Code Execution
Intent
Let the agent write and execute code in a sandboxed environment, enabling computation, data analysis, and file manipulation.
Problem
LLMs are unreliable at arithmetic, data processing, and precise calculations done purely through text generation. They can reason about what code should do but executing it mentally leads to errors. For data analysis, file processing, and computation, you need actual code execution.
Solution
Give the agent access to a sandboxed code interpreter (typically Python). The agent writes code to solve the computational part of the task, executes it, and uses the output in its reasoning. The sandbox prevents the agent from affecting the host system. This is one of the most powerful tool patterns because it gives the agent general-purpose computation — anything expressible as code becomes a capability.
Diagram
User: "Analyze this CSV and find the top 5 customers by revenue"
↓
LLM writes Python:
import pandas as pd
df = pd.read_csv('data.csv')
top5 = df.groupby('customer')['revenue'].sum().nlargest(5)
print(top5)
↓
[Sandbox executes code]
↓
Output: Customer A: $1.2M, Customer B: $890K, ...
↓
LLM: "Here are your top 5 customers by revenue: ..."When to Use
- Data analysis and visualization tasks
- Mathematical computation requiring precision
- File processing (CSV, JSON, images)
- Any task that benefits from general-purpose computation
When NOT to Use
- Simple questions answerable from knowledge alone
- When security requirements prohibit code execution
- Tasks with no computational component
Pros & Cons
Pros
- Precise computation — no arithmetic errors
- General-purpose: any task expressible as code
- Can process files, generate visualizations, manipulate data
- Self-verifying: code output provides ground truth
Cons
- Security risk: sandboxing must be robust
- Generated code may have bugs
- Execution adds latency
- Limited to available libraries in the sandbox
Implementation Steps
- 1Set up a sandboxed execution environment (Docker, E2B, Pyodide)
- 2Define security boundaries: filesystem access, network access, execution time
- 3Give the agent a code execution tool with clear usage instructions
- 4Capture stdout, stderr, and return values as observations
- 5Implement file upload/download for data processing tasks
- 6Add timeout and resource limits to prevent runaway code
Real-World Example
Financial Analysis
User uploads quarterly financial data. Agent writes Python to: parse the spreadsheet, calculate YoY growth rates, identify anomalies, generate charts, and produce a summary report. All computation is precise because it's executed code, not LLM text generation.
from openai import OpenAI
import subprocess
import tempfile
from pathlib import Path
client = OpenAI()
def solve_with_code(problem: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Write Python code to solve this. Output only code, no markdown."},
{"role": "user", "content": problem},
],
)
code = response.choices[0].message.content
with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
f.write(code)
path = f.name
result = subprocess.run(["python", path], capture_output=True, text=True, timeout=10)
Path(path).unlink()
output = result.stdout or result.stderr
summary = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Code output:\n{output}\n\nExplain the result."}],
)
return summary.choices[0].message.content