Code Execution

Intermediate🔧 Tool Use PatternsOpenAI / Industry practice

Intent

Let the agent write and execute code in a sandboxed environment, enabling computation, data analysis, and file manipulation.

Problem

LLMs are unreliable at arithmetic, data processing, and precise calculations done purely through text generation. They can reason about what code should do but executing it mentally leads to errors. For data analysis, file processing, and computation, you need actual code execution.

Solution

Give the agent access to a sandboxed code interpreter (typically Python). The agent writes code to solve the computational part of the task, executes it, and uses the output in its reasoning. The sandbox prevents the agent from affecting the host system. This is one of the most powerful tool patterns because it gives the agent general-purpose computation — anything expressible as code becomes a capability.

Diagram

User: "Analyze this CSV and find the top 5 customers by revenue"
    ↓
LLM writes Python:
    import pandas as pd
    df = pd.read_csv('data.csv')
    top5 = df.groupby('customer')['revenue'].sum().nlargest(5)
    print(top5)
    ↓
[Sandbox executes code]
    ↓
Output: Customer A: $1.2M, Customer B: $890K, ...
    ↓
LLM: "Here are your top 5 customers by revenue: ..."

When to Use

Data analysis and visualization tasks
Mathematical computation requiring precision
File processing (CSV, JSON, images)
Any task that benefits from general-purpose computation

When NOT to Use

Simple questions answerable from knowledge alone
When security requirements prohibit code execution
Tasks with no computational component

Pros & Cons

Pros

Precise computation — no arithmetic errors
General-purpose: any task expressible as code
Can process files, generate visualizations, manipulate data
Self-verifying: code output provides ground truth

Cons

Security risk: sandboxing must be robust
Generated code may have bugs
Execution adds latency
Limited to available libraries in the sandbox

Implementation Steps

1Set up a sandboxed execution environment (Docker, E2B, Pyodide)
2Define security boundaries: filesystem access, network access, execution time
3Give the agent a code execution tool with clear usage instructions
4Capture stdout, stderr, and return values as observations
5Implement file upload/download for data processing tasks
6Add timeout and resource limits to prevent runaway code

Real-World Example

Financial Analysis

User uploads quarterly financial data. Agent writes Python to: parse the spreadsheet, calculate YoY growth rates, identify anomalies, generate charts, and produce a summary report. All computation is precise because it's executed code, not LLM text generation.

PythonAgent Generates and Executes Code in Sandbox

from openai import OpenAI
import subprocess
import tempfile
from pathlib import Path

client = OpenAI()

def solve_with_code(problem: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Write Python code to solve this. Output only code, no markdown."},
            {"role": "user", "content": problem},
        ],
    )
    code = response.choices[0].message.content

    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
        f.write(code)
        path = f.name

    result = subprocess.run(["python", path], capture_output=True, text=True, timeout=10)
    Path(path).unlink()
    output = result.stdout or result.stderr

    summary = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Code output:\n{output}\n\nExplain the result."}],
    )
    return summary.choices[0].message.content

References

Code Interpreter — OpenAI