Why AI Agents Need Code Execution Sandboxes
When an AI agent generates code, something needs to run it. That something cannot be your production server. User-generated or AI-generated code might contain infinite loops, excessive memory allocation, malicious system calls, or simple bugs that crash the host process. You need isolated, ephemeral environments that execute untrusted code safely and report results back.
Three categories of AI applications need code execution: AI coding agents (Devin, OpenDevin, SWE-Agent) that write, test, and debug code autonomously. AI data analysts that generate and run Python scripts to answer questions about datasets. AI assistants with tool use that execute code as one of many available tools. Each has different requirements for execution speed, language support, persistence, and GPU access.
E2B, Modal Sandboxes, and Fly Machines are the three leading infrastructure options, each designed for different use cases. E2B is purpose-built for AI code execution. Modal provides general serverless compute with sandbox capabilities. Fly Machines offer container-based isolation with global edge deployment. Your choice depends on latency requirements, language needs, and whether you need GPU access. Our guide to AI tool use agents covers the broader architecture for agents that execute code.
E2B: Purpose-Built for AI Code Execution
E2B (short for "Edge to Backend") is designed specifically for AI applications that need to run code. It provides lightweight, secure sandboxes that spin up in milliseconds and support multiple programming languages.
How It Works
E2B provides a cloud sandbox API. Your AI agent sends code to E2B, which spins up an isolated microVM (based on Firecracker, the same technology AWS Lambda uses), executes the code, and returns the output. Each sandbox gets its own filesystem, network isolation, and resource limits. Sandboxes persist for the session duration (configurable, default 5 minutes) so agents can run multiple code snippets in the same environment.
Strengths
Cold start under 200ms. This is critical for interactive AI assistants where users expect near-instant responses. Support for Python, JavaScript, TypeScript, R, Julia, Bash, and custom Docker images. The SDK integrates cleanly with LangChain, Vercel AI SDK, and direct API calls. Built-in filesystem persistence within a session means agents can write files, install packages, and build on previous outputs. 8K+ GitHub stars and a growing ecosystem of templates for common use cases (data analysis, web scraping, document processing).
Weaknesses
No GPU access. If your AI agent needs to run ML models, image processing, or GPU-accelerated computation inside the sandbox, E2B cannot help. Limited compute per sandbox (1 vCPU, 512MB RAM on the free tier). Long-running processes (over 24 hours) are not supported. E2B is optimized for short code executions, not persistent compute.
Pricing
Free tier: 100 sandbox hours/month. Pro: $0.10 per sandbox hour (billed per second). Enterprise: custom pricing with dedicated infrastructure. For an AI assistant running 10,000 code executions per day averaging 5 seconds each, monthly cost is approximately $140.
Modal Sandboxes: Serverless Compute with GPU Access
Modal started as a serverless compute platform for ML workloads and has added sandbox capabilities for AI agent use cases. It provides the most powerful execution environment of the three, including GPU access.
How It Works
Modal uses container-based isolation. You define a sandbox environment (base image, installed packages, resource allocation) and Modal provisions it on demand. Code executes in isolated containers with configurable CPU, memory, and GPU resources. Modal's infrastructure handles scaling, cold starts, and resource cleanup automatically.
Strengths
GPU access is Modal's unique advantage. If your AI agent needs to run inference on a local model, process images with computer vision, or execute GPU-accelerated data transformations, Modal is the only option of the three that supports it. A100 and H100 GPUs are available on demand. Cold starts are under 1 second for warmed containers and 5 to 10 seconds for cold containers (GPU containers take longer). Modal's Python SDK is elegant, and the function decorator pattern makes it easy to turn any Python function into a remote execution. Support for 320+ GPU models and custom Docker images gives you maximum flexibility.
Weaknesses
Higher latency than E2B for simple code execution (1 to 5 seconds cold start vs 200ms). Pricing is higher because you are paying for more powerful infrastructure. The sandbox feature is newer and less mature than E2B's dedicated offering. JavaScript support is limited compared to Python. Modal is Python-first, and while you can run any language in a custom container, the SDK and DX are optimized for Python workflows.
Pricing
Pay per second of compute. CPU: $0.0000575/sec (roughly $0.21/hour). GPU: $0.001067/sec for A10G ($3.84/hour) up to $0.003/sec for H100 ($10.80/hour). For the same 10,000 daily code executions averaging 5 seconds, CPU-only costs approximately $86/month. With GPU, costs increase significantly based on GPU type and usage duration.
Fly Machines: Container-Based with Global Edge
Fly Machines provide lightweight VMs (based on Firecracker) that can be started, stopped, and destroyed via API. They are not purpose-built for AI code execution, but their API-driven lifecycle makes them suitable for sandbox use cases.
How It Works
Fly Machines are full Linux VMs that you control programmatically. Start a machine with a specific Docker image, send code to execute via SSH or HTTP, collect output, and destroy the machine. Machines can run in 30+ regions globally, so you can execute code close to your users for lower latency. Each machine is isolated at the VM level, providing strong security boundaries.
Strengths
Global edge deployment means code execution happens close to the user, reducing round-trip latency for interactive applications. Full Linux VM means you can run anything: system packages, background processes, network-accessible services, long-running computations. Persistent volumes let you attach storage that survives machine restarts. Machines can run for hours or days, unlike E2B's session limits. Pricing is competitive at $0.0000075/sec for shared CPU.
Weaknesses
No GPU access (Fly's GPU Machines are in limited availability and not suitable for ephemeral sandbox use). Cold start is 1 to 3 seconds for a pre-built image, slower than E2B. You need to build more infrastructure yourself: there is no built-in code execution API, output streaming, or file system management. You get a VM, and you build the execution layer on top. This flexibility is powerful but requires more engineering effort.
Pricing
Shared CPU: $0.0000075/sec (~$0.027/hour). Performance CPU: $0.0000575/sec (~$0.21/hour). Memory: $0.000002/sec per GB. For 10,000 daily executions at 5 seconds on shared CPU, monthly cost is approximately $11. The cheapest option by far, but you invest more engineering time building the execution framework.
Security Isolation and Multi-Tenancy
Running untrusted code requires serious security isolation. Here is how each platform handles it.
E2B Security
Firecracker microVMs provide hardware-level isolation. Each sandbox gets its own kernel, filesystem, and network namespace. One sandbox cannot access another's data or resources. Network egress can be restricted per sandbox. E2B handles all the security infrastructure, so you do not need to configure kernel parameters or seccomp profiles yourself.
Modal Security
Modal containers run in isolated gVisor sandboxes (similar to how Google Cloud Run handles isolation). Network policies restrict inter-container communication. Secrets management is built in, so sensitive data (API keys, credentials) can be passed to sandboxes without embedding in code. SOC 2 Type II certified for enterprise workloads.
Fly Machines Security
Firecracker VM isolation (same as E2B). Network isolation between machines. But you manage more of the security posture yourself: restricting network egress, limiting filesystem access, setting resource limits, and handling cleanup of sensitive data after execution. The flexibility cuts both ways: more control, more responsibility.
Multi-Tenant Considerations
For B2B SaaS applications where different customers' code runs in sandboxes, ensure complete isolation between tenants. E2B and Fly Machines provide VM-level isolation by default. Modal's container isolation is strong but not quite VM-level. For the highest security requirements (financial services, healthcare), VM-level isolation (E2B or Fly) is preferable.
Performance Benchmarks
Real-world performance matters more than marketing claims. Here are benchmarks from production use cases:
Cold Start Latency
E2B: 150 to 300ms for a Python sandbox. Under 200ms for a pre-warmed sandbox. This is fast enough for interactive AI assistants where users expect sub-second responses. Modal: 800ms to 2 seconds for a CPU container. 5 to 15 seconds for a GPU container. Acceptable for batch processing and agent workflows, but noticeable for interactive use. Fly Machines: 1 to 3 seconds for a standard image. Faster with pre-built images cached in the region.
Execution Throughput
E2B handles burst workloads well, spinning up hundreds of sandboxes concurrently. Modal scales to thousands of concurrent containers with automatic queuing. Fly Machines scale based on your account limits (default 10 concurrent machines, increased on request).
Practical Recommendation
For interactive AI assistants (code interpreters, data analysis chatbots): E2B. The cold start advantage is decisive. For ML workloads and GPU-dependent tasks: Modal. No alternative offers GPU sandboxes at comparable ease of use. For long-running or globally distributed execution: Fly Machines. Build once, run anywhere. For the architecture patterns behind computer-use agents, E2B and Modal are the most common infrastructure choices.
Choosing the Right Platform
Here are specific recommendations based on use case:
AI coding agents (SWE agents): E2B for most use cases. The fast cold starts and persistent filesystem per session match the agent workflow of writing code, running it, observing output, and iterating. Modal if the agent needs to run ML inference as part of its coding workflow.
AI data analysis tools: E2B for Python/R execution with data processing. Modal if datasets require GPU-accelerated processing (large DataFrame operations, image datasets). Fly Machines if you need persistent environments that users return to across sessions.
Code playground products (like Replit): Fly Machines for persistent, user-facing environments. E2B and Modal are optimized for ephemeral execution, not long-lived developer environments.
Computer-use agents: E2B with their desktop sandbox template (provides a virtual desktop accessible via VNC). Modal for agents that need to run browser automation with Playwright in containerized environments.
Start with E2B for its simplicity and speed. Migrate to Modal when you need GPUs or to Fly Machines when you need persistence and global distribution. Most AI applications start with E2B and only outgrow it when requirements become more specialized.
Need help designing your AI agent execution infrastructure? Book a free strategy call to discuss your workload patterns, security requirements, and scaling needs.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.