All posts

A Contrarian Thought Experiment: The Per-Agent VM Was a 24-Month Accident

An exercise in arguing against the consensus.

Published

Nov 20, 2025

Topic

Artificial Intelligence

Today's agent sandbox category may turn out to be transitional architecture rather than the endgame.

This piece is a thought experiment. It deliberately argues a contrarian position against the current consensus on agent infrastructure. Treat it as an exercise in stress-testing assumptions, not as a prediction the author is willing to defend without qualification.

The current consensus runs roughly as follows. e2b, Modal, Daytona, and the cloud hyperscaler sandbox products are the durable infrastructure layer for agent compute. Per-agent isolation is non-negotiable. The architecture works, the pricing works, and the security model works. Founders building agent products treat this layer as a settled question and move up the stack.

The contrarian case is that the consensus reflects a 2023-era assumption that has aged worse than people realize. The assumption was that one agent looks like one developer, so it needs one machine. That mental model traveled directly into the infrastructure, made sense for two years, and is now being relitigated by anyone running agent workloads at meaningful scale. Whether the relitigation results in a category shift or just a layer of abstraction sitting on top of the existing stack is the part that is unsettled.

The Original Sin

The first generation of coding agents in 2023 looked a lot like junior developers. They opened a repository, made changes, ran tests, committed. The natural thing to give them was the same environment a junior developer would have: a clean machine with the repo cloned and dependencies installed. The infrastructure category that materialized to serve this need treated the VM (or its container-flavored cousin) as the obvious unit.

The strongest version of this argument is real and worth acknowledging on its own terms. Vasek Mlejnsky and Tomas Valenta, the e2b co-founders, have publicly defended the per-sandbox model on the grounds that strict process isolation, language-runtime flexibility, and security boundaries are non-negotiable for executing untrusted code. Erik Bernhardsson at Modal has argued separately that the sandbox abstraction generalizes across data, ML, and agent workloads, which is part of why Modal's commercial trajectory has been broad rather than narrow. Both positions are coherent. Both have been validated in the market.

The thought experiment's claim is not that those arguments are wrong on their own terms. It is that they were optimized for a workload pattern that was true in 2023 and is becoming less true with each quarter. Specifically: one agent, working on one task, for a duration measured in minutes to hours. Isolated, disposable, serialized. That workload pattern is still common. It is no longer dominant. The shift in what agents are actually doing has happened faster than the infrastructure category has acknowledged, and the gap between the two is what the contrarian case lives in.

What Changed

Three forces have shifted the calculation under the per-VM model.

First, agent count per task has scaled. A meaningful share of production agent workloads now involve dozens of parallel agents working on variants of the same task: exploring different solution paths, running different evaluation criteria, or sharding across files in a large repository. When the unit of work is a fleet rather than a single run, paying for 30 cold VM starts and 30 dependency installations every time becomes absurd. The shared-base model is the natural fit, and it does not exist inside a per-VM architecture.

Second, average task length has dropped. The first generation of coding agents took 10 to 30 minutes to complete a task. The current generation often completes in under 2 minutes. As the task shortens, the fixed cost of cold start becomes a larger share of total runtime. A 90-second task that takes 90 seconds to spin up is not a viable economic unit.

Third, cost-per-run has become the primary GTM lever for agent platforms. Capability ceilings are converging across major model providers. Differentiation has shifted to how cheaply, how quickly, and how reliably an agent can complete the work. That makes infrastructure efficiency a first-order product concern rather than a back-office optimization. The platforms that ship per-run pricing today are doing so on a substrate that fights them on every dimension.

None of these forces invalidates the per-VM model on technical grounds. They invalidate it on economic grounds, which historically is the more decisive argument when an architectural shift is in play.

The Replacement Architecture

The substrate that emerges from these forces has a consistent shape. Sessions instead of VMs. A shared base that all sessions fork from, materializing in milliseconds rather than minutes. Copy-on-write semantics so that the marginal storage cost of an additional session is the delta, not the base. Content-addressable storage underneath, deduplicating at the blob level. Multi-protocol access surfaces so the same session can be driven by an agent over gRPC, hit by a CI pipeline over HTTP, or mounted locally by a human through FUSE.

AetherFS, a Rust-built filesystem currently in private beta, is the most articulated external example of this architecture. The system is built explicitly around copy-on-write overlay sessions on a shared base, with a content-addressable store handling deduplication and reference counting. Public design documents describe a goal of sub-10-second session materialization and per-session storage costs that scale with delta size, not base size.

AetherFS is not the only example. The same architecture, in less articulated form, is what most large agent platforms have built internally. Cursor, Cognition, Factory.ai, Replit, and others have each shipped some version of the same idea, optimized for their specific workload. The fact that the convergence has happened privately rather than as a category shift is part of what makes the contrarian case interesting. The technical answer is already settled inside several of the largest agent companies. The market structure has not caught up.

What the Sandbox Companies Do Next

Inside the thought experiment, the prediction follows mechanically. The current sandbox-as-a-service companies have three paths.

The first is to move down-stack into storage and state. Add overlay sessions, add content-addressable layers, change the unit of pricing from VM-second to session-second. This preserves the brand and the customer base. e2b is the company best positioned for this move. Modal has the technical breadth to do it but might prefer a different lane.

The second is acquisition into a broader platform play. A hyperscaler buys, a CDE vendor buys, an agent platform buys. The sandbox category becomes a layer in someone else's stack. This is the most likely outcome for at least two of the current independent vendors.

The third is to compete on inference adjacency, leaning into colocation with model serving rather than holding the substrate position. Modal already does some of this. The path is real but narrows the addressable category.

None of these outcomes are catastrophic for the existing vendors. None of them preserve the current shape of the category either.

The Hedge

Explicit. The contrarian case may be wrong. Bernhardsson's broad-substrate thesis at Modal has held up commercially so far, and there is no obvious reason it stops. Strict isolation requirements in regulated industries (financial services, healthcare, government workloads) could entrench the per-VM model in those segments regardless of the architectural argument. Inertia is a real force, and the existing vendors have distribution that any new entrant would need years to match.

But if you are building agent infrastructure today, and you have not stress-tested why the VM is your unit, you are betting on consensus rather than reasoning from the workload. The history of infrastructure transitions does not favor that bet. The history of consensus does not favor it either.