Agents Need Real Computers, Not Just Containers

Source: Stanislav Kozlovski with David Crawshaw, CEO and co-founder of exe.dev; YouTube video ID 1GX18UGoJRw; published as “agents need VMs, not containers.”

May 23, 2026

The next useful cloud for software builders may look less like a maze of managed services and more like a disposable, identity-protected machine with root access, fast local disk, snapshots, and enough isolation to let agents make a mess.

David Crawshaw’s argument is blunt: coding agents do not merely need a place to run commands. They need a real computer-shaped environment where they can run tests, start services, create network interfaces, install dependencies, mutate system configuration, and recover when they break things.

Who Is David Crawshaw?

David Crawshaw is the CEO and co-founder of exe.dev, a young cloud company built around agent-friendly virtual machines. Before exe, he co-founded Tailscale and served as its CTO; earlier, he spent years at Google working on systems problems including Go, large-scale log processing, and Fuchsia networking.

That background matters because his case is not a generic “AI will change cloud” slogan. It comes from building developer tools, networking systems, isolation layers, and now a cloud designed around what agents actually need to be useful.

The Product Started as an Agent. The Infrastructure Became the Product.

Crawshaw and his co-founder originally explored AI coding products before the current wave of cloud-based coding agents had stabilized. One early project, Sketch, copied a Git repository into a fresh container, let an agent work there, and pulled commits back onto a branch.

The architecture was elegant in a very developer-tooling way. The user interface, however, was Git. Every prompt produced work on a specialized branch, and the user had to move through Git commands to see what happened. It was powerful, but awkward.

“We really liked isolating agents. They work a lot better in some kind of universe of their own.”

That insight became more valuable than the original agent product. Agents improved when they had their own world. But creating that world correctly was surprisingly hard. They needed the right machine shape, fast DNS, secure access, isolation, and enough system privileges for real development workflows. “Before you know it,” Crawshaw says, “you’ve built a cloud.”

Containers Are the Wrong Shape for Serious Agent Work

The team began with Docker because containers seem like the natural isolation boundary for untrusted automation. They are fast, familiar, and ubiquitous in developer workflows. But the first serious customers immediately ran into the difference between a sandbox and a developer machine.

Real test suites often expect to start Docker Compose, bring up Postgres, create networks, install K3S, modify cgroups, or exercise behavior close to production. Nesting all of that inside another container produces sharp edges. Crawshaw’s conclusion: the agent environment needs to be a VM.

“If your tests need to create a network interface, let them create a network interface. Give them root, let them do whatever they have to do… containers don’t really work at all.”

This is not an academic distinction. A useful coding agent must close its own feedback loop. It should change code, run the same meaningful tests a developer would run, start the server, inspect the result, and iterate. If the environment cannot support the test suite, the human becomes the error relay: paste failure, wait for patch, run again, paste another failure.

For simple projects, containers may be enough. For production codebases with real dependencies, system-level assumptions, and CI parity requirements, agents need something closer to a laptop in the cloud: isolated from the user’s machine, but powerful and permissive inside its own boundary.

AI-Friendly Is Developer-Friendly

Crawshaw’s broader thesis is that AI-friendly infrastructure is developer-friendly infrastructure. Models are trained on developer behavior and developer tools. They are good at using systems in the way humans use them. When a cloud is confusing for a human, it is also expensive for an agent.

That expense is not only token cost in dollars. It is context-window cost. If an agent burns a quarter of its context reading cloud API documentation just to list VMs, that is intelligence no longer available for the actual task. A verbose CLI, ambiguous docs, and resource-model complexity become an “agent tax.”

“Every token that you’re just constantly wasting doing something is eating that context window… that’s intelligence that is no longer being used to solve your problem.”

Traditional clouds were optimized for enterprise procurement, scale, and fleet economics. A hundred-million-dollar cloud contract is negotiated around CPUs, discounts, commitments, and procurement leverage. Whether a developer or an agent needs two extra flags is rarely decisive. In an agent-heavy workflow, those small frictions compound into lower autonomy.

The New Default: Instant VM, TLS, IAM Proxy, No Public SSH

Crawshaw’s ideal development environment is opinionated: start a VM instantly, put TLS on it automatically, put an identity-aware proxy in front of it, make web services visible only to the right user or team, and avoid exposing SSH directly to the public internet.

That framing makes exe.dev feel less like “AWS for agents” and more like “a VPS with the hardening checklist already done.” The VPS crowd is right that a simple server is a wonderful programming substrate. The problem is that a responsible VPS setup comes with a long list: users, SSH, firewalls, TLS, DNS, reverse proxies, monitoring, backups, access control, and more.

Agent-native cloud design should preserve the simplicity of a machine while removing the unsafe defaults and repetitive setup work.

Cheap Experiments Need Zero-Marginal-Cost Machines

Agents change the economic unit of software creation. When generating a prototype gets cheaper, the number of prototypes explodes. Crawshaw describes having ten ideas for every one thing he actually builds. In an agentic workflow, some of those ideas can be pasted into an environment and, perhaps 25% of the time, something close to the desired result appears.

The remaining bottleneck is deployment and sharing. It should not require buying a fresh VM, configuring a reverse proxy, wiring DNS, and managing idle resources every time an agent produces a demo.

Exe’s model is to let users buy a pool of CPU and memory, then create many VMs inside that pool so the marginal cost of starting another machine approaches zero. Ten ideas can become ten isolated machines. Seven are abandoned. Two are revisited occasionally. One might become real.

That is a cloud model designed for abundance rather than scarcity: lots of disposable environments, cheap idleness, and low ceremony around sharing.

Local NVMe Exposes How Weird Cloud Storage Became

One of Crawshaw’s sharpest technical critiques is about disk. Hyperscale cloud architecture often pushes developers toward remote storage because it simplifies fleet management. In the hard-drive era, that made sense: if seek latency was around 10 milliseconds and network round trip was around 1 millisecond, remote storage overhead was tolerable.

With SSDs and NVMe, the math changes. Local storage latency can be measured in microseconds while the network is still roughly millisecond-scale. Suddenly, the cloud’s remote-storage abstraction is a tax on random I/O. Crawshaw points out that buying roughly 200,000 IOPS on EC2 can cost around $20,000 per month, while his laptop can deliver roughly 500,000 IOPS.

“You write software locally, it’s fast and good. You ship it to the cloud, it’s slow.”

The reason is not that cloud providers are foolish. Remote storage simplifies capacity planning. It removes one dimension from the SKU matrix. AWS does not have to perfectly match customers who want 256 vCPUs, 1 TB of RAM, and only 100 GB of disk against physical machines with awkward local-drive shapes.

But what is efficient for hyperscaler logistics is not always what is best for developer performance. Agent environments, especially those running tests and dev servers, benefit from feeling local: fast disk, real root, and low-latency feedback.

AWS’s Real Superpower Is Capacity, Not 800 Services

Crawshaw is skeptical of the long tail of managed cloud services. He argues that many services reflect what seemed useful five years ago, and that agents plus open-source software make it increasingly feasible to recreate what a team actually needs.

But he gives AWS full credit for one extraordinary capability: “They’ve always got another computer.” Press a button and a machine appears in roughly 30 seconds. Buy and rack your own hardware, and the answer may be six weeks.

That distinction matters for companies “lost in the middle”: large enough for cloud bills to hurt, too small to negotiate the kind of hyperscaler discounts available to the biggest buyers. For them, alternative clouds can win if they combine better economics with enough elasticity and developer ergonomics.

Agents May Erode the Managed-Service Moat

Managed services are attractive because operations are scary. People use RDS, hosted Kafka, and managed data platforms because they want someone else to carry the pager, absorb the weird failure modes, and provide institutional expertise.

Crawshaw’s counterpoint is visceral: the idea of not being able to get to his own Postgres is terrifying. If he knows what to fix but the managed layer prevents direct access, he is stuck waiting for a provider.

Agents complicate the tradeoff. They can compress the learning curve of unfamiliar systems. Crawshaw says using agents to crash-course himself on how a system works is “extraordinarily effective.” Kafka is his example: with his systems background, an agent, and a hard week, he believes he could learn enough operational failure modes to run it with confidence in a way that would not have been true five years ago.

That does not mean every team should run every database itself. It does mean that “this is too hard to operate” becomes a weaker moat when an agent can synthesize docs, forum posts, configs, logs, and runbooks into concrete next actions.

The Iptables Story: Why Root Access Matters

The strongest case for agentic systems work came from exe’s own launch. After opening the service, the team immediately attracted abuse: Bitcoin miners, DDoS clients, and automated account creation every few seconds. Crawshaw spun up an isolated production-like instance containing the spammer accounts and installed a command-line agent directly on the machine with root access.

The issue was obscure Linux networking behavior. Malicious VMs were affecting neighboring VMs by collapsing a host-level connection-tracking table. Crawshaw had attempted to restrict the relevant interfaces, but the iptables configuration was wrong.

The agent found the problem quickly. The fix was one line. Without it, Crawshaw says he would have spent many hours researching an unfamiliar corner of the Linux kernel and networking stack at 2 a.m. from Australia.

“Once it happens to you you’re like wow okay I’m never going back.”

This is the promise and the danger in one story. The agent was valuable precisely because it could inspect and mutate the system. Read-only analysis would have been weaker. But giving an agent root on a real production system with customer data is an entirely different risk category.

The Safety Boundary Is Reversibility

Crawshaw is not casual about production autonomy. “AI SRE” is not solved. The hard part is not only model capability; it is containment, approval, data exposure, and rollback.

He is more comfortable giving agents broad permissions inside isolated environments: dev machines, toy projects, staging boxes, and eventually production-like VMs with strong snapshots. A disposable VM changes the psychology of risk. Let the agent break everything; create a new VM or roll back to a snapshot.

This is where dev equals prod becomes more than nostalgia for live-edited Perl scripts. If the environment is production-shaped, isolated, and snapshot-backed, then agent changes can be explored against something realistic without giving the agent the keys to a laptop or an entire company.

Simon Willison’s “lethal trifecta” still looms: access to private data, tool/mutation capability, and internet access are dangerous together. Cut off internet access and the agent loses the ability to search for obscure fixes. Allow internet access and data exfiltration becomes the threat. Zero-data-retention contracts help with model-provider trust, but they do not eliminate runtime exfiltration risk.

Key Lessons

Useful agents need full feedback loops. They should run real tests, start real services, and inspect real behavior without routing every failure through a human.
VMs are a better default boundary than containers for serious agent work. Containers are often too constrained for nested Docker, K3S, network interfaces, root-level changes, and CI parity.
Developer experience is model performance infrastructure. Confusing CLIs and cloud docs consume context that should be used on the actual problem.
Cheap, isolated, idle-friendly environments unlock more prototypes. Agents create software abundance; infrastructure must make deployment and sharing similarly cheap.
Rollback may be the most important safety feature. Snapshots and disposable machines make root-level agent work feasible before full production autonomy is safe.

Why This Matters for Diffie

Diffie is also in the business of giving frontend engineers a tighter feedback loop. The lesson from Crawshaw’s work is that agent quality depends less on clever prompting than on the environment around the agent: what it can run, what it can observe, how safely it can fail, and how quickly it can recover.

For an AI browser testing tool, that points to a concrete product direction. Diffie should not position itself merely as “tests generated by AI.” The higher-value promise is production-shaped frontend verification in an isolated, recoverable environment. A browser agent that can open the app, exercise flows, inspect console/network state, compare visual changes, and feed structured failures back to the developer is valuable because it closes the loop.

The GTM implication is equally specific: sell the cost of broken loops. Frontend teams already lose time when a CI failure, flaky browser state, missing auth setup, or environment mismatch forces a human to babysit the agent. Diffie can name that pain directly: every minute spent reconstructing browser state is engineering intelligence not spent shipping product.

There is also a strategic infrastructure question. If Diffie’s tests need realistic app state, auth, network behavior, browser permissions, local services, seeded databases, and repeatable rollbacks, then the winning architecture may look more VM-like than container-like at the edges. Even if the core service uses containers internally, the customer-facing story should emphasize safe realism: run the messy browser workflow, capture evidence, and make rollback/replay cheap.

For ICP and outbound, this creates a sharper wedge: frontend teams adopting coding agents now have more generated UI changes than their review and QA loops can absorb. Diffie can be the missing verification machine for that new abundance. The message is not “AI testing is faster.” It is: your agents can write code faster than your team can trust it; Diffie gives every change a realistic browser feedback loop before it reaches humans.