Back to blog

Engineering

Why we built a sandboxed runtime for AI agents (instead of trusting prompts)

Sofia Alvarez · Apr 8, 2026 · 8 min read
Sandboxed runtime architecture

Prompts are not a security boundary. This is the opening we've learned to lead with, because most teams building with LLMs don't yet believe it. They believe they do. They've spent weeks on jailbreak-resistant system prompts, red-teamed their outputs, built guardrails into the prompt template itself. And then their first customer runs a prompt injection attack and those guardrails collapse in seconds.

We made the same bet. Six months in, managing autonomous agents in production, we stopped betting on prompts.

The guardrails we built first (and why they failed)

We started where everyone starts: the model is the boundary. You write a rock-solid system prompt that tells the model what it can and cannot do. You pair that with output filtering. Maybe you add a refusal training fine-tune. An attacker with a creative prompt still finds a way around it. They ask the model to "complete a creative fiction scenario" and suddenly it's writing database schemas. They use subtle semantic framing that bypasses your filter. They ask the model to answer in a way that defeats your detection pattern. None of this is novel. It's not even hard.

The problem is straightforward: the model is probabilistic. You cannot reliably enforce a security boundary with a language model because language is ambiguous. No amount of prompt engineering changes that. An attacker with five minutes and ChatGPT can outthink your guardrails. You're not fighting the model. You're fighting entropy.

We don't trust the model. We trust the runtime.

The runtime: capability-scoped execution

The mental model shift is simple: assume the model will eventually break every guard you put in prompts. Design the system so that doesn't matter.

V8 isolates for tool execution

When an agent calls a tool, it doesn't run arbitrary code on your servers. It runs inside a sandboxed V8 isolate—a completely isolated JavaScript execution context with zero access to the host process, file system, or network. The policy for what that isolate can do is written in configuration, not negotiated with the model. The model can ask for anything. The runtime says yes or no.

tools:
  - name: send_email
    capabilities: [send_message]
    rate_limit: 10/hour
    blast_radius: single_workspace

  - name: query_database
    capabilities: [read_only]
    allowed_tables: [users, orders, products]
    max_rows: 1000

  - name: write_to_crm
    requires: [approval_gate]
    audit_level: full

Every tool is a policy declaration. The model doesn't know what's possible—it only knows what the workspace has explicitly enabled. Ask the agent to delete all your customer data? The tool doesn't exist in the policy. The agent can request it, hallucinate all it wants, the request returns "tool not found."

Deterministic kill switches

Capability scoping is the first layer. The second is blast radius control—the answer to the question "what's the maximum damage this agent can do if it goes wrong?" Every agent deployment specifies:

These aren't suggestions. They're enforced at the runtime level. The model execution layer monitors every call, counts every action against the policy, and kills the execution if any bound is exceeded. No exceptions, no clever prompting around it.

Full observability and replay

Every agent invocation is fully logged: the model prompt, every tool call, the response, latency, errors. You can replay any execution, inspect exactly what happened, see where it deviated from expectations. When something goes wrong—and it will—you don't have to guess. You have a complete record.

In production: what the architecture actually prevents

In our first six months, this runtime blocked 12,000+ unsafe operations with zero false positives that broke a production workflow. That's 40 per day on average, across our customer base. Most were prompt injections—customers accidentally (or intentionally) testing whether they could get agents to do things outside their intended scope. Some were subtle: agents discovering they could make 200 requests to a tool in a loop and trying to exhaust an API quota. One was an agent hallucinating an entirely fictional tool name and trying to invoke it, then pivoting to a different tool when that failed.

The runtime stopped all of it before any code executed. No database tables were touched. No external APIs were hammered. The agent tried, logged the attempt (for audit), and the request returned "not authorized."

This is why we don't trust prompts. The model will always be creative. Creativity is what makes LLMs valuable. But creativity is the enemy of security. The answer isn't to train it out of the model. The answer is to build infrastructure that lets the model be creative within a hard boundary.

What remains hard

This architecture is not magic. It moves the problem, it doesn't eliminate it. The hard problems that remain:

These are the problems we're solving now. But they're second-order. The first-order problem—an attacker using prompt injection to make the model break its guardrails—is solved. The model still tries. The runtime says no. That's the architecture we've learned to trust.

Newsletter

Subscribe for the long-form essays

One email a month — the deepest piece we wrote, nothing else.

Keep reading

Deterministic guardrails
Engineering

Engineering deterministic guardrails on probabilistic systems

Marcus Hale · Mar 11, 2026
Multi-tenant isolation
Engineering

Why we chose multi-tenant isolation over single-tenant for security

Sofia Alvarez · Mar 4, 2026
Evals are infrastructure
Engineering

Treating evals as production infrastructure

Sofia Alvarez · Feb 9, 2026