Prompts are not a security boundary. This is the opening we've learned to lead with, because most teams building with LLMs don't yet believe it. They believe they do. They've spent weeks on jailbreak-resistant system prompts, red-teamed their outputs, built guardrails into the prompt template itself. And then their first customer runs a prompt injection attack and those guardrails collapse in seconds.
We made the same bet. Six months in, managing autonomous agents in production, we stopped betting on prompts.
The guardrails we built first (and why they failed)
We started where everyone starts: the model is the boundary. You write a rock-solid system prompt that tells the model what it can and cannot do. You pair that with output filtering. Maybe you add a refusal training fine-tune. An attacker with a creative prompt still finds a way around it. They ask the model to "complete a creative fiction scenario" and suddenly it's writing database schemas. They use subtle semantic framing that bypasses your filter. They ask the model to answer in a way that defeats your detection pattern. None of this is novel. It's not even hard.
The problem is straightforward: the model is probabilistic. You cannot reliably enforce a security boundary with a language model because language is ambiguous. No amount of prompt engineering changes that. An attacker with five minutes and ChatGPT can outthink your guardrails. You're not fighting the model. You're fighting entropy.
We don't trust the model. We trust the runtime.
The runtime: capability-scoped execution
The mental model shift is simple: assume the model will eventually break every guard you put in prompts. Design the system so that doesn't matter.
V8 isolates for tool execution
When an agent calls a tool, it doesn't run arbitrary code on your servers. It runs inside a sandboxed V8 isolate—a completely isolated JavaScript execution context with zero access to the host process, file system, or network. The policy for what that isolate can do is written in configuration, not negotiated with the model. The model can ask for anything. The runtime says yes or no.
tools: - name: send_email capabilities: [send_message] rate_limit: 10/hour blast_radius: single_workspace - name: query_database capabilities: [read_only] allowed_tables: [users, orders, products] max_rows: 1000 - name: write_to_crm requires: [approval_gate] audit_level: full
Every tool is a policy declaration. The model doesn't know what's possible—it only knows what the workspace has explicitly enabled. Ask the agent to delete all your customer data? The tool doesn't exist in the policy. The agent can request it, hallucinate all it wants, the request returns "tool not found."
Deterministic kill switches
Capability scoping is the first layer. The second is blast radius control—the answer to the question "what's the maximum damage this agent can do if it goes wrong?" Every agent deployment specifies:
- Rate limits: 10 emails/hour, not 10,000. If the agent loops, the runtime stops it.
- Approval gates: Destructive operations (delete, modify across accounts) require human sign-off. The model can't override this.
- Time bounds: Agent execution halts after 5 minutes. No runaway processes, period.
These aren't suggestions. They're enforced at the runtime level. The model execution layer monitors every call, counts every action against the policy, and kills the execution if any bound is exceeded. No exceptions, no clever prompting around it.
Full observability and replay
Every agent invocation is fully logged: the model prompt, every tool call, the response, latency, errors. You can replay any execution, inspect exactly what happened, see where it deviated from expectations. When something goes wrong—and it will—you don't have to guess. You have a complete record.
In production: what the architecture actually prevents
In our first six months, this runtime blocked 12,000+ unsafe operations with zero false positives that broke a production workflow. That's 40 per day on average, across our customer base. Most were prompt injections—customers accidentally (or intentionally) testing whether they could get agents to do things outside their intended scope. Some were subtle: agents discovering they could make 200 requests to a tool in a loop and trying to exhaust an API quota. One was an agent hallucinating an entirely fictional tool name and trying to invoke it, then pivoting to a different tool when that failed.
The runtime stopped all of it before any code executed. No database tables were touched. No external APIs were hammered. The agent tried, logged the attempt (for audit), and the request returned "not authorized."
This is why we don't trust prompts. The model will always be creative. Creativity is what makes LLMs valuable. But creativity is the enemy of security. The answer isn't to train it out of the model. The answer is to build infrastructure that lets the model be creative within a hard boundary.
What remains hard
This architecture is not magic. It moves the problem, it doesn't eliminate it. The hard problems that remain:
- Data exfiltration via summarization: An agent with read access to sensitive data can't delete it or move it, but can it subtly reference it in a summary? We can detect obvious patterns, but sophisticated extraction requires better models than we have for detecting intent.
- Multi-step attacks: An agent might not have the capability to do X directly, but can it chain multiple legitimate operations to achieve the same effect? This requires semantic analysis of intent, not just syntactic checks.
- Side-channel attacks: An agent can't exfiltrate data via the network, but can it encode it in request timing, error messages, or log output?
These are the problems we're solving now. But they're second-order. The first-order problem—an attacker using prompt injection to make the model break its guardrails—is solved. The model still tries. The runtime says no. That's the architecture we've learned to trust.