Engineering deterministic guardrails on probabilistic systems

Language models are probabilistic. They emit tokens with a distribution. The world they touch is deterministic. A wire transfer either happens or it doesn't. A delete operation removes data or it doesn't. So we built a policy engine that stands between every model output that has side effects and the systems it can change.

We discovered this problem early. In our first few months in production, watching agents run on customer systems, we saw the gap: models could be genuinely smart and safe in isolation, but the moment they hit the real world — APIs that move money, delete records, change permissions — we needed certainty, not probability.

Why model self-evaluation doesn't cut it

The first instinct is to ask the model: "Is this action safe?" embed a safety check in the prompt. But that's letting the suspect police themselves. We tested it. In our threat model, an adversarial user can manipulate models, bend them, or simply deploy ones that aren't constrained. The model cannot be your enforcement boundary.

You need an external, deterministic policy layer. A guardrail you didn't write and can't social-engineer. A guardrail you compile and deploy before the model ever runs.

Why we chose CEL

We evaluated three approaches: OPA/Rego, hand-rolled DSL, and Google's Common Expression Language (CEL). We built in OPA first. It was expressive, battle-tested, but the latency floor was 40-50ms on the hot path — too slow when you're making hundreds of evaluations per agent step.

We looked at a hand-rolled DSL next. Fast, sure, but impossible to share with customers and hard to debug. CEL — designed by Google, used in Envoy and gRPC — hit the right spot: simple, sandboxed, fast, and designed exactly for this: evaluating untrusted policies safely.

A real policy in production looks like this:

// Only allow transfers under $10k without additional approvals
request.amount > 10000 && request.approvers.size() < 2 ? deny("High-risk transfer") : allow()

// Blast radius: block if trying to touch >50 users
request.affected_users.size() > 50 ? deny("Exceeds blast radius") : allow()

// Time bounds: no operations outside business hours
hour(now()) < 9 || hour(now()) > 17 ? deny("Outside business hours") : allow()

These compile down to bytecode. No parsing on the hot path. Policy violations get logged, tagged, and sent to audit. You can introspect them: which rule fired, which variables were evaluated, what was the decision path.

The Rust engine

The policy engine itself is ~2000 lines of Rust. At a high level: policy compiler (CEL to bytecode), an immutable policy cache (loaded at startup, zero allocation on evaluation), and a structured audit log.

We picked Rust because we wanted deterministic performance. No GC pauses on the hot path. No surprises. Memory safety. The numbers: p99 latency is under 200 microseconds per policy evaluation. We've hit p99.9 of ~400µs. That's acceptable when you're evaluating maybe 5-10 policies per agent action.

The core trait looks roughly like this:

pub trait PolicyEvaluator {
    fn evaluate(&self, policy: &CompiledPolicy, context: &Context) -> Result;
    fn validate(&self, policy_text: &str) -> Result<();
}

Context is just a map of variables: request details, user roles, resource counts, timestamps. Decisions are allow/deny with a reason. That's it.

Four policies from production

Rate limiting. Track calls per user per resource per minute. Deny if exceeded. One line: request.calls_this_minute > 100 ? deny(...) : allow()
Data scope. Agents can only touch data tagged with their team_id. request.resource.team_id != current_user.team_id ? deny(...) : allow()
Blast radius controls. If an operation would affect >100 records, require approval. request.record_count > 100 && !request.approved ? deny(...) : allow()
Capability bounds. Agents can call certain APIs only if their role includes that capability. Role-based policy evaluation backed by a permission matrix.

A guardrail you can't compile is a guardrail you can't trust.

What we got wrong, then fixed

We started with OPA. After three months, ripped it out. The latency tax was real — it showed up in aggregate metrics. Then we tried embedding safety checks in the model prompt. It worked until a customer deployed a less-constrained model on their own. The policy layer caught it.

We also tried delegating policy composition to the model: "Here are the rules, apply them." The model couldn't be trusted to faithfully evaluate its own constraints. So policies now compile into a deterministic evaluator. No interpretation. No flexibility. No escape hatches.

Open problems

Two things we still wrestle with: composing policies across multi-step plans (if a policy denies step 2, should we backtrack, propose alternatives, or fail?), and debugging policy violations in development. When a policy fires in prod, we have perfect audit logs. In dev, reproducing the context is hard. We're building better tooling.

The core insight holds: probabilistic systems need deterministic governance. The policy engine is the thing we trust.

Engineering deterministic guardrails on probabilistic systems

Why model self-evaluation doesn't cut it

Why we chose CEL

The Rust engine

Four policies from production

What we got wrong, then fixed

Open problems

Subscribe for the long-form essays

Keep reading

Why we built a sandboxed runtime for AI agents

Why we chose multi-tenant isolation over single-tenant

Treating evals as production infrastructure