Back to blog

Engineering

Treating evals as production infrastructure

Sofia Alvarez · Feb 9, 2026 · 7 min read
Evals as infrastructure

At most companies, "evals" are a thing the ML team owns and the rest of engineering is allowed to ignore. We treat evals like production infrastructure: every PR, every model swap, every prompt diff runs through 12,000 cases before merge. No exceptions.

Why evals are infrastructure

If an agent ships a regression to prod, the customer notices in 3 minutes. The lag between "model nudged" and "customer pissed" is the same as between "DB index dropped" and "queries slow". Same pager, same playbook.

That's why we stopped treating evals as a separate concern. They're not optional validation that someone should maybe run before shipping. They're a gate. They're middleware.

Building the case base

Every customer-reported issue becomes a case. Every blast-radius event becomes a case. Every novel tool integration gets a starter set. We mix automated grading (exact match, fuzzy token F1), human grading (expert review on sample), and golden answers (prompt variations against fixed outputs).

The base grew from 1,200 cases last year to 12,000 now. The growth didn't come from cargo-culting—it came from practice. A case is a contract: "In situation X, the agent must not do Y."

Running 12k cases on every PR

Naive approach: run them sequentially. p99 latency: ∞. We went multi-region, horizontally sharded:

p50 PR run: 8 minutes. p99: 14. That's fast enough that reviewers don't wait; it's just there when they show up.

The flakes problem

A flake in CI is a flake in trust.

Probabilistic graders mean some cases will occasionally fail when they should pass (or vice versa). Our policy: 3 retries, then case is quarantined and an alert goes to the eval-owner. Quarantined cases never merge until diagnosed.

Why quarantine instead of just ignoring? Because flakes train people to ignore failures. Once reviewers see "this test flakes 5% of the time", the signal is lost. Quarantine keeps trust intact.

The eval case schema

Here's the shape we standardized on:

case:
  id: eval-customer-stripe-invoice-001
  version: "2"
  category: regression

  input:
    event_type: "stripe.invoice.created"
    payload: |
      {
        "invoice_id": "inv_1234",
        "customer_id": "cus_5678",
        "amount": 15000
      }

  expected_output:
    action: "send_email"
    recipient: "billing@customer.com"
    template: "invoice_notification"

  grader:
    type: "deterministic"
    rule: "action == 'send_email' AND recipient matches customer.contact"

  created_at: "2026-02-01T14:30:00Z"
  owner: "sofia@ai-agents.bar"

What this changes culturally

PRs now include eval diffs in the description. Reviewers actually look at them. A refactor that drops eval pass-rate gets held until the cause is understood—not buried, not waived.

We've also become deliberate about what we don't gate on. We don't require perfect pass-rate—that breeds tweaking, overfitting, politics around "which cases count". We gate on: "no novel regression class introduced."

What we learned

Evals stop being defensive and start being predictive when you run them on every change. You see patterns: which model updates break which class of agents, which prompt diffs have outsized impact, where the agent is brittle.

The infrastructure work—parallelization, caching, flake quarantine—wasn't glamorous. But it's the difference between "we should run evals" and "evals run before everything ships."

Newsletter

Subscribe for the long-form essays

One email a month with the deepest piece we wrote. No fluff, unsubscribe in one click.

Keep reading

Deterministic guardrails
Engineering

Engineering deterministic guardrails on probabilistic systems

A look at the policy engine that sits between the model and the world — written in Rust, powered by CEL.

Marcus Hale · Mar 11, 2026 · 9 min
Why we built a sandboxed runtime
Engineering

Why we built a sandboxed runtime for AI agents (instead of trusting prompts)

Prompts are not a security boundary. Here's the architecture we landed on after our first six months in production.

Sofia Alvarez · Apr 8, 2026 · 8 min
Multi-tenant isolation
Engineering

Why we chose multi-tenant isolation over single-tenant for security

Single-tenant feels safer. We argue, with the threat model in hand, that the opposite is true for agent workloads.

Sofia Alvarez · Mar 4, 2026 · 11 min