At most companies, "evals" are a thing the ML team owns and the rest of engineering is allowed to ignore. We treat evals like production infrastructure: every PR, every model swap, every prompt diff runs through 12,000 cases before merge. No exceptions.
Why evals are infrastructure
If an agent ships a regression to prod, the customer notices in 3 minutes. The lag between "model nudged" and "customer pissed" is the same as between "DB index dropped" and "queries slow". Same pager, same playbook.
That's why we stopped treating evals as a separate concern. They're not optional validation that someone should maybe run before shipping. They're a gate. They're middleware.
Building the case base
Every customer-reported issue becomes a case. Every blast-radius event becomes a case. Every novel tool integration gets a starter set. We mix automated grading (exact match, fuzzy token F1), human grading (expert review on sample), and golden answers (prompt variations against fixed outputs).
The base grew from 1,200 cases last year to 12,000 now. The growth didn't come from cargo-culting—it came from practice. A case is a contract: "In situation X, the agent must not do Y."
Running 12k cases on every PR
Naive approach: run them sequentially. p99 latency: ∞. We went multi-region, horizontally sharded:
- Multi-region pool of eval workers (AWS US + EU, 64 vCPU each)
- Cases sharded by type (customer issues, regression suite, smoke tests)
- Deterministic graders cached (no redundant LLM calls)
- Result aggregation in 60 seconds
p50 PR run: 8 minutes. p99: 14. That's fast enough that reviewers don't wait; it's just there when they show up.
The flakes problem
A flake in CI is a flake in trust.
Probabilistic graders mean some cases will occasionally fail when they should pass (or vice versa). Our policy: 3 retries, then case is quarantined and an alert goes to the eval-owner. Quarantined cases never merge until diagnosed.
Why quarantine instead of just ignoring? Because flakes train people to ignore failures. Once reviewers see "this test flakes 5% of the time", the signal is lost. Quarantine keeps trust intact.
The eval case schema
Here's the shape we standardized on:
case:
id: eval-customer-stripe-invoice-001
version: "2"
category: regression
input:
event_type: "stripe.invoice.created"
payload: |
{
"invoice_id": "inv_1234",
"customer_id": "cus_5678",
"amount": 15000
}
expected_output:
action: "send_email"
recipient: "billing@customer.com"
template: "invoice_notification"
grader:
type: "deterministic"
rule: "action == 'send_email' AND recipient matches customer.contact"
created_at: "2026-02-01T14:30:00Z"
owner: "sofia@ai-agents.bar"
What this changes culturally
PRs now include eval diffs in the description. Reviewers actually look at them. A refactor that drops eval pass-rate gets held until the cause is understood—not buried, not waived.
We've also become deliberate about what we don't gate on. We don't require perfect pass-rate—that breeds tweaking, overfitting, politics around "which cases count". We gate on: "no novel regression class introduced."
What we learned
Evals stop being defensive and start being predictive when you run them on every change. You see patterns: which model updates break which class of agents, which prompt diffs have outsized impact, where the agent is brittle.
The infrastructure work—parallelization, caching, flake quarantine—wasn't glamorous. But it's the difference between "we should run evals" and "evals run before everything ships."