Anatomy of an Email Responder: 100M tokens later

The Email Responder is our most-deployed agent on ai-agents.bar. Since launch, it's processed over 100 million tokens. It's not the flashiest agent we build — no complex orchestration, no multi-step workflows. But its story tells you almost everything we've learned about shipping autonomous systems that people actually trust to touch their inboxes.

Under the hood: the three-stage pipeline

The Email Responder runs a deliberate three-stage pipeline:

Triage: Classify intent (reply, forward, escalate) and urgency (same-day, this-week, low-priority). Prune by rules — auto-ignore OOO messages, meeting confirmations, auto-generated receipts.
Retrieve: Pull context from email threads, CRM records (if connected), and any docs the user flags. Build a semantic window of what matters for this specific conversation.
Draft: Generate a reply using the user's tone fingerprint — learned from their past emails. Inject guardrails: redaction logic, length limits, sentiment check.

The entire thing runs in under 2 seconds per email. We saw early users complain about latency over 3 seconds, so we optimized aggressively. The triage stage is now rule-based and instant. Retrieval uses sparse embeddings. The LLM call is streaming-friendly — we show the draft to users as it arrives.

The big call: drafts only

We shipped the Email Responder with full auto-send in week one. It felt amazing in demos. Real deployments told a different story.

Six weeks in, we had three high-profile auto-send errors that landed in client inboxes. One agent misread urgency and sent a casualty-tone follow-up when the customer was upset. Another over-filtered and auto-replied to an important partner, missing the escalation signal entirely. The third: tone mismatch — formal customer, the agent drafted something too flip.

"95% accuracy is a 5% disaster in someone else's inbox."

We killed auto-send and flipped the default to "always draft, human approves with one tap." It took one minute to implement. Adoption jumped 4x overnight. Same technology, different UX.

The insight: users don't need perfect automation. They need perfect visibility. They want the agent to shoulder the thinking work, but keep them in control of the sending.

What 100M tokens taught us

Over time, a few patterns emerged from the data:

Users rarely tweak the draft. We expected heavy editing. Instead, 87% of approved drafts go out unchanged. The agent learned their voice fast — by week three, editing rates dropped below 5%. This told us the tone fingerprinting was working, and it justified the engineering effort on tone inference.

The agent gets better at *their* tone over weeks, not token budgets. We tried bigger models, better prompts, longer context windows. None of it mattered as much as the agent seeing 50+ emails from the user. That's when it really locked in on quirks — how they open paragraphs, their favorite punctuation, their risk tolerance on formality.

Some industries need explicit redaction. Legal teams and healthcare orgs using the agent asked for a separate "scrub for compliance" step. We added an optional redaction pipeline that flags PII, PHI, and contract-sensitive language. It became a checkbox in setup. Not everyone needs it, but those who do need it badly.

Config example: the trigger

Here's a simplified YAML of how a typical Email Responder instance is configured:

name: Email Responder – Sales
enabled: true
trigger:
  type: email_received
  filters:
    - from_domain: [client1.com, partner.io]
      urgency: ["same-day", "this-week"]

pipeline:
  triage:
    auto_ignore: [ooo, receipt, confirm]
  retrieve:
    sources: [thread, crm, docs]
    context_window: 5000
  draft:
    tone_model: learned
    max_length: 500
    redaction: disabled

approval:
  mode: draft_only
  timeout: 24h

The `mode: draft_only` is the flag. No surprises, no automation anxiety.

What's next

We're shipping four things soon:

Thread-aware memory: The agent currently reads a thread cold. We're adding a memory layer so it learns "oh, we've been discussing budget with this contact for three weeks — I should remember the last pushback." This is a 10% accuracy lift on continuity.

CRM-grounded follow-ups: Right now the agent drafts. We want it to optionally auto-create follow-up tasks in your CRM, set reminders, surface next-best-actions. Still human-approved, but richer context.

Outlook multi-account: We're Gmail-first today. Outlook support is the #2 ask.

Native Gmail integrations: Looser coupling with the Gmail API so setup is one click, not "generate a service account key and paste it here."

The Email Responder taught us that shipping autonomously isn't about removing humans — it's about moving them from the critical path to the approval path. That shift unlocks adoption.

Anatomy of an Email Responder: 100M tokens later

Under the hood: the three-stage pipeline

The big call: drafts only

What 100M tokens taught us

Config example: the trigger

What's next

Subscribe for the long-form essays

Keep reading

Human-in-the-loop is not a feature — it's a contract

The case against autonomous-by-default

Q1 2026 product roundup