Fixing What Breaks: Email Integration Reliability

Email integrations are deceptively hard. They look simple on the surface—authenticate, poll for new messages, process them—but the reality involves service account tokens that expire, network hiccups, rate limits, and a dozen other failure modes that only show up in production.

Last week, we started seeing intermittent failures in Sabine's email integration. Users would connect their inbox, everything would work for a while, then messages would stop flowing. No error messages in the UI. Just silence.

What We Fixed

The root cause was a combination of two issues. First, the service account authentication flow wasn't properly handling token refresh edge cases. When a token expired mid-request, the system would fail rather than retry with a fresh token. Second, when the email poller encountered any error—network timeout, rate limit, malformed message—it would stop polling entirely until someone manually restarted it.

We rebuilt the authentication layer to treat token refresh as a first-class concern. Now when a request fails with an auth error, the system automatically refreshes the token and retries the request. We also added exponential backoff so rapid-fire retries don't trigger rate limits.

For the poller, we implemented automatic recovery with circuit-breaker logic. When an error occurs, the poller backs off, waits, then tries again. If errors persist, it increases the wait time up to a maximum interval. Once the underlying issue resolves—network comes back, rate limit window passes—the poller resumes without human intervention.

Why This Matters

Reliability isn't a feature you ship once. It's built through dozens of small fixes like this one. Each improvement removes a category of failure that users shouldn't have to think about. Email is a foundational integration for Sabine—when it breaks, users lose trust fast.

This fix reduces manual intervention and makes the system self-healing. That means fewer support tickets, less downtime, and more time to build features that matter instead of firefighting infrastructure issues.

What's Next

We're adding observability around email integration health so we can catch issues before users notice them. That includes metrics on auth success rates, poller recovery events, and message processing latency. We're also building a diagnostic UI in Strug Central so operators can see integration status at a glance.

Longer term, we're applying these same reliability patterns to other integrations—calendar, Slack, Linear. The goal is to make Sabine's integration layer bulletproof, so users can trust that once they connect something, it stays connected.