---
title: "OAuth at Scale: The Architecture of Reliable Token Refreshes"
slug: oauth-at-scale-the-architecture-of-reliable-token-refreshes
date: 2026-02-26
author: Roopendra Talekar
categories: [Engineering]
excerpt: "OAuth token management is more than just storage. Learn how Truto handles concurrency, proactive refreshes, and race conditions for 100+ APIs at scale."
tldr: "Reliable OAuth requires proactive refresh alarms, mutex locks to prevent race conditions, and a robust state machine for handling revocations. Here is the architecture we use to keep 100+ integrations connected."
canonical: https://truto.one/blog/oauth-at-scale-the-architecture-of-reliable-token-refreshes/
---

# OAuth at Scale: The Architecture of Reliable Token Refreshes


If you have ever [built a Salesforce integration](https://truto.one/blog/integrate-salesforce/) in a weekend, you know the feeling of triumph when that first `200 OK` comes back. You store the access token, maybe toss the refresh token into a database column, and call it a day.

Then, three months later, your error logs light up.

Tokens are expiring mid-sync. Refresh requests are hitting race conditions because two background jobs tried to refresh the same token simultaneously. A customer changed their password, revoking all tokens, and your app is still hammering the API until you get rate-limited. 

At Truto, we maintain connections to over 100 different SaaS platforms—from HRIS systems like Workday to CRMs like HubSpot. We process millions of API requests, and every single one relies on a valid, fresh credential. 

We learned the hard way that **managing OAuth token lifecycles is not a storage problem; it is a distributed systems problem.**

Here is the engineering deep dive into how we architected a token refresh system that handles concurrency, proactive renewal, and graceful failure at scale.

## The "Happy Path" is a Lie

The OAuth 2.0 spec provides a framework, but every provider implements it with their own chaotic flair. 

- **Expiry Times:** Some tokens last 1 hour, some 24 hours, some never expire until used.
- **Refresh Behavior:** Some providers rotate the refresh token *every time* you use it (refreshing the refresh token). If you fail to capture the new one due to a network blip, you are locked out forever.
- **Concurrency:** If two processes try to refresh the same token at the exact same second, many providers will invalidate *both* requests, assuming a replay attack.

To handle this, we moved away from simple "check and refresh" logic to a multi-layered architecture involving proactive alarms, mutex locks, and self-healing state machines.

## Layer 1: The Two-Pronged Refresh Strategy

Waiting for a token to expire before refreshing it is a recipe for latency and failed user requests. Conversely, refreshing it too aggressively wastes API quota. We use a hybrid approach: **Proactive Alarms** and **Just-in-Time (JIT) Checks**.

### 1. Proactive Refresh (The Background Worker)
Whenever a token is created or updated, we schedule work in our auth layer to run **60 to 180 seconds before the token expires** - per account, not on a coarse global cron.

Why the randomization? To prevent thundering herds. If 10,000 accounts were connected at 9:00 AM, we don't want 10,000 refresh requests firing exactly at 9:59 AM. Spreading the load ensures stability.

When that scheduled refresh runs, it negotiates a new token, updates durable storage, and re-encrypts the new credentials.

### 2. Just-in-Time Safety Net
We cannot rely solely on background jobs. Clocks drift, schedulers miss edge cases, and sometimes a token expires faster than the provider claimed. 

Before *every single* API request—whether it's a proxy call or a sync job—our infrastructure checks the token's validity. We use a **30-second buffer** logic:

```typescript
// Simplified logic: treat token as expired if it expires in the next 30s
if (token.expiresAt < (now + 30_seconds)) {
  await refreshCredentials();
}
```

This buffer is critical. Without it, a token might be valid *when the check runs* but expire 100ms later while the request is in flight to Salesforce.

## Layer 2: Solving Concurrency with Mutex Locks

This is where most [in-house integrations](https://truto.one/blog/3-models-for-product-integrations-a-choice-between-control-and-velocity/) fail. 

Imagine a scenario:
1. A scheduled sync job starts for Customer A.
2. Simultaneously, Customer A triggers a manual "Test Connection" in your UI.
3. Both processes see the token is about to expire.
4. Both processes fire a request to `POST /oauth/token`.

**The Result:** The provider receives two refresh requests. It processes the first, invalidates the old refresh token, and issues a new one. Then it processes the second request (using the now-invalid old refresh token), throws an error, and potentially revokes the entire grant for security reasons. Your customer is now disconnected.

### The Solution: Per-account serialized refresh (mutex)

To prevent this, we wrap the refresh operation in a **mutex (mutual exclusion) lock** scoped to each integrated account.

Conceptually, each account has a single serializer for refresh work. When a refresh is requested:

1. The request attempts to acquire a lock for that specific `integrated_account_id`.
2. If no operation is in progress, it proceeds, arms a short watchdog timeout (e.g., 30s) so a stuck provider cannot block the account forever, and executes the refresh.
3. If a refresh is *already* running, the second request simply **awaits the same in-flight operation** and reuses its result.

This ensures that no matter how many concurrent callers need a fresh token, we only send **one** HTTP request to the provider. Everyone else gets the outcome of that single successful call.

> [!NOTE]
> **Why not just use a database row lock?** Row locks and external stores like Redis work well for many teams. We optimized for strictly serialized refresh per account with very low coordination overhead so concurrent proxy traffic, sync jobs, and scheduled refresh all converge on one refresh flight without tuning a separate lock service for every deployment shape.

## Layer 3: Handling "Invalid Grant" and Re-Auth

Sometimes, refresh fails. The user might have uninstalled the app, changed their password, or an admin might have revoked the token. 

When we receive a fatal error (like HTTP 400 `invalid_grant` or HTTP 401 `Unauthorized`), retrying is futile. We need to involve the human.

### The `needs_reauth` State Machine

1. **Detection:** We catch `invalid_grant` errors specifically. Transient errors (HTTP 500s) trigger a retry with exponential backoff. Auth errors trigger a state change.
2. **State Update:** The account status is flipped from `active` to `needs_reauth`.
3. **Notification:** We fire a webhook event: `integrated_account:authentication_error`. 

This allows our customers to listen for this event and immediately show a "Reconnect" banner in their UI. 

### Auto-Reactivation

We also support **auto-reactivation**. If an account is in `needs_reauth` but a subsequent API call succeeds (perhaps the user fixed the issue on the provider side, or it was a false positive from the API), we automatically flip the status back to `active` and fire a `integrated_account:reactivated` event.

## Security at Rest

Storing thousands of access and refresh tokens requires paranoia. As part of our strict [security standards](https://truto.one/blog/security-at-truto/), we never store tokens in plain text.

*   **Encryption:** All sensitive fields (`access_token`, `refresh_token`, `client_secret`) are encrypted using **AES-GCM** before hitting the database.
*   **Masking:** When developers list accounts via our API, these fields are masked. They are only decrypted inside the secure enclave of the refresh service just before being used.

## Summary: The Checklist for Reliable Auth

If you are building this in-house, ensure your architecture covers these bases:

1.  **Buffer your expiry checks:** Don't wait for `0` seconds remaining.
2.  **Serialize your refreshes:** Never let two threads refresh the same token.
3.  **Handle revocation gracefully:** Distinguish between "API is down" (retry) and "Token is dead" (alert user).
4.  **Secure your storage:** Encrypt at rest, always.

Or, you can offload this entirely. At Truto, we treat authentication infrastructure as a core product, so you can focus on the data, not the handshake.

> Stop debugging OAuth race conditions. Let Truto handle the infrastructure.
>
> [Talk to us](https://cal.com/truto/partner-with-truto)
