---
title: Handling OAuth Token Refresh Failures in Production for Third-Party Integrations
slug: handling-oauth-token-refresh-failures-in-production-for-third-party-integrations
date: 2026-03-20
author: Uday Gajavalli
categories: [Engineering, Guides]
excerpt: "Stop losing customers to silent OAuth failures. Build production-ready token refresh with distributed locks, proactive alarms, graceful degradation, and post-connection UI that maps backend events to user actions."
tldr: "Production OAuth refresh failures stem from race conditions, silent revocation, and provider quirks. Fix them with proactive scheduling, mutex locks, error classification, and a post-connection UI that maps webhook events to self-service re-auth flows."
canonical: https://truto.one/blog/handling-oauth-token-refresh-failures-in-production-for-third-party-integrations/
---

# Handling OAuth Token Refresh Failures in Production for Third-Party Integrations


Your integration worked perfectly in staging. Then, three weeks after launch, a customer's Salesforce sync silently stopped. No alert. No error page. Just a `{"error": "invalid_grant"}` buried in your logs, and a growing backlog of stale data.

If that scenario sounds familiar, you're not alone. Handling OAuth token refresh failures in production is one of the most underestimated problems in B2B SaaS engineering. The OAuth 2.0 spec gives you a clean framework, but every provider implements it with its own undocumented quirks, silent revocation policies, and concurrency traps.

You handle OAuth token refresh failures by [treating token lifecycles as a distributed systems problem](https://truto.one/blog/beyond-bearer-tokens-architecting-secure-oauth-lifecycles-csrf-protection/)—not a simple "check and refresh" afterthought. This requires proactive scheduled refreshes, distributed mutex locks to prevent race conditions, and deterministic state machines that degrade gracefully when a token is permanently revoked.

## The Hidden Cost of OAuth Token Lifecycle Management

The "happy path" of OAuth takes an afternoon to build. You redirect the user, exchange the authorization code, store the tokens, done. The other 90% of the work—keeping those tokens alive across hundreds of customer accounts, each with different provider policies—is where engineering time disappears.

Industry data confirms this pain. <cite index="21-9">60% of respondents reported spending too many weekly hours troubleshooting third-party API issues.</cite> And <cite index="21-28">36% of companies spent more effort troubleshooting API issues than developing new features last year.</cite> That's not building product. That's babysitting connections.

<cite index="34-4,34-5">The 2025 Postman report reinforces this shift. 69% of developers now spend more than 10 hours a week on API related work, and over a quarter spend more than 20 hours.</cite> A significant chunk of those hours goes toward debugging authentication failures, chasing down expired tokens, and rebuilding connections that quietly broke.

The math gets brutal fast. If you integrate with 10 SaaS providers and each has 50 enterprise customers, you're managing 500 OAuth token lifecycles. Each with its own expiry window, refresh behavior, and revocation triggers. One provider rotates refresh tokens on every use. Another silently caps you at 100 tokens per account. A third lets admins override your app's token policy without notifying you.

Supporting one API is manageable. Supporting fifty requires a dedicated platform team. Tokens expire mid-sync. Refresh requests hit race conditions because two background jobs tried to renew the same credential simultaneously. A customer changes their password in Google Workspace, silently revoking all active tokens, and your application continues hammering the provider until your IP is rate-limited.

This isn't a storage problem. It's a distributed systems problem.

## Decoding the `invalid_grant` Error

If you work with OAuth 2.0 long enough, you will eventually stare at an HTTP 400 response containing a single, infuriating JSON payload:

```json
{
  "error": "invalid_grant",
  "error_description": "Token has been expired or revoked."
}
```

**`invalid_grant` is the catch-all error that tells you almost nothing.** The OAuth 2.0 spec defines it as: the authorization grant or refresh token is invalid, expired, revoked, or was issued to a different client. That's at least five different root causes hidden behind one error string. The spec intentionally obscures the exact reason for rejection to prevent enumeration attacks—secure for the protocol, a nightmare for debugging.

Here's what actually triggers `invalid_grant` in production across major providers:

| Root Cause | Google | Microsoft Entra | Salesforce | Linear |
|---|---|---|---|---|
| Refresh token expired | 6 months of inactivity | Inactivity-based (configurable) | Policy-dependent | Inactivity-based |
| User revoked access | ✅ | ✅ | ✅ | ✅ |
| Password changed | Gmail scopes only | All tokens revoked | Depends on org policy | No |
| Admin revoked sessions | ✅ | ✅ (bulk revocation) | ✅ | Workspace removal |
| Token limit exceeded | 100 per account per client | N/A | N/A | N/A |
| Refresh token rotated & old one reused | N/A | N/A | If RTR enabled | Always rotates |

Several of these deserve special attention:

- **Silent Revocation via Password Reset:** If a user resets their password, many providers automatically revoke all active OAuth tokens to secure the account. Google does this for non-Google Apps users—a routine security hygiene practice by your user instantly breaks your integration.
- **Inactivity Timeouts:** Providers aggressively prune unused tokens. A Google OAuth refresh token that remains unused for six consecutive months is automatically invalidated. If your integration relies on infrequent syncs, you will inevitably hit this.
- **Token Limits:** <cite index="41-8,41-9">There is currently a limit of 100 refresh tokens per Google Account per OAuth 2.0 client ID. If the limit is reached, creating a new refresh token automatically invalidates the oldest refresh token without warning.</cite> If your app requests a new authorization every time a user reconnects instead of reusing the existing refresh token, you'll silently burn through this quota.
- **Environment Restrictions:** If your Google Cloud project is left in "Testing" mode with an external user type, all refresh tokens automatically expire after exactly seven days.
- **Admin Policy Overrides:** In enterprise systems like [Salesforce](https://truto.one/blog/integrate-salesforce/), an administrator might update the Connected App policies to enforce strict IP restrictions or require manual admin approval. The token itself is valid, but the context of the request violates the new policy, resulting in a rejected grant.

<cite index="7-17">Microsoft often embeds an AADSTS code in error_description, which tells you why the refresh failed.</cite> Salesforce, on the other hand, gives you the same generic error string whether the user revoked access or an admin changed the org's token policy. <cite index="11-28,11-29,11-30">The problem is that invalid grant is a generic error that says your OAuth signup failed. It doesn't give any more details, and that is by design.</cite>

When your system receives `invalid_grant`, retrying the request is useless. The token is dead. The challenge is ensuring your architecture recognizes this finality instead of entering an infinite retry loop that consumes your API quota.

For a deep breakdown of each error variant and provider-specific debugging steps, see our [guide to fixing OAuth 2.0 errors](https://truto.one/blog/fixing-oauth-20-errors-a-developers-guide-to-invalidgrant-more/).

> [!TIP]
> **Debug rule #1:** Log the *entire* response body, the token endpoint URL, the `grant_type`, the `client_id`, and the environment. Most OAuth bugs survive because teams only log the error string and throw away the inputs that produced it.

## The Final Boss: OAuth Token Refresh Race Conditions

If `invalid_grant` is the most common OAuth failure, **race conditions during token refresh are the most dangerous.** They're nearly invisible in development, only surface under production load, and can permanently lock out customer accounts.

<cite index="1-5,1-6">The key insight is that these issues often appear under load or in production environments where multiple processes are running simultaneously. They're much harder to detect in development or testing environments with single-threaded execution.</cite>

Consider a standard enterprise SaaS application handling [enterprise authentication](https://truto.one/blog/bearer-tokens-were-the-easy-part-the-real-challenge-of-enterprise-auth/). A scheduled cron job kicks off at midnight to sync 10,000 records from a CRM. To process this efficiently, the job fans out into 50 concurrent worker threads. The access token expired at 11:59 PM. At exactly midnight, all 50 threads read the database, detect an expired access token, and fire off a POST request to the provider's `/oauth/token` endpoint using the exact same refresh token.

This triggers a massive security concern from the provider's perspective. Modern OAuth implementations use **Refresh Token Rotation (RTR)**. Under RTR, a refresh token is strictly single-use. When you use it, the provider issues a new access token *and* a new refresh token, immediately invalidating the old one.

If the provider receives two concurrent requests using the same refresh token, it assumes the token has been stolen and a malicious actor is attempting a replay attack. The provider's security system kicks in: it revokes the old refresh token, revokes the newly issued tokens, and locks the authorization grant entirely.

```mermaid
sequenceDiagram
    participant Worker 1
    participant Worker 2
    participant OAuth Provider
    participant Database

    Worker 1->>Database: Read Token (Expired)
    Worker 2->>Database: Read Token (Expired)
    Worker 1->>OAuth Provider: POST /oauth/token (refresh_token_v1)
    Worker 2->>OAuth Provider: POST /oauth/token (refresh_token_v1)
    OAuth Provider-->>Worker 1: 200 OK (access_token_v2, refresh_token_v2)
    Note over OAuth Provider: Detects reused refresh token.<br>Assumes replay attack.
    OAuth Provider-->>Worker 2: 400 Bad Request (invalid_grant)
    Note over OAuth Provider: Revokes ALL tokens for this user.
    Worker 1->>Database: Write Token v2 (Now useless)
    Worker 2->>Database: Mark Account as Failed
```

Your background jobs just permanently locked your customer out of their integration.

> [!CAUTION]
> **The Concurrency Trap**
> Optimistic locking (using version numbers in your database) does not solve this. If 50 threads try to update the database, 49 will fail and retry. Those 49 threads will still make network requests to the provider, triggering rate limits or rotation lockouts. The blast radius of a retry storm can exhaust your application's database connection pool.

This is especially nasty with providers that rotate refresh tokens on every use, like Linear. <cite index="6-15,6-16">Treat "save the new refresh token" as mandatory after every refresh. Linear rotates refresh tokens, and the old one becomes unusable immediately.</cite>

<cite index="1-10,1-11,1-12">Multiple processes might attempt to refresh the same token simultaneously. This creates a race condition where Process A detects an expired token and starts refreshing, Process B also detects the same expired token and starts refreshing, both processes make refresh requests to the OAuth provider, and the last process to complete might overwrite the good new token with an already-expired one. In the worst case, you might lose your valid refresh token entirely, forcing users to re-authenticate.</cite>

## Architecting a Production-Ready Token Refresh System

Four patterns, used together, eliminate most token refresh failures in production. Let's walk through each.

### Pattern 1: Proactive Refresh with Randomized Scheduling

Don't wait for an API call to fail with a 401. Refresh tokens *before* they expire.

When you receive a new token, calculate its absolute expiration time. Then, schedule an asynchronous alarm to fire 60 to 180 seconds before that expiry. The randomization window matters—if you have 500 accounts all with 1-hour tokens created around the same time, you don't want 500 simultaneous refresh requests hammering the provider's token endpoint. This jitter prevents thundering herd problems.

```typescript
function scheduleProactiveRefresh(account: IntegratedAccount) {
  const expiresAt = DateTime.fromISO(account.token.expires_at);
  // Fire 60-180 seconds before expiry, randomized to spread load
  const refreshAt = expiresAt.minus({ seconds: randomInt(60, 180) });
  
  if (refreshAt > DateTime.utc()) {
    scheduler.schedule({
      type: 'refresh_credentials',
      accountId: account.id,
      runAt: refreshAt.toISO(),
    });
  }
  // If already past the window, the next on-demand check will handle it
}
```

The proactive alarm always forces a refresh, bypassing the usual "is the token expired?" check. After every successful refresh, a new alarm is immediately scheduled for the next cycle. This creates a self-sustaining loop—by the time your application needs to make an API call, the token in the database is already fresh.

At Truto, we use this exact approach with **durable per-account scheduling** so refresh work is not lost across deploys or restarts. For the full architecture, see our deep dive on [OAuth token refreshes at scale](https://truto.one/blog/oauth-at-scale-the-architecture-of-reliable-token-refreshes/).

### Pattern 2: Distributed Mutex Locks (Single-Flight Requests)

Proactive alarms handle the common case, but you still need on-demand refresh as a fallback—what if the alarm missed? What if the proactive refresh failed? And on-demand refresh is where race conditions live.

You must ensure that for any given integrated account, only one refresh network request is ever in flight. This requires a per-account mutex lock.

```typescript
class TokenRefreshLock {
  private inProgress: Map<string, Promise<Token>> = new Map();

  async refreshWithLock(accountId: string, refreshFn: () => Promise<Token>): Promise<Token> {
    // If a refresh is already running for this account, wait for it
    if (this.inProgress.has(accountId)) {
      return this.inProgress.get(accountId)!;
    }

    const operation = (async () => {
      try {
        return await refreshFn();
      } finally {
        this.inProgress.delete(accountId);
      }
    })();

    this.inProgress.set(accountId, operation);
    return operation;
  }
}
```

When Thread A detects an expired token, it acquires the lock and initiates the HTTP request to the provider. When Threads B through Z detect the same expired token a millisecond later, they see the lock is held. Instead of making redundant network requests, they simply `await` Thread A's promise.

```mermaid
sequenceDiagram
    participant Thread A
    participant Thread B
    participant Mutex Lock
    participant OAuth Provider

    Thread A->>Mutex Lock: Acquire Lock (account_id)
    Note over Mutex Lock: Lock Acquired.<br>Starts Promise.
    Thread B->>Mutex Lock: Acquire Lock (account_id)
    Note over Mutex Lock: Lock Held.<br>Returns existing Promise.
    Mutex Lock->>OAuth Provider: POST /oauth/token
    OAuth Provider-->>Mutex Lock: 200 OK (New Tokens)
    Mutex Lock-->>Thread A: Resolves with New Tokens
    Mutex Lock-->>Thread B: Resolves with New Tokens
```

The key architectural decision: each account gets its **own** lock. Two refreshes for the same account are serialized. Two refreshes for different accounts run independently and in parallel. No global bottleneck.

This in-memory approach works for single-instance apps. For multi-instance deployments, you need a distributed lock—Redis with a TTL-based lease, a database advisory lock, or any primitive that gives you **single-flight refresh per account**. A watchdog timeout on the lock is essential: if the provider's token endpoint hangs for 30+ seconds, the lock must auto-release so future callers aren't permanently blocked. Without this safeguard, a single degraded provider can deadlock that account's entire worker fleet.

### Pattern 3: Token Merging, Not Replacement

Here's a subtle gotcha that bites most teams: **not all providers return a new refresh token on every refresh response.** Some (like Google for certain flows) only include the refresh token in the initial authorization exchange. If you naively overwrite your stored token with the refresh response, you lose the refresh token entirely.

Always perform a defensive merge of the new token response against the existing token state:

```typescript
const mergeTokens = (existingToken: OAuthToken, newTokenResponse: any): OAuthToken => {
  return {
    ...existingToken,
    ...newTokenResponse,
    // Preserve the existing refresh token if the provider omitted it
    refresh_token: newTokenResponse.refresh_token || existingToken.refresh_token,
    // Calculate absolute expiry to avoid relative time drift
    expires_at: calculateAbsoluteExpiry(newTokenResponse.expires_in)
  };
};
```

Never rely on relative `expires_in` values stored in the database. If a token expires in 3600 seconds, calculate the absolute UNIX timestamp (`Date.now() + 3600 * 1000`) and store that. Relative times drift based on network latency and queue wait times.

This pattern is simple, but missing it is a one-way ticket to `invalid_grant` for every account that refreshes.

### Pattern 4: Expiry Buffer for In-Flight Requests

Checking `token.expires_at > now` isn't enough. If a token expires in 5 seconds and your API request takes 10 seconds to complete, the request will fail mid-flight.

Use a **30-second buffer**: treat a token as expired if it will expire within the next 30 seconds. This gives enough headroom for in-flight requests to complete before the token actually expires.

```typescript
function isTokenExpired(token: OAuthToken, bufferSeconds = 30): boolean {
  const expiresAt = DateTime.fromISO(token.expires_at);
  return expiresAt.minus({ seconds: bufferSeconds }) <= DateTime.utc();
}
```

This one-line check eliminates an entire class of intermittent 401 errors that are nearly impossible to reproduce in development.

## Graceful Degradation: Handling Irrecoverable Refresh Failures

No matter how resilient your infrastructure is, tokens will eventually die. A user will uninstall your app, an admin will restrict permissions, or a password reset will wipe the slate clean. When this happens, your system must recognize finality and stop hammering the provider.

### Classifying Errors: Retryable vs. Terminal

Not all refresh failures are terminal. The critical first step is classification:

| Error Type | HTTP Status | Action |
|---|---|---|
| Provider server error | 500+ | Retry with exponential backoff |
| Network timeout | N/A | Retry with backoff |
| Rate limited | 429 | Retry after `Retry-After` header |
| `invalid_grant` (revoked/expired) | 400/401 | Mark for re-auth, stop retrying |
| `invalid_client` | 401 | Alert engineering, stop retrying |

If the provider returns a `500` or `429`, the token is still valid—retry later. If the provider returns `invalid_grant`, the token is permanently dead. Stop immediately.

### Normalizing Errors with Evaluation Expressions

Providers return terminal errors in wildly inconsistent formats. Some return standard HTTP 401s. Others return HTTP 200 OK with an error buried inside the body. Slack famously returns `200 OK` with `{"ok": false, "error": "invalid_auth"}`. If your error handling only checks HTTP status codes, these errors sail right through as "successful" responses, corrupting your data silently.

Implement a per-integration evaluation layer—such as JSONata expressions—that intercepts the provider's response, normalizes it, and extracts the core failure reason:

```json
{
  "error_expression": "status = 401 or (status = 400 and body.error = 'invalid_grant') ? { 'status': 401, 'message': 'Auth failed: ' & body.error_description } : null"
}
```

This lets you handle non-standard APIs without writing provider-specific code in your core pipeline. For more on how we architect this error normalization layer, see our post on [how third-party APIs can't get their errors straight](https://truto.one/blog/404-reasons-third-party-apis-cant-get-their-errors-straight-and-how-to-fix-it/).

### Transition State and Alert the User

When your evaluation layer detects a true terminal authentication failure, the account should transition to a **`needs_reauth`** state. This triggers three things:

1. **Stop all sync jobs and API calls** for that account—don't waste compute and risk rate limits.
2. **Fire a webhook** (`integrated_account:authentication_error`) so your customers can notify their users.
3. **Store the exact error message** so support teams can diagnose without digging through logs.

```mermaid
stateDiagram-v2
    [*] --> active : OAuth flow complete
    active --> active : Token refresh succeeds
    active --> needs_reauth : Terminal refresh failure<br>(invalid_grant, revoked)
    needs_reauth --> active : User re-authenticates<br>successfully
    needs_reauth --> needs_reauth : Webhook sent,<br>retries stopped
```

Webhook delivery must be backed by a queue with its own retry mechanism. If your core application is temporarily down when the auth failure occurs, you cannot afford to lose the `needs_reauth` alert.

When the user successfully re-authenticates, automatically transition the account back to `active` and emit an `integrated_account:reactivated` webhook, resuming the suspended background jobs. Make this idempotent—if the account is already in `needs_reauth`, don't send duplicate webhooks.

One often-overlooked detail: **distinguish between your own 401s and the provider's 401s.** If your middleware returns a 401 because the caller's API key is wrong, that's a completely different problem than the provider returning a 401 because the user's token is revoked. Tag remote errors explicitly (e.g., `truto_is_remote_error: true`) so your global error handler routes them correctly.

## Building the Post-Connection UI: Mapping Backend Events to User Actions

All the backend machinery described above - proactive refresh, mutex locks, error classification - is invisible to your users until something breaks. The bridge between your token lifecycle infrastructure and a great user experience is the post-connection configuration UI: the settings page where customers monitor, troubleshoot, and manage their integrations after the initial OAuth flow completes.

<cite index="19-14,19-15">Your application must provide a mechanism for users to view the status of their connection and manage the authorization for your application. Typically, this is done with a settings/configuration dashboard in your application.</cite> This isn't just good UX - providers like Square explicitly require it for marketplace apps. <cite index="19-17,19-18">OAuth access token status must always be accurate and indicate the correct state of the access token (valid, expired, or revoked).</cite>

The foundational question: which backend events should trigger visible UI changes, and which should stay silent?

### The Event-to-UI Decision Table

Every OAuth lifecycle event in your backend maps to a specific frontend action. Get this mapping wrong and you either alarm users over routine operations or leave them in the dark when their integration is dead. Here's the complete mapping:

| Backend Event | Webhook Fired | UI Action | Suggested Copy | Urgency |
|---|---|---|---|---|
| Token refresh succeeded | None | No visible change | N/A | Silent |
| Account status → `needs_reauth` | `integrated_account:authentication_error` | Red banner + inline alert on integration card | "Your [Provider] connection has been disconnected. Reconnect to resume syncing." | **High** - show immediately |
| Account status → `active` (reactivated) | `integrated_account:reactivated` | Dismiss error banners, show green toast | "Your [Provider] connection is back online." | Low - transient notification |
| API call returns remote 401 | `integrated_account:authentication_error` | Same as `needs_reauth` above | " [Provider] rejected the current credentials. Please reconnect." | **High** |
| Proactive refresh retrying (transient 500) | None | No visible change | N/A | Silent |
| Rate limited by provider (429) | None | Optional subtle indicator | "Sync temporarily slowed - [Provider] rate limit reached." | Low - only if user is watching |
| Connection first activated | `integrated_account:active` | Green confirmation, enable integration controls | "Successfully connected to [Provider]." | Low - transient notification |
| Account status → `connecting` | None | Show spinner/progress on integration card | "Setting up your [Provider] connection..." | Informational |
| Account status → `post_install_error` | Depends on config | Yellow warning with retry option | "Setup for [Provider] didn't complete. Try again or contact support." | Medium |

The key principle: **successful token refreshes are silent; failures are loud.** Your users should never know that tokens expire. They should only see UI changes when the system can't recover on its own and needs human intervention.

### In-UI Re-Auth Flows and Microcopy

When a webhook delivers `integrated_account:authentication_error`, your frontend needs to guide the user through reconnection without making them feel like they broke something. The microcopy matters more than you'd think - a vague "Connection error" sends users to your support queue, while a specific, actionable message lets them self-serve.

**State: Active (healthy)**

Show a simple status indicator on the integration card. Keep it minimal:

- Status badge: green dot + "Connected"
- Last sync timestamp: "Last synced 3 minutes ago"
- No action needed - don't clutter with unnecessary controls

**State: `needs_reauth` (broken)**

This is where your UI earns its keep. Display a persistent, dismissal-resistant banner:

- **Banner headline:** "Action required: Reconnect your [Provider] account"
- **Body:** "Your [Provider] connection stopped working on [date]. This usually happens when passwords change or an admin updates security settings. Reconnect now to resume data syncing."
- **Primary CTA:** "Reconnect [Provider]" (launches OAuth flow)
- **Secondary link:** "Why did this happen?" (links to a help article or expands an inline explanation)

Don't say "token expired" or "OAuth refresh failed." Your users don't care about OAuth. They care that their data stopped syncing.

**State: Reconnection in progress**

- Replace the error banner with a progress indicator: "Reconnecting to [Provider]..."
- On success, show a green toast: " [Provider] reconnected successfully. Syncing will resume shortly."
- On failure, return to the error state with updated copy: "Reconnection failed. Double-check that you used the correct [Provider] account and try again."

**State: `post_install_error` (setup incomplete)**

- **Banner:** " [Provider] setup didn't finish"
- **Body:** "We connected to [Provider], but additional configuration is needed. This can happen when required permissions weren't granted."
- **CTA:** "Complete setup" or "Retry"

> [!TIP]
> **Microcopy rule of thumb:** Always tell the user (1) what happened in plain language, (2) the most likely cause, and (3) exactly what to do next. If your error message doesn't contain all three, it's going to generate a support ticket.

### Self-Service Maintenance Controls

Beyond handling failures, your post-connection UI should give users proactive control over their integrations. Three actions cover the vast majority of self-service needs:

**Test Connection**

A "Test Connection" button that fires a lightweight API call through the integration (e.g., a `GET /me` or list-one-record request) and reports the result. This lets users verify their integration is working without waiting for the next scheduled sync. Keep the feedback immediate:

- Success: "Connection is healthy. Last verified just now."
- Auth failure: "Connection test failed - your credentials may have been revoked. Try reconnecting."
- Timeout/network error: "Couldn't reach [Provider] right now. This is usually temporary - try again in a few minutes."

**Reauthorize**

A dedicated "Reauthorize" button that launches the OAuth flow even when the connection is healthy. Users need this when:

- They want to change which account is connected
- Their admin rotated credentials on the provider side
- They need to grant additional scopes for new features

The reauthorize flow should preserve existing configuration (field mappings, sync schedules) and only replace the credentials. Don't make users reconfigure everything from scratch.

**Disconnect**

A "Disconnect" button with a confirmation dialog. This should:

- Revoke the OAuth token at the provider (call the provider's revoke endpoint if available)
- Clear stored credentials from your system
- Stop all active sync jobs for that account
- Show a clear confirmation: "Your [Provider] integration has been disconnected. No further data will be synced."

Always include a path back: "You can reconnect [Provider] at any time from this page."

```mermaid
stateDiagram-v2
    [*] --> Connected
    Connected --> Testing : User clicks<br>"Test Connection"
    Testing --> Connected : Test passes
    Testing --> Error : Test fails
    Connected --> Reauthorizing : User clicks<br>"Reauthorize"
    Reauthorizing --> Connected : OAuth flow<br>succeeds
    Reauthorizing --> Error : OAuth flow<br>fails
    Error --> Reauthorizing : User clicks<br>"Reconnect"
    Connected --> Disconnected : User clicks<br>"Disconnect"
    Disconnected --> Reauthorizing : User clicks<br>"Connect"
    Error --> Disconnected : User clicks<br>"Disconnect"
```

### Monitoring Hooks: When to Surface Alerts vs. Silent Telemetry

Not every backend event deserves a user-facing notification. Sending an email every time a token refreshes would be absurd. But failing to notify when a connection dies is a direct path to customer churn. The distinction comes down to **recoverability**: if the system can fix it automatically, stay quiet. If it needs human action, be loud.

**Silent telemetry (log, don't alert):**

- Successful token refreshes (the vast majority of events)
- Transient provider errors (500s) that resolve on retry
- Rate limit encounters that clear within minutes
- Proactive refresh alarm scheduling and execution
- Token expiry buffer checks during API calls

These events should feed into your internal monitoring dashboards and alerting (for your engineering team), but never reach end users.

**Surface to end users:**

- Account status changed to `needs_reauth` - this always requires user action
- Post-install or validation errors during initial setup
- Repeated transient failures that suggest a deeper problem (e.g., a provider returning 500s for 24+ hours)

**Surface to your customers' admins (via webhook):**

- `integrated_account:authentication_error` - the customer's backend should receive this and route it to the right person, whether that's an email, a Slack message, or an in-app notification
- `integrated_account:reactivated` - lets the customer's system automatically clear alerts and resume dependent workflows

The webhook-to-notification pipeline on the customer's side should also be idempotent. If your platform fires `authentication_error` once and the customer's system sends an email, a second webhook for the same account in `needs_reauth` should not send a second email. Use the account status as the source of truth, not the webhook count.

> [!NOTE]
> **Implementation shortcut:** If you're using Truto, the `integrated_account:authentication_error` and `integrated_account:reactivated` webhooks give you everything you need to drive these UI states. Map the webhook `eventType` to your frontend component state, and use the `payload.status` field to determine which banner to render. No polling required.

## Provider-Specific Gotchas That Will Bite You

No amount of reading the OAuth spec will prepare you for these real-world quirks:

- **Google** has a per-account token limit of 100 refresh tokens per client ID. Testing environments in "Testing" mode get tokens that expire in just 7 days. And a routine password reset by your user instantly revokes all active tokens for non-Google Apps accounts.
- **Microsoft Entra** refresh tokens can be invalidated when an admin changes Conditional Access policies. <cite index="7-24,7-25">If a user changes or resets their password, or if a security event occurs, Microsoft can revoke existing refresh tokens. This commonly shows up as AADSTS50173.</cite> <cite index="7-17">Microsoft often embeds an AADSTS code in error_description, which tells you why the refresh failed</cite>—a small mercy compared to most providers.
- **Salesforce** lets org admins override *your* app's token policy. <cite index="62-7,62-8">Admins of Salesforce organisations, where your External Client App is installed, can override your app's default refresh token policy. If they do this you may see every token refresh for users of this org fail.</cite> Your connected app might be set to "valid until revoked," but if an admin sets it to "immediately expire," every refresh fails—and nothing in the error message tells you why.
- **Linear** rotates refresh tokens on every use. <cite index="6-22">This is also the most common failure mode when teams run token refreshes in multiple processes/containers without proper locking: one process stores the "new" refresh token, another process overwrites it with the stale one, and suddenly every refresh starts failing.</cite>

These aren't edge cases. They're the normal operating conditions of production OAuth across popular SaaS providers.

## Why You Should Not Build This In-House

Let's be honest about the full scope of what we've described:

- Proactive alarm scheduling with randomized jitter across millions of accounts
- Per-account distributed mutex locks with watchdog timeout recovery
- Token merging to preserve refresh tokens across inconsistent providers
- Expiry buffers to prevent in-flight request failures
- Error classification with retry/terminal branching
- Per-provider error expressions for non-standard APIs
- Webhook-driven re-authentication flows with guaranteed delivery
- Encrypted token storage with field-level protection

That's a lot of infrastructure for something that isn't your core product. <cite index="21-27">88% of companies encounter problems with third-party APIs that require troubleshooting on a weekly basis.</cite> Every hour your team spends debugging token refresh races is an hour not spent on features that differentiate your product.

This is exactly the problem [Truto's unified API](https://truto.one/blog/what-is-a-unified-api/) solves, particularly for enterprise teams [finding an integration partner for white-label OAuth and on-prem compliance](https://truto.one/blog/finding-an-integration-partner-for-white-label-oauth-on-prem-compliance/). Truto handles the entire OAuth token lifecycle—from initial authorization through proactive refresh, concurrency control, error detection, and re-authentication flows—across 200+ SaaS integrations. **Per-account scheduling** keeps tokens renewed ahead of expiry; **per-account mutexes** guarantee single-flight refresh so concurrent callers never stampede the provider. JSONata error expressions parse provider-specific failures and fire standardized webhooks directly to your application.

Your team writes zero auth management code. When a token refresh fails, Truto classifies the error, marks the account, fires a webhook to your system, and guides the user through re-authentication. You just make standard API calls, and Truto ensures the underlying credentials are always fresh, valid, and secure.

For a fuller breakdown of the build-vs-buy calculus, see [The True Cost of Building SaaS Integrations In-House](https://truto.one/blog/build-vs-buy-the-true-cost-of-building-saas-integrations-in-house/).

## What to Do Monday Morning

If you're dealing with OAuth refresh failures right now, here's the priority order:

1. **Instrument first.** Log full request/response bodies for every token refresh attempt. You can't fix what you can't see.
2. **Add a 30-second expiry buffer** to your token freshness checks. This alone eliminates a class of in-flight expiry failures.
3. **Implement a per-account lock** for refresh operations. Even a simple in-memory lock prevents the most common race condition. Graduate to a distributed lock when you go multi-instance.
4. **Merge token responses** instead of replacing them. One defensive merge prevents losing refresh tokens forever.
5. **Classify errors and stop retrying terminal failures.** Mark accounts as `needs_reauth` and notify users instead of burning through rate limits.
6. **Build the connection health UI.** Map webhook events to frontend states, add a "Test Connection" button, and write microcopy that tells users exactly what to do. A great post-connection UI turns a support ticket into a self-service fix.
7. **Evaluate whether this is worth owning.** If you're integrating with more than 3-5 OAuth providers, the maintenance cost compounds fast. A unified API platform like Truto can absorb this complexity so your team focuses on product.

> Managing OAuth token lifecycles across dozens of providers is a full-time job. Truto handles proactive refresh, concurrency control, error detection, and re-authentication flows for 200+ integrations—so your team doesn't have to.
>
> [Talk to us](https://cal.com/truto/partner-with-truto)
