How do you prevent OAuth token refresh race conditions in distributed systems?

Use a per-account mutex lock (via Redis distributed locks, database advisory locks, or another single-flight primitive per tenant) to serialize refresh requests. Concurrent callers wait for the in-progress refresh and receive the same result, preventing duplicate provider requests that can invalidate rotating refresh tokens.

What is proactive token refresh and why is it better than reactive refresh?

Proactive refresh schedules token renewal 60-180 seconds before expiration using background alarms, so tokens are always fresh when API requests arrive. Reactive refresh waits for a 401 error, adding latency spikes and risking permanent invalid_grant lockouts when refresh tokens rotate during network failures.

How should OAuth tokens be encrypted at rest?

Use AES-256-GCM encryption with a unique random 12-byte IV per encryption operation. Store the IV alongside the ciphertext. GCM provides both confidentiality and integrity verification, detecting tampering in addition to preventing unauthorized reads. Never reuse an IV with the same key.

What happens when an OAuth refresh token is permanently invalidated?

Mark the connected account as needing re-authentication, stop retry attempts immediately (retrying invalid_grant errors is pointless), and fire a webhook notification so the customer can re-authorize. Automatically reactivate the account if a subsequent re-authentication succeeds.

How do third-party API providers handle concurrent refresh requests?

Strict providers enforce refresh token rotation. If they receive multiple concurrent requests using the same refresh token, they process the first and invalidate the old token. Subsequent requests are treated as a replay attack, and the provider may revoke the entire token family, permanently locking out your application.

Back

Engineering Guides

How to Architect a Scalable OAuth Token Management System for B2B SaaS Integrations

Architect OAuth token management for Brex, AI agents, and B2B SaaS: proactive refresh, per-account mutex locks, encryption, and BYO-OAuth portability.

Sidharth Verma · March 20, 2026 · 42 min read

If you have ever built a third-party integration in a weekend, you know the feeling of triumph when that first 200 OK comes back. You store the access token in your database, maybe save the refresh token alongside it, and push the feature to production.

Three months later, your error logs light up.

Tokens expire mid-sync. Background jobs hit race conditions because two worker threads tried to refresh the same token simultaneously. A customer changes their password, revoking all active sessions, and your application blindly hammers the provider's API with an invalid token for hours before anyone notices. If you are building B2B SaaS integrations, your OAuth token management system is either already broken or about to be.

The initial OAuth handshake is the easy part. The hard part is everything that happens afterward. And the stakes are not theoretical. In 2025, 22% of breaches began with stolen or compromised credentials according to Verizon, the highest of any vector. Breaches involving stolen credentials are costly - averaging $4.8M per incident. The Salesloft Drift breach in August 2025 proved this applies directly to OAuth tokens: Salesloft experienced a supply chain breach through its Drift chatbot integration that impacted more than 700 organizations. Threat actors stole OAuth authentication tokens that allowed them to impersonate the trusted Drift application and gain unauthorized access to customer environments.

Managing OAuth token lifecycles is not a database storage problem. It is a distributed systems problem with real security implications. This guide breaks down exactly how to architect a scalable, concurrent-safe OAuth token management system for enterprise integrations.

The Hidden Complexity of OAuth Token Management in B2B SaaS

OAuth token management is the continuous process of acquiring, storing, refreshing, and revoking OAuth tokens for customer-connected third-party accounts. In B2B SaaS, this means managing tokens for every customer's Salesforce org, HubSpot portal, Workday tenant, and dozens of other platforms - simultaneously.

The OAuth 2.0 spec gives you a framework. Every provider then implements it with their own creative interpretation:

Token lifetimes vary wildly. HubSpot access tokens expire in 30 minutes. Salesforce tokens last longer but can be revoked at any time by an admin. Some providers do not return an expires_in field at all, leaving you to guess.
Refresh token rotation is inconsistent. Some providers (like Microsoft Entra ID) issue a new refresh token with every refresh request, invalidating the old one. Others keep the refresh token stable. Many APIs issue a new refresh token with each refresh. In this case, a race condition could lead to the loss of your valid refresh token, making future refreshes impossible.
Revocation is silent. A customer changes their password, an admin removes your app from their org, or the provider decides to expire refresh tokens after 90 days of inactivity. You find out when your next API call returns a 401.

Many legacy enterprise APIs make things worse by returning 403 Forbidden, 500 Internal Server Error, or even a 200 OK with an error message buried deep inside the JSON payload when a token expires. Writing custom interceptors for 50 different API error formats is an unsustainable engineering burden.

Implementation Overview and Goals

Before you write a single line of token management code, agree on the operational contract. A scalable OAuth architecture for customer-facing integrations has to satisfy five non-negotiable goals:

Goal	What it means in practice
Zero silent failures	Every failed refresh produces a durable event (webhook, log, metric) and moves the account into an explicit state. No token failure should be discovered through a customer support ticket.
Bounded blast radius	A single misbehaving account cannot exhaust worker pools, provider rate limits, or the token store for other tenants. Per-account isolation is not optional.
Horizontal scalability	Adding worker processes or regions must not increase the rate of `invalid_grant` errors. All coordination happens through a shared, correct primitive - not luck.
Auditability and rotation	Every credential has a monotonic version, a `last_used_at`, and a rotation history. Both application secrets and encryption keys can be rotated without downtime.
Portability	Tokens are stored in a way that survives platform migration, key rotation, and provider client-secret rotation. Customers never re-authenticate because you refactored something internal.

Everything else - the specific lock primitive, the queue technology, the encryption key manager - is an implementation detail that flows from these goals. If your design fails any of them, no amount of throughput will save it.

The concrete artifacts you will build to satisfy the contract:

A token table whose columns encode the state machine, the concurrency version, and the encryption envelope.
A per-account serializer (Redis SETNX, database advisory lock, or actor-per-account) that funnels concurrent refreshes into one flight.
A proactive scheduler that fires an alarm ahead of every expiry with jitter to smooth load.
A refresh state machine with explicit retry, backoff, and reauth transitions.
An observability plane that surfaces per-provider refresh success rate, latency, and stale-lock counts.

The rest of this guide walks through each artifact with code you can adapt.

Why Reactive Token Refresh Fails at Scale

Reactive Token Refresh: A pattern where an application waits for an API request to fail with an HTTP 401 Unauthorized error before attempting to use a refresh token to obtain a new access token.

Most engineering teams default to the reactive approach because it feels logical. You use a token until the provider tells you it is invalid. The execution flow looks like this:

The client sends a request to the third-party API.
The provider returns a 401 Unauthorized status.
The client intercepts the error, pauses the original request, and sends a POST /oauth/token request.
The provider validates the refresh token and returns a new access token.
The client updates the database and retries the original request.

At low volumes, this works. At scale, it collapses under its own weight.

Latency injection on every expired token. If a standard API request takes 200 milliseconds, a reactive refresh forces the user to wait through a 401 rejection (200ms), a token exchange handshake (500ms), and a subsequent retry (200ms). You have just quadrupled the latency on the critical path. For batch operations processing thousands of records, this pause cascades across every worker that hits the same expired token.

Disrupted long-running operations. Data sync jobs that run for minutes or hours will inevitably encounter token expiration mid-execution. A reactive approach means the job fails partway through, requiring retry logic, idempotency guarantees, and partial-state recovery - all because a token expired predictably.

Partial state corruption. If the network drops exactly after the provider issues the new token but before your database commits the update, your system state is corrupted. The provider has rotated the refresh token, but your application is still holding the old one. The next refresh attempt hits an invalid_grant error. Your application is permanently locked out, and the end user must manually re-authenticate. Look for warning signs like API requests failing with "access token invalid" error messages, or access token refreshes failing with "revoked refresh token" or invalid_grant messages. While refresh tokens can be revoked for many reasons, frequent revocation errors might indicate race conditions.

You are debugging timing, not logic. The nastiest part of reactive refresh is that it works perfectly in development, where you have a single process and low traffic. These issues often appear under load or in production environments where multiple processes are running simultaneously. They're much harder to detect in development or testing environments with single-threaded execution.

The fix is straightforward in principle: do not wait for tokens to expire.

Solving OAuth Token Refresh Concurrency and Race Conditions

Token Refresh Race Condition: A concurrency failure where multiple parallel processes attempt to refresh the same expired OAuth token simultaneously, leading to provider rate limits, token invalidation, or corrupted database state.

The most dangerous architectural flaw in token management is the "thundering herd" problem.

Imagine your application runs a nightly background sync job that pulls 50,000 contact records from a customer's Salesforce instance. To speed up the process, your worker spins up 20 concurrent threads to fetch paginated data. Midway through the sync, the access token expires. All 20 threads hit a 401 Unauthorized error at the exact same millisecond.

Strict OAuth providers enforce refresh token rotation: every time you use a refresh token, the provider issues a brand new one and immediately invalidates the old one. Here is what happens when 20 threads race to refresh:

sequenceDiagram
    participant W1 as Worker 1
    participant W2 as Worker 2
    participant DB as Token Store
    participant P as OAuth Provider

    W1->>DB: Read token (expired)
    W2->>DB: Read token (expired)
    W1->>P: POST /token (refresh_token_v1)
    W2->>P: POST /token (refresh_token_v1)
    P-->>W1: new access_token + refresh_token_v2
    P-->>W2: 400 invalid_grant (token already used)
    W1->>DB: Store refresh_token_v2
    W2->>DB: ❌ Fails - writes stale data or crashes

The provider processes the first request and issues a new token pair. When the second request arrives milliseconds later using the now-invalidated refresh token, the provider's security model assumes the token has been stolen and is being replayed by an attacker. To protect the account, it revokes the entire token family. All active tokens are instantly destroyed. Your sync job crashes, and the customer receives an alarming email from their IT department about a suspected security breach.

This is not a hypothetical edge case. It happens constantly in production systems. A recent GitHub issue against the Claude Code CLI documented the exact failure mode: when multiple CLI processes run concurrently, they race on refreshing the single-use OAuth refresh token. The loser of the race gets a 404 and loses authentication with no automatic recovery. A similar issue hit OpenAI Codex users running parallel sessions, where 18 agents on openai-codex with shared OAuth profile, every ~12h when the access token expires, a burst of agents all try to refresh simultaneously.

How to Prevent Refresh Token Race Conditions

The solution is serialized refresh with request deduplication: ensure only one refresh operation runs per connected account at any given time, and have all other callers wait for the result. You cannot solve this with optimistic locking or simple SQL UPDATE statements. Optimistic locking relies on version numbers and causes massive retry storms when conflicts occur, which will quickly exhaust your API rate limits.

There are several implementation strategies, each with trade-offs:

Strategy	Best For	Trade-offs
In-memory mutex	Single-instance apps	Does not work across multiple servers
Database advisory locks	Multi-instance, single-region	Adds DB load; lock timeout complexity
Redis distributed lock	Multi-instance, multi-region	Requires Redis infrastructure; SETNX expiry tuning
Actor / per-tenant serializer	Edge-native or serverless	Platform-specific; higher per-request cost

The pattern in pseudocode:

async function getValidToken(accountId: string): Promise<string> {
  const token = await tokenStore.get(accountId);
 
  // Check with a 30-second safety buffer
  if (!isExpiringSoon(token, bufferSeconds: 30)) {
    return token.accessToken;
  }
 
  // Acquire a lock scoped to this specific account
  return await lock.acquire(accountId, async () => {
    // Re-read token - another caller may have already refreshed it
    const freshToken = await tokenStore.get(accountId);
    if (!isExpiringSoon(freshToken, bufferSeconds: 30)) {
      return freshToken.accessToken;
    }
 
    // Perform the actual refresh
    const newToken = await oauthProvider.refresh(freshToken.refreshToken);
    await tokenStore.save(accountId, newToken);
    return newToken.accessToken;
  });
}

Two things matter here. First, the lock is scoped per account, not global. Refreshes for different customer accounts should run in parallel. Only refreshes for the same account need serialization. Second, the double-check pattern inside the lock is essential - the first caller refreshes, and subsequent callers that were waiting re-read the already-refreshed token without making a duplicate provider request.

The resolved flow looks like this:

sequenceDiagram
    participant W1 as Worker Thread 1
    participant W2 as Worker Thread 2
    participant Lock as Mutex Lock
    participant DB as Database
    participant API as Provider API

    Note over W1, W2: Distributed Mutex Pattern
    W1->>Lock: Acquire Lock (tenant_123)
    Lock-->>W1: Lock Granted
    W2->>Lock: Acquire Lock (tenant_123)
    Lock-->>W2: Promise Pending (Wait)
    W1->>API: POST /oauth/token
    API-->>W1: 200 OK (New Token Pair)
    W1->>DB: Save Encrypted Tokens
    W1->>Lock: Release Lock / Resolve Promise
    Lock-->>W2: Promise Resolved
    W2->>DB: Read Fresh Token
    W2->>API: Proceed with API Request

Exactly one refresh request hits the provider, completely eliminating the race condition. For a deeper dive into the distributed systems concepts behind this, read our guide on OAuth at Scale: The Architecture of Reliable Token Refreshes.

Per-Account Mutex Patterns: Redis, Advisory Locks, and Optimistic CAS

Every production-grade OAuth architecture needs a per-account mutex. Which primitive you pick depends on your stack, your latency budget, and how many regions you run in. Here are the three patterns that actually work, with the trade-offs made concrete.

Pattern 1: Redis SETNX with a Fencing Token

Redis SET NX PX is the most common distributed lock for OAuth refresh because it is cheap, network-local, and supports a millisecond-precision TTL that guarantees the lock releases even if the worker crashes.

import { randomUUID } from 'crypto';
 
const LOCK_TTL_MS = 30_000; // longer than any refresh should take
 
async function acquireLock(
  redis: Redis,
  accountId: string
): Promise<{ ownerId: string } | null> {
  const ownerId = randomUUID();
  const key = `oauth:refresh:${accountId}`;
  const ok = await redis.set(key, ownerId, 'PX', LOCK_TTL_MS, 'NX');
  return ok === 'OK' ? { ownerId } : null;
}
 
// Release only if we still own the lock - prevents releasing someone else's
// lock after our operation timed out and another worker took over.
const RELEASE_SCRIPT = `
  if redis.call('GET', KEYS[1]) == ARGV[1] then
    return redis.call('DEL', KEYS[1])
  else
    return 0
  end
`;
 
async function releaseLock(redis: Redis, accountId: string, ownerId: string) {
  await redis.eval(RELEASE_SCRIPT, 1, `oauth:refresh:${accountId}`, ownerId);
}
 
async function withAccountLock<T>(
  redis: Redis,
  accountId: string,
  fn: () => Promise<T>,
  maxWaitMs = 15_000
): Promise<T> {
  const start = Date.now();
  while (Date.now() - start < maxWaitMs) {
    const held = await acquireLock(redis, accountId);
    if (held) {
      try {
        return await fn();
      } finally {
        await releaseLock(redis, accountId, held.ownerId);
      }
    }
    // Another worker is refreshing - short exponential backoff, then re-read
    await sleep(50 + Math.random() * 100);
  }
  throw new Error(`Timed out acquiring lock for account ${accountId}`);
}

Trade-offs to know:

The TTL must exceed your slowest legitimate refresh. If it does not, a slow provider will cause the lock to expire while the operation is still running, and a second worker will start a duplicate refresh.
If you rely on Redis for correctness, run it as a durable primary with a replica. Sentinel-based failover during a refresh can lose the lock. For multi-region deployments, Redlock or a consensus-backed KV is worth the extra complexity.
The ownerId acts as a fencing token. Any release path that does not check ownership can accidentally free another worker's lock after its own TTL expired.

Pattern 2: PostgreSQL Advisory Locks

If you already run PostgreSQL for your token store, advisory locks give you correct per-account serialization with zero new infrastructure. They are held for the duration of a transaction (pg_advisory_xact_lock) and released automatically on commit or rollback.

async function refreshWithAdvisoryLock(
  db: Pool,
  accountId: string
): Promise<Token> {
  return db.transaction(async (tx) => {
    // hashtext produces a deterministic 32-bit hash of the account id.
    // pg_advisory_xact_lock blocks until the lock is granted; use
    // pg_try_advisory_xact_lock if you want a non-blocking variant.
    await tx.query(
      `SELECT pg_advisory_xact_lock(hashtext($1))`,
      [`oauth-refresh:${accountId}`]
    );
 
    // Re-read the row inside the lock - another worker may have refreshed
    // while we were waiting.
    const { rows } = await tx.query(
      `SELECT encrypted_context, access_token_expires_at, rotation_id
         FROM integrated_account WHERE id = $1 FOR UPDATE`,
      [accountId]
    );
 
    const account = rows[0];
    if (!isExpiringSoon(account.access_token_expires_at, 30)) {
      return decrypt(account.encrypted_context);
    }
 
    const newToken = await providerRefresh(account);
    await tx.query(
      `UPDATE integrated_account
         SET encrypted_context = $1,
             access_token_expires_at = $2,
             rotation_id = rotation_id + 1,
             last_refresh_at = NOW(),
             refresh_attempts = 0
       WHERE id = $3`,
      [encrypt(newToken), newToken.expiresAt, accountId]
    );
    return newToken;
  });
}

Trade-offs:

Advisory locks are per-connection and per-transaction. They do not survive connection loss, which is what you want, but it means callers must not hold them across long external operations. Keep the refresh call inside the same transaction.
The lock key is a 32-bit or 64-bit integer. Hash your account ID down carefully. hashtext() collisions across billions of accounts are effectively impossible.
No new infrastructure to run. If your traffic is small enough that a single database can handle every refresh, this is the simplest correct answer.

Pattern 3: Optimistic CAS with a Rotation Counter

Sometimes you want a lockless design. Optimistic Compare-and-Swap works well when refresh conflicts are rare and you would rather retry than block. Every token row carries a monotonic rotation_id; the update only succeeds if the row still has the version you read.

async function refreshOptimistic(
  db: Pool,
  accountId: string
): Promise<Token> {
  for (let attempt = 0; attempt < 5; attempt++) {
    const { rows: [account] } = await db.query(
      `SELECT encrypted_context, access_token_expires_at, rotation_id
         FROM integrated_account WHERE id = $1`,
      [accountId]
    );
 
    if (!isExpiringSoon(account.access_token_expires_at, 30)) {
      return decrypt(account.encrypted_context);
    }
 
    const newToken = await providerRefresh(account);
 
    const { rowCount } = await db.query(
      `UPDATE integrated_account
         SET encrypted_context = $1,
             access_token_expires_at = $2,
             rotation_id = rotation_id + 1,
             last_refresh_at = NOW()
       WHERE id = $3 AND rotation_id = $4`,
      [encrypt(newToken), newToken.expiresAt, accountId, account.rotation_id]
    );
 
    if (rowCount === 1) {
      return newToken;
    }
 
    // Someone else won the race. Re-read and use their refreshed token.
    await sleep(50 * Math.pow(2, attempt) + Math.random() * 50);
  }
  throw new Error(`CAS refresh failed after retries for ${accountId}`);
}

Trade-offs:

CAS is the wrong choice when providers rotate refresh tokens. Two workers both call the provider, both succeed, one loses the CAS - but the losing worker has already burned a refresh token that no one will ever store, and the provider may invalidate the whole token family. Use CAS only when refresh tokens are stable, or combine it with a soft lock.
CAS is excellent for lightweight metadata updates (touching last_used_at, updating refresh_attempts) where a lost update is harmless.

Choosing Between the Three

Signal	Pick
Single-region deployment with Postgres already in the stack	Advisory locks
Multi-region, latency-sensitive, existing Redis	Redis SETNX + fencing token
Provider rotates refresh tokens on every use (Brex, Microsoft Entra, Xero)	Locking (Redis or advisory) - never pure CAS
Small connected-account count, refresh conflicts rare	Advisory locks or CAS for metadata
Edge or serverless with actor primitives available	Actor-per-account (functionally a mutex keyed by account id)

The pattern matters less than the invariant: exactly one refresh flight in progress per account at any time. Every pattern above enforces it. Pick the one that fits your infrastructure and stop there.

Proactive Refresh Architecture: Renewing Tokens Before They Expire

Proactive Refresh: An automated background process that schedules and executes a token renewal before the current access token expires, guaranteeing that valid credentials are always available for in-flight requests.

Once you have concurrency under control, the next step is eliminating reactive refreshes entirely. When you complete an OAuth token exchange, the provider returns a JSON payload containing an expires_in field:

{
  "access_token": "eyJhbGciOiJIUzI1NiIs...",
  "refresh_token": "def50200234a...",
  "token_type": "Bearer",
  "expires_in": 3600
}

Instead of waiting for the token to die, calculate the absolute expires_at timestamp and store it in your database. Then, schedule a background worker or distributed alarm to trigger a refresh before that timestamp.

The architecture involves three components:

flowchart LR
    A[Token Acquired<br>via OAuth] --> B[Schedule Alarm<br>60-180s before expiry]
    B --> C{Alarm Fires}
    C --> D[Acquire Mutex Lock]
    D --> E[Refresh Token<br>with Provider]
    E --> F{Success?}
    F -- Yes --> G[Store New Token<br>Schedule Next Alarm]
    F -- No, Retryable --> H[Schedule Retry<br>in 3 hours]
    F -- No, Fatal --> I[Mark needs_reauth<br>Notify Customer]

Implementing Jitter and Buffer Times

If you onboarded 1,000 enterprise users at 9:00 AM, and all their tokens expire in exactly one hour, a naive cron job will attempt to refresh 1,000 tokens at exactly 9:59 AM. This spikes your infrastructure load and likely triggers abuse filters on the provider's side.

Introduce randomized jitter into your scheduling. If the token expires at 10:00 AM, schedule the refresh alarm to fire at a random interval between 9:57 AM and 9:59 AM. This spreads the network load evenly across your workers.

Distinguish Retryable from Fatal Errors

An HTTP 500 from the provider's token endpoint is transient - schedule a retry. An invalid_grant or HTTP 401 means the refresh token itself is dead. No amount of retrying will fix it. Mark the account as requiring re-authentication and stop the alarm. Ship a webhook to the customer so they know immediately.

On-Demand Safety Buffer

You must also implement a strict buffer time for on-demand checks. Before executing any API request, check the token's expiration timestamp. If the token will expire within the next 30 seconds, treat it as already expired and force a refresh. This safety margin ensures that long-running API requests or large file uploads do not fail mid-flight because the token expired while the payload was in transit.

This two-pronged approach - proactive alarms plus on-demand refresh as a fallback before each API call - means tokens are almost always fresh when a request arrives. The on-demand path only activates if the alarm system fails or if a token was just created and does not yet have a scheduled refresh.

Token Data Model: Schema, Metadata, and Encryption Envelope

Every design decision in this guide - proactive refresh, mutex acquisition, state transitions - eventually reads or writes a row in the token store. Getting the schema right up front saves a painful migration later. Here is a canonical shape that supports every pattern in this article.

CREATE TABLE integrated_account (
  id                        UUID PRIMARY KEY,
  tenant_id                 VARCHAR(255) NOT NULL,
  environment_id            UUID NOT NULL,
  integration_name          VARCHAR(64) NOT NULL,
 
  -- Encrypted credential envelope. Format: base64(iv) :: base64(tag || ciphertext)
  -- Contains access_token, refresh_token, and any provider-specific secrets.
  encrypted_context         TEXT NOT NULL,
  encryption_key_id         VARCHAR(64) NOT NULL,        -- for key rotation
 
  -- Non-secret token metadata, indexable and cheap to read
  access_token_expires_at   TIMESTAMPTZ,
  refresh_token_expires_at  TIMESTAMPTZ,                 -- some providers do expire these
  token_type                VARCHAR(32) NOT NULL DEFAULT 'Bearer',
  scope                     TEXT,
 
  -- Concurrency + audit
  rotation_id               BIGINT NOT NULL DEFAULT 0,   -- monotonic; used by CAS
  refresh_attempts          INT NOT NULL DEFAULT 0,      -- reset on success
  last_refresh_at           TIMESTAMPTZ,
  last_refresh_error        TEXT,
  last_used_at              TIMESTAMPTZ,
 
  -- Optional soft lock (for schemes that do not use Redis/advisory locks)
  lock_owner                VARCHAR(128),
  lock_expires_at           TIMESTAMPTZ,
 
  -- State machine
  status                    VARCHAR(32) NOT NULL DEFAULT 'active',
                            -- active | connecting | refreshing | needs_reauth | disabled
 
  created_at                TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at                TIMESTAMPTZ NOT NULL DEFAULT NOW(),
 
  UNIQUE (tenant_id, environment_id, integration_name)
);
 
-- The proactive scheduler scans for tokens near expiry. This partial index
-- keeps the scan cheap even at millions of rows.
CREATE INDEX idx_ia_refresh_due
  ON integrated_account (access_token_expires_at)
  WHERE status IN ('active', 'refreshing');
 
-- Alerting and dashboards for auth health.
CREATE INDEX idx_ia_needs_reauth
  ON integrated_account (environment_id, status)
  WHERE status = 'needs_reauth';

Field-by-field rationale:

encrypted_context stores the entire credential blob as a single AES-256-GCM envelope. Do not split access token and refresh token across columns - it complicates rotation and makes it easy to accidentally leak one without the other.
encryption_key_id identifies which key the envelope was sealed with. When you rotate the encryption key (annually at minimum), new writes use the new key while old rows can still be decrypted. A background job re-encrypts old rows under the new key over time.
access_token_expires_at is duplicated outside the encrypted envelope on purpose. The scheduler needs to answer "which tokens expire in the next 3 minutes?" without decrypting anything.
rotation_id is the CAS version. Every successful token update bumps it by 1. It is also a useful audit signal: a rotation_id that jumps by 20 in an hour indicates a refresh loop bug.
refresh_attempts counts consecutive failed refreshes. Reset it to 0 on success. Use it to decide when to give up and mark needs_reauth.
lock_owner / lock_expires_at are only needed if you serialize refreshes at the row level rather than through Redis or advisory locks. Leave them null if you use an external lock.

Encryption Envelope Format

Every write to encrypted_context uses AES-256-GCM with a fresh 96-bit IV. Store the envelope as concatenated base64 fields so decryption is unambiguous:

<base64 IV (12 bytes)> :: <base64 (authTag || ciphertext)>

function seal(plaintext: string, key: Buffer): string {
  const iv = randomBytes(12);
  const cipher = createCipheriv('aes-256-gcm', key, iv);
  const ct = Buffer.concat([cipher.update(plaintext, 'utf8'), cipher.final()]);
  const tag = cipher.getAuthTag();
  return `${iv.toString('base64')}::${Buffer.concat([tag, ct]).toString('base64')}`;
}
 
function open(envelope: string, key: Buffer): string {
  const [ivB64, payloadB64] = envelope.split('::');
  const iv = Buffer.from(ivB64, 'base64');
  const payload = Buffer.from(payloadB64, 'base64');
  const tag = payload.subarray(0, 16);
  const ct = payload.subarray(16);
  const decipher = createDecipheriv('aes-256-gcm', key, iv);
  decipher.setAuthTag(tag);
  return Buffer.concat([decipher.update(ct), decipher.final()]).toString('utf8');
}

The auth tag lives inside the envelope so a corrupted or tampered ciphertext fails loudly at decrypt time rather than silently producing garbage.

Securing Token Storage: Encryption and Zero Exposure

A stolen refresh token is a persistent backdoor. Unlike access tokens that expire in minutes, refresh tokens can last days or weeks. Unlike access tokens that expire quickly, refresh tokens persist for days or weeks, operating independently of your SSO and MFA controls. They're the bridges attackers walk across to move laterally between your SaaS applications.

The Salesloft Drift breach illustrated this at scale. The actor systematically exported large volumes of data from numerous corporate Salesforce instances. GTIG assesses the primary intent of the threat actor is to harvest credentials. Storing OAuth tokens in plain text is the equivalent of leaving your house keys under the welcome mat.

Token Encryption at Rest

All sensitive fields - including access tokens, refresh tokens, API keys, and client secrets - must be encrypted at rest. Use AES-256-GCM (Advanced Encryption Standard with Galois/Counter Mode), which provides both confidentiality and data authenticity. The GCM algorithm provides data authenticity, integrity and confidentiality and belongs to the class of authenticated encryption with associated data (AEAD) methods. This means it does not just encrypt the data - it also detects tampering.

Generate a cryptographically secure, random 12-byte Initialization Vector (IV) for every single encryption operation, and store the IV alongside the ciphertext.

import { createCipheriv, createDecipheriv, randomBytes } from 'crypto';
 
function encryptToken(plaintext: string, encryptionKey: Buffer): string {
  const iv = randomBytes(12); // 96-bit IV, unique per encryption
  const cipher = createCipheriv('aes-256-gcm', encryptionKey, iv);
  const encrypted = Buffer.concat([
    cipher.update(plaintext, 'utf8'),
    cipher.final(),
  ]);
  const tag = cipher.getAuthTag();
  // Store IV + tag + ciphertext together
  return `${iv.toString('base64')}::${Buffer.concat([tag, encrypted]).toString('base64')}`;
}

This must be unique for every encryption operation carried out with a given key. Put another way: never reuse an IV with the same key. The AES-GCM specification recommends that the IV should be 96 bits long. If you reuse an IV, the entire security of GCM collapses.

Securing the OAuth Initiation Flow

The vulnerability surface begins before the user even authorizes your application. When initiating the OAuth flow, many applications pass raw tenant IDs or environment variables in the state parameter of the authorization URL. This exposes internal system identifiers and invites Cross-Site Request Forgery (CSRF) attacks.

Instead of passing raw data, generate a secure, time-bound Link Token. Hash this token using HMAC and store the digest in a fast key-value store with a strict 7-day Time-To-Live (TTL). Pass this opaque identifier in the state parameter. For a comprehensive guide on securing the initial handshake, including PKCE, review our breakdown on Beyond Bearer Tokens: Architecting Secure OAuth Lifecycles & CSRF Protection.

Principle of Least Exposure

Beyond encryption, limit when tokens are ever decrypted:

List endpoints should mask sensitive fields. When returning a list of connected accounts to your dashboard, show only metadata. Never include raw tokens in API responses.
Decrypt only at the point of use. The only code path that needs the raw access token is the one making the outgoing API call to the third-party provider.
Audit access. Log every decryption event. If your decryption rate suddenly spikes, something is wrong.

For server-side applications that store tokens for many users, encrypt them at rest and ensure that your data store is not publicly accessible to the Internet.

Building a Self-Healing Token State Machine

Your connected accounts need a clear state model that handles failure gracefully and recovers automatically when possible.

stateDiagram-v2
    [*] --> connecting : OAuth initiated
    connecting --> active : Token acquired
    active --> active : Proactive refresh succeeds
    active --> needs_reauth : Refresh fails (invalid_grant)
    needs_reauth --> active : Customer re-authenticates
    needs_reauth --> needs_reauth : API calls blocked

The key behaviors:

Idempotent status transitions. If an account is already marked needs_reauth, a second failure should not send a duplicate notification to the customer. Check current state before updating.
Automatic reactivation. If a customer re-authenticates (or the provider starts accepting the refresh token again after a transient issue), detect the successful refresh and flip the account back to active automatically. Fire a webhook so the customer's system knows the integration is healthy again.
Webhook notifications for auth failures. When an account enters needs_reauth, immediately fire a webhook event like integrated_account:authentication_error. This lets your customers build alerting and self-service re-authentication flows. Do not force them to check a dashboard manually.

For a detailed breakdown of handling specific refresh failure scenarios, see our post on Handling OAuth Token Refresh Failures in Production.

Proactive Refresh State Machine and Pseudocode

The state machine above is the visible half; the transitions and retry policy are what make it correct under load. Here is the full state graph including the transient refreshing state and the alarm-driven retry loop:

stateDiagram-v2
    [*] --> connecting : OAuth initiated
    connecting --> active : Initial token stored
    connecting --> needs_reauth : Callback error or provider denial

    active --> refreshing : Alarm fires OR expiry < 30s
    refreshing --> active : Refresh 2xx, token stored
    refreshing --> refreshing_backoff : 5xx / network / timeout
    refreshing --> needs_reauth : 400 invalid_grant / 401 / 403

    refreshing_backoff --> refreshing : Exponential backoff elapsed
    refreshing_backoff --> needs_reauth : refresh_attempts >= max

    needs_reauth --> active : Customer reconnects, new token stored
    active --> disabled : Admin action / tenant offboarded
    needs_reauth --> disabled : Admin action

The pseudocode below is what the alarm handler runs. It is idempotent, mutex-guarded, and produces exactly one state transition per invocation.

async function handleRefreshAlarm(accountId: string) {
  await withAccountLock(redis, accountId, async () => {
    const account = await tokenStore.get(accountId);
 
    if (account.status === 'disabled' || account.status === 'needs_reauth') {
      return; // Nothing to do; alarms for these states are cancelled elsewhere
    }
 
    if (!isExpiringSoon(account.access_token_expires_at, 30)) {
      // A concurrent on-demand refresh already handled it.
      await scheduler.reschedule(accountId, account.access_token_expires_at);
      return;
    }
 
    await tokenStore.setStatus(accountId, 'refreshing');
 
    try {
      const newToken = await providerRefresh(account);
      await tokenStore.saveToken(accountId, newToken, {
        status: 'active',
        refresh_attempts: 0,
        rotation_id_increment: 1,
      });
      await scheduler.reschedule(accountId, newToken.expiresAt);
      await webhooks.enqueue('integrated_account:reactivated', account, {
        onlyIf: account.previousStatus === 'needs_reauth',
      });
    } catch (err) {
      await handleRefreshFailure(accountId, account, err);
    }
  });
}
 
async function handleRefreshFailure(
  accountId: string,
  account: Account,
  err: Error
) {
  const classification = classifyRefreshError(err);
  // 'retryable' | 'fatal' | 'transient'
 
  if (classification === 'fatal') {
    await tokenStore.markNeedsReauth(accountId, err.message);
    await webhooks.enqueue('integrated_account:authentication_error', account);
    await scheduler.cancel(accountId);
    return;
  }
 
  const nextAttempts = account.refresh_attempts + 1;
  if (nextAttempts >= MAX_REFRESH_ATTEMPTS) {
    await tokenStore.markNeedsReauth(
      accountId,
      `Gave up after ${nextAttempts} attempts: ${err.message}`
    );
    await webhooks.enqueue('integrated_account:authentication_error', account);
    await scheduler.cancel(accountId);
    return;
  }
 
  await tokenStore.recordFailedAttempt(accountId, nextAttempts, err.message);
 
  const backoffMs =
    Math.min(BASE_BACKOFF_MS * 2 ** nextAttempts, MAX_BACKOFF_MS) +
    Math.random() * JITTER_MS;
 
  await scheduler.scheduleIn(accountId, backoffMs);
}
 
function classifyRefreshError(err: any): 'retryable' | 'fatal' | 'transient' {
  if (err.status === 400 && /invalid_grant/i.test(err.body)) return 'fatal';
  if (err.status === 401 || err.status === 403) return 'fatal';
  if (err.status >= 500 || err.code === 'ECONNRESET' || err.code === 'ETIMEDOUT')
    return 'transient';
  return 'retryable';
}

Notable design choices:

refreshing is a real state, not a boolean flag. It gives you a clean answer to "why is this account stuck?" - if a row sits in refreshing past its lock_expires_at, a janitor promotes it back to active and schedules a retry.
Backoff caps at a hard maximum (typically 3 hours). Longer than that and the customer is better served by a needs_reauth alert.
Error classification is provider-aware. invalid_grant is fatal. Everything else is transient by default. If a provider has a quirk (say, it returns 200 with error in the body), plug it in via a per-integration error mapper.
Webhooks fire on transitions, not on every attempt. Customers should not receive an alert email for every 500 during a 3-hour backoff.

Handling Refresh-Token Rotation and Provider Quirks

Refresh-token rotation is where good token architectures go to die. The OAuth 2.0 spec allows a provider to issue a new refresh token with every refresh, but does not require it. Every provider makes its own choice, and some change the rules without notice.

The Two Rotation Modes

Mode	Behavior	Providers
Stable refresh token	The refresh token stays the same across refreshes; only the access token rotates.	Slack, GitHub, many older OAuth 2.0 apps
Rotating refresh token	Every refresh response includes a new refresh token; the old one is invalidated immediately.	Microsoft Entra ID, Brex, Xero, Google (for some grant types)

You cannot tell from the initial token exchange which mode a provider uses; you have to read the docs. Design your code to handle both without any per-provider branching.

The Merge Rule

The one rule that saves you: treat the refresh response as a partial update, not a full replacement. When storing a new token, merge it on top of the existing token blob.

function mergeTokens(previous: TokenBlob, refreshResponse: TokenBlob): TokenBlob {
  return {
    ...previous,           // preserve refresh_token if the provider did not send one
    ...refreshResponse,    // overlay new access_token, expires_at, and (maybe) refresh_token
  };
}

For a stable-refresh-token provider, refreshResponse.refresh_token is undefined, so the previous value survives. For a rotating provider, refreshResponse.refresh_token overwrites the old one. Same code path, both modes.

Atomic Write with Rotation

For rotating providers, the write of the new refresh token has to be atomic with the store of the new access token. If your process crashes between them, the old refresh token is invalidated on the provider side but you never persisted the new one - the account is dead.

Use a single statement with a CAS on rotation_id:

async function persistRefresh(
  db: Pool,
  accountId: string,
  merged: TokenBlob,
  expectedRotationId: bigint
) {
  const envelope = seal(JSON.stringify(merged), currentEncryptionKey);
  const { rowCount } = await db.query(
    `UPDATE integrated_account
       SET encrypted_context = $1,
           access_token_expires_at = $2,
           rotation_id = rotation_id + 1,
           last_refresh_at = NOW(),
           refresh_attempts = 0,
           status = 'active'
     WHERE id = $3 AND rotation_id = $4`,
    [envelope, merged.expires_at, accountId, expectedRotationId]
  );
  if (rowCount !== 1) {
    // If we hold the mutex this should be impossible; if we do not, throw
    // and let the caller re-read.
    throw new Error('Rotation ID mismatch - concurrent refresh detected');
  }
}

Provider Quirks You Will Actually Hit

Missing expires_in. Some legacy providers do not return an expiry. Configure a per-integration default (typically 3600 seconds) and treat any 401 as an immediate reactive refresh.
expires_in: 0 or negative values. Treat as invalid; log and use the default.
Refresh token returned only on initial exchange. Salesforce and a few others return the refresh token only in the initial code exchange, not on subsequent refreshes. The merge rule handles this automatically.
Scope shrinkage. Some providers drop scopes on refresh if the underlying user's permissions changed. Compare the returned scope against the requested set and mark needs_reauth if a required scope disappeared.
Refresh token TTL. Google refresh tokens can be revoked after 6 months of inactivity for consumer accounts. Xero's expire in 60 days. Track refresh_token_expires_at separately and warn the customer via webhook 7 days before it lapses.
Sliding-window refresh tokens. Microsoft Entra ID extends the refresh token's validity every time you use it, up to a maximum. Regular use keeps them alive; long-idle accounts eventually die.
Static-IP requirements. Providers like NetSuite and some banking APIs only accept refresh calls from allowlisted IPs. Route those refreshes through a dedicated egress proxy.

A Rotation-Safe Refresh Flight

Putting it together, here is what a refresh flight looks like when the provider rotates refresh tokens:

async function refreshWithRotation(accountId: string) {
  await withAccountLock(redis, accountId, async () => {
    const account = await tokenStore.get(accountId);
    const previous = decrypt(account.encrypted_context);
 
    const response = await callProviderRefreshEndpoint({
      client_id: creds.clientId,
      client_secret: creds.clientSecret,
      refresh_token: previous.refresh_token,
      grant_type: 'refresh_token',
    });
 
    // If the provider did not return a refresh_token, keep the old one.
    // If it did, the old one is now invalidated on the provider side.
    const merged = mergeTokens(previous, response);
 
    // Single-statement atomic write. If this fails, we know we lost the
    // token on the provider side and must mark needs_reauth.
    try {
      await persistRefresh(db, accountId, merged, account.rotation_id);
    } catch (err) {
      await tokenStore.markNeedsReauth(
        accountId,
        'Write failed after provider rotation'
      );
      throw err;
    }
  });
}

The lock ensures only one flight per account. The merge preserves refresh tokens from stable-mode providers. The CAS on rotation_id catches any impossibility (a lock bug, a rogue write path). The catch-all markNeedsReauth on write failure prevents a silent lockout.

Sequence Diagrams and Failure Modes

Understanding what should happen is one half of the design. Anticipating what will actually go wrong is the other. Below are the two most important lifecycle diagrams and a failure-mode catalogue.

Happy Path: Proactive Refresh with a Contending Reader

sequenceDiagram
    participant Alarm as Scheduler Alarm
    participant W1 as Refresh Worker
    participant Reader as API Request Handler
    participant Lock as Per-Account Mutex
    participant Store as Token Store
    participant Provider as OAuth Provider

    Alarm->>W1: Fire (accountId, t = expiry - 120s)
    W1->>Lock: acquire(accountId)
    Lock-->>W1: granted

    Note over Reader: Customer API call arrives during refresh
    Reader->>Lock: acquire(accountId)
    Lock-->>Reader: waiting

    W1->>Store: read encrypted_context, rotation_id
    W1->>Provider: POST /token (grant_type=refresh_token)
    Provider-->>W1: 200 { access_token, refresh_token, expires_in }
    W1->>Store: CAS write (rotation_id + 1)
    Store-->>W1: OK
    W1->>Lock: release

    Lock-->>Reader: granted
    Reader->>Store: read (fresh token)
    Reader->>Provider: API call with new access_token
    Provider-->>Reader: 200 payload

Failure Path: Provider 5xx During Refresh

sequenceDiagram
    participant W as Refresh Worker
    participant Lock as Mutex
    participant Store as Token Store
    participant Provider as OAuth Provider
    participant Sched as Scheduler
    participant WH as Webhook Bus

    W->>Lock: acquire(accountId)
    Lock-->>W: granted
    W->>Provider: POST /token
    Provider-->>W: 503 Service Unavailable
    W->>Store: refresh_attempts++, status = refreshing_backoff
    W->>Sched: scheduleIn(accountId, base * 2^attempts + jitter)
    W->>Lock: release
    Note over W,WH: No webhook fired - transient failure

    Note over Sched: ...backoff elapses...
    Sched->>W: retry
    W->>Lock: acquire(accountId)
    W->>Provider: POST /token
    Provider-->>W: 200 { access_token, refresh_token }
    W->>Store: save token, status = active, refresh_attempts = 0
    W->>Sched: reschedule at new expiry
    W->>Lock: release

The Failure-Mode Catalogue

Failure	Where it hits	How the design handles it
Two workers race to refresh the same account	Concurrency	Per-account mutex funnels callers into one flight; second caller waits and reads the fresh token
Provider returns `invalid_grant` (refresh token revoked)	Provider side	Error classified as fatal; account moves to `needs_reauth`; webhook fires; alarm cancelled
Provider returns 500 or times out	Network / provider	Classified as transient; `refresh_attempts` incremented; exponential backoff scheduled
Mutex holder crashes mid-refresh	Worker	TTL on the lock expires; janitor promotes stuck `refreshing` rows back to `active` and reschedules
Rotating refresh token issued but write fails	Storage	CAS write inside the mutex; on failure the row is marked `needs_reauth` immediately to avoid silent lockout
Clock skew between worker and provider	Infrastructure	30-second expiry buffer on every read; scheduler jitter avoids synchronised spikes
Encryption key rotated mid-refresh	Storage	Rows carry `encryption_key_id`; decrypt looks up the key by id; new writes use the current key
Provider rate-limits refresh endpoint	Provider side	Backoff honours `Retry-After`; per-provider concurrency cap prevents one busy tenant from starving others
Alarm delivery is late	Scheduler	On-demand refresh before every API call catches expired tokens the alarm missed
Customer disables the connection in the provider dashboard	Customer action	Next refresh returns `invalid_grant`; account moves to `needs_reauth`; webhook informs the customer's system
Encryption key compromised	Security	Rotate the key; re-encrypt rows lazily under `encryption_key_id`; revoke and re-mint any tokens that were exfiltrated

The catalogue is the argument for every architectural choice in this guide. Remove the mutex and the first row breaks. Remove the state machine and the tenth row silently corrupts data. Remove the CAS and the fifth row leaves you locked out.

Case Study: Connecting an AI Agent to Brex Expense Data

Corporate spend platforms are a natural target for AI agents: expense categorization, anomaly detection, month-end close automation, and receipt reconciliation. Brex exposes a wide REST API surface across transactions, expenses, budgets, cards, and payments, and also publishes an official Model Context Protocol server at https://api.brex.com/mcp for MCP-compatible AI clients. If you are building an AI agent that reads Brex expense data - or a B2B SaaS product syncing Brex transactions into an accounting stack - the token lifecycle rules below are the operational contract you have to follow.

This section consolidates every Brex token-lifecycle detail scattered across our other Brex guides into a single authoritative reference.

Brex Token Types and Lifetimes

Brex exposes three distinct authentication artifacts, and mixing them up is the fastest way to break a production integration:

Token type	Grant	Lifetime	Refresh behavior	Best for
User token	Dashboard-generated	Expires after 30 days of inactivity; does not expire when used regularly	No refresh - regenerate manually if expired	Single-tenant tools, admin scripts
Partner OAuth token	Authorization Code Grant	Access token expires in 3600 seconds (1 hour) and must be refreshed	Refresh token rotates on every refresh	Multi-tenant B2B SaaS
Onboarding partner	Client Credentials Grant, no user associated with API access	Provider-issued, short	Re-issued via client credentials	Partner-only onboarding endpoints
MCP session	OAuth via MCP discovery	Session-bound; ends on inactivity	Reconnect to re-authorize	AI agents (Claude, Cursor, ChatGPT connectors)

Two facts drive every design decision downstream:

Partner OAuth access tokens live for exactly one hour. Any long-running integration or persistent AI agent must implement refresh well before that window closes.
User tokens survive as long as you use them. User tokens expire if they are not used to make an API call for 30 days. When used regularly, they will not expire. For always-on agents, a cheap weekly no-op call keeps a user token alive indefinitely.

The API version split matters too: the Transactions API lives at /v2/transactions/card/primary while the Expenses API lives at /v1/expenses/card. Monitor them as independent signals - one path can degrade without the other.

Reconciling Inconsistencies in Brex's Own Docs

Brex's public documentation uses "user token" and "OAuth token" almost interchangeably in places, but they are not the same artifact. The FAQ is explicit: The user token is used to authenticate your own Brex account. The OAuth token is used when building partner applications to authenticate other Brex accounts. If you are building anything multi-tenant, ignore the user-token path entirely and use partner OAuth.

Similarly, the MCP documentation lets you authenticate with either a Brex API token or OAuth, but Brex recommends OAuth for AI clients: Prefer OAuth over API keys. OAuth scopes access to what you authorize, and sessions can be revoked from the Dashboard. For an AI agent that acts on behalf of an end user, OAuth is the only defensible choice.

Scope to Endpoint Mapping

Brex uses fine-grained scopes that map directly to endpoint families. Scopes define which endpoints your app has access to. You will specify your scopes when generating your user token. Requesting the wrong scope produces a 403 insufficient_scope error at call time - not at consent time - so it is easy to miss until an end user hits the broken path.

Endpoint family	Read scope	Write scope	Notes
Transactions (`/v2/transactions/*`)	`transactions.card.readonly`	-	Card and cash transaction history
Expenses (`/v1/expenses/card`)	`expenses.card.readonly`	`expenses.card`	Receipt upload, memo edits, category assignments
Team (`/v1/users`, `/v1/departments`)	`users.readonly`	`users`	Also needed for virtual card issuance
Cards (`/v2/cards`)	`cards.readonly`	`cards`	Request the readonly version if you are just listing cards
Budgets / Spend Limits	`budgets.readonly`	`budgets`	/v1/budgets targets spend limits and will soon be deprecated in favor of /v2/spend_limits
Payments (`/v1/transfers`, `/v1/vendors`)	`payments.readonly`	`payments`	Idempotency keys required on writes
Accounting (`/v1/accounting/*`)	`accounting.readonly`	`accounting`	Bookkeepers may access the accounting API

For an AI agent that reads expense data, the minimum viable scope set is expenses.card.readonly plus transactions.card.readonly. Add users.readonly if the agent needs to attribute spend to specific employees. As a general security practice, you should request the minimum set of scopes required for whatever action the user is performing.

Two edge cases worth designing for up front:

Insufficient-scope errors surface at call time. Scopes define which API endpoints your integration can access, but permissions are based on the customer's role type. The role type defines which actions a customer can perform on the Brex dashboard and, consequently, the Brex APIs. So even a correctly-scoped token can fail if the underlying user's role forbids the action.
Incremental scope upgrades. If you later want to add new functionality that requires fetching user information, or make updates to cards, you can request those scopes (users, cards) which will send the user through the authentication flow again and add those scopes to their previously consented scopes.

Refresh Best Practices and Secure Storage for Brex

Brex partner integrations require refresh token rotation. Partner integrations should always rotate the refresh_token. Whenever your integration exchanges a refresh token for a new access token, it should retain both the new access token and the refresh token (which is returned in the same response). If you do not persist the new refresh token, the next refresh will fail with invalid_grant and the customer has to re-authenticate.

Combine that with the concurrency patterns earlier in this guide, and the operational contract for Brex looks like this:

Schedule refresh at ~55 minutes. Access tokens live for 3600 seconds. Schedule proactive refresh at expires_at - 300s, with 60-180 seconds of jitter to spread load across many customer accounts.
Serialize refreshes per integrated account. Because Brex refresh tokens rotate on every use, a single-flight mutex per account is mandatory. Two concurrent refreshes will race, one wins the rotation, and the loser corrupts your stored refresh token.
Persist the new refresh token atomically. In the same transaction that stores the new access token, overwrite the refresh token from the refresh response. Never assume it is stable.
Encrypt every credential at rest. AES-256-GCM with a unique 96-bit IV per token, as covered earlier. Client secrets belong in the same encrypted store - use environment variables to replace secrets in your source code. Namely, do not expose any client_ids and client_secrets and ensure they are not checked into any version control.
Keep user tokens warm with continuous usage. For long-lived user tokens driving internal agents, schedule a lightweight call (for example GET /v2/users/me) every 20-25 days to reset the 30-day inactivity clock.
Serialize AI-agent workloads. Agents with multiple concurrent worker processes are the exact workload that produces refresh-token race conditions. Funnel all Brex calls through a single credential proxy that owns the mutex per account.
CSRF-protect the authorization redirect. To mitigate cross-site request forgery, partner integrations should always use the optional state parameter when initiating the OAuth 2.0 Authorization Code Grant. The value of your state parameter must be a random string, at least 9 characters long.

Decision Matrix: Which Brex Auth Should You Use?

Scenario	Recommended auth	Why
Internal analytics dashboard for one Brex account	User token	No refresh plumbing; 30-day inactivity keeps it alive if used regularly
AI agent for your own finance team	Brex MCP + OAuth	MCP handles auth discovery; OAuth grants are revocable per session
B2B SaaS syncing many customers' Brex data	Partner OAuth (Auth Code)	Only multi-tenant option; refresh rotation required
MCP-compatible AI assistant for customers (Claude, Cursor)	Brex MCP + OAuth	Any employee can authorize via OAuth and access tools based on their Brex capabilities
Onboarding-only partner integration	Client Credentials	No user is associated with this API access

Checklist: Avoid Common Brex Auth Failures

Before shipping a Brex integration, verify each of these:

How Truto Automates the Entire OAuth Token Lifecycle

Building distributed locking systems, proactive alarm schedulers, encryption pipelines, and self-healing state machines requires months of dedicated engineering time. Maintaining that infrastructure as you scale to millions of API requests per day requires a dedicated platform team. Everything described above is infrastructure that has nothing to do with your core product.

Truto handles this entire OAuth lifecycle automatically for every connected account across 200+ integrations:

Zero race conditions. Each integrated account gets strictly serialized refresh—concurrent API calls, sync jobs, and scheduled refreshes funnel through one flight per account while different accounts stay parallel. If a refresh is already in progress, callers await the same operation and share the result without duplicate provider requests. No Redis lock tuning on your side.
Proactive + on-demand refresh. Scheduled refresh runs 60 to 180 seconds before expiry (randomized to spread load). Every API request also checks token freshness as a fallback. Tokens are valid when your request hits the provider.
Automatic reauth detection. When a refresh fails with a non-retryable error like invalid_grant, the account is marked needs_reauth and a webhook (integrated_account:authentication_error) fires immediately. When the customer re-authenticates and the next refresh succeeds, the account reactivates automatically.
Encryption by default. All tokens, API keys, client secrets, and sensitive credential fields are AES-GCM encrypted at rest with per-value random IVs.
Zero integration-specific code. There is no if (provider === 'salesforce') logic in the token management engine. Integration behavior is defined entirely as declarative JSON configuration. The same execution pipeline handles OAuth 2.0 Authorization Code flows, Client Credentials flows, and custom API key injection. When refresh scheduling or locking behavior improves, every single one of 200+ supported integrations benefits instantly.

The honest trade-off: using Truto (or any managed integration platform) means you are delegating credential management to a third party. That is a real trust decision. We address it with SOC 2 Type II compliance, zero-storage architecture options, and the ability to deploy in your own infrastructure. But it is a trade-off worth evaluating explicitly. For more on how to evaluate this, see our post on passing enterprise security reviews with API aggregators.

Avoiding Vendor Lock-In: A Dual-Layer Architecture for API Portability

When you adopt any integration platform, you are trusting a third party with your customers' credentials and data flows. In 2026, as systems rely more heavily on third-party APIs, vendor lock-in has become a strategic risk - not just a technical inconvenience. API vendor lock-in occurs when your system becomes deeply dependent on a single API provider in a way that makes switching difficult, expensive, or risky.

The architectural defense is a dual-layer API pattern that separates your normalized data flows from your raw provider access, while keeping credentials portable across both.

The Dual-Layer Pattern: Unified API + Proxy Escape Hatch

Most integration platforms offer either a unified (normalized) API or a raw proxy. The strongest architecture uses both layers simultaneously, backed by a shared credential store:

flowchart TD
    App[Your Application] --> Decision{Need a common<br>data model?}
    Decision -- Yes --> Unified[Unified API Layer<br>Normalized schema across providers]
    Decision -- No --> Proxy[Proxy API Layer<br>Raw provider-native access]
    Unified --> Creds[Shared Credential Store<br>Encrypted tokens + proactive refresh]
    Proxy --> Creds
    Creds --> Provider[Third-Party Provider APIs<br>Salesforce, HubSpot, Workday, etc.]

    style Creds fill:#f0f4ff,stroke:#4a6fa5,stroke-width:2px

The unified layer handles 80-90% of your integration needs: listing contacts, creating tickets, syncing employees. When you need provider-specific features - a custom Salesforce SOQL query, a HubSpot workflow trigger, or a proprietary endpoint that no common data model covers - you drop down to the proxy layer. Both layers share the same authenticated connection and the same token management pipeline.

This gives you two forms of portability:

Provider portability. Your core integration logic talks to the unified API. Swapping a customer from HubSpot to Salesforce requires zero code changes on your side because both produce the same normalized response.
Platform portability. If you ever need to move off your integration platform, the proxy layer means your application already knows how to consume raw provider responses. You are not locked into a proprietary data model with no escape hatch.

Credential Portability: Bring-Your-Own-OAuth

The deepest form of vendor lock-in in integration platforms is credential ownership. If the platform's OAuth app is the one authorized to access your customers' data, leaving that platform means every customer must re-authenticate. For enterprise customers with complex approval chains, that could take weeks.

The solution is Bring-Your-Own-OAuth (BYO-OAuth): register your own OAuth application credentials with the integration platform, so the authorization grant belongs to your app, not the platform's.

sequenceDiagram
    participant App as Your Application
    participant Platform as Integration Platform
    participant Provider as OAuth Provider

    Note over App,Platform: One-time setup
    App->>Platform: Register your OAuth client_id + client_secret
    Platform-->>App: Credentials stored (encrypted)

    Note over App,Provider: Customer connects
    App->>Platform: Initiate OAuth for customer
    Platform->>Provider: Authorization redirect (YOUR client_id)
    Provider-->>Platform: Auth code callback
    Platform->>Provider: Exchange code for tokens (YOUR credentials)
    Provider-->>Platform: Access token + refresh token
    Platform-->>App: Connection active

    Note over App,Provider: Ongoing API usage
    App->>Platform: GET /unified/crm/contacts
    Platform->>Platform: Decrypt token, refresh if needed
    Platform->>Provider: API call with Bearer token
    Provider-->>Platform: Response data
    Platform-->>App: Normalized response

    Note over App: If you leave the platform
    App->>Provider: Tokens still valid (issued to YOUR OAuth app)

With BYO-OAuth, the authorization grant is between your OAuth application and the provider. The integration platform manages the token lifecycle - proactive refresh, encryption, concurrency control - but the credentials are portable. If you leave the platform, existing tokens remain valid because they were issued to your OAuth application, not the platform's.

Here is how to register your own OAuth credentials at the environment level in Truto:

// Register your own Salesforce OAuth app for a specific environment
const response = await fetch(
  'https://api.truto.one/environment-integration',
  {
    method: 'PATCH',
    headers: {
      'Authorization': 'Bearer YOUR_TRUTO_API_TOKEN',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      integration_name: 'salesforce',
      environment_id: 'env_your_environment_id',
      credentials: {
        oauth2: {
          config: {
            client_id: 'YOUR_SALESFORCE_CONNECTED_APP_CLIENT_ID',
            client_secret: 'YOUR_SALESFORCE_CONNECTED_APP_SECRET',
            scope: 'api refresh_token offline_access',
          },
        },
      },
    }),
  }
);

Once registered, every new OAuth connection for that integration in that environment uses your OAuth app. The platform's credential resolution merges configuration across three levels - platform defaults, environment overrides, and per-account overrides - with the most specific level taking priority. Your environment-level credentials override the platform's defaults without affecting other integrations or environments.

Calling Provider-Native Endpoints via Proxy

The proxy layer gives you raw access to any endpoint the provider exposes, using the same managed credentials as the unified API. No schema transformation, no field mapping - you send what the provider expects and get back exactly what the provider returns.

// Query a custom Salesforce object via SOQL - not covered by any unified model
const salesforceResponse = await fetch(
  'https://api.truto.one/proxy/query?' +
    new URLSearchParams({
      integrated_account_id: 'ia_customer_abc',
      q: 'SELECT Id, Risk_Score__c, Compliance_Status__c FROM Custom_Audit__c WHERE Risk_Score__c > 80',
    }),
  { headers: { Authorization: 'Bearer YOUR_TRUTO_API_TOKEN' } }
);
 
const data = await salesforceResponse.json();
// Returns raw Salesforce response - no transformation applied
// {
//   "result": {
//     "totalSize": 12,
//     "done": true,
//     "records": [
//       { "Id": "a01xx000003GYb1", "Risk_Score__c": 92, "Compliance_Status__c": "Review" }
//     ]
//   }
// }

// Trigger a HubSpot workflow enrollment - provider-specific, no unified equivalent
const hubspotResponse = await fetch(
  'https://api.truto.one/proxy/workflows-enrollments?' +
    new URLSearchParams({
      integrated_account_id: 'ia_customer_xyz',
    }),
  {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer YOUR_TRUTO_API_TOKEN',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      inputs: [{ email: 'prospect@example.com' }],
    }),
  }
);

The proxy handles authentication, token refresh, rate limiting, and pagination identically to the unified layer. Your application gets managed credentials without being constrained to a predefined data model. Start with the proxy to ship fast, then layer unified models on top as common patterns emerge across your integrations.

Operational Runbook: Monitoring, Rotation, and Export

A multi-provider API strategy is only as strong as your operational visibility into token health across every connected account.

Token health monitoring. Subscribe to webhook events that surface credential issues in real time:

// Webhook payload when a token refresh fails permanently
{
  "event_type": "integrated_account:authentication_error",
  "payload": {
    "id": "ia_customer_abc",
    "integration_name": "salesforce",
    "status": "needs_reauth",
    "last_error": "invalid_grant: expired refresh token"
  }
}
 
// Webhook payload when a customer re-authenticates successfully
{
  "event_type": "integrated_account:reactivated",
  "payload": {
    "id": "ia_customer_abc",
    "integration_name": "salesforce",
    "status": "active"
  }
}

Build dashboards that track refresh success rates and needs_reauth counts per provider. Spikes in auth failures for a specific provider often signal an upstream issue - a provider-side token policy change, a revoked scope, or an API deprecation - before the provider's own status page reflects it.

Credential rotation. When rotating OAuth client secrets (which most providers now recommend at least annually):

Generate a new client secret in the provider's developer console.
Update the environment-level credentials via the integration platform's API.
Existing refresh tokens typically remain valid - most OAuth providers do not invalidate tokens when the client secret changes.
New token exchanges and refreshes will use the updated secret automatically.

Portability checklist. Before you commit to any integration platform, verify these exit capabilities:

Capability	What to verify
BYO-OAuth support	Can you register your own OAuth app credentials so authorization grants belong to you?
Proxy / raw access	Can you call provider-native endpoints through the platform without schema transformation?
Webhook parity	Does the platform expose auth lifecycle events (failures, reactivations) so you can monitor externally?
Zero-storage option	Does the platform offer a pass-through mode that does not persist your customer data at rest?
Override hierarchy	Can you customize field mappings, query translations, and endpoint routing per-environment or per-account without forking?

Avoiding vendor lock-in requires you to think over architectural decisions and plan strategically from day one. Lock-in is an architectural problem, not a procurement mistake. The time to verify these capabilities is during evaluation, not during an emergency migration.

What to Build Next

If you are designing your OAuth token management system from scratch, here is the priority order:

Start with proactive refresh. This eliminates the majority of token-related failures immediately. Even a simple cron job that refreshes tokens 5 minutes before expiry is better than reactive-only.
Add per-account locking. Use whatever locking primitive your stack already has - database advisory locks, Redis SETNX, or in-memory mutexes for single-instance apps. Upgrade to distributed locks when you scale.
Encrypt tokens at rest. AES-256-GCM with per-value random IVs. This is non-negotiable after the Salesloft Drift breach demonstrated what happens when OAuth tokens are compromised at scale.
Build the state machine. Track account health explicitly. Fire webhooks on auth failures so customers can self-serve re-authentication. Auto-reactivate when refreshes succeed again.
Instrument everything. Track refresh success rates, latency, and failure reasons per provider. You will discover that Provider X randomly returns 500s every Tuesday at 2am, and you will be glad you have the data to prove it.

Or skip the infrastructure work entirely and let Truto handle it.

Tip

Pro Tip: Stop treating integration errors as purely technical faults. When an OAuth token fails permanently, it is a customer success issue. Automate the communication layer so your users know exactly which integration needs to be reconnected before their data syncs fall behind.

FAQ

How do you prevent OAuth token refresh race conditions in distributed systems?: Use a per-account mutex lock (via Redis distributed locks, database advisory locks, or another single-flight primitive per tenant) to serialize refresh requests. Concurrent callers wait for the in-progress refresh and receive the same result, preventing duplicate provider requests that can invalidate rotating refresh tokens.
What is proactive token refresh and why is it better than reactive refresh?: Proactive refresh schedules token renewal 60-180 seconds before expiration using background alarms, so tokens are always fresh when API requests arrive. Reactive refresh waits for a 401 error, adding latency spikes and risking permanent invalid_grant lockouts when refresh tokens rotate during network failures.
How should OAuth tokens be encrypted at rest?: Use AES-256-GCM encryption with a unique random 12-byte IV per encryption operation. Store the IV alongside the ciphertext. GCM provides both confidentiality and integrity verification, detecting tampering in addition to preventing unauthorized reads. Never reuse an IV with the same key.
What happens when an OAuth refresh token is permanently invalidated?: Mark the connected account as needing re-authentication, stop retry attempts immediately (retrying invalid_grant errors is pointless), and fire a webhook notification so the customer can re-authorize. Automatically reactivate the account if a subsequent re-authentication succeeds.
How do third-party API providers handle concurrent refresh requests?: Strict providers enforce refresh token rotation. If they receive multiple concurrent requests using the same refresh token, they process the first and invalidate the old token. Subsequent requests are treated as a replay attack, and the provider may revoke the entire token family, permanently locking out your application.

Updates

Jul 15, 2026 Added six new sections: Implementation Overview and Goals; Per-Account Mutex Patterns (Redis SETNX, Postgres advisory locks, optimistic CAS) with working code; Token Data Model with canonical SQL schema and AES-GCM envelope format; Proactive Refresh State Machine with full state graph and alarm-handler pseudocode; Handling Refresh-Token Rotation and Provider Quirks; and Sequence Diagrams with a Failure-Mode Catalogue.
Jul 3, 2026 Added a Brex-focused case study covering token types and lifetimes, scope-to-endpoint mapping, refresh best practices, a decision matrix, and an auth-failure checklist, oriented toward AI agent and MCP use cases.
May 29, 2026 Added new section on avoiding vendor lock-in covering dual-layer API architecture (unified + proxy), BYO-OAuth credential portability with sequence diagram and code, proxy API code examples for provider-native endpoints, and an operational runbook for monitoring, credential rotation, and platform exit.

FAQ

More from our Blog

OAuth at Scale: The Architecture of Reliable Token Refreshes

Beyond Bearer Tokens: Architecting Secure OAuth Lifecycles & CSRF Protection

Handling OAuth Token Refresh Failures in Production for Third-Party Integrations

How to Pass Enterprise Security Reviews When Using 3rd-Party API Aggregators

Best Practices for Handling API Rate Limits and Retries Across Multiple Third-Party APIs

Bearer Tokens Were the Easy Part: The Real Challenge of Enterprise Auth

How to Integrate with the QuickBooks Online API (2026 Guide)

What is a Linked Account in Unified APIs? Architecture & Pricing Guide

What is OAuth Token Management? The B2B SaaS Guide

How DevOps Teams Can Automate API Key Rotation & Secret Management at Scale

Redundancy & Failover Patterns for SaaS Integrations: The 2026 Architecture Guide

How to Architect a Scalable OAuth Token Management System for Enterprise Integrations