How do you handle webhooks from legacy APIs that don't support them?

Use incremental polling with updated_at cursors to detect changes, then convert those diffs into webhook-shaped events. This 'virtual webhook' pattern lets your application code stay event-driven regardless of whether the upstream API supports native webhooks. Pair it with native webhooks where available for a hybrid real-time + reconciliation architecture.

What retry strategy should I use for webhook delivery?

Use exponential backoff with jitter. A practical schedule starts at 5 seconds and caps at 6-12 hours, with 5-7 total attempts spanning roughly 24 hours. Always honor 429 Retry-After headers, use constant-time signature comparison, and route exhausted retries to a dead letter queue for manual inspection and replay.

What metrics should I monitor for webhook reliability?

Track delivery success rate (alert below 95%), ingress ACK latency p99 (alert above 500ms), queue depth and drain time, retry rate by destination, and end-to-end delivery latency p95/p99. Also alert on absence - a provider that stops sending events is often more dangerous than one returning errors.

How do reconciliation jobs complement webhooks?

Webhooks provide near-real-time event delivery, but they can miss events due to outages, misconfigurations, or network issues. A reconciliation job periodically polls the provider API using updated_at cursors, diffs results against local state, and emits synthetic events for any gaps. Running this nightly or hourly catches what webhooks miss.

What is the claim-check pattern for webhook processing?

The claim-check pattern separates payload storage from queue messaging. When a webhook arrives, the receiver writes the full payload to object storage, then enqueues a lightweight message containing only the storage key and metadata. Workers retrieve the payload from storage when processing. This decouples payload size from queue message limits and allows retries without re-transmitting large payloads.

Back

Engineering

Designing Reliable Webhooks: Lessons from Production

Unified webhook architecture for production: signature verification, retry patterns, SLIs, reconciliation, plus a Brex-to-accounting integration recipe.

Sidharth Verma · March 5, 2026 · 19 min read

You pointed a URL at a Lambda function, added an if (req.headers ['x-hub-signature']) check, and called it a day. Now you're three integrations deep, your webhook handler is 800 lines of provider-specific spaghetti, and a silent failure from HiBob just caused a 2:00 AM PagerDuty alert because the payload only contained an employee ID—no actual data.

Webhooks are often marketed as the "simple" way to keep data in sync. In reality, they are the Wild West of software engineering. Every provider—from Salesforce and Slack to HiBob and Jira—has a different "vibe" for how they handle security, payload structures, and retries. If your integration strategy is just pointing a URL at a server and hoping for a valid JSON body, you are building a liability—a common pitfall when building direct integrations in-house.

A unified webhook architecture is a system that centralizes the ingestion, verification, and normalization of asynchronous events from multiple third-party providers into a single, predictable data stream. This architecture eliminates vendor-specific security logic and ensures your application receives enriched, actionable data rather than "thin" notifications.

At Truto, we have processed millions of events across hundreds of SaaS platforms. Here is why the standard approach to webhooks is fundamentally broken and how we engineered a unified engine to fix it.

The Verification Wild West: Solving HMAC, JWT, and Handshakes

Webhook signature verification is the process of cryptographically proving that an incoming HTTP request originated from the expected third-party provider and that the payload has not been tampered with in transit.

There is no "Standard Webhook Security." One service uses HMAC-SHA256, another uses JWT, and a third—like Slack or Microsoft Graph—requires a custom "challenge" handshake before it even starts sending data.

Provider	Verification Method	What You Need to Implement
GitHub	HMAC-SHA256 on raw body	Compute hash, compare with `X-Hub-Signature-256`
Slack	Challenge handshake + HMAC	Echo `challenge` value on setup, verify `x-slack-signature` on events
Microsoft Graph	Validation token	Return the token as plain text during subscription creation
HiBob	Bearer token	Compare token in `Authorization` header with stored secret

The Timing Attack Risk

Most developers compare signatures using a standard string equality operator (==). This is a security failure. Standard string comparison returns as soon as it finds a mismatch, allowing an attacker to use timing analysis to guess the signature byte-by-byte.

Danger

Critical Security Note: Always use constant-time comparison functions like Node's crypto.timingSafeEqual or Web Crypto's crypto.subtle.timingSafeEqual to prevent timing side-channel attacks during webhook verification.

Truto's Unified Verification Engine

We handle this via a declarative verification layer. Instead of writing boilerplate code for every new vendor, we use JSONata expressions to define how to handle challenges and validate signatures. For example, when Slack sends a url_verification event, our engine identifies the challenge field and returns it immediately. Your backend never sees the handshake garbage; it only sees verified, legitimate events.

Solving the "Thin Payload" Problem with Automated Enrichment

Webhook payload enrichment is an architectural pattern where a receiver automatically calls the provider's API to fetch the full resource data after receiving a "thin" notification that only contains an ID.

Most enterprise webhooks are "thin." You get a notification saying employee.updated. Great. What changed? Often, the webhook only gives you a resource_id.

{
  "type": "employee.updated",
  "employee": {
    "id": "2934871"
  }
}

This forces your engineering team to build a manual "fetch-back" loop: receive the webhook, look up the credentials, call the third-party API, handle potential rate limits, and then finally process the data.

The Truto Approach: Automatic Enrichment

When an event hits Truto, our mapping engine—designed to solve the hardest problems in schema normalization—determines if the payload is complete. If it's thin, the system automatically uses our Unified API to fetch the full, up-to-date resource. This is handled via a method_config in the integration's JSONata mapping, which tells Truto exactly which endpoint to call to retrieve the complete record:

# Example: HiBob mapping triggering automated enrichment
webhooks:
  hibob: |
    (
      $action := $split(body.type,'.')[1];
      body.{
        "event_type": $action = "joined" ? "created" : "updated",
        "resource": "hris/employees",
        "method": "get",
        "method_config": {
          "id" : employee.id
        }
      }
    )

By the time the webhook hits your endpoint, it looks like this:

{
  "id": "3a0da6ba-b2d1-473f-957c-51f6825e3623",
  "event": "integrated_account:created",
  "payload": {
    "id": "79a39d69-e27e-49cb-b9a9-79f5eea7aa26",
    "tenant_id": "acme-1",
    "environment_integration_id": "27a8c0ff-0c2e-4383-b651-0772b3515921",
    "context": {},
    "created_at": "2023-06-16T09:21:21.000Z",
    "updated_at": "2023-06-16T09:21:21.000Z",
    "is_sandbox": false,
    "unified_model_override": {},
    "environment_id": "ac15abdc-b38e-47d0-97a2-69194017c177",
    "integration": {
      "id": "1fa47bf3-5f1f-4b65-bcd0-8d07ab455e15",
      "name": "helpscout",
      "category": "helpdesk",
      "is_beta": false,
      "team_id": "68ea7267-2aec-4da0-b5d9-192cc84eb2de",
      "sharing": "allow",
      "default_oauth_app_id": null,
      "created_at": "2023-02-16T09:27:09.000Z",
      "updated_at": "2023-02-17T12:56:55.000Z"
    },
    "environment_integration": {
      "id": "27a8c0ff-0c2e-4383-b651-0772b3515921",
      "integration_id": "1fa47bf3-5f1f-4b65-bcd0-8d07ab455e15",
      "environment_id": "ac15abdc-b38e-47d0-97a2-69194017c177",
      "show_in_catalog": true,
      "is_enabled": true,
      "override": {},
      "created_at": "2023-02-16T09:27:16.000Z",
      "updated_at": "2023-02-16T09:27:16.000Z"
    }
  },
  "environment_id": "ac15abdc-b38e-47d0-97a2-69194017c177",
  "created_at": "2023-06-16T09:21:22.369Z",
  "webhook_id": "077ec306-5756-43fa-9a06-0cc0da4eabe0"
}

Your system receives a complete, unified record. You don't need to know that HiBob sent a thin event while Workday sent a thick one. The data is already there.

Architecture of a Unified Receiver: Sync vs. Async Fan-out

To handle webhooks at scale, you have to account for two distinct integration architectures: per-account and per-integration.

Feature	Per-Account Webhooks	Per-Integration (Environment) Webhooks
URL Structure	Unique per customer	Single URL for all customers
Processing Path	Synchronous enrichment	Asynchronous queue-based fan-out
Scaling Strategy	Direct ingestion	`context_lookup` mapping
Primary Use Case	Salesforce, HiBob, HubSpot	Slack, Microsoft Teams, Asana

Why the per-integration path needs a queue

When a single webhook could affect dozens of connected accounts, processing them all synchronously would blow past the provider's timeout window. Truto handles this with an asynchronous fan-out path: ingest the event, acknowledge the vendor within milliseconds, then hand off to background processing that resolves the relevant integrated accounts and routes data without blocking the HTTP handler.

Handling Legacy APIs That Don't Support Webhooks

Not every API sends webhooks. Many legacy systems and older SaaS platforms simply don't offer event-driven integrations. When you're dealing with an on-premise HRIS, a legacy ERP, or a niche vertical SaaS tool, polling is often the only path to real-time data sync.

The standard approach is incremental polling - call the API on a schedule, filter by updated_at or a change token, and diff the results against what you already have. It works, but it carries real costs: wasted API quota on empty responses, latency determined by your polling interval rather than actual changes, and rate-limit pressure that multiplies across customers and endpoints.

A better pattern is to push the polling responsibility into the integration layer, so your application still receives change events through a standard webhook interface. This is sometimes called virtual webhooks - the integration platform polls for changes and converts detected diffs into webhook-shaped events delivered to your endpoint. Your code doesn't care whether the upstream system supports native webhooks or not; it processes the same unified payload either way.

Truto's sync jobs handle this exact pattern. For providers that lack native webhook support, scheduled sync runs detect changes and emit record:created, record:updated, and record:deleted events through the same delivery pipeline that native webhooks use. The key distinction is that sync-job events use a different event family (sync_job_run:*) so your consumer can differentiate between true real-time events and poll-detected changes if needed.

Info

Hybrid is the default. The most reliable architecture pairs native webhooks for immediacy with periodic sync jobs for reconciliation. Webhooks catch events in near-real-time; sync jobs fill gaps left by missed deliveries, provider outages, or misconfigurations.

Recommended Ingestion Topology and Queueing Patterns

A production webhook receiver needs to decouple ingestion from processing. The moment you start doing work inside the HTTP handler - enrichment, database writes, downstream API calls - you're one slow query away from timing out and triggering provider retries that compound the problem.

The architecture that works at scale follows a claim-check pattern:

flowchart LR
    A[Provider POST] --> B[Receiver:<br>Verify + ACK]
    B --> C[Object Store:<br>Write Payload]
    C --> D[Queue:<br>Lightweight Message]
    D --> E[Worker:<br>Fetch Payload,<br>Enrich, Deliver]
    E --> F[Customer Endpoint]

Receiver layer - Verify the signature, write the raw payload to object storage keyed by a unique event ID, enqueue a lightweight message containing only metadata (event ID, type, account ID), and return 200 OK. This entire path should complete in under 100ms.
Object storage - Decouples payload size from queue message limits. Large enterprise payloads (batch events from Workday, for example) don't blow up your queue.
Processing queue - Workers pull messages, retrieve the payload from storage, run enrichment, map to unified schema, and deliver to customer endpoints.
Delivery queue - A separate queue handles outbound delivery to customer webhook subscriptions, with its own retry and backoff logic independent of ingestion.

This two-queue topology is what Truto uses in production. The ingestion queue handles fan-out for per-integration webhooks (resolving which accounts an event belongs to), while the delivery queue handles outbound retries to customer endpoints. Keeping them separate prevents a single failing customer endpoint from creating backpressure on ingestion.

Retry, Backoff, and Rate-Limit Handling

Webhook delivery will fail. The question is how gracefully your system handles it.

Exponential Backoff with Jitter

The industry-standard pattern is exponential backoff with jitter. The formula:

delay = min(base * 2^attempt + random(0, jitter_max), max_delay)

Here's a concrete retry schedule that works well for most webhook delivery scenarios:

Attempt	Base Delay	With Jitter (approx.)	Cumulative Wait
1	5s	3-7s	~5s
2	25s	18-32s	~30s
3	125s	90-160s	~2.5 min
4	625s	450-800s	~13 min
5	30 min	22-38 min	~43 min
6	2 hr	1.5-2.5 hr	~3 hr
7	6 hr (cap)	4.5-7.5 hr	~9 hr

This gives the consumer roughly a full business day to notice and fix issues before retries are exhausted. Cap the maximum interval at 6-12 hours to avoid events sitting in limbo indefinitely.

Handling 429s and Rate Limits

When a consumer returns 429 Too Many Requests with a Retry-After header, that value should override your backoff algorithm entirely. If the header says wait 120 seconds, wait at least 120 seconds - even if your exponential schedule would have retried sooner. Ignoring explicit backpressure signals risks getting your traffic blocked at the infrastructure level.

For outbound delivery, Truto uses queue-based retry with built-in exponential backoff. Failed deliveries are retried automatically. If an endpoint consistently fails (over 50% failure rate across 20+ attempts in a 2-day window), the health monitoring system sends alerts and can auto-disable the webhook to protect both sides.

Dead Letter Queues

After exhausting retries, events should land in a dead letter queue (DLQ) rather than disappearing silently. The DLQ gives your team a place to inspect failed events, fix the root cause, and replay them in controlled batches. When replaying, rate-limit the replay to avoid a thundering herd - don't dump 10,000 events on an endpoint that just recovered.

SLIs, Alert Thresholds, and Dashboard Metrics

You can't fix what you can't see. Webhook reliability requires a small set of metrics that make problems obvious before they cascade into data loss.

Key SLIs to Track

Metric	What It Tells You	Alert Threshold
Delivery success rate	% of events delivered on first or retried attempt	< 99% warrants investigation, < 95% is a warning
Ingress ACK latency (p99)	Time from request received to `200` returned	> 500ms - you're doing too much work in the handler
End-to-end latency (p95/p99)	Time from provider POST to customer delivery	> 30s needs attention
Queue depth	Number of events waiting for processing	Growing steadily means consumers can't keep up
Queue drain time	Estimated time to clear the backlog	> 10 min triggers scaling
Retry rate	% of deliveries requiring retries	Rising rate is the first signal of destination stress
Error rate by destination	Failures grouped by customer endpoint	Isolates which endpoints are struggling

Dashboard Layout

A useful webhook operations dashboard has four panels:

Ingestion health - Event volume by provider, verification failure rate, ACK latency histogram
Processing pipeline - Queue depth over time, enrichment success/failure ratio, events processed per second
Outbound delivery - Success rate by customer endpoint, retry distribution, active circuit breakers
Business signals - Events by type (record:created, record:updated, record:deleted), top 10 noisiest integrations, "zero events received" alerts per provider

Tip

Alert on absence, not just failure. A provider that suddenly stops sending webhooks is often more dangerous than one returning errors. Set up "no events received in X minutes" alerts for each active integration.

Incident Runbook: Detect, Escalate, Reconcile

When webhook delivery breaks, speed matters. Here's a concrete playbook:

1. Detect

Automated signal: Alert fires on delivery success rate < 95%, queue depth > 5,000, or zero events from a provider for > 15 minutes.
Manual signal: Customer reports stale data or missing updates.
First check: Is the problem on our side (queue backlog, worker crash, deployment regression) or the provider's side (provider outage, changed signature format, expired credentials)?

2. Triage

Single customer endpoint failing: Likely a customer-side issue. Open a circuit breaker on that endpoint, let other deliveries continue. Notify the customer with specifics (HTTP status codes, response bodies).
Single provider's events failing: Check for provider API changes, expired OAuth tokens, or signature format changes. If auth-related, the integrated account gets flagged for re-authentication automatically.
Broad delivery failure: Check infrastructure health - queue consumer crashes, object storage availability, network issues.

3. Mitigate

Protect ingestion: Keep ACK fast. Never block inbound processing because of outbound failures.
Scale consumers: If the bottleneck is processing throughput, add workers.
Shed non-critical load: Temporarily pause enrichment for lower-priority event types to free capacity.
Open circuit breakers: Stop hammering failing endpoints. Queue events for later replay.

4. Recover and Reconcile

Replay DLQ: Once the root cause is fixed, replay dead-lettered events in controlled batches with rate limiting.
Run a reconciliation sync: Trigger a sync job for affected providers/accounts to catch any events missed during the outage window.
Validate metrics: Confirm delivery success rate returns to baseline, queue depth drains to normal, and no data gaps remain.
Post-mortem: Document the timeline, root cause, and what monitoring gap allowed the issue to reach the severity it did.

Reconciliation Job Design

Webhooks are at-least-once delivery systems - which means they can also be zero-times delivery systems when things go wrong. Provider outages, misconfigured subscriptions, deployment windows, and network blips all create gaps. The standard practice is to pair real-time webhooks with periodic reconciliation.

The Hybrid Model

The pattern is straightforward: webhooks handle the "hot path" for immediate change detection, and a scheduled reconciliation job runs as a safety net to catch anything that slipped through.

flowchart TB
    subgraph Real-Time Path
        A[Provider Webhook] --> B[Verify + Enrich]
        B --> C[Deliver to Customer]
    end
    subgraph Reconciliation Path
        D[Scheduled Job] --> E[Fetch records updated<br>since last checkpoint]
        E --> F[Diff against<br>local state]
        F --> G{Gaps<br>found?}
        G -->|Yes| H[Emit synthetic<br>events]
        G -->|No| I[Update checkpoint,<br>done]
        H --> C
    end

Key Design Principles

Use updated_at cursors - Query the provider's API for records modified since your last successful checkpoint. Most APIs support filtering by modification timestamp.
Diff, don't blindly overwrite - Compare fetched records against your local state. Only emit events for records that actually differ, to avoid noisy duplicate processing downstream.
Separate reconciliation events from live events - Tag reconciliation-sourced events so consumers can handle them differently if needed (e.g., skip re-sending notification emails for gap-filled records).
Run on a schedule that matches your freshness SLO - Nightly works for most use cases. Hourly is appropriate for payment or compliance-critical data. Every-5-minutes is overkill unless you're handling financial transactions.
Make it idempotent - The reconciliation job itself might fail partway through. Use checkpoint-based progress tracking so it can resume safely.

Truto's sync jobs serve exactly this purpose. They run on configurable schedules, fetch records from provider APIs using incremental cursors, and emit events through the same unified delivery pipeline. For providers that support native webhooks, sync jobs act purely as reconciliation. For providers that don't, sync jobs are the primary change detection mechanism.

Recipe: Integrating Brex Webhooks with Accounting Software

Everything above is generic. Here's what it looks like applied to a specific, common integration: syncing Brex spend data (transactions, expenses, transfers, card activity) into a customer's accounting system - QuickBooks Online, Xero, NetSuite, or a custom general ledger. This is the exact pattern behind any Brex-powered AP automation, expense management, or financial reporting product.

Subscribing to and Verifying Brex Webhooks

Brex APIs offer webhooks to notify you in real-time when events happen in your account, delivered as HTTPS POST requests to an endpoint you register. Register your endpoint by POSTing to /v1/webhooks with the event types you care about:

curl -X POST 'https://platform.brexapis.com/v1/webhooks' \
  -H 'Authorization: Bearer <api_token>' \
  -H 'Content-Type: application/json' \
  -H 'Idempotency-Key: <uuid>' \
  -d '{
    "url": "https://api.yourapp.com/webhooks/brex",
    "event_types": [
      "EXPENSE_PAYMENT_UPDATED",
      "TRANSFER_PROCESSED",
      "TRANSFER_FAILED"
    ]
  }'

A few Brex-specific constraints to internalize:

Only one webhook endpoint can be registered per customer/client_id, but that endpoint can subscribe to multiple event types. Your single endpoint must route internally by event_type.
Every payload arrives with three headers: Webhook-Id (a unique message identifier that stays the same across retries), Webhook-Timestamp (seconds since epoch), and Webhook-Signature (base64, space-delimited).
Brex signs webhooks with HMAC-SHA256 using a secret retrieved from GET /v1/webhooks/secrets.
The content to sign is the id, timestamp, and raw payload concatenated with a full-stop separator (id.timestamp.payload).
Compare Webhook-Timestamp against your system clock and reject anything outside a tolerance window to prevent replay attacks.

Here's a reference verifier in Node.js:

const crypto = require('crypto');
 
function verifyBrexSignature(rawBody, headers, secret) {
  const webhookId = headers['webhook-id'];
  const timestamp = headers['webhook-timestamp'];
  const signatureHeader = headers['webhook-signature']; // "v1,sig1 v1,sig2"
 
  // Reject stale timestamps (5-minute tolerance)
  const now = Math.floor(Date.now() / 1000);
  if (Math.abs(now - parseInt(timestamp, 10)) > 300) {
    throw new Error('Timestamp outside tolerance window');
  }
 
  const signedContent = `${webhookId}.${timestamp}.${rawBody}`;
  const expected = crypto
    .createHmac('sha256', secret)
    .update(signedContent)
    .digest('base64');
 
  // Brex may send multiple signatures (space-delimited) to support key rotation
  const provided = signatureHeader.split(' ').map(s => s.split(',')[1]);
  const expBuf = Buffer.from(expected, 'base64');
 
  return provided.some(sig => {
    const sigBuf = Buffer.from(sig, 'base64');
    return sigBuf.length === expBuf.length &&
      crypto.timingSafeEqual(sigBuf, expBuf);
  });
}

Two common mistakes to avoid:

Reformatting the body before verifying - even a small change causes the signature to be completely different, so verify against the raw payload as received.
Using === for signature comparison. Always use timingSafeEqual.

Sample Event Payloads and Schema

Brex event payloads are intentionally thin. They carry the event type and identifiers, not the full resource. Here's what a TRANSFER_PROCESSED event body looks like on the wire:

{
  "event_type": "TRANSFER_PROCESSED",
  "transfer_id": "dptx_ckyypz30n000101kgzgnrtqlf",
  "company_id": "cuacc_ckqckhadg000601r95ox48c2s"
}

For multi-tenant integrations, company_id is the critical field: partners maintain a mapping of company_id to access_token, since company_id is passed along webhook payloads associated with a single company. Use it to look up the correct token, then call the relevant Brex endpoint to fetch the full record before pushing to the customer's accounting system.

The events that matter most for an accounting integration:

Event Type	Fires When	Downstream Action
`EXPENSE_PAYMENT_UPDATED`	Card charge posts or updates	Create/update a journal entry or expense record
`TRANSFER_PROCESSED`	ACH, wire, or check outflow settles	Post a payment against a vendor bill
`TRANSFER_FAILED`	Outflow fails	Reverse pending entries, notify AP team
`ACCOUNTING_RECORD_READY_FOR_EXPORT`	Brex explicitly flags a record for ERP export (alpha)	Push directly to the destination ledger

The Webhooks API has expanded WebhookEventType to include ACCOUNTING_RECORD_READY_FOR_EXPORT (alpha), which is the cleanest signal to build against if it's available to your account - it lets Brex own the "is this record ready to sync?" decision instead of you inferring it from raw expense state.

Warning

Pending vs. settled matters. Only settled transactions are returned from the Brex API - pending transactions are not returned in real-time as they happen, and only appear as they settle. If your product needs to show pending charges, use EXPENSE_PAYMENT_UPDATED webhooks - do not try to poll for pending state.

Idempotency and Durable Delivery Patterns

Webhook-Id is stable across retries - the same value is sent when a webhook is re-delivered (e.g., due to a previous failure). That makes it the natural anchor for idempotent processing.

Combine three primitives:

Dedup guard keyed on Webhook-Id at ingestion.
Claim-check storage so the raw body survives independently of the queue message.
Idempotent downstream writes so retries into QuickBooks/Xero/NetSuite don't produce duplicate journal entries.

async function ingestBrexWebhook(req, res) {
  const rawBody = req.rawBody; // captured before JSON parsing
  const headers = req.headers;
 
  if (!verifyBrexSignature(rawBody, headers, secret)) {
    return res.status(401).end();
  }
 
  const webhookId = headers['webhook-id'];
 
  // 1. Dedup guard - stable across Brex retries
  const { rowCount } = await db.query(
    `INSERT INTO processed_webhooks (webhook_id, received_at)
     VALUES ($1, NOW())
     ON CONFLICT (webhook_id) DO NOTHING`,
    [webhookId]
  );
 
  if (rowCount === 0) {
    return res.status(200).json({ status: 'duplicate' });
  }
 
  // 2. Claim-check: persist raw body, enqueue lightweight message
  await objectStore.put(`brex/${webhookId}`, rawBody);
  await queue.send({ webhook_id: webhookId, provider: 'brex' });
 
  // 3. Ack fast - all real work happens in the worker
  return res.status(200).json({ received: true });
}

On the downstream side, when pushing a Brex transaction into QuickBooks or Xero, derive the destination idempotency token from the Brex Webhook-Id (e.g., brex-{webhookId}). If the worker retries and QuickBooks has already accepted that entry, the accounting system returns the existing record instead of creating a duplicate.

Rate-Limit Handling and Retry-After

Exceeding a Brex rate limit results in an HTTP 429 (too many requests) response. The API returns 429 on limit exceeded and the guidance is to implement exponential backoff and respect the Retry-After header.

A correct client honors Retry-After absolutely - if Brex says wait 30 seconds, wait 30 seconds, even if your algorithm would have retried sooner:

async function brexRequestWithBackoff(makeRequest, maxAttempts = 6) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const res = await makeRequest();
 
    if (res.status !== 429) return res;
 
    // Honor Retry-After if present; otherwise exponential backoff + jitter
    const retryAfter = res.headers.get('retry-after');
    const waitMs = retryAfter
      ? parseInt(retryAfter, 10) * 1000
      : Math.min(1000 * Math.pow(2, attempt) + Math.random() * 500, 60_000);
 
    // Also log the Brex trace id so support can correlate later
    console.warn('brex_429', {
      trace_id: res.headers.get('x-brex-trace-id'),
      attempt,
      wait_ms: waitMs,
    });
 
    await new Promise(r => setTimeout(r, waitMs));
  }
  throw new Error('brex_rate_limit_exceeded');
}

The same wrapper applies to enrichment fetches. When a TRANSFER_PROCESSED webhook arrives, your GET /v1/transfers/{id} call can also return 429 - wrap it in the same primitive so a burst of Brex events doesn't stampede your enrichment layer.

If you route Brex API calls through Truto's proxy layer, the platform surfaces upstream rate-limit information via standardized ratelimit-* headers. Truto does not swallow 429s on your behalf; the error passes through so your backoff logic gets an accurate signal.

Sync Pattern for Backfills and Reconciliation

Webhooks cover the incremental path. For initial backfills and daily reconciliation, poll Brex list endpoints with cursor pagination and filter by updated_at. A reasonable topology for a Brex-to-accounting integration:

Real-time - Handle EXPENSE_PAYMENT_UPDATED and TRANSFER_* webhooks as they arrive.
Hourly - Incremental pull filtered by updated_at > last_checkpoint to catch anything the webhook path missed.
Nightly - Full 24-48 hour reconciliation window, diffed against the customer's ledger. Flag mismatches for finance-team review.

Truto's sync jobs automate this exact pattern. Configure the cursor field, schedule, and destination once, and the platform handles pagination, incremental checkpoints, and 429 backoff on Brex - emitting unified record:* events regardless of whether the underlying provider supports native webhooks or not. The event contract is the same for Brex, QuickBooks, Xero, and NetSuite, so your accounting-sync worker doesn't need per-provider branches.

Logging, Metrics, and Monitoring Checklist

For a production Brex integration, capture and alert on the following:

Trace correlation - Log Brex's trace ID (returned in the X-Brex-Trace-Id response header) on every outbound call so you can quote it to Brex support during incidents. Brex recommends using the vendor-agnostic X-Brex-Trace-Id header, though the older X-Datadog-Trace-Id is still in use.
Webhook ingestion ACK latency (p99) - Time from receiving Brex's POST to returning 200. Alert if > 500ms; it means you're doing too much work in the handler.
Signature verification failure rate - Should be near zero. A spike suggests a rotated secret you haven't picked up (fetch /v1/webhooks/secrets again) or a spoofing attempt.
Timestamp-rejection rate - Non-zero means Brex's clock and yours are drifting, or an attacker is replaying old messages.
429 rate by Brex endpoint - Bucket by endpoint path. Rising rates mean you need to reduce concurrency or request a higher limit.
Enrichment fetch success rate - When a thin webhook triggers a Brex API call for full resource data, track how often that call succeeds within your retry budget.
End-to-end freshness lag - Time from Brex event fired to record appearing in the customer's accounting system. This is the metric customers actually feel.
DLQ depth - Any non-zero DLQ needs eyes-on. Alert immediately.
Reconciliation delta - Records the nightly sync had to correct. A rising delta is the loudest possible signal that webhooks are silently dropping.

Log every inbound webhook with Webhook-Id, event_type, company_id, verification result, and processing outcome. When a customer asks "why didn't my Brex expense post to QuickBooks?", you want to answer in one query.

Best Practices for Webhook Consumers

Whether you use Truto or build your own, follow these three rules to avoid data corruption:

Idempotency is Non-Negotiable: Webhooks are "at-least-once" delivery systems. You will receive the same event twice. Always check if you have already processed an event_id before updating your database.
Verify, Then Process: Never trust a POST request just because it hit your endpoint. Use the X-Truto-Signature to validate the request using constant-time comparison before your business logic runs.
Fast Acknowledgement: Return a 200 OK immediately. If you need to perform a long-running task, put that task in a queue and acknowledge the webhook first. If you take longer than 10 seconds, most vendors will time out and retry, leading to race conditions.

Moving Past Brittle Webhook Logic

Building webhook listeners is easy. Building a unified webhook architecture that handles signature verification, automated enrichment, and reliable delivery across 100+ vendors is an engineering project that takes months and significant capital, often costing upwards of $50,000 per integration to maintain.

By moving the mapping and verification into a declarative, zero-integration-specific code layer, we ensure that when a vendor changes their signature format or payload structure, we patch it in the config—and your code never has to change. The webhook handler you actually want to write looks like this:

app.post('/webhooks/truto', async (req, res) => {
  if (!verifyTrutoSignature(req)) return res.status(401).end();
  
  const { event, payload } = req.body;
  // payload.data contains the full, enriched, unified resource
  await processEvent(event, payload);
  
  res.status(200).json({ ok: true });
});

FAQ

How do you handle webhooks from legacy APIs that don't support them?: Use incremental polling with updated_at cursors to detect changes, then convert those diffs into webhook-shaped events. This 'virtual webhook' pattern lets your application code stay event-driven regardless of whether the upstream API supports native webhooks. Pair it with native webhooks where available for a hybrid real-time + reconciliation architecture.
What retry strategy should I use for webhook delivery?: Use exponential backoff with jitter. A practical schedule starts at 5 seconds and caps at 6-12 hours, with 5-7 total attempts spanning roughly 24 hours. Always honor 429 Retry-After headers, use constant-time signature comparison, and route exhausted retries to a dead letter queue for manual inspection and replay.
What metrics should I monitor for webhook reliability?: Track delivery success rate (alert below 95%), ingress ACK latency p99 (alert above 500ms), queue depth and drain time, retry rate by destination, and end-to-end delivery latency p95/p99. Also alert on absence - a provider that stops sending events is often more dangerous than one returning errors.
How do reconciliation jobs complement webhooks?: Webhooks provide near-real-time event delivery, but they can miss events due to outages, misconfigurations, or network issues. A reconciliation job periodically polls the provider API using updated_at cursors, diffs results against local state, and emits synthetic events for any gaps. Running this nightly or hourly catches what webhooks miss.
What is the claim-check pattern for webhook processing?: The claim-check pattern separates payload storage from queue messaging. When a webhook arrives, the receiver writes the full payload to object storage, then enqueues a lightweight message containing only the storage key and metadata. Workers retrieve the payload from storage when processing. This decouples payload size from queue message limits and allows retries without re-transmitting large payloads.

Updates

Jul 3, 2026 Added a Brex-to-accounting recipe section covering webhook subscription and HMAC-SHA256 signature verification, thin-payload event schemas (EXPENSE_PAYMENT_UPDATED, TRANSFER_*, ACCOUNTING_RECORD_READY_FOR_EXPORT), Webhook-Id-based idempotency with claim-check storage, 429 handling that honors Retry-After, and a Brex-specific logging/metrics checklist including X-Brex-Trace-Id correlation.
May 20, 2026 Added five new sections: legacy API handling with virtual webhooks, recommended ingestion topology (claim-check pattern), retry/backoff with parametric examples, SLIs/alert thresholds/dashboard metrics, incident runbook (detect/triage/mitigate/recover), and reconciliation job design with hybrid model diagram.
Mar 6, 2026 Updated title to 'Designing Reliable Webhooks: Lessons from Production' and revised the enrichment section to reference the Unified API with a technical JSONata example.
Mar 6, 2026 Updated webhook payload examples to reflect Truto's actual outbound delivery format and removed health monitoring section.

FAQ

More from our Blog

Look Ma, No Code! Why Truto’s Zero-Code Architecture Wins

Your Unified APIs Are Lying to You: The Hidden Cost of Rigid Schemas

404 Reasons Third-Party APIs Can't Get Their Errors Straight (And How to Fix It)

Converting GraphQL to REST APIs: A Deep Dive into Truto's Proxy Architecture

3 models for product integrations: a choice between control and velocity

Why Schema Normalization is the Hardest Problem in SaaS Integrations

How to Architect a Bidirectional HubSpot Sync (Without Infinite Loops)

The Architect's Guide to Bi-Directional API Sync (Without Infinite Loops)

What is Webhook Normalization? (2026 Integration Guide)

Product Update: Native Slack & Email Alerts for SaaS API Integration Monitoring

How to Stream SaaS Webhooks to Kafka & Redpanda (Architecture Guide)

Streaming Normalized SaaS Webhooks to Snowflake & BigQuery for Real-Time Analytics (2026)

Developer Quickstart: Link UI + Unified Webhooks for B2B SaaS

Handling API Rate Limits and Webhooks from Dozens of Integrations