Skip to content

Designing Reliable Webhooks: Lessons from Production

Enterprise webhooks are unreliable by design. Learn how to build a unified webhook architecture with retry patterns, SLIs, reconciliation jobs, and legacy API support.

Sidharth Verma Sidharth Verma · · 13 min read
Designing Reliable Webhooks: Lessons from Production

You pointed a URL at a Lambda function, added an if (req.headers ['x-hub-signature']) check, and called it a day. Now you're three integrations deep, your webhook handler is 800 lines of provider-specific spaghetti, and a silent failure from HiBob just caused a 2:00 AM PagerDuty alert because the payload only contained an employee ID—no actual data.

Webhooks are often marketed as the "simple" way to keep data in sync. In reality, they are the Wild West of software engineering. Every provider—from Salesforce and Slack to HiBob and Jira—has a different "vibe" for how they handle security, payload structures, and retries. If your integration strategy is just pointing a URL at a server and hoping for a valid JSON body, you are building a liability—a common pitfall when building direct integrations in-house.

A unified webhook architecture is a system that centralizes the ingestion, verification, and normalization of asynchronous events from multiple third-party providers into a single, predictable data stream. This architecture eliminates vendor-specific security logic and ensures your application receives enriched, actionable data rather than "thin" notifications.

At Truto, we have processed millions of events across hundreds of SaaS platforms. Here is why the standard approach to webhooks is fundamentally broken and how we engineered a unified engine to fix it.

The Verification Wild West: Solving HMAC, JWT, and Handshakes

Webhook signature verification is the process of cryptographically proving that an incoming HTTP request originated from the expected third-party provider and that the payload has not been tampered with in transit.

There is no "Standard Webhook Security." One service uses HMAC-SHA256, another uses JWT, and a third—like Slack or Microsoft Graph—requires a custom "challenge" handshake before it even starts sending data.

Provider Verification Method What You Need to Implement
GitHub HMAC-SHA256 on raw body Compute hash, compare with X-Hub-Signature-256
Slack Challenge handshake + HMAC Echo challenge value on setup, verify x-slack-signature on events
Microsoft Graph Validation token Return the token as plain text during subscription creation
HiBob Bearer token Compare token in Authorization header with stored secret

The Timing Attack Risk

Most developers compare signatures using a standard string equality operator (==). This is a security failure. Standard string comparison returns as soon as it finds a mismatch, allowing an attacker to use timing analysis to guess the signature byte-by-byte.

Danger

Critical Security Note: Always use constant-time comparison functions like Node's crypto.timingSafeEqual or Web Crypto's crypto.subtle.timingSafeEqual to prevent timing side-channel attacks during webhook verification.

Truto's Unified Verification Engine

We handle this via a declarative verification layer. Instead of writing boilerplate code for every new vendor, we use JSONata expressions to define how to handle challenges and validate signatures. For example, when Slack sends a url_verification event, our engine identifies the challenge field and returns it immediately. Your backend never sees the handshake garbage; it only sees verified, legitimate events.

Solving the "Thin Payload" Problem with Automated Enrichment

Webhook payload enrichment is an architectural pattern where a receiver automatically calls the provider's API to fetch the full resource data after receiving a "thin" notification that only contains an ID.

Most enterprise webhooks are "thin." You get a notification saying employee.updated. Great. What changed? Often, the webhook only gives you a resource_id.

{
  "type": "employee.updated",
  "employee": {
    "id": "2934871"
  }
}

This forces your engineering team to build a manual "fetch-back" loop: receive the webhook, look up the credentials, call the third-party API, handle potential rate limits, and then finally process the data.

The Truto Approach: Automatic Enrichment

When an event hits Truto, our mapping engine—designed to solve the hardest problems in schema normalization—determines if the payload is complete. If it's thin, the system automatically uses our Unified API to fetch the full, up-to-date resource. This is handled via a method_config in the integration's JSONata mapping, which tells Truto exactly which endpoint to call to retrieve the complete record:

# Example: HiBob mapping triggering automated enrichment
webhooks:
  hibob: |
    (
      $action := $split(body.type,'.')[1];
      body.{
        "event_type": $action = "joined" ? "created" : "updated",
        "resource": "hris/employees",
        "method": "get",
        "method_config": {
          "id" : employee.id
        }
      }
    )

By the time the webhook hits your endpoint, it looks like this:

{
  "id": "3a0da6ba-b2d1-473f-957c-51f6825e3623",
  "event": "integrated_account:created",
  "payload": {
    "id": "79a39d69-e27e-49cb-b9a9-79f5eea7aa26",
    "tenant_id": "acme-1",
    "environment_integration_id": "27a8c0ff-0c2e-4383-b651-0772b3515921",
    "context": {},
    "created_at": "2023-06-16T09:21:21.000Z",
    "updated_at": "2023-06-16T09:21:21.000Z",
    "is_sandbox": false,
    "unified_model_override": {},
    "environment_id": "ac15abdc-b38e-47d0-97a2-69194017c177",
    "integration": {
      "id": "1fa47bf3-5f1f-4b65-bcd0-8d07ab455e15",
      "name": "helpscout",
      "category": "helpdesk",
      "is_beta": false,
      "team_id": "68ea7267-2aec-4da0-b5d9-192cc84eb2de",
      "sharing": "allow",
      "default_oauth_app_id": null,
      "created_at": "2023-02-16T09:27:09.000Z",
      "updated_at": "2023-02-17T12:56:55.000Z"
    },
    "environment_integration": {
      "id": "27a8c0ff-0c2e-4383-b651-0772b3515921",
      "integration_id": "1fa47bf3-5f1f-4b65-bcd0-8d07ab455e15",
      "environment_id": "ac15abdc-b38e-47d0-97a2-69194017c177",
      "show_in_catalog": true,
      "is_enabled": true,
      "override": {},
      "created_at": "2023-02-16T09:27:16.000Z",
      "updated_at": "2023-02-16T09:27:16.000Z"
    }
  },
  "environment_id": "ac15abdc-b38e-47d0-97a2-69194017c177",
  "created_at": "2023-06-16T09:21:22.369Z",
  "webhook_id": "077ec306-5756-43fa-9a06-0cc0da4eabe0"
}

Your system receives a complete, unified record. You don't need to know that HiBob sent a thin event while Workday sent a thick one. The data is already there.

Architecture of a Unified Receiver: Sync vs. Async Fan-out

To handle webhooks at scale, you have to account for two distinct integration architectures: per-account and per-integration.

Feature Per-Account Webhooks Per-Integration (Environment) Webhooks
URL Structure Unique per customer Single URL for all customers
Processing Path Synchronous enrichment Asynchronous queue-based fan-out
Scaling Strategy Direct ingestion context_lookup mapping
Primary Use Case Salesforce, HiBob, HubSpot Slack, Microsoft Teams, Asana

Why the per-integration path needs a queue

When a single webhook could affect dozens of connected accounts, processing them all synchronously would blow past the provider's timeout window. Truto handles this with an asynchronous fan-out path: ingest the event, acknowledge the vendor within milliseconds, then hand off to background processing that resolves the relevant integrated accounts and routes data without blocking the HTTP handler.

Handling Legacy APIs That Don't Support Webhooks

Not every API sends webhooks. Many legacy systems and older SaaS platforms simply don't offer event-driven integrations. When you're dealing with an on-premise HRIS, a legacy ERP, or a niche vertical SaaS tool, polling is often the only path to real-time data sync.

The standard approach is incremental polling - call the API on a schedule, filter by updated_at or a change token, and diff the results against what you already have. It works, but it carries real costs: wasted API quota on empty responses, latency determined by your polling interval rather than actual changes, and rate-limit pressure that multiplies across customers and endpoints.

A better pattern is to push the polling responsibility into the integration layer, so your application still receives change events through a standard webhook interface. This is sometimes called virtual webhooks - the integration platform polls for changes and converts detected diffs into webhook-shaped events delivered to your endpoint. Your code doesn't care whether the upstream system supports native webhooks or not; it processes the same unified payload either way.

Truto's sync jobs handle this exact pattern. For providers that lack native webhook support, scheduled sync runs detect changes and emit record:created, record:updated, and record:deleted events through the same delivery pipeline that native webhooks use. The key distinction is that sync-job events use a different event family (sync_job_run:*) so your consumer can differentiate between true real-time events and poll-detected changes if needed.

Info

Hybrid is the default. The most reliable architecture pairs native webhooks for immediacy with periodic sync jobs for reconciliation. Webhooks catch events in near-real-time; sync jobs fill gaps left by missed deliveries, provider outages, or misconfigurations.

A production webhook receiver needs to decouple ingestion from processing. The moment you start doing work inside the HTTP handler - enrichment, database writes, downstream API calls - you're one slow query away from timing out and triggering provider retries that compound the problem.

The architecture that works at scale follows a claim-check pattern:

flowchart LR
    A[Provider POST] --> B[Receiver:<br>Verify + ACK]
    B --> C[Object Store:<br>Write Payload]
    C --> D[Queue:<br>Lightweight Message]
    D --> E[Worker:<br>Fetch Payload,<br>Enrich, Deliver]
    E --> F[Customer Endpoint]
  1. Receiver layer - Verify the signature, write the raw payload to object storage keyed by a unique event ID, enqueue a lightweight message containing only metadata (event ID, type, account ID), and return 200 OK. This entire path should complete in under 100ms.
  2. Object storage - Decouples payload size from queue message limits. Large enterprise payloads (batch events from Workday, for example) don't blow up your queue.
  3. Processing queue - Workers pull messages, retrieve the payload from storage, run enrichment, map to unified schema, and deliver to customer endpoints.
  4. Delivery queue - A separate queue handles outbound delivery to customer webhook subscriptions, with its own retry and backoff logic independent of ingestion.

This two-queue topology is what Truto uses in production. The ingestion queue handles fan-out for per-integration webhooks (resolving which accounts an event belongs to), while the delivery queue handles outbound retries to customer endpoints. Keeping them separate prevents a single failing customer endpoint from creating backpressure on ingestion.

Retry, Backoff, and Rate-Limit Handling

Webhook delivery will fail. The question is how gracefully your system handles it.

Exponential Backoff with Jitter

The industry-standard pattern is exponential backoff with jitter. The formula:

delay = min(base * 2^attempt + random(0, jitter_max), max_delay)

Here's a concrete retry schedule that works well for most webhook delivery scenarios:

Attempt Base Delay With Jitter (approx.) Cumulative Wait
1 5s 3-7s ~5s
2 25s 18-32s ~30s
3 125s 90-160s ~2.5 min
4 625s 450-800s ~13 min
5 30 min 22-38 min ~43 min
6 2 hr 1.5-2.5 hr ~3 hr
7 6 hr (cap) 4.5-7.5 hr ~9 hr

This gives the consumer roughly a full business day to notice and fix issues before retries are exhausted. Cap the maximum interval at 6-12 hours to avoid events sitting in limbo indefinitely.

Handling 429s and Rate Limits

When a consumer returns 429 Too Many Requests with a Retry-After header, that value should override your backoff algorithm entirely. If the header says wait 120 seconds, wait at least 120 seconds - even if your exponential schedule would have retried sooner. Ignoring explicit backpressure signals risks getting your traffic blocked at the infrastructure level.

For outbound delivery, Truto uses queue-based retry with built-in exponential backoff. Failed deliveries are retried automatically. If an endpoint consistently fails (over 50% failure rate across 20+ attempts in a 2-day window), the health monitoring system sends alerts and can auto-disable the webhook to protect both sides.

Dead Letter Queues

After exhausting retries, events should land in a dead letter queue (DLQ) rather than disappearing silently. The DLQ gives your team a place to inspect failed events, fix the root cause, and replay them in controlled batches. When replaying, rate-limit the replay to avoid a thundering herd - don't dump 10,000 events on an endpoint that just recovered.

SLIs, Alert Thresholds, and Dashboard Metrics

You can't fix what you can't see. Webhook reliability requires a small set of metrics that make problems obvious before they cascade into data loss.

Key SLIs to Track

Metric What It Tells You Alert Threshold
Delivery success rate % of events delivered on first or retried attempt < 99% warrants investigation, < 95% is a warning
Ingress ACK latency (p99) Time from request received to 200 returned > 500ms - you're doing too much work in the handler
End-to-end latency (p95/p99) Time from provider POST to customer delivery > 30s needs attention
Queue depth Number of events waiting for processing Growing steadily means consumers can't keep up
Queue drain time Estimated time to clear the backlog > 10 min triggers scaling
Retry rate % of deliveries requiring retries Rising rate is the first signal of destination stress
Error rate by destination Failures grouped by customer endpoint Isolates which endpoints are struggling

Dashboard Layout

A useful webhook operations dashboard has four panels:

  1. Ingestion health - Event volume by provider, verification failure rate, ACK latency histogram
  2. Processing pipeline - Queue depth over time, enrichment success/failure ratio, events processed per second
  3. Outbound delivery - Success rate by customer endpoint, retry distribution, active circuit breakers
  4. Business signals - Events by type (record:created, record:updated, record:deleted), top 10 noisiest integrations, "zero events received" alerts per provider
Tip

Alert on absence, not just failure. A provider that suddenly stops sending webhooks is often more dangerous than one returning errors. Set up "no events received in X minutes" alerts for each active integration.

Incident Runbook: Detect, Escalate, Reconcile

When webhook delivery breaks, speed matters. Here's a concrete playbook:

1. Detect

  • Automated signal: Alert fires on delivery success rate < 95%, queue depth > 5,000, or zero events from a provider for > 15 minutes.
  • Manual signal: Customer reports stale data or missing updates.
  • First check: Is the problem on our side (queue backlog, worker crash, deployment regression) or the provider's side (provider outage, changed signature format, expired credentials)?

2. Triage

  • Single customer endpoint failing: Likely a customer-side issue. Open a circuit breaker on that endpoint, let other deliveries continue. Notify the customer with specifics (HTTP status codes, response bodies).
  • Single provider's events failing: Check for provider API changes, expired OAuth tokens, or signature format changes. If auth-related, the integrated account gets flagged for re-authentication automatically.
  • Broad delivery failure: Check infrastructure health - queue consumer crashes, object storage availability, network issues.

3. Mitigate

  • Protect ingestion: Keep ACK fast. Never block inbound processing because of outbound failures.
  • Scale consumers: If the bottleneck is processing throughput, add workers.
  • Shed non-critical load: Temporarily pause enrichment for lower-priority event types to free capacity.
  • Open circuit breakers: Stop hammering failing endpoints. Queue events for later replay.

4. Recover and Reconcile

  • Replay DLQ: Once the root cause is fixed, replay dead-lettered events in controlled batches with rate limiting.
  • Run a reconciliation sync: Trigger a sync job for affected providers/accounts to catch any events missed during the outage window.
  • Validate metrics: Confirm delivery success rate returns to baseline, queue depth drains to normal, and no data gaps remain.
  • Post-mortem: Document the timeline, root cause, and what monitoring gap allowed the issue to reach the severity it did.

Reconciliation Job Design

Webhooks are at-least-once delivery systems - which means they can also be zero-times delivery systems when things go wrong. Provider outages, misconfigured subscriptions, deployment windows, and network blips all create gaps. The standard practice is to pair real-time webhooks with periodic reconciliation.

The Hybrid Model

The pattern is straightforward: webhooks handle the "hot path" for immediate change detection, and a scheduled reconciliation job runs as a safety net to catch anything that slipped through.

flowchart TB
    subgraph Real-Time Path
        A[Provider Webhook] --> B[Verify + Enrich]
        B --> C[Deliver to Customer]
    end
    subgraph Reconciliation Path
        D[Scheduled Job] --> E[Fetch records updated<br>since last checkpoint]
        E --> F[Diff against<br>local state]
        F --> G{Gaps<br>found?}
        G -->|Yes| H[Emit synthetic<br>events]
        G -->|No| I[Update checkpoint,<br>done]
        H --> C
    end

Key Design Principles

  1. Use updated_at cursors - Query the provider's API for records modified since your last successful checkpoint. Most APIs support filtering by modification timestamp.
  2. Diff, don't blindly overwrite - Compare fetched records against your local state. Only emit events for records that actually differ, to avoid noisy duplicate processing downstream.
  3. Separate reconciliation events from live events - Tag reconciliation-sourced events so consumers can handle them differently if needed (e.g., skip re-sending notification emails for gap-filled records).
  4. Run on a schedule that matches your freshness SLO - Nightly works for most use cases. Hourly is appropriate for payment or compliance-critical data. Every-5-minutes is overkill unless you're handling financial transactions.
  5. Make it idempotent - The reconciliation job itself might fail partway through. Use checkpoint-based progress tracking so it can resume safely.

Truto's sync jobs serve exactly this purpose. They run on configurable schedules, fetch records from provider APIs using incremental cursors, and emit events through the same unified delivery pipeline. For providers that support native webhooks, sync jobs act purely as reconciliation. For providers that don't, sync jobs are the primary change detection mechanism.

Best Practices for Webhook Consumers

Whether you use Truto or build your own, follow these three rules to avoid data corruption:

  1. Idempotency is Non-Negotiable: Webhooks are "at-least-once" delivery systems. You will receive the same event twice. Always check if you have already processed an event_id before updating your database.
  2. Verify, Then Process: Never trust a POST request just because it hit your endpoint. Use the X-Truto-Signature to validate the request using constant-time comparison before your business logic runs.
  3. Fast Acknowledgement: Return a 200 OK immediately. If you need to perform a long-running task, put that task in a queue and acknowledge the webhook first. If you take longer than 10 seconds, most vendors will time out and retry, leading to race conditions.

Moving Past Brittle Webhook Logic

Building webhook listeners is easy. Building a unified webhook architecture that handles signature verification, automated enrichment, and reliable delivery across 100+ vendors is an engineering project that takes months and significant capital, often costing upwards of $50,000 per integration to maintain.

By moving the mapping and verification into a declarative, zero-integration-specific code layer, we ensure that when a vendor changes their signature format or payload structure, we patch it in the config—and your code never has to change. The webhook handler you actually want to write looks like this:

app.post('/webhooks/truto', async (req, res) => {
  if (!verifyTrutoSignature(req)) return res.status(401).end();
  
  const { event, payload } = req.body;
  // payload.data contains the full, enriched, unified resource
  await processEvent(event, payload);
  
  res.status(200).json({ ok: true });
});

FAQ

How do you handle webhooks from legacy APIs that don't support them?
Use incremental polling with updated_at cursors to detect changes, then convert those diffs into webhook-shaped events. This 'virtual webhook' pattern lets your application code stay event-driven regardless of whether the upstream API supports native webhooks. Pair it with native webhooks where available for a hybrid real-time + reconciliation architecture.
What retry strategy should I use for webhook delivery?
Use exponential backoff with jitter. A practical schedule starts at 5 seconds and caps at 6-12 hours, with 5-7 total attempts spanning roughly 24 hours. Always honor 429 Retry-After headers, use constant-time signature comparison, and route exhausted retries to a dead letter queue for manual inspection and replay.
What metrics should I monitor for webhook reliability?
Track delivery success rate (alert below 95%), ingress ACK latency p99 (alert above 500ms), queue depth and drain time, retry rate by destination, and end-to-end delivery latency p95/p99. Also alert on absence - a provider that stops sending events is often more dangerous than one returning errors.
How do reconciliation jobs complement webhooks?
Webhooks provide near-real-time event delivery, but they can miss events due to outages, misconfigurations, or network issues. A reconciliation job periodically polls the provider API using updated_at cursors, diffs results against local state, and emits synthetic events for any gaps. Running this nightly or hourly catches what webhooks miss.
What is the claim-check pattern for webhook processing?
The claim-check pattern separates payload storage from queue messaging. When a webhook arrives, the receiver writes the full payload to object storage, then enqueues a lightweight message containing only the storage key and metadata. Workers retrieve the payload from storage when processing. This decouples payload size from queue message limits and allows retries without re-transmitting large payloads.

More from our Blog