How to Handle Webhooks and Real-Time Data Sync from Legacy APIs
Handle webhooks and real-time data sync from legacy APIs with copy-paste verification code, retry policies, idempotency patterns, JSONata transforms, and monitoring runbooks.
Legacy APIs are the silent saboteurs of your data pipeline. Your team wants real-time sync between your product and a customer's on-prem ERP or aging CRM instance—the kind of native connectivity your sales team actually asks for. But the API documentation is a PDF from 2017, there are no webhooks, and the rate limit is 100 requests per minute. So you build a poller, ship it, and move on. Six months later, that poller is eating 40% of your API quota, missing records, and a customer is on a call asking why their data is three hours stale.
This is not an edge case. It is the default experience for any team building direct integrations to legacy systems. What starts as a simple Jira ticket to sync data quickly mutates into an ongoing maintenance nightmare of undocumented edge cases, silent failures, and exhausted API quotas.
This guide covers how to architect reliable, real-time data sync from legacy APIs that either lack webhooks entirely or implement them so poorly they might as well not exist. We will break down the business cost of stale data, the technical failure modes of legacy systems, the polling-vs-webhook trade-off, and the architectural patterns — including virtual webhooks, the claim-check pattern, and unified webhook engines — that actually hold up in production.
Quick TL;DR: The Recommended Pattern Checklist
If you only have five minutes, here is the playbook: accept webhooks where providers offer them, fall back to incremental polling (virtual webhooks) where they do not, and merge both into a single normalized event stream. Verify every inbound webhook with timing-safe signature checks. Respond with 200 immediately and process asynchronously via a queue. Use the claim-check pattern to decouple payload size from queue limits. Enrich thin payloads by fetching the full record before forwarding. Deduplicate with idempotency keys derived from provider + event_type + resource_id + event_id. Retry outbound delivery with exponential backoff (start at 30s, double each time, cap at 8 hours, TTL of 24 hours, add jitter). Monitor inbound event rates per provider and alert when volume drops below baseline - silence is the most dangerous failure mode. Run a daily reconciliation job to catch anything both paths missed.
The Hidden Cost of Stale Data in the Enterprise
Late data is not just an engineering annoyance — it is a measurable financial drain. Gartner estimates that poor data quality costs organizations an average of $15 million per year. And the bleeding is constant: B2B contact data decays at approximately 2.1% per month according to Marketing Sherpa, meaning over 22% of a database becomes outdated within a year.
The revenue impact hits harder than most executives realize. According to a report from Validity, 76% of respondents characterize their CRM data quality as either "good" or "very good," yet a staggering 44% of respondents estimated their company loses over 10% of annual revenue due to poor data quality. That gap between perception and reality is where real money evaporates. Research shows that 50% of workers' time is spent finding, correcting, and confirming inaccurate data — time that should be spent closing deals or shipping features.
The consequences cascade across the organization:
- Revenue Loss: Decaying data directly impacts sales targeting and pipeline velocity.
- Operational Friction: Teams spend hours manually verifying records across disparate systems.
- Missed SLAs: Critical automated workflows fail to trigger when prerequisite data is delayed.
- Engineering Drain: Developers abandon core product work to troubleshoot broken batch jobs.
When a B2B SaaS application captures a critical engagement signal — a signed contract, a high-intent product action, or a support escalation — that state change must propagate to the source of truth immediately. Batch sync, the traditional approach of running a nightly ETL job, guarantees your users are operating on outdated context.
If your product captures a lead score change at 2 PM but your batch job runs at midnight, the sales rep picking up the phone the next morning is working with 10-hour-old data. In fast-moving enterprise sales cycles, that is the difference between a warm lead and a cold one. If you are building real-time CRM syncs at enterprise scale, the data freshness problem compounds with every hour of delay.
"Real-time" does not mean "instant." Between event delivery, your queue, retries, and rate-limit backoff, real-time usually means seconds, sometimes minutes, and occasionally "we will reconcile later." If your stakeholders expect a hard 200ms SLA, reset those expectations before writing a line of code.
Why Legacy APIs Break Real-Time Sync Architectures
Modern SaaS APIs from Stripe, GitHub, and Slack ship with well-documented webhooks, signature verification, and retry policies. Legacy systems — your customer's on-prem NetSuite instance, a decade-old HRIS, or a vertical SaaS tool with a SOAP API — actively fight real-time sync patterns.
Here is what breaks:
No webhook support at all. Many legacy systems simply cannot push events. Keeping external systems in sync with data from platforms like NetSuite is a common requirement, but while external integrations that periodically pull or push data are well-documented, the opposite approach — proactively pushing updates via webhooks — is often overlooked. NetSuite does not offer native webhooks out of the box. To get real-time event notifications, your team must write, deploy, and maintain custom SuiteScript RESTlets directly inside the customer's NetSuite environment. NetSuite also imposes severe concurrency restrictions — standard accounts typically allow a maximum of 5 simultaneous requests. If your system fires off six concurrent updates, the sixth request does not neatly queue; it fails or returns a rejection error.
Brutal rate limits. Salesforce enforces a 100,000 daily API request limit for Enterprise Edition orgs, plus 1,000 additional requests per user license. They also cap concurrent long-running requests to a maximum of 25. That sounds generous until you realize your marketing automation, support tools, BI platform, and your integration are all sharing the same quota. If you attempt to achieve "real-time" sync by polling Salesforce endpoints every few seconds across dozens of customer accounts, you will exhaust your API quota before lunch. One runaway poller can starve every other integration in the org.
Undocumented schemas and breaking changes. Legacy API documentation is often incomplete, outdated, or flat-out wrong. Fields appear and disappear between versions. Date formats change without notice. Pagination tokens expire silently. Every one of these becomes a production incident you did not plan for.
The "brittle connector" trap. When you build direct integrations in-house, each one starts as a manageable project. But by the time you are maintaining 10 or more, the maintenance burden cannibalizes core product development. You are not building features — you are babysitting API connections.
Enterprise iPaaS platforms like MuleSoft promise to solve this, but they come with their own pain. MuleSoft implementation timelines typically span 6-8 months, affecting time-to-value compared to alternatives. Implementation costs frequently exceed $100,000 for initial deployment. Companies migrating from legacy platforms like MuleSoft report 20-65% lower Total Cost of Ownership and 4-10x faster development cycles, according to Workato. For most B2B SaaS companies, a six-month, six-figure iPaaS deployment is not a realistic path to shipping integrations fast.
The Webhook Wild West: Reliability and Verification Challenges
Some legacy APIs do support webhooks. The problem is that "support" is a generous word. As we have covered in our guide on designing reliable webhooks from production experience, every provider has a different approach to security, payload structure, and retry behavior.
Here is the reality of what you face across providers:
| Challenge | Example |
|---|---|
| Inconsistent verification | HiBob uses HMAC-SHA256; Slack requires a challenge handshake; Microsoft Graph needs JWT verification |
| Thin payloads | Many providers send only an entity ID, not the actual data. You get { "employee_id": "12345" } and need to call back to get the full record |
| Out-of-order delivery | Event A (created) arrives after Event B (updated). Your handler overwrites the newer state with the older one |
| No retry guarantees | Some providers fire the webhook once and forget it. If your endpoint was down for 30 seconds, that event is gone |
| Duplicate events | Others retry aggressively, sending the same event 3-5 times. Without idempotency, you process the same record multiple times |
The Verification Handshake
Webhook signature verification is the cryptographic proof that a payload actually originated from the expected provider. There is no standard. One legacy API might use HMAC-SHA256, another might use a simple Bearer token, and systems like Microsoft Graph require a synchronous "challenge" handshake where the provider sends a validation token in the query string and your server must echo it back in plain text within seconds. If your generic webhook handler expects a JSON POST body, the initial GET request will fail, and the webhook will never activate.
Thin Payloads and Rate Limit Traps
The thin payload problem is especially painful. Instead of sending the updated record, the provider sends a payload containing nothing but {"event": "contact_updated", "id": "8675309"}. To do anything useful, your system must immediately turn around and make a GET request to fetch the full record.
sequenceDiagram
participant Provider as Third-Party<br>Provider
participant Handler as Your Webhook<br>Handler
participant API as Provider API
participant App as Your Application
Provider->>Handler: POST /webhook {employee_id: "123", event: "updated"}
Note over Handler: Thin payload - no actual data
Handler->>API: GET /employees/123 (fetch full record)
API-->>Handler: {name: "Jane", title: "VP Sales", ...}
Handler->>App: Forward enriched event
Note over Handler: But what if the API call fails?<br>Rate limited? Auth expired?If a customer bulk-updates 10,000 contacts, you receive 10,000 thin webhooks, resulting in 10,000 immediate GET requests. This creates a self-inflicted DDoS attack that instantly triggers the provider's rate limits and gets your API token temporarily banned. Your "real-time" webhook handler, which should be fast and stateless, is now a slow, stateful API client subject to rate limits, auth token expiry, and network failures.
The Fan-Out Routing Problem
Modern APIs typically allow you to register a unique webhook URL per connected account. Legacy APIs often force you to register a single, global webhook URL for your entire developer application. Events for Customer A and Customer B arrive at the exact same endpoint. Your infrastructure must inspect the payload, extract a tenant identifier (like a company_id), query your database to find the matching OAuth token, and then route the event to the correct internal queue. When you are running a multi-tenant SaaS platform, this routing logic alone becomes a significant source of bugs and operational overhead.
Polling vs. Webhooks: Solving the Legacy API Dilemma
When a legacy API does not offer webhooks, polling is your only option. And polling gets a bad reputation for good reason. Zapier estimates that only 1.5% of polling requests find an update. That means 98.5% of your API calls return nothing new — pure waste against a finite rate limit budget.
But here is the thing: smart polling, done right, is a perfectly valid strategy. The goal is not to eliminate polling. It is to make polling behave like an event stream.
The "Virtual Webhook" Pattern: Incremental Polling as Event Stream
A virtual webhook is an architectural pattern where incremental polling is transformed into an event-driven data stream. Instead of fetching all records every cycle, you track a high-water mark (typically an updated_at timestamp) and only fetch records that changed since the last successful run.
The math is compelling. If a provider has 50,000 employee records but only 12 changed in the last hour, an incremental poll fetches 12 records instead of 50,000. That is a 99.97% reduction in API calls.
Here is what this looks like in practice:
{
"resource": "hris/employees",
"method": "list",
"query": {
"updated_at": {
"gt": "{{previous_run_date}}"
}
}
}The previous_run_date is a cursor that tracks the last successful sync. On the very first run, it defaults to epoch (1970-01-01T00:00:00.000Z) to pull a full snapshot. Every subsequent run fetches only the delta. Once you fetch the changed records, you emit them as events — record:created, record:updated, record:deleted — downstream to your application, exactly as if a webhook had fired.
Your downstream application does not know — and should not care — whether an event originated from a real HTTP webhook or a virtual webhook generated by a polling cron job. The interface remains identical.
If the legacy API returns paginated data, your polling engine needs to handle spooling: fetch all pages, temporarily store the blocks, and combine them into a single, comprehensive event. Without this, a partial page failure midway through can leave your data in an inconsistent state. You should also implement exponential backoff to dynamically slow down polling frequency when the API returns 429 errors — hammering a rate-limited endpoint just guarantees a longer lockout.
When to Use Each Approach
| Factor | Real Webhooks | Incremental Polling (Virtual Webhooks) |
|---|---|---|
| Latency | Seconds | Minutes (depends on schedule) |
| API call efficiency | Zero wasted calls | Some waste, but minimized by delta queries |
| Works with legacy APIs | Only if provider supports it | Always |
| Reliability | Depends on provider retry policy | You control the schedule and retries |
| Complexity | Receiver infrastructure, verification, enrichment | Scheduler, cursor management, deduplication |
| Best for | High-frequency events, modern APIs | Legacy systems, APIs without webhook support |
The hybrid approach is the gold standard. Many integrations rely on webhooks for event-driven updates and fall back to periodic REST polling (with conditional requests) only as a safety net. If your integration requires "near real time" updates, the best practice is usually webhooks first plus occasional syncs via conditional requests to catch anything that might have been missed.
How to Architect a Unified Webhook Engine
If you are integrating with more than a handful of third-party systems, the per-provider approach falls apart fast. You end up with separate verification logic for each provider, custom payload parsers, and bespoke retry handling. The right architecture is a unified webhook engine — a centralized system that handles ingestion, verification, transformation, enrichment, and delivery for all providers through a single pipeline.
The cardinal rule: never process webhook business logic synchronously in the HTTP handler. Accept the webhook, respond with 200 immediately, and process asynchronously. A slow webhook handler that times out is worse than no handler at all — most providers will mark your endpoint as dead after a few consecutive timeouts.
Here is the architecture that works at scale:
graph TD
A[Legacy API Provider] -->|Raw Event Payload| B(Ingress Router)
B --> C{Verification Engine}
C -->|Challenge Request| D[Respond to Handshake]
C -->|Valid Signature| E[JSONata Transformation]
E --> F{Is Payload Thin?}
F -->|Yes| G[Fetch Full Resource <br> via Proxy API]
F -->|No| H[Normalized Event Payload]
G --> H
H --> I[(Object Storage / R2)]
I -->|Store Payload <br> Generate Claim-Check ID| J[Event Queue]
J --> K[Outbound Delivery Worker]
K -->|Signed HMAC Payload| L[Customer Application Endpoint]Layer 1: Declarative Verification
Instead of writing if (provider === 'hubspot') { verifyHmac(...) } else if (provider === 'slack') { handleChallenge(...) }, define each provider's verification as configuration:
- HMAC: Specify the algorithm, the header containing the signature, and which parts of the payload to hash
- JWT: Specify the token location and verification key
- Basic Auth / Bearer: Simple credential comparison
- Challenge-Response: A JSONata expression that detects handshake requests and returns the expected response
The ingress router must support both POST requests (for actual events) and GET requests (for verification handshakes). All cryptographic comparisons should use timing-safe equality checks (crypto.subtle.timingSafeEqual or equivalent) to prevent side-channel attacks. This is a detail that most hand-rolled implementations miss.
Layer 2: Transformation via JSONata
Once verified, the raw, provider-specific payload must be normalized. Hardcoding transformation logic creates massive technical debt. Using a declarative expression language like JSONata means adding a new provider is a configuration change, not a code change.
A JSONata mapping transforms a provider's raw event into a canonical format:
(
$action := $split(body.type, '.')[1];
$event_type := $lookup({
"created": "created",
"updated": "updated",
"deleted": "deleted"
}, $action);
{
"event_type": $event_type,
"resource": "hris/employees",
"method": "get",
"method_config": { "id": body.employee.id }
}
)The output is always the same shape — resource, event_type, and enough information to fetch or construct the full record — regardless of which provider sent it.
Layer 3: Enrichment
When the transformation determines the payload is "thin" (containing only an ID), the engine pauses. It securely retrieves the OAuth credentials for that specific tenant, handles any necessary token refreshes, and makes a direct proxy call to the legacy API to fetch the complete record. This ensures that the event pushed to your application always contains the full, actionable data model.
This is where a unified API layer earns its keep. The enrichment call goes through the same normalization pipeline as a regular API request, so the webhook payload matches the exact same schema your application gets from a direct API call. When the payload already contains the full record, this step is skipped entirely.
Layer 4: Reliable Delivery with the Claim-Check Pattern
Message queues (like AWS SQS or Cloudflare Queues) have strict message size limits — often around 256KB. Enterprise CRM payloads can easily exceed this. The claim-check pattern solves this:
- The normalized event payload is written to durable object storage (like AWS S3 or Cloudflare R2), keyed by a unique Event ID.
- A lightweight message containing only the Event ID and metadata is pushed to the queue.
- The outbound delivery worker picks up the message, retrieves the full payload from object storage, and attempts delivery to your application.
If your application is down and returns a 503, the worker leverages exponential backoff to retry delivery. Because the payload is safely stored in object storage, no data is lost during the outage. Outbound events should be signed with HMAC-SHA256 so the receiving application can verify authenticity — the signature and a timestamp should travel in HTTP headers, not the body.
Signature Verification: Copy-Paste Examples
Every webhook handler must verify that inbound payloads actually came from the expected provider. Skipping verification means any attacker who discovers your webhook URL can inject fake events into your pipeline. Here are the three patterns you will encounter most often.
HMAC-SHA256 Verification (Node.js)
Most modern providers (GitHub, Stripe, HiBob) sign payloads with HMAC-SHA256. The provider computes a hash over the raw request body using a shared secret, then sends the hash in a header. Your handler recomputes the hash and compares.
import crypto from 'crypto';
function verifyHmacSignature(
rawBody: Buffer,
signatureHeader: string,
secret: string,
algorithm: string = 'sha256'
): boolean {
const expected = crypto
.createHmac(algorithm, secret)
.update(rawBody)
.digest('hex');
const provided = signatureHeader.replace(/^sha256=/, '');
// Always use timing-safe comparison to prevent side-channel attacks
const a = Buffer.from(provided, 'utf8');
const b = Buffer.from(expected, 'utf8');
if (a.length !== b.length) return false;
return crypto.timingSafeEqual(a, b);
}
// Usage in an Express handler
app.post('/webhook/:provider', (req, res) => {
const signature = req.headers['x-hub-signature-256'] as string;
if (!verifyHmacSignature(req.rawBody, signature, WEBHOOK_SECRET)) {
return res.status(401).json({ error: 'Invalid signature' });
}
// Signature valid - enqueue for async processing
res.status(200).json({ received: true });
});Critical detail: You must verify against the raw request body (the exact bytes received over the wire), not a re-serialized JSON object. Parsing JSON and re-stringifying it can change whitespace, key ordering, or Unicode escaping - any of which will produce a different hash.
JWT Verification
Some providers (Microsoft Graph, certain enterprise APIs) send a JWT token in the Authorization header or as a query parameter. You verify the token's signature against the provider's public key.
import jwt from 'jsonwebtoken';
async function verifyJwtWebhook(
token: string,
publicKey: string
): Promise<{ valid: boolean; payload?: any }> {
try {
const decoded = jwt.verify(token, publicKey, {
algorithms: ['RS256'],
clockTolerance: 30, // Allow 30s clock skew
});
return { valid: true, payload: decoded };
} catch (err) {
return { valid: false };
}
}Challenge-Response Handshake
Slack, Microsoft Graph, and other providers verify your endpoint by sending a challenge request before delivering real events. Your handler must detect these and respond correctly, or the webhook subscription will never activate.
app.post('/webhook/slack', (req, res) => {
const { type, challenge } = req.body;
// Slack sends a url_verification event during setup
if (type === 'url_verification') {
return res.status(200).json({ challenge });
}
// For real events, verify the signature first
const signature = req.headers['x-slack-signature'] as string;
const timestamp = req.headers['x-slack-request-timestamp'] as string;
// Reject requests older than 5 minutes to prevent replay attacks
if (Math.abs(Date.now() / 1000 - Number(timestamp)) > 300) {
return res.status(401).json({ error: 'Request too old' });
}
const sigBasestring = `v0:${timestamp}:${req.rawBody}`;
const mySignature = 'v0=' + crypto
.createHmac('sha256', SLACK_SIGNING_SECRET)
.update(sigBasestring)
.digest('hex');
if (!crypto.timingSafeEqual(
Buffer.from(mySignature),
Buffer.from(signature)
)) {
return res.status(401).json({ error: 'Invalid signature' });
}
res.status(200).json({ ok: true });
// Process event asynchronously...
});The key insight: your webhook ingress must support multiple verification strategies simultaneously. Hardcoding one provider's verification logic into your handler is fine for a single integration. At five or more providers, you need a declarative verification layer where each provider's strategy is defined as configuration, not code.
Idempotency and Event Ordering
Webhooks are delivered at-least-once, not exactly-once. No webhook provider can guarantee exactly-once delivery, because exactly-once delivery is a proven impossibility in distributed systems. Providers retry on timeout, network blips cause duplicate deliveries, and queue-based processing can replay messages. Without idempotency keys, your system may mistakenly treat each retry as a new request, leading to duplicate orders or inventory miscounts.
Designing Idempotency Keys
An idempotency key uniquely identifies a single logical event. The best key combines enough fields to be globally unique without being so specific that legitimate retries are treated as new events.
Recommended key structure:
{provider}:{event_type}:{resource_id}:{event_id}
For example: hubspot:contact.updated:contact_8675309:evt_abc123
Extract the provider event ID (event.id, X-GitHub-Delivery, MessageSid, etc.). If the provider includes a unique event ID in the payload, use it. If not, fall back to a composite of resource ID and event timestamp. Keys should be collision-resistant and stable across retries. Do not include volatile fields like Date headers or random salts.
Lookup and Deduplication
Store processed idempotency keys in a fast-lookup store with a TTL. Most systems set a TTL (Stripe uses 24 hours; many systems use 7 days). Pick a TTL longer than the longest legitimate retry window and prune older entries.
async function processWebhookEvent(
event: WebhookEvent,
store: IdempotencyStore
) {
const key = [
event.provider,
event.event_type,
event.resource_id,
event.event_id,
].join(':');
const alreadyProcessed = await store.exists(key);
if (alreadyProcessed) {
console.log(`Duplicate event skipped: ${key}`);
return { status: 'duplicate', key };
}
// Process the event
await handleEvent(event);
// Mark as processed with a 72-hour TTL
await store.set(key, { processed_at: Date.now() }, { ttl: 72 * 60 * 60 });
return { status: 'processed', key };
}Handling Out-of-Order Events
Events arrive out of order more often than you would expect. A contact.updated event triggered at 2:00:01 PM can arrive before a contact.created event triggered at 2:00:00 PM, especially when providers use multiple delivery servers.
The simple solution: attach a version or updated_at timestamp to each record and reject stale updates.
async function applyUpdate(
record: Record,
incomingVersion: string
) {
const existing = await db.get(record.id);
if (existing && existing.updated_at >= incomingVersion) {
// Incoming event is older than current state - skip it
return { status: 'stale', skipped: true };
}
await db.upsert(record.id, {
...record,
updated_at: incomingVersion,
});
return { status: 'applied' };
}If the provider does not include a reliable version indicator, use the event's arrival timestamp as a last resort - but be aware this is imperfect. For critical data, a periodic reconciliation job that fetches the full record and overwrites local state is the only way to guarantee eventual consistency.
JSONata Transforms for Common Webhook Shapes
Legacy APIs send webhooks in wildly different formats. JSONata expressions let you normalize them into a consistent event structure without writing provider-specific code. Here are the patterns you will encounter most often.
Pattern 1: Dot-Notation Event Types
Many providers encode the resource and action in a single string like employee.created or contact.property_changed.
(
$parts := $split(body.event_type, '.');
$action := $parts[1];
$event := $lookup({
"created": "created",
"updated": "updated",
"property_changed": "updated",
"removed": "deleted"
}, $action);
{
"event_type": $event,
"resource": "crm/contacts",
"raw_event_type": body.event_type,
"method": "get",
"method_config": $action != "removed" ? { "id": body.data.id }
}
)The $lookup function acts as a switch statement - mapping provider-specific action names to your canonical event types. When a provider adds a new action, you add one line to the mapping object.
Pattern 2: Envelope-Style Payloads with Nested Data
Some providers wrap events in an envelope with metadata at the top level and the actual record nested inside a data or payload key.
(
$event := $lookup({
"INSERT": "created",
"UPDATE": "updated",
"DELETE": "deleted"
}, body.changeType);
{
"event_type": $event,
"resource": "hris/employees",
"raw_event_type": body.changeType,
"data": body.payload.current,
"raw_payload": body
}
)When the data field is present in the output, the engine skips the enrichment API call and maps the supplied data directly through the response mapping. This is the efficient path - no extra API call needed.
Pattern 3: Batch Event Arrays
Some APIs batch multiple events into a single webhook delivery. The JSONata expression must return an array so each event is processed independently.
body.events.{
"event_type": $lookup({
"add": "created",
"modify": "updated",
"remove": "deleted"
}, action),
"resource": "ticketing/tickets",
"raw_event_type": action,
"method": "get",
"method_config": action != "remove" ? { "id": ticket_id }
}The body.events.{ ... } syntax iterates over each element in the array, producing one output object per event. The engine processes each one independently - if enrichment fails for one event, the others still get delivered.
Recommended Retry and Backoff Policy
Not all retries are equal. Inbound webhook processing and outbound event delivery need different strategies because the failure modes are different.
Outbound Delivery (Your System to Customer Endpoint)
For most webhook systems, a sensible default is exponential backoff starting at 30 seconds, doubling each time up to a maximum of 8 hours, with full jitter applied, over a maximum of 6 to 8 attempts spanning 24 to 48 hours.
| Attempt | Delay (base) | Cumulative Wait | Covers |
|---|---|---|---|
| 1 | Immediate | 0s | Initial delivery |
| 2 | 30s | 30s | Transient network blip |
| 3 | 1 min | ~1.5 min | Brief server hiccup |
| 4 | 4 min | ~5.5 min | Service restarting |
| 5 | 15 min | ~20 min | Deployment rollback |
| 6 | 1 hour | ~1.3 hours | Extended disruption |
| 7 | 4 hours | ~5.3 hours | Major outage |
| 8 | 8 hours (cap) | ~13.3 hours | Infrastructure incident |
| TTL | - | 24 hours | Drop and dead-letter |
Exponential backoff has a subtle problem. If many webhooks fail simultaneously (common during an outage), they will all retry at exactly the same intervals. When the endpoint recovers, it gets hit with a synchronized wave of retry traffic that can trigger another failure. This is the thundering herd problem, and jitter solves it. Add full jitter (randomize each delay between 0 and the calculated backoff value) to prevent this.
Which HTTP status codes should trigger retries?
| Response | Action |
|---|---|
2xx |
Success - acknowledge and move on |
429 |
Rate limited - honor Retry-After header if present, otherwise use your backoff schedule |
5xx |
Server error - retry with backoff |
4xx (except 429) |
Client error - do not retry; these are configuration problems (bad URL, auth failure) |
| Timeout / connection refused | Retry with backoff |
Recommended practice is to alert when dead-letter queue depth exceeds roughly 10 events - a small standing backlog is the early signal that a downstream handler or endpoint is broken. Alert when the oldest event in the dead-letter queue has sat unreviewed for more than about an hour. Age catches the slow leak that depth alone misses.
Inbound Enrichment (Fetching Full Record from Provider)
| Attempt | Delay | Notes |
|---|---|---|
| 1 | Immediate | First try |
| 2 | 2s | Handles brief rate limit windows |
| 3 | 8s | Provider may be throttling |
| 4 | 30s | Respect rate limit reset window |
| Give up | - | Log failure, deliver event with partial data |
If the provider returns a 429 with a Retry-After header, honor that value instead of your own backoff schedule. If enrichment fails after all retries, deliver the event with whatever data you have (even if it is just the thin payload with a resource ID) rather than dropping it entirely. A partial event is better than a missing one.
Monitoring and Runbook: Detecting Silent Drops
The most dangerous webhook failure is the one you never notice. A provider silently stops sending events, your poller crashes without alerting, or an enrichment call starts returning empty responses. Your data goes stale and nobody knows until a customer complains.
Metrics to Track
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
inbound_events_per_provider (rate) |
Volume of incoming webhooks per provider | Drop below 50% of 7-day rolling average |
inbound_verification_failures (count) |
Signature mismatches - possible secret rotation or attack | Sustained spike (>5% of inbound traffic) |
enrichment_failure_rate (ratio) |
API callbacks failing during thin-payload enrichment | >10% over 15 minutes |
outbound_delivery_success_rate (ratio) |
Events successfully delivered to customer endpoints | <95% over 1 hour |
outbound_delivery_latency_p99 (duration) |
Time from event receipt to successful customer delivery | >5 minutes |
dead_letter_queue_depth (gauge) |
Events that exhausted all retry attempts | Any non-zero value |
virtual_webhook_last_run (timestamp) |
Last successful polling job per provider | More than 2x the expected schedule interval |
The Silent Drop Problem
When a provider stops sending webhooks - because they revoked your subscription, hit an internal error, or silently deprecated an event type - you receive zero errors. There is nothing to alert on because nothing is happening. This is why volume-based alerting is non-negotiable.
Set up a baseline for each provider's expected event volume over a rolling 7-day window. When inbound volume drops below 50% of that baseline for more than 30 minutes, fire an alert. For high-volume providers (>1,000 events/day), tighten this to a 15-minute window.
Replay Procedure for Missed Events
When you detect a gap - whether from a silent drop, a deployment incident, or a provider outage - here is the recovery playbook:
- Identify the gap window. Check your event logs to find the last successfully processed event timestamp per provider and resource type.
- Run a targeted incremental poll. Execute a one-off sync job with
updated_at > {last_known_timestamp}. This is where virtual webhooks pay for themselves - the same polling infrastructure you use for legacy APIs becomes your disaster recovery tool. - Deduplicate against existing events. The incremental poll may return records that were already processed via webhook before the gap. Your idempotency layer handles this automatically.
- Verify completeness. Compare record counts between your system and the provider's API (if they expose a total count or a list endpoint with filtering). Flag any discrepancies for manual review.
- Re-register the webhook if needed. Some providers require you to re-create the webhook subscription after prolonged failures. Check the provider's webhook management API or dashboard.
Automate step 2. The best teams do not wait for a human to trigger replay. They run a lightweight reconciliation job on a daily schedule that compares local record counts and timestamps against the provider, and automatically kicks off an incremental sync for any resource that looks out of date.
The Truto Approach: Declarative Legacy API Integration
Building this architecture in-house requires months of engineering and introduces severe maintenance overhead. Truto handles this entirely through a generic execution engine driven by declarative configuration — zero integration-specific code in the codebase.
Handling Webhooks Declaratively
Instead of writing custom Node.js handlers for every legacy system, Truto uses YAML-based Unified Models. When a webhook arrives, Truto executes a JSONata mapping block. Here is a conceptual example of how Truto normalizes a legacy CRM webhook:
webhooks:
legacy_crm: |
(
$event_type := $mapValues(body.action, {
"insert": "created",
"update": "updated",
"delete": "deleted"
});
body.{
"event_type": $event_type,
"raw_event_type": action,
"raw_payload": $,
"resource": "crm/contacts",
"method": "get",
"method_config": action != "delete" ? {
"id" : data.contact_id
}
}
)If the method_config block is present, Truto automatically knows the payload is thin. It halts, fetches the full crm/contacts resource from the provider, enriches the payload, and then delivers a standardized record:updated event to your application.
Truto supports two ingestion paths: per-account webhooks (where the provider sends events to an account-specific URL) and environment-level webhooks (where a single URL serves all accounts). For the latter, Truto uses a context_lookup_field configuration to automatically extract the tenant ID from the payload, match it to the correct integrated account, and fan the event out to the right destination — requiring zero routing logic on your end.
RapidBridge for Virtual Webhooks
For older APIs that completely lack webhooks, Truto provides RapidBridge. RapidBridge implements the virtual webhook pattern through declarative Sync Jobs that run on a schedule.
Here is a practical example — syncing tickets and their comments from a legacy ticketing system, incrementally:
{
"integration_name": "zendesk",
"resources": [
{
"resource": "ticketing/tickets",
"method": "list",
"query": {
"updated_at": { "gt": "{{previous_run_date}}" }
}
},
{
"resource": "ticketing/comments",
"method": "list",
"depends_on": "ticketing/tickets",
"query": {
"ticket_id": "{{resources.ticketing.tickets.id}}"
}
}
]
}The depends_on directive means comments are only fetched for tickets that actually changed. No wasted calls. If the legacy API returns paginated data, Truto uses spool nodes to fetch all pages, temporarily store them, and combine them into a single comprehensive webhook event. The output is delivered as webhook events in the same record:created / record:updated format as real-time webhooks — giving your engineering team the exact same event-driven developer experience regardless of how archaic the underlying API might be.
The honest trade-off: Truto's unified approach works well when your integration needs align with supported unified models (CRM, HRIS, ATS, ticketing, accounting, etc.). For deeply custom or proprietary API workflows that do not map to any standard data model, you may still need the proxy API for direct, unmapped access or a custom resource configuration. A unified API is not a silver bullet; it is a force multiplier for the 80% of integration work that follows common patterns.
When virtual webhooks meet real ones. Some providers only support webhooks for certain event types. A CRM might send contact.created events via webhook but require polling for deal.stage_changed. The best architectures handle both paths and merge them into a single event stream for your application.
The Decision Framework: Choosing Your Architecture
Before you start building, map your integration landscape against these questions:
- Does the provider offer webhooks? If yes, use them as the primary path. If no, implement incremental polling.
- Are the webhooks reliable? Check: Does the provider retry failed deliveries? Do they send full payloads or thin notifications? Is there signature verification? If any answer is "no," you need enrichment and reconciliation layers.
- How many providers are you integrating? One or two? Direct integration is fine. Five or more? Build or buy a unified engine. The maintenance cost of bespoke connectors grows linearly — or worse — with each new provider.
- What is your latency budget? Seconds? You need webhooks with a fast delivery pipeline. Minutes? Incremental polling on a schedule works. Hours? Batch is acceptable, but you are leaving money on the table.
- What are the rate limits? Map every provider's limits before designing your polling frequency. A single integration that consumes 60% of a customer's API quota will get you fired from the deal.
What to Build Next
The gap between "we have a webhook endpoint" and "we have reliable, real-time data sync across 20 providers" is enormous. Here is how to close it without drowning in technical debt:
- Start with the hybrid pattern. Use webhooks where available, incremental polling where not, and a reconciliation job that runs daily to catch anything both paths missed.
- Invest in normalization early. The longer you wait to standardize event formats across providers, the more application code you write that is tightly coupled to provider-specific schemas.
- Monitor delivery health. Track failure rates per webhook subscription. Auto-disable endpoints that have been failing for days — they are just burning compute and filling your retry queue.
- Separate ingestion from processing. Accept the webhook, respond with 200 immediately, and process asynchronously. Your HTTP handler's only job is to acknowledge receipt and enqueue.
If you are shipping integrations for a B2B product and the provider list keeps growing, the build-vs-buy calculus tips toward buying sooner than most teams expect. The engineering hours spent wrestling with provider-specific verification, pagination edge cases, and undocumented schema changes are hours not spent on your core product.
FAQ
- How do you sync data from an API that doesn't support webhooks?
- Use incremental polling (the 'virtual webhook' pattern). Track the last successful sync timestamp and query only for records updated since then, using a filter like updated_at > previous_run_date. Emit the changed records as events downstream, making your application agnostic to whether data arrived via webhook or poll.
- What is the biggest problem with webhook implementations in legacy APIs?
- Thin payloads. Many legacy providers send only an entity ID in the webhook body, forcing you to make a synchronous API callback to fetch the full record. This turns your fast, stateless webhook handler into a slow API client subject to rate limits and auth failures.
- How do you handle different webhook signature verification methods across providers?
- Use a declarative verification layer that defines each provider's method (HMAC, JWT, Basic Auth, challenge-response) as configuration rather than code. All cryptographic comparisons should use timing-safe equality checks to prevent side-channel attacks.
- What is the claim-check pattern for webhook delivery?
- Store the full webhook payload in object storage (like S3 or R2) keyed by event ID, then enqueue a lightweight message with only metadata. This decouples payload size from queue message limits and ensures payloads survive retries without data loss.
- When should I build my own integration engine vs. using a unified API?
- For 1-3 integrations, building in-house is manageable. Beyond 5-10 providers, the maintenance burden of bespoke connectors - each with unique auth, pagination, rate limits, and schemas - typically exceeds the cost of a unified API platform.