How to Handle Webhooks and Real-Time Data Sync from Legacy APIs
Legacy APIs break real-time sync with missing webhooks, thin payloads, and brutal rate limits. Here's how to architect reliable data pipelines without drowning in technical debt.
Legacy APIs are the silent saboteurs of your data pipeline. Your team wants real-time sync between your product and a customer's on-prem ERP or aging CRM instance—the kind of native connectivity your sales team actually asks for. But the API documentation is a PDF from 2017, there are no webhooks, and the rate limit is 100 requests per minute. So you build a poller, ship it, and move on. Six months later, that poller is eating 40% of your API quota, missing records, and a customer is on a call asking why their data is three hours stale.
This is not an edge case. It is the default experience for any team building direct integrations to legacy systems. What starts as a simple Jira ticket to sync data quickly mutates into an ongoing maintenance nightmare of undocumented edge cases, silent failures, and exhausted API quotas.
This guide covers how to architect reliable, real-time data sync from legacy APIs that either lack webhooks entirely or implement them so poorly they might as well not exist. We will break down the business cost of stale data, the technical failure modes of legacy systems, the polling-vs-webhook trade-off, and the architectural patterns — including virtual webhooks, the claim-check pattern, and unified webhook engines — that actually hold up in production.
The Hidden Cost of Stale Data in the Enterprise
Late data is not just an engineering annoyance — it is a measurable financial drain. Gartner estimates that poor data quality costs organizations an average of $15 million per year. And the bleeding is constant: B2B contact data decays at approximately 2.1% per month according to Marketing Sherpa, meaning over 22% of a database becomes outdated within a year.
The revenue impact hits harder than most executives realize. According to a report from Validity, 76% of respondents characterize their CRM data quality as either "good" or "very good," yet a staggering 44% of respondents estimated their company loses over 10% of annual revenue due to poor data quality. That gap between perception and reality is where real money evaporates. Research shows that 50% of workers' time is spent finding, correcting, and confirming inaccurate data — time that should be spent closing deals or shipping features.
The consequences cascade across the organization:
- Revenue Loss: Decaying data directly impacts sales targeting and pipeline velocity.
- Operational Friction: Teams spend hours manually verifying records across disparate systems.
- Missed SLAs: Critical automated workflows fail to trigger when prerequisite data is delayed.
- Engineering Drain: Developers abandon core product work to troubleshoot broken batch jobs.
When a B2B SaaS application captures a critical engagement signal — a signed contract, a high-intent product action, or a support escalation — that state change must propagate to the source of truth immediately. Batch sync, the traditional approach of running a nightly ETL job, guarantees your users are operating on outdated context.
If your product captures a lead score change at 2 PM but your batch job runs at midnight, the sales rep picking up the phone the next morning is working with 10-hour-old data. In fast-moving enterprise sales cycles, that is the difference between a warm lead and a cold one. If you are building real-time CRM syncs at enterprise scale, the data freshness problem compounds with every hour of delay.
"Real-time" does not mean "instant." Between event delivery, your queue, retries, and rate-limit backoff, real-time usually means seconds, sometimes minutes, and occasionally "we will reconcile later." If your stakeholders expect a hard 200ms SLA, reset those expectations before writing a line of code.
Why Legacy APIs Break Real-Time Sync Architectures
Modern SaaS APIs from Stripe, GitHub, and Slack ship with well-documented webhooks, signature verification, and retry policies. Legacy systems — your customer's on-prem NetSuite instance, a decade-old HRIS, or a vertical SaaS tool with a SOAP API — actively fight real-time sync patterns.
Here is what breaks:
No webhook support at all. Many legacy systems simply cannot push events. Keeping external systems in sync with data from platforms like NetSuite is a common requirement, but while external integrations that periodically pull or push data are well-documented, the opposite approach — proactively pushing updates via webhooks — is often overlooked. NetSuite does not offer native webhooks out of the box. To get real-time event notifications, your team must write, deploy, and maintain custom SuiteScript RESTlets directly inside the customer's NetSuite environment. NetSuite also imposes severe concurrency restrictions — standard accounts typically allow a maximum of 5 simultaneous requests. If your system fires off six concurrent updates, the sixth request does not neatly queue; it fails or returns a rejection error.
Brutal rate limits. Salesforce enforces a 100,000 daily API request limit for Enterprise Edition orgs, plus 1,000 additional requests per user license. They also cap concurrent long-running requests to a maximum of 25. That sounds generous until you realize your marketing automation, support tools, BI platform, and your integration are all sharing the same quota. If you attempt to achieve "real-time" sync by polling Salesforce endpoints every few seconds across dozens of customer accounts, you will exhaust your API quota before lunch. One runaway poller can starve every other integration in the org.
Undocumented schemas and breaking changes. Legacy API documentation is often incomplete, outdated, or flat-out wrong. Fields appear and disappear between versions. Date formats change without notice. Pagination tokens expire silently. Every one of these becomes a production incident you did not plan for.
The "brittle connector" trap. When you build direct integrations in-house, each one starts as a manageable project. But by the time you are maintaining 10 or more, the maintenance burden cannibalizes core product development. You are not building features — you are babysitting API connections.
Enterprise iPaaS platforms like MuleSoft promise to solve this, but they come with their own pain. MuleSoft implementation timelines typically span 6-8 months, affecting time-to-value compared to alternatives. Implementation costs frequently exceed $100,000 for initial deployment. Companies migrating from legacy platforms like MuleSoft report 20-65% lower Total Cost of Ownership and 4-10x faster development cycles, according to Workato. For most B2B SaaS companies, a six-month, six-figure iPaaS deployment is not a realistic path to shipping integrations fast.
The Webhook Wild West: Reliability and Verification Challenges
Some legacy APIs do support webhooks. The problem is that "support" is a generous word. As we have covered in our guide on designing reliable webhooks from production experience, every provider has a different approach to security, payload structure, and retry behavior.
Here is the reality of what you face across providers:
| Challenge | Example |
|---|---|
| Inconsistent verification | HiBob uses HMAC-SHA256; Slack requires a challenge handshake; Microsoft Graph needs JWT verification |
| Thin payloads | Many providers send only an entity ID, not the actual data. You get { "employee_id": "12345" } and need to call back to get the full record |
| Out-of-order delivery | Event A (created) arrives after Event B (updated). Your handler overwrites the newer state with the older one |
| No retry guarantees | Some providers fire the webhook once and forget it. If your endpoint was down for 30 seconds, that event is gone |
| Duplicate events | Others retry aggressively, sending the same event 3-5 times. Without idempotency, you process the same record multiple times |
The Verification Handshake
Webhook signature verification is the cryptographic proof that a payload actually originated from the expected provider. There is no standard. One legacy API might use HMAC-SHA256, another might use a simple Bearer token, and systems like Microsoft Graph require a synchronous "challenge" handshake where the provider sends a validation token in the query string and your server must echo it back in plain text within seconds. If your generic webhook handler expects a JSON POST body, the initial GET request will fail, and the webhook will never activate.
Thin Payloads and Rate Limit Traps
The thin payload problem is especially painful. Instead of sending the updated record, the provider sends a payload containing nothing but {"event": "contact_updated", "id": "8675309"}. To do anything useful, your system must immediately turn around and make a GET request to fetch the full record.
sequenceDiagram
participant Provider as Third-Party<br>Provider
participant Handler as Your Webhook<br>Handler
participant API as Provider API
participant App as Your Application
Provider->>Handler: POST /webhook {employee_id: "123", event: "updated"}
Note over Handler: Thin payload - no actual data
Handler->>API: GET /employees/123 (fetch full record)
API-->>Handler: {name: "Jane", title: "VP Sales", ...}
Handler->>App: Forward enriched event
Note over Handler: But what if the API call fails?<br>Rate limited? Auth expired?If a customer bulk-updates 10,000 contacts, you receive 10,000 thin webhooks, resulting in 10,000 immediate GET requests. This creates a self-inflicted DDoS attack that instantly triggers the provider's rate limits and gets your API token temporarily banned. Your "real-time" webhook handler, which should be fast and stateless, is now a slow, stateful API client subject to rate limits, auth token expiry, and network failures.
The Fan-Out Routing Problem
Modern APIs typically allow you to register a unique webhook URL per connected account. Legacy APIs often force you to register a single, global webhook URL for your entire developer application. Events for Customer A and Customer B arrive at the exact same endpoint. Your infrastructure must inspect the payload, extract a tenant identifier (like a company_id), query your database to find the matching OAuth token, and then route the event to the correct internal queue. When you are running a multi-tenant SaaS platform, this routing logic alone becomes a significant source of bugs and operational overhead.
Polling vs. Webhooks: Solving the Legacy API Dilemma
When a legacy API does not offer webhooks, polling is your only option. And polling gets a bad reputation for good reason. Zapier estimates that only 1.5% of polling requests find an update. That means 98.5% of your API calls return nothing new — pure waste against a finite rate limit budget.
But here is the thing: smart polling, done right, is a perfectly valid strategy. The goal is not to eliminate polling. It is to make polling behave like an event stream.
The "Virtual Webhook" Pattern: Incremental Polling as Event Stream
A virtual webhook is an architectural pattern where incremental polling is transformed into an event-driven data stream. Instead of fetching all records every cycle, you track a high-water mark (typically an updated_at timestamp) and only fetch records that changed since the last successful run.
The math is compelling. If a provider has 50,000 employee records but only 12 changed in the last hour, an incremental poll fetches 12 records instead of 50,000. That is a 99.97% reduction in API calls.
Here is what this looks like in practice:
{
"resource": "hris/employees",
"method": "list",
"query": {
"updated_at": {
"gt": "{{previous_run_date}}"
}
}
}The previous_run_date is a cursor that tracks the last successful sync. On the very first run, it defaults to epoch (1970-01-01T00:00:00.000Z) to pull a full snapshot. Every subsequent run fetches only the delta. Once you fetch the changed records, you emit them as events — record:created, record:updated, record:deleted — downstream to your application, exactly as if a webhook had fired.
Your downstream application does not know — and should not care — whether an event originated from a real HTTP webhook or a virtual webhook generated by a polling cron job. The interface remains identical.
If the legacy API returns paginated data, your polling engine needs to handle spooling: fetch all pages, temporarily store the blocks, and combine them into a single, comprehensive event. Without this, a partial page failure midway through can leave your data in an inconsistent state. You should also implement exponential backoff to dynamically slow down polling frequency when the API returns 429 errors — hammering a rate-limited endpoint just guarantees a longer lockout.
When to Use Each Approach
| Factor | Real Webhooks | Incremental Polling (Virtual Webhooks) |
|---|---|---|
| Latency | Seconds | Minutes (depends on schedule) |
| API call efficiency | Zero wasted calls | Some waste, but minimized by delta queries |
| Works with legacy APIs | Only if provider supports it | Always |
| Reliability | Depends on provider retry policy | You control the schedule and retries |
| Complexity | Receiver infrastructure, verification, enrichment | Scheduler, cursor management, deduplication |
| Best for | High-frequency events, modern APIs | Legacy systems, APIs without webhook support |
The hybrid approach is the gold standard. Many integrations rely on webhooks for event-driven updates and fall back to periodic REST polling (with conditional requests) only as a safety net. If your integration requires "near real time" updates, the best practice is usually webhooks first plus occasional syncs via conditional requests to catch anything that might have been missed.
How to Architect a Unified Webhook Engine
If you are integrating with more than a handful of third-party systems, the per-provider approach falls apart fast. You end up with separate verification logic for each provider, custom payload parsers, and bespoke retry handling. The right architecture is a unified webhook engine — a centralized system that handles ingestion, verification, transformation, enrichment, and delivery for all providers through a single pipeline.
The cardinal rule: never process webhook business logic synchronously in the HTTP handler. Accept the webhook, respond with 200 immediately, and process asynchronously. A slow webhook handler that times out is worse than no handler at all — most providers will mark your endpoint as dead after a few consecutive timeouts.
Here is the architecture that works at scale:
graph TD
A[Legacy API Provider] -->|Raw Event Payload| B(Ingress Router)
B --> C{Verification Engine}
C -->|Challenge Request| D[Respond to Handshake]
C -->|Valid Signature| E[JSONata Transformation]
E --> F{Is Payload Thin?}
F -->|Yes| G[Fetch Full Resource <br> via Proxy API]
F -->|No| H[Normalized Event Payload]
G --> H
H --> I[(Object Storage / R2)]
I -->|Store Payload <br> Generate Claim-Check ID| J[Event Queue]
J --> K[Outbound Delivery Worker]
K -->|Signed HMAC Payload| L[Customer Application Endpoint]Layer 1: Declarative Verification
Instead of writing if (provider === 'hubspot') { verifyHmac(...) } else if (provider === 'slack') { handleChallenge(...) }, define each provider's verification as configuration:
- HMAC: Specify the algorithm, the header containing the signature, and which parts of the payload to hash
- JWT: Specify the token location and verification key
- Basic Auth / Bearer: Simple credential comparison
- Challenge-Response: A JSONata expression that detects handshake requests and returns the expected response
The ingress router must support both POST requests (for actual events) and GET requests (for verification handshakes). All cryptographic comparisons should use timing-safe equality checks (crypto.subtle.timingSafeEqual or equivalent) to prevent side-channel attacks. This is a detail that most hand-rolled implementations miss.
Layer 2: Transformation via JSONata
Once verified, the raw, provider-specific payload must be normalized. Hardcoding transformation logic creates massive technical debt. Using a declarative expression language like JSONata means adding a new provider is a configuration change, not a code change.
A JSONata mapping transforms a provider's raw event into a canonical format:
(
$action := $split(body.type, '.')[1];
$event_type := $lookup({
"created": "created",
"updated": "updated",
"deleted": "deleted"
}, $action);
{
"event_type": $event_type,
"resource": "hris/employees",
"method": "get",
"method_config": { "id": body.employee.id }
}
)The output is always the same shape — resource, event_type, and enough information to fetch or construct the full record — regardless of which provider sent it.
Layer 3: Enrichment
When the transformation determines the payload is "thin" (containing only an ID), the engine pauses. It securely retrieves the OAuth credentials for that specific tenant, handles any necessary token refreshes, and makes a direct proxy call to the legacy API to fetch the complete record. This ensures that the event pushed to your application always contains the full, actionable data model.
This is where a unified API layer earns its keep. The enrichment call goes through the same normalization pipeline as a regular API request, so the webhook payload matches the exact same schema your application gets from a direct API call. When the payload already contains the full record, this step is skipped entirely.
Layer 4: Reliable Delivery with the Claim-Check Pattern
Message queues (like AWS SQS or Cloudflare Queues) have strict message size limits — often around 256KB. Enterprise CRM payloads can easily exceed this. The claim-check pattern solves this:
- The normalized event payload is written to durable object storage (like AWS S3 or Cloudflare R2), keyed by a unique Event ID.
- A lightweight message containing only the Event ID and metadata is pushed to the queue.
- The outbound delivery worker picks up the message, retrieves the full payload from object storage, and attempts delivery to your application.
If your application is down and returns a 503, the worker leverages exponential backoff to retry delivery. Because the payload is safely stored in object storage, no data is lost during the outage. Outbound events should be signed with HMAC-SHA256 so the receiving application can verify authenticity — the signature and a timestamp should travel in HTTP headers, not the body.
The Truto Approach: Declarative Legacy API Integration
Building this architecture in-house requires months of engineering and introduces severe maintenance overhead. Truto handles this entirely through a generic execution engine driven by declarative configuration — zero integration-specific code in the codebase.
Handling Webhooks Declaratively
Instead of writing custom Node.js handlers for every legacy system, Truto uses YAML-based Unified Models. When a webhook arrives, Truto executes a JSONata mapping block. Here is a conceptual example of how Truto normalizes a legacy CRM webhook:
webhooks:
legacy_crm: |
(
$event_type := $mapValues(body.action, {
"insert": "created",
"update": "updated",
"delete": "deleted"
});
body.{
"event_type": $event_type,
"raw_event_type": action,
"raw_payload": $,
"resource": "crm/contacts",
"method": "get",
"method_config": action != "delete" ? {
"id" : data.contact_id
}
}
)If the method_config block is present, Truto automatically knows the payload is thin. It halts, fetches the full crm/contacts resource from the provider, enriches the payload, and then delivers a standardized record:updated event to your application.
Truto supports two ingestion paths: per-account webhooks (where the provider sends events to an account-specific URL) and environment-level webhooks (where a single URL serves all accounts). For the latter, Truto uses a context_lookup_field configuration to automatically extract the tenant ID from the payload, match it to the correct integrated account, and fan the event out to the right destination — requiring zero routing logic on your end.
RapidBridge for Virtual Webhooks
For older APIs that completely lack webhooks, Truto provides RapidBridge. RapidBridge implements the virtual webhook pattern through declarative Sync Jobs that run on a schedule.
Here is a practical example — syncing tickets and their comments from a legacy ticketing system, incrementally:
{
"integration_name": "zendesk",
"resources": [
{
"resource": "ticketing/tickets",
"method": "list",
"query": {
"updated_at": { "gt": "{{previous_run_date}}" }
}
},
{
"resource": "ticketing/comments",
"method": "list",
"depends_on": "ticketing/tickets",
"query": {
"ticket_id": "{{resources.ticketing.tickets.id}}"
}
}
]
}The depends_on directive means comments are only fetched for tickets that actually changed. No wasted calls. If the legacy API returns paginated data, Truto uses spool nodes to fetch all pages, temporarily store them, and combine them into a single comprehensive webhook event. The output is delivered as webhook events in the same record:created / record:updated format as real-time webhooks — giving your engineering team the exact same event-driven developer experience regardless of how archaic the underlying API might be.
The honest trade-off: Truto's unified approach works well when your integration needs align with supported unified models (CRM, HRIS, ATS, ticketing, accounting, etc.). For deeply custom or proprietary API workflows that do not map to any standard data model, you may still need the proxy API for direct, unmapped access or a custom resource configuration. A unified API is not a silver bullet; it is a force multiplier for the 80% of integration work that follows common patterns.
When virtual webhooks meet real ones. Some providers only support webhooks for certain event types. A CRM might send contact.created events via webhook but require polling for deal.stage_changed. The best architectures handle both paths and merge them into a single event stream for your application.
The Decision Framework: Choosing Your Architecture
Before you start building, map your integration landscape against these questions:
- Does the provider offer webhooks? If yes, use them as the primary path. If no, implement incremental polling.
- Are the webhooks reliable? Check: Does the provider retry failed deliveries? Do they send full payloads or thin notifications? Is there signature verification? If any answer is "no," you need enrichment and reconciliation layers.
- How many providers are you integrating? One or two? Direct integration is fine. Five or more? Build or buy a unified engine. The maintenance cost of bespoke connectors grows linearly — or worse — with each new provider.
- What is your latency budget? Seconds? You need webhooks with a fast delivery pipeline. Minutes? Incremental polling on a schedule works. Hours? Batch is acceptable, but you are leaving money on the table.
- What are the rate limits? Map every provider's limits before designing your polling frequency. A single integration that consumes 60% of a customer's API quota will get you fired from the deal.
What to Build Next
The gap between "we have a webhook endpoint" and "we have reliable, real-time data sync across 20 providers" is enormous. Here is how to close it without drowning in technical debt:
- Start with the hybrid pattern. Use webhooks where available, incremental polling where not, and a reconciliation job that runs daily to catch anything both paths missed.
- Invest in normalization early. The longer you wait to standardize event formats across providers, the more application code you write that is tightly coupled to provider-specific schemas.
- Monitor delivery health. Track failure rates per webhook subscription. Auto-disable endpoints that have been failing for days — they are just burning compute and filling your retry queue.
- Separate ingestion from processing. Accept the webhook, respond with 200 immediately, and process asynchronously. Your HTTP handler's only job is to acknowledge receipt and enqueue.
If you are shipping integrations for a B2B product and the provider list keeps growing, the build-vs-buy calculus tips toward buying sooner than most teams expect. The engineering hours spent wrestling with provider-specific verification, pagination edge cases, and undocumented schema changes are hours not spent on your core product.
FAQ
- How do you sync data from an API that doesn't support webhooks?
- Use incremental polling (the 'virtual webhook' pattern). Track the last successful sync timestamp and query only for records updated since then, using a filter like updated_at > previous_run_date. Emit the changed records as events downstream, making your application agnostic to whether data arrived via webhook or poll.
- What is the biggest problem with webhook implementations in legacy APIs?
- Thin payloads. Many legacy providers send only an entity ID in the webhook body, forcing you to make a synchronous API callback to fetch the full record. This turns your fast, stateless webhook handler into a slow API client subject to rate limits and auth failures.
- How do you handle different webhook signature verification methods across providers?
- Use a declarative verification layer that defines each provider's method (HMAC, JWT, Basic Auth, challenge-response) as configuration rather than code. All cryptographic comparisons should use timing-safe equality checks to prevent side-channel attacks.
- What is the claim-check pattern for webhook delivery?
- Store the full webhook payload in object storage (like S3 or R2) keyed by event ID, then enqueue a lightweight message with only metadata. This decouples payload size from queue message limits and ensures payloads survive retries without data loss.
- When should I build my own integration engine vs. using a unified API?
- For 1-3 integrations, building in-house is manageable. Beyond 5-10 providers, the maintenance burden of bespoke connectors - each with unique auth, pagination, rate limits, and schemas - typically exceeds the cost of a unified API platform.