What is a declarative data sync pipeline?

A declarative data sync pipeline defines what data to pull from third-party APIs using configuration (typically JSON) rather than procedural code. The pipeline engine handles pagination, authentication, state tracking, and delivery automatically.

How does RapidBridge handle incremental syncs?

RapidBridge tracks state using a previous_run_date context variable per job and connected account. You bind this variable to a query filter in your config, allowing the engine to fetch only records updated since the last successful run.

What is JSONata and why is it used for data transformation?

JSONata is a functional query and transformation language for JSON data. It allows you to filter, reshape, and aggregate API responses using compact expressions stored as configuration strings, eliminating the need for complex nested loops in code.

How does Truto handle API rate limits in sync pipelines?

Truto does not automatically retry or absorb HTTP 429 errors. Instead, it passes the error back to the caller and normalizes upstream rate limits into IETF-standard headers (ratelimit-limit, ratelimit-remaining, ratelimit-reset) so you can implement precise retry logic.

What is the difference between declarative sync pipelines and tools like Fivetran?

Tools like Fivetran are built for internal analytics data warehousing and often charge per row (MAR). Declarative sync pipelines like RapidBridge are designed for embedded B2B integrations, syncing your customers' SaaS data directly into your product's operational datastore.

Declarative Data Sync Pipelines: Ship Integrations as Config, Not Code

If you are spending engineering sprints writing Python or Node.js scripts to pull data from your customers' SaaS tools into your product, you already know the pattern. A new enterprise prospect needs Zendesk, so you write a Zendesk sync script. Then someone needs Jira. Then ServiceNow. Building imperative data pipelines means maintaining brittle state, handling endless edge cases, and updating code every time a vendor changes an endpoint.

Each script has its own pagination logic, its own incremental sync cursor, its own error handling, and its own maintenance burden. Six months later, you have a graveyard of brittle ETL scripts that nobody wants to touch. When an upstream provider deprecates a field, your team drops product work to push an emergency fix. This cycle makes scaling integrations a massive engineering bottleneck.

The pressure to support more platforms is a mathematical reality for modern software companies. The average company uses 106 SaaS applications in 2024, according to BetterCloud's State of SaaS report. Every one of those apps is a potential integration point your customers will ask about during a sales call. Furthermore, 60% of all SaaS sales cycles now involve integration discussions, and for 84% of customers, integrations are either very important or a dealbreaker. Customers with five or more integrations are up to 80% less likely to churn.

The math is simple: if each integration requires a dedicated sync script, and each script takes an engineer two to four weeks to build, test, and harden for production, your integration backlog will outpace your hiring plan by an order of magnitude. To capture this revenue without expanding engineering headcount, you must fundamentally change how you build integrations. You need to move away from imperative scripts and adopt declarative data sync pipelines.

This guide breaks down the architectural shift from imperative integration scripts to declarative sync pipelines, walks through the real mechanics of incremental sync, JSONata-powered transforms, and pagination spooling, and addresses the uncomfortable truth about API rate limits that most integration vendors gloss over.

The Embedded Integration Bottleneck: Why Imperative ETL Scripts Fail at Scale

Building integrations in-house usually starts with a logical approach: you write a dedicated API client for each provider. A HubSpotAdapter class handles HubSpot's nested properties objects. A SalesforceAdapter class handles SOQL queries and PascalCase field names. Engineers write imperative logic to fetch, paginate, transform, and load the data.

This approach fails spectacularly at scale. Managing incremental loads, handling dependencies, and dealing with late-arriving data are all "undifferentiated heavy lifting" that distract from core business logic. As data sources evolve, imperative pipelines become brittle.

Here is what this looks like in practice. A typical imperative sync script for pulling tickets from Zendesk might contain:

A fetch_all_tickets() function with manual cursor management
A database table storing the last sync timestamp per customer account
Custom retry logic for 429 errors (that will inevitably drift from Zendesk's actual behavior)
A transformation layer mapping Zendesk's field names to your internal schema
Scheduled execution via cron or a task queue

Multiply that by 20 integrations and you have thousands of lines of integration-specific code, each with its own edge cases and failure modes. Schema drift from upstream API changes silently breaks these scripts. An engineer leaves the company, and nobody remembers why the Salesforce sync has a special sleep(3) call on line 247. When you maintain separate code paths for each integration, adding a new CRM means writing new endpoint handler functions, creating new database schemas, adding conditional branches in shared code, and going through a full CI/CD deployment cycle.

Embedded ETL platforms like Hotglue attempt to solve this by giving you a platform to host these scripts, but they are still fundamentally code-first. You are still writing Python transformation scripts for every integration. Conversely, internal data warehousing tools like Airbyte or Fivetran offer zero-maintenance pipelines, but they are built for internal analytics, not embedded B2B SaaS use cases. Fivetran operates on a usage-based pricing model where costs are based on Monthly Active Rows (MAR) - the number of rows inserted, updated, or deleted. As of March 2025, MAR is calculated per connection, meaning each data source has its own MAR count. This change significantly impacts costs for companies syncing multiple high-volume data sources for thousands of individual customer tenants.

To truly reduce technical debt from maintaining dozens of API integrations, you must abstract the integration logic out of your runtime code entirely.

What Are Declarative Data Sync Pipelines?

Declarative data pipelines shift the engineering focus from how to execute a sync to what data needs to be synced.

A declarative data sync pipeline defines what data to fetch, from which resources, and in what order - without specifying the procedural steps for HTTP calls, pagination, authentication, or error handling.

In a declarative system, you define the desired end state using configuration data. The underlying execution engine reads this configuration and determines the most efficient execution plan. It handles the HTTP requests, pagination, authentication, and error formatting automatically. Adding a new integration becomes a data operation, not a code deployment.

Truto's RapidBridge is a configuration-driven engine that eliminates integration-specific code. With RapidBridge, you define a Sync Job as a JSON object. This object specifies which resources to fetch, how they depend on each other, and where to send the data. The same engine that syncs Zendesk tickets syncs Jira issues, ServiceNow incidents, or Asana tasks - because all integration-specific behavior lives as data-only configuration, not compiled code.

Here is an example of a declarative Sync Job definition that fetches Zendesk users, contacts, tickets, and the comments associated with each ticket:

{
    "integration_name": "zendesk",
    "resources": [
        {
            "resource": "ticketing/users",
            "method": "list"
        },
        {
            "resource": "ticketing/tickets",
            "method": "list"
        },
        {
            "resource": "ticketing/comments",
            "method": "list",
            "depends_on": "ticketing/tickets",
            "query": {
                "ticket_id": "{{resources.ticketing.tickets.id}}"
            }
        }
    ]
}

Notice the depends_on array. Instead of writing nested for loops in Node.js to fetch comments for every ticket, you simply declare the dependency graph. The RapidBridge engine resolves the execution graph, replaces the {{resources.ticketing.tickets.id}} placeholder dynamically, and handles the orchestration. Comments cannot be fetched until tickets are available, and the engine ensures this execution order automatically.

This architecture allows you to sync unified API data directly to a datastore or route it to a webhook endpoint, entirely driven by configuration. Compare this to code-first embedded ETL platforms where you write and maintain Python transformation scripts for each integration. Code-first gives you maximum flexibility; declarative gives you maximum leverage. At 5 integrations, the difference is marginal. At 50, it is the difference between a sustainable system and an engineering crisis.

Mastering Incremental Sync: Escaping the Full-Refresh Trap

Full-refresh syncs - pulling every record from a source on every run - are the default in most hand-rolled integration scripts because they are the simplest to implement. They are also highly wasteful. If your customer has 500,000 Zendesk tickets and only 200 changed since your last sync, fetching all 500,000 every six hours is burning API quota, bandwidth, and processing time for no reason.

In production environments, you must implement incremental syncing to fetch only the records that have changed since the last successful run. The trick is reliable state tracking: you need to know when the last sync completed successfully and use that timestamp as a filter.

In imperative scripts, managing incremental state is notoriously difficult. You have to read the last sync timestamp from a database, pass it to the API request, handle timezone conversions, and update the timestamp only if the entire batch succeeds. If a script crashes halfway through, you risk data duplication or missed records.

Declarative pipelines handle state tracking automatically. RapidBridge exposes a system-managed context variable called previous_run_date. This variable stores the exact UTC timestamp of the last successful sync job for a specific integrated account.

To convert a full refresh into an incremental sync, you bind this variable to the upstream API's filter parameters:

{
    "resource": "ticketing/tickets",
    "method": "list",
    "query": {
      "updated_at": {
        "gt": "{{previous_run_date}}"
      }
    }
}

That is the entire incremental sync implementation. No database migration for a sync_cursors table. No custom logic to handle the first-run edge case. No manual timestamp bookkeeping. On the very first run, RapidBridge evaluates previous_run_date as 1970-01-01T00:00:00.000Z, triggering a historical sync. On subsequent runs, it injects the precise timestamp of the last success.

Overrides and Dynamic Arguments

There are scenarios where you need to override this automatic state. If a customer reports stale data and you want to force a full historical sync on demand, you can pass an ignore_previous_run flag in the API request. This bypasses the stored state without altering the pipeline definition:

{
  "sync_job_id": "7279a917-b447-4629-9e46-a1eeb791ad6b",
  "integrated_account_id": "7ae7b0ab-c6a7-4f29-aec1-1f123517af5d",
  "webhook_id": "a5b21886-3b4d-4fd0-9956-ffc0714d701c",
  "ignore_previous_run": true
}

Sometimes, you need to sync data based on user input rather than automated timestamps, such as during targeted backfill operations. RapidBridge supports dynamic arguments via an args_schema definition. If you want to allow users to specify a custom start date for their ticket sync, you define the schema and reference it in your query:

{
  "args_schema": {
    "ticket_sync_start_date": {
      "type": "string",
      "format": "date-time"
    }
  }
}

When triggering the job, you pass the argument:

{
  "args": {
    "ticket_sync_start_date": "2025-01-15T00:00:00.000Z"
  }
}

You can then use a conditional placeholder to fall back to the previous run date if the user does not provide a custom argument:

{
  "query": {
    "updated_at": {
      "gt": "{{args.ticket_sync_start_date|previous_run_date}}"
    }
  }
}

This declarative flexibility completely removes the need for conditional logic and state management in your application backend.

Transforming Data on the Fly with JSONata

Pulling raw data from a third-party API is only half the job. Unified schemas are highly effective for standardizing core data models across providers, but enterprise customers frequently use custom fields, unique naming conventions, or highly specific data structures that fall outside standard schemas. The data almost always needs filtering, reshaping, or enrichment before it is useful in your system.

When transforming complex data containing nested object hierarchies in native JavaScript or Python, you often have to write complex nested looping structures. These mentally demanding implementations soak up precious development resources that should be dedicated to implementing business logic. Business logic provides real value for customers; maintaining complicated mapping scripts simply slows development velocity.

RapidBridge solves this using JSONata. JSONata is a lightweight, Turing-complete functional query and transformation language purpose-built for JSON data, inspired by the location path semantics of XPath 3.1. It allows you to express complex transformations - filtering, aggregation, string manipulation, and conditional logic - as compact strings that live as configuration.

In RapidBridge, you implement these transformations using Transform Nodes. A Transform Node depends on a resource node and applies a JSONata expression to the fetched data before passing it downstream.

Consider a real use case: you fetch a massive list of contacts from Zendesk, but you only want to persist contacts that were updated after the previous_run_date. While some APIs support updated_at filtering at the query level, many older APIs do not. You have to fetch everything and filter it in memory.

Here is how you handle that declaratively with a Transform Node:

{
    "integration_name": "zendesk",
    "resources": [
        {
            "name": "all-contacts",
            "resource": "ticketing/contacts",
            "method": "list",
            "persist": false
        },
        {
            "name": "filtered-contacts",
            "type": "transform",
            "config": {
              "expression": "resources.ticketing.contacts[updated_at >= %.%.%.previous_run_date]"
            },
            "depends_on": "all-contacts",
            "persist": true
        }
    ]
}

In this configuration:

The all-contacts node fetches the data but sets persist: false, meaning this raw data will not be sent to your webhook or datastore.
The filtered-contacts node depends on all-contacts.
The JSONata expression filters the array, keeping only objects where updated_at is greater than or equal to the previous_run_date context variable.
The filtered node sets persist: true, ensuring only the relevant records are delivered.

Transform nodes also chain. You can have a transform that depends on another transform, building multi-step processing pipelines without writing a single line of procedural code. JSONata expressions have full access to a rich context object, including the arguments passed to the run, the data fetched by previous resources, and any custom variables stored on the integrated account.

JSONata is a Domain Specific Language and does require some ramp-up to onboard developers. Its compact syntax has a learning curve, and debugging complex expressions is different from stepping through JavaScript in a debugger. However, for teams unfamiliar with functional expression languages, tools like the JSONata Exerciser are invaluable for iterating on expressions interactively, and the long-term maintenance benefits far outweigh the initial learning curve.

Handling Complex Pagination and Spooling

Not all APIs return clean, flat arrays of data. Document APIs, knowledge bases, and file storage systems often return heavily nested or fragmented data structures.

Some APIs fragment a single logical resource across many paginated responses. Notion is the poster child for this: fetching a page does not return the full document text. It returns a paginated list of block IDs. You have to recursively fetch the content of each block, handle pagination for each request, and stitch the text back together. Assembling that into a single coherent document in an imperative script is a recursive nightmare requiring significant memory management.

RapidBridge handles this with two features that work together: recursive fetching and Spool Nodes.

Recursive fetching follows parent-child relationships within a single resource automatically. If a block has children, the engine fetches them based on a condition:

{
  "resource": "file-storage/drive-items",
  "method": "list",
  "recurse": {
    "if": "{{resources.file-storage.drive-items.has_children:bool}}",
    "config": {
      "query": {
        "parent": {
          "id": "{{resources.file-storage.drive-items.id}}"
        }
      }
    }
  }
}

Spool Nodes collect all paginated results into a single batch before forwarding them. Instead of receiving 15 separate webhook events for 15 pages of content blocks, you receive one event with the complete, assembled collection. You can then pass the entire collection to a Transform Node to be combined into a single document.

sequenceDiagram
    participant Truto as RapidBridge
    participant API as Third-Party API
    participant Spool as Spool Node
    participant Transform as Transform Node
    
    Truto->>API: Fetch Block List (Page 1)
    API-->>Truto: Blocks 1-50 + Next Cursor
    Truto->>Spool: Store Blocks 1-50
    
    Truto->>API: Fetch Block List (Page 2)
    API-->>Truto: Blocks 51-100
    Truto->>Spool: Store Blocks 51-100
    
    Spool->>Transform: Pass all 100 blocks
    Transform->>Transform: Evaluate JSONata (Combine Text)
    Transform-->>Truto: Single Markdown Document

Warning

To prevent memory exhaustion, Spool Nodes limit temporary storage to 128KB per block. If your pages contain massive embedded content or remote data payloads, you should pair the Spool Node with an intermediate Transform Node that strips out heavy, unnecessary metadata before the data hits the spool.

By chaining a resource node, a stripping transform node, a spool node, and a combining transform node, you can convert highly fragmented, paginated API responses into clean, single-document webhook events - entirely through configuration.

The Reality of API Rate Limits and 429 Errors

If you are building data sync pipelines at scale, you will hit rate limits. There is no avoiding it. How your integration platform handles these limits dictates the reliability of your entire architecture.

Many integration platforms market "automatic retries" and "built-in exponential backoff" as selling points. They intercept HTTP 429 (Too Many Requests) errors, hold the connection open, and retry the request behind the scenes. This sounds great until you realize that opaque retry logic makes debugging impossible and creates cascading failures in distributed systems. Your background workers time out waiting for a response, queue backlogs explode, and you lose all visibility into the actual health of the upstream API.

Here is how rate limits actually work in the real world: every major API gateway and proxy - Red Hat 3scale, Kong, Envoy, Azure API Gateway - implements rate limiting differently. Salesforce uses a rolling 24-hour window. HubSpot enforces per-second and daily limits. Zendesk returns Retry-After headers. Each vendor expresses their limits in different response headers with different naming conventions.

Truto takes a radically transparent approach: it does not retry, throttle, or absorb rate limit errors.

If an upstream API returns an HTTP 429, Truto passes that error directly back to your caller. You are explicitly responsible for handling the failure and implementing your own retry logic.

To solve the chaos of inconsistent upstream headers, Truto normalizes all upstream rate limit data into standardized response headers based on the IETF RateLimit specification:

Header	Meaning
`ratelimit-limit`	The maximum number of requests permitted in the current window
`ratelimit-remaining`	The number of requests remaining in the current window
`ratelimit-reset`	The number of seconds until the rate limit window resets

Why is this better than automatic retry? Because you - the caller - know your business context. You know whether a failed sync can wait 30 seconds or needs to be escalated immediately. You know whether you should back off exponentially or switch to a different customer's sync job while the rate limit window resets. An opaque retry layer inside your integration platform strips you of that control.

A practical implementation looks like this:

async function syncWithRateLimitHandling(syncFn, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const response = await syncFn();
    
    if (response.status === 429) {
      const resetIn = parseInt(
        response.headers.get('ratelimit-reset') || '60',
        10
      );
      const remaining = response.headers.get('ratelimit-remaining');
      console.log(`Rate limited. ${remaining} requests left. Waiting ${resetIn}s.`);
      await sleep(resetIn * 1000);
      continue;
    }
    
    return response;
  }
  throw new Error('Rate limit retries exhausted');
}

The standardized headers mean this retry logic works identically whether the upstream API is Salesforce, Zendesk, or BambooHR. You write it once and it works across every integration. This explicit failure state gives your engineering team complete, deterministic control, which is the only way to adhere to best practices for handling API rate limits across multiple third-party APIs.

Info

Architectural Takeaway: Never trust a system that hides rate limits from your application logic. For sync jobs that run via RapidBridge, rate limit events are reported as sync_job_run:rate_limited webhook events, so your system can track and react to throttling across all your connected accounts programmatically.

Where Declarative Pipelines Sit in the Integration Stack

Declarative sync pipelines are not a silver bullet. They work best when:

You need to pull data from customers' SaaS accounts into your own data store - the classic embedded integration use case for B2B SaaS.
The data model is well-defined - tickets, contacts, employees, invoices - resources that map cleanly to unified schemas.
You are syncing at scheduled intervals - every 15 minutes, every hour, every 6 hours.

They are less well-suited for:

Complex bidirectional sync with conflict resolution - that requires application-level logic that goes beyond what a declarative config can express.
Real-time event-driven workflows where sub-second latency matters - webhooks plus a real-time unified API are a better fit.
Highly custom ETL where every customer needs radically different transformation logic - though RapidBridge's JSONata transforms handle more of this than you might expect.

Combined with scheduled cron triggers, you can have a complete sync pipeline that runs every 6 hours, fetches only incremental changes, transforms the data to your schema, and delivers it to your database - all defined as a JSON configuration that fits in a single screen.

Shipping Integrations as Data Operations

The pattern here is the same one that transformed infrastructure management: move from imperative scripts to declarative specifications. Just as Terraform replaced shell scripts for provisioning servers, declarative sync pipelines replace hand-rolled ETL scripts for provisioning integrations. Relying on imperative ETL scripts, custom Python functions, and code-heavy orchestration platforms creates a maintenance burden that eventually stalls product development.

The practical impact for engineering leaders:

New integrations become data operations, not engineering sprints. Adding a new CRM or HRIS source means writing a JSON configuration, not a new codebase.
Incremental sync is a configuration flag, not an architecture project. Binding previous_run_date to a query filter takes 30 seconds.
Transform logic lives alongside sync definitions, not in separate codebases. JSONata expressions are part of the sync job config - versioned, auditable, and hot-swappable.
Rate limit handling is explicit and under your control. Standardized IETF headers give you the data you need to implement retry logic that matches your business requirements.

For product managers staring at a backlog of integration requests blocking six-figure deals, this architecture changes the economics. The question shifts from "How many engineers do we need to hire for integrations?" to "How quickly can we add a new sync configuration?"

Stop writing custom API clients. Start shipping integrations as data operations.

Declarative Data Sync Pipelines: Ship Integrations as Config, Not Code

The Embedded Integration Bottleneck: Why Imperative ETL Scripts Fail at Scale

What Are Declarative Data Sync Pipelines?

Mastering Incremental Sync: Escaping the Full-Refresh Trap

Overrides and Dynamic Arguments

Transforming Data on the Fly with JSONata

The Reality of API Rate Limits and 429 Errors

Where Declarative Pipelines Sit in the Integration Stack

Shipping Integrations as Data Operations

Frequently Asked Questions

More from our Blog

How to Survive API Deprecations Across 50+ SaaS Integrations

Zero Integration-Specific Code: How to Ship API Connectors as Data-Only Operations

How to Reduce Technical Debt from Maintaining Dozens of API Integrations

Hot-Swappable API Integrations: Add Connectors Without Code Deploys

Truto vs Hotglue: Declarative JSON vs Code-First Python for B2B Integrations

Best Practices for Handling API Rate Limits and Retries Across Multiple Third-Party APIs

Product Update: Directly sync unified API data to a data store of your choice