Skip to content

ETL Workflows Using Unified APIs: Solving the Bulk Extraction Problem

Learn why traditional ETL tools and store-and-sync unified APIs fail at bulk extraction, and how to architect zero-retention data pipelines for B2B SaaS.

Roopendra Talekar Roopendra Talekar · · 22 min read
ETL Workflows Using Unified APIs: Solving the Bulk Extraction Problem

Your AI agent needs to ingest every ticket, employee record, and CRM contact from your customers' SaaS tools. You need data from HubSpot, Zendesk, BambooHR, and 40 other platforms — normalized, delivered to your data store, and refreshed on a schedule. Building 40 custom ETL pipelines is not an option. Neither is asking your team to manage OAuth tokens, pagination quirks, and rate limits for each one.

This is the bulk extraction problem for B2B SaaS: pulling high-volume data from your customers' third-party tools into your own system, reliably, at scale. Unified APIs promise to solve it. But the architectural choices your unified API provider makes — particularly around data caching — determine whether you are building a clean pipeline or an expensive, latency-ridden mess.

If you are a VP Eng or PM choosing between warehouse ETL, a cached unified API, and a real-time unified API with sync jobs, the architecture question boils down to this: where does the extra copy of customer data live, and who pays for it?

This post breaks down exactly how ETL workflows interact with unified APIs, where the major approaches fail, and how to architect a data pipeline that actually scales without storing your customers' sensitive data on someone else's servers.

The Collision of ETL Workflows and Unified APIs

Bulk data extraction (ETL) in B2B SaaS is the automated process of authenticating into a customer's third-party application, paginating through large datasets, normalizing proprietary data into a standard schema, and loading it into your own database or vector store.

SaaS portfolios have flattened at around 305 applications per company, but costs keep rising — organizations now spend an average of $55.7M on SaaS annually, an 8% increase year over year. That is 305 potential data sources your product might need to pull from, each with its own API contract, authentication scheme, and data model.

According to the State of SaaS Integrations 2024 report, 60% of all SaaS sales cycles now involve integration discussions, and for 84% of customers, integrations are either very important or a dealbreaker. If you are building an analytics platform, a compliance tool, an AI agent, or anything that depends on data from your customers' existing tools, bulk data extraction is not a nice-to-have. It is your product.

When building an AI agent or a data-intensive application, your system requires context. A customer support AI needs every historical Zendesk ticket. A sales forecasting model needs every Salesforce opportunity and HubSpot engagement. A compliance platform needs employee records from BambooHR and access logs from Okta. To provide value, your product must connect to the specific subset of tools your customer happens to use.

A bulk extraction architecture needs four things that point-to-point API code rarely handles well:

  • Tenant-aware auth across dozens or hundreds of customer connections
  • A common data model so your downstream pipeline is not littered with provider branches
  • A replayable sync system for backfills and incremental runs
  • An escape hatch for provider-specific endpoints the unified model does not cover

That last point matters more than most vendor demos admit. The minute you onboard a large customer, you stop dealing with vanilla HubSpot or Zendesk. You start dealing with custom fields, weird filters, half-documented endpoints, and webhooks that omit the one identifier you actually needed. If your integration layer cannot normalize the common path and still give you a proxy path for the weird stuff, you are back to vendor-specific code.

Unified APIs emerged to solve the common path by providing a single, normalized interface. You write code against one schema (e.g., "CRM Contacts"), and the unified API translates that request to Salesforce, Pipedrive, or Zoho. But when you attempt to use them for bulk ETL workflows, the underlying architectural choices matter far more than the marketing pitch. Two questions should drive your evaluation:

  1. Where does the data live between the source API and your system?
  2. Who controls the extraction schedule and data freshness?

The answers split the market into two fundamentally different architectures.

Why Traditional ETL Tools Fail at B2B SaaS Integrations

When engineering leaders realize they have a data extraction problem, their first instinct is often to reach for established data engineering tools. Frameworks like Apache Airflow, Apache Flink, or GUI-based tools like Portable.io are incredible pieces of technology. They excel at internal business intelligence, data warehousing, and moving terabytes of data between systems you control.

But traditional ETL tools fail spectacularly when applied to customer-facing B2B SaaS integrations. The abstraction leaks immediately because external SaaS integrations introduce constraints that internal data pipelines never face.

Problem Internal ETL Stack Customer-Facing SaaS Integration Layer
Warehouse loads Great fit Necessary but not sufficient
Per-customer OAuth and token refresh Usually not the focus Mandatory
Normalized API objects across vendors Usually your job Should be built in
Product-grade webhooks and sync jobs Limited Mandatory
Passthrough for weird provider endpoints Rare Mandatory
Procurement and privacy review Internal Customer-facing and high scrutiny

Multi-Tenant OAuth Is a Different Beast

Internal ETL tools assume you have a static API key or database credentials. In B2B SaaS integrations, you are dealing with multi-tenant OAuth.

If you have 1,000 customers connecting their Salesforce accounts, you have 1,000 distinct OAuth access tokens. These tokens expire every hour. You must reliably exchange refresh tokens, handle token revocation, and ensure cross-tenant data isolation. Airflow is a workflow orchestrator; it does not ship with a secure, multi-tenant credential vault or an OAuth lifecycle management engine.

Schema Normalization Across Vendors, Not Just Sources

Fivetran and Airbyte handle schema drift within a single source well. But they are not designed to normalize across sources. A "contact" in HubSpot lives under properties.firstname and properties.lastname. In Salesforce, it is FirstName and LastName at the top level. In Pipedrive, it is name as a single field. Traditional ETL tools will faithfully replicate each vendor's raw schema into your warehouse. The normalization layer — making all of these look identical — is your problem.

No Embedded Authentication UX

When your customer connects their BambooHR account to your product, they expect an in-app OAuth flow: click "Connect," authorize, done. Traditional ETL tools do not provide embeddable authentication components. You would need to build the entire Link/Connect UI, token storage, and refresh lifecycle yourself.

Hostile API Environments

Internal databases do not rate-limit you. External SaaS APIs do, and they do so aggressively.

Rate limits vary not just by provider, but by the specific enterprise tier your customer pays for. A standard ETL pipeline will blast an API with concurrent requests, hit a 429 Too Many Requests error, and fail. You need an engine that understands provider-specific rate limit headers (like X-RateLimit-Reset), implements exponential backoff, and respects concurrency limits per connected account.

There is also a subtle operational difference. Traditional ETL assumes you own the source relationship. In B2B SaaS, you do not. Your customer owns the Salesforce instance. Your customer owns the NetSuite role permissions. Your customer can revoke scopes on Friday night and ask why payroll did not sync on Monday morning. That operational model is much closer to an integration platform than an analytics pipeline.

75% of companies take 3+ months to build integrations. Multiply that by 40 platforms and you have burned your entire engineering roadmap—a trap we've discussed in our breakdown of the three models for product integrations. Because traditional ETL tools lack these primitives, companies turn to unified APIs. But not all unified APIs are built the same.

The "Double ETL" Trap: Why Store-and-Sync Architectures Break at Scale

Some unified API providers recognized the ETL gap and built syncing infrastructure on top of their platforms. Merge.dev is the most prominent example. The approach sounds reasonable: Merge syncs data from each customer's third-party tools into Merge's own data store, normalizes it, and then you pull the normalized data from Merge via their API.

The problem? You have just introduced a double ETL without realizing it.

flowchart LR
    A["Customer's<br>SaaS App"] -->|ETL #1| B["Merge.dev<br>Cached Store"]
    B -->|ETL #2| C["Your<br>Data Store"]
    style B fill:#ff6b6b,stroke:#333,color:#fff

Here is what actually happens:

  • ETL #1 (invisible to you): Merge pulls data from the third-party API (HubSpot, BambooHR, etc.) on a schedule they control and stores it in their infrastructure.
  • ETL #2 (your responsibility): You pull normalized data from Merge's API into your own system.

All data requests are pass-through in some alternatives. Merge stores and serves data from cache under the guise of differential syncing. This creates three compounding problems.

Data Staleness You Cannot Control

Merge controls the sync frequency between the source API and their cache. Their documentation instructs you to store timestamps and use modified_after in subsequent requests to pull updates from Merge since your last sync. But the freshness of Merge's cache depends on their sync schedule — not yours.

If your AI agent needs to act on a newly created Zendesk ticket, but Merge only syncs that endpoint every few hours, your product feels broken to the end user. You cannot force a real-time extraction because the architecture is designed around batch caching. You are now debugging phantom inconsistencies that have nothing to do with your code.

Data Residency and Compliance Liability

In a store-and-sync model, highly sensitive customer data — payroll details, HR records, private CRM notes, candidate EEOC data — is sitting on a third-party vendor's servers.

GDPR requires data controllers to ensure data encryption, strong access controls, and regular security assessments. Data controllers are accountable for ensuring compliance even when using third-party processors, including conducting due diligence on vendors.

Your customers' employee records from BambooHR are now sitting on Merge's infrastructure. If your customer is a European company, you need to verify that Merge stores that data within the EEA, or that adequate transfer mechanisms are in place. Non-compliance with GDPR can result in fines of up to 4% of an organization's global annual turnover or €20 million, whichever is higher.

If you are selling into the enterprise, your security team and your customer's InfoSec team will flag this immediately. Passing SOC2, ISO 27001, or GDPR audits becomes exponentially harder when you have to explain why a middleman is permanently storing a copy of your customer's database. You are expanding your threat surface area for no functional benefit.

Pricing That Punishes Growth

Storing and processing billions of rows of cached data requires massive infrastructure. Unified API providers pass that cost directly to you through aggressive pricing models.

Merge charges $650/month for up to 10 total production Linked Accounts, with $65 per additional Linked Account after.

If integrations are a core part of your product, these costs quickly become prohibitive. If 100 customers each connect 2 integrations, you are looking at $13,000 per month ($156,000 annually) just in linked account fees.

It gets worse for ETL-heavy use cases. Every new integration category you support multiplies your costs. If you add HRIS integrations and 40% of your customers adopt them, your linked account costs increase by 40%. When you are building a data-intensive product that needs CRM + HRIS + Ticketing data for each customer, the unit economics collapse.

Merge's Remote Data feature — which gives you access to non-common fields — is only available on Professional or Enterprise plans. Their docs also say Remote Data is updated only during a sync, and only when a common-model field changes. If you need vendor-specific custom fields to behave like first-class data, you are fighting the architecture.

The Real Cost of the Double ETL

Cost Factor Pass-Through Architecture Store-and-Sync (Double ETL)
Data freshness Real-time or near-real-time Dependent on provider's sync schedule
Data residency liability None (data never stored on middleware) Full GDPR/SOC2 audit surface
Sync control You control schedule and scope Provider controls ETL #1
Debugging complexity One hop: source → your system Two hops: source → cache → your system
Vendor lock-in Low (swap providers, keep pipeline) High (migration requires re-syncing all data)

The double ETL is not just an architectural inconvenience. It is an ongoing operational and compliance cost that scales linearly with your customer count.

For a deeper breakdown of Merge.dev's limitations at scale, see our full comparison post.

How a Pass-Through Unified API Handles Bulk Extraction

The alternative to store-and-sync is a pass-through architecture: the unified API acts as a transformation and routing layer, not a data store. When you request data, it fetches directly from the source API, normalizes the response on the fly, and delivers it to your system. Nothing is cached on the middleware.

flowchart LR
    A["Customer's<br>SaaS App"] -->|Single hop| B["Unified API<br>Pass-Through"]
    B -->|Normalized data| C["Your<br>Data Store"]
    style B fill:#51cf66,stroke:#333,color:#fff

This is how Truto's architecture works. The unified API engine reads a declarative configuration describing how to talk to the third-party API, applies JSONata-based transformation expressions to normalize the response, and delivers the result directly. No intermediate storage. When you need provider-specific endpoints the unified model does not cover, Truto exposes a Proxy API for direct provider-native access alongside the Unified API — so you never have to fork your architecture for edge cases.

But "pass-through" alone does not solve the bulk extraction problem. Pulling 50,000 contact records one page at a time through a live API is slow, error-prone, and burns through rate limits. You need orchestration primitives specifically designed for high-volume syncing.

This is where Truto's RapidBridge comes in.

Handling Bulk Data Extraction with Truto's RapidBridge

RapidBridge is Truto's data pipeline engine, built specifically for the problem of extracting bulk data from third-party APIs through the unified API layer. It is not an ETL tool in the traditional sense — it is a sync orchestrator that handles the painful mechanics of multi-resource, multi-page, dependency-aware data extraction and delivers results directly to your webhook endpoint or data store.

Here is what RapidBridge does:

  • Defines a set of resources to sync for a given integration (e.g., users, contacts, tickets, and comments from Zendesk)
  • Resolves dependencies between resources (fetch tickets first, then fetch comments for each ticket)
  • Paginates automatically through every page of results
  • Delivers data via webhooks directly to your endpoint — Truto never stores the payload
  • Tracks sync state for incremental runs

Setting Up a Sync Pipeline

A RapidBridge Sync Job is a declarative JSON configuration. Here is a real example that syncs users, contacts, tickets, and per-ticket comments from Zendesk:

curl --location 'https://api.truto.one/sync-job' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <api_token>' \
--data '{
    "integration_name": "zendesk",
    "resources": [
        {
            "resource": "ticketing/users",
            "method": "list"
        },
        {
            "resource": "ticketing/contacts",
            "method": "list"
        },
        {
            "resource": "ticketing/tickets",
            "method": "list"
        },
        {
            "resource": "ticketing/comments",
            "method": "list",
            "depends_on": "ticketing/tickets",
            "query": {
                "ticket_id": "{{resources.ticketing.tickets.id}}"
            }
        }
    ]
}'

The depends_on field is where it gets interesting. Comments require a ticket_id, so RapidBridge first fetches all tickets, then iterates through each ticket and fetches its comments. The query block uses placeholder bindings — Truto automatically extracts each ticket's ID and passes it as a query parameter to the comments endpoint. This dependency resolution happens automatically. You do not write nested loops, manage cursor state across dependent resources, or handle partial failures in your application code.

Truto handles the pagination, rate limiting, and credential injection automatically. As the data flows through Truto's proxy layer, it is normalized via JSONata expressions and pushed directly to your webhook. No intermediate database. No double ETL.

What Your Webhook Receives

Each webhook event delivers normalized data in the unified schema. A contact from Zendesk looks identical to a contact from Freshdesk or Intercom. Your ingestion pipeline writes one handler, and it works across every integration. The raw, provider-specific response is also available in the remote_data field if you need vendor-specific fields the unified schema does not cover.

This is a single-hop pipeline. Data goes from the source API, through Truto's transformation layer (in-flight, not stored), and into your system. Zero data retention.

Error Handling and Resilience

Bulk extraction is inherently messy. APIs go down, records contain malformed data, and tokens get revoked mid-sync.

By default, RapidBridge ignores individual record errors during a Sync Job Run. If one Zendesk ticket fails to parse, Truto logs a sync_job_run:record_error webhook event and continues processing the remaining 99,999 tickets. You get the data you need, and an actionable error log for the anomalies.

For workflows that require strict consistency, you can set the error_handling attribute to fail_fast in the execution request, causing the job to halt immediately upon encountering an issue.

Pipeline Type Recommended Error Handling
Search indexing, low-risk enrichment Continue on error, capture dead-letter events
Customer-facing analytics with tolerable lag Continue on error, alert on error rate
Payroll, accounting, compliance evidence Fail fast, fix root cause, rerun deterministically

Real-World Adoption

The practical upside is speed without the usual rewrite tax. As we've highlighted in our guide to enterprise SaaS integrations, Thoropass rolled out 85+ integrations in less than two weeks and used RapidBridge for incremental and event-driven data extraction. Sprinto reached 200+ integrations with Truto. Spendflo migrated ten accounting integrations from Merge at a pace of one per week while keeping its application code unchanged by mirroring the model field-for-field. Those are not toy numbers — they are the kind of rollout speeds you need when integrations are tied to enterprise revenue.

Architecting Scalable Data Pipelines: Incremental Syncs and Spooling

Extracting historical data is only the first step. To maintain a performant ETL pipeline, you must handle ongoing state changes without re-fetching the entire dataset. If you only remember one implementation rule, make it this: never design bulk extraction around repeated full syncs.

Incremental Syncing via Dynamic Bindings

Truto tracks a previous_run_date per Sync Job per Integrated Account — the timestamp of the last successfully completed sync. You bind this value into your resource's query parameters:

{
    "resource": "ticketing/tickets",
    "method": "list",
    "query": {
        "updated_at": {
            "gt": "{{previous_run_date}}"
        }
    }
}

On the first run, previous_run_date defaults to 1970-01-01T00:00:00.000Z, so you get a full initial sync. Every subsequent run only fetches tickets modified since the last successful completion. The keyword here is successful — if a sync fails midway, the timestamp is not updated, so the next run retries from the same checkpoint. This is idempotency for free.

When you need a full re-sync — data migration, schema change, disaster recovery — you override this behavior on a per-run basis:

curl --location 'https://api.truto.one/sync-job-run' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <api_token>' \
--data '{
    "sync_job_id": "<sync_job_id>",
    "integrated_account_id": "<integrated_account_id>",
    "webhook_id": "<webhook_id>",
    "ignore_previous_run": true
}'

Setting ignore_previous_run to true triggers a full sync without resetting the stored timestamp, so subsequent scheduled runs resume incremental behavior.

A few engineering notes the glossy vendor pages usually skip:

  • Keep checkpoints per integrated account. Global watermarks will corrupt multi-tenant syncs.
  • Store the provider object ID and your unified object ID separately.
  • Treat updated_at as a hint, not gospel. Plenty of APIs forget to bump it on related changes.
  • Have a strategy for deletes. Incremental create-and-update is the easy part.

Scheduling and Automation

Sync Jobs can be triggered on a schedule using cron triggers, or fired automatically when an Integrated Account is first connected. A new customer connects their Zendesk account, and their initial data sync starts immediately — no manual intervention, no waiting for a batch window.

sequenceDiagram
    participant Customer as Customer's App
    participant Truto as Truto RapidBridge
    participant Source as Third-Party API
    participant You as Your Webhook
    
    Customer->>Truto: Connect account (OAuth)
    Truto->>Truto: Trigger initial sync job
    loop For each resource
        Truto->>Source: Fetch page (with auth, pagination)
        Source-->>Truto: Raw response
        Truto->>Truto: Normalize via JSONata
        Truto->>You: Deliver webhook event
    end
    Note over Truto: Store previous_run_date
    
    loop Scheduled runs (cron)
        Truto->>Source: Fetch only modified records
        Source-->>Truto: Delta response
        Truto->>Truto: Normalize
        Truto->>You: Deliver webhook event
    end

For teams looking to pipe sync data directly into a database rather than through webhooks, Truto also supports direct syncing to your data store.

Spool Nodes: Solving the Nested Content Problem for RAG

One of the hardest challenges in building AI agents is ingesting heavily nested, paginated resources. This is where most unified API demos fall apart.

Consider a Notion knowledge base. A Notion page is not a single string of text — it is a tree of blocks. Fetching a page requires paginating through its content blocks, checking if any blocks have children, and recursively fetching those children. If you send these blocks to your system one by one, your application has to maintain complex state to reassemble the document before feeding it to a vector database for RAG (Retrieval-Augmented Generation).

Spool nodes allow you to paginate, fetch, and assemble a complete resource within Truto's pipeline, delivering the final, consolidated asset in a single webhook event.

Info

Spool Nodes temporarily hold paginated chunks of data in memory (up to 128KB) during a sync run, allowing you to run JSONata transformations across the entire dataset before pushing it to your destination. Once the run completes, the memory is cleared. Strip unnecessary fields (like remote_data) using transform nodes before spooling to stay within bounds.

Here is how you configure a RapidBridge pipeline to extract a complete Notion page, recursively fetch its children, strip out unnecessary metadata, and combine it into a single Markdown blob:

{
    "integration_name": "notion",
    "args_schema": {
        "page_id": { "type": "string", "required": true }
    },
    "resources": [
        {
            "name": "page-content",
            "resource": "knowledge-base/page-content",
            "method": "list",
            "query": {
                "page": { "id": "{{args.page_id}}" },
                "truto_ignore_remote_data": true
            },
            "recurse": {
                "if": "{{resources.knowledge-base.page-content.has_children:bool}}",
                "config": {
                    "query": {
                        "page_content_id": "{{resources.knowledge-base.page-content.id}}"
                    }
                }
            },
            "persist": false
        },
        {
            "name": "remove-remote-data",
            "type": "transform",
            "config": {
                "expression": "[resources.`knowledge-base`.`page-content`.$sift(function($v, $k) {$k != 'remote_data'})]"
            },
            "depends_on": "page-content"
        },
        {
            "name": "all-page-content",
            "type": "spool",
            "depends_on": "remove-remote-data"
        },
        {
            "name": "combine-page-content",
            "type": "transform",
            "config": {
                "expression": "$blob($reduce($sortNodes(resources.`knowledge-base`.`page-content`, 'id', 'parent.id'), function($acc, $v) { $acc & $v.body.content }, ''), { \"type\": \"text/markdown\" })"
            },
            "depends_on": "all-page-content",
            "persist": true
        }
    ]
}

Here is exactly what this pipeline does:

  1. Fetch and Recurse: The page-content node fetches the top-level blocks. The recurse block evaluates a JSONata expression (has_children:bool). If true, it automatically triggers another API call to fetch the child blocks, injecting the parent ID.
  2. Transform (Clean): The remove-remote-data node uses a JSONata $sift function to drop the raw provider payload, keeping data under the 128KB spool memory limit.
  3. Spool: The all-page-content node acts as a barrier, waiting until every single block and sub-block has been fetched and paginated.
  4. Transform (Combine): The final node uses $sortNodes to organize the blocks into their proper hierarchy, $reduce to concatenate the text content, and $blob to format the output as text/markdown.

The result? Your webhook receives a single, perfectly formatted Markdown file ready for chunking and vectorization. You offloaded the entire recursive pagination and formatting burden to Truto. No recursive API call management. No partial page fragments. No state machines.

Why Zero Integration-Specific Code Matters for ETL Reliability

One under-discussed aspect of ETL pipelines through unified APIs: what happens when the underlying third-party API changes?

Enterprise customers heavily customize their SaaS instances. Salesforce orgs have custom objects, custom fields with __c suffixes, and deeply nested relationships. HubSpot accounts have custom properties, association types, and calculated fields. Your ETL pipeline needs to handle these variations without breaking, rather than falling victim to the hidden costs of rigid schemas that plague traditional unified APIs.

Most unified API providers maintain separate code paths for each integration — a Salesforce adapter, a HubSpot adapter, a Pipedrive adapter. When Salesforce deprecates an API version or HubSpot changes their pagination format, the provider must update that specific adapter, deploy it, and hope it does not break the 50 other adapters sharing the same runtime.

Truto takes a fundamentally different approach. The entire runtime engine contains zero integration-specific code. Every integration behavior — how to authenticate, which endpoints to call, how to paginate, how to map fields — is defined entirely as data: JSON configuration and JSONata transformation expressions. The same generic execution pipeline that handles a HubSpot CRM contact sync also handles Salesforce, Pipedrive, Zoho, and every other CRM without knowing which one it is talking to.

What this means for ETL reliability:

  • Bug fixes propagate universally. When the pagination engine is improved, all integrations benefit immediately.
  • New integrations are data operations. Adding support for a new SaaS platform does not require code deployment — just configuration.
  • Custom field handling is built in. Truto's response mapping preserves the original API response in a remote_data field alongside the normalized schema, so vendor-specific custom fields are never lost.

JSONata itself is gaining traction as an industry standard. AWS Step Functions added JSONata support in November 2024, and AWS now recommends JSONata as the query language for new state machines. Declarative transformation languages are no longer niche toys — they are becoming normal infrastructure. The right place for provider-specific data shaping is a declarative mapping layer, not a pile of provider branches in your application code.

The Override Hierarchy

This architecture also gives you a three-level override hierarchy. If your largest customer has a heavily modified Salesforce instance with a custom object that needs to be synced, you do not have to wait for Truto to support it. You can apply an Account Override directly to that specific customer's integrated account — providing a custom JSONata expression that maps their bespoke fields into the unified schema. Truto's engine evaluates your override during the next RapidBridge sync run. You get the scalability of a unified API with the exact flexibility of a custom in-house integration.

For a deeper dive, see Look Ma, No Code! Why Truto's Zero-Code Architecture Wins.

The Trade-Offs: Honest Assessment

Pass-through is not magic. Here are the real trade-offs you should consider:

Latency on first request. Because data is fetched live from the source API, a first-time list request that returns 50,000 records will take longer than hitting a pre-warmed cache. RapidBridge mitigates this by handling pagination asynchronously and delivering results as they arrive via webhooks, but the total time-to-complete-sync depends on the source API's performance.

Source API rate limits. You are subject to each vendor's rate limits directly. A store-and-sync provider absorbs some of this by pre-fetching and caching. With pass-through, Truto's proxy layer handles rate limit detection and backoff automatically, but a particularly aggressive sync against a low-limit API (looking at you, Salesforce's daily API quota) will take longer.

No offline querying. With a cached provider, you can query their cache even if the source API is down. With pass-through, if HubSpot is having an outage, your sync fails until they recover. For teams that need local queryability, Truto offers a SuperQuery feature that can query previously synced data — but this is an opt-in feature, not the default.

Not a replacement for your data warehouse. RapidBridge gets data to your system. The transformations beyond schema normalization, and loading into your specific warehouse schema, are still your responsibility. Truto normalizes the data model; you handle the business logic transformations.

Tip

The integration layer should normalize and transport data, not quietly become another warehouse. If you need a warehouse, build or buy one on your terms. If you need caching, do it deliberately — not because your vendor's architecture forces it.

For a detailed analysis of when caching makes sense and when it does not, see Tradeoffs Between Real-time and Cached Unified APIs.

Comparing Approaches: A Concrete Scenario

Scenario: You are building an AI-powered compliance platform. You need to ingest employee data from HRIS tools (BambooHR, HiBob, Gusto, Workday), ticket data from helpdesks (Zendesk, Jira, Freshdesk), and document data from knowledge bases (Notion, Confluence) for RAG indexing. You have 200 enterprise customers.

Store-and-Sync (e.g., Merge.dev)

Factor Impact
Data flow Source → Merge cache → Your system (double ETL)
Freshness Dependent on Merge's sync frequency per category
Cost estimate 200 customers × 3 categories = 600 linked accounts. At $65/account: $39,000/month ($468,000/year)
Compliance Customer HR data cached on Merge's infrastructure. Requires GDPR DPA, data residency audit, SOC2 scope expansion
Custom fields Limited to supported fields; custom objects require passthrough outside the unified model
RAG content No built-in mechanism for deep recursive page content extraction

Pass-Through with RapidBridge (Truto)

Factor Impact
Data flow Source → Truto (transform in-flight) → Your system (single hop)
Freshness You control sync schedule; data fetched directly from source
Cost Unlimited connections and API calls on higher tiers. No per-linked-account pricing penalty for multi-category integrations
Compliance Zero data retention. Customer data never persists on middleware
Custom fields remote_data preserves full vendor response; JSONata overrides handle custom schemas per customer
RAG content Spool nodes handle recursive pagination and deliver complete documents

The architectural difference is not subtle. One approach forces you to trust a third party with your customers' most sensitive data and pay per connection for the privilege. The other gives you a transformation layer you control.

Stop Paying for Double ETLs

If you are evaluating vendors for ETL workflows using unified APIs, ask six blunt questions:

  1. Does data flow source → vendor → us, or source → us?
  2. Can we sync directly into our own datastore?
  3. How are incremental checkpoints tracked per tenant and per resource?
  4. What triggers a full sync or resync?
  5. How do we access raw provider data when the unified model falls short?
  6. What does pricing look like at 100, 500, and 1,000 connected accounts?

Those questions cut through most marketing pages in ten minutes.

Here is the decision framework:

  • If your extraction needs are light (a few standard fields, low volume, infrequent syncs), a store-and-sync provider might be acceptable. The double ETL tax is low when volume is low.
  • If you are building a data-intensive product (RAG pipelines, analytics platforms, compliance auditing, AI agents), the double ETL becomes an engineering and financial liability. You need a pass-through architecture with incremental syncing, spool nodes for nested content, and zero data retention.
  • If you serve regulated industries (healthcare, fintech, enterprise software), every intermediate data store is a compliance surface. Pass-through eliminates the largest one.

The cost of ignoring this decision? Ask your security team what happens when they discover your unified API provider has been caching employee SSNs on their servers for the past six months. Or ask your CFO what happens when your linked account bill hits six figures.

Build your ETL pipeline on an architecture that gives you control over your data flow, your sync schedule, and your customers' privacy. The data should move from source to destination in one hop — not two.

FAQ

What is the double ETL problem with cached unified APIs?
When a unified API provider caches your customers' data on their servers, the data moves twice: once from the source to the provider's cache (ETL #1), and again from their cache to your system (ETL #2). This adds data staleness, privacy liability, debugging complexity, and extra cost to your data pipeline — all scaling linearly with your customer count.
Can unified APIs handle bulk data extraction for ETL workflows?
Yes, but the architecture matters. Pass-through unified APIs fetch data directly from third-party sources on demand and deliver it to your system in a single hop. Look for features like dependency-aware sync jobs, incremental syncing via tracked watermarks, spool nodes for nested content assembly, and webhook-based delivery with zero data retention.
How does incremental syncing work with unified APIs?
Incremental syncing tracks the last successful sync timestamp per tenant and per resource, then only fetches records modified since that checkpoint. If a sync fails midway, the timestamp is not updated, so the next run retries from the same point. This reduces API calls, minimizes data transfer, and avoids re-processing your entire dataset on every run.
Why are traditional ETL tools like Fivetran or Airbyte bad for B2B SaaS integrations?
Traditional ETL tools are designed for internal data warehousing with single-tenant credentials. They lack multi-tenant OAuth lifecycle management, cross-vendor schema normalization, embeddable authentication UIs, and provider-specific rate limit handling that customer-facing B2B SaaS integrations require. You end up building the missing half yourself.
How much does Merge.dev cost for high-volume ETL workflows?
Merge.dev charges $650/month for up to 10 linked accounts and $65 per additional account. At 200 customers with 3 integration categories each (600 linked accounts), that is approximately $39,000/month or $468,000/year in linked account fees alone, before enterprise features or custom sync frequencies.

More from our Blog