ETL Workflows Using Unified APIs: Solving the Bulk Extraction Problem
Learn why traditional ETL tools and store-and-sync unified APIs fail at bulk extraction, and how to architect zero-retention data pipelines for B2B SaaS.
Your AI agent needs to ingest every ticket, employee record, and CRM contact from your customers' SaaS tools. You need data from HubSpot, Zendesk, BambooHR, and 40 other platforms — normalized, delivered to your data store, and refreshed on a schedule. Building 40 custom ETL pipelines is not an option. Neither is asking your team to manage OAuth tokens, pagination quirks, and rate limits for each one.
This is the bulk extraction problem for B2B SaaS: pulling high-volume data from your customers' third-party tools into your own system, reliably, at scale. Unified APIs promise to solve it. But the architectural choices your unified API provider makes — particularly around data caching — determine whether you are building a clean pipeline or an expensive, latency-ridden mess.
If you are a VP Eng or PM choosing between warehouse ETL, a cached unified API, and a real-time unified API with sync jobs, the architecture question boils down to this: where does the extra copy of customer data live, and who pays for it?
This post breaks down exactly how ETL workflows interact with unified APIs, where the major approaches fail, and how to architect a data pipeline that actually scales without storing your customers' sensitive data on someone else's servers.
The Collision of ETL Workflows and Unified APIs
Bulk data extraction (ETL) in B2B SaaS is the automated process of authenticating into a customer's third-party application, paginating through large datasets, normalizing proprietary data into a standard schema, and loading it into your own database or vector store.
SaaS portfolios have flattened at around 305 applications per company, but costs keep rising — organizations now spend an average of $55.7M on SaaS annually, an 8% increase year over year. That is 305 potential data sources your product might need to pull from, each with its own API contract, authentication scheme, and data model.
According to the State of SaaS Integrations 2024 report, 60% of all SaaS sales cycles now involve integration discussions, and for 84% of customers, integrations are either very important or a dealbreaker. If you are building an analytics platform, a compliance tool, an AI agent, or anything that depends on data from your customers' existing tools, bulk data extraction is not a nice-to-have. It is your product.
When building an AI agent or a data-intensive application, your system requires context. A customer support AI needs every historical Zendesk ticket. A sales forecasting model needs every Salesforce opportunity and HubSpot engagement. A compliance platform needs employee records from BambooHR and access logs from Okta. To provide value, your product must connect to the specific subset of tools your customer happens to use.
A bulk extraction architecture needs four things that point-to-point API code rarely handles well:
- Tenant-aware auth across dozens or hundreds of customer connections
- A common data model so your downstream pipeline is not littered with provider branches
- A replayable sync system for backfills and incremental runs
- An escape hatch for provider-specific endpoints the unified model does not cover
That last point matters more than most vendor demos admit. The minute you onboard a large customer, you stop dealing with vanilla HubSpot or Zendesk. You start dealing with custom fields, weird filters, half-documented endpoints, and webhooks that omit the one identifier you actually needed. If your integration layer cannot normalize the common path and still give you a proxy path for the weird stuff, you are back to vendor-specific code.
Unified APIs emerged to solve the common path by providing a single, normalized interface. You write code against one schema (e.g., "CRM Contacts"), and the unified API translates that request to Salesforce, Pipedrive, or Zoho. But when you attempt to use them for bulk ETL workflows, the underlying architectural choices matter far more than the marketing pitch. Two questions should drive your evaluation:
- Where does the data live between the source API and your system?
- Who controls the extraction schedule and data freshness?
The answers split the market into two fundamentally different architectures.
Why Traditional ETL Tools Fail at B2B SaaS Integrations
When engineering leaders realize they have a data extraction problem, their first instinct is often to reach for established data engineering tools. Frameworks like Apache Airflow, Apache Flink, or GUI-based tools like Portable.io are incredible pieces of technology. They excel at internal business intelligence, data warehousing, and moving terabytes of data between systems you control.
But traditional ETL tools fail spectacularly when applied to customer-facing B2B SaaS integrations. The abstraction leaks immediately because external SaaS integrations introduce constraints that internal data pipelines never face.
| Problem | Internal ETL Stack | Customer-Facing SaaS Integration Layer |
|---|---|---|
| Warehouse loads | Great fit | Necessary but not sufficient |
| Per-customer OAuth and token refresh | Usually not the focus | Mandatory |
| Normalized API objects across vendors | Usually your job | Should be built in |
| Product-grade webhooks and sync jobs | Limited | Mandatory |
| Passthrough for weird provider endpoints | Rare | Mandatory |
| Procurement and privacy review | Internal | Customer-facing and high scrutiny |
Multi-Tenant OAuth Is a Different Beast
Internal ETL tools assume you have a static API key or database credentials. In B2B SaaS integrations, you are dealing with multi-tenant OAuth.
If you have 1,000 customers connecting their Salesforce accounts, you have 1,000 distinct OAuth access tokens. These tokens expire every hour. You must reliably exchange refresh tokens, handle token revocation, and ensure cross-tenant data isolation. Airflow is a workflow orchestrator; it does not ship with a secure, multi-tenant credential vault or an OAuth lifecycle management engine.
Schema Normalization Across Vendors, Not Just Sources
Fivetran and Airbyte handle schema drift within a single source well. But they are not designed to normalize across sources. A "contact" in HubSpot lives under properties.firstname and properties.lastname. In Salesforce, it is FirstName and LastName at the top level. In Pipedrive, it is name as a single field. Traditional ETL tools will faithfully replicate each vendor's raw schema into your warehouse. The normalization layer — making all of these look identical — is your problem.
No Embedded Authentication UX
When your customer connects their BambooHR account to your product, they expect an in-app OAuth flow: click "Connect," authorize, done. Traditional ETL tools do not provide embeddable authentication components. You would need to build the entire Link/Connect UI, token storage, and refresh lifecycle yourself.
Hostile API Environments
Internal databases do not rate-limit you. External SaaS APIs do, and they do so aggressively.
Rate limits vary not just by provider, but by the specific enterprise tier your customer pays for. A standard ETL pipeline will blast an API with concurrent requests, hit a 429 Too Many Requests error, and fail. You need an engine that understands provider-specific rate limit headers (like X-RateLimit-Reset), implements exponential backoff, and respects concurrency limits per connected account.
There is also a subtle operational difference. Traditional ETL assumes you own the source relationship. In B2B SaaS, you do not. Your customer owns the Salesforce instance. Your customer owns the NetSuite role permissions. Your customer can revoke scopes on Friday night and ask why payroll did not sync on Monday morning. That operational model is much closer to an integration platform than an analytics pipeline.
75% of companies take 3+ months to build integrations. Multiply that by 40 platforms and you have burned your entire engineering roadmap—a trap we've discussed in our breakdown of the three models for product integrations. Because traditional ETL tools lack these primitives, companies turn to unified APIs. But not all unified APIs are built the same.
The "Double ETL" Trap: Why Store-and-Sync Architectures Break at Scale
Some unified API providers recognized the ETL gap and built syncing infrastructure on top of their platforms. Merge.dev is the most prominent example. The approach sounds reasonable: Merge syncs data from each customer's third-party tools into Merge's own data store, normalizes it, and then you pull the normalized data from Merge via their API.
The problem? You have just introduced a double ETL without realizing it.
flowchart LR
A["Customer's<br>SaaS App"] -->|ETL #1| B["Merge.dev<br>Cached Store"]
B -->|ETL #2| C["Your<br>Data Store"]
style B fill:#ff6b6b,stroke:#333,color:#fffHere is what actually happens:
- ETL #1 (invisible to you): Merge pulls data from the third-party API (HubSpot, BambooHR, etc.) on a schedule they control and stores it in their infrastructure.
- ETL #2 (your responsibility): You pull normalized data from Merge's API into your own system.
All data requests are pass-through in some alternatives. Merge stores and serves data from cache under the guise of differential syncing. This creates three compounding problems.
Data Staleness You Cannot Control
Merge controls the sync frequency between the source API and their cache. Their documentation instructs you to store timestamps and use modified_after in subsequent requests to pull updates from Merge since your last sync. But the freshness of Merge's cache depends on their sync schedule — not yours.
If your AI agent needs to act on a newly created Zendesk ticket, but Merge only syncs that endpoint every few hours, your product feels broken to the end user. You cannot force a real-time extraction because the architecture is designed around batch caching. You are now debugging phantom inconsistencies that have nothing to do with your code.
Data Residency and Compliance Liability
In a store-and-sync model, highly sensitive customer data — payroll details, HR records, private CRM notes, candidate EEOC data — is sitting on a third-party vendor's servers.
GDPR requires data controllers to ensure data encryption, strong access controls, and regular security assessments. Data controllers are accountable for ensuring compliance even when using third-party processors, including conducting due diligence on vendors.
Your customers' employee records from BambooHR are now sitting on Merge's infrastructure. If your customer is a European company, you need to verify that Merge stores that data within the EEA, or that adequate transfer mechanisms are in place. Non-compliance with GDPR can result in fines of up to 4% of an organization's global annual turnover or €20 million, whichever is higher.
If you are selling into the enterprise, your security team and your customer's InfoSec team will flag this immediately. Passing SOC2, ISO 27001, or GDPR audits becomes exponentially harder when you have to explain why a middleman is permanently storing a copy of your customer's database. You are expanding your threat surface area for no functional benefit.
Pricing That Punishes Growth
Storing and processing billions of rows of cached data requires massive infrastructure. Unified API providers pass that cost directly to you through aggressive pricing models.
Merge charges $650/month for up to 10 total production Linked Accounts, with $65 per additional Linked Account after.
If integrations are a core part of your product, these costs quickly become prohibitive. If 100 customers each connect 2 integrations, you are looking at $13,000 per month ($156,000 annually) just in linked account fees.
It gets worse for ETL-heavy use cases. Every new integration category you support multiplies your costs. If you add HRIS integrations and 40% of your customers adopt them, your linked account costs increase by 40%. When you are building a data-intensive product that needs CRM + HRIS + Ticketing data for each customer, the unit economics collapse.
Merge's Remote Data feature — which gives you access to non-common fields — is only available on Professional or Enterprise plans. Their docs also say Remote Data is updated only during a sync, and only when a common-model field changes. If you need vendor-specific custom fields to behave like first-class data, you are fighting the architecture.
The Real Cost of the Double ETL
| Cost Factor | Pass-Through Architecture | Store-and-Sync (Double ETL) |
|---|---|---|
| Data freshness | Real-time or near-real-time | Dependent on provider's sync schedule |
| Data residency liability | None (data never stored on middleware) | Full GDPR/SOC2 audit surface |
| Sync control | You control schedule and scope | Provider controls ETL #1 |
| Debugging complexity | One hop: source → your system | Two hops: source → cache → your system |
| Vendor lock-in | Low (swap providers, keep pipeline) | High (migration requires re-syncing all data) |
The double ETL is not just an architectural inconvenience. It is an ongoing operational and compliance cost that scales linearly with your customer count.
For a deeper breakdown of Merge.dev's limitations at scale, see our full comparison post.
How a Pass-Through Unified API Handles Bulk Extraction
The alternative to store-and-sync is a pass-through architecture: the unified API acts as a transformation and routing layer, not a data store. When you request data, it fetches directly from the source API, normalizes the response on the fly, and delivers it to your system. Nothing is cached on the middleware.
flowchart LR
A["Customer's<br>SaaS App"] -->|Single hop| B["Unified API<br>Pass-Through"]
B -->|Normalized data| C["Your<br>Data Store"]
style B fill:#51cf66,stroke:#333,color:#fffThis is how Truto's architecture works. The unified API engine reads a declarative configuration describing how to talk to the third-party API, applies JSONata-based transformation expressions to normalize the response, and delivers the result directly. No intermediate storage. When you need provider-specific endpoints the unified model does not cover, Truto exposes a Proxy API for direct provider-native access alongside the Unified API — so you never have to fork your architecture for edge cases.
But "pass-through" alone does not solve the bulk extraction problem. Pulling 50,000 contact records one page at a time through a live API is slow, error-prone, and burns through rate limits. You need orchestration primitives specifically designed for high-volume syncing.
This is where Truto's RapidBridge comes in.
Handling Bulk Data Extraction with Truto's RapidBridge
RapidBridge is Truto's data pipeline engine, built specifically for the problem of extracting bulk data from third-party APIs through the unified API layer. It is not an ETL tool in the traditional sense — it is a sync orchestrator that handles the painful mechanics of multi-resource, multi-page, dependency-aware data extraction and delivers results directly to your webhook endpoint or data store.
Here is what RapidBridge does:
- Defines a set of resources to sync for a given integration (e.g., users, contacts, tickets, and comments from Zendesk)
- Resolves dependencies between resources (fetch tickets first, then fetch comments for each ticket)
- Paginates automatically through every page of results
- Delivers data via webhooks directly to your endpoint — Truto never stores the payload
- Tracks sync state for incremental runs
Setting Up a Sync Pipeline
A RapidBridge Sync Job is a declarative JSON configuration. Here is a real example that syncs users, contacts, tickets, and per-ticket comments from Zendesk:
curl --location 'https://api.truto.one/sync-job' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <api_token>' \
--data '{
"integration_name": "zendesk",
"resources": [
{
"resource": "ticketing/users",
"method": "list"
},
{
"resource": "ticketing/contacts",
"method": "list"
},
{
"resource": "ticketing/tickets",
"method": "list"
},
{
"resource": "ticketing/comments",
"method": "list",
"depends_on": "ticketing/tickets",
"query": {
"ticket_id": "{{resources.ticketing.tickets.id}}"
}
}
]
}'The depends_on field is where it gets interesting. Comments require a ticket_id, so RapidBridge first fetches all tickets, then iterates through each ticket and fetches its comments. The query block uses placeholder bindings — Truto automatically extracts each ticket's ID and passes it as a query parameter to the comments endpoint. This dependency resolution happens automatically. You do not write nested loops, manage cursor state across dependent resources, or handle partial failures in your application code.
Truto handles the pagination, rate limiting, and credential injection automatically. As the data flows through Truto's proxy layer, it is normalized via JSONata expressions and pushed directly to your webhook. No intermediate database. No double ETL.
What Your Webhook Receives
Each webhook event delivers normalized data in the unified schema. A contact from Zendesk looks identical to a contact from Freshdesk or Intercom. Your ingestion pipeline writes one handler, and it works across every integration. The raw, provider-specific response is also available in the remote_data field if you need vendor-specific fields the unified schema does not cover.
This is a single-hop pipeline. Data goes from the source API, through Truto's transformation layer (in-flight, not stored), and into your system. Zero data retention.
Error Handling and Resilience
Bulk extraction is inherently messy. APIs go down, records contain malformed data, and tokens get revoked mid-sync.
By default, RapidBridge ignores individual record errors during a Sync Job Run. If one Zendesk ticket fails to parse, Truto logs a sync_job_run:record_error webhook event and continues processing the remaining 99,999 tickets. You get the data you need, and an actionable error log for the anomalies.
For workflows that require strict consistency, you can set the error_handling attribute to fail_fast in the execution request, causing the job to halt immediately upon encountering an issue.
| Pipeline Type | Recommended Error Handling |
|---|---|
| Search indexing, low-risk enrichment | Continue on error, capture dead-letter events |
| Customer-facing analytics with tolerable lag | Continue on error, alert on error rate |
| Payroll, accounting, compliance evidence | Fail fast, fix root cause, rerun deterministically |
Real-World Adoption
The practical upside is speed without the usual rewrite tax. As we've highlighted in our guide to enterprise SaaS integrations, Thoropass rolled out 85+ integrations in less than two weeks and used RapidBridge for incremental and event-driven data extraction. Sprinto reached 200+ integrations with Truto. Spendflo migrated ten accounting integrations from Merge at a pace of one per week while keeping its application code unchanged by mirroring the model field-for-field. Those are not toy numbers — they are the kind of rollout speeds you need when integrations are tied to enterprise revenue.
Architecting Scalable Data Pipelines: Incremental Syncs and Spooling
Extracting historical data is only the first step. To maintain a performant ETL pipeline, you must handle ongoing state changes without re-fetching the entire dataset. If you only remember one implementation rule, make it this: never design bulk extraction around repeated full syncs.
Incremental Syncing via Dynamic Bindings
Truto tracks a previous_run_date per Sync Job per Integrated Account — the timestamp of the last successfully completed sync. You bind this value into your resource's query parameters:
{
"resource": "ticketing/tickets",
"method": "list",
"query": {
"updated_at": {
"gt": "{{previous_run_date}}"
}
}
}On the first run, previous_run_date defaults to 1970-01-01T00:00:00.000Z, so you get a full initial sync. Every subsequent run only fetches tickets modified since the last successful completion. The keyword here is successful — if a sync fails midway, the timestamp is not updated, so the next run retries from the same checkpoint. This is idempotency for free.
When you need a full re-sync — data migration, schema change, disaster recovery — you override this behavior on a per-run basis:
curl --location 'https://api.truto.one/sync-job-run' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <api_token>' \
--data '{
"sync_job_id": "<sync_job_id>",
"integrated_account_id": "<integrated_account_id>",
"webhook_id": "<webhook_id>",
"ignore_previous_run": true
}'Setting ignore_previous_run to true triggers a full sync without resetting the stored timestamp, so subsequent scheduled runs resume incremental behavior.
A few engineering notes the glossy vendor pages usually skip:
- Keep checkpoints per integrated account. Global watermarks will corrupt multi-tenant syncs.
- Store the provider object ID and your unified object ID separately.
- Treat
updated_atas a hint, not gospel. Plenty of APIs forget to bump it on related changes. - Have a strategy for deletes. Incremental create-and-update is the easy part.
Scheduling and Automation
Sync Jobs can be triggered on a schedule using cron triggers, or fired automatically when an Integrated Account is first connected. A new customer connects their Zendesk account, and their initial data sync starts immediately — no manual intervention, no waiting for a batch window.
sequenceDiagram
participant Customer as Customer's App
participant Truto as Truto RapidBridge
participant Source as Third-Party API
participant You as Your Webhook
Customer->>Truto: Connect account (OAuth)
Truto->>Truto: Trigger initial sync job
loop For each resource
Truto->>Source: Fetch page (with auth, pagination)
Source-->>Truto: Raw response
Truto->>Truto: Normalize via JSONata
Truto->>You: Deliver webhook event
end
Note over Truto: Store previous_run_date
loop Scheduled runs (cron)
Truto->>Source: Fetch only modified records
Source-->>Truto: Delta response
Truto->>Truto: Normalize
Truto->>You: Deliver webhook event
endFor teams looking to pipe sync data directly into a database rather than through webhooks, Truto also supports direct syncing to your data store.
Spool Nodes: Solving the Nested Content Problem for RAG
One of the hardest challenges in building AI agents is ingesting heavily nested, paginated resources. This is where most unified API demos fall apart.
Consider a Notion knowledge base. A Notion page is not a single string of text — it is a tree of blocks. Fetching a page requires paginating through its content blocks, checking if any blocks have children, and recursively fetching those children. If you send these blocks to your system one by one, your application has to maintain complex state to reassemble the document before feeding it to a vector database for RAG (Retrieval-Augmented Generation).
Spool nodes allow you to paginate, fetch, and assemble a complete resource within Truto's pipeline, delivering the final, consolidated asset in a single webhook event.
Spool Nodes temporarily hold paginated chunks of data in memory (up to 128KB) during a sync run, allowing you to run JSONata transformations across the entire dataset before pushing it to your destination. Once the run completes, the memory is cleared. Strip unnecessary fields (like remote_data) using transform nodes before spooling to stay within bounds.
Here is how you configure a RapidBridge pipeline to extract a complete Notion page, recursively fetch its children, strip out unnecessary metadata, and combine it into a single Markdown blob:
{
"integration_name": "notion",
"args_schema": {
"page_id": { "type": "string", "required": true }
},
"resources": [
{
"name": "page-content",
"resource": "knowledge-base/page-content",
"method": "list",
"query": {
"page": { "id": "{{args.page_id}}" },
"truto_ignore_remote_data": true
},
"recurse": {
"if": "{{resources.knowledge-base.page-content.has_children:bool}}",
"config": {
"query": {
"page_content_id": "{{resources.knowledge-base.page-content.id}}"
}
}
},
"persist": false
},
{
"name": "remove-remote-data",
"type": "transform",
"config": {
"expression": "[resources.`knowledge-base`.`page-content`.$sift(function($v, $k) {$k != 'remote_data'})]"
},
"depends_on": "page-content"
},
{
"name": "all-page-content",
"type": "spool",
"depends_on": "remove-remote-data"
},
{
"name": "combine-page-content",
"type": "transform",
"config": {
"expression": "$blob($reduce($sortNodes(resources.`knowledge-base`.`page-content`, 'id', 'parent.id'), function($acc, $v) { $acc & $v.body.content }, ''), { \"type\": \"text/markdown\" })"
},
"depends_on": "all-page-content",
"persist": true
}
]
}Here is exactly what this pipeline does:
- Fetch and Recurse: The
page-contentnode fetches the top-level blocks. Therecurseblock evaluates a JSONata expression (has_children:bool). If true, it automatically triggers another API call to fetch the child blocks, injecting the parent ID. - Transform (Clean): The
remove-remote-datanode uses a JSONata$siftfunction to drop the raw provider payload, keeping data under the 128KB spool memory limit. - Spool: The
all-page-contentnode acts as a barrier, waiting until every single block and sub-block has been fetched and paginated. - Transform (Combine): The final node uses
$sortNodesto organize the blocks into their proper hierarchy,$reduceto concatenate the text content, and$blobto format the output astext/markdown.
The result? Your webhook receives a single, perfectly formatted Markdown file ready for chunking and vectorization. You offloaded the entire recursive pagination and formatting burden to Truto. No recursive API call management. No partial page fragments. No state machines.
Why Zero Integration-Specific Code Matters for ETL Reliability
One under-discussed aspect of ETL pipelines through unified APIs: what happens when the underlying third-party API changes?
Enterprise customers heavily customize their SaaS instances. Salesforce orgs have custom objects, custom fields with __c suffixes, and deeply nested relationships. HubSpot accounts have custom properties, association types, and calculated fields. Your ETL pipeline needs to handle these variations without breaking, rather than falling victim to the hidden costs of rigid schemas that plague traditional unified APIs.
Most unified API providers maintain separate code paths for each integration — a Salesforce adapter, a HubSpot adapter, a Pipedrive adapter. When Salesforce deprecates an API version or HubSpot changes their pagination format, the provider must update that specific adapter, deploy it, and hope it does not break the 50 other adapters sharing the same runtime.
Truto takes a fundamentally different approach. The entire runtime engine contains zero integration-specific code. Every integration behavior — how to authenticate, which endpoints to call, how to paginate, how to map fields — is defined entirely as data: JSON configuration and JSONata transformation expressions. The same generic execution pipeline that handles a HubSpot CRM contact sync also handles Salesforce, Pipedrive, Zoho, and every other CRM without knowing which one it is talking to.
What this means for ETL reliability:
- Bug fixes propagate universally. When the pagination engine is improved, all integrations benefit immediately.
- New integrations are data operations. Adding support for a new SaaS platform does not require code deployment — just configuration.
- Custom field handling is built in. Truto's response mapping preserves the original API response in a
remote_datafield alongside the normalized schema, so vendor-specific custom fields are never lost.
JSONata itself is gaining traction as an industry standard. AWS Step Functions added JSONata support in November 2024, and AWS now recommends JSONata as the query language for new state machines. Declarative transformation languages are no longer niche toys — they are becoming normal infrastructure. The right place for provider-specific data shaping is a declarative mapping layer, not a pile of provider branches in your application code.
The Override Hierarchy
This architecture also gives you a three-level override hierarchy. If your largest customer has a heavily modified Salesforce instance with a custom object that needs to be synced, you do not have to wait for Truto to support it. You can apply an Account Override directly to that specific customer's integrated account — providing a custom JSONata expression that maps their bespoke fields into the unified schema. Truto's engine evaluates your override during the next RapidBridge sync run. You get the scalability of a unified API with the exact flexibility of a custom in-house integration.
For a deeper dive, see Look Ma, No Code! Why Truto's Zero-Code Architecture Wins.
The Trade-Offs: Honest Assessment
Pass-through is not magic. Here are the real trade-offs you should consider:
Latency on first request. Because data is fetched live from the source API, a first-time list request that returns 50,000 records will take longer than hitting a pre-warmed cache. RapidBridge mitigates this by handling pagination asynchronously and delivering results as they arrive via webhooks, but the total time-to-complete-sync depends on the source API's performance.
Source API rate limits. You are subject to each vendor's rate limits directly. A store-and-sync provider absorbs some of this by pre-fetching and caching. With pass-through, Truto's proxy layer handles rate limit detection and backoff automatically, but a particularly aggressive sync against a low-limit API (looking at you, Salesforce's daily API quota) will take longer.
No offline querying. With a cached provider, you can query their cache even if the source API is down. With pass-through, if HubSpot is having an outage, your sync fails until they recover. For teams that need local queryability, Truto offers a SuperQuery feature that can query previously synced data — but this is an opt-in feature, not the default.
Not a replacement for your data warehouse. RapidBridge gets data to your system. The transformations beyond schema normalization, and loading into your specific warehouse schema, are still your responsibility. Truto normalizes the data model; you handle the business logic transformations.
The integration layer should normalize and transport data, not quietly become another warehouse. If you need a warehouse, build or buy one on your terms. If you need caching, do it deliberately — not because your vendor's architecture forces it.
For a detailed analysis of when caching makes sense and when it does not, see Tradeoffs Between Real-time and Cached Unified APIs.
Comparing Approaches: A Concrete Scenario
Scenario: You are building an AI-powered compliance platform. You need to ingest employee data from HRIS tools (BambooHR, HiBob, Gusto, Workday), ticket data from helpdesks (Zendesk, Jira, Freshdesk), and document data from knowledge bases (Notion, Confluence) for RAG indexing. You have 200 enterprise customers.
Store-and-Sync (e.g., Merge.dev)
| Factor | Impact |
|---|---|
| Data flow | Source → Merge cache → Your system (double ETL) |
| Freshness | Dependent on Merge's sync frequency per category |
| Cost estimate | 200 customers × 3 categories = 600 linked accounts. At $65/account: $39,000/month ($468,000/year) |
| Compliance | Customer HR data cached on Merge's infrastructure. Requires GDPR DPA, data residency audit, SOC2 scope expansion |
| Custom fields | Limited to supported fields; custom objects require passthrough outside the unified model |
| RAG content | No built-in mechanism for deep recursive page content extraction |
Pass-Through with RapidBridge (Truto)
| Factor | Impact |
|---|---|
| Data flow | Source → Truto (transform in-flight) → Your system (single hop) |
| Freshness | You control sync schedule; data fetched directly from source |
| Cost | Unlimited connections and API calls on higher tiers. No per-linked-account pricing penalty for multi-category integrations |
| Compliance | Zero data retention. Customer data never persists on middleware |
| Custom fields | remote_data preserves full vendor response; JSONata overrides handle custom schemas per customer |
| RAG content | Spool nodes handle recursive pagination and deliver complete documents |
The architectural difference is not subtle. One approach forces you to trust a third party with your customers' most sensitive data and pay per connection for the privilege. The other gives you a transformation layer you control.
Stop Paying for Double ETLs
If you are evaluating vendors for ETL workflows using unified APIs, ask six blunt questions:
- Does data flow source → vendor → us, or source → us?
- Can we sync directly into our own datastore?
- How are incremental checkpoints tracked per tenant and per resource?
- What triggers a full sync or resync?
- How do we access raw provider data when the unified model falls short?
- What does pricing look like at 100, 500, and 1,000 connected accounts?
Those questions cut through most marketing pages in ten minutes.
Here is the decision framework:
- If your extraction needs are light (a few standard fields, low volume, infrequent syncs), a store-and-sync provider might be acceptable. The double ETL tax is low when volume is low.
- If you are building a data-intensive product (RAG pipelines, analytics platforms, compliance auditing, AI agents), the double ETL becomes an engineering and financial liability. You need a pass-through architecture with incremental syncing, spool nodes for nested content, and zero data retention.
- If you serve regulated industries (healthcare, fintech, enterprise software), every intermediate data store is a compliance surface. Pass-through eliminates the largest one.
The cost of ignoring this decision? Ask your security team what happens when they discover your unified API provider has been caching employee SSNs on their servers for the past six months. Or ask your CFO what happens when your linked account bill hits six figures.
Build your ETL pipeline on an architecture that gives you control over your data flow, your sync schedule, and your customers' privacy. The data should move from source to destination in one hop — not two.
FAQ
- What is the double ETL problem with cached unified APIs?
- When a unified API provider caches your customers' data on their servers, the data moves twice: once from the source to the provider's cache (ETL #1), and again from their cache to your system (ETL #2). This adds data staleness, privacy liability, debugging complexity, and extra cost to your data pipeline — all scaling linearly with your customer count.
- Can unified APIs handle bulk data extraction for ETL workflows?
- Yes, but the architecture matters. Pass-through unified APIs fetch data directly from third-party sources on demand and deliver it to your system in a single hop. Look for features like dependency-aware sync jobs, incremental syncing via tracked watermarks, spool nodes for nested content assembly, and webhook-based delivery with zero data retention.
- How does incremental syncing work with unified APIs?
- Incremental syncing tracks the last successful sync timestamp per tenant and per resource, then only fetches records modified since that checkpoint. If a sync fails midway, the timestamp is not updated, so the next run retries from the same point. This reduces API calls, minimizes data transfer, and avoids re-processing your entire dataset on every run.
- Why are traditional ETL tools like Fivetran or Airbyte bad for B2B SaaS integrations?
- Traditional ETL tools are designed for internal data warehousing with single-tenant credentials. They lack multi-tenant OAuth lifecycle management, cross-vendor schema normalization, embeddable authentication UIs, and provider-specific rate limit handling that customer-facing B2B SaaS integrations require. You end up building the missing half yourself.
- How much does Merge.dev cost for high-volume ETL workflows?
- Merge.dev charges $650/month for up to 10 linked accounts and $65 per additional account. At 200 customers with 3 integration categories each (600 linked accounts), that is approximately $39,000/month or $468,000/year in linked account fees alone, before enterprise features or custom sync frequencies.