What is a pass-through proxy architecture for API integrations?

A pass-through proxy handles credential injection, token refresh, pagination, and response mapping in memory without persisting third-party payloads. Data flows through the proxy to the caller without being written to any database, minimizing your data retention footprint and simplifying compliance with SOC 2 and GDPR.

How does BYOK (Bring Your Own Key) work for SaaS integrations?

BYOK lets enterprise customers supply their own encryption master key from their KMS (AWS KMS, Google Cloud KMS, Azure Key Vault). The integration platform uses this customer-owned key to wrap data encryption keys that protect OAuth tokens and API credentials. The customer can independently audit key usage and revoke access at any time by disabling their master key.

How do you prevent webhook replay attacks?

Prevent replay attacks by including a timestamp in the signed webhook payload and rejecting any request older than 5 minutes. Combine this with idempotency keys (unique event IDs) so that even if a duplicate delivery arrives, the consumer processes it only once.

What is envelope encryption and why does it matter for API credentials?

Envelope encryption uses a two-tier key hierarchy: a master key (KEK) stored in a KMS wraps a data encryption key (DEK), and the DEK encrypts the actual credential. A database breach yields only ciphertext because the DEK is encrypted and the KEK never leaves the KMS.

What observability metrics should I track for API integration audits?

Track request latency per provider (p99), error rates by HTTP status, OAuth token refresh failure rates, webhook delivery success rates, and rate limit exhaustion counts. Each metric should include dimensions for provider, environment, and integrated account to support per-customer incident investigation.

The SaaS API Integration Audit Runbook: Retention, Tokens, Logging & SLAs

Your account executive just moved a six-figure enterprise deal to the final procurement stage. The technical validation is complete, the champion is sold, and the contract is waiting on a single signature. Then, the enterprise InfoSec team sends over a 200-question vendor risk assessment targeting your third-party API integrations. They want to know exactly how you handle OAuth token concurrency, where third-party payloads are cached, and what your data retention policies are for external API logs.

If your engineering team responds by scrambling to pull architecture diagrams for a patchwork of cron jobs, legacy webhooks, and raw API keys stored in plaintext, the deal is dead.

As B2B SaaS companies move upmarket, enterprise security teams no longer accept generic assurances about integration security. They expect a heavily documented API integration audit runbook - a standardized operational framework that proves you have systemic control over data retention, token lifecycles, boundary logging, and upstream service-level agreements (SLAs).

Why You Need an API Integration Audit Runbook

Third-party API reliability is getting worse, downtime costs have escalated into seven-figure territory, and enterprise InfoSec teams now treat your integration architecture as a primary attack surface. An audit runbook is the cheapest insurance policy you can buy.

The numbers are blunt. ITIC's 2024 Hourly Cost of Downtime Survey shows over 90% of mid-size and large enterprises now lose more than $300,000 per hour to downtime, 41% lose between $1M and $5M+ per hour, and 98% of large enterprises report at least $100,000 per hour.

Reliability is also actively degrading: between Q1 2024 and Q1 2025, average API uptime dropped from 99.66% to 99.46%. That represents a 60% increase in downtime year-over-year. A 0.1% drop in uptime translates to approximately 10 extra minutes of downtime per week. Across dozens of integrations, your system is constantly exposed to partial outages and degraded performance.

Worse, when systems fail, APIs are overwhelmingly responsible - 67% of monitoring errors originate from API, HTTP, Timeout, or TLS failures, making API performance the primary determinant of overall system reliability. Your integrations layer is now the single biggest reliability risk in your product. Treat it like one.

If you want to bypass procurement bottlenecks and stop burning core engineering cycles on silent integration failures, you need to transition from reactive firefighting to a defensible, standardized integration posture. If you do not yet have a baseline monitoring strategy, review our guide on how to create an operational runbook and monitoring playbook first.

An audit runbook (which pairs well with an operational runbook for declarative syncs) covers five critical areas. Skip any of them and InfoSec will find the gap before your customers do:

Data retention - what third-party payloads you cache, encrypt, or persist
OAuth token lifecycle - refresh, concurrency control, revocation handling
Logging - what's captured, what's redacted, retention windows
Upstream SLAs and fallback behavior - rate limits, error normalization, circuit breakers
Webhook security - inbound verification, outbound signatures, replay protection

Step 1: Auditing Data Retention and Privacy Controls

Definition: A data retention audit traces every byte of third-party API payload through your system - request, response, cache, queue, log, warehouse - and documents the retention window, encryption state, and legal basis for each hop.

Storing third-party API payloads expands your toxic data footprint. Every time your system caches a Salesforce contact record, a Workday employee profile, or a NetSuite invoice to "simplify" processing, you introduce an unmanaged sub-processor into your customer's compliance boundary.

According to compliance researchers, zero data retention is rapidly becoming a mandatory requirement for passing enterprise security audits like SOC 2 and GDPR. Zero data retention is an integration design pattern where third-party payloads are processed and passed through to the destination system without ever being cached, queued, or stored in persistent databases by the integration middleware. Data passes through your platform on the way to its destination, but is not stored at rest. This is what makes your architecture defensible during enterprise reviews - the auditor cannot find data you do not have.

The Data Retention Audit Checklist

To pass a strict enterprise Data Protection Impact Assessment (DPIA), audit your integration infrastructure against these constraints:

Layer	Question to answer	Acceptable answer
Edge ingestion	Are raw request/response bodies persisted?	No, or with strict TTL and encryption
Queue / buffer	What's the message retention window?	Minutes, not days
Logs	Are response bodies logged?	Headers only, bodies sampled and redacted
Warehouse / cache	Is third-party data mirrored?	Only with explicit customer opt-in
Backups	Are integration secrets in backups?	Encrypted with separate KMS key

Eliminate database caching for third-party payloads: Are you storing raw JSON responses from HubSpot or Zendesk in your primary database to make pagination easier? Rip it out. Use a pass-through proxy architecture that streams data directly to the client or destination system.
Audit message queues and durable state: If you use message brokers (like Kafka or RabbitMQ) for webhook ingestion, verify the Time-To-Live (TTL) configurations. Webhook payloads should not sit in a dead-letter queue for 30 days.
Verify encryption at rest for credentials: OAuth access tokens, refresh tokens, and API keys must be encrypted at rest using AES-GCM or equivalent standards. The encryption keys must be managed via a dedicated secret management service, entirely separate from the application database.
Implement claim-check patterns for large payloads: If you process massive webhooks, do not push the raw payload through your event bus. Write the payload to an ephemeral object store, pass a reference ticket (the "claim check") through the queue, and delete the object immediately after the consumer processes it.

If you're building this from scratch, the architectural principle is simple: never persist what you can proxy. For a deeper dive into formalizing these policies for enterprise buyers, adapt the templates in our SaaS integration compliance and operations checklist.

The Pass-Through Proxy Architecture

When you route customer data through any third-party integration platform - unified API or otherwise - the first security question an auditor will ask is: does that platform store my data? A pass-through proxy architecture makes the answer a hard no.

In this architecture, the integration layer receives your application's request, decrypts stored credentials in memory, injects them into the outbound request, forwards it to the third-party API, transforms the response using in-memory mapping expressions, and returns the result to your application. No payload is written to disk, queued for later processing, or cached in a database.

flowchart LR
    App["Your Application"] -->|"1. Unified API request"| Proxy["Integration Proxy<br>(stateless, in-memory only)"]
    Proxy -->|"2. Inject credentials,<br>forward to provider"| API["Third-Party API"]
    API -->|"3. Raw response"| Proxy
    Proxy -->|"4. Transform and return<br>(zero persistence)"| App

The proxy layer centralizes five responsibilities that would otherwise force your application to handle - and store - sensitive data directly:

Responsibility	What happens	Why it eliminates storage
Credential injection	OAuth tokens are decrypted in memory and attached to the outbound request	Your app never sees or stores raw tokens
Token lifecycle	Expired tokens are refreshed before the request is forwarded	No stale token cache to manage
Pagination	Multi-page responses are assembled on the fly	No intermediate page storage needed
Error normalization	Provider-specific errors are mapped to a standard format in memory	No raw error payloads persisted
Response transformation	Mapping expressions transform responses into unified schemas without intermediate writes	Data flows through, never lands

The rule is: if the data exists in the third-party system, fetch it on demand through the proxy. Only persist when the customer explicitly opts into synchronization for offline querying or analytics. For webhook payloads that must be queued for async delivery, use a claim-check pattern - write the payload to ephemeral object storage with a strict TTL, pass a reference through the queue, and delete the object after the consumer processes it.

Envelope Encryption and BYOK Patterns

OAuth tokens, API keys, and webhook secrets require encryption at rest. Simple column-level encryption is a start, but enterprise auditors demand a provable key hierarchy. The standard pattern is envelope encryption - a two-tier system where a master key (Key Encryption Key, or KEK) wraps a data-specific key (Data Encryption Key, or DEK), and the DEK encrypts the actual credential.

flowchart TB
    KEK["Master Key - KEK<br>(lives in KMS, never exported)"] -->|"Wraps"| EDEK["Encrypted DEK<br>(stored alongside ciphertext)"]
    EDEK -->|"Unwrap via KMS<br>at runtime"| DEK["Plaintext DEK<br>(in-memory only)"]
    DEK -->|"AES-GCM"| CRED["Encrypted Credential<br>(OAuth tokens, API keys, secrets)"]
    DEK -.->|"Discarded immediately"| X["Memory cleared"]

Envelope encryption is a two-tier approach where the KMS generates a data key, you use that data key to encrypt your actual data locally, and then store the encrypted data key alongside the ciphertext. The workflow:

Encrypt: Generate a unique DEK, encrypt the credential with AES-GCM, wrap the DEK with the KEK via your KMS, store the encrypted DEK alongside the ciphertext, and immediately discard the plaintext DEK from memory.
Decrypt: Send the encrypted DEK to the KMS for unwrapping, use the plaintext DEK to decrypt the credential in memory, execute the API call, and discard the plaintext DEK.

A database breach yields only ciphertext. Without KMS access, the data is useless.

BYOK (Bring Your Own Key) for enterprise customers: BYOK is a cloud architecture that gives customers ownership of the encryption keys that protect some or all of their data stored in SaaS applications. It is per-tenant encryption where your customers can independently monitor usage of their data and revoke all access to it if desired. In practice, the customer supplies their own KEK from their own KMS instance, and the integration platform wraps all credential DEKs with that customer-owned key.

Recommendations for BYOK implementation:

Per-tenant KEKs: Wrap each customer's credentials with their own master key, not a shared platform key. A breach of one tenant's key material must not expose another tenant's credentials.
Key rotation without downtime: When a customer rotates their KEK, re-wrap all existing DEKs with the new key. The underlying encrypted data does not change - only the wrapping layer. This is a metadata operation, not a re-encryption of all credentials.
Audit logging on every KMS operation: Every wrap, unwrap, and rotate call should be logged and accessible to the customer. Your customer can independently monitor all data access.
Document KMS unavailability behavior: What happens when the customer's KMS is unreachable? Cached DEKs with a short TTL (minutes, not hours) can prevent total outage, but the trade-off must be documented explicitly in your DPA.

Warning

Audit gotcha: "We don't store data" is not the same as "we don't process data." If you decrypt, transform, or buffer payloads, you're still a sub-processor under GDPR Article 28. Your DPA needs to reflect the actual processing chain, not the marketing claim.

Zero Data Retention for MCP Servers

Model Context Protocol (MCP) servers extend the same zero-data-retention question to AI agents. When Claude, ChatGPT, or Cursor calls a tool that hits Salesforce or HubSpot on your customer's behalf, the audit questions do not change - they multiply.

Audit trail completeness is a common gap: developer-grade MCP implementations log at the application level, if at all. They record that the AI made a request - not which user authorized it, not what specific data was retrieved, not what action was taken with it. For organizations subject to HIPAA, GDPR, SOX, or FedRAMP, this is not a logging gap - it is a compliance gap. These frameworks require attribution-level documentation of data access that generic MCP logging does not provide. The fix is to hold MCP servers to the exact same architectural constraints as your REST proxy layer, not to a lower bar.

Three properties are non-negotiable for MCP servers that must pass an enterprise DPIA:

Stateless tool execution. Each tools/call invocation creates a fresh server context, executes against the upstream API through the same pass-through proxy that serves the unified API, and returns the result. No caching of tool arguments, no persistence of response bodies, no side-channel storage of the AI's reasoning trace. Tools inherit zero-data-retention guarantees automatically because they share the proxy path.
Hashed token storage. Raw MCP tokens are hashed with an HMAC signing key before being stored. The plaintext token is returned exactly once at creation time and is never recoverable from any store. If the token database leaks, the attacker gets hashes, not tokens.
Enforced TTLs with independent cleanup. MCP servers support explicit expires_at timestamps. Both the token store's built-in expiration and a scheduled cleanup job independently remove the record on expiry, so no orphaned credential lingers if one enforcement path fails.

The MCP-specific audit table an enterprise reviewer will hand you:

Question	Acceptable answer
Where are tool arguments logged?	Metadata only (tool name, timestamp, integrated account ID) - never argument values
Are tool responses cached?	No. Every call re-executes against the upstream API
How are MCP server tokens stored?	Hashed with HMAC before storage; plaintext returned once at creation
Can an MCP server outlive its purpose?	No. TTLs are enforced at the token store and by a scheduled cleanup job
Does the MCP endpoint support additional auth?	Optional per-server: layer API token or session auth on top of the URL
How is the tool surface scoped?	Method filters (read / write / custom) and tag filters restrict which tools each server exposes

Two operational patterns further reduce the retention surface. Short-lived servers (24-hour or 7-day TTLs) scope AI access to a specific workflow, so revoked contractors or completed automations never leave behind stale credentials. Optional API token layering requires the caller to present a valid session or API token on top of the MCP URL - so possession of the URL alone is not sufficient to invoke tools.

Robust observability is essential for MCP environments. All tool and model invocations should be logged, including the exact parameters, identities involved, and (where feasible) cryptographic hashes of results or output. These logs form the backbone of forensic response in the event of a breach or anomaly. Where possible, this telemetry should be integrated into the existing security monitoring infrastructure of the organization - such as SIEM systems, threat detection pipelines, or compliance dashboards. The trick is to log the metadata around each MCP call without persisting the tool arguments or response bodies. That distinction - metadata-only invocation logs, zero payload persistence - is what makes MCP compatible with enterprise data retention requirements.

Step 2: The OAuth Token Lifecycle and Concurrency Audit

Definition: An OAuth token audit verifies how access tokens are acquired, refreshed, stored, and revoked - and proves that concurrent operations cannot corrupt token state or trigger lockouts at the provider.

Most integration downtime is not caused by the third-party API going offline. It is caused by botched OAuth token refreshes. Access tokens for Salesforce, HubSpot, Microsoft Graph, and most modern APIs expire in 30 to 60 minutes. If you have a high-volume sync job running, multiple threads will eventually attempt to use an expired token at the exact same millisecond.

If your architecture lacks concurrency control, five concurrent API requests will trigger five simultaneous refresh requests to the provider. The provider issues a new token to the first request and immediately revokes the old refresh token. The other four requests fail, overwrite the database with invalid credentials, and permanently disconnect the user. This is known as a refresh race condition, and it is the single most common cause of integration failure.

Your audit needs to answer four critical questions:

1. Are tokens refreshed proactively, not reactively?

Reactive refresh (only when an API call returns 401) creates a thundering herd: every sync job, webhook handler, and user request hits an expired token at the same time and stampedes the refresh endpoint. Proactive refresh schedules a token swap before expiry. Do not wait for a token to expire. Schedule a distributed alarm to fire 60 to 180 seconds before the known expires_at timestamp. This randomized buffer spreads the refresh load and guarantees tokens are always hot.

2. Is concurrent refresh serialized per account?

Before any process attempts to refresh a token, it must acquire a distributed mutex lock tied to that specific integrated account ID. Concurrent callers must await the in-progress refresh operation rather than firing duplicate requests.

A reasonable implementation pattern in TypeScript looks like this:

async function refreshWithMutex(accountId: string) {
  return await mutex.acquire(accountId, async () => {
    const account = await store.get(accountId)
    
    // Enforce a pre-flight expiry buffer
    if (!account.token.expired(30 /* sec buffer */)) {
      return account.token  // someone else already refreshed
    }
    
    const newToken = await oauthClient.refresh(account.refresh_token)
    await store.update(accountId, newToken)
    return newToken
  })
}

Here is how that serialized flow operates structurally:

sequenceDiagram
    participant Client as API Client (x5)
    participant Mutex as Distributed Mutex Lock
    participant Provider as Third-Party OAuth Server
    participant DB as Credential Store

    Client->>Mutex: Request Token (Expired)
    Note over Mutex: Lock Acquired by Request 1
    Mutex->>Provider: Exchange Refresh Token
    Note over Mutex: Requests 2-5 Await Promise
    Provider-->>Mutex: Return New Access Token
    Mutex->>DB: Persist New Credentials
    Note over Mutex: Lock Released
    Mutex-->>Client: Return Fresh Token to All 5 Callers

3. How are revoked tokens handled?

When a provider returns an invalid_grant (indicating the user manually revoked access, the admin rotated credentials, or the refresh token is dead), your system must immediately halt retries. Retrying a revoked grant is pointless - it just generates noise in logs and triggers rate limit bans.

Audit-wise, this means distinguishing retryable errors (HTTP 5xx, network failures) from terminal errors (HTTP 401/403, invalid_grant) and routing them differently. Your system must mark the account as needs_reauth, and fire a webhook to alert the customer with a clear re-connect CTA.

4. Where are tokens stored, and how?

Tokens belong in a column encrypted with AES-GCM (or equivalent), with the key in a managed KMS - never in plain logs, never in error messages, never in OpenTelemetry traces. Your audit log should be able to prove that no engineer can read a customer's access token without an explicit, audited break-glass procedure.

Token Lifecycle Tests: Sample Refresh and Rotation Checks

Auditors want to see that your OAuth handling works under adversarial conditions. Ship these tests in CI so a security reviewer can read the assertions directly instead of taking your word for it. The following examples use Vitest, but the pattern translates to any test runner:

// oauth-lifecycle.test.ts
import { describe, it, expect, beforeEach } from 'vitest'
import { refreshWithMutex } from './oauth'
import { store, seedAccount, providerRefreshCount, resetProvider } from './test-harness'
 
describe('OAuth token lifecycle', () => {
  beforeEach(async () => {
    await resetProvider()
  })
 
  it('refreshes proactively before expiry (buffer window)', async () => {
    // Token expires in 45s; the pre-flight buffer is 30s
    const account = await seedAccount({ expiresInSec: 45 })
    const fresh = await refreshWithMutex(account.id)
 
    expect(fresh.access_token).not.toEqual(account.token.access_token)
    expect(fresh.expires_at).toBeGreaterThan(Date.now() + 25 * 60 * 1000)
    expect(providerRefreshCount(account.id)).toBe(1)
  })
 
  it('serializes 5 concurrent refresh calls into 1 provider hit', async () => {
    const account = await seedAccount({ expired: true })
 
    const results = await Promise.all(
      Array.from({ length: 5 }, () => refreshWithMutex(account.id))
    )
 
    // Only one refresh actually hit the upstream OAuth server
    expect(providerRefreshCount(account.id)).toBe(1)
    // All callers received the same fresh token
    const uniqueTokens = new Set(results.map(r => r.access_token))
    expect(uniqueTokens.size).toBe(1)
  })
 
  it('marks account needs_reauth on invalid_grant and does not retry', async () => {
    const account = await seedAccount({ refreshToken: 'revoked' })
 
    await expect(refreshWithMutex(account.id)).rejects.toThrow(/invalid_grant/)
 
    const updated = await store.get(account.id)
    expect(updated.status).toBe('needs_reauth')
    expect(providerRefreshCount(account.id)).toBe(1) // no retry storm
  })
 
  it('retries on transient 5xx with exponential backoff', async () => {
    const account = await seedAccount({ expired: true, provider503Count: 2 })
 
    const fresh = await refreshWithMutex(account.id)
 
    expect(fresh.access_token).toBeDefined()
    expect(providerRefreshCount(account.id)).toBe(3) // 2 failures + 1 success
  })
 
  it('re-wraps DEKs on KEK rotation without re-issuing OAuth tokens', async () => {
    const account = await seedAccount({ expiresInSec: 3600 })
    const before = await store.getRawRow(account.id)
 
    await rotateCustomerKek(account.customer_id)
 
    const after = await store.getRawRow(account.id)
    // Ciphertext changed (re-wrapped) but plaintext token is unchanged
    expect(after.token_ciphertext).not.toEqual(before.token_ciphertext)
    expect(after.encrypted_dek).not.toEqual(before.encrypted_dek)
    const plaintext = await decryptToken(account.id)
    expect(plaintext.access_token).toEqual(account.token.access_token)
    // Provider was NOT called - rotation is a metadata operation
    expect(providerRefreshCount(account.id)).toBe(0)
  })
})

The rotation test is the one auditors linger on. It proves that a customer rotating their KEK does not force a re-consent flow with the upstream provider, which is what makes BYOK operationally viable for enterprises with 90-day key rotation policies.

For more details, read our deep-dive on handling OAuth token refresh failures in production.

Step 3: API Logging Best Practices for Compliance

Definition: Compliance-ready API boundary logging is the practice of recording the exact HTTP requests and responses exchanged with third-party APIs while systematically redacting sensitive credentials and Personally Identifiable Information (PII) before the data hits your observability platform.

The common mistake is logging the transformed payload after your integration layer has already normalized it. By then, you've lost the upstream's raw error format, the original headers, and the exact request body that caused the failure. When an integration breaks, your engineers need logs. When an InfoSec auditor reviews your system, they demand proof that those logs do not contain plaintext API keys or unredacted customer data.

The Logging Audit Checklist

Log at the boundary, where requests cross from your system into a third-party API and back. Audit your observability pipeline against these requirements:

Log at the network boundary: Do not log the output of your internal data models. Log the exact HTTP request method, target URL, and raw response body received from the third-party provider. This is the only way to prove whether a data corruption issue originated in your code or the vendor's API.
Enforce aggressive, automated redaction: Your logging middleware must automatically strip Authorization, X-Api-Key, and Cookie headers before the log object is constructed. Never rely on engineers to manually redact secrets in their console.log statements.
Correlate logs with standard identifiers: Every log entry must include an x-request-id, the target environment_id, and the integrated_account_id. When a customer reports a missing record, your engineers should be able to query the exact API transaction in seconds.
Implement outbound signature verification: When your system delivers normalized webhooks to your customers, sign the payload using HMAC SHA-256 and include it in an X-Signature header. Log the successful delivery of this signature to prove non-repudiation.

Below is an example of a compliant, heavily redacted boundary log structure:

{
  "timestamp": "2026-10-14T08:12:33Z",
  "correlation_id": "req_987654321",
  "integrated_account_id": "acc_12345",
  "provider": "salesforce",
  "request": {
    "method": "PATCH",
    "url": "https://your-domain.my.salesforce.com/services/data/v60.0/sobjects/Contact/003xx000004abcd",
    "headers_redacted": ["Authorization", "Cookie"],
    "Content-Type": "application/json"
  },
  "response": {
    "status": 429,
    "latency_ms": 412,
    "ratelimit_remaining": "0",
    "ratelimit_reset": "1716804912",
    "upstream_request_id": "a3f8x91...",
    "body": {
      "errorCode": "REQUEST_LIMIT_EXCEEDED",
      "message": "TotalRequests Limit exceeded."
    }
  }
}

For retention, the audit-defensible default is 30 to 90 days for boundary logs, with PII-redacted summaries retained longer for trend analysis. Anything longer needs an explicit legal basis.

Observability Metrics for Audit Trails

Logs tell you what happened. Metrics tell you when something is going wrong before it surfaces in a customer ticket. Track these integration-specific signals to maintain an auditable operational posture:

Metric	What it measures	Alert threshold
`integration.request.latency_p99`	99th percentile response time per provider	> 2x historical baseline
`integration.request.error_rate`	Percentage of non-2xx responses per provider	> 5% over a 5-minute window
`integration.token_refresh.failure_rate`	Percentage of failed OAuth refresh attempts	> 0% (any failure is actionable)
`integration.token_refresh.latency`	Time to complete a token refresh operation	> 10 seconds
`integration.webhook.delivery_success_rate`	Percentage of outbound webhooks acknowledged by customer	< 99% over a 1-hour window
`integration.ratelimit.exhaustion_count`	Number of times a provider's rate limit was hit	> 0 (indicates capacity planning needed)

Every metric should carry these dimensions: provider, environment_id, integrated_account_id, and operation (list, get, create, update, delete). This lets you slice dashboards per-customer and per-provider during an incident.

For audit trail completeness, your observability pipeline should be able to answer these questions within 60 seconds:

Which integrated account made a specific API call at a specific time?
What was the HTTP status code and response latency for that call?
Did a token refresh occur during that request lifecycle?
Were any PII fields present in the request or response, and were they redacted before log persistence?

Tip

If you log raw request bodies, run a redaction layer before they hit persistent storage. Regex-based PII redaction is fine for known fields (email, ssn, phone), but pair it with allow-listing for high-sensitivity integrations like HRIS and payroll.

Log Sanitization: Sample Queries That Prove PII Removal

A policy that says "we redact PII from logs" is a claim. A query that returns zero rows against your log store is evidence. Bake these into CI and ship the pass/fail record to the auditor as part of your evidence pack.

The pattern is the same across every backend: define a set of PII-shaped patterns, run them against the log store, and assert zero matches (allowing only known-safe canary values that your own tests inject).

Splunk (SPL): hunt for anything that looks like an email address in the boundary index.

index=integration_boundary earliest=-24h
| regex _raw="[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
| where NOT match(_raw, "@example\.test$")
| stats count as leaked_emails
| where leaked_emails > 0

Expected result: zero rows. Any hit means a redactor bypass leaked an email address into a log.

Datadog Logs (search syntax): any log record that still carries a raw request or response body field.

service:integration-proxy (@request.body:* OR @response.body:*)

Expected result: zero events. The allowlist should have dropped both fields at ingestion.

Warehouse SQL: scan the last day of boundary logs for authorization headers or bearer tokens that slipped through.

SELECT COUNT(*) AS leaked_secrets
FROM integration_boundary_logs
WHERE occurred_at > NOW() - INTERVAL '24 hours'
  AND (
    payload::text ILIKE '%Bearer %'
    OR payload::text ILIKE '%X-Api-Key%: %'
    OR payload::text ~ '"access_token"\s*:\s*"[A-Za-z0-9._-]{20,}"'
    OR payload::text ~ '"refresh_token"\s*:\s*"[A-Za-z0-9._-]{20,}"'
  );
-- Expected: 0

Elastic (Lucene): high-cardinality PII patterns that the standard-pii preset should have caught.

(message:/\b\d{3}-\d{2}-\d{4}\b/) OR       // SSN
(message:/\b(?:\d[ -]*?){13,16}\b/) OR     // credit card
(message:/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b/)

Wrap these in a nightly job that fails loudly when any query returns a non-zero count:

#!/usr/bin/env bash
# assert-log-sanitization.sh - run in CI + nightly cron
set -euo pipefail
 
run_query() {
  local backend="$1" query="$2" description="$3"
  local count
  count=$(./scripts/query.sh "$backend" "$query")
  if [[ "$count" -gt 0 ]]; then
    echo "FAIL [$backend]: $description returned $count rows"
    exit 1
  fi
  echo "PASS [$backend]: $description"
}
 
run_query splunk    "$SPLUNK_EMAIL_HUNT"      "emails in boundary logs"
run_query datadog   "$DD_BODY_HUNT"           "raw bodies in proxy logs"
run_query warehouse "$SQL_SECRET_HUNT"        "bearer tokens in warehouse"
run_query elastic   "$ES_PII_HUNT"            "SSN / card / email in Elastic"
 
echo "All sanitization checks passed."

The audit-facing artifact is not the queries. It is the archived run history showing green results across the SOC 2 observation window. When the auditor asks "how do you know PII is not in your logs," you hand them 365 nightly runs of assert-log-sanitization.sh.

Log Retention Defaults and Configuration

Different log categories carry different retention obligations. A single blanket retention window either over-retains sensitive data (increasing your DPIA footprint) or under-retains security evidence (failing your SOC 2 auditor). The defensible pattern is tiered retention, with each tier tied to a compliance justification.

SOC 2 leaves retention duration to the service provider's discretion, but many frameworks provide concrete guidance: NERC specifies six months for log retention and three years for audit records; ISO 27001 recommends keeping at least twelve months of logs to demonstrate control effectiveness; the Sarbanes-Oxley Act requires financial audit logs to be retained for seven years; PCI DSS 4.0 mandates twelve months of history with three months readily available. Industry standard and auditor expectation is 12 months for security-relevant logs, with 90 days in hot storage and the remainder in cold storage.

Recommended defaults for integration boundary logging:

Log tier	Contents	Hot retention	Cold archive	Justification
Boundary metadata	Correlation IDs, provider, status, latency, method, path template	90 days	13 months	SOC 2 CC7.2 lookback
OAuth token events	Refresh outcomes, revocation events, `invalid_grant` occurrences (no token values)	90 days	13 months	SOC 2 CC6.1 authentication evidence
MCP tool invocations	Tool name, integrated account ID, timestamp, result status (no arguments, no response bodies)	90 days	13 months	AI access governance and review
Webhook delivery events	Event ID, subscription ID, delivery outcome, retry count	90 days	13 months	Non-repudiation of delivery
Redacted headers only	Header names, response status, size, `Retry-After`	30 days	none	Operational debugging only
Full request/response bodies	Never persisted	0	0	Zero data retention

The rationale for the 13-month cold archive is straightforward: SOC 2 auditors want evidence that covers the observation window - typically twelve months for an annual Type II - plus a reasonable buffer for sample requests that come in after the window closes. In practice this means logs from at least thirteen months ago must be retrievable.

Configuration levers to expose per customer:

Retention override per environment: some tenants require 30-day retention for GDPR minimization; others need 24 months for regulated industries. Make it configurable, not hardcoded.
Legal hold flag: records tagged with legal_hold: true bypass retention deletion until the flag is removed.
Field-level redaction rules: allow customers to add regex patterns for domain-specific PII (case numbers, patient IDs, account numbers).
Tenant-specific storage region: for EU data residency, pin cold archive to an EU region.

One trap to avoid: a policy that says "12-month retention" while your log storage automatically purges at 30 days is a finding waiting to happen. Set up alerts that fire if log ingestion stops unexpectedly. A gap in your logs during the audit period - even if caused by a misconfiguration rather than malicious activity - creates questions your auditor will probe. Continuous log ingestion verification is a control your auditor will appreciate seeing.

SIEM Export Formats and Integration Examples

Enterprise customers rarely want to log into a proprietary console. They want to pipe integration logs into their existing SIEM (Splunk, Datadog, Sumo Logic, Elastic Security, Chronicle) so their security team can correlate integration events with the rest of their environment.

Three export formats cover the vast majority of enterprise SIEM deployments:

1. JSON Lines (newline-delimited JSON) via HTTPS push or object storage sink. The default. Every log entry is a single JSON object per line. Splunk HEC, Datadog, Elastic, and most modern SIEMs ingest this format natively.

{"timestamp":"2026-10-14T08:12:33Z","event":"integration.request","correlation_id":"req_987","integrated_account_id":"acc_12345","provider":"salesforce","method":"PATCH","path_template":"/sobjects/Contact/{id}","status":429,"latency_ms":412,"upstream_request_id":"a3f8x91"}
{"timestamp":"2026-10-14T08:12:34Z","event":"integration.token_refresh","integrated_account_id":"acc_12345","provider":"salesforce","outcome":"success","latency_ms":180}
{"timestamp":"2026-10-14T08:12:36Z","event":"mcp.tool_call","integrated_account_id":"acc_12345","mcp_server_id":"mcp_abc","tool":"list_all_salesforce_contacts","status":"success","latency_ms":892}

2. Syslog (RFC 5424) over TCP with TLS. Required by legacy SIEMs and many on-premise SOCs. Structured data goes in the [integration@32473 ...] block:

<134>1 2026-10-14T08:12:33Z integrations.example.com integration-boundary - REQ [integration@32473 correlation_id="req_987" integrated_account_id="acc_12345" provider="salesforce" method="PATCH" status="429" latency_ms="412"] Salesforce PATCH returned 429

3. Webhook push to a customer-controlled endpoint. For customers who route logs through their own event pipeline (Kafka, EventBridge, Kinesis) before landing in the SIEM. Every batch is HMAC-SHA256 signed, timestamped, and includes an event ID for idempotency - the same webhook hardening rules from Step 5 apply here.

POST https://customer-siem-ingest.example.com/hooks/integration-logs
X-Log-Signature: sha256=<hmac>
X-Log-Timestamp: 1728896553
X-Log-Batch-Id: log_batch_01HFA...
Content-Type: application/json
 
{
  "batch_id": "log_batch_01HFA...",
  "count": 3,
  "events": [
    { "timestamp": "2026-10-14T08:12:33Z", "event": "integration.request", "...": "..." },
    { "timestamp": "2026-10-14T08:12:34Z", "event": "integration.token_refresh", "...": "..." },
    { "timestamp": "2026-10-14T08:12:36Z", "event": "mcp.tool_call", "...": "..." }
  ]
}

Configuration surface to expose per customer:

Setting	Options	Default
Export format	`jsonl`, `syslog-5424`, `webhook`, `cef`	`jsonl`
Transport	HTTPS push, S3/GCS/Azure Blob sink, syslog TCP+TLS	HTTPS push
Batch size	1 - 1000 events	100
Batch flush interval	1 - 60 seconds	5 seconds
Field allowlist	Named allowlist (see below)	`metadata-only`
Redaction preset	`standard-pii`, `hris-strict`, `finance-strict`, `custom`	`standard-pii`
Compression	`none`, `gzip`, `zstd`	`gzip`
At-least-once delivery	On / off	On

For SIEM correlation, every event carries the same required dimensions: correlation_id, integrated_account_id, environment_id, provider, and event. This lets a security analyst pivot from a suspicious authentication event in Okta to the exact integration API call it triggered in seconds.

Metadata Allowlist and Sample Log Entries

Zero data retention means the log record must be metadata about the request, not the request itself. Enforce this with a strict allowlist at the logging middleware - anything not on the list is dropped before the record is ever constructed.

Default metadata allowlist for boundary logs:

# integration-log-allowlist.yaml
# Fields explicitly permitted in boundary logs. Everything else is dropped.
 
event:
  - timestamp                    # ISO 8601, UTC
  - event_type                   # integration.request | .token_refresh | .webhook_delivery | mcp.tool_call
  - correlation_id               # x-request-id
  - integrated_account_id
  - environment_id
  - team_id
  - provider                     # salesforce | hubspot | ...
 
request:
  - http_method
  - path_template                # /sobjects/{object}/{id} - never the resolved path
  - query_param_names            # names only, never values
  - header_names                 # names only, never values
 
response:
  - status_code
  - latency_ms
  - upstream_request_id          # provider's request ID
  - ratelimit_limit
  - ratelimit_remaining
  - ratelimit_reset
  - error_code                   # normalized code, never the message body
 
mcp:
  - mcp_server_id
  - tool_name
  - tool_result_status           # success | error - never the tool result payload
  - argument_field_names         # names only, never values
 
# Explicit deny list - dropped even if a bug allows them past the middleware
deny:
  - request_body
  - response_body
  - authorization_header
  - cookie_header
  - api_key
  - refresh_token
  - access_token
  - customer_pii.*               # any nested key under customer_pii

Sample compliant log entry (integration boundary):

{
  "timestamp": "2026-10-14T08:12:33Z",
  "event_type": "integration.request",
  "correlation_id": "req_987654321",
  "integrated_account_id": "acc_12345",
  "environment_id": "env_prod",
  "team_id": "team_abc",
  "provider": "salesforce",
  "request": {
    "http_method": "PATCH",
    "path_template": "/services/data/v60.0/sobjects/Contact/{id}",
    "query_param_names": ["fields"],
    "header_names": ["Content-Type", "Authorization", "X-Request-Id"]
  },
  "response": {
    "status_code": 429,
    "latency_ms": 412,
    "upstream_request_id": "a3f8x91-4c2b-4e",
    "ratelimit_remaining": 0,
    "ratelimit_reset": 1728896912,
    "error_code": "REQUEST_LIMIT_EXCEEDED"
  }
}

Sample compliant log entry (MCP tool invocation):

{
  "timestamp": "2026-10-14T08:12:36Z",
  "event_type": "mcp.tool_call",
  "correlation_id": "req_987654322",
  "integrated_account_id": "acc_12345",
  "environment_id": "env_prod",
  "mcp_server_id": "mcp_abc123",
  "tool_name": "list_all_salesforce_contacts",
  "argument_field_names": ["limit", "next_cursor"],
  "tool_result_status": "success",
  "latency_ms": 892,
  "upstream_request_id": "sf_ray_9871"
}

Note what is not there: no argument values, no response payload, no customer names, no email addresses. The log tells you a specific integrated account listed contacts at a specific time and succeeded. It does not tell you which contacts were returned or what filters were applied. That is the whole point.

Step 4: Evaluating Third-Party API SLAs and Fallback Patterns

Definition: Third-party API SLA management is the architectural practice of protecting your core application from upstream latency, unannounced rate limits, and provider downtime by implementing strict timeouts, normalized error handling, and standardized retry semantics.

Here's the brutal reality: your customer-facing 99.9% SLA is mathematically impossible if you depend on five third-party APIs each running at 99.46% with no fallback. The math compounds against you. If you treat a 200 OK from Salesforce and a 200 OK from a legacy on-premise ERP as equally reliable, your system will eventually suffer catastrophic cascading failures.

What your audit needs to capture per integration:

Field	Example (Salesforce)	Example (HubSpot)
Vendor uptime SLA	99.9% (Enterprise)	99.95% (Enterprise hub)
Rate limit model	Per-org, 24h rolling	Per-app, 10-second window
429 response shape	SOQL governor exception	HTTP 429 + `Retry-After`
Webhook delivery guarantee	At-least-once, no order	At-least-once, no order
Breaking-change notice	12 months (REST)	Variable
Support response time	Premier: 1 hour	Enterprise: 2 hours

The SLA and Rate Limit Audit Checklist

To pass an enterprise architecture review, you must prove that your system degrades gracefully when upstream APIs fail:

Normalize rate limit headers: Different APIs express rate limits differently. HubSpot uses X-HubSpot-RateLimit-Remaining, while Zendesk uses RateLimit-Remaining. Your integration layer must intercept these proprietary headers and normalize them into the standardized IETF format: ratelimit-limit, ratelimit-remaining, and ratelimit-reset.
Pass HTTP 429s back to the caller: Do not attempt to absorb or artificially retry rate limit errors inside the integration middleware. When the upstream provider returns an HTTP 429 (Too Many Requests), pass that 429 directly back to your core application. The caller - armed with the standardized ratelimit-reset header - is responsible for implementing the exponential backoff and retry logic.
Standardize error payloads using JSONata: When an upstream API fails, it will return a proprietary error schema. Use JSONata expressions to evaluate the error response and extract a structured error message. This ensures your application code only ever has to handle one unified error format, regardless of which API failed.
Enforce strict timeout boundaries: Never allow an outbound API request to hang indefinitely. Implement hard timeouts (e.g., 15 seconds for standard REST calls) to prevent upstream latency from exhausting your server's connection pool.

Why shouldn't the integration layer silently absorb rate limits? Because the right backoff depends on context: a user-facing request needs to fail fast, a background sync should exponentially back off, and a bulk import might be better served by switching to a queue. An integration platform that secretly retries on your behalf takes that decision away from you and burns your customer's quota. For more details, see our guide on best practices for handling API rate limits.

Fallback patterns to document

Circuit breaker per upstream: open the circuit after N consecutive 5xx errors, route to a cached read or a graceful error.
Idempotency keys on writes: so retries after a network blip don't create duplicate records.
Webhook + polling hybrid: webhooks are best-effort, so a daily reconciliation sync catches dropped events.
Degradation modes: which features stay online if HubSpot is down? Document them.

SLA Checklist for Procurement

When your customer's procurement team is auditing you (or when you are auditing an API aggregator on behalf of your own procurement team), this is the list of items that needs a documented answer in writing. Every "no" or "we'll get back to you" is a follow-up conversation that stretches the deal timeline.

Availability and reliability

Aggregator uptime SLA stated in the MSA (target percentage, measurement window, exclusions)
Service credits or remedies defined if the SLA is missed
Public status page with 12-month uptime history, subscribable via RSS or webhook
Incident post-mortems published within 5 business days of a customer-impacting event
Historical incident count in the last 12 months, with average time-to-recover

Sub-processor and upstream dependency management

Sub-processor list current within the last 90 days, notified via subscribable channel on change
Upstream SLA disclosure: what SLA does the aggregator inherit from Salesforce, HubSpot, NetSuite, etc.?
Rate limit accounting: is quota per-customer or shared across the aggregator's tenant pool?
Rate limit header visibility: does the aggregator expose upstream ratelimit-* headers, or absorb them silently?
Breaking-change notice period for the aggregator's own API contract (minimum 90 days)
Connector deprecation policy with minimum 12 months of notice

Data handling and residency

Zero data retention stated explicitly in the DPA, not just marketing pages
Data residency guarantees per region (EU / US / APAC) with contractual commitment
Encryption at rest with per-tenant key isolation, BYOK option available on the enterprise tier
Key rotation policy with evidence of last rotation and rotation cadence
Break-glass procedure for engineer access to customer credentials, with audit trail

Security assurance

Current SOC 2 Type II report covering the audit period, not just Type I. A Type I report evaluates whether controls are designed properly at a single point in time. Type II evaluates whether those controls operated effectively over three to twelve months. Enterprise buyers almost always ask for Type II.
ISO 27001, HIPAA BAA, GDPR Article 28 DPA available on request
Penetration test summary available under NDA, dated within the last 12 months
Vulnerability disclosure program with published response SLAs
Right-to-audit clause in the MSA, or acceptance of a customer-driven security questionnaire annually
Complementary user entity controls (CUECs) documented so your team knows what controls you are expected to operate on your side. A vendor's SOC 2 report often assumes you will configure MFA, restrict admins, review access, manage API keys, or monitor activity inside your own account.

Operational transparency

Log export to customer-owned SIEM (JSONL, syslog-5424, or signed webhook push)
Log retention configurability per tenant (minimum 30 days, maximum defined by contract)
Metrics API or customer-facing dashboards for per-integration health
Webhook signing on all outbound deliveries with HMAC-SHA256 and timestamp inclusion

Incident and exit

Incident notification SLA for customer-impacting security events (typically within 72 hours)
Data return and deletion procedure with signed attestation on contract termination
Vendor exit playbook: how does your team migrate off if the aggregator is acquired, changes pricing, or goes out of business?

Hand this list to the aggregator during evaluation. The speed and completeness of their answers is a leading indicator of how the relationship will feel three years in.

Step 5: Webhook Security - Signatures, Replay Protection, and Validation

Definition: Webhook hardening is the practice of securing both inbound (from third-party providers) and outbound (to your customers) webhook endpoints against spoofing, replay attacks, and payload tampering through cryptographic signatures, timestamp validation, and idempotent processing.

Webhooks are publicly accessible HTTP endpoints. Every URL you expose is an attack surface. A forged webhook from a spoofed provider can inject malicious data into your system. A replayed webhook can cause duplicate payments, duplicate records, or unauthorized state changes. Industry surveys indicate that 78% of SaaS platforms now expose webhook endpoints, yet only 30% of organizations implement replay attack protection, and webhook vulnerabilities account for 12% of API-related security incidents.

Inbound Webhook Hardening

When receiving webhooks from third-party providers, verify every request before processing:

HMAC signature verification: Most providers sign payloads using a shared secret. Your integration layer must recompute the HMAC over the raw request body and compare it against the provider's signature header. Regular string equality short-circuits, returning false as soon as it finds a mismatched character. An attacker can time the responses to figure out the signature byte by byte. Use hmac.Equal (Go), crypto.timingSafeEqual (Node), or hmac.compare_digest (Python).

Two common implementation mistakes to avoid: parsing the body before verifying the signature (the signature is computed against the raw bytes, and re-parsed output might have different whitespace or key ordering - always read the raw body first, verify, then parse), and verifying against a single hardcoded secret when the provider supports key rotation.

Timestamp validation for replay protection: A valid signature alone is not enough. A replay attack occurs when an attacker captures a valid, signed webhook request and re-sends it to trigger duplicate processing. Prevent this with timestamp validation (reject requests older than 5 minutes) and idempotent processing. The timestamp must be included in the signed content - otherwise an attacker can replace the timestamp header without invalidating the signature.

function isWebhookTimestampValid(
  timestampHeader: string,
  toleranceSec = 300
): boolean {
  const webhookTime = parseInt(timestampHeader, 10)
  const now = Math.floor(Date.now() / 1000)
  return Math.abs(now - webhookTime) <= toleranceSec
}

Schema validation: After cryptographic checks pass, validate the payload against a strict schema. Define strict JSON schemas for each webhook event type. Validate incoming payloads against schemas before processing. Reject requests with unexpected fields, missing required data, or invalid data types. This blocks injection attempts that exploit loose parsing in your event handlers.

Full Inbound Verification Code

The checklist items above map to a single verification function that runs at the very edge of your webhook receiver, before any parsing or business logic. Every step is ordered deliberately: cheap checks first (timestamp), expensive checks last (HMAC), rotation-aware, and never leaking timing signal.

// verify-inbound-webhook.ts
import { createHmac, timingSafeEqual } from 'node:crypto'
 
type VerifyResult =
  | { ok: true }
  | { ok: false; reason: 'bad_timestamp' | 'stale_timestamp' | 'bad_signature_length' | 'signature_mismatch' }
 
export function verifyInboundWebhook(input: {
  rawBody: Buffer         // exact bytes as received - NOT re-serialized JSON
  signatureHeader: string // e.g. "sha256=abc123..."
  timestampHeader: string // e.g. "1728896553" (Unix seconds)
  secrets: string[]       // current + previous, in that order, to support rotation
  toleranceSec?: number   // default 300 (5 minutes)
}): VerifyResult {
  const { rawBody, signatureHeader, timestampHeader, secrets, toleranceSec = 300 } = input
 
  // 1. Timestamp check first - cheap, and prevents replay of an old capture
  const ts = parseInt(timestampHeader, 10)
  if (!Number.isFinite(ts)) return { ok: false, reason: 'bad_timestamp' }
  const now = Math.floor(Date.now() / 1000)
  if (Math.abs(now - ts) > toleranceSec) {
    return { ok: false, reason: 'stale_timestamp' }
  }
 
  // 2. Signed content = "<timestamp>." prefix followed by the raw body bytes
  const signedContent = Buffer.concat([
    Buffer.from(`${ts}.`, 'utf8'),
    rawBody,
  ])
 
  // 3. Extract the hex portion of "sha256=<hex>"
  const provided = signatureHeader.replace(/^sha256=/, '')
  let providedBuf: Buffer
  try {
    providedBuf = Buffer.from(provided, 'hex')
  } catch {
    return { ok: false, reason: 'bad_signature_length' }
  }
  if (providedBuf.length !== 32) return { ok: false, reason: 'bad_signature_length' }
 
  // 4. Try each secret (rotation window) with a constant-time compare
  for (const secret of secrets) {
    const expected = createHmac('sha256', secret).update(signedContent).digest()
    if (expected.length === providedBuf.length && timingSafeEqual(expected, providedBuf)) {
      return { ok: true }
    }
  }
  return { ok: false, reason: 'signature_mismatch' }
}

And the Express-style handler that wires it up. Note the order: read the raw body, verify, then parse. Never parse first.

import express from 'express'
import { verifyInboundWebhook } from './verify-inbound-webhook'
 
const app = express()
 
app.post(
  '/webhooks/salesforce',
  express.raw({ type: '*/*', limit: '256kb' }), // raw bytes, NOT express.json()
  (req, res) => {
    const result = verifyInboundWebhook({
      rawBody: req.body as Buffer,
      signatureHeader: req.header('x-webhook-signature') ?? '',
      timestampHeader: req.header('x-webhook-timestamp') ?? '',
      secrets: [process.env.WEBHOOK_SECRET_CURRENT!, process.env.WEBHOOK_SECRET_PREVIOUS!].filter(Boolean),
    })
    if (!result.ok) {
      // Log the reason but do not echo it back to the caller
      logger.warn({ reason: result.reason }, 'webhook verification failed')
      return res.status(401).end()
    }
 
    // Safe to parse now
    const event = JSON.parse((req.body as Buffer).toString('utf8'))
    return enqueueEvent(event).then(() => res.status(202).end())
  }
)

Outbound Webhook Hardening

When delivering webhooks to your customers, sign every delivery:

HMAC-SHA256 with timestamp inclusion: Sign the concatenation of the current Unix timestamp and the serialized payload, then send both the signature and timestamp in dedicated headers. This gives your customers everything they need to verify authenticity and reject stale deliveries on their end.

const timestamp = Math.floor(Date.now() / 1000).toString()
const signedContent = `${timestamp}.${JSON.stringify(payload)}`
const signature = await hmacSha256(signedContent, webhookSecret)
 
// Headers sent with the webhook delivery
headers['X-Webhook-Signature'] = `sha256=${signature}`
headers['X-Webhook-Timestamp'] = timestamp

The full outbound header contract customers should expect on every delivery:

POST /customer-endpoint HTTP/1.1
Host: customer.example.com
Content-Type: application/json
User-Agent: Integration-Platform-Webhooks/1.0
X-Webhook-Id: evt_01HFA9K2X8...              # unique per delivery, for idempotency
X-Webhook-Timestamp: 1728896553              # Unix seconds, included in the signed content
X-Webhook-Signature: sha256=<hex>            # HMAC-SHA256 of "<timestamp>.<raw_body>"
X-Webhook-Attempt: 1                         # 1..N for retries
X-Webhook-Event-Type: contact.updated
X-Webhook-Subscription-Id: sub_01HFA...

Idempotency keys: Include a unique event ID (X-Webhook-Id) in every delivery. Customers use this ID to deduplicate events on their end - critical when retries produce duplicate deliveries.

TLS enforcement: Only deliver webhooks to HTTPS endpoints. Reject http:// target URLs at subscription creation time. Webhook payloads frequently contain customer data and event metadata that must not transit the network in cleartext.

Webhook Delivery Configuration

Enforce these defaults across all webhook subscriptions to balance reliability with security:

Configuration	Recommended value	Rationale
Signature algorithm	HMAC-SHA256	Industry standard, supported by all major providers
Timestamp tolerance	300 seconds (5 minutes)	Accounts for clock drift while blocking replay attacks
Payload size limit	256 KB	Prevents decompression bombs and queue flooding
Delivery timeout	15 seconds	Prevents slow consumer endpoints from blocking the queue
Max delivery retries	5 with exponential backoff	Balances delivery reliability with resource consumption
Failed delivery auto-disable	After 24 hours of consecutive failures	Protects against retry storms and wasted compute
TLS requirement	HTTPS only	Prevents payload interception in transit

The end-to-end flow from provider to customer looks like this:

sequenceDiagram
    participant P as Third-Party Provider
    participant L as Integration Layer
    participant C as Customer Endpoint

    P->>L: POST /webhook (HMAC-signed payload)
    L->>L: Verify provider signature (timing-safe)
    L->>L: Validate timestamp (reject if > 5 min stale)
    L->>L: Validate schema + transform to unified event
    L->>L: Sign unified payload with customer's secret
    L->>C: POST /customer-endpoint (signed + timestamped)
    C->>C: Verify signature + check timestamp
    C-->>L: 200 OK

Step 6: Proving Zero Data Retention with a CI/QA Test Script

Definition: A retention verification test fires a unique canary value through every ingress surface of the integration platform (REST proxy, MCP tool call, inbound webhook) and asserts that the canary appears in zero durable storage locations after an acceptable buffering window.

Enterprise auditors do not accept "we don't store data" as a statement. They accept evidence. The most defensible evidence is a reproducible test that fires a canary value through every write path and confirms the canary appears in exactly zero durable log or storage locations. Run it in CI, archive the pass/fail record, and hand the auditor a year of green runs.

The Canary Probe Pattern

The test is straightforward:

Generate a unique, high-entropy canary string that could not appear organically anywhere in the system.
Fire a real request through the integration proxy (or MCP server) with the canary embedded in the request body, query params, headers, and tool arguments.
Wait past any acceptable in-flight buffering window (30 to 60 seconds is generous).
Scan every durable store - application logs, boundary audit logs, SIEM export buffers, database tables, warehouse tables, object storage - for the canary.
Fail the build if the canary appears anywhere except the ephemeral in-memory request context.

Below is a reference script customers can run against their own tenant. It is intentionally boring - the point is that it is auditable and reproducible.

// zero-retention-check.ts
// Run in CI after every deploy. Fails the build if any request/response
// payload leaks into a durable store.
 
import { execSync } from 'node:child_process'
import { randomBytes } from 'node:crypto'
import { setTimeout as sleep } from 'node:timers/promises'
 
const CANARY = `zdr-canary-${randomBytes(12).toString('hex')}`
const BUFFER_MS = 60_000
 
async function fireCanaryRequest() {
  // Ingress surface 1: REST unified API
  await fetch(`${process.env.API_URL}/unified/crm/contacts`, {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.API_TOKEN}`,
      'Content-Type': 'application/json',
      'X-Canary-Header': CANARY,
    },
    body: JSON.stringify({
      first_name: CANARY,
      last_name: 'RetentionProbe',
      email: `${CANARY}@example.test`,
    }),
  })
 
  // Ingress surface 2: MCP tool invocation
  await fetch(`${process.env.MCP_URL}`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      jsonrpc: '2.0',
      id: 1,
      method: 'tools/call',
      params: {
        name: 'create_a_hub_spot_contact',
        arguments: { first_name: CANARY, email: `${CANARY}@example.test` },
      },
    }),
  })
 
  // Ingress surface 3: inbound webhook receiver
  await fetch(`${process.env.WEBHOOK_URL}/probe`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ canary: CANARY, event: 'probe' }),
  })
}
 
function scan(store: string): string {
  // ./scripts/grep-store.sh is a per-store adapter that queries the
  // backend natively (logs API, warehouse SQL, S3 select, etc.).
  return execSync(`./scripts/grep-store.sh ${store} ${CANARY}`, {
    encoding: 'utf8',
  }).trim()
}
 
async function main() {
  await fireCanaryRequest()
  await sleep(BUFFER_MS)
 
  const durableStores = [
    'application-logs',    // observability platform
    'boundary-audit-logs', // signed audit stream
    'siem-export-buffer',  // outbound SIEM forwarder
    'primary-database',    // application DB tables
    'analytics-warehouse', // downstream analytics sink
    'object-storage',      // ephemeral buffer buckets
    'backup-snapshots',    // encrypted DB snapshots (via hash compare)
  ]
 
  const failures: string[] = []
  for (const store of durableStores) {
    const hits = scan(store)
    if (hits.length > 0) {
      failures.push(`FAIL: canary found in ${store}\n${hits}`)
    }
  }
 
  if (failures.length > 0) {
    console.error(failures.join('\n\n'))
    process.exit(1)
  }
  console.log(`PASS: canary "${CANARY}" not persisted in any durable store`)
}
 
main().catch(err => {
  console.error(err)
  process.exit(1)
})

The subtlety worth calling out: the scan must ignore the metadata records that are supposed to exist. A log entry showing POST /unified/crm/contacts returned 400 in 41ms for integrated_account acc_12345 is compliant. A log entry containing first_name: zdr-canary-abc123 is a fail. Design the scan to match on payload field values, not on metadata identifiers like correlation_id.

CI/QA Retention Test Checklist

Use this as the acceptance criteria for the test suite:

Pair this with the daily retention enforcement script from Step 3 (which verifies that data older than the retention window has actually been deleted) and you have two independent, machine-checkable proofs: nothing sensitive gets in, and nothing sensitive lingers past its window.

Standardizing Your Integration Posture

Enterprise security audits are designed to expose architectural inconsistencies. If your HubSpot integration uses a modern OAuth flow with strict boundary logging, but your legacy NetSuite integration relies on long-lived credentials stored in a database column without concurrency controls, you will fail the vendor risk assessment.

The audit runbook is not a one-time document. It's a living artifact that gets updated every time you add an integration, every time a vendor changes their API, and every time an incident exposes a gap. Treat it like your SOC 2 evidence pack: kept current, version-controlled, and reviewable on demand.

The practical pattern that holds up under enterprise scrutiny:

One integration architecture for all connectors. Same auth handling, same retention policy, same logging surface, same error normalization. Per-integration snowflakes are the leading source of audit findings.
Zero data retention by default. Persist only when there's an explicit, documented reason.
Proactive token refresh with per-account mutex locks. No race conditions, no thundering herds.
Boundary logging with redaction. Raw enough to debug, sanitized enough to retain.
Webhook hardening with signatures and replay protection. Every inbound webhook verified, every outbound delivery signed and timestamped.
Pass-through error semantics. 429s reach the caller. Retry decisions live with whoever owns the business logic.

Stop rebuilding auth flows, rate limit normalizers, and logging middleware for every new API. By enforcing zero data retention, implementing distributed mutex locks for token refreshes, hardening webhook endpoints with signatures and replay protection, standardizing your boundary logging, and normalizing upstream rate limits, you eliminate the operational liabilities that kill enterprise deals.

FAQ

More from our Blog

How to Create an Operational Runbook & Monitoring Playbook for SaaS APIs

What Does Zero Data Retention Mean for SaaS Integrations?

How to Create a SaaS Integration Compliance & Operations Checklist (With DPIA & DPA Examples)

Handling OAuth Token Refresh Failures in Production for Third-Party Integrations

How to Pass Enterprise Security Reviews When Using 3rd-Party API Aggregators

Finding an Integration Partner for White-Label OAuth & On-Prem Compliance

The Operational Runbook for Declarative Syncs and Compliance

Best Practices for Handling API Rate Limits and Retries Across Multiple Third-Party APIs