How to Create an Operational Runbook & Monitoring Playbook for SaaS APIs
Build an operational runbook with SLO-to-SLA mapping, error budget policies, incident severity flows, and procurement-ready uptime reporting to guarantee 99.99% uptime for third-party integrations.
You shipped the integration. The sales team celebrated the launch. The enterprise prospect signed the contract. Now it is Tuesday morning, a critical OAuth token just dropped, an undocumented upstream API change is failing silently, and your core engineering team is preparing to burn an entire sprint debugging third-party webhook payloads.
If you have launched a handful of third-party integrations and your on-call rotation is now drowning in OAuth token failures, silent webhook drops, and HTTP 429 errors at 2 AM, you do not need another integration. You need a written operational runbook and a monitoring playbook that turns chaotic firefighting into a repeatable, measurable process.
This guide provides the exact operational framework required to standardize integration maintenance, normalize upstream errors, and monitor API health without draining your core engineering capacity.
Why You Need an Operational Runbook and Monitoring Playbook
Short answer: Because the cost of a single hour of API downtime now exceeds the cost of writing the runbook by two orders of magnitude, and third-party API reliability is actively worsening.
Unplanned API downtime is an incredibly expensive operational failure. ITIC's research found that the average cost of a single hour of downtime now exceeds $300,000 for over 90% of mid-size and large enterprises, exclusive of litigation, civil or criminal penalties. For 41% of those companies, hourly losses fall between $1 million and $5 million.
It is not just expensive—it is getting worse. Between Q1 2024 and Q1 2025, average API uptime fell from 99.66% to 99.46%, resulting in 60% more downtime year-over-year. A 0.1% drop in uptime translates to approximately 10 extra minutes of downtime per week and close to 9 hours across a year. APIs went from around 34 minutes of weekly downtime in Q1 2024 to 55 minutes in Q1 2025.
The driver behind the regression: API complexity has grown with industries increasingly relying on microservices and third-party integrations, so modern APIs are distributed and interdependent, meaning more points of failure beyond your control. If you are a B2B SaaS company with 30+ connectors, you have inherited the failure modes of every vendor in your portfolio. This is a primary reason why SaaS integrations break after launch.
For financial and heavily regulated SaaS platforms, the stakes are even higher. Financial services organizations face downtime costs of $152 million each year according to research from Splunk and Oxford Economics, with companies losing approximately $37 million annually from direct revenue impacts when systems go offline.
A runbook is not a static document you write once and forget. It is the operational contract between your product, your engineering team, and your customers. For broader context on where this fits in the product lifecycle, review our SaaS Product Manager's Integration Rollout Playbook.
The Anatomy of a SaaS Integration Operational Runbook
A production-grade runbook is built around state machines, not free-form prose. Treating a third-party connection as a simple boolean—either connected or disconnected—is an architectural mistake that leads to silent failures and frustrated customers.
Every integrated account a customer connects must exist in exactly one of a small, well-defined set of states, and every state transition must be observable, logged, and recoverable. When you standardize these states, your monitoring tools, customer success dashboards, and automated recovery pipelines all speak the same language.
Here are the five states every integration runbook should standardize:
| State | Meaning | Customer-Facing Action |
|---|---|---|
connecting |
OAuth callback succeeded, post-install actions currently running. | Show loading spinner |
active |
Integration is fully operational. Credentials are valid, and API calls succeed. | Hidden / Green indicator |
needs_reauth |
Refresh token failed or access revoked. Customer must manually intervene. | Show red re-authorize banner |
validation_error |
Credentials accepted, but the initial validation API call failed. | Show specific error message |
post_install_error |
Credentials valid, but a required webhook setup or backfill failed. | Show retry CTA |
The connection flow should be highly deterministic. A customer initiates the OAuth redirect or submits an API key form, your platform persists the credentials, and if validation or post-install actions exist (such as registering webhooks or fetching the customer's workspace ID), the account sits in connecting until they pass.
If they fail, you route to validation_error or post_install_error and fire a webhook to your own product so customer success knows immediately.
stateDiagram-v2
[*] --> connecting: OAuth callback<br>or API key submitted
connecting --> active: Post-install<br>actions pass
connecting --> validation_error: Validation fails
connecting --> post_install_error: Webhook setup<br>or backfill fails
active --> needs_reauth: Refresh token<br>rejected
needs_reauth --> active: API call succeeds<br>after re-auth
validation_error --> active: Customer fixes<br>and retries
post_install_error --> active: Retry succeedsBy tracking these explicit states, your customer success team can filter for accounts in the needs_reauth state and proactively email customers before they file a support ticket complaining about missing data. This proactive approach is one of the most effective ways to reduce customer churn caused by broken integrations.
A second operational pillar is standardized error handling. Every upstream API error—rate limit, auth failure, schema mismatch, server error—should map to a small, normalized error taxonomy before it ever reaches your application code. If your code branches on error.code === 'INVALID_GRANT' for Salesforce but error.error === 'invalid_token' for HubSpot, you have already lost the maintenance battle.
Building Your API Monitoring Playbook
Most engineering teams monitor their own API endpoints obsessively but treat third-party dependencies as a black box. Your monitoring playbook must extend beyond your own infrastructure to track the real-time health of the SaaS platforms you integrate with.
Short answer: Monitor the gap between what should happen and what does happen. Forget vanity dashboards. Your monitoring playbook should track exactly six categories of metrics:
- Authentication Health: Count of accounts in
needs_reauthper integration. If HubSpot accounts inneeds_reauthspike from 2 to 40 in an hour, HubSpot rotated something or your refresh logic broke. - OAuth Token Expiry Drift: Monitor the delta between your database's
expires_attimestamp and the actual validity of the token. Upstream providers occasionally revoke tokens before their stated expiration due to security events. Tracking this drift helps identify undocumented provider behavior. - Webhook Ingestion Lag: Measure the time between a third-party event timestamp and your processing timestamp. A P95 latency above 30 seconds means your queue is falling behind.
- Outbound Webhook Delivery Rate: Your unified
account.updatedwebhook should hit customer endpoints with >99.5% success. Anything less indicates customer-side issues or misconfigured retry logic. - Normalized Error Rate Per Endpoint: Salesforce
/contactsreturning 5% 500s is a Salesforce problem. Your/crm/contactsreturning 5% 500s when only Pipedrive accounts are affected is a you problem. - Per-Tenant API Success Rate: Aggregate metrics hide the one enterprise customer whose integration has been broken for 72 hours.
Pro Tip: Alert on derivatives, not absolutes. A 0.5% error rate on an upstream API is normal. A jump from 0.5% to 2% in 15 minutes is an incident. Set thresholds based on the rate of change, not static values.
Postman's 2025 State of the API Report found that 60% of teams version their APIs and 57% use Git repositories, but only 26% use semantic versioning, meaning most teams track changes without communicating the impact of those changes effectively. Translation: your upstream vendors are shipping breaking changes without telling you. Your monitoring needs to catch schema drift, not just HTTP errors.
How to Handle Upstream API Rate Limits (HTTP 429)
One of the most persistent myths in the integration ecosystem is that a third-party platform can magically "absorb" or "handle" all rate limits for you. Engineers assume their unified API or integration platform will magically queue, throttle, and retry on their behalf.
That assumption is dangerous. It hides the fact that the upstream API is the bottleneck, and silently retrying can amplify a rate limit storm. It is architecturally impossible to absorb limits without introducing massive, unpredictable latency into your data pipeline.
The correct architecture, aligned with the IETF rate limit headers specification, is:
- The integration platform calls the upstream vendor.
- If the vendor returns HTTP 429, the platform passes that status to the caller.
- The platform normalizes upstream rate limit information into standardized headers:
ratelimit-limit,ratelimit-remaining, andratelimit-reset. - The caller reads those headers and implements exponential backoff with jitter.
Here is a practical example of how to implement a circuit breaker that respects these normalized headers:
async function fetchWithBackoff(url: string, options: RequestInit, maxRetries = 5): Promise<Response> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await fetch(url, options);
// If the request succeeds or fails with a non-retriable error, return immediately
if (response.status !== 429) {
return response;
}
if (attempt === maxRetries) {
throw new Error(`Failed after ${maxRetries} retries due to rate limits.`);
}
// Extract the normalized IETF rate limit reset header
const resetHeader = response.headers.get('ratelimit-reset');
let waitTimeMs = 1000; // Default fallback wait time
if (resetHeader) {
const resetTimestamp = parseInt(resetHeader, 10);
const now = Math.floor(Date.now() / 1000);
// Calculate seconds to wait, add a 1-second buffer
const secondsToWait = Math.max(0, resetTimestamp - now) + 1;
// Add jitter to prevent thundering herd problems
const jitter = Math.random() * 250;
waitTimeMs = (secondsToWait * 1000) + jitter;
} else {
// Fallback to standard exponential backoff with jitter if header is missing
const jitter = Math.random() * 250;
waitTimeMs = (Math.pow(2, attempt) * 1000) + jitter;
}
console.warn(`Rate limit hit. Waiting ${waitTimeMs}ms before retry ${attempt + 1}...`);
await new Promise(resolve => setTimeout(resolve, waitTimeMs));
}
throw new Error('Unreachable');
}This is the model Truto uses. Truto does not retry, throttle, or apply backoff on rate limit errors. When an upstream API returns HTTP 429, Truto passes that error to the caller with the normalized headers. The trade-off is radical honesty: you know exactly when you are being rate-limited, and you control how aggressive your retries are. For more detailed patterns, read our guide on handling API rate limits and retries.
Automating Credential Refresh and Reactivation
OAuth token expiration is the leading cause of silent integration failures. If your runbook dictates that you wait for a token to expire, make an API call, receive an HTTP 401, and then attempt a refresh, you are introducing unnecessary latency and error surface area into your production traffic.
Reactive refresh—waiting until you get a 401 and then refreshing—is the most common reason integrations fail at 3 AM. An end-user-triggered API call hits the expired token, and now you have a customer-visible failure where you should have had a silent background refresh.
The Proactive Refresh Flow
An enterprise-grade platform schedules work to refresh tokens proactively. The production pattern:
- Before every API call, check if the access token is within a small buffer of expiry (e.g., 30 seconds). If yes, refresh first.
- Schedule a proactive refresh independently of API call traffic. A background scheduler fires 60-180 seconds before the token's
expires_at, refreshes, and writes the new token before any user action triggers it. - On refresh failure, the account status transitions immediately from
activetoneeds_reauth. The platform fires anintegrated_account:needs_reauthwebhook to your product, and you surface a re-authorize banner to the customer. - On the first successful API call after re-auth, automatically transition back to
activeand fireintegrated_account:reactivated. No manual ops involvement.
sequenceDiagram
participant Sched as Token Scheduler
participant Vault as Credential Store
participant Vendor as Upstream OAuth
participant App as Your Product
Sched->>Vault: Token expires in 90s?
Vault-->>Sched: Yes
Sched->>Vendor: POST /token (refresh_token)
alt Refresh succeeds
Vendor-->>Sched: new access_token
Sched->>Vault: Store new token + expires_at
else Refresh fails
Vendor-->>Sched: 400 invalid_grant
Sched->>Vault: Mark needs_reauth
Sched->>App: Webhook: needs_reauth
endThis self-healing architecture drastically reduces operational burden. For specific failure modes, our deep-dive on handling OAuth token refresh failures covers the edge cases like refresh token rotation, single-use tokens, and revocation cascades.
Standardizing Third-Party Webhook Ingestion
Polling third-party APIs for updates is a fast track to exhausting your rate limits. Webhooks are the preferred method for real-time data synchronization, but they introduce massive operational complexity. Webhooks are where most integration platforms quietly drop data.
When a third-party SaaS platform experiences a traffic spike, they will flood your webhook ingestion endpoints. If your server drops the payload, that data is often lost forever, as many legacy APIs do not offer reliable webhook replay mechanisms. As outlined in our guide to redundancy and failover patterns, the runbook must cover four ingestion pillars:
- Signature Verification on Ingestion: Every inbound webhook must validate against the vendor's cryptographic signing secret (HMAC, RSA, etc.) before being processed. Reject and log invalid signatures—do not return a 200 OK for them.
- Buffer Before Processing: Your edge endpoint must accept the payload, persist the raw data, and return an HTTP 200 OK immediately (under a second). Do not perform database lookups or heavy transformations synchronously. Push the verified payload into an asynchronous queue.
- Idempotency by Event ID: Vendors retry payloads. You will receive the exact same event multiple times. Deduplicate by the vendor's event ID, not by a payload hash.
- Map to Unified Events: A background worker pulls the payload, identifies the customer, and maps the provider-specific event (e.g., a HubSpot
contact.propertyChangeand a SalesforceContact.updated) to a unified event model (e.g.,crm.contact.updated) with the exact same shape.
Webhook health is invisible until it isn't. A vendor disabling your webhook subscription due to too many 5xx responses can look identical to "the integration is working" from your dashboard. Track inbound webhook count per integration per hour. A flat-line is an outage.
Zero Integration-Specific Code: The Ultimate Maintenance Strategy
Here is the uncomfortable truth: the most effective way to maintain an operational runbook is to drastically reduce the amount of custom code you actually have to monitor. If your runbook has separate playbooks for "how to debug Salesforce" and "how to debug HubSpot," you have built a maintenance liability that grows linearly with every new connector.
The architectural alternative is to treat integrations as data, not code. Abstracting integrations into data-only operations is the defining characteristic of a modern integration strategy.
Every connector should be a configuration: auth flow, base URL, endpoint definitions, field mappings, pagination strategy, and webhook signature scheme. The execution engine is generic. There is no Salesforce-specific module, no HubSpot-specific module. There is one pipeline that reads config and calls APIs.
This is the model Truto uses internally. Adding a new integration means writing a JSON manifest, not deploying new code. The operational consequences are massive:
- Bug fixes apply to all integrations. Fix the pagination engine once, and every paginated endpoint benefits.
- Runbook entries are generic. "Refresh token failure" is one standardized playbook, not 80 different vendor-specific procedures.
- No deploy required to add or fix a connector. Configuration changes ship at runtime.
Instead of writing custom logic to handle Linear's GraphQL pagination versus Salesforce's SOQL offsets, you interact with a unified API layer. Truto's proxy API allows developers to expose complex GraphQL-backed integrations as RESTful CRUD resources using placeholder-driven request building. You define the mapping configuration once, and the platform handles the execution, normalization, and credential injection automatically.
For a deeper dive into this architectural approach, read our analysis on shipping API connectors as data-only operations.
Guaranteeing 99.99% Uptime: SLO-to-SLA Mapping for Third-Party Integrations
Here is the uncomfortable math. 99.99% uptime allows for 52 minutes and 35 seconds of downtime per year, or about 4 minutes and 23 seconds per month. A single bad deployment or a cascading upstream token revocation can eat that entire monthly budget in one event. Each additional "nine" cuts allowed downtime by a factor of 10. Going from 99.9% to 99.99% means going from roughly 43 minutes per month to roughly 4.3 minutes.
An SLO is an internal performance goal that engineering teams use to measure service health over a period of time. It is typically more stringent than the SLA so you have a buffer before contractual penalties kick in. Set your SLO tighter than your public SLA. If your SLA is 99.9%, your SLO should be 99.95% or higher. For a 99.99% SLA, target an internal SLO of 99.995%.
Here is how SLIs, SLOs, and SLAs relate for an integration platform:
| Layer | Definition | Integration Platform Example |
|---|---|---|
| SLI (Service Level Indicator) | The raw metric you measure | % of API proxy calls returning non-5xx in < 500ms |
| SLO (Service Level Objective) | Your internal reliability target | 99.995% availability over a rolling 30-day window |
| SLA (Service Level Agreement) | The contractual commitment to customers | 99.99% monthly uptime, with credits for breaches |
An SLI is a quantitative measure of performance (like success rate or latency) that serves as the "ground truth" for SLOs and SLAs. For an integration platform specifically, your SLIs must distinguish between failures you caused and failures the upstream provider caused. A Salesforce 500 error passed through your proxy is not your downtime - unless your proxy added latency, mangled the request, or failed to route it correctly.
Choosing Your Measurement Window
SLAs are measured over a specific window, and that window matters more than most people realize. A monthly calendar window (the 1st through the 30th) means your error budget resets each month. A yearly 99.9% SLA gives you 8.77 hours of total downtime spread across 12 months. A monthly 99.9% SLA gives you only 43.83 minutes per month - but that resets every cycle.
A rolling 30-day window continuously recalculates, so a single bad incident can affect your compliance measurement for a full month. For integration platforms, rolling windows are the more honest choice. They prevent the gaming behavior where a team burns their budget in the first week and then freezes all changes for the remaining three weeks.
For each integration, your SLIs should track at minimum:
- Availability SLI: Percentage of API proxy calls that return a non-5xx response (excluding upstream 5xx passed through transparently)
- Latency SLI: P95 response time for API proxy calls, excluding upstream response time
- Data freshness SLI: Percentage of webhook events processed within your target window (e.g., under 60 seconds from ingestion to delivery)
Sample SLA Contract Language and SLA Credit Formula
Your SLA document is where engineering commitments become legal obligations. The language below is a starting template - have your legal team review and adapt it before including it in any customer contract.
Uptime commitment clause (template):
"Provider shall ensure that the Integration Platform maintains a Monthly Uptime Percentage of at least 99.99%, measured as the total number of minutes in the calendar month minus the number of minutes of Downtime, divided by the total number of minutes in the calendar month. Downtime excludes: (a) scheduled maintenance windows communicated 72 hours in advance, (b) failures of upstream third-party APIs outside Provider's control, (c) customer-caused errors including misconfigured credentials or revoked OAuth grants, and (d) force majeure events."
SLA credit formula (template):
| Monthly Uptime Percentage | Credit as % of Monthly Fee |
|---|---|
| 99.90% - 99.99% | 10% |
| 99.00% - 99.89% | 25% |
| Below 99.00% | 50% |
The credit calculation is straightforward: Credit = Monthly Fee × Credit Percentage. If a customer pays $5,000/month and your platform achieves 99.92% uptime (below the 99.99% SLA but above 99.90%), you owe a $500 credit. When you are writing SLAs with your own clients, understand that this is the standard model - credits are goodwill gestures tied to service fees, not full indemnification.
Two things to note about the exclusion clauses. First, the upstream third-party exclusion is where most integration SLA disputes happen. You need to prove - with logs and timestamps - that the failure originated at the upstream provider, not in your platform. This is why normalized error tracking per vendor (from your monitoring playbook) is non-negotiable. Second, your SLA should define "Downtime" precisely. A common definition: any period of 5 or more consecutive minutes where more than 5% of customer API requests return server errors attributable to the platform.
Error Budget Policy and Automated Actions
An error budget is a representation of the allowable amount of downtime a service can tolerate while still meeting its SLO. If your SLO is 99.99%, your error budget is 0.01% of total time in the measurement window. For a rolling 30-day window, that is approximately 4.3 minutes of downtime per month.
The error budget is not an abstract concept - it is an operational lever. Error budgets provide a framework for prioritizing reliability work over new feature development, ensuring that system stability is not compromised. When the budget is healthy, teams ship features. When it is burning fast, teams shift to reliability work.
Burn rate measures how fast you are consuming your error budget relative to a sustainable pace. A burn rate of 1.0 means you will exactly exhaust your budget by the window end. A burn rate of 10 means you are consuming budget ten times faster than sustainable.
Set up two tiers of burn rate alerts, following the multi-window strategy recommended in Google's SRE Workbook:
| Alert Tier | Burn Rate | Lookback Window | Action |
|---|---|---|---|
| Fast burn (P1) | > 14.4x | 1 hour | Page on-call immediately. At this rate, the monthly budget exhausts in roughly 2 days. |
| Slow burn (P2) | > 6x | 6 hours | Create a ticket and notify the team lead. Budget will exhaust in roughly 5 days if unchecked. |
A fast-burn alert warns you of a sudden, large change in consumption that, if uncorrected, will exhaust your error budget very soon. "At this rate, we'll burn through the whole month's error budget in two days!" A slow-burn alert warns you of a rate of consumption that, if not altered, exhausts your error budget before the end of the compliance period.
Automated policy actions based on budget consumption:
| Budget Remaining | Operational Mode | Actions |
|---|---|---|
| > 75% | Normal | Feature deployments proceed as planned. |
| 50% - 75% | Caution | Warn engineering leads. Review recent deployments for correlation with error rate changes. |
| 25% - 50% | Restricted | Freeze non-critical deployments. Prioritize reliability fixes. |
| < 25% | Critical | Full deployment freeze except for reliability hotfixes. Escalate to VP of Engineering. |
| Exhausted | SLA Breach | Trigger P1 incident flow. All hands on reliability restoration. |
More mature teams tie error budgets to automated policies - like deployment freezes, incident escalations, or capacity planning. The key is making these policies explicit and agreed upon by engineering, product, and leadership before the budget starts burning. If you are debating whether to freeze deploys during an active burn, you have already lost time.
Your error budget burns whether the outage is your fault or not. If an upstream provider is down for 20 minutes and your platform is transparently proxying those errors, that time counts against your customer-facing SLO. Factor upstream provider reliability into your error budget planning from day one.
Incident Runbook: P1/P2/P3 Flows, RACI, and Postmortem Template
A monitoring playbook without an incident response plan is like a smoke detector without a fire exit. You know something is wrong, but nobody knows what to do next. Support levels connect directly to SLAs and MTTR. If your contract guarantees 99.9% uptime, your team needs crystal clear rules about what to fix immediately and what can wait.
Severity Definitions for Integration Incidents
Map each severity level to specific, measurable conditions tied to your SLIs and the telemetry from your monitoring playbook. Ambiguity in severity classification is the fastest way to turn every incident into a P1.
| Severity | Definition | Response Time | Resolution Target | Example |
|---|---|---|---|---|
| P1 - Critical | Complete platform outage or > 5% of all integration API calls failing. Error budget burn rate > 14.4x. | 15 minutes | 1 hour | Platform-wide OAuth refresh failures; token scheduler down. |
| P2 - High | Single integration provider fully degraded, or > 1% of total API calls failing. Error budget burn rate > 6x. | 30 minutes | 4 hours | All HubSpot syncs returning 500s due to a bad connector config. |
| P3 - Medium | Isolated issue affecting < 1% of traffic or a single tenant. No measurable error budget impact. | 4 business hours | 24 hours (next business day) | One enterprise customer's Salesforce webhook subscription silently deactivated. |
P1 and P2 incidents usually run on a 24/7 schedule where the clock never stops. P3 and P4 incidents usually run on 8x5 weekdays where the clock pauses on weekends and nights. This distinction matters for SLA measurement - make sure your contracts specify which clock applies to each severity.
Escalation Path and Paging Rules
P1 flow:
- Burn rate alert fires at > 14.4x
- Auto-page on-call engineer
- On-call acknowledges within 15 minutes
- Incident channel created automatically
- Incident commander assigned
- Status page updated within 20 minutes
- VP Engineering notified within 30 minutes
- Customer-facing communication sent within 45 minutes
P2 flow:
- Burn rate alert fires at > 6x, or single-vendor degradation detected
- Notify on-call engineer via ticket and chat
- On-call triages within 30 minutes
- Incident channel created if multiple engineers needed
- Team lead looped in within 1 hour
- Affected customers notified if SLA impact is likely
P3 flow:
- Alert fires or customer reports issue
- Ticket created in backlog
- On-call reviews during next business day
- Fix prioritized in sprint planning
RACI Matrix for Integration Incidents
| Activity | On-Call Engineer | Incident Commander | Team Lead | VP Engineering | Customer Success | Legal/Sales |
|---|---|---|---|---|---|---|
| Initial triage and diagnosis | R | I | I | - | - | - |
| Declare severity level | R | A | C | I (P1 only) | I | - |
| Implement fix or rollback | R | A | C | I (P1 only) | - | - |
| Customer communication | C | A | C | I | R | I (P1 only) |
| Status page updates | R | A | I | I | C | - |
| Escalate to VP | - | R | C | A | - | - |
| Determine SLA credit impact | - | C | C | A | R | R |
| Postmortem authorship | R | A | C | I | I | - |
R = Responsible, A = Accountable, C = Consulted, I = Informed
Postmortem Checklist
Every P1 and P2 incident requires a postmortem. P3 incidents require one if they recur within 30 days. Conduct postmortems while details are fresh, ideally within 48-72 hours. A review is a meeting and a post-mortem is an artifact, and the artifact should exist before the meeting starts, not get created during it.
Your postmortem document should cover:
- Incident summary: What happened, in two to three sentences.
- Timeline: Key events from detection to full resolution, with UTC timestamps.
- Detection method: How the incident was identified - monitoring alert, customer report, or manual discovery. If a customer found it first, that is a monitoring gap to fix.
- Root cause: The primary technical or process failure. Use the "5 Whys" technique to get past symptoms to underlying causes.
- Impact assessment: Number of affected customers, API calls impacted, error budget consumed, and estimated revenue impact.
- Resolution steps: Actions taken to restore service.
- What went well: Response actions that worked effectively.
- What went poorly: Gaps in detection, communication, or resolution.
- Corrective actions: Specific, measurable action items with assigned owners and deadlines. Turning corrective actions into work items with owners and deadlines helps teams turn lessons learned into real improvements.
- SLA impact: Was the SLA breached? If yes, what credits are owed and to which customers?
The postmortem is blameless, not consequenceless. Focus on system-level failures, not individual mistakes. The goal is organizational learning - turning one team's outage into a shared improvement that prevents the same failure pattern across all integrations.
How to Present Uptime Proof to Procurement (Dashboards and Reports)
Enterprise procurement teams do not trust your marketing page's uptime claim. They want verifiable evidence, exported from systems they can audit. If you cannot produce this evidence on demand, you will lose the deal - or worse, lose the renewal.
What Procurement Wants to See
- Historical uptime reports: Monthly and quarterly uptime percentages for at least the trailing 12 months, broken down by integration category (CRM, HRIS, ATS, etc.). Show the raw numbers, not just a single rolled-up percentage.
- Incident history: A log of every P1 and P2 incident with timestamps, duration, root cause summary, and resolution. Procurement teams look for patterns - three OAuth-related P1 incidents in six months tells a story.
- SLA compliance record: A clear mapping of contracted SLA targets versus actual performance per measurement period. If credits were issued, include them. Hiding past breaches destroys trust when discovered during due diligence.
- Current error budget status: Remaining error budget for the current measurement window. This demonstrates that you actively track reliability, not just retroactively report it.
- MTTD and MTTR: MTTR (Mean Time to Resolve) tells you how long it takes to fix issues. MTTD (Mean Time to Detect) reflects how quickly teams notice something is wrong. A low MTTD proves your monitoring works. A low MTTR proves your runbook works.
Building the Procurement-Ready Report
Your internal monitoring dashboard and your procurement-facing report are different artifacts. The internal dashboard is real-time and granular. The procurement report is periodic, summarized, and accompanied by narrative context.
Structure your monthly uptime report as:
- Executive summary: One paragraph covering overall platform availability, SLA compliance status, and error budget consumption.
- Uptime by integration category: Table showing availability percentage per category with a visual indicator (green/yellow/red) against SLA targets.
- Incident log: Table of incidents with severity, duration, customer impact scope, and whether the root cause was internal or upstream.
- Trend analysis: A 6-month or 12-month chart showing availability trends. Procurement teams care about trajectory as much as absolute numbers.
- Active corrective actions: Open items from recent postmortems that demonstrate continuous improvement.
Maintain a public or customer-accessible status page that shows real-time and historical availability per integration category. This page should update automatically from your monitoring infrastructure - no manual editing. Customers and procurement teams should be able to subscribe to incident notifications. A status page that requires an engineer to manually update it during an outage is a status page that lies.
Where to Go From Here
Creating an operational runbook and monitoring playbook is not about predicting every possible failure. It is about building a system that degrades gracefully, signals errors predictably, and gives your engineering team standardized levers to pull when things go wrong.
If you are starting from scratch, do these six things in this order:
- Define your state machine this week. Five states, documented, with explicit transitions. No exceptions.
- Instrument the six monitoring categories. Auth health, webhook lag, outbound delivery, normalized error rate, per-tenant success rate, and token expiry drift.
- Audit your retry logic. If your code silently retries 429s, fix it. Pass them through, read the normalized IETF headers, and let callers own exponential backoff.
- Move proactive token refresh out of the request path. Schedule it. Stop waiting for 401s.
- Set your SLO-to-SLA mapping and error budget policy. Define your SLIs, set an internal SLO tighter than your public SLA, and establish automated burn rate alerts with predefined policy actions.
- Write your incident severity definitions and RACI matrix. Make sure every engineer on the on-call rotation knows the P1/P2/P3 escalation paths before the next incident, not during it.
The goal of the runbook isn't to eliminate failure—upstream APIs will fail, and reliability is trending downward. The goal is to make failure boring: observable, recoverable, and bounded. If your on-call engineer can resolve a HubSpot outage by reading a checklist instead of paging the most senior engineer on the team, the runbook is doing its job.
FAQ
- How do you guarantee 99.99% uptime for third-party integrations?
- You cannot control upstream API availability, but you can guarantee 99.99% uptime for the integration platform itself. Set an internal SLO tighter than your public SLA (e.g., 99.995% SLO for a 99.99% SLA), monitor burn rate on your error budget, automate deployment freezes when budget runs low, and exclude upstream provider failures in your SLA with clear contractual language and log-based proof.
- What is the difference between an SLO and an SLA for integrations?
- An SLO (Service Level Objective) is your internal engineering target, such as 99.995% availability. An SLA (Service Level Agreement) is the contractual commitment to customers, such as 99.99% uptime with credit penalties for breaches. The SLO should always be stricter than the SLA to give your team a buffer before contractual penalties apply.
- How do you calculate an SLA credit for integration downtime?
- Use a tiered credit formula based on actual uptime versus the SLA commitment. For example: 99.90%-99.99% uptime earns a 10% credit, 99.00%-99.89% earns 25%, and below 99.00% earns 50%. The credit is calculated as Monthly Fee multiplied by the applicable Credit Percentage.
- What is an error budget and how does it apply to API integrations?
- An error budget is the maximum allowable unreliability before you breach your SLO. For a 99.99% SLO over a 30-day window, your error budget is about 4.3 minutes of downtime per month. Track burn rate to detect when you are consuming budget faster than sustainable, and tie automated actions like deployment freezes to budget thresholds.
- What should an integration incident postmortem include?
- A postmortem should cover: incident summary, timeline with UTC timestamps, detection method, root cause analysis, impact assessment (affected customers, API calls, error budget consumed), resolution steps, what went well, what went poorly, corrective actions with owners and deadlines, and SLA credit impact.
- How do you present uptime proof to enterprise procurement teams?
- Provide historical uptime reports broken down by integration category for at least 12 months, a log of all P1/P2 incidents with root causes, SLA compliance records showing contracted targets versus actual performance, current error budget status, and MTTD/MTTR metrics. Maintain an automatically updated status page that procurement can audit independently.