Connect Better Stack to AI Agents: Monitor performance, SLAs & uptime
A definitive engineering guide to connecting Better Stack to AI agents using Truto. Learn how to fetch tools, bind them to LLMs, and automate incident response.
You want to connect Better Stack to an AI agent so your system can autonomously read uptime monitors, analyze SLA availability, acknowledge incidents, and orchestrate on-call escalations. Here is exactly how to do it using Truto's /tools endpoint and SDK, bypassing the need to build and maintain a custom REST integration from scratch.
The industry is rapidly shifting from passive observability to active remediation. A monitoring system that only alerts an engineer at 3 AM is a failure of automation. Engineering teams are deploying agentic AI - autonomous systems that execute multi-step workflows across the infrastructure stack. If you are specifically looking to integrate standard chat interfaces instead of custom programmatic agents, natural link to the sibling guides: connecting Better Stack to ChatGPT and connecting Better Stack to Claude.
Giving a Large Language Model (LLM) read and write access to your Better Stack instance is technically demanding. You either spend cycles building, hosting, and maintaining a custom connector that translates raw JSON into LLM tool schemas, or you use a managed infrastructure layer that handles the translation for you. This guide breaks down exactly how to fetch AI-ready tools for Better Stack, bind them natively to an LLM using LangChain (or any framework like LangGraph, CrewAI, or Vercel AI SDK), and execute complex observability workflows.
The Engineering Reality of Better Stack's API
Building AI agents is trivial in a local script. Connecting them to external SaaS APIs in production is the bottleneck. Giving an LLM access to external data sounds simple: you write a Node.js function that makes a fetch request and wrap it in a tool decorator. When you scale that to production, the approach collapses.
Better Stack's API introduces specific relational complexities that break standard CRUD assumptions. If you build a custom integration, you own the entire lifecycle of these quirks.
The Incident State Machine
In Better Stack, an incident is not just a static record you update. It operates on a strict state machine. You cannot simply PATCH an incident to change its status to 'acknowledged' using a generic update tool. The API requires hitting explicit operational endpoints - /api/v2/incidents/:id/acknowledge, /resolve, or /escalate. LLMs accustomed to flat database models often struggle with this action-oriented architecture unless the tools are explicitly defined as discrete operational methods rather than generic write operations.
Heartbeats vs. Monitors
Better Stack uses distinct conceptual models for tracking uptime. A 'Monitor' actively checks an external service (like a ping to your homepage). A 'Heartbeat' is passive - it waits for your cron jobs or background workers to send an HTTP request to Better Stack, alerting if the request is missed. If you give an LLM a generic "check status" tool, it will hallucinate the difference between these two systems. Exposing Better Stack to an agent requires explicitly mapping both Monitor and Heartbeat schemas so the LLM understands exactly which infrastructure component it is interrogating.
Rate Limits and 429 Errors
When you connect Better Stack to AI agents, you will hit rate limits. An agent tasked with generating a weekly SLA report might attempt to loop through 50 individual monitors and call the availability metrics endpoint for each one in parallel. Better Stack enforces strict API limits to protect its infrastructure.
Factual note on rate limits: Truto does not retry, throttle, or absorb backoff on rate limit errors. When the upstream Better Stack API returns an HTTP 429 Too Many Requests, Truto passes that error directly back to the caller. What Truto does do is normalize the upstream rate limit information into standardized headers (ratelimit-limit, ratelimit-remaining, ratelimit-reset) following the IETF spec.
Because Truto normalizes these headers uniformly across all integrations, you only have to write your exponential backoff logic once in your agent's execution loop, and it will work for Better Stack, Datadog, PagerDuty, or any other tool. You can find more details in our guide on best practices for handling API rate limits and retries.
Fetching Better Stack AI Agent Tools
Truto maps external SaaS APIs into proxy resources (similar to how you might build MCP servers for AI agents), handling the authentication routing and schema normalization. For AI agents, Truto takes this a step further by providing a /tools endpoint. This endpoint returns the Better Stack API methods formatted strictly as JSON Schema tool definitions, which LLM frameworks natively ingest.
When you call GET https://api.truto.one/integrated-account/<better-stack-account-id>/tools, Truto returns an array of executable operations. Each operation includes the exact parameters required, preventing the LLM from guessing field names.
Building Multi-Step Workflows
To build a highly reliable integration, you need an execution loop that binds the tools to the LLM, parses the LLM's requests, executes the API calls, and handles the resulting HTTP 429 errors.
We provide a dedicated SDK for this pattern. Using the TrutoToolManager from the truto-langchainjs-toolset, you can inject Better Stack capabilities into a LangChain or LangGraph agent in a few lines of code.
Here is a concrete architecture implementation in TypeScript:
import { ChatOpenAI } from "@langchain/openai";
import { TrutoToolManager } from "truto-langchainjs-toolset";
import { AgentExecutor, createOpenAIToolsAgent } from "langchain/agents";
import { ChatPromptTemplate, MessagesPlaceholder } from "@langchain/core/prompts";
async function runBetterStackAgent() {
// 1. Initialize the LLM
const llm = new ChatOpenAI({
modelName: "gpt-4o",
temperature: 0,
});
// 2. Initialize Truto Tool Manager for the Better Stack account
const toolManager = new TrutoToolManager({
trutoEnvironmentId: process.env.TRUTO_ENV_ID,
integratedAccountId: process.env.BETTER_STACK_ACCOUNT_ID,
});
// 3. Fetch Better Stack tools (Monitors, Incidents, SLAs, etc.)
// We use a custom fetch wrapper here to handle 429 backoff
const tools = await toolManager.getTools();
// 4. Bind tools to the prompt
const prompt = ChatPromptTemplate.fromMessages([
["system", "You are a senior Site Reliability Engineer. You have access to Better Stack tools to investigate monitors, acknowledge incidents, and check SLAs. If an API call fails due to rate limits, wait and try again."],
["human", "{input}"],
new MessagesPlaceholder("agent_scratchpad"),
]);
const agent = await createOpenAIToolsAgent({
llm,
tools,
prompt,
});
const executor = new AgentExecutor({
agent,
tools,
// Ensure the executor does not stop early on complex lookups
maxIterations: 15,
});
// 5. Execute a multi-step query
const result = await executor.invoke({
input: "List all our monitors. Find the one named 'Primary Database' and check its SLA availability for the last 30 days. If the availability is under 99.9%, create a new incident."
});
console.log(result.output);
}
runBetterStackAgent().catch(console.error);Handling Rate Limits in the Execution Loop
Because Truto passes the 429 status code and standardized ratelimit-reset headers back to your application, you must implement a retry wrapper if you expect heavy read volume from your agent.
If you are using a framework like LangGraph, you can build a fallback node that catches tool execution errors, checks the headers, pauses the thread execution until the reset timestamp, and then re-invokes the tool node. This prevents your agent from hallucinating a response when it is temporarily locked out of the Better Stack API.
Hero Tools
When connecting Better Stack to AI agents, do not dump 100 unused generic endpoints into the LLM context window. Focus on high-leverage operations. Here are the core hero tools that enable meaningful infrastructure automation.
better_stack_incidents_acknowledge
This tool allows the agent to mark a specific Better Stack incident as seen and under review. It changes the state machine of the incident, stopping the escalation chain. It requires the incident_id.
Usage notes: Highly effective when paired with an automated log analysis agent. If the agent identifies the root cause in Datadog, it can immediately acknowledge the Better Stack incident and post the root cause in the incident comments. This approach is central to how teams orchestrate incident response across Datadog, PagerDuty, and Slack.
"Acknowledge incident ID 992834 and stop the current paging rotation. I am taking over the investigation."
better_stack_incidents_escalate
This tool promotes an incident to a higher severity or notifies the next tier of responders based on the escalation policy.
Usage notes: Use this in an agentic triage workflow. If an agent detects that a P3 ticket involves a critical payment gateway, it can autonomously call this tool to escalate the ticket to a P1 without waiting for a human to read the alert.
"This database latency issue affects checkout. Escalate incident 110293 to the Tier 3 database reliability engineering team immediately."
better_stack_monitors_availability
Retrieves SLA availability data and metrics for a specific Better Stack monitor.
Usage notes: Crucial for generating executive reports. The agent must first query the list of monitors to find the correct monitor_id, then pass that ID into this tool to retrieve the actual uptime percentages.
"Pull the availability SLA metrics for the 'Auth Service' monitor for the previous quarter. Did we breach our 99.95% agreement?"
list_all_better_stack_on_call_schedule_rotations
Lists all rotations for a Better Stack on-call schedule. Returns the schedule data, letting the agent know who is currently carrying the pager.
Usage notes: An agent can use this tool to dynamically route non-critical questions in Slack to the correct engineer who is currently on-call for a specific service, rather than pinging a generic team channel.
"Check the Better Stack on-call schedule for the frontend team. Who is currently on rotation right now?"
create_a_better_stack_status_page_report
Creates a new status report (an active incident notice) on a public or private Better Stack status page. Requires the status_page_id.
Usage notes: Perfect for autonomous incident communication. The agent can take a technical incident description, rewrite it into customer-friendly language, and post it directly to the status page.
"Create a new report on our public status page indicating that we are experiencing degraded performance on the image upload service. Set the status to 'Investigating'."
To view the complete schema definitions, required parameters, and the full inventory of available Better Stack tools - including heartbeats, metadata, and escalation groups - visit the Better Stack integration page.
Workflows in Action
Connecting Better Stack to AI agents enables autonomous workflows that previously required manual human intervention. Here is how specific personas utilize these tools in production.
1. The P1 Incident Escalation (Site Reliability Engineer)
During a major outage, an SRE needs to cross-reference multiple alerts and ensure the right people are awake.
"Find the critical incident related to 'Redis Timeout'. Escalate it to the Data team, then fetch the current on-call rotation for the platform team so I know who is joining the bridge."
Agent Execution Steps:
- Calls
list_all_better_stack_incidentsfiltering for the term 'Redis Timeout' to retrieve the exactincident_id. - Calls
better_stack_incidents_escalatepassing theincident_idto trigger the escalation policy. - Calls
list_all_better_stack_on_call_schedulesto find the platform team's schedule ID. - Calls
list_all_better_stack_on_call_schedule_rotationsto identify the specific engineer currently paged.
Result: The LLM successfully escalates the incident to the appropriate tier and returns the name of the on-call platform engineer directly to the SRE's prompt interface.
2. The Executive Uptime Report (Engineering Manager)
Managers waste hours compiling SLA reports for vendor reviews. An AI agent can pull this instantly.
"Generate an SLA report for our three core API monitors. Get the availability metrics for each one, and if any are below 99.9%, identify which specific incident caused the downtime."
Agent Execution Steps:
- Calls
list_all_better_stack_monitorsto get the IDs for the core APIs. - Loops through the IDs, calling
better_stack_monitors_availabilityfor each one. - Evaluates the returned SLA numbers. If one is 99.8%, it moves to the next step.
- Calls
list_all_better_stack_incidentsfor that specific monitor to find the root cause.
Result: The agent returns a neatly formatted markdown table detailing the SLA performance of the three APIs, along with a summary of the specific outage that breached the agreement.
3. Status Page Automation (Customer Support Lead)
Support teams need to keep customers informed without waiting on engineering to write public updates.
"We have an ongoing incident regarding degraded search functionality. Please check the current incident status, and create a new update on our public status page letting customers know we have identified the issue and expect a fix in 30 minutes."
Agent Execution Steps:
- Calls
list_all_better_stack_incidentsto confirm the internal status of the search issue. - Calls
list_all_better_stack_status_pagesto locate the ID for the public-facing page. - Calls
create_a_better_stack_status_page_reportusing the page ID, translating the technical reality into the requested customer-friendly update.
Result: The public status page is updated immediately without context switching, ensuring customers are informed while engineers focus entirely on the fix.
Connecting Better Stack to AI agents transforms an observability platform into an active participant in your incident response strategy. By leveraging Truto's /tools endpoint, you bypass the friction of writing custom integration code, managing schema updates, and handling raw OAuth configurations.
Your engineering team can focus entirely on refining the agent's logic, prompt structure, and rate-limit backoff handling, leaving the API boilerplate to the infrastructure layer.
FAQ
- Does Truto automatically handle Better Stack rate limits for AI agents?
- No. Truto does not retry, throttle, or apply backoff on rate limit errors. When Better Stack returns an HTTP 429, Truto passes that error to the caller, normalizing the rate limit information into standardized IETF headers. Your agent logic must implement the retry and backoff.
- Can I use Truto's Better Stack tools with frameworks other than LangChain?
- Yes. Truto's /tools endpoint returns standard JSON Schema definitions for every API method. These can be bound to any LLM framework, including LangGraph, CrewAI, Vercel AI SDK, or custom execution loops.
- What is the difference between a Better Stack monitor and a heartbeat tool?
- A monitor in Better Stack actively checks an external endpoint or service for uptime. A heartbeat is a passive endpoint that expects your cron jobs or background workers to ping it on a schedule. Truto provides distinct tools for both resource types.