How to Implement Semantic Routing for AI Agents to Select API Endpoints
Cut AI agent latency from 5,000ms to 100ms with semantic routing. A senior engineer's guide to vector-based intent classification and unified API execution.
You are building a multi-agent system. The reasoning engine works perfectly in your local prototype. Your agent correctly identifies the user's intent, chains function calls, reasons through multi-step workflows, formats the required JSON arguments, and triggers the function call perfectly.
Then you deploy it to a customer's production environment, where your agent needs to act inside Salesforce, Jira, Workday, and 30 other SaaS systems. Suddenly, you are spending your weeks debugging OAuth token refresh failures, wrestling with undocumented pagination quirks, and watching your system choke on rate limits from vendors who haven't updated their developer portals since 2018.
The large language model is not the bottleneck. The integration infrastructure is.
Agentic AI adoption is accelerating rapidly. As of 2025, 79% of organizations report some level of AI agent adoption, with 96% planning to expand their usage in 2026. But demand does not equal success. Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.
One of the primary drivers of these escalating costs is architectural inefficiency. Specifically, using massive, slow LLMs to make basic routing decisions about which SaaS API endpoint to call. When production hits, your agent stuffs every tool definition for every API into the context window on every single turn, blowing through tokens, slowing first-token latency to 4-6 seconds, and watching the LLM still pick the wrong endpoint when the catalog gets large.
Semantic routing is the architectural fix: a vector-based decision layer that classifies user intent in roughly 100ms and hands the agent a small, relevant subset of tools, before any expensive LLM inference happens.
This guide is for senior PMs and AI engineering leads who already understand function calling and now need to ship a multi-agent system that does not melt under cost or latency. We will walk through why monolithic LLM routers fail, how semantic routing actually works under the hood, and how to combine it with a unified API layer so a single intent like "create CRM contact" maps cleanly to 50 different SaaS providers instead of 50 brittle code paths.
The Problem: Why LLMs Make Terrible API Routers
The standard approach to giving an AI agent access to external tools is the "monolithic prompt" anti-pattern. Developers define dozens of external API endpoints as JSON schemas and dump all of them into the LLM's system prompt.
Current AI agents send every single tool definition to the LLM on every request. When you ask Claude to "create an issue in my repo," it receives descriptions for all available tools, even though it only needs create_issue. If your customer connects HubSpot, Jira, Slack, NetSuite, Greenhouse, and Zendesk through a unified agent, you can be looking at 200+ tool schemas per turn.
When a user types, "Update the status of my Jira ticket," the LLM has to read the prompt, evaluate 50 to 200 different tool schemas, reason about which one matches the intent, and generate a structured response. Using a monolithic LLM to route every user request to an API endpoint is a massive compute drain that inflates token usage.
This creates three immediate problems in production:
- Token bloat and Context Limits: Research on iterative tool routing shows that dynamically exposing only the minimal tool subset per step reduces per-step context tokens by 95% and improves correct tool routing by 32%, because monolithic tool catalogs consume up to 90% of available context. The monolithic prompt approach turns a simple 10-word query into a 1,500-token payload before generation even begins.
- Selection accuracy collapses with scale: Studies show performance drops 7-85% when tool count increases from small to large catalogs. The more tools you present to the LLM, the higher the probability it hallucinates arguments or selects the wrong endpoint entirely. The model gets confused between similar endpoints (
update_contactvsupsert_contactvsmerge_contacts). - Latency is brutal: Standard LLM inference takes roughly 5,000ms to evaluate a large context window and return a structured tool call. An LLM-only router adds 500-2000ms to every request regardless of complexity. For a system processing thousands of requests per day, this latency cost is not acceptable. For a multi-step agentic workflow, waiting seconds just to decide which API to call ruins the user experience.
The financial impact compounds. Building custom AI agents that deeply integrate with internal systems and SaaS APIs requires a significant engineering budget. Developing a business AI agent that connects with CRMs and databases costs $25,000 to $60,000, while advanced autonomous agents cost $85,000 to $150,000+. Wasting expensive LLM tokens on basic routing decisions inflates ongoing operational costs unnecessarily. You are paying GPT-4 rates to answer the question "is this a CRM request or a ticketing request?"
LLMs are reasoning engines. They should be used for complex extraction, summarization, and planning. They should not be used as expensive if/else routers.
What Is Semantic Routing for AI Agents?
Semantic routing is a lightweight, vector-based decision layer that sits in front of your LLM. It intercepts user prompts, classifies the intent using vector embeddings, and routes the request to the correct tool or API endpoint in a fraction of a second.
Instead of asking an LLM, "Which of these 50 tools should I use?" a semantic router uses math to find the closest match. It is a superfast decision-making layer for LLMs and agents. Rather than waiting for slow LLM generations to make tool-use decisions, it uses semantic vector space to make those decisions, routing requests using semantic meaning.
The core trade is straightforward. An LLM call to classify intent is flexible but slow and expensive. A vector similarity lookup is fast and cheap but only works if you have well-defined route categories. Lightweight vector math reduces processing latency from roughly 5000ms to just 100ms. A peer-reviewed evaluation of an intent-aware semantic router on the MMLU-Pro benchmark achieved a 10.2 percentage point improvement in accuracy while reducing response latency by 48.5% compared to direct inference.
Note on industry terminology: Some platforms, like Kong AI Gateway, position semantic routing primarily as a load-balancing mechanism to route prompts between different LLMs (e.g., routing simple queries to Llama 3 and complex queries to GPT-4). While valid, this guide focuses on semantic routing for tool selection - routing user intent to specific external SaaS API endpoints.
In agent architectures, semantic routing sits before the LLM:
flowchart LR
A[User Query] --> B[Embedding Model]
B --> C{Vector Similarity<br>Search}
C -->|CRM intent| D[CRM Tool Subset]
C -->|Ticketing intent| E[Ticketing Tool Subset]
C -->|Accounting intent| F[Accounting Tool Subset]
D --> G[LLM with<br>filtered tools]
E --> G
F --> G
G --> H[Function Call<br>to Unified API]Three architectural variants you will see in the wild
- Pure semantic routing: Embedding similarity selects exactly one route. It is the cheapest and fastest, but fails on ambiguous multi-domain queries.
- Cascade routing: Rules first, embeddings second, LLM as the last resort. A three-tier cascade tries the cheapest routing method first, escalates to a more expensive one only when the cheap method cannot produce a confident match, and exits as soon as any tier reaches a confidence threshold. Most requests route in milliseconds, and hard requests route accurately, without paying for accuracy on requests that do not need it.
- Decomposed semantic discovery (RAG over tools): Index every tool, parameter, and return type as separate vectors. Semantic discovery uses vector embeddings to match user intent with relevant tools. Embed the user's query, search for semantically similar tools, and send only the top matches to your LLM. Instead of sending all available tools when someone asks "create an issue," you send the most semantically relevant ones: create_issue, update_issue, create_issue_comment.
For most B2B SaaS agent stacks, the cascade pattern wins. Hard-coded rules handle deterministic intents (/billing slash commands, OAuth callbacks), embeddings handle the long tail of natural language queries, and the LLM only intervenes when confidence is low.
How Semantic Routing Works for SaaS Integrations
The mechanics are simpler than the marketing implies. Implementing a semantic router for API selection requires building an index of "routes" and comparing incoming queries against that index. You need four things: an embedding model, a set of route definitions with example utterances, a similarity metric (cosine is standard), and a confidence threshold.
Step 1: Define routes as intents, not endpoints
The biggest mistake teams make is defining routes as raw API endpoints (POST /crm/v3/objects/contacts). Routes should be intents that map to a unified action. The intent layer is what lets you swap providers without touching routing code.
A "route" represents a specific API endpoint or unified action. An "utterance" is an example of what a user might say to trigger that route. You can define these in a simple JSON configuration or using a Python routing library:
from semantic_router import Route
crm_create_contact = Route(
name="crm.contact.create",
utterances=[
"add a new contact for Acme Corp",
"create a lead named Sarah from Stripe",
"log this person we met at the conference",
"new prospect from inbound form",
"Add a new customer record"
],
)
ticketing_get_status = Route(
name="ticketing.issue.status",
utterances=[
"What is the status of my ticket?",
"Check on issue PROJ-123",
"Has support replied to my request?"
],
)Step 2: Pre-compute and cache route embeddings
When your agent initializes, you pass these utterances through a fast, lightweight embedding model (like text-embedding-3-small or a local ONNX model like bge-small-en). This converts the human-readable text into dense vector arrays. These arrays represent the semantic meaning of the phrases.
Store the vectors in memory or a vector store like pgvector, Qdrant, or Pinecone. Caching embeddings improves performance significantly. Adding a new agent is as simple as adding an entry with a good description. No rule updates, no regex patterns, just semantic understanding.
Step 3: Embed the query and run similarity search
When a user submits a query at runtime, you embed their query using the exact same model. You then calculate the cosine similarity between the user's query vector and your pre-computed utterance vectors. Calculate cosine similarity between the query embedding and pre-defined reference prompt embeddings, then route to the model associated with the highest-similarity reference category.
Cosine similarity measures the angle between two vectors in a multi-dimensional space. A score of 1.0 means the vectors point in the exact same direction (identical meaning).
import numpy as np
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> np.ndarray:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return np.array(resp.data[0].embedding)
def route(query: str, routes: dict, threshold: float = 0.85) -> str | None:
q_vec = embed(query)
best_route, best_score = None, 0.0
for name, vectors in routes.items():
# max similarity across the route's utterance vectors
score = max(
np.dot(q_vec, v) / (np.linalg.norm(q_vec) * np.linalg.norm(v))
for v in vectors
)
if score > best_score:
best_route, best_score = name, score
return best_route if best_score >= threshold else NoneStep 4: Handle the long tail with cascade fallbacks
When no route clears the threshold (e.g., 0.85), fall back to LLM-based routing with a curated tool subset. This is the safety net for novel phrasing. Track these escalations - they are your roadmap for new utterances to add.
Set your confidence threshold based on observed false-positive rates, not gut feel. Log every routing decision with the top-3 scores. If your top-2 scores are within 0.05 of each other, escalate to an LLM tiebreaker. That is your ambiguity zone.
Step 5: Hand off to the agent with a filtered tool set
Because the router bypassed the LLM, you still need to extract the actual arguments (like the contact's name or email) to execute the API call. Once you have an intent, fetch only the tools relevant to that intent and pass them to the LLM (or a specialized Named Entity Recognition model) with a strict instruction: "Extract the name and email from this text to match this specific JSON schema."
This is where the cost and latency wins compound. The agent gets 5 tools instead of 200, picks correctly, and emits a function call. This targeted extraction is significantly faster and cheaper than asking a massive LLM to simultaneously determine the intent and extract the variables.
The Missing Link: Standardized API Schemas
Semantic routing solves the intent classification problem. It tells your agent what to do. But it does not solve the execution problem. Here is where most semantic routing tutorials lie to you. They show a clean demo where intent crm.contact.create triggers a single function.
In production, your customers use Salesforce, HubSpot, Pipedrive, Close, Zoho, and Microsoft Dynamics. Every one of those CRMs models a contact differently.
- Salesforce: flat PascalCase fields (
FirstName,LastName,Email), custom objects via__csuffix, SOQL for filtering. - HubSpot: nested
propertiesobject,filterGroupsarrays for search, snake_case property names. - Pipedrive: yet another shape with custom fields keyed by hash IDs.
If your semantic router resolves intent to crm.contact.create and then your agent has to ask the LLM "now figure out the right shape for whichever CRM this customer connected", you have re-introduced the exact token bloat and latency problem you solved. The router becomes a bottleneck that hands off to a worse bottleneck.
If you rely on point-to-point integrations, your routing logic becomes infected with integration-specific code:
if (route === 'crm_create_contact') {
if (provider === 'salesforce') {
// Load Salesforce schema, ask LLM to format for Salesforce
} else if (provider === 'hubspot') {
// Load HubSpot schema, ask LLM to format for HubSpot
}
}Maintaining 50 different schemas for a single semantic route destroys the efficiency gains of the router. This is why semantic routing only delivers its theoretical wins when the destination APIs share a common data model.
With a unified API like Truto, the same intent maps to the same request shape regardless of provider. Truto normalizes wildly different SaaS responses into a single, predictable schema using JSONata-based mappings. Your semantic router targets one intent (crm.contact.create), the agent emits one function call against one schema, and the platform handles the per-provider translation.
Behind the scenes, JSONata expressions handle the translation. A JSONata request mapping for HubSpot might look like this:
{
"properties": {
"firstname": body.first_name,
"lastname": body.last_name,
"email": body.email
}
}While the mapping for Salesforce looks like this:
{
"FirstName": body.first_name,
"LastName": body.last_name,
"Email": body.email
}Your agent only ever sees the unified body.first_name schema. The semantic router selects the route, the agent formats the unified JSON, and the unified API handles the rest. For the deeper mechanics, see our guide on LLM function calling for integrations.
Combining Semantic Routing With Unified APIs and MCP
The end-state architecture for a truly scalable, production-ready multi-agent SaaS integration system has three distinct layers, each doing what it is best at:
- Semantic Routing for high-speed intent classification.
- Model Context Protocol (MCP) for standardized tool discovery.
- Unified APIs for normalized execution.
flowchart TB
A[User Query] --> B[Semantic Router<br>~100ms]
B -->|intent + confidence| C{Confidence<br>> threshold?}
C -->|Yes| D[Filtered Tool Set<br>5-10 schemas]
C -->|No| E[LLM Tiebreaker<br>with top-K tools]
E --> D
D --> F[LLM Function Call]
F --> G[Unified API / MCP Server]
G --> H[Provider 1: HubSpot]
G --> I[Provider 2: Salesforce]
G --> J[Provider N: ...]Layer 1: Semantic router (intent classification)
Fast, cheap, deterministic. Resolves "what does the user want to do?" in roughly 100ms.
Layer 2: Tool schema provider (MCP or function calling)
Given an intent, fetch the corresponding JSON schema for the LLM to fill. Truto auto-generates MCP tools dynamically from integration documentation, producing properly typed JSON schemas with auto-injected id, limit, and next_cursor properties for list endpoints. The agent receives clean, schema-validated tools without you hand-writing them. More on this pattern in our auto-generated MCP tools guide.
Layer 3: Execution layer (unified API)
The agent emits a function call against a unified schema. The unified API does the per-provider translation: OAuth refresh, pagination quirks, field mapping, error normalization. Because the runtime is data-driven with zero integration-specific code, adding a new SaaS to your routing logic is a configuration change, not a code deploy.
The execution flow operates like this:
- The semantic router intercepts the user query and identifies the
crm.contact.createintent. - The agent looks at its available MCP tools and finds the filtered
create_a_crm_contacttool provided by the platform. - The agent formats the arguments according to the unified JSON schema provided by the MCP server.
- The agent executes the tool call.
- The platform receives the unified payload, translates it into the provider-specific format via JSONata, handles the OAuth token injection, and executes the HTTP request against the third-party API.
Your customers simply authenticate their preferred SaaS tool via a frontend component, and your agent instantly knows how to route to it, format data for it, and execute against it.
Where the trade-offs actually live
Let's be honest about the costs. Semantic routing is not free:
| Concern | Reality |
|---|---|
| Cold-start tuning | You need 5-10 quality utterances per route. Bad utterances yield bad routing. |
| Threshold tuning | Too high and you miss valid queries. Too low and you misroute. Plan to iterate based on logs. |
| Multi-intent queries | "Create a Jira ticket and notify the AE in Slack" needs splitting. Pure routing breaks here. |
| Embedding cost | One embedding call per query. Cheap but non-zero. Cache at the query level. |
| Drift | New product features mean new utterances. Routes need ownership and review cadence. |
A critical note on rate limits: While Unified APIs handle schema translation, they do not absorb upstream architectural failures. Truto normalizes upstream rate limit info into standardized headers (ratelimit-limit, ratelimit-remaining, ratelimit-reset) per the IETF spec. However, when an upstream API returns an HTTP 429, Truto passes that error directly to the caller. Your agent's execution loop is fully responsible for implementing exponential backoff and retry logic, because semantic routers should not silently absorb upstream errors.
A Reference Implementation Sketch
If you are starting from scratch, this stack will get you to production:
- Embedding model:
text-embedding-3-smallfrom OpenAI (cheap, fast, 1536 dims) orbge-small-enself-hosted. - Vector store: pgvector on Postgres if you have under 1M route vectors. Qdrant or Pinecone if you exceed that.
- Router framework: Aurelio Labs'
semantic-routerlibrary, or roll your own with 50 lines of NumPy if your route count is small. - Tool layer: MCP server pointing at a unified API, or direct function calling with schemas pulled from the unified API's OpenAPI spec.
- Observability: log every routing decision with
query,selected_route,top_3_scores,final_tool_called,success. This dataset is gold for retraining utterances and tuning thresholds.
Next Steps
Semantic routing is the highest-leverage optimization you can make for a multi-tool agent before you start fighting token economics in production. Start by auditing your current tool catalog: count how many tool schemas you currently send per turn, measure your average input token count, and identify your top 5-10 intents from real usage logs. Define routes for those intents, embed 8-10 utterances each, and ship the router behind a feature flag with shadow-mode logging so you can compare its decisions against your existing LLM-based routing before cutting over.
If you are building agents that touch multiple SaaS APIs and you do not want to spend the next two quarters writing per-provider field mappers, OAuth refresh handlers, and pagination cursors for every CRM, ATS, and ticketing system your customers use, the unified API layer is what makes semantic routing actually deliver in production.
FAQ
- What is semantic routing for AI agents?
- Semantic routing is a decision layer that uses vector embeddings and cosine similarity to classify user intent and route requests to the correct tool or API endpoint, avoiding the latency and cost of using a large LLM for every routing decision. It typically resolves intent in around 100ms versus 500-2000ms for an LLM call.
- Why do LLMs perform worse as the tool catalog grows?
- Research shows tool selection accuracy can drop 7-85% when catalogs grow from small to large, and monolithic tool catalogs can consume up to 90% of the available context window. Long, similar tool descriptions confuse the model and crowd out the actual user query.
- How does semantic routing work with unified APIs?
- The router resolves the user query to a unified intent like `crm.contact.create`, then a unified API translates that single intent into the correct provider-specific call - Salesforce, HubSpot, Pipedrive, etc. - using a normalized schema. Without unification, the agent still has to handle each provider's idiosyncratic field names, which re-introduces the complexity semantic routing was meant to remove.
- Does Truto handle rate limit retries automatically?
- No. Truto normalizes rate limit headers (limit, remaining, reset) per the IETF spec, but passes HTTP 429 errors directly to the caller. Your agent is responsible for implementing retry and exponential backoff logic.