Connect Sarvam to AI Agents: Build Indic Voice and Document Workflows

You want to connect Sarvam to an AI agent so your system can independently transcribe regional Indic audio, execute text translation across distinct language codes, and run batch document digitization pipelines. Here is exactly how to do it using Truto's /tools endpoint and SDK, bypassing the need to hand-code complex async job polling logic for every LLM interaction.

Giving a Large Language Model (LLM) read and write access to advanced speech and language AI endpoints is an engineering headache. You either spend weeks building, hosting, and maintaining a custom connector that handles multipart file uploads and strict retry logic, or you use a managed infrastructure layer that handles the boilerplate for you. If your team uses ChatGPT, check out our guide on connecting Sarvam to ChatGPT, or if you are building on Anthropic's models, read our guide on connecting Sarvam to Claude. For developers building custom autonomous workflows, you need a programmatic way to fetch these tools and bind them to your agent framework.

This guide breaks down exactly how to fetch AI-ready tools for Sarvam, bind them natively to an LLM using LangChain (or any framework like LangGraph, CrewAI, or Vercel AI SDK), and execute complex localized language processing workflows. For a deeper look at the architecture behind this approach, refer to our research on architecting AI agents and the SaaS integration bottleneck.

The Engineering Reality of Custom Sarvam Connectors

Building AI agents is easy. Connecting them to external, specialized AI APIs is hard. Giving an LLM access to external data sounds simple in a prototype. You write a Node.js function that makes a fetch request and wrap it in an @tool decorator. In production, this approach collapses entirely when dealing with heavy processing tasks.

If you decide to build a custom integration for Sarvam, you own the entire API lifecycle. Sarvam's API introduces several highly specific integration challenges that break standard LLM assumptions.

The Asynchronous Batch Job and Polling Trap

Most modern agent frameworks assume a synchronous tool execution loop: the agent calls a tool, the tool hits an API, the API returns data, and the agent continues reasoning. Sarvam handles massive workloads - like transcribing hours of speech or digitizing complex documents - via asynchronous batch jobs.

When you trigger create_a_sarvam_speech_to_text_translate_job, the API does not return a transcript. It returns a job_id. The agent must then repeatedly call get_single_sarvam_speech_to_text_translate_job_by_id to check the status. LLMs possess absolutely zero concept of time or rate limiting. If left unconstrained, an LLM will execute the status check tool 500 times in two seconds, immediately triggering server blocks. Sarvam's API explicitly requires strict delay intervals (e.g., a minimum 5ms delay between consecutive status polling requests, though practically you need longer backoffs). If you hand-code this, you must write complex polling wrappers that forcefully pause the agent loop.

Multipart Form Data vs. LLM JSON Payloads

LLMs generate JSON. They do not generate binary files or multipart/form-data payloads. If you want an agent to pass an audio file to Sarvam's Saaras v3 model for transcription (create_a_sarvam_speech_to_text), the LLM will simply output a JSON string containing a file path or URL. Sarvam's endpoint expects an actual binary file upload.

A custom integration layer must intercept the LLM's JSON payload, download the referenced file into memory or a temporary buffer, construct a valid multipart boundary request, and proxy it to Sarvam. Truto abstracts this away entirely via its Resource methods, translating standard JSON references into the exact wire format Sarvam demands.

Hard HTTP 429 Errors and Rate Limit Headers

Handling rate limits is critical when scaling multi-step translation agents. Truto does not retry, throttle, or apply backoff on rate limit errors. This is an architectural decision. When Sarvam returns an HTTP 429 Too Many Requests error, Truto passes that error directly back to the caller.

However, Truto standardizes the chaos. It normalizes Sarvam's upstream rate limit information into standard IETF headers: ratelimit-limit, ratelimit-remaining, and ratelimit-reset. Your agent execution loop is responsible for reading these headers and executing the exact retry/backoff logic required. This prevents silent queue lockups and gives your orchestrator total control over execution timing.

Generating Sarvam Tools with Truto

Instead of writing raw API requests for Sarvam's endpoints, you can use Truto to generate AI-ready tools dynamically. Truto maps Sarvam's API endpoints to Proxy APIs, handling authentication and query parameter processing.

We provide a set of tools for your LLM frameworks by offering a description and schema for all the Methods defined on the Sarvam integration. We then call the GET /integrated-account/<id>/tools endpoint on the Truto API to return all of these Proxy APIs with their descriptions and schemas. Your agent framework simply ingests these tool definitions.

Hero Tools for Sarvam AI Agents

To build highly capable Indic language agents, you need to expose the right levers to the LLM. Here are the highest-leverage tools available for Sarvam via Truto. Do not overwhelm your agent with 50 tools; select the specific capabilities your workflow requires.

create_a_sarvam_speech_to_text

Transcribe audio using Sarvam's Saaras v3 speech recognition model. This tool submits an audio file via multipart form-data and returns the transcribed output. It supports multiple output modes: transcribe (original language), translate (to English), verbatim, translit (romanization), and codemix.

Contextual usage notes: Best used for synchronous, shorter audio clips where you need immediate reasoning on the spoken text. Ensure the agent provides the required file and model parameters.

Example User Prompt: "Take the customer support audio file at /tmp/call_102.wav, transcribe it using the Saaras model in verbatim mode, and extract the primary user complaint."

create_a_sarvam_text_translation

Translate text from one Indic language to another using Sarvam AI's translation service.

Contextual usage notes: This requires absolute precision on the source_language_code and target_language_code. The agent must be instructed on the exact format of these codes (e.g., hi-IN for Hindi) to prevent validation errors.

Example User Prompt: "Translate the provided legal disclaimer text from English to Hindi (hi-IN) and Marathi (mr-IN), ensuring the formatting remains intact."

create_a_sarvam_text_language_identification

Identify the language of a given text input using Sarvam AI's language identification (LID) endpoint.

Contextual usage notes: Use this as an initial routing step in automated workflows. Before sending text to a targeted processor, the agent uses this tool to return a verified language_code.

Example User Prompt: "Analyze the following user review snippet, identify the exact Indic language code, and return it so we can route the ticket to the correct regional support team."

create_a_sarvam_document_intelligence_job

Initialize a Sarvam Document Intelligence job to begin async document digitization processing.

Contextual usage notes: This tool initiates a complex pipeline. It returns a request_id and status. The agent must be programmed to store this ID and subsequently poll for completion.

Example User Prompt: "Start a document intelligence job for the scanned tax form PDF. Let me know the request ID so I can track the extraction progress."

create_a_sarvam_speech_to_text_translate_job

Initiate a Sarvam speech-to-text translation batch job for asynchronous audio processing.

Contextual usage notes: Critical for processing hours of podcast audio or massive contact center call logs. Remember that polling requires a minimum 5ms delay, and your agent loop must catch HTTP 429 errors based on the standardized Truto rate limit headers.

Example User Prompt: "Submit the batch of 50 recorded interviews for speech-to-text translation processing. Store the resulting job ID in the database."

get_single_sarvam_speech_to_text_job_by_id

Get the current status of a Sarvam speech-to-text transcription job by id.

Contextual usage notes: The necessary polling tool. It returns job_id, status, and eventually the transcript when the job completes. Your agent framework should handle the retry loop if the status is still 'processing'.

Example User Prompt: "Check the status of transcription job ID 8912-ABCD. If it is finished, summarize the final transcript into three bullet points."

These tools represent just a fraction of the Sarvam capabilities mapped by Truto. You can find the complete tool inventory and granular schema definitions on the Sarvam integration page.

Workflows in Action

Providing an LLM with Sarvam tools unlocks powerful, autonomous localization and media processing pipelines. Here are concrete examples of how an AI agent executes multi-step logic.

Scenario 1: Autonomous Multilingual Customer Support Routing

A regional e-commerce company receives mixed-language support emails and audio memos. They need an agent to identify the language, translate it to English for triage, and prepare a native-language response.

User Prompt: "A new customer audio memo just arrived. Determine what language they are speaking, translate their issue to English for the database, and draft a polite resolution in their original language."

create_a_sarvam_speech_to_text: The agent submits the audio file using the translate output mode to immediately get the English transcript of the issue.
create_a_sarvam_text_language_identification: The agent feeds a snippet of the native transcription to identify the exact language_code (e.g., te-IN for Telugu).
Agent Reasoning: The LLM reads the English transcript, identifies the resolution (e.g., refund processed), and drafts the response in English.
create_a_sarvam_text_translation: The agent translates the English response back into Telugu using the identified target_language_code.

Outcome: The human support team sees an English summary, while the customer receives a highly accurate Telugu response, all executed autonomously in seconds.

Scenario 2: Batch Legal Document Digitization and Translation

A law firm is processing thousands of scanned regional property deeds. They need an agent to digitize the physical scans, extract the text, and translate it.

User Prompt: "Take the scanned property deed folder, run a document intelligence job to extract the text, check the status until complete, and then translate the extracted text into English."

create_a_sarvam_document_intelligence_job: The agent submits the initialization request for the scanned PDFs, receiving a request_id.
get_single_sarvam_document_intelligence_job_by_id: The agent enters a controlled loop, checking the status. It reads the Truto ratelimit-reset headers to back off dynamically if it queries too fast.
Agent Reasoning: Once the job returns a completed status and the extracted raw text, the agent parses the data payload.
create_a_sarvam_text_translation: The agent takes the extracted text chunks and translates them to English for the final database entry.

Outcome: A completely hands-free digitization pipeline that handles asynchronous waiting, rate limit safety, and final translation without human intervention.

Building Multi-Step Workflows

To build these multi-step workflows, you need to bind Truto's dynamically generated tools to your agent framework. This works regardless of whether you use LangChain, LangGraph, CrewAI, or the Vercel AI SDK.

Here is an architectural view of how the agent orchestrator, Truto, and Sarvam interact during a polling loop:

sequenceDiagram
    participant App as Your Agent App
    participant Truto as Truto API Layer
    participant Sarvam as Sarvam API

    App->>Truto: GET /integrated-account/<id>/tools
    Truto-->>App: Return JSON schemas for Sarvam tools
    App->>App: Bind tools to LLM
    
    App->>Truto: Call create_a_sarvam_speech_to_text_job
    Truto->>Sarvam: POST /speech-to-text/async
    Sarvam-->>Truto: 200 OK (job_id: 123)
    Truto-->>App: Return job_id
    
    loop Polling with 429 Handling
        App->>Truto: Call get_single_sarvam_speech_to_text_job_by_id(123)
        Truto->>Sarvam: GET /jobs/123
        alt Rate Limit Exceeded
            Sarvam-->>Truto: 429 Too Many Requests
            Truto-->>App: 429 Error with ratelimit-reset header
            App->>App: Sleep until ratelimit-reset
        else Job Processing
            Sarvam-->>Truto: 200 OK (status: processing)
            Truto-->>App: Return status
            App->>App: Sleep (framework controlled)
        else Job Complete
            Sarvam-->>Truto: 200 OK (transcript data)
            Truto-->>App: Return transcript
        end
    end

When writing the integration code, handling HTTP 429 errors from Sarvam is the developer's responsibility. Because Truto normalizes the headers according to the IETF spec, you can write clean, predictable retry logic inside your tool manager or execution loop.

Below is a conceptual TypeScript example using LangChain and the @trutohq/truto-langchainjs-toolset. Notice how we specifically catch rate limits.

import { ChatOpenAI } from "@langchain/openai";
import { TrutoToolManager } from "@trutohq/truto-langchainjs-toolset";
import { AgentExecutor, createOpenAIToolsAgent } from "langchain/agents";
 
async function runSarvamAgent() {
  // 1. Initialize the LLM
  const llm = new ChatOpenAI({
    modelName: "gpt-4o",
    temperature: 0,
  });
 
  // 2. Fetch Sarvam tools via Truto
  const truto = new TrutoToolManager({
    apiKey: process.env.TRUTO_API_KEY,
  });
  
  // Fetching specifically voice and translation tools
  const tools = await truto.getTools(process.env.SARVAM_ACCOUNT_ID);
 
  // 3. Bind tools to the agent framework
  const prompt = await pull<ChatPromptTemplate>(
    "hwchase17/openai-tools-agent"
  );
  const agent = await createOpenAIToolsAgent({ llm, tools, prompt });
  const executor = new AgentExecutor({ agent, tools });
 
  // 4. Execute with custom Rate Limit handling wrap
  try {
    const result = await executor.invoke({
      input: "Start a document intelligence job for file.pdf and tell me the ID."
    });
    console.log("Agent response:", result.output);
  } catch (error) {
    // Truto passes the 429 directly to you with standardized headers
    if (error.status === 429) {
      const resetTime = error.headers.get('ratelimit-reset');
      console.warn(`Rate limit hit. Must wait until ${resetTime} before retrying.`);
      // Implement your custom backoff queue here
    } else {
      throw error;
    }
  }
}

This architecture guarantees that your agent never locks up silently. By leaning on Truto to map the resources, translate the payloads, and standardize the response envelopes, your engineering team can focus strictly on the LLM's cognitive loop rather than deciphering Sarvam's underlying polling mechanics.

Moving Forward with Agentic Integrations

Connecting Sarvam to an AI agent requires navigating strict asynchronous polling rules, managing complex multipart binary data, and properly backing off when hit with HTTP 429 errors. Hand-coding an integration to manage this for a single LLM framework is difficult; maintaining it as APIs evolve is impossible.

By leveraging Truto's Proxy APIs and /tools endpoint, you generate framework-agnostic tools that abstract the binary translation and standardize rate limit headers, ensuring your agents execute complex Indic language workflows reliably at scale.