Artificial Intelligence

Preferred on Google

Rishabh Bhargava on Voice Agent Engineering

Rishabh Bhargava of Together AI discusses engineering voice agents, focusing on latency, quality, and scale challenges across STT, LLM, and TTS components.

May 31 at 5:02 PM7 min read

Rishabh Bhargava presenting on engineering voice agents, with a slide showing the pipeline architecture. — Rishabh Bhargava of Together AI discusses the engineering of voice agents.· AI Engineer

Visual TL;DR. Rishabh Bhargava discusses Voice AI Potential. Voice AI Potential leads to Core Challenges. Core Challenges addressed by Pipeline Architecture. Pipeline Architecture requires System Optimization. System Optimization enables Natural, Reliable Agents.

Rishabh Bhargava: Director of ML at Together AI, leads Voice AI team
Voice AI Potential: billions of phone calls handled by humans annually
Core Challenges: latency, quality, and scale across STT, LLM, TTS
Pipeline Architecture: multifaceted approach to create voice interfaces
System Optimization: key components and strategies for better performance
Natural, Reliable Agents: functional and natural voice interfaces for users

Visual TL;DRQuickExplainDeeper

Rishabh Bhargava, Director of ML at Together AI and leader of their Voice AI team, recently presented on the critical engineering challenges involved in building effective voice agents. The talk, titled "Engineering voice agents: Latency, quality, and scale," detailed the multifaceted approach required to create voice interfaces that are not only functional but also natural and reliable for users.

Rishabh Bhargava on Voice Agent Engineering - AI Engineer — Rishabh Bhargava on Voice Agent Engineering — from AI Engineer

Who is Rishabh Bhargava?

Bhargava brings a wealth of experience to the field of AI and machine learning infrastructure. Prior to his role at Together AI, he was the co-founder and CEO of Refuel.ai, a company that was acquired by Together AI. His background includes over a decade of building AI and ML infrastructure, positioning him as a knowledgeable voice on the practical engineering aspects of AI development.

Why Voice AI Matters

Bhargava began by highlighting the immense scale and potential of voice AI, noting that billions of phone calls are handled by humans annually. He emphasized that voice is a natural interface for humans, learned before reading and writing, and that this naturalness is key to its appeal. The adoption of voice AI in customer support, healthcare appointment scheduling, and even developer interactions with code signifies a shift towards more intuitive human-computer interaction.

Core Challenges in Voice Agent Engineering

Bhargava outlined four core challenges that engineers must overcome to build high-quality voice agents:

Latency: Users notice delays and will disengage if responses exceed certain thresholds (e.g., over 500ms is noticeable, over 1 second is often a failure point). Silence in conversation is also a critical indicator of failure.
Intelligence: Voice agents must be able to handle complex instructions, understand ambiguity, and effectively utilize tools through function calls to perform real-world tasks.
Naturalness: Robotic voices can erode trust. Users expect speech that closely mimics human quality, including tone, emotion, and accent.
Reliability: Agents need high uptime (e.g., 99.9%) and cost-effectiveness at scale, with a particular focus on avoiding silent failures.

He stressed that these are not independent problems but rather an "AND problem," meaning all challenges must be resolved simultaneously for a voice agent to succeed.

Pipeline Architecture for Voice Agents

The standard architecture for voice agents, Bhargava explained, involves a sequential pipeline:

Audio Chunks: Raw audio input from the user.
Agent Orchestrator: Manages the flow of information and coordinates different components.
Speech-to-Text (STT) Model: Transcribes spoken audio into text.
Turn Detection: Identifies when a user has finished speaking.
Large Language Model (LLM): Processes the text, understands intent, and determines the next action, potentially involving function calls.
Text-to-Speech (TTS) Model: Synthesizes the agent's response into spoken audio.

This pipeline is often augmented with capabilities like WebSocket streaming for real-time interaction and function calling for task execution.

Optimizing the System: Key Components and Strategies

Bhargava then delved into how each component can be optimized:

Speech-to-Text (STT): Key metrics include word error rate (SOTA models are below 6% on open benchmarks) and time-to-completed-transcript (e.g., p90 under 100ms for streaming ASR). Capabilities like turn detection and handling barge-ins are crucial, as is multi-lingual support and streaming-native architectures.
LLM: Performance is measured by streaming latency (low TTFT of 200-300ms is important). There's a trade-off between model size and performance, with 8B-30B parameter models often hitting a sweet spot for latency versus intelligence. Capabilities include instruction following and managing multi-turn conversations.
Text-to-Speech (TTS): Performance metrics focus on time-to-first-audio (TTFA), such as <120ms p90 for models like Cartesia Sonic 3, and a real-time factor (RTF) less than 1. Quality is paramount, with human evaluations being the best measure. Capabilities include naturalness across voices and pronunciation/emotional control.

Beyond individual component optimization, Bhargava highlighted system-level strategies:

Latency and Cost Budget: A rough allocation often sees LLMs consuming about 60% of the budget, followed by STT and TTS, considering both latency and cost.
Colocation: Placing models in close proximity to each other, ideally within the same data center or even the same building, drastically reduces network latency.
Autoscaling: Seamlessly scaling resources up or down based on demand is critical for both performance and cost efficiency. Scaling down stateful connections, however, presents unique challenges.
Global Deployments: Models need to be deployed closer to end-customers globally to minimize latency and comply with data residency regulations like GDPR.

Bhargava concluded by noting that while the current pipeline approach is effective, the field is moving towards more integrated speech-to-speech models that can potentially handle more nuanced conversational elements and reduce the complexity of the overall system.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Rishabh Bhargava #Together AI #Voice AI #Speech-to-Text #LLM #Text-to-Speech #AI Infrastructure #Machine Learning