Together AI's Speech-to-Text Speed Secret

Together AI reveals the engineering secrets behind its record-breaking speech-to-text performance, optimizing the entire data pipeline.

May 29 at 10:01 PM7 min read

Abstract visualization of fast data processing and AI model connections — Optimizing the entire data pipeline is key to achieving breakthrough AI speeds.· Together AI

Visual TL;DR. Audio Data Scale leads to CPU Bottlenecks. CPU Bottlenecks leads to TensorRT Encoder Opt.. TensorRT Encoder Opt. leads to Decoder Decoupled. Decoder Decoupled leads to Eliminate Data Copying. Eliminate Data Copying leads to Evented I/O & GC. Evented I/O & GC leads to Record ASR Speed.

Audio Data Scale: audio data is 3 orders of magnitude larger than text prompts
CPU Bottlenecks: CPU handles decoding, resampling, noise filtering, and feature extraction
TensorRT Encoder Opt.: optimizing the encoder with TensorRT for faster processing
Decoder Decoupled: decoupling the decoder from the CPU for efficiency
Eliminate Data Copying: removing unnecessary data copying and CPU hops
Evented I/O & GC: controlling garbage collection and using evented I/O for streaming
Record ASR Speed: achieving record-breaking speech-to-text performance

Visual TL;DRQuickExplainDeeper

The race for the fastest AI models often focuses on raw computational power, but Together AI has demonstrated that optimizing the entire system, from data ingestion to final output, is crucial for groundbreaking performance. Their latest advancements in speech-to-text (ASR) technology, detailed in a recent blog post, reveal a meticulous approach to overcoming bottlenecks that plague traditional systems.

Unlike large language models (LLMs) where the bulk of computation happens within the GPU, speech-to-text processing involves significant overhead on the CPU for tasks like decoding audio, resampling, noise filtering, and feature extraction. Together AI identified this full-path systems problem as the primary challenge.

Engineering for Audio's Scale

The sheer volume of data differs vastly between text and audio. A 1M-token text prompt is compact, but its audiobook equivalent can be 5 to 10 GB, a three-order-of-magnitude difference. This necessitates efficient preprocessing before data even reaches the GPU.

Together AI's stack serves two distinct ASR regimes: offline transcription, where throughput is paramount, and streaming transcription, where low latency and minimal jitter are critical. Their system powers NVIDIA's Parakeet-TDT 0.6B v3 and OpenAI's Whisper Large v3, boasting impressive speed metrics.

Optimizing the Encoder with TensorRT

The encoder, responsible for processing variable-length speech segments, contains about 95% of the Parakeet model's weights. To handle the wide range of audio input lengths efficiently, Together AI employed NVIDIA's TensorRT. By using multi-profile engines, they ensured optimized kernel execution plans tailored to expected input shape distributions, avoiding costly padding for shorter segments.

This profile-aware optimization within TensorRT provided a significant boost over previous PyTorch-based solutions, especially for short utterances critical in streaming scenarios.

Decoupling the Decoder from the CPU

The decoder's iterative process, predicting tokens from acoustic frames, traditionally involved CPU intervention for conditional logic. This frequent host sync prevented the entire loop from being captured as a single CUDA graph, leading to microsecond GPU work being bogged down by thousands of CPU round trips per request.

By implementing conditional CUDA graph nodes, Together AI moved the decision-making logic onto the GPU. This allows the entire decoder loop to run as a single CUDA graph, slashing decoder latency by 2x to 3x.

Eliminating Data Copying and CPU Hops

Further latency gains came from streamlining the CPU path. Instead of relying on a microservice architecture that often involves multiple processes and redundant data copies, Together AI collapsed preprocessing steps into fewer processes.

For inter-process communication, they opted for custom protocols over persistent Unix domain sockets, and crucially, utilized shared memory for large data volumes. This shared memory approach enables zero-copy data transfer between processes, removing hundreds of milliseconds of latency previously spent on copying and serialization.

Evented I/O and GC Control for Streaming

Streaming ASR presented unique challenges related to connection management. A move from one thread per connection to a single thread managing thousands of connections via `epoll` drastically reduced scheduler pressure and improved predictability.

A subtle but critical optimization involved Python's garbage collector (GC). Spikes in p95 latency, observed despite healthy p50/p90 metrics, were traced to full GC passes on long-lived preallocated objects. The simple addition of `gc.freeze()` after startup preallocation prevented these objects from being scanned, eliminating the 200ms stalls and smoothing out traffic patterns.

The entire ASR pipeline, from preprocessing to final token emission, demands end-to-end optimization, echoing the need for comprehensive frameworks like those detailed in NVIDIA Details SMART Framework for AI Inference at Scale.

Together AI's advancements underscore that achieving the world's fastest speech-to-text requires a holistic systems engineering approach, addressing every potential bottleneck from the silicon up to the runtime environment.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Together AI #Speech-to-Text #ASR #NVIDIA TensorRT #CUDA