The race for the fastest AI models often focuses on raw computational power, but Together AI has demonstrated that optimizing the entire system, from data ingestion to final output, is crucial for groundbreaking performance. Their latest advancements in speech-to-text (ASR) technology, detailed in a recent blog post, reveal a meticulous approach to overcoming bottlenecks that plague traditional systems.
Unlike large language models (LLMs) where the bulk of computation happens within the GPU, speech-to-text processing involves significant overhead on the CPU for tasks like decoding audio, resampling, noise filtering, and feature extraction. Together AI identified this full-path systems problem as the primary challenge.
Engineering for Audio's Scale
The sheer volume of data differs vastly between text and audio. A 1M-token text prompt is compact, but its audiobook equivalent can be 5 to 10 GB – a three-order-of-magnitude difference. This necessitates efficient preprocessing before data even reaches the GPU.
