The race to make AI agents sound less like robots and more like humans just got a new front-runner. AI startup Cartesia has unveiled Sonic-3, a text-to-speech (TTS) model it claims is the fastest and most emotionally expressive on the market, capable of generating laughter and a full range of emotions in real-time conversations.
For anyone who has suffered through a laggy, monotone call with an automated agent, Cartesia’s claims are significant. The company reports an end-to-end latency of just 190 milliseconds, well below the typical threshold for human conversational response. This speed, combined with the ability to generate non-speech sounds like laughter, aims to eliminate the uncanny, stilted nature of most current voice AI. In demos, the voice can sound palpably excited or even "devastatingly sad," a far cry from the neutral tone of typical assistants.
SSMs: The engine behind the emotion
The key differentiator, according to Cartesia, is its underlying architecture. While most of the industry relies on Transformers, Sonic-3 is built on State Space Models (SSMs). In a post on X, the company explained the difference with a simple analogy: Transformers are like re-watching an entire conversation from the start before speaking each new word, which is computationally intensive. SSMs, by contrast, act more like humans, remembering the "topic and vibe" of a conversation to maintain context without constant reprocessing.
This technical choice, pioneered by Cartesia's co-founders at the Stanford AI Lab, is what enables the model's low latency. The efficiency creates a performance budget that allows for more complex, emotional rendering without sacrificing speed.
Beyond the speed and emotion, Sonic-3 is built for global enterprise use. It supports 42 languages, intelligently handles acronyms, and offers both instant and professional-grade voice cloning. Cartesia is already powering millions of monthly conversations for clients like ServiceNow and Cresta. To back its claims, the company's co-founder issued a bold challenge: if they can't improve a qualified company's existing voice AI, they'll donate $5,000 to a charity of its choice.


