Mira Murati's Thinking Machines Lab on Sunday published its first detailed technical post since the company was founded, and the thesis is a direct shot at how OpenAI, Google DeepMind and Anthropic have been productising real-time voice. The lab argues that the prevailing approach to interactive AI, in which a turn-based language model is wrapped in a harness that handles speech detection, interruption and latency tricks, is a dead end. Its alternative is what it calls an "interaction model": a single network in which the ability to listen, speak, see and pause at the right time is trained in, not bolted on.
The post introduces TML-Interaction-Small, a 276 billion-parameter mixture-of-experts model with 12 billion active parameters, and a small suite of new benchmarks on which Thinking Machines reports its model scoring an order of magnitude higher than OpenAI's GPT Realtime-2. On one test of time-aware speech, called TimeSpeak, the lab reports a macro-accuracy of 64.7 per cent versus 4.3 per cent for GPT Realtime-2 minimal. On a temporal action-counting benchmark, the same comparison is 35.4 per cent versus 1.3 per cent. "No existing model can meaningfully perform any of these tasks," the post claims.
The thesis: real-time scaffolding has hit a wall
Current frontier models, the lab notes, have been optimised primarily for autonomous capability rather than for interactive collaboration. It quotes a recent frontier model card that conceded the model's benefits "were less clear" when used in a "hands-on-keyboard, synchronous" pattern. The post frames this as an architectural problem, not a polish problem. A turn-based model experiences the world as a single thread: the user types or speaks, the model freezes during generation, then a response arrives. The handful of products that feel real-time, including OpenAI's Realtime API and the various Gemini Live offerings, get there by stitching together voice-activity-detection components, interruption emulators and multimodal scaffolding on top of an otherwise turn-based engine.
"For interactivity to scale with intelligence, it must be part of the model itself," the post argues. The scaffolding approach, in this framing, is destined to be outpaced because the bolted-on components do not get smarter as the underlying model gets smarter. Voice-activity-detection is not learning. The turn-boundary predictor is not learning. So as the language model improves, the gap between what the model could do conversationally and what the surrounding harness lets it do widens.
