Thinking Machines Lab Wants to Replace OpenAI Realtime With a Model That Listens While It Speaks

Mira Murati's lab published its first technical paper, arguing that real-time interactivity should be a native model capability rather than scaffolding bolted around turn-based language models — and it ships benchmarks where GPT Realtime-2 scores near zero.

7 min read
Thinking Machines Lab interaction models architecture diagram
Conceptual illustration of real-time, multimodal AI interaction.· ThinkingMachines

Mira Murati's Thinking Machines Lab on Sunday published its first detailed technical post since the company was founded, and the thesis is a direct shot at how OpenAI, Google DeepMind and Anthropic have been productising real-time voice. The lab argues that the prevailing approach to interactive AI, in which a turn-based language model is wrapped in a harness that handles speech detection, interruption and latency tricks, is a dead end. Its alternative is what it calls an "interaction model": a single network in which the ability to listen, speak, see and pause at the right time is trained in, not bolted on.

The post introduces TML-Interaction-Small, a 276 billion-parameter mixture-of-experts model with 12 billion active parameters, and a small suite of new benchmarks on which Thinking Machines reports its model scoring an order of magnitude higher than OpenAI's GPT Realtime-2. On one test of time-aware speech, called TimeSpeak, the lab reports a macro-accuracy of 64.7 per cent versus 4.3 per cent for GPT Realtime-2 minimal. On a temporal action-counting benchmark, the same comparison is 35.4 per cent versus 1.3 per cent. "No existing model can meaningfully perform any of these tasks," the post claims.

The thesis: real-time scaffolding has hit a wall

Current frontier models, the lab notes, have been optimised primarily for autonomous capability rather than for interactive collaboration. It quotes a recent frontier model card that conceded the model's benefits "were less clear" when used in a "hands-on-keyboard, synchronous" pattern. The post frames this as an architectural problem, not a polish problem. A turn-based model experiences the world as a single thread: the user types or speaks, the model freezes during generation, then a response arrives. The handful of products that feel real-time, including OpenAI's Realtime API and the various Gemini Live offerings, get there by stitching together voice-activity-detection components, interruption emulators and multimodal scaffolding on top of an otherwise turn-based engine.

"For interactivity to scale with intelligence, it must be part of the model itself," the post argues. The scaffolding approach, in this framing, is destined to be outpaced because the bolted-on components do not get smarter as the underlying model gets smarter. Voice-activity-detection is not learning. The turn-boundary predictor is not learning. So as the language model improves, the gap between what the model could do conversationally and what the surrounding harness lets it do widens.

Related startups

What the architecture actually does differently

Three design choices anchor the system. The first is what Thinking Machines calls time-aligned micro-turns. Rather than processing one complete user turn and emitting one complete response, the model treats audio and video as continuous streams and works on 200-millisecond chunks of input and output. Speaking, listening, deciding to interrupt and deciding to wait become token-level decisions inside the model rather than judgements made by an external harness. The lab argues that this is what allows the model to interject contextually, talk over a user during a live translation use case, or remain silent for several seconds and then speak when a video reveals the answer to a question asked minutes earlier.

The second is encoder-free early fusion. Most multimodal systems route audio through a large standalone encoder, vision through another, and then concatenate the embeddings. Thinking Machines does the opposite: audio signals are converted via a lightweight dMel embedding, video frames are split into 40×40 patches encoded by an hMLP block, and every component is co-trained from scratch alongside the transformer. The system's audio decoder uses a flow head rather than a separate vocoder. The effect is fewer specialised parts and a tighter coupling between modalities.

The third is the split between an interaction model and a background model. The interaction model holds the live channel: it talks, listens, and tracks the conversation. The background model does the slower planning, tool use, search and longer-horizon reasoning. The interaction model decides when to delegate and weaves the background model's output back into the conversation when it is ready. This is meant to give the system responsiveness at "non-thinking latency" while still letting an agent loop run in the background.

Engineering details that matter to anyone building on top

The post is unusually open about its serving stack. Streaming sessions, in which clients send 200ms chunks as separate requests while the inference server appends them into a persistent GPU sequence, have been upstreamed to SGLang, which means anyone serving on SGLang gets some of the latency benefits for free. The lab also reports a "bitwise" alignment between its trainer and sampler, achieved with batch-invariant kernels for All-Reduce, Reduce-Scatter and split-KV attention, at a sub-5 per cent end-to-end performance cost. The practical effect is determinism: training and inference produce the same outputs given the same inputs, which simplifies debugging in a regime where small numerical drift normally explodes into different model behaviour over a long streaming session.

What this means for OpenAI, Anthropic and DeepMind

Thinking Machines has so far been a research curiosity. Murati left OpenAI as chief technology officer in 2024 and raised the largest seed round in venture history at a multibillion-dollar valuation. The lab has hired aggressively from OpenAI, Anthropic and Google. Until now, the only public output has been hiring announcements and Murati's appearances. This post is the first time the company has made a substantive technical claim about what it thinks the field is doing wrong.

For OpenAI specifically, the framing is awkward. The Realtime API has been one of the company's most-deployed surfaces over the last twelve months, with enterprise customers building call centres, tutoring products and voice assistants on top of it. The Thinking Machines paper, with benchmarks in which GPT Realtime-2 minimal scores 2.9 per cent on cued response timing and 0 per cent on temporal action localisation, is implicitly arguing that the entire Realtime stack is the wrong abstraction. Anthropic does not yet have a voice product. DeepMind's Gemini Live is closer to the streaming paradigm than OpenAI's offering but still relies on external speech components.

The bet is sceptical-friendly: the post acknowledges that current TML-Interaction-Small is small by frontier standards, that long sessions need more work on context management, and that scaling the architecture to a larger pretrained base remains a 2026 project. Murati's lab is choosing to publish the architecture and the benchmarks before it has the scaled model. That is the same play DeepMind made with AlphaGo and Anthropic made with constitutional AI: establish the paradigm before competitors can frame the conversation around their own.

What to watch

The interesting question is not whether 200-millisecond micro-turns become a standard primitive. The interesting question is whether the labs with already-deployed real-time products treat this paper as a credible threat or as research-lab posturing. If OpenAI announces in the next six months that the next Realtime API revision will move away from VAD-based turn boundaries toward native streaming, that is the answer. If the Realtime stack continues to compete on faster TTS and lower transcription latency, Thinking Machines' framing wins.

One indirect signal is already visible: SGLang's upstream of streaming sessions means open-source serving infrastructure now has the plumbing for the Thinking Machines paradigm. Anyone wanting to serve an interaction-model-style workload no longer has to build the prefill batching themselves. That is how architectural assumptions become defaults.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.