Mira Murati's Thinking Machines Lab on Sunday published its first detailed technical post since the company was founded, and the thesis is a direct shot at how OpenAI, Google DeepMind and Anthropic have been productising real-time voice. The lab argues that the prevailing approach to interactive AI, in which a turn-based language model is wrapped in a harness that handles speech detection, interruption and latency tricks, is a dead end. Its alternative is what it calls an "interaction model": a single network in which the ability to listen, speak, see and pause at the right time is trained in, not bolted on.
The post introduces TML-Interaction-Small, a 276 billion-parameter mixture-of-experts model with 12 billion active parameters, and a small suite of new benchmarks on which Thinking Machines reports its model scoring an order of magnitude higher than OpenAI's GPT Realtime-2. On one test of time-aware speech, called TimeSpeak, the lab reports a macro-accuracy of 64.7 per cent versus 4.3 per cent for GPT Realtime-2 minimal. On a temporal action-counting benchmark, the same comparison is 35.4 per cent versus 1.3 per cent. "No existing model can meaningfully perform any of these tasks," the post claims.
