The inherent unpredictability of large language models, where identical prompts yield varied outputs, has long been a significant hurdle for their widespread adoption in critical applications. This "nondeterminism" undermines trust, complicates debugging, and makes scientific reproducibility a formidable challenge. A recent paper from Thinking Machines Lab, a venture linked to former OpenAI CTO Mira Murati, addresses this fundamental problem head-on, presenting a compelling solution that promises to usher in an era of truly reliable AI.
Matthew Berman, in his recent video commentary, dissects this pivotal research from Thinking Machines Lab, specifically their paper titled "Defeating Nondeterminism in LLM Inference." The core issue, as Berman explains, is that "reproducibility is a bedrock of scientific progress. However, it's remarkably difficult to get reproducible results out of large language models." This isn't merely about models being "creative"; even when the temperature parameter, which controls output randomness, is set to zero (theoretically ensuring deterministic behavior), LLM responses can still differ.
Initially, a common hypothesis attributed this nondeterminism to a combination of floating-point non-associativity and concurrent execution within GPUs. Floating-point numbers, representing decimals, require rounding at a certain precision, and the order in which these calculations are performed by concurrent GPU cores can subtly alter the final result. While seemingly minor, these minute variations can cascade, leading to entirely different token selections by the LLM.
However, Thinking Machines Lab's research reveals a more profound underlying cause: a lack of "batch invariance." Berman simplifies this complex concept with an analogy: imagine your individual query being placed into a "carpool" (a batch) with other users' requests. The size of this carpool dynamically changes based on system load. When the system is busy, the batch is large; when quiet, it's small. This fluctuating batch size, rather than just the concurrent execution of individual calculations, quietly changes the internal mathematical ordering within the AI. These subtle shifts in the sequence of tiny add-ups can lead to different intermediate totals and, ultimately, a different "next word" prediction by the LLM.
As Berman highlights, "our request's output does depend on the parallel user requests... it's because our forward pass lacks 'batch invariance', causing our request's output to depend on the batch size of our forward pass." This dependency on the batch size means that even if a model is theoretically deterministic at a granular level, the dynamic nature of inference environments introduces an external source of variability. The problem isn't just internal calculation order; it's how external factors, like the volume of other requests, influence that order.
The proposed solution centers on making kernels "batch-invariant." This means designing the fundamental computational units (kernels) of the transformer architecture to produce identical outputs regardless of the batch size. Thinking Machines identifies three key operations requiring this invariance: RMSNorm, matrix multiplication, and attention. By ensuring that these core operations yield consistent results irrespective of how many other requests are processed concurrently, the model's overall output becomes truly deterministic.
This technical achievement carries immense implications for the future of AI. The ability to guarantee consistent outputs from LLMs transforms them from powerful but unpredictable tools into reliable, auditable systems. For developers, debugging becomes far more straightforward. For enterprises, particularly in sectors like finance, healthcare, or legal, where precision and accountability are non-negotiable, deterministic LLMs pave the way for trusted deployments. Benchmarking also gains a new level of integrity, allowing for accurate comparisons and reliable progress tracking in AI development.
Thinking Machines Lab validated their approach using the Qwen/Qwen3-235B-Instruct-2507 model. When prompted with "Tell me about Richard Feynman" (at temperature 0), the baseline model generated 80 unique completions across 1000 samples. However, with their batch-invariant kernels enabled, "all of our 1000 completions are identical." This striking result demonstrates the efficacy of their solution in achieving perfect reproducibility. This breakthrough moves LLMs from a realm of probabilistic guessing, even at their most conservative settings, into one of predictable and verifiable outcomes.
The ability to consistently reproduce LLM outputs fundamentally changes the landscape for AI. It fosters greater trust, enables robust debugging and auditing processes, and provides a stable foundation for scientific advancement and reliable product development. The implications for critical infrastructure and regulated industries are particularly profound, offering a path to integrate advanced AI with confidence and control.

