If you’ve ever asked an AI the same question twice and gotten a different answer, you’ve encountered LLM inference nondeterminism. For years, the tech community has pointed to a seemingly obvious culprit: the chaotic, parallel nature of GPUs. The "concurrency + floating point" hypothesis suggests that since GPUs perform calculations in a slightly different order each time, tiny rounding errors in floating-point math cascade into different results. It’s a plausible theory, but it’s not the whole story.
According to a deep-dive analysis by Thinking Machines Lab, this common explanation misses the true source of the problem. While floating-point non-associativity, the fact that `(a + b) + c` isn't always equal to `a + (b + c)` with computer numbers, is the underlying mechanism for numerical differences, it doesn’t explain the randomness. The researchers point out that the individual operations in an LLM’s forward pass, like a matrix multiplication, are actually deterministic. Run the same `torch.mm(A, B)` operation a thousand times on a GPU, and you’ll get the exact same bit-for-bit result every time. So if the core components are deterministic, why is the final output not?
