In the complex world of AI agents, failures in production can be notoriously difficult to reproduce, creating a significant hurdle for developers aiming to ensure reliability. Tisha Chawla and Susheem Koul from Microsoft, in their presentation titled "Your Agent Failed in Prod. Good Luck Reproducing It.," tackle this critical challenge head-on. They delve into the underlying causes of these elusive bugs and offer practical strategies for effective debugging, emphasizing a shift in focus from absolute determinism to robust replayability and observability.
Related startups
Understanding the Sources of Non-Determinism
Chawla and Koul highlight several key factors that contribute to the unpredictable behavior of AI agents, particularly in production environments. One primary culprit is the concept of sampling determinism versus system determinism. While setting a model's temperature to zero aims for deterministic output, the underlying hardware and software can still introduce variations. Specifically, they point to issues like float addition not being associative, where the order of operations can lead to minute differences in calculations that cascade into different argmax outputs. Furthermore, batch invariance, where a model processes data in batches, can introduce subtle differences based on the batch composition. The presentation also touches upon MoE routing jitter, noting that in Mixture-of-Experts models, the routing decisions can depend on the batch, leading to inconsistent pathways and outputs for the same input.
The core message is that striving for perfect, bitwise determinism through APIs is often a futile endeavor. Instead, the focus should be on achieving replayability. This means being able to reconstruct the exact sequence of events that led to a failure, even if the underlying computations are not perfectly identical each time.
