Sandipan Bhaumik, Data & AI Tech Lead at Databricks, shared insights into the critical patterns for managing multi-agent AI systems in production. With over 18 years of experience in building and scaling distributed data systems, Bhaumik highlighted common mistakes and offered best practices for orchestrating AI agents, particularly in regulated industries like financial services and healthcare.
The Problem with Scaling AI Agents
Bhaumik began by illustrating the exponential increase in complexity when moving from a single-agent system to a multi-agent one. He noted that while a single agent is often perceived as a feature, a system with multiple agents transforms into a distributed systems problem. This is due to the inherent challenges of coordination, state management, and failure recovery that arise with increased agent interactions.
He presented a "complexity curve" showing how coordination complexity grows disproportionately with the number of agents. For instance, five agents can be 25 times more complex to manage than a single agent. This complexity can lead to numerous failure modes, including race conditions and state synchronization issues.
A "War Story": The Race Condition
To illustrate these challenges, Bhaumik shared a "war story" about a credit scoring system that failed due to a race condition. In this scenario, a credit score calculation agent successfully wrote a score of 750 to the database. However, a subsequent risk assessment agent, operating with stale data from a cache, read an outdated score of 680, leading to incorrect decisions.
The root cause was identified as a caching layer without proper invalidation. This highlights a common anti-pattern: building a distributed system without embracing distributed systems thinking. The problem wasn't the LLM or the prompt, but the underlying system architecture.
