Modern software systems are a labyrinth of interconnected components, generating an overwhelming deluge of data. When an outage or performance degradation strikes, identifying the true root cause amidst this cacophony of signals becomes a monumental, often manual, task. This critical challenge, a persistent thorn in the side of enterprise operations, is precisely what Traversal AI, a startup founded by Anish Agarwal and Raaz Dwivedi, aims to tackle with a sophisticated blend of causal machine learning and reinforcement learning.
In a recent Latent Space podcast, co-hosted by Alessio Fanelli of Kernel Labs and Swyx of Smol AI, Anish and Raaz, both veterans of MIT and academia, detailed their journey from deep research to building a product addressing this complex problem. Their backgrounds, rooted in cutting-edge AI research, provided a unique lens through which to view the inefficiencies plaguing incident response.
Anish, whose PhD research at MIT focused on "how do you get these AI systems to pick up cause-and-effect relationships from data, and also reinforcement learning, which is to me fundamentally about how do you search large spaces effectively," saw a convergence point. This specialized expertise, honed over years, directly informed their approach to what co-founder Raaz, a Berkeley PhD who previously worked in observability, termed the "needle in a haystack" problem. Raaz humorously noted, "Correlation isn't causation, and I joke, well, when I say it, then I'm allowed to say it because I have a degree." Their combined academic rigor and practical experience positioned them uniquely to address a challenge that has long defied simple solutions.
The core of the problem, as they articulated, is not merely detecting an anomaly. When a critical system experiences a latency spike, thousands of other metrics might also show unusual behavior. Distinguishing between a genuine cause, a mere symptom, or a spurious correlation becomes incredibly difficult. Traversal AI seeks to provide clarity in these high-stakes scenarios.
Modern enterprise systems operate at an astounding scale. DigitalOcean, one of Traversal's early customers, manages over 1300 microservices, generating billions of logs and tens of billions of time-series data points daily. Traditional observability tools struggle to make sense of this volume, often leaving engineers to manually sift through disparate dashboards, logs, and code repositories. Furthermore, the expectation for incident resolution is incredibly high, with companies demanding root cause identification within minutes.
Traversal's approach moves beyond simple automation or superficial AI wrappers. It leverages sophisticated AI agents to dynamically and adaptively query across an organization's entire observability stack—Elastic, Grafana, Datadog, Splunk, Service Now—to build a coherent context. This process involves sifting through petabytes of data within strict rate limits, making it as much an infrastructure challenge as an AI one. Anish candidly admitted, "If I think about the last six months, we've been almost more of an infrastructure company than an AI company, just figuring out how to search these things effectively, the right way to cache this data, so you can do it at real-time."
The ultimate vision extends to self-healing systems. While companies are understandably cautious about granting AI agents full write access to their infrastructure, a phased approach involving "whitelisted sets of commands" is proving viable. These commands, derived from pre-existing, human-validated scripts, allow the AI to execute targeted fixes like reverting a problematic commit or restarting a service, thereby accelerating resolution times and freeing up engineering resources. Raaz emphasized the non-deterministic nature of complex incidents: "When our AI runs and finds an answer, we know it's a good signal. Like, it is not by fluke, because there is no way we would have guessed it otherwise." This underscores the true intelligence at play, moving beyond mere pattern recognition to genuine problem-solving.
Traversal AI's competitive advantage stems from its outcome-driven business model. Unlike many observability incumbents that profit from the volume of data stored, Traversal focuses on delivering precise, actionable root cause analysis. This aligns incentives with the customer's need for efficiency, especially in scenarios where "no amount of workflows will suffice for a big enterprise... you have to link together some of the missing pieces, some of the poorly instrumented data... That requires world knowledge and a few iterations with the world knowledge." By tackling these unstructured, complex problems that lack predefined playbooks, Traversal AI is carving out a niche where deep AI expertise translates directly into tangible operational benefits and resilience for the most demanding software systems.

