The overwhelming complexity of modern enterprise systems has rendered traditional troubleshooting methods obsolete, creating what Traversal AI founders Anish Agarwal and Raaz Dwivedi aptly describe as a "massive search problem." This fundamental challenge, where fragmented telemetry data—logs, metrics, traces, code, and even Slack messages—swamps human engineers, formed the genesis of their innovative approach to incident response. At a recent Latent Space Podcast, Agarwal, an MIT and Columbia professor specializing in causal machine learning and reinforcement learning, along with Dwivedi, whose background spans Berkeley and Cornell Tech with expertise in observability startups, discussed their unique solution: an agentic AI architecture designed to pinpoint root causes with unprecedented precision.

The core of Traversal's innovation lies in its ability to transcend mere correlation, a limitation that plagues existing observability tools. "Correlation isn't causation, so how do you get these AI systems to pick up cause and effect relationships from data," Agarwal posed, highlighting the central tenet of causal machine learning that underpins their platform. Modern microservice architectures, with their thousands of services and petabytes of data, generate a deluge of signals. This sheer volume means that simply feeding data into large language models (LLMs) is insufficient; a more intelligent, adaptive search is required to distill actionable insights from the noise.

Traversal addresses this by combining the semantic understanding capabilities of LLMs with sophisticated statistical analysis of time-series data. This hybrid "agentic architecture" dynamically decides which statistical tests to run, sifting through the vast data landscape efficiently and respecting system rate limits. As Dwivedi articulated, "It is the use case where reasoning models are a necessity. You are trying to argue across so many symptoms and such a complex architecture of what the root cause is..." This intelligent reasoning is critical for navigating incidents that defy pre-defined runbooks, which often fail when faced with novel or highly complex issues.

The market timing for such a solution is ripe. Organizations are grappling with the "hero engineer" problem, where only a few highly experienced individuals can troubleshoot elusive incidents. Traversal aims to democratize this expertise, transforming reactive firefighting into a more proactive and intelligent process. This shift extends beyond merely identifying problems; it paves the way for self-healing systems and long-term code architecture improvements, fundamentally altering the landscape of software maintenance.

The technical underpinnings of Traversal’s approach are particularly sharp. By focusing on causal inference, the platform can distinguish between symptoms and true root causes, preventing engineers from chasing false positives. This ability to intelligently prune the search space is paramount when dealing with data sets comprising billions of logs and millions of time series. The system’s adaptability means it can learn and evolve with the infrastructure, providing relevant insights even when no human expert knows the answer.

The founders also touched upon the ongoing evolution of AI models, noting the continuous improvement in reasoning and tool-calling capabilities across platforms like OpenAI and Google. However, they emphasized that reliance on models alone is not enough. "The best AI companies will always have to be at the edge of what the models can do... if everything works all the time then you're not really pushing the limit," Agarwal stated, underscoring the necessity of continuous innovation and robust evaluation pipelines as core intellectual property. This commitment to pushing boundaries, rather than settling for mere automation, positions Traversal AI at the forefront of intelligent incident resolution.

Traversal AI: Unraveling Software Incidents with Causal Machine Learning

Related startups

AI Daily Digest