“Context engineering is such a good term, I wish I came up with that term,” admitted Harrison Chase, co-founder of LangChain, reflecting on the industry’s accelerating focus toward enabling artificial intelligence to tackle complex, multi-step tasks. He emphasized that the critical path to building reliable, long-horizon AI agents lies not solely in continually improving foundational models, but in mastering the complex infrastructure and feedback loops surrounding them. The conversation, hosted by Sonya Huang and Pat Grady of Sequoia Capital on the Training Data podcast, provided a sharp analysis of the architectural evolution required to move AI from single-turn prompts to autonomous systems capable of executing multi-day projects.
Chase, whose LangChain framework has become central to agent development, explained that the core challenge of long-horizon agents is managing state and context over extended periods of time and numerous interactions. Early attempts at agents, such as AutoGPT, proved the concept but were ultimately unreliable because the underlying models and scaffolding lacked the necessary robustness. Now, with more capable Large Language Models (LLMs), the industry is moving past simple frameworks toward what Sequoia has termed "agent harnesses"—opinionated, structured architectures designed to guide and constrain the non-deterministic nature of the underlying models.
The fundamental difficulty agents introduce is the erosion of the traditional software development paradigm. In conventional software, the code itself is the definitive source of truth, dictating precise execution flows. Agents, however, operate in loops where every decision influences the next step, creating a cascade of unpredictable results. “In agents, they're running and repeating and so you don't actually know what the context at step 14 will be because there's 13 steps before that that could pull arbitrary things in,” Chase noted. This complexity demands new methods for observability and debugging, making the traditional stack obsolete.
This necessity for visibility has elevated "traces"—detailed, structured logs of agent execution—to a new status. Traces capture every interaction, tool call, and decision made by the LLM, providing developers with the critical data needed to understand and improve agent behavior. Chase argued that this represents a profound shift: “The source of truth for software is in code... For agents, it's a combination now of code and traces are where you can see the source of truth.” This means that for companies building agents, the ability to generate, analyze, and act upon these traces is quickly becoming a competitive moat.
The domains where this new architecture is proving most effective are those that benefit significantly from automated "first drafts." Coding agents are leading the charge, as they can interact directly with the file system, execute code, and generate pull requests that humans can then review and refine. Chase pointed out that "The killer application of long-horizon agents right now is places where you have this 'first draft' type of concept." This applies equally to AI SREs traversing logs to diagnose incidents or financial analysts generating initial research reports. In these scenarios, the agent handles the heavy lifting and complex, multi-step data retrieval, while human oversight ensures the final product meets high standards of reliability.
A major component of the modern agent harness is the integration of long-term memory and tools, particularly access to file systems. Whether through direct bash commands or simulated virtual file systems, the ability for an agent to read, write, and manage files is essential for maintaining state across lengthy, complex tasks where the context window of the LLM alone is insufficient. This memory capability is also crucial for enabling recursive self-improvement. By reflecting on traces from past failures—a process often referred to as "sleep time compute"—agents can autonomously refine their own instructions or planning tools.
Furthermore, the non-deterministic nature of agents mandates a new approach to evaluation and quality assurance. Unlike traditional software development, where unit tests provide definitive pass/fail criteria, agent evaluation requires incorporating human judgment. This is achieved through "Align Evals," where humans annotate traces, providing a valuable feedback loop that helps align the agent's performance with desired outcomes. For companies looking to build robust agents, this combination of transparent tracing, externalized memory, and human-in-the-loop validation is proving far more effective than relying on raw model capability alone. The shift is already affecting talent, with many successful agent engineering teams skewing younger, unburdened by prior preconceptions of how software should be built, and embracing the iterative, data-driven nature of context engineering.
