When John Yang released SWE-bench in October 2023, the reaction from the AI community was muted, but everything changed months later when Cognition Labs launched Devin, validating the benchmark as the de facto standard for evaluating AI coding agents. Yang, a Stanford PhD focused on code evaluations, reflected on the rapid acceleration of the field, noting that the release of Devin was "an amazing release" that “kicked off the arms race.” This arms race has forced the benchmark landscape to evolve far beyond the initial scope of SWE-bench, pushing researchers toward long-horizon, consequential evaluation methods. Yang spoke with Swyx at NeurIPS 2025 to dig into the state of code evals heading into 2026, detailing the proliferation of specialized benchmarks and the shift toward competitive agent environments.
The original SWE-bench, while foundational, was heavily focused on Python and Django repositories. The rapid adoption by major labs—including OpenAI and Anthropic—has driven diversification. Yang and his collaborators quickly expanded the benchmark to include Multimodal and Multilingual variants, covering nine languages across 40 repositories, including JavaScript, Rust, Java, C, and Ruby. This expansion addresses the immediate need to test models across the diverse linguistic stack of modern software engineering. The success of SWE-bench has also led to a Cambrian explosion of related, independent benchmarks, such as SWE-bench Pro and SWE-bench Live. Yang is optimistic about this decentralization, noting that he is "excited to see how people curate the next sets," particularly as authors move beyond simply adding "more repos" and begin justifying their dataset splits using advanced curation techniques.
The shift away from unit tests as the sole form of verification signals a critical philosophical turn in AI evaluation. John Yang’s new project, CodeClash, aims instead to measure long-horizon development where agents maintain codebases and compete in consequential, multi-round programming tournaments.
