The race to build the most capable AI coding assistants is being complicated by a fundamental flaw: the testing grounds themselves. Infrastructure, from CPU allocation to memory limits, can swing benchmark results by several percentage points, sometimes more than the actual gap between leading AI models. This means that decisions about deploying AI assistants might be based on flawed data.
Benchmarks like SWE-bench and Terminal-Bench 2.0 are the front lines where AI models battle for supremacy in software engineering tasks. Top models often vie for leaderboard positions separated by mere points. However, Anthropic's research reveals that the environment these models run in is far from passive; it's an integral part of the problem-solving process.
The Hidden Variable: Infrastructure
Unlike static benchmarks that simply score output, agentic coding evaluations provide models with a full development environment. The AI writes code, runs tests, installs dependencies, and iterates over multiple turns. When two agents operate with different resource budgets and time limits, they aren't taking the same test.
Anthropic discovered that their setup on Terminal-Bench 2.0 produced scores that didn't align with the official leaderboard, coupled with surprisingly high infrastructure error rates—up to 6% of tasks failed due to environmental issues, not model limitations.
The discrepancy stemmed from how resources were managed. Kubernetes, in this case, treated per-task resource specifications as both a minimum guarantee and a hard kill limit. This left zero headroom for transient spikes in demand, meaning a momentary memory fluctuation could terminate a container that might have otherwise succeeded.
The benchmark's official leaderboard uses a more lenient sandboxing provider that allows temporary overallocation, prioritizing stability over strict limits. This difference alone can create significant score variations. Anthropic's experiments showed that increasing resource headroom directly correlated with higher success rates, primarily by reducing infrastructure errors.
