The race to build the smartest AI is getting complicated. Benchmarks designed to measure the coding prowess of cutting-edge models, like agentic coding benchmarks such as SWE-bench and Terminal-Bench, often show top models separated by mere percentage points. These scores are treated as gospel for deciding which AI to deploy, but new research reveals a significant flaw: the underlying infrastructure can distort results more than the models themselves.
Internal experiments found that simply changing the server resources allocated to resource configuration impact, specifically on Terminal-Bench 2.0, created a 6-percentage-point difference in scores. This gap is wider than the margin separating leading AI models.
Beyond Static Scores
Unlike traditional benchmarks that score output directly, agentic coding evaluations involve complex, multi-turn interactions. AI models write code, run tests, and install dependencies within a dynamic environment. This means the runtime isn't just a passive box; it's an active participant in the problem-solving process.
Two AIs running the same task aren't necessarily taking the same test if their resource budgets and time limits differ. Even benchmark developers are acknowledging this. Terminal-Bench 2.0 now suggests per-task CPU and RAM recommendations, but simply suggesting resources isn't the same as enforcing them consistently, and how they're enforced matters.
The Kubernetes Conundrum
Researchers running Terminal-Bench 2.0 on Google Kubernetes Engine noticed discrepancies between their scores and the official leaderboard, alongside a high rate of infrastructure-related failures—up to 6% of tasks failed due to pod errors, unrelated to the AI's coding ability.
The issue stemmed from resource enforcement. Their Kubernetes setup treated resource specifications as both a guaranteed minimum and a hard kill limit. This left no headroom for temporary resource spikes. A momentary surge in memory could crash a container that would have otherwise succeeded.
