The performance ceiling for autonomous code generation has just been raised, but not through brute-force model size. A new methodology called TEX (Test-Time Scaling Testing Agents via Execution-based Cross-Validation) leverages peer validation to transform individual Large Language Models into collaborative Software engineering agents. This hybrid scaling approach proves that the critical feedback loop of writing tests, writing code, and learning from peers is essential for achieving state-of-the-art results on complex benchmarks like SWT-Bench and SWE-Bench.
The industry has been grappling with two primary methods for boosting LLM performance at inference time: serial and parallel scaling. Serial scaling, exemplified by OpenAI’s O1 series or Google’s thinking mode, improves reliability by generating extensive reasoning tokens sequentially, but it is inherently slow and difficult to parallelize. Conversely, parallel scaling offers diversity by running multiple models simultaneously, yet it struggles with selecting the single best candidate from a pool of diverse responses, especially in complex, multi-step tasks like software repair. The challenge has been finding a mechanism that captures the diversity of parallel execution while integrating the deep, iterative refinement of sequential reasoning.
TEX solves this tradeoff by implementing cross-candidate execution feedback as the aggregation strategy. Instead of relying on an LLM-as-a-judge to summarize long, complex agentic traces—a process prone to information loss—TEX uses the fundamental software engineering loop: execution. Multiple agents work in parallel, each generating both a code patch (C) and a test script (T) for a given issue. The crucial step is running every agent's code patch against every other agent's generated test script, providing concrete, execution-based feedback that is fed back into the next round of generation. According to the announcement, this execution-based aggregation avoids the memory overhead and summarization pitfalls that plague other hybrid methods.
The Power of Peer Validation in Code Generation
The experimental results validate the hypothesis that collaboration is key, particularly in the often-overlooked area of test generation. On SWT-Bench, the benchmark specifically designed for real-world test-case generation, TEX-T achieved state-of-the-art performance, showing a greater than 6% improvement in pass@1 over isolated baselines. This confirms that test generation is not an isolated task; it benefits immensely from the agent simultaneously attempting to solve the underlying coding problem. More importantly, the success of TEX-T's simple pass@1 selection demonstrates that the quality of the generated tests and code is so high that random selection from the ensemble is sufficient for top performance, a significant finding for deployment simplicity.
The benefits extend directly to bug fixing, with TEX-C (the code output) also showing performance gains on SWE-Bench. The core finding here is that the cross-candidate execution feedback continually improves performance over rounds, whereas agents that only receive feedback from their own generated tests quickly stagnate. While the absolute SWE-Bench scores were constrained by the use of a "simple scaffold" that lacked access to standard developer tools like bash commands or internet access, the relative improvement over the baseline is clear. This limitation underscores a critical architectural point: even the most sophisticated agentic reasoning is bottlenecked by the quality of its interface and environment, suggesting that future Software engineering agents must combine advanced collaboration with robust operating environments.
The TEX framework represents a crucial evolution in agent design, moving beyond monolithic LLM performance toward distributed, collaborative intelligence modeled after human engineering teams. By replacing abstract natural language summarization with concrete execution feedback, this framework establishes a robust, scalable blueprint for building reliable Software engineering agents. The takeaway for the industry is that the next frontier of autonomous coding isn't just better models, but better systems that enforce the verifier-generator loop—meaning agents that write tests, fix code, and learn from their peers are the only ones that will truly ship production-ready software.
STRUCTURED DATA: Organization: Salesforce AI Research Category: AI/Software Engineering Agents Release: TEX Framework Impact: 9/10


