The promise of foundation models in automated code generation has outpaced their practical application in realistic software engineering settings. Autonomous agents, while powerful, remain unreliable. This paper challenges the prevailing narrative that limitations lie solely within the foundation model itself.
Related startups
The Systemic Nature of Software Engineering Capability
The researchers propose that effective software engineering capability emerges from the interplay between a foundation model, a mediating harness, and the development environment. This AI Harness acts as a critical intermediary, dictating how an agent perceives a project, executes actions, receives feedback, and confirms task completion. This reframes the problem from individual model prowess to the architecture of the entire system. The harness is formalized with eleven key responsibilities, including task specification, context selection, tool access, project memory, and verification.
A Ladder of Runtime Support for Autonomous Agents
To operationalize this concept, the paper introduces a four-level harness ladder (H0-H3). Each level incrementally exposes more runtime support to the agent. This graduated approach allows for systematic evaluation and development. The framework's evaluation protocol generates auditable 'episode packages,' which vary in their evidence structure based on the harness level. Higher levels yield richer outputs, such as reproduction logs, failure attributions, and structured verification reports, moving beyond simple patch generation for foundation model software engineering.