The recent launch party for Terminal-Bench 2.0 and Harbor, hosted by Mike Merrill and Alex Shaw, unveiled a pivotal shift in how AI agents are evaluated, moving firmly towards command-line interface (CLI) interactions as the gold standard for performance. This event, which included a fireside chat with industry leaders Andy Konwinski and Ludwig Schmidt, highlighted the critical need for robust, standardized benchmarks and tools to accurately measure the capabilities of increasingly sophisticated AI agents.
Mike Merrill, co-creator of Terminal-Bench, explained that the initial iteration, Terminal-Bench 1.0, was built on the premise that efficient agent-computer interaction would predominantly occur via CLI, rather than graphical user interfaces (GUIs). He illustrated this with a compelling example: "If you've ever tried to launch an EC2 instance from the GUI, it's a ton of menus, takes you 20 or 30 clicks. In comparison, there's just a single CLI command that can do this for you." This foundational belief in CLI's superior efficiency for expert tasks became the bedrock for their benchmarking efforts.