The automation of routine online tasks remains a significant hurdle for the widespread adoption of AI agents as general-purpose assistants. While AI can manage an inbox, the complexity of everyday digital interactions presents a formidable challenge.
Bridging the Gap: The ClawBench Framework
To address this, the researchers introduce ClawBench, a novel evaluation framework designed to test AI agents on 153 real-world tasks across 144 live platforms. Spanning categories from e-commerce to job applications, ClawBench demands capabilities beyond current benchmarks, including information extraction from user documents, multi-step cross-platform navigation, and extensive form completion. Crucially, ClawBench operates on production websites, mirroring the dynamic and complex nature of real-world web interactions, unlike static, offline sandboxes.