Unlocking AI Agents with Gym-Anything

Environment Creation as a Multi-Agent Task

The core innovation lies in the Gym-Anything framework, which reframes the arduous process of environment creation into a scalable, multi-agent problem. A coding agent automates setup script generation, data acquisition, and software configuration, while an audit agent rigorously verifies the setup against quality standards. This approach dramatically lowers the barrier to entry for developing sophisticated AI agent environments.

CUA-World: A Scalable Benchmark for Real-World AI Agents

Leveraging Gym-Anything, the researchers constructed CUA-World, a benchmark comprising over 10,000 long-horizon tasks derived from economically relevant occupations across diverse fields like medicine, astronomy, and enterprise systems. This dataset is equipped with realistic data and train/test splits, significantly advancing the scope and complexity beyond existing benchmarks, including the challenging CUA-World-Long subset featuring tasks exceeding 500 steps.

Vision-Language Models Achieve Efficiency and Improved Performance

The research demonstrates that distilling successful agent trajectories from CUA-World into a 2B vision-language model yields performance exceeding that of models twice its size. Furthermore, applying an auditing VLM at test time to provide feedback on incomplete trajectories improved Gemini-3-Flash's performance on CUA-World-Long from 11.5% to 14.0%, highlighting the efficacy of guided feedback loops in complex agent tasks.