The inherent uncertainty in hybrid action spaces—where Computer Use Agents (CUAs) can leverage both granular GUI interactions and high-level tool calls—hinders optimal execution. This challenge is compounded by the scarcity of quality interleaved GUI-Tool trajectories and the difficulty of collecting real-world tool usage data.
Related startups
Synthesizing Hybrid Trajectories at Scale
Addressing this gap, the researchers introduce ToolCUA, an end-to-end agent employing a staged training approach. A core innovation is the Interleaved GUI-Tool Trajectory Scaling Pipeline. This pipeline repurposes abundant static GUI trajectories and synthesizes a grounded tool library, effectively generating diverse GUI-Tool trajectories without costly manual engineering or reliance on brittle real-world tool data collection. This allows for robust learning in complex action spaces.
Bootstrapping Smarter Switching Decisions
ToolCUA's training progresses through distinct phases. Initially, Tool-Bootstrapped GUI RFT combines supervised fine-tuning (SFT) with single-turn reinforcement learning (RL) to refine decisions at critical GUI-Tool switching junctures. This warmup phase is crucial for improving the agent's ability to discern when to transition between action modalities. Subsequently, the agent is optimized using Online Agentic RL within a high-fidelity GUI-Tool environment. A key element here is the Tool-Efficient Path Reward, which incentivizes not only correct tool utilization but also the discovery of shorter, more efficient execution paths. Experiments on OSWorld-MCP demonstrate the efficacy of this approach, with the ToolCUA agent achieving 46.85% accuracy—a substantial 66% relative improvement over baselines and a 3.9% gain over GUI-only methods, establishing a new state-of-the-art for comparable models.