Hybrid Agents Master GUI-Tool Orchestration

The inherent uncertainty in hybrid action spaces—where Computer Use Agents (CUAs) can leverage both granular GUI interactions and high-level tool calls—hinders optimal execution. This challenge is compounded by the scarcity of quality interleaved GUI-Tool trajectories and the difficulty of collecting real-world tool usage data.

Visual TL;DR. Hybrid Action Space Uncertainty hinders ToolCUA Agent. Scarcity of Trajectories addressed by ToolCUA Agent. ToolCUA Agent uses Staged Training Pipeline. Staged Training Pipeline includes Trajectory Scaling Pipeline. Trajectory Scaling Pipeline enables Smarter Switching Decisions. Staged Training Pipeline leads to State-of-the-Art Performance.

Related startups

Hybrid Action Space Uncertainty: difficulty in combining granular GUI and high-level tool calls
Scarcity of Trajectories: lack of quality interleaved GUI-Tool trajectories and real-world tool data
ToolCUA Agent: novel agent designed to overcome hybrid action space challenges
Staged Training Pipeline: multi-phase approach for robust learning in complex action spaces
Trajectory Scaling Pipeline: synthesizes diverse GUI-Tool trajectories from static GUI data
Smarter Switching Decisions: bootstraps improved decision-making for tool and GUI interactions
State-of-the-Art Performance: achieves superior results in GUI-Tool orchestration tasks

Visual TL;DRQuickExplainDeeper

Synthesizing Hybrid Trajectories at Scale

Addressing this gap, the researchers introduce ToolCUA, an end-to-end agent employing a staged training approach. A core innovation is the Interleaved GUI-Tool Trajectory Scaling Pipeline. This pipeline repurposes abundant static GUI trajectories and synthesizes a grounded tool library, effectively generating diverse GUI-Tool trajectories without costly manual engineering or reliance on brittle real-world tool data collection. This allows for robust learning in complex action spaces.

Bootstrapping Smarter Switching Decisions

ToolCUA's training progresses through distinct phases. Initially, Tool-Bootstrapped GUI RFT combines supervised fine-tuning (SFT) with single-turn reinforcement learning (RL) to refine decisions at critical GUI-Tool switching junctures. This warmup phase is crucial for improving the agent's ability to discern when to transition between action modalities. Subsequently, the agent is optimized using Online Agentic RL within a high-fidelity GUI-Tool environment. A key element here is the Tool-Efficient Path Reward, which incentivizes not only correct tool utilization but also the discovery of shorter, more efficient execution paths. Experiments on OSWorld-MCP demonstrate the efficacy of this approach, with the ToolCUA agent achieving 46.85% accuracy—a substantial 66% relative improvement over baselines and a 3.9% gain over GUI-only methods, establishing a new state-of-the-art for comparable models.

Hybrid Agents Master GUI-Tool Orchestration

Related startups

Synthesizing Hybrid Trajectories at Scale

Bootstrapping Smarter Switching Decisions

AI Daily Digest