OS-Themis: Scalable Rewards for GUI Agents

OS-Themis revolutionizes GUI agent training with a scalable, milestone-based critic framework and OGRBench, achieving significant performance uplifts.

2 min read
Diagram illustrating the OS-Themis framework with agents, milestones, and review mechanisms.
Image credit: StartupHub.ai

The efficacy of Reinforcement Learning (RL) for enhancing the robustness of GUI agents in complex, stochastic environments hinges critically on the quality of reward functions. However, current reward engineering approaches often falter in balancing scalability with high performance. Addressing this fundamental challenge, researchers introduce OS-Themis, a novel multi-agent critic framework designed for both scale and precision.

Decomposing Complexity: Milestone-Based Auditing

OS-Themis innovates by moving beyond monolithic reward signals. It strategically decomposes agent trajectories into a series of verifiable milestones. This granular approach isolates critical evidence for decision-making and implements a rigorous review mechanism to audit the chain of evidence before rendering a final verdict. This structured auditing process is key to overcoming the limitations of single-judge reward systems.

OmniGUIRewardBench: A Holistic Evaluation Standard

To facilitate robust evaluation of such advanced reward mechanisms, the authors present OmniGUIRewardBench (OGRBench), a comprehensive, cross-platform benchmark specifically tailored for GUI outcome rewards. This benchmark empowers the community to assess and compare reward strategies more effectively. Notably, models evaluated under OS-Themis consistently achieved their peak performance, underscoring the framework's effectiveness.

Empirical Gains: Driving Agent Evolution

Extensive experiments conducted on the AndroidWorld environment demonstrate the practical impact of OS-Themis. The framework yielded a substantial 10.3% improvement when integrated to support online RL training. Furthermore, it provided a 6.9% gain when employed for trajectory validation and filtering within a self-training loop. These results highlight the potent capability of OS-Themis reinforcement learning to significantly accelerate and refine agent evolution.