OS-Themis: Scalable Rewards for Robust RL

OS-Themis, a new multi-agent critic framework, revolutionizes GUI agent training by providing scalable, accurate rewards through milestone decomposition and evidence auditing.

2 min read
Diagram illustrating the OS-Themis framework with decomposed trajectory milestones.
Image credit: StartupHub.ai

The inherent stochasticity of graphical user interfaces (GUIs) presents a significant hurdle for reinforcement learning (RL) agents. Traditional reward function design struggles to balance scalability and performance, often leading to brittle agent behavior. Addressing this, researchers have introduced OS-Themis, a novel multi-agent critic framework designed to enhance the robustness of GUI agents.

Decomposing Complexity for Accurate Reward Signals

OS-Themis tackles the reward function sensitivity problem by moving beyond single-judge paradigms. It decomposes complex agent trajectories into a series of verifiable milestones. This allows for the isolation of critical evidence at each step, creating a more granular and accurate basis for reward calculation. A sophisticated review mechanism then audits this evidence chain, ensuring the integrity of the final reward verdict. This structured approach is key to improving the reliability of OS-Themis reinforcement learning.

OmniGUIRewardBench: A Holistic Evaluation Standard

To rigorously assess the efficacy of reward frameworks, the authors also introduce OmniGUIRewardBench (OGRBench). This cross-platform benchmark focuses specifically on GUI outcome rewards, providing a standardized environment for evaluating agent performance. The introduction of OGRBench, coupled with OS-Themis, has already demonstrated its value, with all evaluated models achieving their peak performance under this new framework. This highlights the potential of OS-Themis reinforcement learning to serve as a foundational component for future GUI agent development.

Quantifiable Performance Gains in Agent Evolution

Extensive experiments conducted on AndroidWorld underscore the practical impact of OS-Themis. When integrated to support online RL training, the framework yielded a notable 10.3% improvement in agent performance. Furthermore, its application in trajectory validation and filtering within self-training loops resulted in a 6.9% gain. These results indicate that OS-Themis is not merely an incremental improvement but a significant step forward in driving the evolution of more capable and robust GUI agents, showcasing the power of advanced OS-Themis reinforcement learning techniques.