PRIMO R1: Active Critics for Robotic Manipulation

PRIMO R1 transforms video MLLMs into active critics for robotic manipulation via outcome-based RL, achieving SOTA on RoboFail and outperforming larger models.

2 min read
Diagram illustrating the PRIMO R1 framework for robotic manipulation.
AI-generated illustration

Long-horizon robotic manipulation has been hampered by the inability of current video MLLMs to actively evaluate task progress. These models, typically trained via Supervised Fine-Tuning (SFT), primarily function as passive observers rather than critical evaluators of the current state against the final task goal. The introduction of PRIMO R1, a 7B framework, marks a pivotal shift by transforming these models into active "Critics".

From Passive Observers to Active Critics

PRIMO R1 leverages outcome-based Reinforcement Learning to explicitly incentivize Chain-of-Thought generation for progress estimation. This approach fundamentally alters the model's role, moving beyond simple event recognition to a more analytical function. The architecture is further enhanced by constructing a structured temporal input, explicitly anchoring the video sequence between initial and current state images, providing crucial temporal context for reasoning.

State-of-the-Art Performance and Generalization

The efficacy of PRIMO R1 is validated by extensive experiments on the proposed PRIMO Dataset and Benchmark, demonstrating state-of-the-art performance across diverse in-domain environments and out-of-domain real-world humanoid scenarios. Notably, the 7B PRIMO R1 model achieves a 50% reduction in mean absolute error compared to specialized reasoning baselines, and shows significant relative accuracy improvements over 72B-scale general MLLMs. Its capabilities extend to robust zero-shot generalization on challenging failure detection tasks. On the RoboFail benchmark, PRIMO R1 attains 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%, underscoring its advanced capabilities in PRIMO R1 robot manipulation.