Long-horizon robotic manipulation has been hampered by the inability of current video MLLMs to actively evaluate task progress. These models, typically trained via Supervised Fine-Tuning (SFT), primarily function as passive observers rather than critical evaluators of the current state against the final task goal. The introduction of PRIMO R1, a 7B framework, marks a pivotal shift by transforming these models into active "Critics".
From Passive Observers to Active Critics
PRIMO R1 leverages outcome-based Reinforcement Learning to explicitly incentivize Chain-of-Thought generation for progress estimation. This approach fundamentally alters the model's role, moving beyond simple event recognition to a more analytical function. The architecture is further enhanced by constructing a structured temporal input, explicitly anchoring the video sequence between initial and current state images, providing crucial temporal context for reasoning.
State-of-the-Art Performance and Generalization
The efficacy of PRIMO R1 is validated by extensive experiments on the proposed PRIMO Dataset and Benchmark, demonstrating state-of-the-art performance across diverse in-domain environments and out-of-domain real-world humanoid scenarios. Notably, the 7B PRIMO R1 model achieves a 50% reduction in mean absolute error compared to specialized reasoning baselines, and shows significant relative accuracy improvements over 72B-scale general MLLMs. Its capabilities extend to robust zero-shot generalization on challenging failure detection tasks. On the RoboFail benchmark, PRIMO R1 attains 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%, underscoring its advanced capabilities in PRIMO R1 robot manipulation.