Long-horizon robotic manipulation has been hampered by the inability of current video MLLMs to actively evaluate task progress. These models, typically trained via Supervised Fine-Tuning (SFT), primarily function as passive observers rather than critical evaluators of the current state against the final task goal. The introduction of PRIMO R1, a 7B framework, marks a pivotal shift by transforming these models into active "Critics".
From Passive Observers to Active Critics
PRIMO R1 leverages outcome-based Reinforcement Learning to explicitly incentivize Chain-of-Thought generation for progress estimation. This approach fundamentally alters the model's role, moving beyond simple event recognition to a more analytical function. The architecture is further enhanced by constructing a structured temporal input, explicitly anchoring the video sequence between initial and current state images, providing crucial temporal context for reasoning.