The effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR) in boosting Large Language Model (LLM) reasoning is fundamentally constrained by the base model's inherent output distribution. A significant bottleneck emerges because RLVR primarily optimizes the conditional distribution P(y|x), leaving the marginal distribution P(y) largely untouched.
Unlocking Reasoning Potential in the Pre-train Space
Addressing this limitation, the researchers propose optimizing the marginal distribution P(y) within the pre-train space. This approach aims to encode reasoning abilities directly during pre-training, crucially preserving broad exploration capacity. Conventional pre-training, however, relies on static corpora, leading to distribution shifts that impede targeted reasoning enhancements. To overcome this, they introduce PreRL (Pre-train Space RL), a method that applies reward-driven online updates directly to P(y). Theoretical and empirical validation confirms a strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a potent surrogate for standard RL optimization.