Pre-training Space RL for Enhanced LLM Reasoning

The effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR) in boosting Large Language Model (LLM) reasoning is fundamentally constrained by the base model's inherent output distribution. A significant bottleneck emerges because RLVR primarily optimizes the conditional distribution P(y|x), leaving the marginal distribution P(y) largely untouched.

Unlocking Reasoning Potential in the Pre-train Space

Addressing this limitation, the researchers propose optimizing the marginal distribution P(y) within the pre-train space. This approach aims to encode reasoning abilities directly during pre-training, crucially preserving broad exploration capacity. Conventional pre-training, however, relies on static corpora, leading to distribution shifts that impede targeted reasoning enhancements. To overcome this, they introduce PreRL (Pre-train Space RL), a method that applies reward-driven online updates directly to P(y). Theoretical and empirical validation confirms a strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a potent surrogate for standard RL optimization.

Negative Sample Reinforcement: A Catalyst for Reasoning and Reflection

A key discovery within PreRL is the effectiveness of Negative Sample Reinforcement (NSR) as a driver for reasoning. NSR rapidly prunes incorrect reasoning paths while simultaneously stimulating endogenous reflective behaviors. This mechanism led to substantial increases, with transition and reflection thoughts escalating by 14.89x and 6.54x, respectively. This highlights the power of actively shaping the pre-training distribution to foster deeper reasoning capabilities.

Dual Space RL: Expanding Horizons Before Refinement

Building on these insights, the authors propose Dual Space RL (DSRL). This strategy employs a Policy Reincarnation approach, initializing models with NSR-PreRL to significantly expand the reasoning horizon. Subsequently, the model transitions to standard RL for fine-grained optimization. Extensive experiments show that DSRL consistently outperforms strong baselines, underscoring the strategic advantage of pre-train space pruning in steering policies toward a refined, correct reasoning subspace.