Pre-training Space RL for Enhanced LLM Reasoning

New PreRL framework optimizes LLM reasoning by directly refining the pre-training distribution P(y), enhanced by Negative Sample Reinforcement and Dual Space RL.

2 min read
Diagram illustrating the PreRL and DSRL framework for enhancing LLM reasoning.
Conceptual overview of the PreRL and DSRL approach for improving LLM reasoning.

The effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR) in boosting Large Language Model (LLM) reasoning is fundamentally constrained by the base model's inherent output distribution. A significant bottleneck emerges because RLVR primarily optimizes the conditional distribution P(y|x), leaving the marginal distribution P(y) largely untouched.

Unlocking Reasoning Potential in the Pre-train Space

Addressing this limitation, the researchers propose optimizing the marginal distribution P(y) within the pre-train space. This approach aims to encode reasoning abilities directly during pre-training, crucially preserving broad exploration capacity. Conventional pre-training, however, relies on static corpora, leading to distribution shifts that impede targeted reasoning enhancements. To overcome this, they introduce PreRL (Pre-train Space RL), a method that applies reward-driven online updates directly to P(y). Theoretical and empirical validation confirms a strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a potent surrogate for standard RL optimization.

Related startups

Negative Sample Reinforcement: A Catalyst for Reasoning and Reflection

A key discovery within PreRL is the effectiveness of Negative Sample Reinforcement (NSR) as a driver for reasoning. NSR rapidly prunes incorrect reasoning paths while simultaneously stimulating endogenous reflective behaviors. This mechanism led to substantial increases, with transition and reflection thoughts escalating by 14.89x and 6.54x, respectively. This highlights the power of actively shaping the pre-training distribution to foster deeper reasoning capabilities.

Dual Space RL: Expanding Horizons Before Refinement

Building on these insights, the authors propose Dual Space RL (DSRL). This strategy employs a Policy Reincarnation approach, initializing models with NSR-PreRL to significantly expand the reasoning horizon. Subsequently, the model transitions to standard RL for fine-grained optimization. Extensive experiments show that DSRL consistently outperforms strong baselines, underscoring the strategic advantage of pre-train space pruning in steering policies toward a refined, correct reasoning subspace.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.