Steering LRMs Beyond Output Degradation

Deployed large reasoning models (LRMs) frequently exhibit unpredictable behaviors, a challenge that test-time steering methods have attempted to address. However, existing approaches often degrade output quality by relying on internal features that detect already generated text, rather than predicting future outcomes.

Visual TL;DR. LRM Output Degradation problem Existing Steering Methods. Existing Steering Methods flaw Detection vs Prediction. Detection vs Prediction solution Activation Probes. Activation Probes demonstrates Predicting Future Behavior. Activation Probes enables FPCG Method. FPCG Method achieves Minimal Quality Degradation.

LRM Output Degradation: deployed large reasoning models often exhibit unpredictable behaviors
Existing Steering Methods: rely on internal features detecting already generated text
Detection vs Prediction: distinguishing features that signal existing vs future behavior
Activation Probes: trained to forecast future behavior likelihoods from intermediate steps
Predicting Future Behavior: probes demonstrate significant accuracy from 64% to 91%
FPCG Method: future probe controlled generation enables precise steering
Minimal Quality Degradation: enables precise large reasoning model steering with minimal output quality degradation

Visual TL;DRQuickExplainDeeper

Unmasking Prediction Features for Control

The core innovation presented by Kortukov, Komorowski, and colleagues in their arXiv preprint lies in identifying a critical distinction between detection and prediction features within LRM hidden states. Prior steering techniques inadvertently focused on features that signal existing behavior, which proved to be poor indicators of future actions. This paper introduces activation probes trained to forecast future behavior likelihoods from intermediate reasoning steps. These probes demonstrate significant accuracy, ranging from 64% to 91%, in predicting the most probable behavior, thereby revealing a distinct set of internal prediction features.

Future Probe Controlled Generation: Precision Steering

Building upon these newly identified prediction features, the authors propose Future Probe Controlled Generation (FPCG). This novel text-level steering method enhances control by sampling multiple candidate sentences and selecting the optimal one based on a probe's prediction of future behavior likelihood. FPCG enables precise large reasoning models steering with remarkably little degradation in output quality, a significant improvement over previous methods. Furthermore, FPCG demonstrates efficacy in steering scenarios where activation steering methods fail, underscoring its robustness and broader applicability.

Steering LRMs Beyond Output Degradation

Unmasking Prediction Features for Control

Related startups

Future Probe Controlled Generation: Precision Steering

AI Daily Digest