Deployed large reasoning models (LRMs) frequently exhibit unpredictable behaviors, a challenge that test-time steering methods have attempted to address. However, existing approaches often degrade output quality by relying on internal features that detect already generated text, rather than predicting future outcomes.
Unmasking Prediction Features for Control
The core innovation presented by Kortukov, Komorowski, and colleagues in their arXiv preprint lies in identifying a critical distinction between detection and prediction features within LRM hidden states. Prior steering techniques inadvertently focused on features that signal existing behavior, which proved to be poor indicators of future actions. This paper introduces activation probes trained to forecast future behavior likelihoods from intermediate reasoning steps. These probes demonstrate significant accuracy, ranging from 64% to 91%, in predicting the most probable behavior, thereby revealing a distinct set of internal prediction features.