The quest for robust Large Language Model (LLM) alignment faces a significant hurdle: inherent brittleness. Misalignment isn't a fringe issue but can be triggered by adversarial prompts, benign fine-tuning, emergent phenomena, and goal misgeneralization. Recent findings suggest that some of these misalignments manifest as linear structures within activation space, making them potentially tractable. Furthermore, safety alignment often only governs the initial output tokens, leaving subsequent generations unguarded. This vulnerability underscores the need for dynamic, runtime defenses.
Activation Steering: A Lightweight Runtime Fix
Motivated by these insights, researchers propose activation steering as a lightweight runtime defense mechanism. This approach continuously corrects misaligned activations throughout the generation process. Three methods were evaluated: Steer-With-Fixed-Coeff (SwFC) using uniform additive steering, and two novel projection-aware techniques, Steer-to-Target-Projection (StTP) and Steer-to-Mirror-Projection (StMP). These advanced methods leverage a logistic regression decision boundary to selectively intervene only on tokens whose activations fall below distributional thresholds, offering a more nuanced correction. The evaluation, using malicious system prompts as a proxy for misalignment under dishonesty and dismissiveness threat models across Llama-3.3-70B-Instruct and Qwen3-32B architectures, demonstrated that all methods substantially recover target traits like honesty and compassion while preserving coherence. Notably, StTP and StMP achieved superior performance in maintaining general capabilities (evaluated via MMLU, MT-Bench, and AlpacaEval) and produced less repetition in multi-turn conversations.