Researchers are tackling the challenge of understanding and correcting undesirable LLM behavior with a new technique called latent attribution, detailed by Tom Dupre la Tour and Dan Mossing. This method utilizes sparse autoencoders (SAEs) and a first-order Taylor expansion approximation to compare a model's behavior across similar prompts, allowing it to pinpoint the specific internal "latents" responsible for the difference between desired and misaligned outputs.
Latent attribution establishes a more direct causal link than previous activation-difference approaches, which often missed subtle or causally relevant features. In case studies involving inaccurate health advice and user bias validation, the technique successfully identified features correlated with problematic outputs. Notably, the top latent identified in both scenarios was dubbed the "provocative" feature, strongly associated with extreme language, suggesting a surprisingly unified internal representation for distinct failure modes. The researchers found that steering this single latent could effectively push models toward or away from broad misalignment, demonstrating that latents selected via attribution were significantly better at causal steering than those selected by simple activation difference.
