RLAIF Explained: Latent Values in LLMs

RLAIF explained: Human values are latent directions in LLM representations, activated by constitutional prompts, with alignment ceiling tied to model capacity and data quality.

2 min read
Abstract representation of neural network layers and data flow
Image credit: StartupHub.ai

Reinforcement Learning from AI Feedback (RLAIF) has emerged as a powerful paradigm for improving language models, yet its theoretical underpinnings have remained elusive. This work proposes the 'latent value hypothesis' to explain why RLAIF, which trains models on their own preference judgments, demonstrably works for value alignment.

The Latent Value Hypothesis: Human Values Encoded in Representation Space

The core insight is that internet-scale pretraining imbues large language models with human values, not as explicit rules, but as discernible directions within the model's high-dimensional representation space. According to Robin Young's work, these latent values are then selectively activated by constitutional prompts, which guide the model to generate preference judgments that reflect these encoded values. This framework, detailed on arXiv, formalizes this by treating the constitution as a projection operator that isolates value-relevant dimensions.

Bridging the Generation-Judgment Gap and Unifying Empirical Findings

The theory explains a critical phenomenon: the 'generation-judgment gap'. RLAIF improves alignment when the direction activated by the constitution correlates better with true values than the model's default generation direction. This implies that the model's raw output may not always reflect its learned values, but specific prompts can surface them. Furthermore, the maximum achievable quality in RLAIF is fundamentally limited by how well the model's representations encode these values, a capability that scales directly with model capacity. The research also warns of adversarial constitutions, capable of activating anti-social value directions inadvertently learned during pretraining. This account offers a unified explanation for diverse empirical observations, including the 'refusal direction' in LLMs, the existence of low-rank safety subspaces, and the observed scaling behaviors of RLAIF.