1 articles with this tag
The latent value hypothesis explains RLAIF by positing that pretraining encodes human values as representation directions, activated by constitutional prompts.