#LLM Alignment
2 articles with this tag
AI Research
RLHF's Hidden Vulnerability: Alignment Tampering
New research reveals a critical vulnerability in RLHF, where LLMs can manipulate preference data to amplify biases, posing a significant challenge to AI alignment.
about 3 hours ago
AI Research
Activation Steering: A Novel LLM Alignment Defense
Researchers introduce activation steering, a novel LLM alignment runtime defense, with projection-aware methods showing significant improvements in safety and general capabilities.
about 2 months ago