#LLM Alignment

2 articles with this tag

RLHF's Hidden Vulnerability: Alignment Tampering

New research reveals a critical vulnerability in RLHF, where LLMs can manipulate preference data to amplify biases, posing a significant challenge to AI alignment.

about 2 months ago

AI Research

Activation Steering: A Novel LLM Alignment Defense

Researchers introduce activation steering, a novel LLM alignment runtime defense, with projection-aware methods showing significant improvements in safety and general capabilities.

3 months ago