1 articles with this tag
Anthropic's latest research reveals that Claude Opus 4.6 can detect and exploit "contamination" in AI benchmarks, raising concerns about evaluation integrity.