Anthropic Reworks AI Safety Rules

Anthropic has unveiled version 3.0 of its Responsible Scaling Policy (RSP), a significant overhaul of its framework for mitigating catastrophic AI risks. The update, announced two and a half years after the original policy, reflects both the successes and the limitations of its prior approach as AI capabilities have rapidly advanced.

The original RSP, introduced in September 2023, aimed to address future AI risks through "if-then" commitments tied to "AI Safety Levels" (ASLs). For example, if a model exceeded certain biological science capabilities, stricter safeguards would be implemented. While early ASLs were detailed, later levels were intentionally left vague, awaiting a clearer picture of advanced AI capabilities.

Assessing Past Successes and Challenges

Anthropic's previous RSP successfully incentivized internal safeguard development, leading to sophisticated input and output classifiers for ASL-3 compliance in May 2025. This also spurred similar frameworks from competitors like OpenAI and Google DeepMind, and informed early AI policy globally, including California's SB 53 and the EU AI Act's Codes of Practice.

However, the policy faced significant hurdles. Pre-set capability thresholds proved ambiguous, making it difficult to definitively trigger multilateral action across the industry. For instance, models now exhibit substantial biological knowledge, but conclusive evidence of high risk remains elusive despite extensive wet-lab trials. Government action on AI safety has also lagged, with the policy environment prioritizing competitiveness over safety, a long-term project proving slower than anticipated.

Furthermore, while ASL-3 safeguards were unilaterally achievable, higher ASLs, requiring robust mitigations against state-level actors, might be impossible without collective action. A RAND report, for example, deemed "SL5" security standards for model weights "currently not possible" without national security community assistance. This combination of ambiguous risk, a cautious regulatory climate, and challenging unilateral compliance compelled Anthropic to restructure its RSP.

RSP 3.0: A Pragmatic Evolution

The revised AI risk mitigation strategies outlined in RSP 3.0 introduce three core elements:

Separated Commitments: The policy now distinguishes between mitigations Anthropic will pursue independently and a more ambitious, industry-wide capabilities-to-mitigations map.
Frontier Safety Roadmap: This new requirement mandates the publication of concrete plans for risk mitigation across security, alignment, safeguards, and policy. These are ambitious, publicly graded goals, not hard commitments, designed to maintain internal impetus for progress. Examples include "moonshot R&D" for information security and developing automated red-teaming methods.
Risk Reports and External Review: Anthropic will publish detailed Risk Reports every 3-6 months, explaining model capabilities, threat models, and active mitigations. These reports, while subject to some redactions for sensitive information, will be reviewed by independent third-party experts under specific circumstances, aiming to foster greater public awareness and policy change.

Anthropic views RSP 3.0 as a living document, designed to adapt as AI capabilities evolve. This pragmatic revision enhances transparency, clarifies internal commitments, and continues to advocate for industry-wide action on the most challenging AI safety issues.

Anthropic Reworks AI Safety Rules

Related startups

Assessing Past Successes and Challenges

RSP 3.0: A Pragmatic Evolution

AI Daily Digest