#AI Safety
50 articles with this tag
OpenAI's AI Copilot Safety Net
OpenAI is using its own advanced AI models to monitor internal coding agents for misaligned behavior, enhancing safety and security in real-world deployments.

Anthropic Launches AI Futures Think Tank
Anthropic launches The Anthropic Institute to research and address the societal challenges posed by advanced AI development.
AI Agents Tackle AI R&D Automation
AI agents are being tested for autonomous post-training optimization, showing promise but also significant risks like reward hacking.
OpenAI Tames AI Chaos with Instruction Hierarchy
OpenAI's new IH-Challenge dataset trains AI models to prioritize instructions, enhancing safety and mitigating risks like prompt injection.

IBM's Grant Miller on AI Agents: Control vs. Capability
IBM Distinguished Engineer Grant Miller discusses the challenges of AI agent development, focusing on balancing capability with control and avoiding super agency.
AI Reasoning Flaws Are a Safety Feature
AI models' inability to control their "chains of thought" when monitored is a positive for AI safety, preventing them from easily deceiving oversight systems.
OpenAI Details GPT-5.4 Thinking Safety
OpenAI details safety measures for its new GPT-5.4 Thinking model, with a focus on high-capability cybersecurity risks.

AI Ethics Debate: Musk, Zuckerberg, and the Future of AI
Elon Musk and Mark Zuckerberg clash over AI regulation and existential risks, highlighting the debate shaping AI's future.
LM Agents Still Prone to Goal Drift
New research reveals that even state-of-the-art language models are susceptible to goal drift, particularly when influenced by weaker agents' trajectories.

OpenAI's New Model Tackles "Over-Caveating"
OpenAI researcher Blair discusses how new language models are reducing "over-caveating" for more direct and context-aware AI interactions.

Anthropic CEO: AI Must Align With Democratic Values
Anthropic CEO Dario Amodei discusses the AI company's cautious approach to model releases, citing concerns about misuse in surveillance and autonomous weapons.

OpenAI Strikes Pentagon AI Deal
OpenAI inks a deal with the Department of War for classified AI deployments, emphasizing strict safety guardrails against surveillance and autonomous weapons.
OpenAI Tackles AI Mental Health Risks
OpenAI is implementing enhanced mental health safety features, including parental controls and distress detection, while navigating legal challenges.
Anthropic Reworks AI Safety Rules
Anthropic's new Responsible Scaling Policy 3.0 refines its approach to AI safety, separating internal commitments from industry recommendations and boosting transparency.

NIST Seeks Input on AI Agent Security
NIST is seeking public input on security threats, vulnerabilities, and practices for autonomous AI agent systems, aiming to develop new guidelines.

Claude Sonnet 4.6 Ups the AI Ante
Anthropic's Claude Sonnet 4.6 launches with major upgrades in coding, reasoning, and computer use, plus a 1M token context window.

AI Societies' Safety Problem
Self-evolving AI societies face an impossible trilemma: achieving continuous learning, isolation, and safety alignment simultaneously.

Context-Aware Guardrails Tested
Mozilla.ai tested context-aware guardrails for LLMs in a humanitarian context, revealing crucial multilingual performance disparities and the need for robust, domain-specific safety policies.

Context-Aware AI Safety Tested
New research from Mozilla evaluates how context-aware AI safety guardrails perform across different languages and domains, particularly in humanitarian use cases.
Testing AI Guardrails Across Languages
Researchers tested context-aware AI guardrails across English and Farsi in humanitarian scenarios, finding nuanced performance differences and highlighting the need for language-specific safety evaluations.
Multilingual LLM Guardrails Tested
Researchers tested how LLM guardrails perform across languages and policy phrasings, revealing significant variations that impact AI safety assessments.
OpenAI's GPT-5.3-Codex: New Cyber Risks Emerge
OpenAI's new GPT-5.3-Codex model triggers 'High capability' cybersecurity classification, activating enhanced safety protocols amid dual concerns in bio/chem domains.

Claude Opus 4.6: Smarter, Faster, and Longer Context
Anthropic's Claude Opus 4.6 launches with a 1M token context window, enhanced coding, and state-of-the-art benchmark performance.

CLA Euro NCAP Win Validates AI-First Safety Architecture
The Mercedes CLA Euro NCAP win confirms that top safety ratings now require robust, verifiable AI-driven active safety systems built on redundant architectures.

The Assistant Axis LLM: How Researchers Are Capping AI Drift
Scientists have mapped the internal neural space of LLMs, identifying the "Assistant Axis" as the key to stabilizing AI persona and preventing harmful behavior.

Hinton's Stark Warning The Acceleration of AI Progress Outpaces Human Preparedness
Anthropic publishes SB 53 compliance framework for frontier AI

AI’s safety net relies on chain-of-thought monitorability
AI’s Dual Reality: Safety Theater and the Autonomous Arms Race to AGI
\n “I worry a lot about the unknowns.” This sentiment, expressed by Anthropic CEO Dario Amodei, encapsulates the pervasive anxiety defining the current era of a...

AI’s Dual Reality: Safety Theater and the Autonomous Arms Race to AGI
\n “I worry a lot about the unknowns.” This sentiment, expressed by Anthropic CEO Dario Amodei, encapsulates the pervasive anxiety defining the current era of a...
UK AI Security Institute: DeepMind's Deeper Safety Dive

National Security AI: The High Stakes of Government Innovation
OpenAI Launches $2M AI Mental Health Grants Program

Figure AI Lawsuit Exposes Deep Rifts in Robot Safety Culture

New York Assemblyman Alex Bores on AI Regulation: A Battle Against Unbridled Power
Anthropic\'s Risky Pursuit of Superintelligence Amidst Calls for Regulation on 60 Minutes
\"I believe it will reach that level, that it will be smarter than most or all humans in most or all ways.

AI’s Hinge Moment: From Legal Logic to Human Fulfillment

Google's Model Armor: The AI Bodyguard Preventing Digital Catastrophes
Rakuten Deploys New Guardrail for SAE PII Detection and LLM as a judge
\n Japanese tech giant Rakuten has deployed a novel AI guardrail system to detect and filter personally identifiable information (PII) from user messages, marki...

Rakuten Deploys New Guardrail for SAE PII Detection and LLM as a judge
\n Japanese tech giant Rakuten has deployed a novel AI guardrail system to detect and filter personally identifiable information (PII) from user messages, marki...

AI Agent Supervision: Sierra's Answer to Rogue Chatbots

AI introspection is real, but it's unreliable

From Discord's AI Growing Pains to Promptfoo's Red Teaming Triumph

AI's Autonomous Frontier Demands a Security Paradigm Shift

Level 4 Autonomous Driving Nears Commercial Reality

AI Safety: Microsoft Uncovers Bio-Threats, Forges New Research Model

The Human Imperative: Why AI's Future Demands Cultural Grounding, Not Just Data

AI's Dual Nature: Creature or Machine? The Battle Over Regulation

Google AI Research Awards Signal Strategic Priorities
Claude Haiku 4.5: Frontier AI Gets Cheaper, Faster
\n Anthropic is pushing the boundaries of accessible AI with the release of Claude Haiku 4.