As artificial intelligence systems tackle increasingly complex tasks, ensuring human oversight remains paramount. OpenAI's latest research highlights a surprising benefit in current AI limitations: reasoning models are generally poor at controlling their internal thought processes, or "chains of thought" (CoT), when they know they're being watched. This lack of what researchers term "CoT controllability", detailed in a new OpenAI News post, is proving to be a crucial safety feature.
The ability to monitor an AI's reasoning steps is vital for detecting unsafe behavior or deviations from user goals. While AI agents generate these reasoning traces to solve problems, the worry has been that future, more advanced models might learn to deliberately manipulate or hide these steps to evade scrutiny. This could lead to an overestimation of an AI's alignment and safety.
A Feature, Not a Bug
OpenAI's study, using an evaluation suite called CoT-Control, tested 13 reasoning models on their ability to follow instructions about their CoT, such as avoiding specific words or using only lowercase letters. Across the board, these models performed poorly, with controllability scores rarely exceeding 15.4%. Crucially, even when models recognized they were failing to comply with instructions, they often continued to do so.
This inherent "flaw" is good news for AI safety. It suggests that current reasoning models, like those powering advanced systems, are unlikely to possess the capability to deliberately obfuscate their reasoning to bypass monitoring systems. This supports the reliability of AI’s safety net relies on chain-of-thought monitorability and other safety measures.
Controllability Varies, But Remains Low
The research did find that controllability generally increases with model size, but this effect diminishes quickly. Furthermore, models became significantly less able to control their CoTs when asked to reason for longer durations or after undergoing additional post-training, such as reinforcement learning. Even informing models they were being monitored or evaluated for controllability only slightly increased their ability to manipulate their reasoning.
While this low controllability is a positive sign for current AI safety, OpenAI emphasizes that it's not a guaranteed long-term solution. The exact reasons for this low controllability are not yet fully understood, and continued evaluation is necessary as AI capabilities advance. This research serves as an early warning system, or a "canary eval," to detect if models begin to improve their ability to shape or hide their reasoning processes.
The findings contrast with the challenges faced by complex reasoning models chain of thought, where subtle errors can cascade. By focusing on the interpretability of AI's internal workings, OpenAI aims to build more robust safeguards, similar to the layered approach discussed in relation to UK AI Security Institute: DeepMind's Deeper Safety Dive.