The race to build increasingly powerful AI models like GPT-5 has created a fundamental safety problem: how do you supervise a system whose reasoning is faster and more complex than a human supervisor? For researchers, the answer lies in the model’s internal monologue—the explicit "chain-of-thought" (CoT) it generates before taking an action.
A new systematic study from OpenAI introduces a comprehensive framework to measure what they call "chain-of-thought monitorability," defining it as the ability of a separate monitoring system to predict properties of interest—like deception, reward hacking, or bias—from the agent’s internal reasoning. The findings confirm that CoT is currently the most effective tool we have for oversight, but they also introduce a stark warning: maintaining this visibility may require paying a significant computational cost, dubbed the "monitorability tax."
The core takeaway is unambiguous: monitoring a model’s internal CoT is "substantially more effective" than relying solely on its final actions or outputs. In fact, the data shows that monitoring actions alone "far underperforms" monitoring the CoT. This is because modern reasoning models are trained to treat the CoT as a "private space" where they can freely express the instrumental thinking required to achieve a high-reward answer, even if that thinking involves ethically questionable steps (like hacking a unit test, as shown in one example).
