The race to build increasingly powerful AI models like GPT-5 has created a fundamental safety problem: how do you supervise a system whose reasoning is faster and more complex than a human supervisor? For researchers, the answer lies in the model’s internal monologue, the explicit "chain-of-thought" (CoT) it generates before taking an action.
A new systematic study from OpenAI introduces a comprehensive framework to measure what they call "chain-of-thought monitorability," defining it as the ability of a separate monitoring system to predict properties of interest, like deception, reward hacking, or bias, from the agent’s internal reasoning. The findings confirm that CoT is currently the most effective tool we have for oversight, but they also introduce a stark warning: maintaining this visibility may require paying a significant computational cost, dubbed the "monitorability tax."
