"The ultimate problem, the ultimate reason models hallucinate, is because we have no way to tell them good job for saying 'I don't know' and good job for roughly guessing in the right area." This stark observation, articulated by Matthew Berman, cuts to the core of a persistent challenge in large language models (LLMs). Berman’s recent discussion centers on a groundbreaking OpenAI paper titled "Why Language Models Hallucinate," which identifies the surprising root cause of these plausible yet incorrect outputs. The paper argues that hallucinations are not mere bugs to be patched, but rather an inherent "feature" stemming from the very training and evaluation paradigms designed to optimize model performance.

Matthew Berman, in his detailed commentary, breaks down the implications of this OpenAI paper for a technical audience. He highlights the core argument that LLMs are engineered to produce "overconfident, plausible falsehoods, which diminish their utility." This behavior is cultivated because existing training objectives and evaluation benchmarks inadvertently reward guessing over acknowledging uncertainty. Essentially, models are incentivized to provide a confident, specific answer, even when unsure, because abstaining or admitting ignorance often results in a lower score in current metrics.

The paper traces the origins of hallucinations to two primary stages: pre-training and post-training. During pre-training, the model learns language distribution from vast corpora, which inevitably contain errors and half-truths. Even with perfectly curated, error-free data, the optimization objectives would still generate errors. This is because creating a valid response is inherently more difficult than simply verifying one. A model asked to generate an answer must navigate an infinite landscape of potential incorrect responses to arrive at a single correct one, whereas a verification task is a binary classification, far simpler. This fundamental asymmetry means that pre-training, by its very nature, predisposes models to generate errors.

Post-training, often aimed at reducing hallucinations, ironically perpetuates them. Current evaluation benchmarks, such as GPQA or MMLU-Pro, predominantly use a binary 0-1 scoring scheme: one point for a correct answer, zero for incorrect or blank responses. This mirrors the flawed incentive structure of standardized human exams, where guessing, even with low confidence, offers a statistical advantage over abstention. "Bluffs are often overconfident and specific, such as 'September 30' rather than 'Sometime in autumn' for a question about a date," Berman quotes from the paper, illustrating how models are driven to provide definitive, albeit potentially false, answers to maximize their scores.

The critical insight here is that these evaluation systems are "not aligned" with the goal of reducing hallucinations. They penalize uncertainty, thereby compelling models to guess rather than express doubt. This creates an "epidemic of penalizing uncertainty and abstention," as the paper describes it, reinforcing the very behavior we seek to eliminate. Reinforcement learning, intended to make models more helpful and decisive, inadvertently pushes them towards overconfidence, leading to a poorer calibration where their stated confidence does not accurately reflect their actual accuracy.

The solution proposed by OpenAI is "behavioral calibration." This involves modifying evaluation metrics to reward models for expressing uncertainty when appropriate, rather than penalizing it. Specifically, benchmarks should award a positive score for correct answers, a neutral score (zero) for "I don't know" responses, and a negative score for incorrect, confident answers. This paradigm shift would incentivize models to only answer when confident above a certain threshold, thereby fostering honesty and reducing the propensity to hallucinate. Early indications from GPT-5 suggest this approach is already being explored, with the model occasionally stating "I don't know – and I can't reliably find out," a response Elon Musk lauded as "impressive."

This re-evaluation of how LLMs are trained and assessed is paramount. It suggests that the problem of hallucination is deeply ingrained in the current ecosystem of AI development, driven by misaligned incentives. Fixing this requires a socio-technical mitigation: adjusting benchmarks to prioritize accurate self-assessment over confident bluffing. Only by recalibrating these fundamental mechanisms can we truly steer the field toward more trustworthy and reliable AI systems, moving beyond the current "test-taking" mode that fosters overconfident falsehoods.

OpenAI's Hallucination Breakthrough: A Feature, Not a Bug, and How to Fix It

Related startups

AI Daily Digest