The frontier of AI is increasingly defined by tasks demanding nuanced comprehension, not just pattern recognition. Humor, a uniquely human cognitive feat, presents a significant challenge, with existing benchmarks often treating it as a black-box prediction problem. This overlooks the intricate reasoning processes involved in understanding why something is funny.
Decomposing Humor: The Incongruity-Resolution Supervision Framework
Researchers introduce IRS (Incongruity-Resolution Supervision), a novel framework designed to explicitly model the structured reasoning behind humor. IRS breaks down humor comprehension into three core components: identifying the visual incongruity, generating coherent resolutions for that mismatch, and aligning these resolutions with human judgments. This approach, grounded in established humor theory and expert practice, provides structured supervision for the intermediate reasoning steps, making the path from perception to humorous interpretation explicit and trainable.
Scaling Reasoning, Not Just Parameters, for Humor AI
The effectiveness of IRS is demonstrated across models of varying sizes (7B, 32B, and 72B) on the New Yorker Cartoon Caption Contest (NYCC) benchmark. The framework significantly outperforms strong multimodal baselines in both caption matching and ranking tasks. Notably, the largest IRS model achieves performance approaching expert levels in caption ranking. Crucially, the zero-shot transfer capabilities of IRS to external benchmarks indicate that it learns generalizable reasoning patterns, suggesting that supervising the structure of reasoning is paramount for complex, reasoning-centric tasks, rather than relying on model scale alone. This marks a significant advancement in the pursuit of sophisticated humor AI reasoning.