The rapid advancement of vision language models (VLMs) for complex medical tasks, such as report generation and visual question answering, has outpaced their fundamental safety checks. While these models can produce fluent diagnostic narratives, they often fail to perform basic pre-diagnostic sanity checks, a critical step in clinical practice. This oversight creates a dangerous blind spot: models can confidently generate plausible text even when presented with invalid or inconsistent visual input.
Introducing MedObvious: A Benchmark for Input Validation in Medical VLMs
To address this critical gap, researchers have introduced MedObvious, a novel benchmark designed to specifically isolate and evaluate input validation capabilities in vision language models medical contexts. This 1,880-task benchmark assesses a model's ability to identify set-level consistency issues across small multi-panel image sets, focusing on whether any panel violates expected coherence. MedObvious progresses through five tiers, starting from rudimentary orientation and modality mismatches to more complex, clinically relevant checks involving anatomy, viewpoint verification, and triage-style cues. Furthermore, it employs five distinct evaluation formats to rigorously test model robustness across different interaction paradigms.