The rapid advancement of vision language models (VLMs) for complex medical tasks, such as report generation and visual question answering, has outpaced their fundamental safety checks. While these models can produce fluent diagnostic narratives, they often fail to perform basic pre-diagnostic sanity checks, a critical step in clinical practice. This oversight creates a dangerous blind spot: models can confidently generate plausible text even when presented with invalid or inconsistent visual input.
Introducing MedObvious: A Benchmark for Input Validation in Medical VLMs
To address this critical gap, researchers have introduced MedObvious, a novel benchmark designed to specifically isolate and evaluate input validation capabilities in vision language models medical contexts. This 1,880-task benchmark assesses a model's ability to identify set-level consistency issues across small multi-panel image sets, focusing on whether any panel violates expected coherence. MedObvious progresses through five tiers, starting from rudimentary orientation and modality mismatches to more complex, clinically relevant checks involving anatomy, viewpoint verification, and triage-style cues. Furthermore, it employs five distinct evaluation formats to rigorously test model robustness across different interaction paradigms.
Unreliable Sanity Checks Plague Leading VLMs
Evaluations of 17 prominent VLMs using the MedObvious benchmark reveal a sobering reality: pre-diagnostic input verification remains largely unsolved. A significant number of models exhibited concerning behavior, including hallucinating anomalies on normal, negative-control inputs. Performance also showed a marked degradation as the complexity and size of image sets increased, highlighting scalability issues. Critically, the measured accuracy varied substantially depending on the evaluation format, with multiple-choice questions yielding different results than open-ended assessments. These findings underscore that current vision language models medical systems, while capable of sophisticated language generation, are not yet safe for deployment without robust input validation mechanisms.