Large Vision-Language Models (LVLMs) exhibit impressive capabilities but remain susceptible to generating outputs not grounded in visual input. A critical question has been the relative contribution of vision backbone limitations versus language dominance to this hallucination problem. New research from Khayatan et al. introduces the HalluScope benchmark, a novel tool designed to dissect the factors inducing these visual grounding failures.
Excessive Textual Priors Fuel Hallucinations
The analysis conducted using the HalluScope benchmark reveals a significant finding: LVLM hallucinations largely stem from an over-reliance on textual priors and background knowledge. This is particularly evident when information is introduced through textual instructions, suggesting that the language component's learned associations can override or misinterpret visual cues. This insight challenges previous assumptions and provides a clearer target for mitigation strategies.