Vision Language Models (VLMs) demonstrate remarkable capabilities but falter when spatial reasoning hinges on unobservable information. This limitation hinders applications requiring inference about occluded spaces, alternative viewpoints, or integration of partial observations. A new approach from researchers, detailed on arXiv, introduces a method to imbue VLMs with 'imaginative perception'.
Related startups
Externalizing Unseen Spatial Configurations
The core innovation lies in Imaginative Perception Tokens (IPTs), which act as intermediate representations. These tokens externalize what a VLM would perceive under hypothetical spatial arrangements, ensuring consistency with the observed input. This allows models to reason about spatial relationships that are not directly present in the input data, moving beyond the limitations of purely observable information.
A Superior Supervision Signal for Spatial Reasoning
To validate this paradigm, the researchers formulated three new tasks: Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), accompanied by a 20,000-example dataset. When applied to the BAGEL VLM, IPT supervision consistently boosted spatial reasoning performance. Notably, it often surpassed textual chain-of-thought training, even without the computational overhead of generating images during inference. On the Multiview Counting task, IPT improved accuracy by 3.4%, and it achieved competitive results against strong closed-source models on Path Tracing. The study further suggests that combining IPT with label-only supervision yields additional gains, whereas forcing spatial computation through language (textual chain-of-thought) can degrade performance, indicating a potential modality mismatch.