Multimodal Large Language Models (MLLMs) have excelled at bridging vision and language, yet they falter in nuanced spatial understanding and viewpoint-dependent reasoning. Addressing this gap, a new framework, Loc3R-VLM, empowers existing 2D Vision-Language Models with advanced 3D comprehension capabilities using only monocular video input.
Grounding Perception in 3D Space
Inspired by human spatial cognition, Loc3R-VLM operates on two synergistic objectives. The first, global layout reconstruction, constructs a holistic representation of the scene's structure. The second, explicit situation modeling, anchors the egocentric perspective. Together, these objectives furnish direct spatial supervision, effectively grounding both perceptual input and linguistic output within a coherent 3D context. This approach moves beyond simply augmenting input with geometric cues to actively teaching spatial reasoning.
Metric-Scale Alignment via Foundation Model Priors
To ensure geometric consistency and accurate metric-scale alignment, a critical aspect for real-world applications, Loc3R-VLM ingeniously leverages lightweight camera pose priors. These priors are extracted from a pre-trained 3D foundation model, providing essential structural information without requiring extensive retraining or heavy computational overhead. The framework demonstrates state-of-the-art performance in language-based localization and surpasses existing 2D and video-based methods on situated and general 3D question-answering benchmarks, validating the efficacy of its spatial supervision strategy.