Large Vision Language Models (LVLMs) have demonstrated remarkable semantic understanding, yet their ability to precisely ground language in visual space remains a significant bottleneck. This limitation stems from the implicit nature of spatial inference, where models must deduce complex geometry without explicit spatial outputs.
Explicit Spatial Tokenization for Enhanced Grounding
The novel Perceptio LVLM framework directly addresses this challenge by integrating explicit 2D and 3D spatial reasoning capabilities. This is achieved by generating semantic segmentation tokens (via SAM2) and depth tokens (distilled from a VQ-VAE codebook) directly within the autoregressive sequence. This allows the model to first process and output spatial information before generating textual answers, establishing a 'spatial chain-of-thought'.