Perceptio: Spatial Grounding for LVLMs

Perceptio LVLM integrates explicit spatial tokens (segmentation, depth) to overcome LVLM limitations in fine-grained visual grounding, achieving SOTA across benchmarks.

1 min read
Perceptio: Spatial Grounding for LVLMs

Large Vision Language Models (LVLMs) have demonstrated remarkable semantic understanding, yet their ability to precisely ground language in visual space remains a significant bottleneck. This limitation stems from the implicit nature of spatial inference, where models must deduce complex geometry without explicit spatial outputs.

Explicit Spatial Tokenization for Enhanced Grounding

The novel Perceptio LVLM framework directly addresses this challenge by integrating explicit 2D and 3D spatial reasoning capabilities. This is achieved by generating semantic segmentation tokens (via SAM2) and depth tokens (distilled from a VQ-VAE codebook) directly within the autoregressive sequence. This allows the model to first process and output spatial information before generating textual answers, establishing a 'spatial chain-of-thought'.

Related startups

Novel Techniques for Robust Spatial Perception

To ensure stable and accurate depth token generation, Perceptio introduces innovative composite depth-token objectives—including marker, token, and count losses—alongside a soft-merging technique for differentiable reconstruction. This multi-task co-training strategy, applied across diverse datasets, empowers the Perceptio LVLM to learn perception tokens that effectively tackle multiple downstream tasks. Building upon the InternVL architecture, Perceptio demonstrates substantial improvements across key benchmarks, including a +0.8/+1.4/+1.1 cIoU gain on RefCOCO/+/g for referring expression segmentation, a 10.3% boost in HardBLINK spatial understanding accuracy, and a 1.0% increase in MMBench accuracy.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.