The pursuit of sophisticated 3D scene understanding within Vision-Language Models (VLMs) has been hampered by a trade-off: either complex, bespoke geometry encoders or substantial training investments are required. This has limited the scalability and accessibility of spatial reasoning capabilities in AI.
Related startups
The Equirectangular Canvas: A Unified Spatial Coordinate System
The OneCanvas approach, detailed by Baranowski et al. on arXiv, fundamentally rethinks how multi-view image patches are integrated into a VLM. Instead of complex fusion mechanisms, it projects patch features into a single equirectangular panoramic canvas. Each patch is unprojected to its 3D world coordinate using its depth and camera pose. Crucially, this 3D position is then mapped to continuous longitude and latitude on the canvas, effectively creating a shared spatial coordinate system without rasterization or cross-view aggregation. A 3D position embedding of the patch's metric coordinates is added to its feature, preserving depth information lost in the angular projection. This representation is directly consumable by pretrained VLMs as if it were a standard image, eliminating the need for major architectural modifications or specialized encoders.
Enabling Situated Reasoning and Efficient Pretraining
A key strategic advantage of OneCanvas is its inherent support for situated reasoning. By centering the canvas on any pose of interest, the same representation can be used to perform analysis from a specific viewpoint, a critical capability for robotics and embodied AI applications. Furthermore, this unified representation unlocks a novel spatial pretraining curriculum. Researchers can procedurally generate supervision by placing object patch features at chosen 3D world positions on an empty canvas. This on-the-fly generation allows for broad coverage of spatial reasoning tasks while controlling answer distributions to prevent shortcut learning. This methodology has demonstrated state-of-the-art accuracy on benchmarks like SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, all while utilizing an order of magnitude less training compute than competing methods. This efficiency dramatically lowers the barrier to entry for advanced 3D scene understanding.