OneCanvas: Unified 3D Scene Representation

The pursuit of sophisticated 3D scene understanding within Vision-Language Models (VLMs) has been hampered by a trade-off: either complex, bespoke geometry encoders or substantial training investments are required. This has limited the scalability and accessibility of spatial reasoning capabilities in AI.

Visual TL;DR. VLM 3D Understanding leads to Complex Geometry/Training. Complex Geometry/Training solves OneCanvas Approach. OneCanvas Approach involves Unproject to 3D. Unproject to 3D then Map to Canvas. Map to Canvas creates Unified Spatial System. Unified Spatial System enables Situated Reasoning. Unified Spatial System enables Efficient Pretraining.

Related startups

VLM 3D Understanding: sophisticated 3D scene understanding in VLMs has been hampered
Complex Geometry/Training: either complex geometry encoders or substantial training investments
OneCanvas Approach: projects multi-view features onto a unified equirectangular canvas
Unproject to 3D: unprojected to its 3D world coordinate using depth and pose
Map to Canvas: mapped to continuous longitude and latitude on the canvas
Unified Spatial System: creating a shared spatial coordinate system without rasterization
Situated Reasoning: enabling efficient situated reasoning and VLM performance
Efficient Pretraining: enabling efficient pretraining for spatial reasoning capabilities

Visual TL;DRQuickExplainDeeper

The Equirectangular Canvas: A Unified Spatial Coordinate System

The OneCanvas approach, detailed by Baranowski et al. on arXiv, fundamentally rethinks how multi-view image patches are integrated into a VLM. Instead of complex fusion mechanisms, it projects patch features into a single equirectangular panoramic canvas. Each patch is unprojected to its 3D world coordinate using its depth and camera pose. Crucially, this 3D position is then mapped to continuous longitude and latitude on the canvas, effectively creating a shared spatial coordinate system without rasterization or cross-view aggregation. A 3D position embedding of the patch's metric coordinates is added to its feature, preserving depth information lost in the angular projection. This representation is directly consumable by pretrained VLMs as if it were a standard image, eliminating the need for major architectural modifications or specialized encoders.

Enabling Situated Reasoning and Efficient Pretraining

A key strategic advantage of OneCanvas is its inherent support for situated reasoning. By centering the canvas on any pose of interest, the same representation can be used to perform analysis from a specific viewpoint, a critical capability for robotics and embodied AI applications. Furthermore, this unified representation unlocks a novel spatial pretraining curriculum. Researchers can procedurally generate supervision by placing object patch features at chosen 3D world positions on an empty canvas. This on-the-fly generation allows for broad coverage of spatial reasoning tasks while controlling answer distributions to prevent shortcut learning. This methodology has demonstrated state-of-the-art accuracy on benchmarks like SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, all while utilizing an order of magnitude less training compute than competing methods. This efficiency dramatically lowers the barrier to entry for advanced 3D scene understanding.

OneCanvas: Unified 3D Scene Representation

Related startups

The Equirectangular Canvas: A Unified Spatial Coordinate System

Enabling Situated Reasoning and Efficient Pretraining

AI Daily Digest