Beyond RGB: Grounding Vision-Language on Raw Sensor Data

Vision-language models (VLMs) typically operate on post-image signal processing (ISP) RGB images. This preprocessing pipeline often discards crucial sensor evidence through clipping, suppression, or quantization, thereby limiting the model's ability to accurately ground its understanding. A new approach, PRISM-VL, investigates whether grounding performance improves when the visual interface is moved closer to the original camera measurement.

Visual TL;DR. VLMs use RGB leads to RGB loses data. RGB loses data problem PRISM-VL approach. PRISM-VL approach leads to RAW-derived Meas.-XYZ. PRISM-VL approach leads to Camera-conditioned grounding. Camera-conditioned grounding leads to Exposure-Bracketed Supervision. PRISM-VL approach enables Improved performance. Improved performance leads to Quantifiable gains.

Related startups

VLMs use RGB: standard vision-language models process post-image signal processing RGB images
RGB loses data: preprocessing discards crucial sensor evidence through clipping, suppression, or quantization
PRISM-VL approach: grounds vision-language models in raw camera measurements, not just RGB
RAW-derived Meas.-XYZ: directly incorporates raw sensor data inputs for improved grounding
Camera-conditioned grounding: a key innovation for better understanding of sensor data
Exposure-Bracketed Supervision: transfers supervision from RGB proxies to raw measurement domain observations
Improved performance: significantly improves performance on challenging visual tasks
Quantifiable gains: demonstrates measurable improvements in challenging scenarios

Visual TL;DRQuickExplainDeeper

Bridging the Measurement-to-RGB Gap

The researchers introduce measurement-grounded vision-language learning, instantiated as PRISM-VL. This framework directly incorporates RAW-derived Meas.-XYZ inputs. A key innovation is its camera-conditioned grounding mechanism and Exposure-Bracketed Supervision Aggregation. This technique effectively transfers supervision signals from readily available RGB proxies to the more granular, raw measurement-domain observations, addressing a fundamental challenge in training on sensor data.

Quantifiable Gains in Challenging Scenarios

PRISM-VL-8B, trained on a 150K instruction-tuning set and evaluated on a benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, achieved significant improvements. It reached 0.6120 BLEU and 0.4571 ROUGE-L scores, alongside an 82.66% LLM-Judge accuracy. This represents a substantial leap over the RGB-based Qwen3-VL-8B baseline, with gains of +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points in LLM-Judge accuracy. These results strongly suggest that a portion of VLM grounding errors stems directly from information lost during standard RGB rendering, underscoring the value of preserving measurement-domain evidence for enhanced multimodal reasoning.

Beyond RGB: Grounding Vision-Language on Raw Sensor Data

Related startups

Bridging the Measurement-to-RGB Gap

Quantifiable Gains in Challenging Scenarios

AI Daily Digest