Vision-language models (VLMs) typically operate on post-image signal processing (ISP) RGB images. This preprocessing pipeline often discards crucial sensor evidence through clipping, suppression, or quantization, thereby limiting the model's ability to accurately ground its understanding. A new approach, PRISM-VL, investigates whether grounding performance improves when the visual interface is moved closer to the original camera measurement.
Related startups
Bridging the Measurement-to-RGB Gap
The researchers introduce measurement-grounded vision-language learning, instantiated as PRISM-VL. This framework directly incorporates RAW-derived Meas.-XYZ inputs. A key innovation is its camera-conditioned grounding mechanism and Exposure-Bracketed Supervision Aggregation. This technique effectively transfers supervision signals from readily available RGB proxies to the more granular, raw measurement-domain observations, addressing a fundamental challenge in training on sensor data.
Quantifiable Gains in Challenging Scenarios
PRISM-VL-8B, trained on a 150K instruction-tuning set and evaluated on a benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, achieved significant improvements. It reached 0.6120 BLEU and 0.4571 ROUGE-L scores, alongside an 82.66% LLM-Judge accuracy. This represents a substantial leap over the RGB-based Qwen3-VL-8B baseline, with gains of +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points in LLM-Judge accuracy. These results strongly suggest that a portion of VLM grounding errors stems directly from information lost during standard RGB rendering, underscoring the value of preserving measurement-domain evidence for enhanced multimodal reasoning.