Rubric-Driven DPO for Visual Tasks

The efficacy of Direct Preference Optimization (DPO) in multimodal AI hinges on preference data that accurately captures the nuances of visual reasoning. Existing methods, often relying on indirect signals or off-policy perturbations, fall short in providing the fine-grained feedback necessary for these complex tasks. This limitation is addressed by a novel approach that leverages instance-specific rubrics.

Beyond Coarse Outcomes: Criterion-Level Feedback

The proposed framework, rDPO, introduces a paradigm shift by employing instance-specific rubrics. For each image-instruction pair, a detailed, checklist-style rubric is generated, outlining both essential and supplementary criteria for evaluating responses. This rubric pool is constructed offline and then utilized during the on-policy data generation phase. This methodological refinement ensures that preference signals are directly tied to the specific visual reasoning requirements of each instance, rather than relying on broad outcome-based assessments.

Elevating Judge Accuracy and Downstream Performance

The impact of this rubric-based approach is substantial. On public reward modeling benchmarks, a 30B-A3B judge augmented with rubric-based prompting achieves performance nearing that of GPT-5.4. Furthermore, in downstream benchmark evaluations, rubric-based filtering boosts the macro average score to 82.69%. In contrast, traditional outcome-based filtering leads to a drop from 81.14% to 75.82%, underscoring the limitations of coarser evaluation methods. When assessing scalability on a comprehensive benchmark, rDPO demonstrates its power, achieving a score of 61.01, significantly outperforming a style-constrained baseline (52.36) and surpassing the base model's score of 59.48. These results highlight the critical advantage of integrating on-policy data construction with instance-specific, criterion-level feedback for effective rDPO multimodal preference optimization.

Rubric-Driven DPO for Visual Tasks

Beyond Coarse Outcomes: Criterion-Level Feedback

Related startups

Elevating Judge Accuracy and Downstream Performance

AI Daily Digest