The efficacy of Direct Preference Optimization (DPO) in multimodal AI hinges on preference data that accurately captures the nuances of visual reasoning. Existing methods, often relying on indirect signals or off-policy perturbations, fall short in providing the fine-grained feedback necessary for these complex tasks. This limitation is addressed by a novel approach that leverages instance-specific rubrics.
Beyond Coarse Outcomes: Criterion-Level Feedback
The proposed framework, rDPO, introduces a paradigm shift by employing instance-specific rubrics. For each image-instruction pair, a detailed, checklist-style rubric is generated, outlining both essential and supplementary criteria for evaluating responses. This rubric pool is constructed offline and then utilized during the on-policy data generation phase. This methodological refinement ensures that preference signals are directly tied to the specific visual reasoning requirements of each instance, rather than relying on broad outcome-based assessments.