Automating High-Quality Image Editing Data

Cracking the Code on VLM Instruction Failures

The researchers identified three key failure modes in VLM-generated instructions for image editing: orientation confusion (e.g., left/right), ambiguity in viewpoint, and a lack of fine-grained attribute detail. These issues collectively lead to a high rate of unusable training data, hindering the progress of automated image editing model development. Addressing these systematic errors is paramount for unlocking the full potential of instruction-guided image manipulation.

EditCaption: A Two-Stage Pipeline for Data Refinement

To combat these data quality issues, the authors propose EditCaption, a novel two-stage post-training pipeline. Stage 1 focuses on supervised fine-tuning (SFT) by constructing a 100K sample dataset. This dataset is meticulously curated, combining automatic annotation with GLM, EditScore-based filtering, and human refinement to ensure spatial, directional, and attribute-level accuracy. Stage 2 further refines the model by collecting 10K human preference pairs specifically targeting the identified failure modes, employing direct preference optimization (DPO) to achieve alignment beyond standard SFT.

Qwen3-VL Image Editing Surges Ahead

The impact of the EditCaption pipeline is demonstrated through fine-tuned Qwen3-VL image editing models. On the Eval-400 benchmark, the 235B model achieved a score of 4.712, surpassing strong competitors like Gemini-3-Pro (4.706) and GPT-4.1 (4.220). Similar outperformance was observed on the ByteMorph-Bench, where the Qwen3-VL model reached 4.588, exceeding Gemini-3-Pro (4.522) and GPT-4.1 (3.412). This practical approach to scalable, human-aligned instruction synthesis is crucial for advancing the field of automated image editing.