The scalability of instruction-guided image editing hinges on high-quality training data, a bottleneck currently addressed by Vision-Language Models (VLMs). However, these models exhibit systematic failure modes in image-pair settings, including orientation inconsistency, viewpoint ambiguity, and insufficient attribute description. Human evaluation reveals over 47% of instructions generated by strong VLMs contain critical errors rendering them unusable for downstream training. This underscores a critical need for robust instruction synthesis pipelines.
Cracking the Code on VLM Instruction Failures
The researchers identified three key failure modes in VLM-generated instructions for image editing: orientation confusion (e.g., left/right), ambiguity in viewpoint, and a lack of fine-grained attribute detail. These issues collectively lead to a high rate of unusable training data, hindering the progress of automated image editing model development. Addressing these systematic errors is paramount for unlocking the full potential of instruction-guided image manipulation.
EditCaption: A Two-Stage Pipeline for Data Refinement
To combat these data quality issues, the authors propose EditCaption, a novel two-stage post-training pipeline. Stage 1 focuses on supervised fine-tuning (SFT) by constructing a 100K sample dataset. This dataset is meticulously curated, combining automatic annotation with GLM, EditScore-based filtering, and human refinement to ensure spatial, directional, and attribute-level accuracy. Stage 2 further refines the model by collecting 10K human preference pairs specifically targeting the identified failure modes, employing direct preference optimization (DPO) to achieve alignment beyond standard SFT.