Automating High-Quality Image Editing Data

A new pipeline, EditCaption, drastically improves VLM instruction synthesis for image editing, boosting Qwen3-VL performance and reducing critical errors.

2 min read
Diagram illustrating the EditCaption two-stage pipeline for VLM instruction synthesis.
The EditCaption pipeline offers a two-stage approach to refine VLM-generated instructions for image editing.

The scalability of instruction-guided image editing hinges on high-quality training data, a bottleneck currently addressed by Vision-Language Models (VLMs). However, these models exhibit systematic failure modes in image-pair settings, including orientation inconsistency, viewpoint ambiguity, and insufficient attribute description. Human evaluation reveals over 47% of instructions generated by strong VLMs contain critical errors rendering them unusable for downstream training. This underscores a critical need for robust instruction synthesis pipelines.

Cracking the Code on VLM Instruction Failures

The researchers identified three key failure modes in VLM-generated instructions for image editing: orientation confusion (e.g., left/right), ambiguity in viewpoint, and a lack of fine-grained attribute detail. These issues collectively lead to a high rate of unusable training data, hindering the progress of automated image editing model development. Addressing these systematic errors is paramount for unlocking the full potential of instruction-guided image manipulation.

EditCaption: A Two-Stage Pipeline for Data Refinement

To combat these data quality issues, the authors propose EditCaption, a novel two-stage post-training pipeline. Stage 1 focuses on supervised fine-tuning (SFT) by constructing a 100K sample dataset. This dataset is meticulously curated, combining automatic annotation with GLM, EditScore-based filtering, and human refinement to ensure spatial, directional, and attribute-level accuracy. Stage 2 further refines the model by collecting 10K human preference pairs specifically targeting the identified failure modes, employing direct preference optimization (DPO) to achieve alignment beyond standard SFT.

Related startups

Qwen3-VL Image Editing Surges Ahead

The impact of the EditCaption pipeline is demonstrated through fine-tuned Qwen3-VL image editing models. On the Eval-400 benchmark, the 235B model achieved a score of 4.712, surpassing strong competitors like Gemini-3-Pro (4.706) and GPT-4.1 (4.220). Similar outperformance was observed on the ByteMorph-Bench, where the Qwen3-VL model reached 4.588, exceeding Gemini-3-Pro (4.522) and GPT-4.1 (3.412). This practical approach to scalable, human-aligned instruction synthesis is crucial for advancing the field of automated image editing.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.