CoCo: Code Drives Precise Image Generation

CoCo leverages executable code for precise, structured text-to-image generation, outperforming existing methods on complex benchmarks.

Mar 10 at 8:01 PM2 min read
Diagram illustrating the CoCo framework, showing prompt input, code generation, draft image rendering, and final image refinement.

Current Unified Multimodal Models (UMMs) for text-to-image (T2I) generation, while advanced, struggle with the precision needed for complex spatial layouts and dense textual content. Existing Chain-of-Thought (CoT) approaches often rely on abstract natural language, falling short of the specificity required for intricate visual compositions. This limitation hinders the creation of highly structured and detailed images.

From Abstract Reasoning to Executable Plans

The proposed CoCo framework introduces a paradigm shift by representing the reasoning process as executable code, termed Code-as-CoT. This approach moves beyond natural language planning to generate explicit, verifiable intermediate code. This code dictates the structural layout of a scene, which is then executed in a sandboxed environment to produce a deterministic draft image. This structured draft serves as a foundation for subsequent fine-grained editing, ultimately leading to high-fidelity results. This method effectively addresses the precision gap in current CoT-based T2I systems.

CoCo-10K: Training for Structured Visual Synthesis

To facilitate this code-driven approach, the researchers constructed CoCo-10K, a novel dataset. This curated collection features structured draft-final image pairs specifically designed to train models in both generating coherent structured drafts and performing accurate visual refinements. This specialized dataset is crucial for teaching the model the nuances of translating code-based plans into visual reality and correcting deviations.

Unlocking New Benchmarks in Controllable Generation

Empirical evaluations demonstrate CoCo's substantial impact. Across StructT2IBench, OneIG-Bench, and LongText-Bench, CoCo achieved significant improvements of +68.83%, +54.8%, and +41.23% over direct generation methods. Furthermore, CoCo outperformed other generation methods enhanced by CoT. These results underscore the efficacy and reliability of executable code as a reasoning paradigm for precise, controllable, and structured text-to-image generation, marking a significant advancement in the field.