Code-Driven Reasoning for Precise Image Generation

CoCo (Code-as-CoT) introduces executable code as a reasoning framework for text-to-image generation, achieving superior precision and control.

Mar 10 at 8:01 PM2 min read
Diagram illustrating the CoCo framework with code execution leading to image generation.

Current Unified Multimodal Models (UMMs) excel at text-to-image (T2I) generation, yet struggle with the nuanced precision needed for complex spatial layouts, structured visual elements, and dense textual content. Relying on abstract natural-language planning, these models fall short when intricate detail is paramount. This limitation is addressed by CoCo (Code-as-CoT), a novel framework presented by Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou et al. on arXiv, which redefines the reasoning process as executable code.

From Natural Language to Executable Logic

CoCo introduces a paradigm shift by representing the reasoning process as executable code. This allows for explicit, verifiable intermediate planning in image generation. The framework first generates code that precisely defines the structural layout of a scene. This code is then executed within a sandboxed environment, rendering a deterministic draft image. This structured draft serves as a foundation for subsequent fine-grained image editing, ultimately producing a high-fidelity final result. The CoCo-10K dataset, featuring structured draft-final image pairs, was developed to train both the draft construction and corrective refinement stages.

Quantifiable Leaps in Structured Image Generation

Empirical evaluations demonstrate the efficacy of this code-driven approach. On benchmarks like StructT2IBench, OneIG-Bench, and LongText-Bench, CoCo achieved substantial improvements of +68.83%, +54.8%, and +41.23% over direct generation baselines. Furthermore, it significantly outperformed other existing generation methods that leverage Chain-of-Thought (CoT) reasoning. These results underscore that executable code is a powerful and reliable paradigm for achieving precise, controllable, and structured text-to-image generation, moving beyond the ambiguities of natural language planning.