Current Unified Multimodal Models (UMMs) excel at text-to-image (T2I) generation, yet struggle with the nuanced precision needed for complex spatial layouts, structured visual elements, and dense textual content. Relying on abstract natural-language planning, these models fall short when intricate detail is paramount. This limitation is addressed by CoCo (Code-as-CoT), a novel framework presented by Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou et al. on arXiv, which redefines the reasoning process as executable code.
From Natural Language to Executable Logic
CoCo introduces a paradigm shift by representing the reasoning process as executable code. This allows for explicit, verifiable intermediate planning in image generation. The framework first generates code that precisely defines the structural layout of a scene. This code is then executed within a sandboxed environment, rendering a deterministic draft image. This structured draft serves as a foundation for subsequent fine-grained image editing, ultimately producing a high-fidelity final result. The CoCo-10K dataset, featuring structured draft-final image pairs, was developed to train both the draft construction and corrective refinement stages.