Traditional image tokenization methods, by breaking down images into spatial patches, impose inherent limitations on capturing nuanced visual information. This often leads to a compromise between global structure and fine-grained detail.
Related startups
From Patches to Channels: A New Visual Language
A significant departure from conventional approaches is introduced by Channel-wise Vector Quantization (CVQ). Instead of assigning discrete tokens to feature vectors of image patches, CVQ quantizes each individual channel of a feature map. This fundamental shift allows an image to be represented as a composition of discrete visual detail levels, moving beyond a simple grid-based spatial decomposition. The authors demonstrate that CVQ achieves 100% codebook utilization even with a codebook size exceeding 16K, and substantially enhances reconstruction quality over prior methods.
Sequential Detail Refinement with CAR
Building upon CVQ, the researchers present a novel visual autoregressive framework called Channel-wise Autoregressive (CAR). This model operates on a 'next-channel prediction' principle, generating images by sequentially predicting channels. This process mimics a human artist's workflow, starting with a global structure and progressively refining finer attributes. Empirically, the CAR model achieves a DPG score of 86.7 and a GenEval score of 0.79, signaling its potent effectiveness in text-to-image generation tasks.