The prevailing paradigm for Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) relies on textual or interleaved textual-visual reasoning. This work challenges that assumption, proposing a radical shift: leveraging images as the sole medium for AI reasoning.
Related startups
Optical Reasoning: Visualizing Thought Processes
The core innovation, optical reasoning, posits that images can serve as a standalone reasoning engine. This approach is instantiated in two forms: typographic-based optical reasoning, which strategically arranges visual elements for compact rationale display, and graphical-based optical reasoning, which integrates text and graphics into structured visual rationales. This novel framework aims to move beyond traditional text-centric approaches in AI.
Unlocking Unprecedented Efficiency and Performance
Evaluated across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning demonstrates remarkable efficacy. It not only matches but often surpasses traditional text-based reasoning methods. Critically, this is achieved with substantial token efficiency gains: an average reduction of 28.57% on language tasks and 16% on multimodal tasks, translating to 1.96 times the token efficiency of text reasoning. This suggests that a well-structured visual rationale can be significantly more compact and effective than lengthy textual explanations, marking a significant advancement for the optical reasoning LLM paradigm.