UniDoc-RL: Finer-Grained Visual RAG

The efficacy of Large Vision-Language Models (LVLMs) is often capped by their reliance on generic retrieval signals for external visual knowledge. This approach fails to capture the fine-grained semantics critical for sophisticated reasoning tasks. To bridge this gap, researchers have introduced UniDoc-RL, a unified reinforcement learning framework designed to tackle these limitations head-on.

Active Perception for Semantic Precision

UniDoc-RL reframes visual information acquisition as a sequential decision-making process. Its hierarchical action space allows the LVLM agent to move beyond simple document retrieval. It progressively refines visual evidence, starting with coarse-grained document retrieval and advancing to fine-grained image selection and active region cropping. This granular control enables the model to actively suppress irrelevant content and focus on information-dense areas, a crucial step for accurate reasoning.

End-to-End Training with Dense Multi-Reward Supervision

Training such a complex agent necessitates sophisticated supervision. UniDoc-RL employs a dense multi-reward scheme that provides task-aware feedback for each action taken by the agent. Coupled with Group Relative Policy Optimization (GRPO), this approach allows for effective end-to-end training of the LVLM agent, aligning its behavior with multiple objectives without the need for a separate value network. The development of a comprehensive dataset with fine-grained action annotations further supports this advanced training paradigm, pushing the boundaries of what's possible with UniDoc-RL visual RAG.

Outperforming State-of-the-Art in Visual Reasoning

Experimental results across three benchmarks demonstrate the superiority of UniDoc-RL. The framework consistently surpasses existing state-of-the-art baselines, achieving gains of up to 17.7% over prior RL-based methods. This significant performance improvement underscores the value of UniDoc-RL's approach to fine-grained visual retrieval and active perception in complex reasoning scenarios, positioning UniDoc-RL visual RAG as a leading solution.

UniDoc-RL: Finer-Grained Visual RAG

Active Perception for Semantic Precision

End-to-End Training with Dense Multi-Reward Supervision

Outperforming State-of-the-Art in Visual Reasoning

AI Daily Digest