UniDoc-RL: Finer-Grained Visual RAG

UniDoc-RL enhances LVLMs with fine-grained visual RAG via hierarchical RL, active perception, and multi-reward training, achieving state-of-the-art results.

2 min read
Diagram illustrating the UniDoc-RL framework showing hierarchical action space and information refinement.
The UniDoc-RL framework proposes a hierarchical approach to visual RAG.

The efficacy of Large Vision-Language Models (LVLMs) is often capped by their reliance on generic retrieval signals for external visual knowledge. This approach fails to capture the fine-grained semantics critical for sophisticated reasoning tasks. To bridge this gap, researchers have introduced UniDoc-RL, a unified reinforcement learning framework designed to tackle these limitations head-on.

Active Perception for Semantic Precision

UniDoc-RL reframes visual information acquisition as a sequential decision-making process. Its hierarchical action space allows the LVLM agent to move beyond simple document retrieval. It progressively refines visual evidence, starting with coarse-grained document retrieval and advancing to fine-grained image selection and active region cropping. This granular control enables the model to actively suppress irrelevant content and focus on information-dense areas, a crucial step for accurate reasoning.

End-to-End Training with Dense Multi-Reward Supervision

Training such a complex agent necessitates sophisticated supervision. UniDoc-RL employs a dense multi-reward scheme that provides task-aware feedback for each action taken by the agent. Coupled with Group Relative Policy Optimization (GRPO), this approach allows for effective end-to-end training of the LVLM agent, aligning its behavior with multiple objectives without the need for a separate value network. The development of a comprehensive dataset with fine-grained action annotations further supports this advanced training paradigm, pushing the boundaries of what's possible with UniDoc-RL visual RAG.

Outperforming State-of-the-Art in Visual Reasoning

Experimental results across three benchmarks demonstrate the superiority of UniDoc-RL. The framework consistently surpasses existing state-of-the-art baselines, achieving gains of up to 17.7% over prior RL-based methods. This significant performance improvement underscores the value of UniDoc-RL's approach to fine-grained visual retrieval and active perception in complex reasoning scenarios, positioning UniDoc-RL visual RAG as a leading solution.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.