The efficacy of Large Vision-Language Models (LVLMs) is often capped by their reliance on generic retrieval signals for external visual knowledge. This approach fails to capture the fine-grained semantics critical for sophisticated reasoning tasks. To bridge this gap, researchers have introduced UniDoc-RL, a unified reinforcement learning framework designed to tackle these limitations head-on.
Active Perception for Semantic Precision
UniDoc-RL reframes visual information acquisition as a sequential decision-making process. Its hierarchical action space allows the LVLM agent to move beyond simple document retrieval. It progressively refines visual evidence, starting with coarse-grained document retrieval and advancing to fine-grained image selection and active region cropping. This granular control enables the model to actively suppress irrelevant content and focus on information-dense areas, a crucial step for accurate reasoning.