H Company has unveiled Holo2, a new family of large-scale Vision-Language Models (VLMs) engineered to power multi-domain GUI agents. These agents are designed to interpret, reason over, and act within real digital environments, including web, desktop, and mobile interfaces. Moving beyond static perception, Holo2 emphasizes navigation and multi-step task execution, building on the UI localization and screen understanding capabilities of its predecessor, Holo1.5. Significant advancements have been made in policy learning, action grounding, and cross-environment generalization.
Holo2 Models and Capabilities
The Holo2 series comprises four distinct model sizes: Holo2-4B and Holo2-8B are fully open-sourced under the Apache 2.0 license. Holo2-30B-A3B and Holo2-235B-A22B are available under a research-only license, with commercial licensing requiring direct contact with H Company. These models are positioned as reliable and efficient foundations for next-generation computer use agents, such as the Surfer-H agent. Developed by H Company, these vision language models are fine-tuned from Qwen/Qwen3-VL-235B-A22B-Thinking. The training strategy involves a multi-stage pipeline utilizing proprietary data for UI understanding and action prediction, combined with open-source datasets, synthetic data, and human annotations. This is followed by supervised fine-tuning and online reinforcement learning (GRPO) to achieve state-of-the-art performance.
