--- license: cc-by-nc-4.0 language: - en base_model: - Qwen/Qwen3-VL-30B-A3B-Thinking pipeline_tag: image-text-to-text library_name: transformers tags: - multimodal - action - agent - pytorch - computer use - gui agents --- # **Holo2: Foundational Models for Navigation and Computer Use Agents** [](https://github.com/hcompai/hai-cookbook/tree/main/holo2) ## **Model Description** **Holo2** represents the next major step in developing large-scale Vision-Language Models (VLMs) for **multi-domain GUI Agents**. These agents can operate real digital environments specifically web, desktop, and mobile by interpreting interfaces, reasoning over content, and executing actions. Our **Holo2** family emphasizes **navigation and task execution** across diverse real and simulated environments, extending beyond static perception to **multi-step, goal-directed behavior**. It builds upon the strengths of **Holo1.5** in UI localization and screen content understanding, with major improvements in **policy learning**, **action grounding**, and **cross-environment generalization**. The **Holo2** series comes in three model sizes: - **Holo2-4B:** fully open under Apache 2.0 - **Holo2-8B:** fully open under Apache 2.0 - **Holo2-30B-A3B:** research-only license (non-commercial). For commercial use, please contact us. These models are designed to provide reliable, accurate, and efficient foundations for next-generation CU agents, like Surfer-H. - **Developed by:** [**H Company**](https://www.hcompany.ai/) - **Model type:** Vision-Language Model for Navigation and Computer Use Agents - **Fine-tuned from model:** Qwen/Qwen3-VL-30B-A3B-Thinking - **Blog Post:** https://www.hcompany.ai/blog/holo2 - **License:** Apache 2.0 License ## Get Started with the Model Please have a look at the [cookbook](https://github.com/hcompai/hai-cookbook/tree/main/holo2) in our repo where we provide examples for both self-hosting and API use! ## **Training Strategy** Our models are trained using high-quality proprietary data for UI understanding and action prediction, following a multi-stage training pipeline. The training dataset is a carefully curated mix of open-source datasets, large-scale synthetic data, and human-annotated samples. Training proceeds in two stages: large-scale supervised fine-tuning, followed by online reinforcement learning (GRPO) yielding SOTA performance in interpreting UIs and performing actions on large, complex screens ## **Results** ### **Holo2: Navigation Performance** Navigation evaluates an agent’s ability to complete real or simulated tasks through multi-step reasoning and action. Holo2 models show significant improvements in navigation efficiency and task completion rates, particularly in unseen and complex environments. Benchmarks include **WebVoyager**, **WebArena**, **OSWorld**, and **AndroidWorld**, testing the models’ abilities across web, operating system, and mobile platforms.
All external model scores are reproduced internally in the Surfer 2 agent, to allow for fair comparison
Accuracy of our and competitors' models on UI Localization benchmarks.