---
license: cc-by-nc-4.0
language:
- en
base_model:
- Qwen/Qwen3-VL-30B-A3B-Thinking
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- multimodal
- action
- agent
- pytorch
- computer use
- gui agents
---
# **Holo2: Foundational Models for Navigation and Computer Use Agents**


[![GitHub](https://img.shields.io/badge/Holo2_Cookbook-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/hcompai/hai-cookbook/tree/main/holo2)

## **Model Description**

**Holo2** represents the next major step in developing large-scale Vision-Language Models (VLMs) for **multi-domain GUI Agents**.
These agents can operate real digital environments specifically web, desktop, and mobile by interpreting interfaces, reasoning over content, and executing actions.

Our **Holo2** family emphasizes **navigation and task execution** across diverse real and simulated environments, extending beyond static perception to **multi-step, goal-directed behavior**.  

It builds upon the strengths of **Holo1.5** in UI localization and screen content understanding, with major improvements in **policy learning**, **action grounding**, and **cross-environment generalization**.

The **Holo2** series comes in three model sizes:

- **Holo2-4B:** fully open under Apache 2.0
- **Holo2-8B:** fully open under Apache 2.0
- **Holo2-30B-A3B:** research-only license (non-commercial). For commercial use, please contact us.

These models are designed to provide reliable, accurate, and efficient foundations for next-generation CU agents, like Surfer-H.

- **Developed by:** [**H Company**](https://www.hcompany.ai/)
- **Model type:** Vision-Language Model for Navigation and Computer Use Agents
- **Fine-tuned from model:** Qwen/Qwen3-VL-30B-A3B-Thinking
- **Blog Post:** https://www.hcompany.ai/blog/holo2
- **License:** Apache 2.0 License


## Get Started with the Model

Please have a look at the [cookbook](https://github.com/hcompai/hai-cookbook/tree/main/holo2) in our repo where we provide examples for both self-hosting and API use!


## **Training Strategy**

Our models are trained using high-quality proprietary data for UI understanding and action prediction, following a multi-stage training pipeline. The training dataset is a carefully curated mix of open-source datasets, large-scale synthetic data, and human-annotated samples. Training proceeds in two stages: large-scale supervised fine-tuning, followed by online reinforcement learning (GRPO) yielding SOTA performance in interpreting UIs and performing actions on large, complex screens

## **Results**

### **Holo2: Navigation Performance**

Navigation evaluates an agent’s ability to complete real or simulated tasks through multi-step reasoning and action.  
Holo2 models show significant improvements in navigation efficiency and task completion rates, particularly in unseen and complex environments.

Benchmarks include **WebVoyager**, **WebArena**, **OSWorld**, and **AndroidWorld**, testing the models’ abilities across web, operating system, and mobile platforms.

<div align="center">

| Model                     | WebVoyager | WebArena | OSWorld | AndroidWorld | Average |
|---------------------------|------------|----------|---------|---------------|---------|
| Holo2-30B-A3B             | **83.0%**  | **46.3%**| 37.4%   |**71.6%**      | **59.6%**|
| Holo2-8B                  | 80.2%      | 42.2%    |**39.9%**| 60.4%         | 55.7%   |
| Holo2-4B                  | 80.2%      | 41.0%    | 37.7%   | 64.6%         | 55.9%   |
| Holo1.5-7B                | 65.9%      | 23.4%    | 6.4%    | 32.7%         | 32.1%   |
| Holo1.5-3B                | 56.1%      | 15.4%    | 5.8%    | 27.5%         | 26.2%   |
| Qwen3-VL-30B-A3B-Thinking | 76.1%      | 45.0%    | 36.6%   | 62.9%         | 55.1%   |
| Qwen3-VL-8B-Thinking      | 72.0%      | 31.9%    | 28.8%   | 52.6%         | 46.3%   |
| Qwen3-VL-4B-Thinking      | 67.5%      | 31.5%    | 24.1%   | 45.7%         | 42.2%   |


Table 1: Navigation benchmark scores. Bold values will denote state-of-the-art once final evaluations are available.
</div>

<p align="center"><img width=800 src="https://cdn-uploads.huggingface.co/production/uploads/68c2d2a93ded6e08ec71ef8c/i4XZaAfTu0VpS2nwkcGb8.png"/><em>All external model scores are reproduced internally in the Surfer 2 agent, to allow for fair comparison</em></p>

---

### **Holo2: SOTA UI Localization**

UI Localization measures how precisely an agent can locate on-screen elements—buttons, inputs, links—necessary for accurate interaction.  
Holo2 continues to set new standards for localization accuracy across web, OS, and mobile benchmarks.

<div align="center">

|                           | ScreenSpot-Pro | OSWorld-G | Showdown | Ground-UI-1K | WebClick-v1 | ScreenSpot-v2 | Average  |
|---------------------------|----------------|-----------|----------|--------------|-------------|---------------|----------|
| Holo2-30B-A3B             | **66.1%**      | **76.1%** | **77.6%**| **85.4%**    | 91.3%       | 94.9%         | **81.90**|
| Holo2-8B                  | 58.9%          | 70.1%     | 72.5%    | 83.8%        | 89.5%       | 93.2%         | 78.00    |
| Holo2-4B                  | 57.2%          | 69.4%     | 74.7%    | 83.3%        | 88.8%       | 93.2%         | 77.77    |
| Holo1.5-72B               | 63.3%          | 71.8%     | 76.8%    | 84.5%        | **92.4%**   | 94.4%         | 80.52    |
| Holo1.5-7B                | 57.9%          | 66.2%     | 72.1%    | 84.0%        | 90.2%       | 93.3%         | 77.28    |
| Holo1.5-3B                | 51.4%          | 61.5%     | 67.5%    | 83.2%        | 81.4%       | 91.6%         | 72.77    |
| Qwen3-VL-30B-A3B-Thinking | 49.9%          | 65.8%     | 71.2%    | 84.2%        | 89.5%       | 91.8%         | 75.40    |
| Qwen3-VL-8B-Thinking      | 38.5%          | 56.0%     | 64.2%    | 83.6%        | 85.9%       | 91.5%         | 69.95    |
| Qwen3-VL-4B-Thinking      | 41.4%          | 56.4%     | 66.6%    | 84.1%        | 85.8%       | 90.0%         | 70.72    |
| Qwen2.5-VL-72B            | 55.6%          | 62.0%     | 41.0%    | 85.4%        | 88.3%       | 93.3%         | 70.93    |
| Qwen2.5-VL-7B             | 29.0%          | 40.6%     | 52.0%    | 80.7%        | 76.5%       | 85.6%         | 60.73    |
| Qwen2.5-VL-3B             | 29.3%          | 34.3%     | 50.3%    | 76.4%        | 71.2%       | 80.7%         | 57.03    |
| UI-TARS-1.5-7B            | 39.0%          | 61.0%     | 58.0%    | 84.0%        | 86.1%       | 94.0%         | 70.35    |
| UI-Venus-72B              | 61.9%          | 70.4%     | 75.6%    | 75.5%        | 77.0%       | **95.3%**     | 75.95    |
| UI-Venus-7B               | 50.8%          | 58.8%     | 67.3%    | 82.3%        | 84.4%       | 94.1%         | 72.95    |

Table 2: Localization benchmark scores for leading models.
</div>

<p align="center"><img width=1000 src="https://cdn-uploads.huggingface.co/production/uploads/68c2d2a93ded6e08ec71ef8c/i6VxzAXX9I_G0kheX4rN0.png"/><em>Accuracy of our and competitors' models on UI Localization benchmarks.</em></p>


---


## Citation

```bibtex
@misc{hai2025holo2modelfamily,
      title={Holo2 - Open Foundation Models for Navigation and Computer Use Agents}, 
      author={H Company},
      year={2025},
      url=https://huggingface.co/collections/Hcompany/holo2, 
}