GlimpsePrune: Dynamic Visual Token Pruning for Large Vision-Language Models

This repository contains the GlimpsePrune model, a dynamic visual token pruning framework for Large Vision-Language Models (LVLMs), as presented in the paper A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models.

Abstract from the paper: Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

Official Code: https://github.com/HVision-NKU/GlimpsePrune

GlimpsePrune dynamically prunes a large number of irrelevant visual tokens before answering questions, reducing the model's inference overhead.

✨ Key Features

High Pruning Rate: Prunes over 90% of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
Robust Performance: Stable performance when processing high-resolution images and handling complex free-form VQA tasks.
Lightweight Training: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
Broad Compatibility: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.

🖼️ Framework Overview

The core idea of GlimpsePrune is to introduce a glimpse token and a lightweight Visual tokens Important Predictor (VIP) that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.

📊 Performance Results

GlimpsePrune was evaluated on multiple VQA benchmarks. The results demonstrate its ability to achieve a high pruning rate while maintaining performance on par with the original model, often outperforming other visual compression methods.

Free-form VQA Benchmarks

Short-form VQA Benchmarks

▶️ How to Use

You can use GlimpsePrune with the transformers_gp, which is located in the GitHub repository.

from transformers_gp.models.qwen2_5_vl import (
    Qwen2_5_VL_GP_ForConditionalGeneration,
    Qwen2_5_VL_GP_Processor
)
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch
# Load the model and processor
base_model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
new_model_name = "ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VL_GP_ForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map={"": "cuda:0"},
)
processor = Qwen2_5_VL_GP_Processor.from_pretrained(base_model)
model.load_new_modules(new_modules_dir)
model.eval()
# Prepare messages (image and text input)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "../examples/people.png", # Placeholder: replace with your image path
            },
            {"type": "text", "text": "What kind of a tie is the groom wearing?"},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
# Generate output
model.reset_image_tokens_cache()  # NOTE: reset the cache before inference
with torch.inference_mode():
  generated_ids = model.generate(**inputs, max_new_tokens=1024, do_selection=True)  # Enable glimpse prune by do_selection=True
# Decode and print the response
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)
print(f"User: {question}
Assistant: {output_text[0]}")

For more detailed usage, including local Gradio demos, evaluation, and training, please refer to the official GitHub repository and its tutorial notebook.

🖊️ Citation

If you find our work helpful, please consider citing our paper:

@misc{zeng2025glimpseprune,
      title={A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models}, 
      author={Quan-Sheng Zeng and Yunheng Li and Qilong Wang and Peng-Tao Jiang and Zuxuan Wu and Ming-Ming Cheng and Qibin Hou},
      year={2025},
      eprint={2508.01548},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.01548}, 
}

Downloads last month: 16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct

GlimpsePrune

Collection

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models. https://github.com/HVision-NKU/GlimpsePrune • 6 items • Updated Aug 5 • 1