NyxKrage commited on
Commit
2fa4720
·
verified ·
1 Parent(s): c466d58

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -22
README.md CHANGED
@@ -4,24 +4,21 @@ pipeline_tag: image-text-to-text
4
  license: other
5
  ---
6
 
7
- # Moondream 3 HF
8
 
9
- **Moondream 3 HF** is a reimplementation of the [Moondream 3 (Preview)](https://huggingface.co/moondream/moondream3-preview) model using the standard Hugging Face Transformers architecture conventions.
10
 
11
  ## Overview
12
 
13
  - Multimodal vision-language model with a mixture-of-experts (MoE) text backbone
14
  - Architecture and weights correspond to Moondream 3 (Preview) (approximately 9B parameters, 2B active)
15
  - Implemented as standard Transformers components:
16
- - `Moondream3ForConditionalGeneration`
17
- - `Moondream3Model`, `Moondream3TextModel`, `Moondream3VisionModel`
18
- - `Moondream3Processor`, `Moondream3ImageProcessor`, `Moondream3Config`
19
 
20
  The purpose of this repository is to make Moondream 3 interoperable with the Hugging Face ecosystem so it can be used directly with the Transformers API, including `generate()`, `Trainer`, and PEFT integrations.
21
 
22
  ## Example usage
23
 
24
- Example for running multimodal inference with the `moondream3-hf` implementation:
25
 
26
  ```python
27
  import torch
@@ -30,8 +27,8 @@ from transformers import AutoProcessor, AutoModelForCausalLM
30
 
31
  DEVICE="cuda:0"
32
 
33
- model = AutoModelForCausalLM.from_pretrained("NyxKrage/moondream3-hf", dtype="bfloat16", device_map=DEVICE, trust_remote_code=True)
34
- processor = AutoProcessor.from_pretrained("NyxKrage/moondream3-hf", use_fast=False, trust_remote_code=True)
35
 
36
 
37
  image1 = Image.open("image1.jpg")
@@ -60,33 +57,45 @@ with torch.no_grad():
60
 
61
  The `chat_template` uses Hugging Face’s Jinja format and accepts either a single string or a sequence of messages (`user [, assistant]`).
62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  ### Training
64
 
65
  The model can be trained using `trl` and supports `peft` and `bitsandbytes` out of the box.
66
  Included in the repo is also an implementation which replaces the MoE layers with a grouped_gemm implementation which has been adapted from [github:woct0rdho/transformers-qwen3-moe-fused](https://github.com/woct0rdho/transformers-qwen3-moe-fused), and can be used by importing Moondream3ForConditionalGeneration from `modeling_moondream3_fusedmoe.py` instead.
67
 
 
68
 
69
- ## Prompting modes
70
 
71
- The chat template supports multiple task types via text prefixes:
72
 
73
- | Mode | Template prefix | Example input |
74
- | ------- | ----------------------------- | ----------------------------------------- |
75
- | Query | `query:` | `query: What is happening in this image?` |
76
- | Caption | `caption:[short/normal/long]` | `caption: long` |
77
- | Detect | `detect:` | `detect: dog` |
78
- | Point | `point:` | `point: red car` |
79
-
80
- If no prefix is provided, the default mode is `query:`.
81
-
82
- This reimplementation aims to provide interoperability and ease of experimentation within the Hugging Face ecosystem. It is not an official release.
83
 
84
  ## License
85
 
86
  The model weights remain under the **Business Source License 1.1 with an Additional Use Grant (No Third-Party Service)**, identical to the original [Moondream 3 Preview license](https://huggingface.co/moondream/moondream3-preview/blob/main/LICENSE.md).
87
 
88
- This allows research, personal, and most commercial use, but prohibits offering hosted or resold access that competes with M87 Labs’ paid services.
89
-
90
  All new implementation code in this repository is released under the **Apache 2.0 License**.
91
 
92
  ## Credits
 
4
  license: other
5
  ---
6
 
7
+ # Moondream 3 Preview HF
8
 
9
+ **Moondream 3 Preview HF** is a reimplementation of the [Moondream 3 (Preview)](https://huggingface.co/moondream/moondream3-preview) model using the standard HuggingFace Transformers architecture conventions.
10
 
11
  ## Overview
12
 
13
  - Multimodal vision-language model with a mixture-of-experts (MoE) text backbone
14
  - Architecture and weights correspond to Moondream 3 (Preview) (approximately 9B parameters, 2B active)
15
  - Implemented as standard Transformers components:
 
 
 
16
 
17
  The purpose of this repository is to make Moondream 3 interoperable with the Hugging Face ecosystem so it can be used directly with the Transformers API, including `generate()`, `Trainer`, and PEFT integrations.
18
 
19
  ## Example usage
20
 
21
+ Example for running multimodal inference with the `moondream3-preview-hf` implementation:
22
 
23
  ```python
24
  import torch
 
27
 
28
  DEVICE="cuda:0"
29
 
30
+ model = AutoModelForCausalLM.from_pretrained("NyxKrage/moondream3-preview-hf", dtype="bfloat16", device_map=DEVICE, trust_remote_code=True)
31
+ processor = AutoProcessor.from_pretrained("NyxKrage/moondream3-preview-hf", use_fast=False, trust_remote_code=True)
32
 
33
 
34
  image1 = Image.open("image1.jpg")
 
57
 
58
  The `chat_template` uses Hugging Face’s Jinja format and accepts either a single string or a sequence of messages (`user [, assistant]`).
59
 
60
+ ## Prompting modes
61
+
62
+ The chat template supports multiple task types via text prefixes:
63
+
64
+ | Mode | Template prefix | Example input |
65
+ | ------- | ------------------------------ | ------------------------------------------ |
66
+ | Query | `query:` | `query: What is happening in this image?` |
67
+ | Reason | `reason:` | `reason: What is happening in this image?` |
68
+ | Caption | `caption: [short/normal/long]` | `caption: long` |
69
+ | Detect | `detect:` | `detect: dog` |
70
+ | Point | `point:` | `point: red car` |
71
+
72
+ If no prefix is provided, the default mode is `caption:normal`. `reason` is the same as query, but makes the model think/reason before providing a final answer.
73
+
74
+ ## Output Format
75
+
76
+ For query: and caption: prompts, the model behaves like a standard Hugging Face causal language model and returns token IDs to be decoded normally.
77
+
78
+ For detect: prompts, the model does not return text. Instead, it produces a floating-point tensor of shape `[batch_size, max_detections, 4]`. Each non-zero entry represents a bounding box in normalized coordinates `[x_min, y_min, x_max, y_max]`. Zero rows indicate padding.
79
+
80
+ For point: prompts, the model likewise returns structured coordinates rather than text. The output is a floating-point tensor of shape `[batch_size, max_points, 2],` where each non-zero row is a point `[x, y]` in normalized image coordinates. Zero rows again represent padding.
81
+
82
  ### Training
83
 
84
  The model can be trained using `trl` and supports `peft` and `bitsandbytes` out of the box.
85
  Included in the repo is also an implementation which replaces the MoE layers with a grouped_gemm implementation which has been adapted from [github:woct0rdho/transformers-qwen3-moe-fused](https://github.com/woct0rdho/transformers-qwen3-moe-fused), and can be used by importing Moondream3ForConditionalGeneration from `modeling_moondream3_fusedmoe.py` instead.
86
 
87
+ ## Limitations
88
 
89
+ All images within a batch must yield the same number of vision crops. To ensure this, it is recommended to resize every image to the same resolution before preprocessing; otherwise, differences in aspect ratio or size can cause mismatched crop counts and prevent batching.
90
 
91
+ Only a single action type is supported per batch. Every prompt in the batch must use the same mode prefix (such as query:, caption:, point:, or detect:). Mixing different modes in a single batch is not allowed.
92
 
93
+ Because both the visual crops and text inputs must align across examples, batches must be homogeneous in both image structure and task type. Inputs that differ in crop count or prompt mode should be run separately or grouped into compatible batches.
 
 
 
 
 
 
 
 
 
94
 
95
  ## License
96
 
97
  The model weights remain under the **Business Source License 1.1 with an Additional Use Grant (No Third-Party Service)**, identical to the original [Moondream 3 Preview license](https://huggingface.co/moondream/moondream3-preview/blob/main/LICENSE.md).
98
 
 
 
99
  All new implementation code in this repository is released under the **Apache 2.0 License**.
100
 
101
  ## Credits