duydq12's picture
Upload folder using huggingface_hub
28d92e4 verified
---
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
base_model:
- nomic-ai/nomic-embed-code
tags:
- llmcompressor
- quantized
- FP8
---
# nomic-embed-code-FP8-dynamic
## Model Overview
- **Model Architecture:** Qwen2Model
- **Input:** Text
- **Output:** Text
- **Model Optimizations:**
- **Activation quantization:** FP8
- **Weight quantization:** FP8
- **Release Date:** 09/06/2025
- **Version:** 1.0
- **Model Developers:** duydq12 (enhance by RedHatAI)
### Model Optimizations
This model was obtained by quantizing activations and weights of [nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code) to FP8 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
Weight quantization also reduces disk size requirements by approximately 50%.
Only weights and activations of the linear operators within transformers blocks are quantized.
Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
## Deployment
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
```python
import torch
import vllm
from vllm import LLM
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'{task_description}: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Represent this query for searching relevant code'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
model = LLM(model="duydq12/nomic-embed-code-FP8-dynamic", task="embed")
outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
```
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
## Creation
<details>
<summary>Creation details</summary>
This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
```python
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model_stub = "nomic-ai/nomic-embed-code"
model_name = model_stub.split("/")[-1]
model = AutoModelForCausalLM.from_pretrained(model_stub, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_stub, torch_dtype="auto", device_map="auto")
# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
ignore=["lm_head"],
targets="Linear",
scheme="FP8_dynamic",
)
# Apply quantization
oneshot(
model=model,
recipe=recipe,
)
# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
```
</details>
## Evaluation
private
### Accuracy
private