Commit
·
2b701bb
1
Parent(s):
ed45863
Upload README.md
Browse filesAdd a README file
README.md
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- llama-2
|
| 4 |
+
- gptq
|
| 5 |
+
- quantization
|
| 6 |
+
- code
|
| 7 |
+
- llama-2
|
| 8 |
+
model-index:
|
| 9 |
+
- name: Llama-2-7b-4bit-GPTQ-python-coder
|
| 10 |
+
results: []
|
| 11 |
+
license: gpl-3.0
|
| 12 |
+
language:
|
| 13 |
+
- code
|
| 14 |
+
datasets:
|
| 15 |
+
- iamtarun/python_code_instructions_18k_alpaca
|
| 16 |
+
pipeline_tag: text-generation
|
| 17 |
+
library_name: transformers
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
# LlaMa 2 7b 4-bit GPTQ Python Coder 👩💻
|
| 22 |
+
|
| 23 |
+
This model is the **GPTQ Quantization of my Llama 2 7B 4-bit Python Coder**. The base model link is [here](https://huggingface.co/edumunozsala/llama-2-7b-int4-python-code-20k)
|
| 24 |
+
|
| 25 |
+
The quantization parameters for the GPTQ algo are:
|
| 26 |
+
- 4-bit quantization
|
| 27 |
+
- Group size is 128
|
| 28 |
+
- Dataset C4
|
| 29 |
+
- Decreasing activation is False
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
## Model Description
|
| 33 |
+
|
| 34 |
+
[Llama 2 7B 4-bit Python Coder](https://huggingface.co/edumunozsala/llama-2-7b-int4-python-code-20k) is a fine-tuned version of the Llama 2 7B model using QLoRa in 4-bit with [PEFT](https://github.com/huggingface/peft) library and bitsandbytes.
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
## Quantization
|
| 38 |
+
|
| 39 |
+
A quick definition extracted from a great article in Medium by Benjamin Marie ["GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2"](https://medium.com/towards-data-science/gptq-or-bitsandbytes-which-quantization-method-to-use-for-llms-examples-with-llama-2-f79bc03046dc) (Only for Medium subscribers)
|
| 40 |
+
|
| 41 |
+
*"GPTQ (Frantar et al., 2023) was first applied to models ready to deploy. In other words, once the model is fully fine-tuned, GPTQ will be applied to reduce its size. GPTQ can lower the weight precision to 4-bit or 3-bit.
|
| 42 |
+
In practice, GPTQ is mainly used for 4-bit quantization. 3-bit has been shown very unstable (Dettmers and Zettlemoyer, 2023). It quantizes without loading the entire model into memory. Instead, GPTQ loads and quantizes the LLM module by module.
|
| 43 |
+
Quantization also requires a small sample of data for calibration which can take more than one hour on a consumer GPU."*
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
### Example of usage
|
| 48 |
+
|
| 49 |
+
```py
|
| 50 |
+
import torch
|
| 51 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 52 |
+
|
| 53 |
+
model_id = "edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k"
|
| 54 |
+
|
| 55 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 56 |
+
|
| 57 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
|
| 58 |
+
|
| 59 |
+
instruction="Write a Python function to display the first and last elements of a list."
|
| 60 |
+
input=""
|
| 61 |
+
|
| 62 |
+
prompt = f"""### Instruction:
|
| 63 |
+
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.
|
| 64 |
+
|
| 65 |
+
### Task:
|
| 66 |
+
{instruction}
|
| 67 |
+
|
| 68 |
+
### Input:
|
| 69 |
+
{input}
|
| 70 |
+
|
| 71 |
+
### Response:
|
| 72 |
+
"""
|
| 73 |
+
|
| 74 |
+
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
|
| 75 |
+
# with torch.inference_mode():
|
| 76 |
+
outputs = model.generate(input_ids=input_ids, max_new_tokens=128, do_sample=True, top_p=0.9,temperature=0.3)
|
| 77 |
+
|
| 78 |
+
print(f"Prompt:\n{prompt}\n")
|
| 79 |
+
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
### Citation
|
| 84 |
+
|
| 85 |
+
```
|
| 86 |
+
@misc {edumunozsala_2023,
|
| 87 |
+
author = { {Eduardo Muñoz} },
|
| 88 |
+
title = { llama-2-7b-int4-GPTQ-python-coder },
|
| 89 |
+
year = 2023,
|
| 90 |
+
url = { https://huggingface.co/edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k },
|
| 91 |
+
publisher = { Hugging Face }
|
| 92 |
+
}
|
| 93 |
+
```
|