Great quantized model

by CHNtentes - opened 20 days ago

20 days ago

Much better than another AWQ quant which has more downloads :)

Achieve ~70 token/s on 4x H20.

Is there any method to enable MTP for AWQ model? I copied the mtp.safetensors from original BF16 model, but the accept rate was always 0...

cpatonn

cyankiwi org 20 days ago

Thank you for your kind words. This quant great quality can not be done with out llm-compressor team continuous improvements.

I am working on enabling MTP layers for GLM AWQ models, and it has mostly be done. However, there are some compatibility issues with vllm, which prevents my releases.

CHNtentes

20 days ago

Thank you for your kind words. This quant great quality can not be done with out llm-compressor team continuous improvements.

I am working on enabling MTP layers for GLM AWQ models, and it has mostly be done. However, there are some compatibility issues with vllm, which prevents my releases.

Last time I checked vllm did not support MTP for GLM series yet, but sglang did. However, using this quant with sglang gives me frequent crashes.

cpatonn

cyankiwi org 19 days ago

Thank you for letting me know. I have not tried GLM MTP with SGLang yet. Hopefully it will work with my experimental MTP AWQ quant.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment