Great quantized model

#1
by CHNtentes - opened

Much better than another AWQ quant which has more downloads :)

Achieve ~70 token/s on 4x H20.

Is there any method to enable MTP for AWQ model? I copied the mtp.safetensors from original BF16 model, but the accept rate was always 0...

cyankiwi org

Thank you for your kind words. This quant great quality can not be done with out llm-compressor team continuous improvements.

I am working on enabling MTP layers for GLM AWQ models, and it has mostly be done. However, there are some compatibility issues with vllm, which prevents my releases.

Thank you for your kind words. This quant great quality can not be done with out llm-compressor team continuous improvements.

I am working on enabling MTP layers for GLM AWQ models, and it has mostly be done. However, there are some compatibility issues with vllm, which prevents my releases.

Last time I checked vllm did not support MTP for GLM series yet, but sglang did. However, using this quant with sglang gives me frequent crashes.

cyankiwi org

Thank you for letting me know. I have not tried GLM MTP with SGLang yet. Hopefully it will work with my experimental MTP AWQ quant.

Sign up or log in to comment