Is it true llama.cpp always de-quantizes GGUF to FP16?
I read that llama.cpp has no option to convert Q5, Q6 or even Q8, etc. to a GPU-compatible FP8 format. This means small quantized models always get de-quantized to FP16. Doesn't this negate any speed advantage in using a lower precision model? This doesn't seem to be the case with e4m3 .safetensors models, like the ones from Kijai, where the weights can be processed directly without de-quantization. Am I correct in thinking this way or is there more to it?
Hey, yeah lower quants does not necessarily mean higher speed, q8 for example runs slower than f16 since it has overhead due to quantization, ggufs are not meant to speed things up in the common sense, they are more so used to lower vram use and if the model is to big to fit into vram in full precision it will give a decent speed up since it doesnt spill over into system ram (;
Thanks. I figured as much. It seems GGUF uses a more clever quantization method than e4m3/e5m2—however, the format in which the GGUF floats are stored aren't directly usable by the GPU. Based on what I read, this slows things down in two ways: an additional de-quantize step to get to FP16, and the fact that the GPU now must process FP16 even though it may have the ability to process native FP8. e4m3 may not score as high with perplexity compared to Q8, but the floats are stored in an 8-bit format directly compatible with the GPU.
Despite all this, I did some tests with Wan 2.2 using both types of models (Q8 and e5m2) on a laptop with a mobile 5090, and didn't really notice a performance difference . This may be due to memory bandwidth rather than actual number crunching, as the mobile version of the 5090 can only push half of what the desktop version can (something like ~800GB/sec). This may be a different story on GPUs with faster memory.