Can someone please explain the quantisation method differences:
- AWQ
- GPTQ
- llamacpp GGUF quantisation (sorry I do not know the quantisation technique name)
As far as I have researched there is limited AI backend that supports CPU inference of AWQ and GPTQ models and GGUF quantisation (like Q_4_K_M) is prevalent because it even runs smoothly on CPU.
So:
What exactly is the quantisation difference between above techniques.
Among other things, GPTQ, GGUF’ K-Quants, and bitsandbytes FP4 are relatively “easy” quantization. Not to discount them… They are very sophisticated, but models can be quantized very quickly with them.
EXL2 an AWQ are much more intense. You feed them profiling data, text you want to use as a reference to optimize the quantization towards that. And the quantization takes forever, and requires a lot of GPU. But the quantized weights you get out of them are very VRAM efficient.
Yeah, EXL2 is awesome. It’s kinda black magic how GPUs that were released way before ChatGPT was a twinkle in anyone’s eyes can run something that can trade blows with it. I still don’t get how fractional bpw is even possible. What the hell, 2.55 bits man 😂 how does it even run after that to any degree? It’s magic, that’s what it is.