Quantisation techniques difference?

No-Belt7582 · 1 year ago

Quantisation techniques difference?

mcmoose1900 · 1 year ago

Among other things, GPTQ, GGUF’ K-Quants, and bitsandbytes FP4 are relatively “easy” quantization. Not to discount them… They are very sophisticated, but models can be quantized very quickly with them.

EXL2 an AWQ are much more intense. You feed them profiling data, text you want to use as a reference to optimize the quantization towards that. And the quantization takes forever, and requires a lot of GPU. But the quantized weights you get out of them are very VRAM efficient.

Dead_Internet_Theory · 1 year ago

Yeah, EXL2 is awesome. It’s kinda black magic how GPUs that were released way before ChatGPT was a twinkle in anyone’s eyes can run something that can trade blows with it. I still don’t get how fractional bpw is even possible. What the hell, 2.55 bits man 😂 how does it even run after that to any degree? It’s magic, that’s what it is.