Can someone please explain the quantisation method differences:
- AWQ
- GPTQ
- llamacpp GGUF quantisation (sorry I do not know the quantisation technique name)

As far as I have researched there is limited AI backend that supports CPU inference of AWQ and GPTQ models and GGUF quantisation (like Q_4_K_M) is prevalent because it even runs smoothly on CPU.

So:
What exactly is the quantisation difference between above techniques.

  • mcmoose1900B
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Among other things, GPTQ, GGUF’ K-Quants, and bitsandbytes FP4 are relatively “easy” quantization. Not to discount them… They are very sophisticated, but models can be quantized very quickly with them.

    EXL2 an AWQ are much more intense. You feed them profiling data, text you want to use as a reference to optimize the quantization towards that. And the quantization takes forever, and requires a lot of GPU. But the quantized weights you get out of them are very VRAM efficient.

    • Dead_Internet_TheoryB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Yeah, EXL2 is awesome. It’s kinda black magic how GPUs that were released way before ChatGPT was a twinkle in anyone’s eyes can run something that can trade blows with it. I still don’t get how fractional bpw is even possible. What the hell, 2.55 bits man 😂 how does it even run after that to any degree? It’s magic, that’s what it is.