So, it was bothering me a bit that the only metric people really had to understand the ‘loss’ of quantization objectively was perplexity.

So, after hacking with koboldcpp’s sampler code to force output probabilities for a predetermined sequence so that I can make a fair comparison…

Mistral 7b Avg Quantization Differences

Ta-da!

This is Mistral 7b GGUF’s various popular quantizations, compared to the fp16 base model, as measured by KL divergence. What I’m specifically doing to measure this is comparing the probability similarities between models. Specifically, I did this for a predetermined sequence of about ~350 tokens worth of Wikipedia text.

This means (if we adapt the scale for readability):

  • fp16 = ~0 measured KL change from original probabilities (cause it’s the original)
  • Q8_0 = ~0.06 avg. measured KL change from original probabilities
  • Q6_K = ~0.1 avg. measured KL change from original probabilities
  • Q5_K_M = ~0.3 avg. measured KL change from original probabilities
  • Q4_K_M = ~1.0 avg. measured KL change from original probabilities
  • Q3_K_M = ~3.7 avg. measured KL change from original probabilities
  • Q2_K = ~8.2 avg. measured KL change from original probabilities

“Average difference” obscures the bigger problem with low quantization, though. Technically, if many tokens are easily predictable or predetermined no matter what quant, this will contribute to the average. So what happens if, out of the 300+ tokens of text I tested on, we specifically pick the highest reported difference in KL divergence for each respective quantization and graph that?

Now it becomes clear how big the gap can be for ‘difficult’ tokens!

To make the differences less aggressive, let’s take the top ~5% of the most affected by quantization tokens for each quant, and graph that out.

https://preview.redd.it/3baou5l9mv1c1.png?width=1324&format=png&auto=webp&s=afc4ff00c6b4e14cc86f322e9ccae887bd23b91c

So, if we soley compare the top 5% of tokens that were ‘most affected’ by quantization when doing an average (we do that to exclude the ‘obvious’ tokens), the scale is significantly more dramatic.

I’ll be updating this post with 13b soon enough. I’d also do it for 70b, but since I’m on 12GB VRAM, measuring would be extremely slow as it’d go into the pagefile for every single quant. is this the part where I should shill a kofi or something?

I hope this helps the sub understand how much quantization really impacts models in a somewhat more objective sense.

  • kpodkanowiczB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    you are on fire. This is your yet another great post - btw. i changed perplexity scripts to only measure responses after the instruction and using for example, the evol dataset. The preset is configured accordingly to the model - i got completely different results than normal perplexity - interestingly, when running code isntructions on normal model and for instance roleplay instructions on coding model not just perpelxity is around 1 vs. 3 but also degradate differently

  • CardAnarchistB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Hi there, you seem like the man to ask on this somewhat related topic to the OP,

    I’ve recently found out that models output different results based on the number of layers loaded into GPU. I’ve been told that more layers loaded in = better output.

    How does the loss asociated with layers not in GPU compare to the loss say between quants?

      • CardAnarchistB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        I thought it odd myself. So much so that I thought SillyTavern was bugged but that wasn’t the case.

        It’s pretty easy to test yourself. Just use Koboldcpp to load in say 31 layers generate some output on seed 1 then, restart Koboldcpp with 30 layers.

        Example of 31 layers of a 7B vs 30 layers on the same seed.

        Each seed works the same if the layers are close enough it seems like. The output starts exactly the same before branching off.

        It’s worth mentioning that the person who told me the quality was “better” with more layers loaded in simply said it was as far as he recalled.