Quantizing 70b models to 4-bit, how much does performance degrade?

ae_dataviz · 2 years ago

Quantizing 70b models to 4-bit, how much does performance degrade?

semicausal · 2 years ago

In my experience, the lower you go…the model:

- hallucinates more (one time I asked Llama2 what made the sky blue and it freaked out and generated thousands of similar questions line by line)

- is more likely to give you an inaccurate response when it doesn’t hallucinate

- is significantly more unreliable and non-deterministic (seriously, providing the same prompt can cause different answers!)

At the bottom of this post, I compare the 2-bit and 8-bit extreme ends of Code Llama Instruct model with the same prompt and you can see how it played out: https://about.xethub.com/blog/comparing-code-llama-models-locally-macbook

a_beautiful_rhind · 2 years ago

70b 4bit will eat those small models for breakfast.

Secret_Joke_2262 · 2 years ago

A friend told me that for 70b when using q4, performance drops by 10%. The larger the model, the less it suffers from weight quantization

Herr_Drosselmeyer · 2 years ago

It’s a rule of thumb that yes, higher parameter at low quant beats lower parameter at high quant (or no quant) but take it with a grain of salt as you may still prefer a lower parameter model that’s more tuned for the task you prefer.