The title, pretty much.
I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.
The title, pretty much.
I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.
A friend told me that for 70b when using q4, performance drops by 10%. The larger the model, the less it suffers from weight quantization