Why there are quantized models in the hugging face hug?

Motylde · 1 year ago

Why there are quantized models in the hugging face hug?

vasileer · 1 year ago

file size which impacts load time:

with load_in_4bit it will download and parse the big file (which is 4x bigger if it is bfloat16, or 8x bigger if it is float32) and then will quantize on the fly,

with pre-quantized files, it downloads only the quants, so expect a 4x to 8x faster load time for 4bit quants