LocalLLaMA@poweruser.forumEnglish · 1 year ago

Why there are quantized models in the hugging face hug?

4

1

Why there are quantized models in the hugging face hug?

LocalLLaMA@poweruser.forumEnglish · 1 year ago

4

Hi. I’m using Llama-2 for my project in python with transformers library. There is an option to use quantization on any normal model:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_4bit=True,
)

If it’s just a matter of single flag, and nothing is recomputed, then why there is so much already quantized models in the hub? Are they better than adding this one line?

Chat

mcmoose1900B
link
fedilink
English
arrow-up
1·
1 year ago
Many reasons:

AutoModelForCausalLM is extremely slow compared to other backends/quantizations, even with augmentations like BetterTransformers.

It also uses much more VRAM than other quantization, especially at high context.

Its size is inflexible.

Loads slower

No CPU offloading

Its potentially lower quality than other quantization at the same bpw