Hi. I’m using Llama-2 for my project in python with transformers library. There is an option to use quantization on any normal model:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_4bit=True,
)

If it’s just a matter of single flag, and nothing is recomputed, then why there is so much already quantized models in the hub? Are they better than adding this one line?

  • mcmoose1900B
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Many reasons:

    • AutoModelForCausalLM is extremely slow compared to other backends/quantizations, even with augmentations like BetterTransformers.

    • It also uses much more VRAM than other quantization, especially at high context.

    • Its size is inflexible.

    • Loads slower

    • No CPU offloading

    • Its potentially lower quality than other quantization at the same bpw