Hi. I’m using Llama-2 for my project in python with transformers library. There is an option to use quantization on any normal model:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_4bit=True,
)

If it’s just a matter of single flag, and nothing is recomputed, then why there is so much already quantized models in the hub? Are they better than adding this one line?

  • vasileerB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    file size which impacts load time:

    with load_in_4bit it will download and parse the big file (which is 4x bigger if it is bfloat16, or 8x bigger if it is float32) and then will quantize on the fly,

    with pre-quantized files, it downloads only the quants, so expect a 4x to 8x faster load time for 4bit quants

  • mcmoose1900B
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Many reasons:

    • AutoModelForCausalLM is extremely slow compared to other backends/quantizations, even with augmentations like BetterTransformers.

    • It also uses much more VRAM than other quantization, especially at high context.

    • Its size is inflexible.

    • Loads slower

    • No CPU offloading

    • Its potentially lower quality than other quantization at the same bpw