Hi. I’m using Llama-2 for my project in python with transformers
library. There is an option to use quantization on any normal model:
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-chat-hf",
load_in_4bit=True,
)
If it’s just a matter of single flag, and nothing is recomputed, then why there is so much already quantized models in the hub? Are they better than adding this one line?