Quantizing 70b models to 4-bit, how much does performance degrade?

ae_dataviz · 3 years ago

harrro · 3 years ago

Using Q3, you can fit it in 36GB (I have a weird combo of RTX 3060 with 12GB and P40 with 24GB and I can run a 70B at 3bit fully on GPU).

Dry-Vermicelli-682 · 3 years ago

So you have 2 GPUs on single m/b… and the llama.cpp thing knows to use both? Does this work with AMD GPUs too?

harrro · 3 years ago

Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.

Not sure on AMD support but for nvidia it’s pretty easy to do.