I am talking about this particular model:
https://huggingface.co/TheBloke/goliath-120b-GGUF
I specifically use: goliath-120b.Q4_K_M.gguf
I can run it on runpod.io on this A100 instance with “humane” speed, but it is way too slow for creating long form text.
These are my settings in text-generation-webui:
Any advice? Thanks
Why don’t you use exl2? Assuming it’s the A100 80GB, you can run up to 5bpw I think,
I have done quants at 3, 4.5 and 4.85bpw.
https://huggingface.co/Panchovix/goliath-120b-exl2
https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal
I have 2x4090+1x3090, I get 2 t/s on GGUF (all layers on GPU) vs 10 t/s on exllamav2.
Wait what? I am getting 2-3t/s on 3x P40 running Goliath GGUF Q4KS.
Thanks. Will try this. No idea how these really work so that is why i am asking :)
I’m sorry for a little side-track, but how much context you able to squeeze into your 3 GPUs with Goliath’s 4bit quant?
I’m considering to add another 3090 to my own doble-GPU setup just to run this model.
I tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn’t use 8bit cache
Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.
That is awesome. What kind of platform do you use for that 3 GPUs setup?
Basically for now an X670E mobo + Ryzen 7 7800X3D.
But if you want the full speed, a server mobo with ton of PCI-E lanes will work better.