I’ve posted the calibration dataset (on a link) on the goliath-calrp quant and the measurement, if you want or would like to do another quant with different sizes.
Hi there, nice work there with Venus. For your next version and exl2 quants, you maybe want to the calibration dataset from this https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal
(On the description)
Since I checked the one that you used first and is well the same, but without any fix or formatting (so it has weird symbols etc)
Venus is 139 layers instead of 137 of goliath, so it weights a bit more.
Great post, glad you enjoyed both of my Goliath quants :)
Models on ooba without “exl” on the folder name will redirect to transformers by default, so that may be the reason he got that by default.
In theory, in 24GB VRAM cards, only the 3090 is possible since it has 24x1GB GDDR6X chips, so if you change the chips to 2GB each, you would have 48GB GDDR6X.
3090Ti and 4090 has 12x2GB GDDR6X chips on the front pcb.
Not sure if 3-4GB GDDR6X chips exist (I think not)
Try at FP16 or 8bit at most, probably a 13B models suffers too much at 4bits.
it will work but you will be limited to 1070 speeds if using all 3 gpus
By the hard work of kingbri, Splice86 and turboderp, we have a new API loader for LLMs using the exllamav2 loader! This is on a very alpha state, so if you want to test it may be subject to change and such.
TabbyAPI also works with SillyTavern! Doing some special configurations, it can work as well.
As a reminder, exllamav2 added mirostat, tfs and min-p recently, so if you used those on exllama_hf/exllamav2_hf on ooba, these loaders are not needed anymore.
Enjoy!
The major reason I use exl2 is speed, like on 2x4090 I get 15-20 t/s at 70b depending of the size, but GGUF I get like tops 4-5 t/s.
When using 3 gpus (2x4090+1x3090), it is 11-12 t/s at 6.55bpw vs GGUF Q6_K that runs at 2-3 t/s.
Though I agree with you, for model comparisons and such you need to have deterministic results and also the best quality.
If you can sometime, try 70b at 6bpw or more, IMO it is pretty consistent and doesn’t have issues like 5bpw/bits.
The performance hit is too much on multigpu systems when using GGUF. I guess if in the future the speed gets to the same level, I would use it most of the time.
Basically for now an X670E mobo + Ryzen 7 7800X3D.
But if you want the full speed, a server mobo with ton of PCI-E lanes will work better.
Great work!
Will upload some exl2 quants in about 4-5 hours here https://huggingface.co/Panchovix/opus-v0-70b-exl2 (thinking for now about 2.5, 4.65 and 6bpw (I use the latter))
Also, uploaded a safetensors conversion here, if you don’t mind https://huggingface.co/Panchovix/opus-v0-70b-safetensors
If you don’t want the safetensors up, I can remove it.
I tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn’t use 8bit cache
Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.
Why don’t you use exl2? Assuming it’s the A100 80GB, you can run up to 5bpw I think,
I have done quants at 3, 4.5 and 4.85bpw.
https://huggingface.co/Panchovix/goliath-120b-exl2
https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal
I have 2x4090+1x3090, I get 2 t/s on GGUF (all layers on GPU) vs 10 t/s on exllamav2.
You can run 3bpw of exl2, I did some quants here https://huggingface.co/Panchovix/goliath-120b-exl2
You can use Alpha scaling, to get more context. You will lose a bit of ppl as you increase ctx. 1.75 alpha for 1.5x context, and 2.5 alpha for 2x context, if I’m not wrong. You can try freely since you’re on the cloud.
I guess you’re trying the 4.85bpw one? A single 80GB GPU may do more context but not that much. Now, if it’s 2x48GB then you have more slack.