- 2 Posts
- 16 Comments
panchovixBto
LocalLLaMA@poweruser.forum•🐺🐦⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5English
1·2 years agoI’ve posted the calibration dataset (on a link) on the goliath-calrp quant and the measurement, if you want or would like to do another quant with different sizes.
panchovixBto
LocalLLaMA@poweruser.forum•🐺🐦⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5English
1·2 years agoHi there, nice work there with Venus. For your next version and exl2 quants, you maybe want to the calibration dataset from this https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal
(On the description)
Since I checked the one that you used first and is well the same, but without any fix or formatting (so it has weird symbols etc)
panchovixBto
LocalLLaMA@poweruser.forum•🐺🐦⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5English
1·2 years agoVenus is 139 layers instead of 137 of goliath, so it weights a bit more.
panchovixBto
LocalLLaMA@poweruser.forum•🐺🐦⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5English
1·2 years agoGreat post, glad you enjoyed both of my Goliath quants :)
panchovixBto
LocalLLaMA@poweruser.forum•Venus-120b: A merge of three different models in the style of Goliath-120bEnglish
1·2 years agoModels on ooba without “exl” on the folder name will redirect to transformers by default, so that may be the reason he got that by default.
panchovixBtoHardware@hardware.watch•Special Chinese Factories are Dismantling NVIDIA GeForce RTX 4090 Graphics Cards and Turning Them into AI-Friendly GPU ShapeEnglish
1·2 years agoIn theory, in 24GB VRAM cards, only the 3090 is possible since it has 24x1GB GDDR6X chips, so if you change the chips to 2GB each, you would have 48GB GDDR6X.
3090Ti and 4090 has 12x2GB GDDR6X chips on the front pcb.
Not sure if 3-4GB GDDR6X chips exist (I think not)
panchovixBto
LocalLLaMA@poweruser.forum•Discrepancy between TheBloke_Orca-2-13B-GPTQ and the original one with the tested logic questionEnglish
1·2 years agoTry at FP16 or 8bit at most, probably a 13B models suffers too much at 4bits.
it will work but you will be limited to 1070 speeds if using all 3 gpus
panchovixOPBto
LocalLLaMA@poweruser.forum•TabbyAPI released! A pure LLM API for exllama v2.English
1·3 years agoBy the hard work of kingbri, Splice86 and turboderp, we have a new API loader for LLMs using the exllamav2 loader! This is on a very alpha state, so if you want to test it may be subject to change and such.
TabbyAPI also works with SillyTavern! Doing some special configurations, it can work as well.
As a reminder, exllamav2 added mirostat, tfs and min-p recently, so if you used those on exllama_hf/exllamav2_hf on ooba, these loaders are not needed anymore.
Enjoy!
panchovixBto
LocalLLaMA@poweruser.forum•🐺🐦⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)English
1·3 years agoThe major reason I use exl2 is speed, like on 2x4090 I get 15-20 t/s at 70b depending of the size, but GGUF I get like tops 4-5 t/s.
When using 3 gpus (2x4090+1x3090), it is 11-12 t/s at 6.55bpw vs GGUF Q6_K that runs at 2-3 t/s.
Though I agree with you, for model comparisons and such you need to have deterministic results and also the best quality.
If you can sometime, try 70b at 6bpw or more, IMO it is pretty consistent and doesn’t have issues like 5bpw/bits.
The performance hit is too much on multigpu systems when using GGUF. I guess if in the future the speed gets to the same level, I would use it most of the time.
panchovixBto
LocalLLaMA@poweruser.forum•Where and how to run Goliath 120b GGUF with good performance?English
1·3 years agoBasically for now an X670E mobo + Ryzen 7 7800X3D.
But if you want the full speed, a server mobo with ton of PCI-E lanes will work better.
panchovixBto
LocalLLaMA@poweruser.forum•DreamGen Opus 70B — Uncensored model for story telling and chat / roleplayEnglish
1·3 years agoGreat work!
Will upload some exl2 quants in about 4-5 hours here https://huggingface.co/Panchovix/opus-v0-70b-exl2 (thinking for now about 2.5, 4.65 and 6bpw (I use the latter))
Also, uploaded a safetensors conversion here, if you don’t mind https://huggingface.co/Panchovix/opus-v0-70b-safetensors
If you don’t want the safetensors up, I can remove it.
panchovixBto
LocalLLaMA@poweruser.forum•Where and how to run Goliath 120b GGUF with good performance?English
1·3 years agoI tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn’t use 8bit cache
Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.
panchovixBto
LocalLLaMA@poweruser.forum•Where and how to run Goliath 120b GGUF with good performance?English
1·3 years agoWhy don’t you use exl2? Assuming it’s the A100 80GB, you can run up to 5bpw I think,
I have done quants at 3, 4.5 and 4.85bpw.
https://huggingface.co/Panchovix/goliath-120b-exl2
https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal
I have 2x4090+1x3090, I get 2 t/s on GGUF (all layers on GPU) vs 10 t/s on exllamav2.
You can run 3bpw of exl2, I did some quants here https://huggingface.co/Panchovix/goliath-120b-exl2

You can use Alpha scaling, to get more context. You will lose a bit of ppl as you increase ctx. 1.75 alpha for 1.5x context, and 2.5 alpha for 2x context, if I’m not wrong. You can try freely since you’re on the cloud.
I guess you’re trying the 4.85bpw one? A single 80GB GPU may do more context but not that much. Now, if it’s 2x48GB then you have more slack.