🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

WolframRavenwolf · 1 year ago

🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

nero10578 · 1 year ago

What kind of token/s do you get with 2x3090 for the 70B models?

WolframRavenwolf · 1 year ago

koboldcpp-1.50\koboldcpp.exe --contextsize 4096 --debugmode --foreground --gpulayers 99 --highpriority --usecublas mmq --model TheBloke_lzlv_70B-GGUF/lzlv_70b_fp16_hf.Q4_K_M.gguf

ContextLimit: 3815/4096, Processing:25.07s (7.1ms/T), Generation:43.74s (145.8ms/T), Total:68.80s (4.36T/s)

nero10578 · 1 year ago

Huh its not really faster than Tesla P40s then for some reason.

WolframRavenwolf · 1 year ago

Yeah, GGUF is rather slow for me, that’s why I’ve begun to use ExLlamav2_HF which lets me run even 120B models at 3-bit with nice quality at around 20 T/s.

Monkey_1505 · 1 year ago

I dislike Frankenstein models. the 20b, the 120b they are all the same - major confusion, can’t follow logic or instructions properly. Great prose, but pretty useless for that reason.

Someone would have to invest some major training on one of them before it’d be any good.

SomeOddCodeGuy · 1 year ago

The results for the 120b continue to absolutely floor me. Not only is it performing that well at 3bpw, but it’s an exl2 as well, which your own tests have shown perform worse than gguf. So imagine what a q4 gguf does if a q3 equivalent exl2 can do this.

WolframRavenwolf · 1 year ago

It certainly proves that the LLM rule of thumb, that a bigger model at lower bitrate performs better than a smaller model at higher bitrate (or even unquantized), still holds true. At least in the situations I tested.

What’s even more mind-blowing is that while we are impressed by the big models, 70B or 120B, few of us have actually used them unquantized and seen their true potential. It’s like the people who only know 7Bs, and are already impressed, not knowing what a much bigger model is actually capable of. I guess we’re in the same boat, as even 48 GB VRAM are hardly enough. Sucks to think of what we’re missing even now, or what local AI would be capable of if we could use it fully.

Distinct-Target7503 · 1 year ago

That’s a great work!

Just a question… Have anyone tried to fine tune one of those “Frankenstein” models? Even on a small dataset…

Some time ago (when one tf the first experimental “Frankenstein” came out, it was a ~20B model) I read here on reddit that lots of users agreed that a fine tune on those merged models would have “better” results since it would help to “smooth” and adapt the merged layers. Probably I lack the technical knowledge needed to understand, so I’m asking…

panchovix · 1 year ago

Great post, glad you enjoyed both of my Goliath quants :)

WolframRavenwolf · 1 year ago

Thanks for making them! :) Keep up the great work!

Evening_Ad6637 · 1 year ago

O.M.G. What an incredibly huge work! Wtf?! I am speechless.

You are the most angel like wolf i know so far and you really really deserve a price dude!

Again: WTH?!

nsfw_throwitaway69 · 1 year ago

Hi, I’m the creator of Venus-120b.

Venus has Synthia 1.5 mixed in with it, which as you noted performs pretty badly on RP. I’m currently working on a trimmed down version of Venus that has 100b parameters and I’m using SynthIA 1.2b for that, which I believe scored much better in oyur last RP tests. I’ll probably also make a 1.1 version of Venus-120b that uses SynthIA 1.2b as well to see if that helps fix some of the issues with it.

Monkey_1505 · 1 year ago

IMO don’t bother with Frankenstein models unless you plan to seriously train them with a broad dataset. They just tend towards getting confused, not following instructions etc. You’d probably need to run an orca dataset at it, and then some RP on top.

nsfw_throwitaway69 · 1 year ago

I don’t think this is true. Goliath wasn’t fine-tuned or trained at all and it outperforms every 70b I’ve ever used.

Distinct-Target7503 · 1 year ago

Still really curious about a full fine tune on one of those Frankenstein models… What are the vram requirements?

Monkey_1505 · 1 year ago

I think that’s where the real performance will be. Not sure about vram, but probably would make sense to start with mistral 11b, or llama-2 20b splices. Proof of concept.

panchovix · 1 year ago

Hi there, nice work there with Venus. For your next version and exl2 quants, you maybe want to the calibration dataset from this https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal

(On the description)

Since I checked the one that you used first and is well the same, but without any fix or formatting (so it has weird symbols etc)

WolframRavenwolf · 1 year ago

Hey, thanks for chiming in, and I’m happy to hear that feedback and glad my review didn’t discourage you. I firmly believe you’re doing a great thing there and wish you all the best for these experiments. Looking forward to your upcoming models!

BalorNG · 1 year ago

Did you do post-merge training and how much?

nsfw_throwitaway69 · 1 year ago

None at all, it’s just a merge. I’m not even really sure where to begin training it lol.

BalorNG · 1 year ago

That explains why Goliath worked and yours - not so much, I guess…

nsfw_throwitaway69 · 1 year ago

Goliath wasn’t fine-tuned at all, it’s just a merge.

bullerwins · 1 year ago

Hi! I have a similar setup, 5950x, 64GB Ram and 2x3090’s, how did you manage to load a exl2 120B model?

WolframRavenwolf · 1 year ago

oobabooga’s text-generation-webui, ExLlamav2_HF loader, gpu-split 22,22, 4K max seq length, 8-bit cache.

alchemist1e9 · 1 year ago

Wow! This post is inspiring. The attention to detail is amazing. You are a true hero for everyone studying this topic. Thank you.

CardAnarchist · 1 year ago

Thanks as always for the detailed tests!

I recently learned that Goliath makes spelling errors and I see you noticed it too.

I was wondering if you noticed spelling errors when you tested some other smaller frankenmerges or if you think it’s not to do with frankenmerges but a low quant issue?

Also I wrote a sort of guide / sharing of my settings for some people that asked. Of note that you may be interested in is the Misted 7B model results I posted at the bottom of that post.

It’s the best 7B model amongst the ones I tested in it’s ability to respond to my “quality jailbreak” whilst producing interesting non dry dialogue. If you get around to testing 7B’s again, I can highly recommend it!

Link to model

WolframRavenwolf · 1 year ago

Already saw and read your post, saved it, and added Misted-7B to the top of my 7B TODO list. :)

I’m not sure about what causes the misspellings, probably both low quant and the frankenmerging combined.

I do see misspellings and grammar mistakes when using the English models in German, even the biggest ones, but it’s worth with smaller models. They understand full well what is said but can’t write it as perfectly as English. And that’s apparently at any quant. Probably because there’s less quality German in the training data compared to English, and the less parameters a model has, the less its (language) understanding and knowledge, so it makes more mistakes.

🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

Models tested:

Testing methodology

1st test series: 4 German data protection trainings

2nd test series: Chat & Roleplay