Venus-120b: A merge of three different models in the style of Goliath-120b

nsfw_throwitaway69 · 1 year ago

Venus-120b: A merge of three different models in the style of Goliath-120b

noeda · 1 year ago

I will set this to run overnight on Hellaswag 0-shot like I did here on Goliath when it was new: https://old.reddit.com/r/LocalLLaMA/comments/17rsmox/goliath120b_quants_and_future_plans/k8mjanh/

Thanks for the model! I started investigating some approaches to combine models and see if it can be better than its individual parts. Just today I finished code to use a genetic algorithm to pick out parts and frankenstein 7B models together (trying to prove that there is merit to this approach using smalelr models…but we’ll see).

I’ll report back on the Hellaswag results on this model.

nsfw_throwitaway69 · 1 year ago

Thanks! I’m eager to see the results :)

xadiant · 1 year ago

Any tips/attempts on frankensteining 2 yi-34b models together to make a ~51B model?

a_beautiful_rhind · 1 year ago

We need 2 or 3 yi stacked together and then face them off vs 70b.

xadiant · 1 year ago

Exactly what I was thinking. I just fail miserably each time I merge the layers.

Aaaaaaaaaeeeee · 1 year ago

possibly even going even larger than 120b parameters

I didn’t know that was possible, have people made a 1T model yet?

a_beautiful_rhind · 1 year ago

Sadly doesn’t work on 48gb like the other 120b. It can only fit sub 2048 context otherwise it goes OOM.

nsfw_throwitaway69 · 1 year ago

Crap, what’s your setup? I tested it with a single 48GB card but if you’re using 2x 24 then it might not work. I’ll have to make a 2.8 bpw quant (or get someone else to do it) so that it’ll work with card splitting.

a_beautiful_rhind · 1 year ago

I have 2x3090 for exl2. I have tess and goliath and both fit with ~3400 context so somehow your quant is slightly bigger.

nsfw_throwitaway69 · 1 year ago

Venus-120b is actually a bit bigger than Goliath-120b. Venus has 140 layers and Goliath has 136 layers, so that would explain it.

a_beautiful_rhind · 1 year ago

Makes sense… it’s doing pretty well. Like the replies. Set the limit to 3400 in tabby, no oom yet but using 98%/98%. I assume this means I can bump up the other models past 3400 too if I’m using tabby and autosplit.

Distinct-Target7503 · 1 year ago

That’s a great work!

Just a question… Have anyone tried to fine tune one of those “Frankenstein” models? Some time ago (when the first “Frankenstein” came out, it was a ~20B model) I read here on reddit that lots of users agreed that a fine tune on those merged models would have “better” results since it would help to “smooth” and adapt the merged layers. Probably I lack the technical knowledge needed to understand, so I’m asking…

a_beautiful_rhind · 1 year ago

Tess-XL-1.0… so far I didn’t like the results.

Distinct-Target7503 · 1 year ago

Is that a LORA or a full fine tune?

a_beautiful_rhind · 1 year ago

Hell yea! No Xwin. I hate that model. I’m down for the 3 bit. I didn’t like tess-XL so far so hopefully you made a david here.

nsfw_throwitaway69 · 1 year ago

I used this dataset for the quants: https://huggingface.co/datasets/jasonkstevens/pippa-llama2-chat/tree/refs%2Fconvert%2Fparquet/default/train

ambient_temp_xeno · 1 year ago

I still have this feeling in my gut that closedai have been doing this for a while. It seems like a free lunch.

Charuru · 1 year ago

I don’t think so, this is something you do when you’re GPU poor, closedai would just not undertrain their models in the first place.