- 3 Posts
- 7 Comments
BayesMindOPBto LocalLLaMA@poweruser.forum•Dear Model Mergers, Have You Solved Merger of Different Model Families?English1·2 years agorebasin! I was trying to recall this, thank you. Can it mix model families, do you know? I thought it was just for identical architectures.
BayesMindOPBto LocalLLaMA@poweruser.forum•Dear Model Mergers, Have You Solved Merger of Different Model Families?English1·2 years agoNot for the kind of merging I’ve seen. But I remember a paper back in the day that suggested you could find high-dimensional axes within different models, and if you rotated the weights to align, you could merge different models to your advantage, and maintain knowledge from both seed models. This included models that were trained from different initializations.
I think that the only reason this franken-merging works is because people are mostly just merging finetunes of the same base, so these high-d vectors are already aligned enough that the mergers work.
BayesMindOPBto LocalLLaMA@poweruser.forum•Dear Model Mergers, Have You Solved Merger of Different Model Families?English1·2 years agoreading the readme, it sounds like they’re running some attention heads that were either already same-dimensioned across both models, or, they may have included a linear projection layer to accomplish it. Then, they say they trained on 10M tokens to “settle in the transplant”, which doesn’t sound like enough to me, and they concur this model isn’t useful until further training.
BayesMindOPBto LocalLLaMA@poweruser.forum•Dear Model Mergers, Have You Solved Merger of Different Model Families?English1·2 years agoThis doesn’t seem cost-effective for what you’d get.
I agree, which is why I’m bearish on model merges, unless you’re mixing model families (IE mistral + Llama).
These franken-merges are just interweaving finetunes of the same base model in a way that, it’d make more sense to me if they just collapsed all params into a same-sized model via element-wise interpolation. So, merging weights makes sense, but running params in parallel like these X-120B, there’s no payout I can see in doing that beyond collapsing the weights.
BayesMindBto LocalLLaMA@poweruser.forum•100B, 220B, and 600B models on huggingface!English1·2 years agoWe need a different flair for
New Model
s vsNew Merge/Finetune
BayesMindBto LocalLLaMA@poweruser.forum•Running full Falcon-180B under budget constraintEnglish1·2 years agoIf you want to benchmark the largest open source model, Google recently released a 1.6T model: https://huggingface.co/google/switch-c-2048
Try out the 3B model, it’s great: https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2