I’ve only seen merging of same-upstream-pretrained-model-at-same-size.

At very least, you should be able to merge any 2 models with the same tokenizer via element-wise addition of the log probs just before sampling. This would also unlock creative new samplers. IE instead of adding logprobs, maybe one model’s logprobs constrains the other’s in interesting ways.

But, 2 models with same architecture and same dataset will be heavily biased in the same direction, even if you take 2 different finetunes, so this approach seems like it will have a low ceiling of potential.

Also, if you’re just doing a linear interpolation of same-dimensioned weights, why not just collapse them all into a normal-sized model? IE 70B + 70B should still == 70B.

That said, you would get much more interesting models if you allowed mergers of different architectures, trained from different initializations, and with different datasets. I would think that the research on “token healing” would allow you to merge any 2 models, even if they have different tokenizers.

This seems like a cool way forward.

  • mcmoose1900B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Git rebasin claims to do this.

    But its untested on large models. There is a branch for it in mergekit, as well as a stable diffusion implementation (which works fantastically as a regular merger).

    • BayesMindOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      rebasin! I was trying to recall this, thank you. Can it mix model families, do you know? I thought it was just for identical architectures.