A few people here tried the Goliath-120B model I released a while back, and looks like TheBloke has released the quantized versions now. So far, the reception has been largely positive.

https://huggingface.co/TheBloke/goliath-120b-GPTQ

https://huggingface.co/TheBloke/goliath-120b-GGUF

https://huggingface.co/TheBloke/goliath-120b-AWQ

The fact that the model turned out good is completely unexpected. Every LM researcher I’ve spoken to about this in the past few days has been completely baffled. The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

On a related note, I’ve been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I’m now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we’d need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

If anyone has suggestions, please let me know. Cheers!

  • FPhamB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I suspect that it behaves sort of as if you have (fictious) Xwin and Eurayle adapter and apply it as catsum which sums the rank (so 2x256 rank would became 512 rank!) but improves the response only a tiny bit.

    But in this case we are summing “virtual” rank of two 70b models. The model could be a smidgen smarter, but not that much because a huge chunks of weights are overlapping. We are wasting probably 80b parameters :) that do not contribute.

    A correct test has to be done between the Sum and both Xwin and Eurayle to see the actual result. I’ve seen it many times with fine-tuning when I attributed the good response to the fine-tune, but in fact it was mostly due to the prior model, when I A/B and the fine-tune really added only a tiny bit.

    I’m honestly more interested in the opposite way to make models smaller while maybe loosing only a smidgen of knowledge.