Goliath-120B - quants and future plans

AlpinDale · 1 year ago

Goliath-120B - quants and future plans

Illustrious_Sand6784 · 1 year ago

I’m quite impressed with Goliath so far, so thank you and everyone who helped you make it.

The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

My suggestion is that it would be best to try out a 4-bit QLoRA fine-tune first and see how it preforms before spending the money/compute required to do a full fine-tune of such a massive model and have it possibly turn out to be mediocre.

On a related note, I’ve been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I’m now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we’d need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

I do think small models have a lot of good uses already and still have plenty of potential, especially ones that were created or fine-tuned for a specific task. I’m also sure we’ll get a general purpose 1-10B parameter model that’s GPT-4 level within 5 years, but I really don’t see any 7B parameter model outperforming a good Llama-2-70B fine-tune before the second half of 2024 unless there’s some big breakthrough. So I’d really encourage you to do some more research in this direction, as there’s plenty of others working on improving small models, but barely anyone doing research on improving large models.

I know that it requires a lot of money and compute to fine-tune large models, that the time and cost increase the larger the model is, and that a majority of people can’t run large models locally. I know those are the main reasons why large models don’t get as much attention as smaller ones, but come on, there’s a new small base model every week now, while I was stuck with LLaMA-65B for like half a year, and then I was stuck with Llama-2 70B for months, and now the only better model (which might not actually be much better, still waiting for the benchmarks…) that was only very recently released is not really even a base model, as it’s a merge of two fine-tuned Llama-2-70B models. Mistral-70B may not even be available to download and won’t be under a free license, and Yi-100B will be completely proprietary and unavailable to download, which leaves no upcoming models besides Llama-3-70B that are likely to outperform Llama-2-70B.