I wanted to try and finetune a model in Swedish, since the availability of models is so lacking. Here is my first attempt; bellman-7b. The name comes from a famous Swedish singer and poet who lived in the 1700s: https://huggingface.co/neph1/bellman-7b-1k
It’s so far tuned on one epoch of: https://huggingface.co/datasets/jeremyc/Alpaca-Lora-GPT4-Swedish
on a Google Colab V100. The dataset is machine translated and as you might expect, not perfect.
The model has picked up the Swedish really well, though. I didn’t expect one epoch to make it that good. It’s based on NousResearch/Llama-2-7b-chat-hf, mainly because it allowed me to try out finetuning on the free tier of Colab. The knowledge quality of the model is so-so, though. It usually gets the first sentence right, and then starts to hallucinate, wildly. I expect more training would help, but I’m not sure whether to continue, or start over with a Mistral base instead?
The repetition bug is also prevalent, to the point of being hilarious, if I hadn’t spent time and money on doing this. :) I don’t see anyone talking about it anymore, so I expect it is solved in more recent models?
For future finetuning, I’ve done a number of fixes to the dataset, removing some obvious mistakes, pruning some odd generations, and hand-refined the first 100 rows (out of 52000).
I think I’ll also try to produce an additional small dataset (let’s call it ‘alignment’) to apply afterwards. This would include some more knowledge in the Swedish language, etc. And some RLHF. So if anyone tries it out, feel free to send me your chat logs. If they’re corrected, all the better, but anything would help.
Overall, it’s been a fun learning experience so far, since this was the first time I used Google Colab for anything, and the first time I’ve quantized anything.
Would you advice me to start over with a better base and a better dataset, or continue for more epochs with what I have?
Try to fine tune a 13b model instead, which has a way better command of Swedish than the 7B. And in my experience tends to have less issues with becoming repetitive etc.
I will. I also like 13b models. They seem like the perfect balance for us gpu starved people. But I’d rather fail some on 7b models first, since it’s quicker to iterate on them.