TL ; DR

When fine tuning for instruction following in a low-resource language, would you start from a base or a chat model? Would you include text corpora in that language as a first fine-tuning stept? Or would you rather train a model from scratch?

What’s up?

I am in the process of trying to figure out how to best train a model for a low-resource language only marginally supported by open source models such as Llama-2-7/13B and Mistral-7B.

And thought I would pick your collective brains regarding what strategy to use.

Material

I have an extensive corpus of texts (not prompt-response pairs but books, articles etc.), a large amount of translated instruction datasets (instruction, input, output a la Alpaca), and a small corpus of native instruction sets (mostly from q&a web resources, as well as Wikipedia

Methods

I see at least four overarching approaches (each with a lot of choices as to the details):

  1. take a pre-trained base model such as Lllama-2-13B-hf, ignore the texts, and go straight to the supervised fine-tuning with the translated and native instructions sets.

  2. take a pre-trained chat model such as Llama-2-13B-chat-hf, ignore the texts, and go straight to the supervised fine-tuning with the translated and native instructions sets.

  3. take a pre-trained base model such as Llama-2-13B-hf, fine tune it on the texts, and afterwards use supervised fine-tuning with the translated and native instructions sets.

  4. train a base model from scratch using architectures such as Llama-2 using the texts and afterwards use supervised fine-tuning with the translated and native instructions sets.

Except for 4, I would be training a sizable LoRA adapter relative to a 4-bit quantized base model on a single GPU. For 4, I would be using 4 GPUs on an unquantized model.

Current considerations

So far, I have tried 1), with somewhat promising results after 3 epochs of training in a translated alpaca gpt4 dataset.

I am wondering whether it is worth to start with a model already fine tuned on English instruction sets (as in 2), or whether this might be detrimental?

Would you expect fine tuning a model on a large amount of text data (as in 3) to improve the results of later fine tuning on instruction sets in that language?

And what would you expect the advantages to be of having a model trained from scratch in the language? Beyond better adapted tokenization, of course.

Any input is welcome :-)