Train Smarter, Not Harder? - MiniSymposium 7b

https://huggingface.co/kalomaze/MiniSymposium-Demo

MiniSymposium is an experimental model that I created based on Mistral 7b. I created it attempting to test these goals:

Demonstrate the untapped potential of using a small, focused dataset of handwritten examples instead of training on a large amount of synthetic GPT outputs, by lowering the learning rate and doing many passes over the small dataset
Create a dataset that allows the model to explore different possible answers from multiple perspectives before reaching a final conclusion (‘Socratic prompting’?)
Develop a model that performs well across various pseudo-markdown prompt formats, rather than overfitting to a specific kind of format such as ChatML, which should naturally benefit other general purpose use cases

The current trend in QLora/Lora-based finetuning (and finetuning in general for local LLMs) is to use large synthetic datasets. These are typically GPT-generated datasets trained with higher learning rates.

However, I believe there is a lot of potential in using small, hand-written datasets with low learning rates, even if it’s for general-purpose instruction following, as long as you train it for many epochs on a learning rate low enough to avoid overfitting.

This approach, I hypothesize, helps the model to learn the deeper patterns of instruction following , including the small details. This should help to avoid shallow data biases (like “As an AI made by OpenAI” and other GPT-isms) that are irrelevant to deeper instruction following patterns, especially in long context and multiturn scenarios.

My initial configuration for this QLora model used a constant learning rate of 1e-6 (0.000001), which resulted in obvious, massive overfitting after about 100 epochs. The model started reproducing the original dataset almost verbatim, and exhibited poor generalization across different prompt formats, including obvious hallucinations & also Chinese language outputs for some reason.

However, turning down the learning rate to 1/10th of (1e-7, which is 0.0000001) significantly improved the model with the same exact small dataset. I trained for about ~10 hours on my RTX 3060 to 600 epochs; I think it’s still a little undertrained, but I encourage people to try the demo model out in the meantime.

https://preview.redd.it/54imvd09ee2c1.png?width=1561&format=png&auto=webp&s=a0e603f5f5a960189b0d225ab5581f2a0339d12d

https://preview.redd.it/al6gmpuaee2c1.png?width=1132&format=png&auto=webp&s=5704aa41e87a5555664405d2f0178287bd7bde35

https://preview.redd.it/7fs90ictee2c1.png?width=1140&format=png&auto=webp&s=7f94c1d76493673d83e0d066efe9f43e21205fe7

It’s designed to be very adaptable to different prompt formats and playing roles, and I’ve gotten some fun and sometimes surprisingly good outputs so far.

A few samples of the training data are formatted like this to help avoid blatant overconfidence in its outputs, to serve as a sort of self-correction mechanism:

https://preview.redd.it/vlmyw1smfe2c1.png?width=2448&format=png&auto=webp&s=4c2cfea77188b9529c2c0c1c1fe29af9d152f0bf

Let me know how this model goes. There’s lots of merges of models that are all sort of doing the same thing, so I figured a more experimental approach would be appreciated. I think there is still more optimization for LR/epoch balance, and I’ll probably add some more examples of specific tasks like Summarization in the dataset so that it’s not *too* small (but still lightweight enough to generalize well).