Hi All,
I previously discussed here [Phibrarian Alpha] a model I which was fine-tuned over Mistral 7b with a 32k context window. The fine-tune ran for four epochs on over 1 billion tokens of high-quality synthetic data + high quality instruction data (OpenOrca / WizardCoder). This model is now fully trained.
However, we encountered issues regarding its accuracy. Although it was fine-tuned with diverse educational sources, which gave it an informative tone, it often generated inaccurate yet detailed information. To address this, I began working with RAG to enhance the model’s accuracy. Fortunately, a promising approach called self-rag was introduced recently.
I further fine-tuned the SciPhi model on this data as well as some of the RAG-instruct data which I had previously prepared. The result, SciPhi-Self-RAG-Mistral-7B-32k, is a model that is more powerful and more accurate. The downside is that this model requires access to a RAG database to be ran - thus I set out to provide free and open access to the one cited in the paper. This is now online and something which you can read about in the documentation here.
Here are the eval comparisons -
Running this model is slightly more complicated than other LLMs because of the RAG integration, so one other goal was to build a turn-key open source solution. Below is what the API I cooked up looks like.
The SciPhi API for RAG + LLM Eval
With this pipeline you can use your own local model and using the sciphi-infra you can host your own embedding db, if desired.
Some notes - the model still struggles with chat in some ways, as the current fine-tuning dataset is not fully optimized for this. This is something that I am still working on, but I think there is an opportunity here for the greater community to work on improving pipelines around RAG - so I’m hoping to see some cool models get built on top of this database.
Further, I’m working on extending the data sources in the RAG db well beyond those quoted in the self-rag paper, as it appears to be an incredibly promising approach.
Lastly, here is a random example output I just produced -
Please take a look and let me know your thoughts. I’ve appreciated all the valuable feedback thus far.
yes, i’m very bullish on trying to just use small LLMs to reason and to then augment with additional data streams - I think this is our best chance at competing w/ OAI and others.