The new chat model released by Intel is now at the top of the OpenLLM leaderboard (among the 7B models).
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
The model seems cool and all, but the paper is better.
Intel eliminated the preference data from direct preference optimization. Preference data is expensive and collecting it is a hassle, so this is a big deal. Best of all, it looks like their no-preference DPO actually performs better.
The trick is sampling rejects from a small model. Let’s say you have a dataset of GPT-4 completions. You mark those as good (“preferred”). You prompt Llama 2 13B and mark its responses as rejects.
Tl;dr This could boost the performance of nearly every model with a minimal increase in complexity (though obviously it’s non-zero compute).