NeuralChat 7B: Intel’s Chat Model Trained with DPO

aminedjeghri · 2 years ago

NeuralChat 7B: Intel’s Chat Model Trained with DPO

georgejrjrjr · 2 years ago

The model seems cool and all, but the paper is better.

Intel eliminated the preference data from direct preference optimization. Preference data is expensive and collecting it is a hassle, so this is a big deal. Best of all, it looks like their no-preference DPO actually performs better.

The trick is sampling rejects from a small model. Let’s say you have a dataset of GPT-4 completions. You mark those as good (“preferred”). You prompt Llama 2 13B and mark its responses as rejects.

Tl;dr This could boost the performance of nearly every model with a minimal increase in complexity (though obviously it’s non-zero compute).