NeuralHermes-2.5: Boosting SFT models' performance with DPO

mlabonne · 2 years ago

NeuralHermes-2.5: Boosting SFT models' performance with DPO

kpodkanowicz · 2 years ago

really cool! what do you think about using gpt3.5 as the worst output in the hopes to resurface some extra edge?

mlabonne · 2 years ago

Yes, I’d say it’d probably work better than the current approach. If you look at the reward plots on wandb, it feels like the problem is too easy for the model, hence slight improvement.

https://preview.redd.it/xhuyiquojg3c1.png?width=2398&format=png&auto=webp&s=67725747e6cd9254e38728149fb6cea3ba85d71e