NeuralHermes-2.5: Boosting SFT models' performance with DPO

mlabonne · 2 years ago

NeuralHermes-2.5: Boosting SFT models' performance with DPO

mlabonne · 2 years ago

Yes, I’d say it’d probably work better than the current approach. If you look at the reward plots on wandb, it feels like the problem is too easy for the model, hence slight improvement.

https://preview.redd.it/xhuyiquojg3c1.png?width=2398&format=png&auto=webp&s=67725747e6cd9254e38728149fb6cea3ba85d71e