Identity-PO: DeepMind takes the ELO out of DPO

georgejrjrjr · 1 year ago

Identity-PO: DeepMind takes the ELO out of DPO

georgejrjrjr · 1 year ago

https://www.reddit.com/r/LocalLLaMA/comments/183d0t6/comment/kap6r1c/?utm_source=share&utm_medium=web2x&context=3

Since it’s already been integrated into Huggingface’ trainer (per the linked comment above), you should be able to follow the the Huggingface alignment manual, with one (or two) small modifications:
* Optionally: instead of using preference data from UltraChat or whomever, you can use Intel’s trick and just reject sample from a weaker model --perhaps the model you’re finetuning, or you could use Llama 2 13b as Intel did. This just means that you’re labeling (perhaps some subset of) your original training set examples as ‘preferred’ and the weaker model’s completions of the same prompts as ‘rejected’.
* Instead of using the DPO option on Huggingface’s training library (used by ‘TRL’), use the IPO option. That’s it.