I came across this new finetuned model based on Openchat 3.5 which is apparently trained used Reinforcement Learning from AI Feedback (RLAIF).
https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha
Check out this tweet: https://twitter.com/bindureddy/status/1729253715549602071
If there is something somehow inherently superior about having a separate reward model, that should be teased out.
It would be nice to see stronger baselines / ablations for this reason. I realize it’s nigh impossible to keep up with the unrelenting pace of advances, so I don’t fault the authors here. That said, if there isn’t a compelling reason to keep the separate preference model, community people-hours will probably be best spent sticking with DPO/IPO to avoid the hyper-parameter tuning rabbit hole.
My guess: the way things are going, we’ll soon see a rough consensus emerge around a sane default DPO or Identity-PO recipe for fine-tunes (the same way we’ve seen gradual convergence around decoder-only transformer + rotational positional embeddings + group query attention + FlashAttention 2) to be applied absent a compelling reason to use a different reward signal.
No matter what, preference datasets like this are helpful. Pity about the license being claimed here, it’s hard to imagine it would hold up, but the specter is a bit of a hindrance.