Identity-PO: DeepMind takes the ELO out of DPO

georgejrjrjr · 1 year ago

The reason this was necessary is a bit funny: as Google reported in their MADLAD paper, over half of Mandarin CommonCrawl text is porn. Perhaps a robots.txt issue? Great firewall? I have no idea.

georgejrjrjr · 1 year ago

I’ve wondered this, and hope you get better answers.

One thing you could do if it fit your use-case: align GDELT entries and news stories in realnews dataset on huggingface, train a model to output the extracted info from the article.

Another is have GPT-4 so some examples on lightly faked / anonymized data and then distill that into a model that does well on information extraction evals (which are a thing iirc).

georgejrjrjr · 1 year ago

If there is something somehow inherently superior about having a separate reward model, that should be teased out.

It would be nice to see stronger baselines / ablations for this reason. I realize it’s nigh impossible to keep up with the unrelenting pace of advances, so I don’t fault the authors here. That said, if there isn’t a compelling reason to keep the separate preference model, community people-hours will probably be best spent sticking with DPO/IPO to avoid the hyper-parameter tuning rabbit hole.

My guess: the way things are going, we’ll soon see a rough consensus emerge around a sane default DPO or Identity-PO recipe for fine-tunes (the same way we’ve seen gradual convergence around decoder-only transformer + rotational positional embeddings + group query attention + FlashAttention 2) to be applied absent a compelling reason to use a different reward signal.

No matter what, preference datasets like this are helpful. Pity about the license being claimed here, it’s hard to imagine it would hold up, but the specter is a bit of a hindrance.

georgejrjrjr · 1 year ago

Sort-of.

Refuel.ai finetuned a 13B llama 2 for data labeling; not hard to imagine applications for that here if the data volume were reasonable. Simplest thing that might work: take a paragraph at a time and have a data labeling model answer “Is this boilerplate or content?”

Another possibility is using the TART classifier head from Hazy Research, find as many as 256 pairs of boilerplate vs. content, and use only as large a model as you need to get good classification results. If your data volume is large, you would do this for a while, get a larger corpus of content vs. boilerplate, and train a more efficient classifier with fasttext or something similar (probably bigram based).

georgejrjrjr · 1 year ago

One base model, dozens maybe hundreds of adapters would be the goal.

georgejrjrjr · 1 year ago

That is very cool.

May I suggest doing a separate post to show off all the incredible work you all are doing with secure enclaves? It’s important work and --as far as I have seen-- few have noticed.

A walk-through to get Whisper, Mistral 7B, and Stable Diffusion up and running in an enclave would go a long way towards getting folks hip to the impact of what you’re building.

georgejrjrjr · 1 year ago

The broad outline:
* You would need an easy way for people to throw their GPU idle time at a cluster, and a reason to do so (i.e., what do your hosts get out of the deal?).

* You need an easy way to ingest datasets for training LoRAs.

* You’d need an automated pipeline to turn those fine-tuning datasets into aligned LoRAs, to be propagated to your inference nodes.

* You’d probably want to think about retrieval, and whether you would like that to be part of the story (and whether it puts you at additional legal risk).

* You’d need a fast inference server with S-LoRA (or whatever the leading method for batch inference with LoRAs is next week).

* You would need an HTTPS server on the front end that terminates TLS for all your endpoints, and routes API requests to the appropriate LoRA.

* You need a way to keep those certificates and inference server addresses up to date in spite of churn.

* You need to figure out your cost model, and revenue sharing model for your hosting providers if applicable, ideally one that doesn’t involve a cryptocurrency unless you have a limitless legal budget and you are based in El Salvador and personal friends with the Bukele family.

From the generality of your question, your best bet would probably be to hire me ;-).

georgejrjrjr · 1 year ago

Now that it is possible to host many LoRAs off of one base model, a scrappy intelligence cooperative might:

Standardize on a base model. Ideally one that can fit comfortably on a gaming card with room for batch inference and context.
Share LoRAs.
Host cooperatively, share load w/ DNS.

georgejrjrjr · 1 year ago

Used first generation (non-Ada) 48GB A6000 is an option. Kinda slow, but also the only card in its VRAM-density-per-dollar niche.

georgejrjrjr · 1 year ago

Edits aren’t working for me somehow, here’s my update:

First, as I mentioned on twitter but failed to address here, this is at least excellent PR. So that may be all it is, basically a more sophisticated “AGI achieved internally” troll. I would suggest taking Q* discourse with all due salt.

From context and the description, it looks like OpenAI published about the technique in question here: https://openai.com/research/improving-mathematical-reasoning-with-process-supervision

The result is pretty unsurprising: given process supervision (i.e., help from a suitably accurate model of a particular process), models perform better.

Well…yeah. It’s probably an impactful direction for AI as people find ways to build good process models, but it isn’t an especially novel finding, nor is it a reason to blow up a company. This updates me further in the direction of, “Q* discourse was a brilliant PR move to capitalize off of the controversy and direct attention away from the board power struggle.”

Which doesn’t mean it can’t also be a good intuition pump for the open source world. Every big lab seems to be thinking about model-based supervision, it would be a little bit silly if we weren’t. So coming back to the original question:

How might we use this?

I think the question reduces to, “What means of supervision are available?”

Once you have a supervisor to play “warmer / colder” with the model, the rest is trivial.

I’m curious what models you all expect to come online to supervise llms. Arithmetic has already been reported. Code, too.

What else?

georgejrjrjr · 1 year ago

Uh, it’s ~great if you have a model of a problem domain that can solve that kind of problem, and you want an LLM to talk about the solution.

If you read the paper, you’ll see they cut the LLM calls way down by calling a domain specific model to do the actual problem solving. They have an ablation where they let the LLM do the very last step of a multistep problem and performance plummets.

I think the presentation is a little bit deceptive. The MCTS is not really helping the LLM work through the problem, the LLM is essentially just talking about the solution found by the other model.

georgejrjrjr · 1 year ago

https://www.reddit.com/r/LocalLLaMA/comments/183d0t6/comment/kap6r1c/?utm_source=share&utm_medium=web2x&context=3

Since it’s already been integrated into Huggingface’ trainer (per the linked comment above), you should be able to follow the the Huggingface alignment manual, with one (or two) small modifications:
* Optionally: instead of using preference data from UltraChat or whomever, you can use Intel’s trick and just reject sample from a weaker model --perhaps the model you’re finetuning, or you could use Llama 2 13b as Intel did. This just means that you’re labeling (perhaps some subset of) your original training set examples as ‘preferred’ and the weaker model’s completions of the same prompts as ‘rejected’.
* Instead of using the DPO option on Huggingface’s training library (used by ‘TRL’), use the IPO option. That’s it.

georgejrjrjr · 1 year ago

They used to be on eBay. They’re still listed on Alibaba.

georgejrjrjr · 1 year ago

It’s better than that, imo, when you look at it in context.

Particularly in light of Intel’s finding the other day, that DPO works well (probably better) without preference data.

“Alignment” methods are getting simpler, easier, and more effective.

RLHF was a huge pain, because there were a ton of hyper parameters to tweak, and it’s expensive to get human data.

Constitutional AI (RLAIF) dealt with some of the cost and difficulty by using AI preference data, but still left the necessity for collecting preference data, and all the hyper parameter tweaking intact.

DPO eliminated the superfluous reward model, simplifying things greatly, and making overfitting less pernicious.

Intel got rid of preference data altogether.

IPO claims to fix overfitting altogether, while simplifying further.

I figure within a month, Axolotl will grow a flag that means, “and also IPO this,” with no additional cognitive overhead or hyper-parameter tuning required, and —yes— the water line for model quality is going to go up.

georgejrjrjr · 1 year ago

Ty, that’s helpful to know.

georgejrjrjr · 1 year ago

Identity-PO: DeepMind takes the ELO out of DPO

georgejrjrjr · 1 year ago

Very cool. It’s fun to see praxis match the theory, as small models hit the compute wall at a batch size proportional to their size.

Have you tried cranking the batch size further on Falcon 180B? 16 tokens was 16 times as fast as one token, so it seems like you’re still pretty far from the limit.

And the optimal batch size for the FP16 model should be around 4x higher, right?

georgejrjrjr · 1 year ago

That’s awesome, and I could see it being pretty useful for synthetic data generation with more compute intensity.
90s/t is serial decoding, right? I guess your CPU utilization is approaching zero. What happens when you push the batch size until you’re > 50% CPU utilization? (At some point it might make sense to dedicate a core to tokenization).

The potential gains from speculative decoding here seem likely to be big, too, since you’d only be running the big model once every several tokens. I imagine sticking Mistral in VRAM, after fine-tuning with the same instruction tuning corpus as your Falcon (though there are fancier ways to do sketch model / big model alignment, too).

Total aside: I don’t know if you saw the sub-1 bit compression of mixture models, but it might be up your alley. Fun if we ever get weights for a big mixture model (https://github.com/IST-DASLab/qmoe).

georgejrjrjr · 1 year ago

The model seems cool and all, but the paper is better.

Intel eliminated the preference data from direct preference optimization. Preference data is expensive and collecting it is a hassle, so this is a big deal. Best of all, it looks like their no-preference DPO actually performs better.

The trick is sampling rejects from a small model. Let’s say you have a dataset of GPT-4 completions. You mark those as good (“preferred”). You prompt Llama 2 13B and mark its responses as rejects.

Tl;dr This could boost the performance of nearly every model with a minimal increase in complexity (though obviously it’s non-zero compute).

georgejrjrjr · 1 year ago

What is Q* and how do we use it?

georgejrjrjr · 1 year ago

Link to watch the livestream:

https://event.ai-pulse.eu/

Although, the topic for the talk is the architecture of Mistral-7B, and the architectural choices for Mistral-7B were conservative, even a bit boring, so I wouldn’t strongly anticipate bombshells. Still, one can hope.

georgejrjrjr · 1 year ago

Three thoughts:

TGI is no longer free software (in the sense that their new license is not OSI approved, nor would it be remotely eligible).

LightLLM is another option that is permissively licensed, and reportedly fast. I haven’t tried it yet.

Speculative inference can yield a significant performance bump, but the devil’s in the details. Some implementations seem to work a lot better than others.

georgejrjrjr · 1 year ago

New multilingual base model from nvidia: Nemotron-3-8B