• 3 Posts
  • 21 Comments
Joined 1 year ago
cake
Cake day: October 31st, 2023

help-circle


  • If there is something somehow inherently superior about having a separate reward model, that should be teased out.

    It would be nice to see stronger baselines / ablations for this reason. I realize it’s nigh impossible to keep up with the unrelenting pace of advances, so I don’t fault the authors here. That said, if there isn’t a compelling reason to keep the separate preference model, community people-hours will probably be best spent sticking with DPO/IPO to avoid the hyper-parameter tuning rabbit hole.

    My guess: the way things are going, we’ll soon see a rough consensus emerge around a sane default DPO or Identity-PO recipe for fine-tunes (the same way we’ve seen gradual convergence around decoder-only transformer + rotational positional embeddings + group query attention + FlashAttention 2) to be applied absent a compelling reason to use a different reward signal.

    No matter what, preference datasets like this are helpful. Pity about the license being claimed here, it’s hard to imagine it would hold up, but the specter is a bit of a hindrance.


  • Sort-of.

    Refuel.ai finetuned a 13B llama 2 for data labeling; not hard to imagine applications for that here if the data volume were reasonable. Simplest thing that might work: take a paragraph at a time and have a data labeling model answer “Is this boilerplate or content?”

    Another possibility is using the TART classifier head from Hazy Research, find as many as 256 pairs of boilerplate vs. content, and use only as large a model as you need to get good classification results. If your data volume is large, you would do this for a while, get a larger corpus of content vs. boilerplate, and train a more efficient classifier with fasttext or something similar (probably bigram based).




  • The broad outline:
    * You would need an easy way for people to throw their GPU idle time at a cluster, and a reason to do so (i.e., what do your hosts get out of the deal?).

    * You need an easy way to ingest datasets for training LoRAs.

    * You’d need an automated pipeline to turn those fine-tuning datasets into aligned LoRAs, to be propagated to your inference nodes.

    * You’d probably want to think about retrieval, and whether you would like that to be part of the story (and whether it puts you at additional legal risk).

    * You’d need a fast inference server with S-LoRA (or whatever the leading method for batch inference with LoRAs is next week).

    * You would need an HTTPS server on the front end that terminates TLS for all your endpoints, and routes API requests to the appropriate LoRA.

    * You need a way to keep those certificates and inference server addresses up to date in spite of churn.

    * You need to figure out your cost model, and revenue sharing model for your hosting providers if applicable, ideally one that doesn’t involve a cryptocurrency unless you have a limitless legal budget and you are based in El Salvador and personal friends with the Bukele family.

    From the generality of your question, your best bet would probably be to hire me ;-).




  • Edits aren’t working for me somehow, here’s my update:

    First, as I mentioned on twitter but failed to address here, this is at least excellent PR. So that may be all it is, basically a more sophisticated “AGI achieved internally” troll. I would suggest taking Q* discourse with all due salt.

    From context and the description, it looks like OpenAI published about the technique in question here: https://openai.com/research/improving-mathematical-reasoning-with-process-supervision

    The result is pretty unsurprising: given process supervision (i.e., help from a suitably accurate model of a particular process), models perform better.

    Well…yeah. It’s probably an impactful direction for AI as people find ways to build good process models, but it isn’t an especially novel finding, nor is it a reason to blow up a company. This updates me further in the direction of, “Q* discourse was a brilliant PR move to capitalize off of the controversy and direct attention away from the board power struggle.”

    Which doesn’t mean it can’t also be a good intuition pump for the open source world. Every big lab seems to be thinking about model-based supervision, it would be a little bit silly if we weren’t. So coming back to the original question:

    How might we use this?

    I think the question reduces to, “What means of supervision are available?”

    Once you have a supervisor to play “warmer / colder” with the model, the rest is trivial.

    I’m curious what models you all expect to come online to supervise llms. Arithmetic has already been reported. Code, too.

    What else?


  • Uh, it’s ~great if you have a model of a problem domain that can solve that kind of problem, and you want an LLM to talk about the solution.

    If you read the paper, you’ll see they cut the LLM calls way down by calling a domain specific model to do the actual problem solving. They have an ablation where they let the LLM do the very last step of a multistep problem and performance plummets.

    I think the presentation is a little bit deceptive. The MCTS is not really helping the LLM work through the problem, the LLM is essentially just talking about the solution found by the other model.




  • It’s better than that, imo, when you look at it in context.

    Particularly in light of Intel’s finding the other day, that DPO works well (probably better) without preference data.

    “Alignment” methods are getting simpler, easier, and more effective.

    RLHF was a huge pain, because there were a ton of hyper parameters to tweak, and it’s expensive to get human data.

    Constitutional AI (RLAIF) dealt with some of the cost and difficulty by using AI preference data, but still left the necessity for collecting preference data, and all the hyper parameter tweaking intact.

    DPO eliminated the superfluous reward model, simplifying things greatly, and making overfitting less pernicious.

    Intel got rid of preference data altogether.

    IPO claims to fix overfitting altogether, while simplifying further.

    I figure within a month, Axolotl will grow a flag that means, “and also IPO this,” with no additional cognitive overhead or hyper-parameter tuning required, and —yes— the water line for model quality is going to go up.





  • That’s awesome, and I could see it being pretty useful for synthetic data generation with more compute intensity.
    90s/t is serial decoding, right? I guess your CPU utilization is approaching zero. What happens when you push the batch size until you’re > 50% CPU utilization? (At some point it might make sense to dedicate a core to tokenization).

    The potential gains from speculative decoding here seem likely to be big, too, since you’d only be running the big model once every several tokens. I imagine sticking Mistral in VRAM, after fine-tuning with the same instruction tuning corpus as your Falcon (though there are fancier ways to do sketch model / big model alignment, too).

    Total aside: I don’t know if you saw the sub-1 bit compression of mixture models, but it might be up your alley. Fun if we ever get weights for a big mixture model (https://github.com/IST-DASLab/qmoe).


  • The model seems cool and all, but the paper is better.

    Intel eliminated the preference data from direct preference optimization. Preference data is expensive and collecting it is a hassle, so this is a big deal. Best of all, it looks like their no-preference DPO actually performs better.

    The trick is sampling rejects from a small model. Let’s say you have a dataset of GPT-4 completions. You mark those as good (“preferred”). You prompt Llama 2 13B and mark its responses as rejects.

    Tl;dr This could boost the performance of nearly every model with a minimal increase in complexity (though obviously it’s non-zero compute).