Reuters is reporting that OpenAI achieved an advance with a technique called Q* (pronounced Q-Star).

So what is Q*?

I asked around the AI researcher campfire and…

It’s probably Q Learning MCTS, a Monte Carlo tree search reinforcement learning algorithm.

Which is right in line with the strategy DeepMind (vaguely) said they’re taking with Gemini.

Another corroborating data-point: an early GPT-4 tester mentioned on a podcast that they are working on ways to trade inference compute for smarter output. MCTS is probably the most promising method in the literature for doing that.

So how do we do it? Well, the closest thing I know of presently available is Weave, within a concise / readable Apache licensed MCTS lRL fine-tuning package called minihf.

https://github.com/JD-P/minihf/blob/main/weave.py

I’ll update the post with more info when I have it about q-learning in particular, and what the deltas are from Weave.

  • georgejrjrjrOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Edits aren’t working for me somehow, here’s my update:

    First, as I mentioned on twitter but failed to address here, this is at least excellent PR. So that may be all it is, basically a more sophisticated “AGI achieved internally” troll. I would suggest taking Q* discourse with all due salt.

    From context and the description, it looks like OpenAI published about the technique in question here: https://openai.com/research/improving-mathematical-reasoning-with-process-supervision

    The result is pretty unsurprising: given process supervision (i.e., help from a suitably accurate model of a particular process), models perform better.

    Well…yeah. It’s probably an impactful direction for AI as people find ways to build good process models, but it isn’t an especially novel finding, nor is it a reason to blow up a company. This updates me further in the direction of, “Q* discourse was a brilliant PR move to capitalize off of the controversy and direct attention away from the board power struggle.”

    Which doesn’t mean it can’t also be a good intuition pump for the open source world. Every big lab seems to be thinking about model-based supervision, it would be a little bit silly if we weren’t. So coming back to the original question:

    How might we use this?

    I think the question reduces to, “What means of supervision are available?”

    Once you have a supervisor to play “warmer / colder” with the model, the rest is trivial.

    I’m curious what models you all expect to come online to supervise llms. Arithmetic has already been reported. Code, too.

    What else?