Yeah, I also notice there are two types of ways to implement trees being researched.
One is at a sequence / thought level, like tree of thoughts / chain of thoughts, where the model talks to itself in order to find the best solution. The other is at the decoding / token level, where the tree is used to search for the optimal next set of tokens. In principle you could put these both together and have nested trees.
But yeah I think the alpha-go style self-learning is what is really missing here. In principle, even without a tree, nothing stops us from putting an LLM in an environment where it gets positive feedback from rewards (like solving math problems), and then just let it rip.
How does this work? Like I’m really confused at a conceptual level on how you merge models with different numbers of different sized layers.