Can LLMs stack more layers than the largest ones currently have, or is it bottlenecked? Is it because the gradients can’t propagate properly to the beginning of the network? Because inference would be to slow?

If anyone could provide a paper that talks about layer stacking scaling I would love to read it!

  • cstein123OPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Exactly the answer I was looking for, thank you!