Thinking about what people ask for in llama 3

vatsadev · 1 year ago

Thinking about what people ask for in llama 3

dogesator · 1 year ago

MoE

You gloss over “MoE just helps with FLOPS issues” as if that’s not a hugely important factor.

So many people have a 16 or 24GB GPU, or even 64GB + Macbooks that aren’t being fully utilized.

Sure people can load a 30B Q5 model into their 24GB GPU or a 70B Q5 model into their 48GB+ of memory in a macbook, but the main reason we don’t is because it’s so much slower, because it takes so much more FLOPS…

People are definitely willing to sacrifice vram for speed and that’s what MoE allows you to do.

You can have a 16 sub-network MoE with 100B parameters loaded comfortably into a macbook pro with 96GB of memory at Q5 with the most useful 4 subnetworks activated (25B params) for any given token,

this would benchmark significantly higher than current 33B dense models when done right and act much smarter than a 33B model while also being around the same speed as a 33B model.

Its all around more smarts for the same speed and the only downside is that it’s just using the extra VRAM that you probably weren’t using before anyways