Thinking about what people ask for in llama 3

vatsadev · 3 years ago

Thinking about what people ask for in llama 3

FPham · 3 years ago

Not making it 180B so then I won’t be able to run it would be great for starters…

mcmoose1900 · 3 years ago

Massive ctx len

There is a happy middle ground between he current 4K context and 5000K context.

GPUs can handle ~32K-64K inference in the existing architecture just fine.

vatsadev · 3 years ago

Well the 5 million was just an example of the OP stuff out there

Jean-Porte · 3 years ago

Even 200m would be great (among others)

Monkey_1505 · 3 years ago

it takes up more vram than a dense model.

If you are using qlora, it’s not by much. The main issue is that you need another model to parse the prompt. But I could see this being useful sometimes. Maybe as an option though, rather than default

That’s useful, though its gonna be mixed with real data for model robustness.

I actually really don’t like synthetic data. It’s a great method for filtering large datasets, and perhaps augmenting them, but if you use purely synthetic data you are replicating inaccuracies and prose from the origin model that will only be exaggerated by the target model. I’d rather this was a quality control step, not a dataset producer.

Multimodality

I’m personally very eh about this. It has it’s uses, and I’ve used it. But if LLM intelligence has a long way to go and this could take focus away from that. Let that be a seperate project IMO. I’m sure it has it’s uses, and it’s fans, not knocking it - I just think open source is nessasarily already behind proprietary models, and mixed focus could just make that worse.

Massive ctx len

Because of the accuracy issues involved, I’d rather they worked on smarter data retrieval like openAI has (it doesn’t really have the context sizes quoted, it grabs out the relevant bits). Generally speaking for prompts, relevancy beats quantity.

dogesator · 3 years ago

MoE

You gloss over “MoE just helps with FLOPS issues” as if that’s not a hugely important factor.

So many people have a 16 or 24GB GPU, or even 64GB + Macbooks that aren’t being fully utilized.

Sure people can load a 30B Q5 model into their 24GB GPU or a 70B Q5 model into their 48GB+ of memory in a macbook, but the main reason we don’t is because it’s so much slower, because it takes so much more FLOPS…

People are definitely willing to sacrifice vram for speed and that’s what MoE allows you to do.

You can have a 16 sub-network MoE with 100B parameters loaded comfortably into a macbook pro with 96GB of memory at Q5 with the most useful 4 subnetworks activated (25B params) for any given token,

this would benchmark significantly higher than current 33B dense models when done right and act much smarter than a 33B model while also being around the same speed as a 33B model.

Its all around more smarts for the same speed and the only downside is that it’s just using the extra VRAM that you probably weren’t using before anyways