Thinking about what people ask for in llama 3

So I was looking at some of the things people ask for in llama 3, kinda judging them over whether they made sense or were feasible.

Mixture of Experts - Why? This literally is useless to us. MoE helps with Flops issues, it takes up more vram than a dense model. OpenAI makes it work, it isn’t naturally superior or better by default.

Synthetic Data - That’s useful, though its gonna be mixed with real data for model robustness. Though the real issue I see is here is collecting that many tokens. If they ripped anything near 10T for openai, they would be found out pretty quick. I could see them splitting the workload over multiple different accounts, also using Claude, calling multiple model AI’s (GPT-4, gpt-4-turbo), ripping data off third party services, and all the other data they’ve managed to collect.

More smaller models - A 1b and 3b would be nice. TinyLlama 1.1B is really capable for its size, and better models at the 1b and 3b scale would be really useful for web inference and mobile inference

More multilingual data - This is totally Nesc. I’ve seen RWKV world v5, and its trained on a lot of multilingual data. its 7b model is only half trained, and it already passes mistral 7b on multilingual benchmarks. They’re just using regular datasets like slimpajama, they havent even prepped the next dataset actually using multilingual data like CulturaX and Madlad.

Multimodality - This would be really useful, also probably a necessity if they want LLama 3 to “Match GPT-4”. The Llava work has proved that you can make image to text work out with llama. Fuyu Architecture has also simplified some things, considering you can just stuff modality embeddings into regular model and train it the same. it would be nice if you could use multiple modalities in, as meta already has experience in that with imagebind and anymal. It would be better than GPT 4 is it was multimodality in -> multimodal out

GQA, sliding windows - Useful, the +1% architecture changes, Meta might add them if they feel like it

Massive ctx len - If they Use RWKV, they may make any ctx len they can scale to, but they might do it for a regular transformer too, look at Magic.devs (not that messed up paper MAGIC!) ltm-1: https://magic.dev/blog/ltm-1, the model has a context len of 5,000,000.

Multi-epoch training, Dr. Vries scaling laws - StableLM 3b 4e 1t is still the best 3b base out there, and no other 3b bases have caught up to it so far. Most people attribute it to the Dr Vries scaling law, exponential data and compute, Meta might have really powerful models if they followed the pattern.

Function calling/ tool usage - If they made the models come with the ability to use some tools, and we instruction tuned to allow models to call any function through in context learning, that could be really OP.

Different Architecture - RWKV is good one to try, but if meta has something better, they may shift away from transformers to something else.