So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?
So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?
It’s simply the time bonus - coming after all the big models.
- better filtering - kill outright junk
- you use already big models (OpenAI and LLama) that you can use for data tuning and filtering
- use available synthetic data