Look at this, apart Llama1, all the other “base” models will likely answer “language” after “As an AI”. That means Meta, Mistral AI and 01-ai (the company that made Yi) likely trained the “base” models with GPT instruct datasets to inflate the benchmark scores and make it look like the “base” models had a lot of potential, we got duped hard on that one.
Shouldn’t be the proof in the pudding?
If Mistral 7B is better than most other 7b models, then they did something right, no?
I understand that the base model then can inherit some biases - but it’s onto them that they didn’t cleaned those “As and AI…” answers strings from their dataset. So despite this, it performs better.