Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods

Covid-Plannedemic_ · 2 years ago

Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods

its_just_andy · 2 years ago

if you’re interested in running your own models for any reason, you really should build your own evaluation dataset for the scenarios you care about.

at this point, all the public benchmarks are such a mess. Do you really care if the model you select has the highest MMLU? Or, do you care only that it’s the best-performing model for the scenarios you actually need?

ambient_temp_xeno · 2 years ago

To be fair, it’s pretty clear that openai update their models with every kind of test people throw at them as well.

DreamGenX · 2 years ago

It’s inevitable people will game the system when it’s so easy, and the payoff can be huge. Not so long ago people could still get huge VC checks for showing off GitHub stars or benchmark numbers.

Monkey_1505 · 2 years ago

The problem isn’t the training data, it’s the benchmarks.

Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods

Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods

Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org