Extra proof (IMO) that HumanEval is leaked in base models?

kpodkanowiczB to

LocalLLaMA@poweruser.forumEnglish · 2 years ago

I noticed I never posted this before - during experimenting with various merges after merging Phind v2, Speechless finetune and WizardCoder-Python34B each with 33% / averaged then adding Airoboros PEFT on the top I consistently have:
{‘pass@1’: 0.7926829268292683}
Base + Extra
{‘pass@1’: 0.7073170731707317}

Instruct prompt, greedy decoding, seed=1, 8bit.
Phind and Wizard has around 72%, Speech 75%, Airo around 60%

(That would have been SOTA back then, this is also a current score of Deepseek-33B)

The model is rather broken - it has not passed any of my regular questions. That would mean in my opinion, that by a lucky stroke, I broke the model in a way that some of the former data has resurfaced. Let me know what you think,

If someone is very interested I can push it to HF, but its waste of storage

You must log in or register to comment.

Chat