What prompts/questions do you use to test a model’s capabilities? Ideally ones that aren’t included in their training data.

sardoa11 · 1 year ago

What prompts/questions do you use to test a model’s capabilities? Ideally ones that aren’t included in their training data.

tgredditfc · 1 year ago

“Write the snake game using pygame”

kpodkanowicz · 1 year ago

started asking it as well - seems to be very hard for 34b models to get it fully right @1

naptastic · 1 year ago

It’s important that we not disclose all our test questions, or models will continue to overfit and underlearn. Now, to answer your question:

When evaluating a code model, I look for questions with easy answers, then tweak them slightly to see if the model gives the easy answer or figures out that I need something else. I’ll give one example out of tens*:

“Write a program that removes the first 1 KiB of a file.”

Most of the models I’ve tested will give a correct answer to the wrong question: seek(1024) and truncate(). That removes everything after the first 1 KiB of the file.

(*I’m being deliberately vague about how many questions I have for the same reason I don’t share them. Also it’s a moving target.)

ntn8888 · 1 year ago

I’ve used gpt4 to help write articles for my blog. So I just picked some of the good articles that it wrote (eg Lutris game manager) and prompt the testing one to write (800 words) and then compare. This has worked really well for me. Vicuna 33b was the best alternative I’ve found in my small tests in creative writing… Although I cant locally host it on my PC :/

AnomalyNexus · 1 year ago

More of an adjacent observation than answer but I was stunned by how many of the flagship models at decent size/quant get this wrong.

Grammar constained to Yes/No:

Is the earth flat? Answer with yes or no only. Do not provide any explanation or additional narrative.

Especially with non zero temp the answer seem near coin toss. idk maybe the training data is polluted by flat earthers lol