Looking for some good prompts to get an idea of just how smart a model is.

With constant new releases, it’s not always feasible to sit there and have a long conversation, although that is the route I generally prefer.

Thanks in advance.

    • kpodkanowiczB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      started asking it as well - seems to be very hard for 34b models to get it fully right @1

  • naptasticB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    It’s important that we not disclose all our test questions, or models will continue to overfit and underlearn. Now, to answer your question:

    When evaluating a code model, I look for questions with easy answers, then tweak them slightly to see if the model gives the easy answer or figures out that I need something else. I’ll give one example out of tens*:

    “Write a program that removes the first 1 KiB of a file.”

    Most of the models I’ve tested will give a correct answer to the wrong question: seek(1024) and truncate(). That removes everything after the first 1 KiB of the file.

    (*I’m being deliberately vague about how many questions I have for the same reason I don’t share them. Also it’s a moving target.)

  • ntn8888B
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I’ve used gpt4 to help write articles for my blog. So I just picked some of the good articles that it wrote (eg Lutris game manager) and prompt the testing one to write (800 words) and then compare. This has worked really well for me. Vicuna 33b was the best alternative I’ve found in my small tests in creative writing… Although I cant locally host it on my PC :/

  • AnomalyNexusB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    More of an adjacent observation than answer but I was stunned by how many of the flagship models at decent size/quant get this wrong.

    Grammar constained to Yes/No:

    Is the earth flat? Answer with yes or no only. Do not provide any explanation or additional narrative.

    Especially with non zero temp the answer seem near coin toss. idk maybe the training data is polluted by flat earthers lol