BlueMetaMindB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

Open LLM Leaderboard vs Reality: How do you evaluate "good" ?

1

Open LLM Leaderboard vs Reality: How do you evaluate "good" ?

BlueMetaMindB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

As a beginner, I appreciate that there are metrics for all these LLMs out there so I don’t waste time downloading and trying failures. However, I noticed that the Leaderboard doesn’t exactly reflect reality for me. YES, I DO UNDERSTAND THAT IT DEPENDS ON MY NEEDS.

I mean really basic stuff of how the LLM acts as a coherent agent, can follow instructions and grasp context in any given situation. Which is often lacking in LLMs I am trying so far, like the boards leader for 30B models 01-ai/Yi-34B for example. I guess there is something similar going on like it used to with GPU benchmarks: dirty tricks and over-optimization for the tests.

I am interested in how more experienced people here evaluate an LLM’s fitness. Do you have a battery of questions and instructions you try out first?

Chat

WolframRavenwolfB
link
fedilink
English
arrow-up
1·
1 year ago
I test and compare models in-depth, still hard at work on my 70B-120B evaluation. Take a look at one of my recent posts, where I explain my testing methodology in detail.
- No-Belt7582B
  link
  fedilink
  English
  arrow-up
  1·
  1 year ago
  You are famous everywhere for those comparisons.