As a beginner, I appreciate that there are metrics for all these LLMs out there so I don’t waste time downloading and trying failures. However, I noticed that the Leaderboard doesn’t exactly reflect reality for me. YES, I DO UNDERSTAND THAT IT DEPENDS ON MY NEEDS.
I mean really basic stuff of how the LLM acts as a coherent agent, can follow instructions and grasp context in any given situation. Which is often lacking in LLMs I am trying so far, like the boards leader for 30B models 01-ai/Yi-34B for example. I guess there is something similar going on like it used to with GPU benchmarks: dirty tricks and over-optimization for the tests.
I am interested in how more experienced people here evaluate an LLM’s fitness. Do you have a battery of questions and instructions you try out first?
When I look at the leaderboard, I mostly pay attention to TruthfulQA, as it seems most predictive of models which are good for my use-case. YMMV of course.
Once I’ve downloaded a model, I’ll fiddle around with different llama.cpp parameters and prompt templates, figuring out what works best for it, and then send it through my test framework, which has it infer five times on each of several prompts.
Evaluation of test results are fairly subjective, but there are some obvious problems which recur, like not inferring an answer, or inferring itself a new user prompt to answer.
I just finished a compare-and-contrast of Marx-3B vs Marx-3B-v3 using that test framework, which you can see (along with raw test results) here: https://old.reddit.com/r/LocalLLaMA/comments/17xsliz/marx_3b_v3_and_akins_3b_gguf_quantizations/ka2fd19/
I’ve been meaning to add some simple assessment logic to my test framework, which tries to guess at the quality of inferred replies, but haven’t made it a priority.
What are the top 3 best open source LLMs in your opinion?
I test and compare models in-depth, still hard at work on my 70B-120B evaluation. Take a look at one of my recent posts, where I explain my testing methodology in detail.
You are famous everywhere for those comparisons.
Technically, you can somewhat automate the testing process by creating a script that makes that model aswer a series of questions that are relevant to YOU and are unique (so cannot be gamed by training for benchmarks) and evaluate those yourself.
Make sure you experiment using different sampling methods and run several tests due to inherent randomness of output.