Open LLM Leaderboard vs Reality: How do you evaluate "good" ?

BlueMetaMind · 1 year ago

Open LLM Leaderboard vs Reality: How do you evaluate "good" ?

BalorNG · 1 year ago

Technically, you can somewhat automate the testing process by creating a script that makes that model aswer a series of questions that are relevant to YOU and are unique (so cannot be gamed by training for benchmarks) and evaluate those yourself.

Make sure you experiment using different sampling methods and run several tests due to inherent randomness of output.