Goliath-120B - quants and future plans

AlpinDale · 1 year ago

Goliath-120B - quants and future plans

noeda · 1 year ago

I’ve done bunch of D&D character sheets with this and yeah I think is pretty good. (Still not sure if it’s just Euryale though which looks like has been trained on that kind of data).

I would love to see where Goliath ranks in the traditional benchmarks, Hellaswag, Winogrande etc. (has anyone run them yet?) Very curious if this model is strictly better than the two models it was made out of in a more rigorous test.

I’m really hoping the frankensteining method can be proven that it really does improve the smarts compared to the models it is made out of.

I’ve been using a Q6 gguf quant I made myself on day 1 and it works well. 1.22 tokens per second on a pure CPU + DDR5 memory and I think around 90GB of memory.

FPham · 1 year ago

I suspect that it behaves sort of as if you have (fictious) Xwin and Eurayle adapter and apply it as catsum which sums the rank (so 2x256 rank would became 512 rank!) but improves the response only a tiny bit.

But in this case we are summing “virtual” rank of two 70b models. The model could be a smidgen smarter, but not that much because a huge chunks of weights are overlapping. We are wasting probably 80b parameters :) that do not contribute.

A correct test has to be done between the Sum and both Xwin and Eurayle to see the actual result. I’ve seen it many times with fine-tuning when I attributed the good response to the fine-tune, but in fact it was mostly due to the prior model, when I A/B and the fine-tune really added only a tiny bit.

I’m honestly more interested in the opposite way to make models smaller while maybe loosing only a smidgen of knowledge.

noeda · 1 year ago

Just finished the Hellaswag trial runs. First, here’s a table from best to worst:

The euryale and xwin models are the ones used to Frankenstein together the Goliath model.

The Goliath .gguf was quantized by myself, as was the Yi model. The rest are downloaded from TheBloke.

Even though Goliath shows up as the top model, here is why I don’t think you should run off and tell everyone Goliath’s the best model ever:

The trials ran 400 random tests from the Hellaswag set. There is a random element in the final score. When I plugged in Goliath and Euryale results for 400 trials to compute the probability that Goliath is better at 0-shot Hellaswag vs. Euryale, I got 84% as result (97.83% for vs. Xwin). 84% is good but I was hoping it would be more like 99%. In other words, it’s possible I randomly got a better result for Goliath simply because it got lucky in the choice of which Hellaswag tests it was asked to complete.
This was the first time I ever tried running more rigorous tests on LLMs rather than eyeballing it so I may have made mistakes.
The numbers can’t be compared with the OpenLLM leaderboard (they use N-shot Hellaswag, forgot what N was), and I noticed they also don’t line up with the llama.cpp link there. OpenLLM leaderboard, I expected it to not be the same but I can’t explain why it doesn’t match with the llama.cpp discussion.
Hellaswag is just one benchmark and I looked at the examples inside the tests what it’s actually asking the models and I think 0-shot testing is a bit brutal for these models. It might be a bit unfair for them. I thought the Yi model for example was supposed to be real good.

I would wait until proper benchmarks run by people with more resources can test this out. I don’t plan on myself on updating these numbers.

BUT. It does look promising. I’m hoping more rigorous benchmarks will give some more confidence.

AlpinDale · 1 year ago

Makes sense the benchmark results would be surprisingly low for goliath. After playing around with it for a few days, I’ve noticed two glaring issues:

it tends to make slight spelling mistakes
it hallucinates words They happen rarely, but frequent enough to throw off benchmarks. I’m very positive this can be solved by a quick full finetune over a 100 or so steps, which would align the layers to better work together.

noeda · 1 year ago

Not sure if you misread, but it’s actually high, i.e. it’s better than Xwin and Euryale it’s made out of (in this particular quick test).

It beat all the 70B models I tested there, although the gap is not super high.

AlpinDale · 1 year ago

Yes well it should perform much higher than that. Turboderp ran MMLU at 3.25bpw and it was performing worse than other 70B models. I assume quantization further degrades the spelling consistency.

a_beautiful_rhind · 1 year ago

Surprise!.. xwin doing poorly among the 70b. It does bad when I test it vs chat logs too. Shows higher perplexity than yi-34b and a gaggle of other models, including base.