I say:
- It has a performance hit, but it remains to be seen if going with a much larger model can compensate for that.
- The model needs to be trained from scratch, you cannot finetune an existing model for this apparently…
I say:
I mean, you can jailbreak/browbeat chatgpt/Claude into going against guardrails relatively easily, I smash “X” for doubt that Grok is going to be any different. If it will, now THAT is going to huge, if not in a way we’d like to I guess…
That explains why Goliath worked and yours - not so much, I guess…
“Prompt Template: Alpeca” Wut?
Looks like a scam to be fair. I bet if you apply, you’ll get “Just send us 100$ for access!”
Did you do post-merge retraining? Without at least some results are going to be poor…
Did you do post-merge training and how much?
10s/tok and couple kilowatts of power… ok, if it was as smart as Einstein and as unerring as an Oracle it might make sense, but you can use it for free at Petals at 3 tok/sec and it is most certainly not…
Technically, you can somewhat automate the testing process by creating a script that makes that model aswer a series of questions that are relevant to YOU and are unique (so cannot be gamed by training for benchmarks) and evaluate those yourself.
Make sure you experiment using different sampling methods and run several tests due to inherent randomness of output.
Please dear Tzeench, have someone leak gpt4 in general confusion, I MUST know if this is really 10 7b models in a trench coat :)
My name is Mensch. Uber Mensch.
A quick question: does it imply that it has 160 layers as a result? Afaik, Falcon has 80 layers (like Llama), and original GPT3 had 96. “Stack more layers” ©
Sooo… is “stacking 1000 phi 1.3b together” is a recipe for AGI? :)
He MUST become a CEO of Uber, too! :))))
Yea, I’ve had my “honeymoon effect” with some new/large models like, say, Falcon and even Claude: they are inherently random and that affects quality, too. I’ve had great outputs from Falcon, for instance (on Petals), but also long stretches of mediocre and some outright bad… and also sometimes really great and creative output from 7b Mistral, especially with enough prompt tinkering and setting sampling “just right”. Objective evaluations of LMMs is extremely hard and time-consuming!
Can we have some non-cherry-picked examples of writing?
Does not have to be highly nsfw/whatever, but a comparison of goliath writing compared to output from constituent models at same settings and same (well-crafted) prompts will be very interesting to see, and preferably at least 3 examples per model due to inherent randomness of model output…
If you say this is “night and day” difference, it should be apparent… I’m not sceptical per se, but “writing quality” is highly subjective and the model style may simply mesh better with your personal preferences?
There is no way it has “undiluted” 100k context. https://news.ycombinator.com/item?id=36374936
But yea, it IS impressive.
Given how good 7b Mistral is in my personal experience, it seems that a model 3x its size can BE GPT3.5 Turbo is no longer implausible.
EXTRERMINATE!