Your settings are (probably) hurting your model - Why sampler settings matter

kindacognizant · 3 years ago

Your settings are (probably) hurting your model - Why sampler settings matter

Monkey_1505 · 3 years ago

I use Tail Free Sampling all the time, exclusively and I never touch anything else.

kindacognizant · 3 years ago

What frontends do you use?

Monkey_1505 · 3 years ago

Mostly silly tavern.

a_beautiful_rhind · 3 years ago

I used to use that “Shortwave” preset a lot but now with minP and dynamic temperature, it’s all the sampling I need.

I actually miss the latter in other implementations besides exllama, hopefully it also gets merged.

This combo is much better than mirostat for me. When I used it, miro would either make the model dry or drunk depending on how much it was turned up. These two have been set and forget.

Aaaaaaaaaeeeee · 3 years ago

“why is it constantly repeating”

Would you have some ideas why models throw repeating lines and segments in the first place? Are finetuned instruction models simply trained to respond with something similar to what the user said?

this would still be a big mystery to me unfortunately…

(I bring this up since I know very little about sampling optimizations, on a post dedicated to explaining the tech - Amazing ! :D)

In my experience, repetition in the outputs are an everyday occurance with “greedy decoding” This sampling, used in speculative decoding, generates unusable output, 2-3x faster. With adjustments to temperature and repetition penalty, the speed becomes 1.5 (exl2) or 1.3 (llama.cpp)

kindacognizant · 3 years ago

Large language models learn deep patterns. Most notably, they target patterns that are not immediately obvious to humans reading the text they create. If the pattern of the long term context implies that the text tends to be repetitive or high-confidence in the abstract, because of deterministic / greedy sampling being used, it will slowly drift towards that repetition over time. And eventually, it will become so focused on this deeper pattern that it’ll be unable to find a way out.

Aaaaaaaaaeeeee · 3 years ago

So the main goal of sampling optimization is, we offset that drifting behavior (present in all llm models?), breaking down repetition loops normally formed in the OG sampling. (greedy decoding)

If we assumed the reasoning abilities of a model depend on it not going into repetition loops, maybe this is why larger parameter models are better, Each sampling step has a larger, diverse pool of tokens to choose from.

hibbity · 3 years ago

So uh, any quick how to use min_p with koboldcpp? I’m sold, you converted me. Tell me how to turn it on, preferably through the api so I can set up a good default in my custom front end.

It can’t be as easy as just {prompt: “text”, min_p: 0.1, }

is it?

ambient_temp_xeno · 3 years ago

I assumed everyone was using minp apart from deterministic type testing.

For example I have temp of 4.59, rep pen and everything else off with minp of 0.05 and nous-capybara-34b.Q4_K_M.gguf is happily writing a little story, no problems at all.

AbsorbingCrocodile · 3 years ago

Does this help with the output or the speed?

CardAnarchist · 3 years ago

Hi thanks a lot for this, I haven’t seen a good guide to these settings until now.

As someone who always runs mistral 7B models I have two questions,

For a general default for all mistral models would you recommend a Repetition Penalty setting of 1.20?
I run Mistral models at 8192 context. What should I set the Repetition Penalty Range at?

Thanks again for the great info and of course for making Min P!

ProperShape5918 · 3 years ago

Needed to use a language model just to read this.

FPham · 3 years ago

Proof is in the pudding - blind tests, just like ooba did a while ago with the older samplings.

Language is way too complex to approach it from the math side and assert “this should work better”. In theory yes, but we need blind tests.

nsfw_throwitaway69 · 3 years ago

min P seems similar to tail free sampling. I think the difference is that TFS tries to identify the “tail” by computing the derivative of the token probability function.

Dead_Internet_Theory · 3 years ago

OP, this post is fantastic.

I wonder, is this a case of the community doing free R&D for OpenAI or they truly have a good reason for using naive sampling?

Also the graph comes from here, a bunch of other graphs there too.

kindacognizant · 3 years ago

I posted that GitHub issue. That original Top K vs Top P graph wasn’t made by me, I can’t find the original source, but I made the Min P one and others.