Proposed Alternative to Repetition Penalty - Noisy Sampling

kindacognizant · 1 year ago

Proposed Alternative to Repetition Penalty - Noisy Sampling

WolframRavenwolf · 1 year ago

imagine a language model that was tasked to do trivial math problems, and a user always involved the number 3 in his first 5 questions. After a certain amount of context, it will bias against using the number 3 in the solution even if if it is correct.

I used to think that, but one of the Transformers devs (Joao Gante from HF) told me that it is “only applied at most once per token” within the repetition penalty range, so it doesn’t matter how often the number 3 appears in the first 5 questions, as long as the repetition penalty is a “reasonable value (e.g. 1.2 or 1.3)”, it won’t have a negative impact on tokens the model is reasonably sure about. So for trivial math problems, and other such situations, repetition penalty is not a problem.

Same with other tokens like EOS, newlines, punctuation, etc. - if the repetition penalty would affect them negatively, we’d quickly see lots of problems. So it’s not preventing the output of tokens the model is sure about, it’s trying to prevent repetition in cases the token isn’t that predetermined.

Just something non-obvious to keep in mind.

a_beautiful_rhind · 1 year ago

Hope you can do another patch for exllamav2, with tabbyAPI it kicks.

FPham · 1 year ago

On somehow similar note of adding noise during finetuning to help with generalization: I you using oobabooga, you can look at Training PRO

https://github.com/FartyPants/Training_PRO

And then experiment with NEFtune noise scale.

It is somehow simillar idea - but on the other end - pretraining, I assume you are talking about adding noise in interference in sampler. Worth pursuing for sure - the results, however are unpredictable before trying it…

EvokerTCG · 1 year ago

Aside from repetition, isn’t this effectively a new sampling method? You could call it Fuzzed Greedy Sampling.

nuvalab · 1 year ago

Thanks for writing this, it’s an interesting idea and very relevant to the issue that I am trying to solve too - creative writing, which definitely hates repetition, and very interested to try out what you proposed once it’s available :)

One technical question for this approach: Wouldn’t it change the original distribution of training data / output, specially in case where there is one and obviously good one next token to choose from? I can see the value when multiple next tokens are all considered great with close probability, but curious how would it behave otherwise in terms of consistency in correctness.

CardAnarchist · 1 year ago

So do you think this approach is better then Dynatemp?

Or are you planning to put forward both modifications, leaving Dynatemp out of this Kobold build to better test just the noise modification?

kindacognizant · 1 year ago

DynaTemp is still available in the test build.

I’m not sure which method is superior or anything yet, need more testing and opinions, but it looks promising because it scales well

Proposed Alternative to Repetition Penalty - Noisy Sampling

Proposed Alternative to Repetition Penalty - Noisy Sampling

Noisy Sampling

- Context Free

- Scales with Confidence