Noisy Sampling
Temperature as a method of making LLMs more or less deterministic by changing the scale at which the tokens are ‘scored’ certainly works, but there’s also still an issue where greedy sampling (which is *only* picking the most likely token at all times) will eventually degenerate into repetitive nonsense because of slight biases.
Mistral 7b, for example, seems to be better than Llama 2 13b for a variety of tasks, but has a tendency to repeat itself significantly more often (especially in the context of greedy sampling).
The typical solution to fix this is the Repetition Penalty, which adds a bias to the model to avoid repeating the same tokens, but this has issues with ‘false positives’; imagine a language model that was tasked to do trivial math problems, and a user always involved the number 3 in his first 5 questions. After a certain amount of context, it will bias against using the number 3 in the solution even if if it is correct. This is obviously incorrect behavior.
One possible solution to this problem is to add a bit of controlled noise to the model’s scores to prevent it from slowly accumulating determinism bias. In the case where all the scores are relatively the same, this will allow for a lot of randomness (as you’d expect); in the case where the scores are extremely different (e.g. 3,000 for the top token and 500 for the second most likely) this would instead add a negligible amount of noise, and it wouldn’t be uniform.
I’ve realized that my Dynamic Temp sampler experiment… basically performs in a similar fashion, albeit indirectly, which is probably why people seem to like it in the first place.
When I made that, I was thinking, “why not make the model more random when there’s a high opportunity to be random”, but my DynaTemp still always assumes the original token rankings. Paradoxically, it may be more natural to just add random noise to the token scores to begin with, so that in cases where the top two tokens are both close to 20% for example, but the rest are 0.001%, it’ll randomly choose from one of those two 20% tokens instead of just selecting the one with the slightly higher score (which is a statistically biased choice rather than a natural one)
I will be working on an implementation of this for koboldcpp soon, and then I will look into adding it to text-generation-webui (it’s more popular, but I’m more experienced with kobold’s codebase).
This method has two potential advantages:
- Context Free
Instead of analyzing past context like Repetition Penalty, it stands independently as a way to prevent individual tokens from creating biased generations in the first place rather than as a hacky solution that must factor in the past context before it makes a decision.
- Scales with Confidence
This should in theory apply randomness that scales proportionally to the model’s confidence. That means it will not disproportionately weigh highly low quality token choices (which will naturally have much lower scores and should, in theory, be just as unlikely).
imagine a language model that was tasked to do trivial math problems, and a user always involved the number 3 in his first 5 questions. After a certain amount of context, it will bias against using the number 3 in the solution even if if it is correct.
I used to think that, but one of the Transformers devs (Joao Gante from HF) told me that it is “only applied at most once per token” within the repetition penalty range, so it doesn’t matter how often the number 3 appears in the first 5 questions, as long as the repetition penalty is a “reasonable value (e.g. 1.2 or 1.3)”, it won’t have a negative impact on tokens the model is reasonably sure about. So for trivial math problems, and other such situations, repetition penalty is not a problem.
Same with other tokens like EOS, newlines, punctuation, etc. - if the repetition penalty would affect them negatively, we’d quickly see lots of problems. So it’s not preventing the output of tokens the model is sure about, it’s trying to prevent repetition in cases the token isn’t that predetermined.
Just something non-obvious to keep in mind.
Hope you can do another patch for exllamav2, with tabbyAPI it kicks.
On somehow similar note of adding noise during finetuning to help with generalization: I you using oobabooga, you can look at Training PRO
https://github.com/FartyPants/Training_PRO
And then experiment with NEFtune noise scale.
It is somehow simillar idea - but on the other end - pretraining, I assume you are talking about adding noise in interference in sampler. Worth pursuing for sure - the results, however are unpredictable before trying it…
Aside from repetition, isn’t this effectively a new sampling method? You could call it Fuzzed Greedy Sampling.
Thanks for writing this, it’s an interesting idea and very relevant to the issue that I am trying to solve too - creative writing, which definitely hates repetition, and very interested to try out what you proposed once it’s available :)
One technical question for this approach: Wouldn’t it change the original distribution of training data / output, specially in case where there is one and obviously good one next token to choose from? I can see the value when multiple next tokens are all considered great with close probability, but curious how would it behave otherwise in terms of consistency in correctness.
So do you think this approach is better then Dynatemp?
Or are you planning to put forward both modifications, leaving Dynatemp out of this Kobold build to better test just the noise modification?
DynaTemp is still available in the test build.
I’m not sure which method is superior or anything yet, need more testing and opinions, but it looks promising because it scales well