So I have an experimental build of Koboldcpp that allows for Dynamic Temperature sampling. Some people tell me that my dynamic temp has become a mainstay of their configurations. Now this poses an obvious question:
Why would you need a Dynamic Temperature?
- Typical attempts to make language models more varied and creative through higher temperature values might not work as you’d expect, due to the fact that higher temperatures disproportionately impact high confidence token generations. This is especially a problem for weaker language models that have less of an innate ability to ‘course correct’ when a ‘bad’ token is chosen, or for instances where ‘course correction’ just isn’t good enough (like in programming languages)
- As a consequence, higher temperature values (past ~1.2) are rather difficult to use if you want your language model to output coherent and creative generations. A specific example of how higher temperature can introduce difficulties is in the case of adhering to programming language syntax, as programming languages all have strictly defined rules. This can be an issue if you want an LLM to try a more ‘creative’ solution to a specific programming problem while still consistently adhering to the rules of the language; a static temperature, therefore, wouldn’t be the most effective way to scale the language model’s creativity.
For an example, here’s how the Dynamic Temperature mapping looks, assuming you use the “HHI” dynamic temp method (which measures how concentrated the model’s probabilities are at any given point in time.)
Red = Closer to maximum temperature, Grey = Closer to minimum temperature
The idea is, we turn temperature into a range, where only the highly randomizable tokens get mapped a high temperature, and a non-randomizable token stays near-deterministic.
This sounds great on paper. Except, there’s 3 different versions of it that measure different metrics in an attempt to create a better sampler, and not just the HHI version of it. As they say, perfect is the enemy of good… because of this, it’s hard to create a ‘standard’ that I can propose to any of these LLM model hosting backends, and therefore, Dynamic Temperature hasn’t been implemented where people can use it beyond my test builds.
This, of course, has made it difficult for me to settle on the ‘best method’.
So! To determine the most effective method, I need the community’s help in testing and documenting the effects of this experimental sampler on various models. The lack of a standardized approach has hindered widespread implementation, so your feedback on the best method or even just the ‘best values’ for each method is crucial.
How to Test: I’ve provided a custom build of Koboldcpp for testing: Link to the experimental build. You can modify the values in the generated .txt file for quick testing. There are also overrides for different dynamic temperature sampling methods.
These overrides include:
- 1.84 Temp
This value overrides to Entropy Sampling, which uses a power function & SamplerTemp.txt file to control the values.
It measures the entropy (uncertainty) of the probability distribution before sampling. This means, if it is highly certain for a certain token, it will use values closer to the minimum temperature. If it is highly uncertain, it will increase the temperature (to avoid repetition / determinism issues in a more natural fashion).
This is probably really difficult for this sub to understand but maybe it makes sense.
It has minTemp (minimum temperature), maxTemp (maximum temperature), and the exponent value (which controls how aggressively it scales the mapping of temperature.)
UNIQUE OBSERVATIONS ABOUT THIS SAMPLER:
- I’m able to turn off all truncation samplers (Min P, Top P, etc) and it still functions coherently within the default range of values (from 0.0 minTemp to 2.0 maxTemp).
- I’m guessing the reason why that happens is because it’s really difficult to achieve maximum entropy on a 32,000 token vocabulary model. However, you can turn up the maxTemp to even 5.0 and get some really weird but still pretty coherent results.
- 2.0 Temp
This form of DynaTemp is HHI Sampling, uses a power function & SamplerTemp.txt file to control the values. I misnamed this as Gini sampling before, but it is measuring HHI.
The ‘HHI’ value it measures is how concentrated the probabilities are. If it is highly concentrated on just one token, then it reduces the temperature to a strong degree. It is more spread out or evenly divided, the temperature is increased towards the maxTemp.
It has minTemp (minimum temperature), maxTemp (maximum temperature), and the exponent value (which controls how aggressively it scales).
UNIQUE OBSERVATIONS ABOUT THIS SAMPLER:
- The measurements of concentration (via the HHI measurement) seem pretty consistent with or without removing ‘bad tokens’ (e.g Min P, Top P, and othet truncation samplers). This is unlike Entropy which is sensitive to whether or not you have those truncation samplers on or not.
- For reference, here’s how the HHI (concentration) measurements look for a prompt that’s more deterministic vs. an open ended prompt:
- 1.91 Temp
Greedy Dynamic Temp (aka DynaTemp), the original implementation. This uses uses a sigmoid function & is basing the temperature off the top token. I am not confident that this is useful or interesting compared to HHI and Entropy versions of Dynamic Temp, as it does not measure the entire distribution; this was my first trial run, but you can test it if you want.
This is really interesting work!!! I’m doing research on Contrastive Decoding and have pretty good results so far, moreover reading this post I realized it might fix my issues with picking the right alpha.
I have a suggestion to make to OP and people reading this post - could we start collecting “goto” questions that this community uses for testing? IT will be easier to automate and then publish all outputs at once and let people rank whether they like the output or not.
This way it will be much easier for small teams and individuals to conduct meaningful progress