I don't understand Mistral and context size, honestly.

anti-lucas-throwaway · 1 year ago

I don't understand Mistral and context size, honestly.

SomeOddCodeGuy · 1 year ago

I don’t believe messing with alpha values is a good idea, but I’ve never done it on any model. My Mistral 7B instance in chat mode had no trouble with a conversation extending past 9k tokens

This is the part that threw me off, and why Im interested in the answers from this post.

Normally, on a Llama 2 model for instance, I’d use alpha to increase the context past the regular cap. For example, on XWin 70b with a max seq length of 4096, I run it at 1.75 alpha and 17000 rope base to kick the context to 6144.

Codellama is a little different. I don’t need to touch the alpha for it to use 100,000 tokens, but the rope base has to be at 1,000,000. So its 1 alpha, rope base 1,000,000, 1 compress == 100,000 tokens.

But then there’s mistral. Mistral loads up and is like “I can do 32,000 tokens!” and has 1 alpha, 0 rope base, 1 compress. And the readme files on the models keep showing “4096” tokens. So I’ve been staring at it, scratching my head, unsure whether it can do 32k, 4k, does it need rope, etc.

I just keep loading it in 4096 until I have a chance to look it up lol