Maybe anecdotal but I have very high hopes for Yi 34b finetunes.

Herr_Drosselmeyer · 1 year ago

Maybe anecdotal but I have very high hopes for Yi 34b finetunes.

bullerwins · 1 year ago

Does the 200K mean that it has up to 200k context size? Is the context limited by the model or can you just set it to whatever a long as you have enough VRAM. Also, if a GGUF model for example takes 20GB vram for example. That’s with the “default” context size? Can it be less if you decrease the context or more if you increase it ?

Herr_Drosselmeyer · 1 year ago

The base Yi can handle 200k. The version I used can do 48k (though I only tested 16k so far). Larger context size requires more VRAM.

The size that TheBloke like gives for GGUF is the minimum size at 0 context. As context increases, VRAM use increases.

bullerwins · 1 year ago

thanks a lot! I was not sure about how context affected VRAM usage. So each model has a maximum context size and using more will take more vram, thanks!

mcmoose1900 · 1 year ago

Another thing to note is that the exllamav2 backend is “special” because its context takes up less vram than the context in other backends. So lets say the weights take 18GB, and your context takes up 6GB for a gguf model. In exllama thats only 3GB taken up by the context with the 8 bit cache.

There are other complications like the prompt processing batch size, but thats the jist of it.

This makes a dramatic difference when the context gets huge. I’d prefer to use koboldcpp myself, but I just can’t really squeeze it on my 3090 without excessive offloading.

bullerwins · 1 year ago

Interesting! I had more succeed for some reason with gguf models, as those work everywhere using koboldcpp and ooba’s. I didn’t know that exllamasv2 was better for context. I will try it. That backend is for EXL2 formats right? I had the impression it was better for speed, I didn’t know about the context takes up less vram