Well, not a total n00b as I play with LLM for almost year and a half now, but with local LLMs since summer. Although I have a profound experience with local image generators I thought I can use some of this knowledge with setting LLMs although it doesn’t seem to be that easy ;)
Any input that will shed some light on the problems I have will be greatly appreciated :)
Hardware:
Ryzen 9 3900X, 48GB RAM, RTX 4090
Oobabooga startup params:
--load-in-8bit --auto-devices --gpu-memory 23 --cpu-memory 42 --auto-launch --listen
I still have a problem getting around some issues, likely caused by improper loader settings.
I’m looking for some tips how to set them optimally. I use oobabooga UI as it’s the most comfortable for me and lets me test models before deploying them elsewhere (ie. to company UIs - I’m working on a chatbot connected to a vector db for local document storage and I thought about ooba as a backend for quick loading of models and setting parameters and exposing them via API) however It’s documentation is vague and I have a feeling that names for the parameters and so on are not standarized too. Which loader is optimal? ExLlama2_HF or AutoGPTQ? Latter pretty much always gives me issues :( and in ExLlama2 when I try to set longer context lenght and set alpha_value or compress_pos_emb it starts having trouble especially with repeating numbers ie it will say 190 instead of 1990 or 3137 instead of 31337 (but sometimes also with words shorting them in a strange way) - is that expected behaviour?
I would like to use context lenght that will be longer (4k or even 8k hardly cuts it) also I would like the LLM to generate longer replies - it’s not always necessary but sometimes it’s desired (ie for code generation) - usually instructing the model to “continue” helps, but longer answers would be nice.
BTW is the “max_position_embeddings” in the model’s config the same as the " max_seq_len" in the ExLlamav2 loader settings?
Or maybe you can just point me into some more advanced tutorial discussing these thing? All the stuff I find doesn’t delve into these things (just basic tutorials how to run oobabooga or other ui and they always use default configs).
Thanks for the informative answer. I will take a look at GGUF models (although I’m not sure yet how to split them between cpu/gpu yet (I will take a look at llama.cpp parameters).