Is this normal behavior?
I’m still learning but I noticed that if I load a normal LLM like https://huggingface.co/teknium/OpenHermes-2-Mistral-7B it will take all the VRAM available (I have a 3080 10GB).
But when I load the quantized model like https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF it will take almost nothing of the VRAM, maybe like 1GB?
Is this normal behaviour?
You must log in or register to comment.
for cpu only it is not viewable due to mmap-loading which saves time during startup. to view, use --no-mmap
Update: I just saw that I had the GPU layers at 0, so it was running all in CPU then?
The slider goes from 0 to 128, how do I know what to pick?