Is this normal behavior?
I’m still learning but I noticed that if I load a normal LLM like https://huggingface.co/teknium/OpenHermes-2-Mistral-7B it will take all the VRAM available (I have a 3080 10GB).
But when I load the quantized model like https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF it will take almost nothing of the VRAM, maybe like 1GB?
Is this normal behaviour?
Update: I just saw that I had the GPU layers at 0, so it was running all in CPU then?
The slider goes from 0 to 128, how do I know what to pick?
https://preview.redd.it/snrkzjg43v1c1.png?width=1442&format=png&auto=webp&s=b356f72d5deaa5a49e19fbf3e91d0c22e2bc333b