tl;dr: AutoAWQ seems to ignore the multi-GPU VRAM allocation sliders completely in text-generation-ui?!?
I’ve got a 3090 and added in the old 2070S for some temporary experimentation.
Not particularly stable and slowed speed a lot versus just 3090, but 32gb opens up some higher quant 34Bs.
llama.cpp mostly seems to run fine split across them.
Puzzled though by text-generation-UI’s AutoAWQ. Regardless of what I do with the sliders it always runs out of memory on the 8GB card. Even if I tell it 1GB on the 2070S only it still fills it till OOM.. The max the sliders go to are expected amounts (24 & 8) so pretty sure I’ve got them right way round…
Anybody know what’s wrong?
Accelerate isn’t splitting the model right. Are you sure GPU 0 and 1 are not flipped? I just moved some around and numbers for CUDA_VISIBLE_DEVICES, nvtop and llama.cpp don’t really match as to what is what.