tl;dr: AutoAWQ seems to ignore the multi-GPU VRAM allocation sliders completely in text-generation-ui?!?


I’ve got a 3090 and added in the old 2070S for some temporary experimentation.

Not particularly stable and slowed speed a lot versus just 3090, but 32gb opens up some higher quant 34Bs.

llama.cpp mostly seems to run fine split across them.

Puzzled though by text-generation-UI’s AutoAWQ. Regardless of what I do with the sliders it always runs out of memory on the 8GB card. Even if I tell it 1GB on the 2070S only it still fills it till OOM.. The max the sliders go to are expected amounts (24 & 8) so pretty sure I’ve got them right way round…

Anybody know what’s wrong?

  • a_beautiful_rhindB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Accelerate isn’t splitting the model right. Are you sure GPU 0 and 1 are not flipped? I just moved some around and numbers for CUDA_VISIBLE_DEVICES, nvtop and llama.cpp don’t really match as to what is what.