AnomalyNexusB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

Splitting models over GPUs - AWQ text-generation-AU

1

1

Splitting models over GPUs - AWQ text-generation-AU

AnomalyNexusB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

1

tl;dr: AutoAWQ seems to ignore the multi-GPU VRAM allocation sliders completely in text-generation-ui?!?

I’ve got a 3090 and added in the old 2070S for some temporary experimentation.

Not particularly stable and slowed speed a lot versus just 3090, but 32gb opens up some higher quant 34Bs.

llama.cpp mostly seems to run fine split across them.

Puzzled though by text-generation-UI’s AutoAWQ. Regardless of what I do with the sliders it always runs out of memory on the 8GB card. Even if I tell it 1GB on the 2070S only it still fills it till OOM.. The max the sliders go to are expected amounts (24 & 8) so pretty sure I’ve got them right way round…

Anybody know what’s wrong?

You must log in or register to comment.

Chat

a_beautiful_rhindB
link
fedilink
English
arrow-up
1·
1 year ago
Accelerate isn’t splitting the model right. Are you sure GPU 0 and 1 are not flipped? I just moved some around and numbers for CUDA_VISIBLE_DEVICES, nvtop and llama.cpp don’t really match as to what is what.