LocalLLaMA@poweruser.forumEnglish · 1 year ago

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything!

1

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything!

LocalLLaMA@poweruser.forumEnglish · 1 year ago

Found out about air_llm, https://github.com/lyogavin/Anima/tree/main/air_llm, where it loads one layer at a time, allow each layer to be 1.6GB for a 70b with 80 layers. theres about 30mb for kv cache, and i’m not sure where the rest goes.

works with HF out of the box too apparently. The weaknesses appear to be ctxlen, and its gonna be slow, but anyway, anyone want to try goliath 120B unquant?

Chat

fallingdowndizzyvrB
link
fedilink
English
arrow-up
1·
1 year ago
There’s no point to it. Since if it’s too big to fit in RAM, it would be disk i/o that would be the limiter. Then it wouldn’t matter if you had 400GB/s of memory bandwidth or 40GB/s. Since the disk i/o would be the bottleneck.