- 2 Posts
- 5 Comments
multiverse_fanBto LocalLLaMA@poweruser.forum•Question about GGUF, gpu offload and performanceEnglish1·2 years agoI have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I’m just thinking that comparatively your setup with a 20B model should be faster than that but I’m sure I’m missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?
If I had the money, I’d go with the cpu.
Also, I’m not sure a 4090 could run 33B modes at full precision. Wouldn’t that require like 70GB of vRAM?
multiverse_fanBto LocalLLaMA@poweruser.forum•Goliath-120B - quants and future plansEnglish1·2 years agoGoliath was created by merging layers of Xwin and Euryale. (from their model card)
The layer ranges used are as follows: - range 0, 16 Xwin - range 8, 24 Euryale - range 17, 32 Xwin - range 25, 40 Euryale - range 33, 48 Xwin - range 41, 56 Euryale - range 49, 64 Xwin - range 57, 72 Euryale - range 65, 80 Xwin
I’m not sure how the model would be reduced to 70B unless it’s through removing layers. Is that what “shearing” is? I don’t understand what is being pruned in that, is it layers?
multiverse_fanBto LocalLLaMA@poweruser.forum•For roleplay purposes, Goliath-120b is absolutely thrilling meEnglish1·2 years agoCool, sounds like a good model to download and store for future when I can get access to better hardware.
TheBloke/MonadGPT-GGUF