I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I’m just thinking that comparatively your setup with a 20B model should be faster than that but I’m sure I’m missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?
If I had the money, I’d go with the cpu.
Also, I’m not sure a 4090 could run 33B modes at full precision. Wouldn’t that require like 70GB of vRAM?
Goliath was created by merging layers of Xwin and Euryale. (from their model card)
The layer ranges used are as follows:
- range 0, 16 Xwin
- range 8, 24 Euryale
- range 17, 32 Xwin
- range 25, 40 Euryale
- range 33, 48 Xwin
- range 41, 56 Euryale
- range 49, 64 Xwin
- range 57, 72 Euryale
- range 65, 80 Xwin
I’m not sure how the model would be reduced to 70B unless it’s through removing layers. Is that what “shearing” is? I don’t understand what is being pruned in that, is it layers?
Cool, sounds like a good model to download and store for future when I can get access to better hardware.
TheBloke/MonadGPT-GGUF