Question about GGUF, gpu offload and performance

Jokaiser2000 · 1 year ago

Question about GGUF, gpu offload and performance

multiverse_fan · 1 year ago

I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I’m just thinking that comparatively your setup with a 20B model should be faster than that but I’m sure I’m missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?

Desm0nt · 1 year ago

By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Sound suspicious. A use Yi-Chat-34b-Q4_K_M on old 1080ti (11 gb VRAM) with 20 layers offloaded and got around 2.5 t/s.But it is on Threadripper 2920 with 4 channel RAM (also 3200). However I don’t think it would make that much difference. Ofcourse in 4 channel I have ram bandwidth x2 of your’s but I run 34b and I load only 20 layers on gpu…