Hi. I’m currently running a 3060 12Gb | R7 2700X | 32gb 3200 | Windows 10 w/ latests nvidia drivers (vram>ram overflow disabled). By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.
Are these values what is expected of my setup? Or is there something i can do to improve speeds without changing the model?
Its pretty much unusable at this state, and since it’s hard to find information about this topic i figured i would try to ask here.
EDIT: running the model on the latest version of the text-generation-webui
I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I’m just thinking that comparatively your setup with a 20B model should be faster than that but I’m sure I’m missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?