I’m thinking of upgrading to 64GB of RAM so I can lot larger models on my rtx 3090.
If I want to run tigerbot-70b-chat-v2.Q5_K_M.gguf
which has max RAM usage of 51.61GB, assuming I load 23GB worth of layers into VRAM that leaves 51.61-23=28.61 left to load in RAM. My operating system already uses up to 9.2GB of RAM which means I need 37.81GB of RAM (hence 64GB).
How many tokens/s can I expect from 23GB out of 51.61GB being loaded in VRAM, and 28.61GB being loaded in RAM on an rtx 3090? I’m mostly curious about Q5_K_M quant, but I’m still interested in other quants.
You must log in or register to comment.