I have a server with 512gb RAM and 2x Intel Xeon 6154. It has spare 16x pcie 3.0 slot once I get rid of my current gpu.
I’d like to add a better gpu so I can generate paper summaries (the responses can take a few minutes to come back) that are significantly better than the quality I get now with 4bit Llama2 13b. Anyone know whats the minimum gpu I should be looking at with this setup to be able to upgrade to the 70b model?Will hybrid cpu+gpu inference with RTX 4090 24GB be enough?
Last I checked, 38t/s is minimum prompt processing speeds with zero layers offloaded on a 3090 for 70B q4_k_m
I’m sure its way higher now. When you offload layers, you can do more, but I think you have to have pre knowledge of the max length, so that your gpu doesnt OOM towards the end.
I think your supposed to adjust the prompt processing batch size settings also.
I highly recommend checking the nvidia PRs in llama.cpp for the prompt processing speeds, for differences between GPUs. If they have double or triple that will tell you something and you could calculate the amount of time for processing your text.
Used first generation (non-Ada) 48GB A6000 is an option. Kinda slow, but also the only card in its VRAM-density-per-dollar niche.