• 5 Posts
  • 42 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle






  • Last I checked, 38t/s is minimum prompt processing speeds with zero layers offloaded on a 3090 for 70B q4_k_m

    I’m sure its way higher now. When you offload layers, you can do more, but I think you have to have pre knowledge of the max length, so that your gpu doesnt OOM towards the end.

    I think your supposed to adjust the prompt processing batch size settings also.

    I highly recommend checking the nvidia PRs in llama.cpp for the prompt processing speeds, for differences between GPUs. If they have double or triple that will tell you something and you could calculate the amount of time for processing your text.