I am going to build a LLM server very soon, targeting 34B models (specifically phind-codellama-34b-v2.Q4 GGUF GPTQ AWQ).

I am stuck between these two setups:

  1. 12400 + DDR5 6000MHz 30CL + 4060 Ti 16GB (GGUF; Split the workload between CPU and GPU)
  2. 3090 (GPTQ/AWQ model fully loaded in GPU)

Not sure if the speed bump of 3090 is worth the hefty price increase. Does anyone have benchmarks/data comparing these two setups?

BTW: Alder Lake CPUs run DDR5 in gear 2 (while AM4 run DDR5 in gear 1). AFAIK gear 1 offers lower latency. Would this give AM4 big advantage when it comes to LLM?

  • mcmoose1900B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Here’s a 7B llama.cpp bench on a 3090 and 7800X3D, with CL28 DDR5 6000 RAM.

    All layers offloaded to GPU:

    Generation:5.94s (11.6ms/T), Total:5.95s (86.05T/s)

    And here is just 2/35 layers offlloaded to CPU:

    Generation:7.59s (14.8ms/T), Total:7.75s (66.10T/s)

    As you can see, the moment you offload even a little bit to CPU, you are going to hit performance hard. More than a few layers and the hit is very severe.

    Here is exllamav2 for reference, though the time also includes prompt processing so its actually faster than indicated:

    3.91 seconds, 512 tokens, 130.83 tokens/second (includes prompt eval.)

    • regunakyleOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Thanks for your data! Can you do the test again with the phind codellama 34B model?