Hi all,

Just curious if anybody knows the power required to make a llama server which can serve multiple users at once.

Any discussion is welcome:)

  • a_beautiful_rhindB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

    There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

    The “minimum” is one GPU that completely fits the size and quant of the model you are serving.

    People serve lots of users through kobold horde using only single and dual GPU configurations so this isn’t something you’ll need 10s of 1000s for.