Hi all,

Just curious if anybody knows the power required to make a llama server which can serve multiple users at once.

Any discussion is welcome:)

  • Tiny_Arugula_5648B
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    unless you’re doing this as a business it’s going to be massively cost prohibitive, hundreds of thousands dollars of hardware. If it is a business you better get talking to cloud vendors because GPUs are an incredibly scarce resource right now.

  • seanpuppyB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    It depends a lot on the details tbh. Do they share one model? Do they each use a different lora? If its the latter theres some cool recent research on efficiently hosting many loras on one machine

    • Appropriate-Tax-9585OPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      At the moment I’m just trying to grasp the basics, like for example what kind of GPUS I will need and how many. This is more for comparison to SaaS options, however in reality I need to setup a server for testing with just few users. I’m going to research into but I like this community and to hear others view on the case as many have tried to manage their own servers I imagine :)

  • a_beautiful_rhindB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

    There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

    The “minimum” is one GPU that completely fits the size and quant of the model you are serving.

    People serve lots of users through kobold horde using only single and dual GPU configurations so this isn’t something you’ll need 10s of 1000s for.