What kind of specs to run local llm and serve to say up to 20-50 users

Appropriate-Tax-9585 · 2 years ago

What kind of specs to run local llm and serve to say up to 20-50 users

Tiny_Arugula_5648 · 2 years ago

unless you’re doing this as a business it’s going to be massively cost prohibitive, hundreds of thousands dollars of hardware. If it is a business you better get talking to cloud vendors because GPUs are an incredibly scarce resource right now.

seanpuppy · 2 years ago

It depends a lot on the details tbh. Do they share one model? Do they each use a different lora? If its the latter theres some cool recent research on efficiently hosting many loras on one machine

Appropriate-Tax-9585 · 2 years ago

At the moment I’m just trying to grasp the basics, like for example what kind of GPUS I will need and how many. This is more for comparison to SaaS options, however in reality I need to setup a server for testing with just few users. I’m going to research into but I like this community and to hear others view on the case as many have tried to manage their own servers I imagine :)

a_beautiful_rhind · 2 years ago

You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

The “minimum” is one GPU that completely fits the size and quant of the model you are serving.

People serve lots of users through kobold horde using only single and dual GPU configurations so this isn’t something you’ll need 10s of 1000s for.