What kind of specs to run local llm and serve to say up to 20-50 users

Appropriate-Tax-9585 · 1 year ago

a_beautiful_rhind · 1 year ago

You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

The “minimum” is one GPU that completely fits the size and quant of the model you are serving.

People serve lots of users through kobold horde using only single and dual GPU configurations so this isn’t something you’ll need 10s of 1000s for.