MonkeyMaster64B to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

Large-scale LLM deployment with GBNF support

1

1

Large-scale LLM deployment with GBNF support

MonkeyMaster64B to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

1

Hey guys, as the title suggests I’d like some advice on the best way to serve LLMs with the support of GBNF or similar to ensure that I receive deterministic output. I have been using text-generation-web-ui locally and from there I can add my grammar, however, I would like to be able to do this across a cluster that can infer with high throughput. Any suggestions on how best to accomplish this?

A naive solution would be having multiple instances of text-generation-web-ui running in a cluster and distributing requests to each instance. My gut says there’s a more ideal method that I can use.

Chat

mcmoose1900B
link
fedilink
English
arrow-up
1·
1 year ago
Llama.cpp’s example server supports batching and custom grammar.

Its a work in progress for Aphrodite: https://github.com/PygmalionAI/aphrodite-engine/issues/36#issuecomment-1747429134