Hey guys, as the title suggests I’d like some advice on the best way to serve LLMs with the support of GBNF or similar to ensure that I receive deterministic output. I have been using text-generation-web-ui locally and from there I can add my grammar, however, I would like to be able to do this across a cluster that can infer with high throughput. Any suggestions on how best to accomplish this?
A naive solution would be having multiple instances of text-generation-web-ui running in a cluster and distributing requests to each instance. My gut says there’s a more ideal method that I can use.
Llama.cpp’s example server supports batching and custom grammar.
Its a work in progress for Aphrodite: https://github.com/PygmalionAI/aphrodite-engine/issues/36#issuecomment-1747429134