So I was looking over the recent merges to llama.cpp’s server and saw that they’d more or less brought it in line with Open AI-style APIs – natively – obviating the need for e.g. api_like_OAI.py, or one of the bindings/wrappers like llama-cpp-python (+ooba), koboldcpp, etc. (not that those and others don’t provide great/useful platforms for a wide variety of local LLM shenanigans).

As of a couple days ago (can’t find the exact merge/build), it seems as if they’ve implemented – essentially – the old ‘simple-proxy-for-tavern’ functionality (for lack of a better way to describe it) but *natively*.

As in, you can connect SillyTavern (and numerous other clients, notably hugging face chat-ui — *with local web search*) without a layer of python in between. Or, I guess, you’re trading the python layer for a pile of node (typically) but just above bare metal (if we consider compiled cpp to be ‘bare metal’ in 2023 ;).

Anyway, it’s *fast* — or at least not apparently any slower than it needs to be? Similar pp and generation times to main and the server’s own skeletal js ui in the front-ends I’ve tried.

It seems like ggerganov and co. are getting serious about the server side of llama.cpp, perhaps even over/above ‘main’ or the notion of a pure lib/api. You love to see it. apache/httpd vibes 😈

Couple links:

https://github.com/ggerganov/llama.cpp/pull/4198

https://github.com/ggerganov/llama.cpp/issues/4216

But seriously just try it! /models, /v1, /completion are all there now as native endpoints (compiled in C++ with all the gpu features + other goodies). Boo-ya!