Around 1.5 months ago, I started https://github.com/michaelfeil/infinity. With the hype in Retrieval-Augmented-Generation, this topic got important over the last month in my view. With this Repo being the only option under a open license.
I now implemented everything from faster attention, onnx / ctranslate2 / torch inference, caching, better docker images, better queueing stategies. Now I am pretty much running out of ideas - if you got some, feel free to open an issue, would be very welcome!