I want to know the tools and methods you use for the observability and monitoring of your ML (LLM) performance and responses in production.
I want to know the tools and methods you use for the observability and monitoring of your ML (LLM) performance and responses in production.
If you’re open to using an open source library, you can use LangCheck to monitor and visualize text quality metrics in production.
For example, you can compute & plot toxicity of users prompts and LLM responses from your logs. (A very simple example here.)
(Disclaimer: I’m one of the contributors of LangCheck)