I’m currently working on some RAG-based tooling for some non-profits and am having difficulty doing the following things. Wondering what people are using?

  1. Tracking model performance across experiments and productized pipelines
    1. changes in test or finetuning data sets
    2. Changes in chunking strategy
    3. changes in RAG tooling (e.g. RAG Fusion or RAG-DIT)
    4. Changes in underlying models and/or finetuning strategies
  2. Tracking pipeline performance (e.g. speed, throughput, latency, etc.) as we change items laid out above

What products do you use and how do you choose them?