Hi,

Having developed multiple AI based products now and running them in production I run into problems again and again and wonder whether others have a solution for this.

I have multiple prompts that are expect to return a JSON with a given structure. Even now since you can hand the output type to openAI I see no difference in result quality. One has to be very careful in prompt constructing and model choice. A single word differently placed can lead to broken JSON output. Same with switching models eg from 3.5 Turbo preview to 4 with otherwise same inputs can reduce output quality.

And that’s just one issue. With volume I sometimes see random hallucinations in language. Multi language handling is tricky in general.

That lead me to think what’s needed in this early stage of AI building is a rigid testing environment. More than just a playground but rather an environment that can mass test prompts and results on an ongoing basis and monitor results, suggest improvements and so on. Just to bring what’s needed to seriously build business products rather than play use cases for single user Environments.

Does anyone of you know of such tool out there? Anyone else having theses problems with their commercial products?