Prototype to Production: an evals-driven approach to building reliable systems

Sahil Sinha,Thu Nov 07 2024•llms evals human in the loop

It's now clear that taking your LLM applications from prototype to production is it's own beast, with its own set of challenges that are disticnt from the prototype phase. Some teams with more risk-tolerance, may be open to a 'ship and wing it' approach, building their LLMOps plane as it flies - so to speak. For example, this could be teams operating in less regulated industries or not handling particularly sensitive data.

But some may not be ready to build the plane as it flies. So for those - here are some ways to have a more analytical approach to turning your prototypes into reliable production-ready systems.

1. Human in the loop (HITL)

Human in the loop involves employing human-agents (typically subject matter experts) to review model outputs; either before the user sees them (acting like a guardrail), or after the user has seen them (acting like a post-hoc evaluator). HITL can be especially useful for use cases with vast and/or dense knowledge bases and big consequences for getting things wrong.

A few considerations when building your HITL flow:

You'll want to track the data you collect from your human-agents. This will help you build your own evaluation models, and also help you understand what kinds of errors are your models frequently making that your human-agents need to correct.
Given the costs of human-agents, you'll want to experiment with automated evaluation models as well. Use the training data you're collecting to start building your own evaluation models. While you don't need to fully rely on these yet, it is important to deploy and start gathering data early.

This will help you understand what types of errors your automated models can catch, and where (if anywhere) you need your human-agents to step in

In the future you’re going to have a mix of automated evaluations and human-agent evaluations. Here are some considerations for thinking of where automated vs. human-agent evaluations may make most sense -

Automated evaluations are going to be faster and cheaper than human-agents, making them great for catching routine errors and performance tracking.

However, they're not perfect and may misclassify errors, especially early versions trained on limited data. Comitting to future-proofing these models by regularly updating their training data (or with k-shot prompting) will be critical. Additionally, automated evaluations may struggle to catch errors they haven't seen before, or errors that are less prevalent in the training data.

Human-agents will have a firmer grasp on the knowledge base they're based on, and be able to catch new kinds of errors without the reliance on historic data. However, employing-human agents is no doubt expensive, and will pose challenges as your system scales. You also have latency considerations - human-agents will always take longer to respond than your automated evaluations.

2. Start your evaluations layer early

Evaluations can feel intimidating because developers feel they have to get everything right the first time. In reality, your evaluations layer will be just like your product - it’ll start a little rough, but iterate over time based on the data you’re seeing.

That means just like your product, you should start early and iterate over time! A few tactical tips on how to get started:

Consider off the shelf evaluations vs. building something custom: there are great evaluation models (such as RAGAs), that you can use right off the shelf with no manual work needed. These are great if you're using more standard workflows (i.e. RAGAs is great for workflows using RAG). HuggingFace, Reddit, Github and HackerNews are all great sources of potential evaluation models.

If you're more concerned with evaluating the 'natural language' quality of your outputs, then there are great NLP models that can be helpful. BLEU, ROUGE and BERTScore are all great starting points for comparing your models output against some ground-truth/reference string. You could also look at source libraries for measuring toxicity, bias, friendliness etc... can also be helpful.

If the open source community hasn't already built the evaluation that works for you, then you can quickly building something custom.

LLM as judge models are great starting points, as they're quick to spin up and experiment with. They're also great for evaluations that may require applying reasoning and considering broader context. Just remember that the quality of your evaluations will be a direct function of the quality of your prompt engineering. If you're not using a tool like lytix that prompt-optimizes under the hood, I'd spend some time prompt-engineering your evaluation models.

I'd recommend starting with k-shot prompting - given it's quick to implement, leverages prompt engineering best practices (opens in a new tab), and sets you up for future-proofing your models (by adding more examples as you go).

If you're starting to notice particular patterns of errors, then REGEX could be a cheaper and faster starting point than any of the above. However, it will break down for errors that might need more nuance to catch.

3. Iterating overtime

I’d employ the following process to iterate on your evaluations layer:

I'd start with one evaluation model, checking for critical errors and edge cases.

Periodically spot check your evaluations model, particular sessions marked as particularly good or bad. I’d verify that the evaluation-grade is appropriate. If it’s not, I’d bring that input/output pair into my eval prompt as a k-shot example.

Split into multiple models.

As I'm getting a better sense of what kinds of errors I'm seeing, what they look like and how frequent they are, I'd start experimenting with breaking my single evaluation model into multiple models.

You may have one large evaluation model looking tracking overall performance and flagging any edge cases, and 2-3 others looking for specific types of errors that are high frequency and/or high impact.You'll get more reliable coverage by dedicating some evaluation models dedicated to only 1 type of error.

Graduating from k-shot prompts to finetuned models.

As your system starts to scale, you'll want to think of the scalability of your evaluations layer as well. Your evlauations layer could start to get slow and expensive, depending on how much evaluations you have running, how many examples they have in the prompt, and what they are evaluating. You can see that having n models, stuffing the context window of each with as many examples as you can, will quickly blow up your costs.

As you start to get more data, finetuning your own evaluation models will start to become more efficient and cheaper than using k-shot prompts. For discrete and specific tasks, such as narrow-scoped evaluations, small language models can actually be more efficient and cheaper (opens in a new tab) than large language models.

4. Track your true north metric over time

Along side any LLM-driven evaluations, it's critical to make sure you’re monitoring your end-goal (i.e. particular conversion events, retention etc...). LLM evaluations will be great proxies for your end-goal, but they're not a direct replacement. Especially early on, when you may not quite have your evaluation models correctly configured.

Analysing product-data can also be a great source for new dimensions to evaluate. Analysing all sessions of users who churned vs. those who didn't, can help you identify what qualities of your LLM output are actually moving the needle (or failing to), and therefore make sense to track.

Happy evaluating and building!

At lytix, we are passionate about getting LLM systems out of prototypes and into production. While you can use these best practices with whatever tool you'd like, we've incorporated them all into Custom Evaluations by Lytix - here's how you can get started (opens in a new tab).

📖 If you liked this, sign up for our newsletter below. You can check out our other posts here (opens in a new tab).