Data-Maxxing in LLMops - how to get more out of your observability layer

Sahil Sinha,Thu Nov 07 2024•llms evals human in the loop

Base observability tools are key tools for any LLM stack - but may teams stop at using them for manually logging review, and dont’ realize they can get so much more out of this data.

Here are 3 ways to get more out of your observability layer.

1. Building your true north KPI

A True North KPI is a single metric that measures the performance of your LLM system. It should be a single number, representing how well your system is doing. You can use this to validate if any changes you've made are 'good' or 'bad' or monitor impacts of any exogenous changes.

Many teams feel like they're not mature enough for a metric like this, but still struggle with getting off of 'vibes' based evaluations and validating if the changes they've made 'worked'.

A lot of developers are intimidated by the thought of building a metric from scratch, thinking it has to be perfect the first time. But just like your product, shipping early and iterating is the best approach.

Your observability tool will have enough data to build the first version of your true north metric. You can start with a k-shot LLM as judge prompt and iterate/refine from there.

2. Evaluation models

Evaluations are similar to a True North KPI, in that they help you monitor the performance of your LLM system. However unlike a True North KPI, it's wise to have multiple evaluations running on your system, tracking different aspects of quality/performance.

These can be great for catching regressions, identifying edge cases, and validating if a change you made 'worked' or not.

A few approaches to consider:

K shot prompting This approach is a form of LLM as Judge, where you provide the model examples of good/bad outputs and ask it to classify new outputs.

The data in your observability dashboard is a great starting point for finding good examples to bring into your k-shot prompt. And as you're seeing new examples hit your system, you can iteratively add them into your k-shot prompt to improve its coverage.

Finetuned SLM evals

You may even have enough data to finetune your own SLM eval model. SLMs can be more cost effective and efficient than heavily prompted large language models, especially for narrowly-scoped tasks at scale. Here's an example of how to do this for building a custom evaluation model, using human in the loop training data. (opens in a new tab)

3. Building your own data moats

Finally, teams that have reached some kind of scale may also be ready to start thinking about building some moats - reducing their dependency on the foundational models, and leveraging their proprietary data to build their own LLMs.

The idea here is to use all your production data around what 'good' sessions look like, and use that as your training data for finetuning your own LLMs.

To do this, evaluations will be critical for ensuring you're finetuning your LLMs with 'good' data. If you finetune your model with a mix of good and bad data, you're leaving performance gains on the table. Accurately identifying which datapoints are good enough to finetune with, is key for ensuring the performance of your own models.

One approach is to automatically tag 'perfect' sessions and set them aside in their own dataset. You can periodically download this dataset and re-finetune your model with it. I'd recommend running your new model with a few prompts (ideally real user prompts), and use your evaluations to ensure your model is actually better than what you had before.

Happy building!

📖 If you liked this, sign up for our newsletter below. You can check out our other posts here (opens in a new tab).