Why you may need a family of evaluators
One of the learnings about LLMs that came in 2024, was the idea that models don't necessarily just get better as they get larger. The future - more likely - looks like systems of multiple models, each with different capabilities and specializations.
Sam Altman has said that futue models will be "better in other ways", and that the future of "one large model" that can do everything, is unlikely. There are research papers (opens in a new tab) that have come to the same conclusion.
It looks like the LLMOps community is converging on the same idea - that a family of evaluators may be better than a few, large evaluation models.
I'll go over what the literature says, and what we've observed in the real world, building evaluations at lytix.
The studies 📜
In "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models", Verga et al tested how a "Panel of LLm evaluators" (or PoLL) approach, would compare against a single large "evaluator" judge (you can learn more about their methodology here (opens in a new tab)).
They found that evaluations from "Panels" of evaluators, were closer to human judgement, cheaper, and had less variance across datasets - than relying on a single LLM evaluator. On top of the reliability benefits, they also found the PoLL approach to be more practical - cheaper and lower latency.
In prod 🔧
This definitely aligns with what we've seen at lytix, for developers with relatively mature systems and starting to see a lot of traction.
There are 2 more tactical benefits to breaking down your evaluation into distinct components:
- Balancing prioritisation, speed and cost
The reality is that not all elements that you're evaluating for are equally important. Some qualities may be absolute show-stoppers - PII, names of competitors, data leaks etc... You'd want these to be caught 100% of the time, and you'd want to do this as fast as possible. Cost may not be a concern, given how critical it is you get this right.
Some may be more subjective qualities, such as if the tone and vocabulary of your output is in keeping keeping with your brand. You may not need to catch 100% of these, but you may want to catch most of them. You may be willing to accept a certain level of failure, and you may be willing to wait longer for the evaluation to complete.
It's important to keep these trade-offs in mind, because the reality of instrumenting evaluators is the balance bewteen reliability, speed and cost. By splitting your evaluator into distinct comopnents, you can make trade-offs based on the priority of what you're evaluating.
For example - you may have a lightweight LLM as Judge test, trained on your brands tone and style, evaluating a certain % of your output. Since you need some reasoning, you might be okay using a larger model. And you might throttle cost by running it on a small % of your output.
You may have a faster, more lightweight model - catching instances of "LLM laziness" or refusals, or of it mentioning competitors. You may run this on 100% of your output, but use NLP models (ROUGE or BLEU) or REGEX, which are cheaper and faster.
- Clarity when iterating
Another benefit of splitting up your evaluations, is more clarity when iterating on your system. It'll be more clear what aspects of your system you need to improve, and what kinds of errors are occuring. And as you're making changes, you'll be able to validate that they've had the desired effect.
This because if you're using one large model that's checking for several criteria, it won't be as immediately clear what specifically is bringing the score up or down. You may be able to get around this by adding an explainability component, but that still requries you to manually read through every log to understand trends.
Ok great - What other kinds of models can I use?
Here are a few common ones we've seen, with the pros and cons of each.
###Proxies: These are code-driven, deterministic checks, using a set of rules or functions on discrete datapoints, such as token-counts or the presnce of certain strings
Examples:
- Evaluating model 'conciseness' via % of logs with token-size > 1000
- Evaluating the presence of competitors or Laziness via REGEX
Pros: Fast, cheap, reliable Cons: Can be challenging to get right and iterate on, limited scope of what they can evaluate for, especially in more generative AI use cases
Natural Language Algorithms
These are algorithms that use NLP techniques, to evaluate more semantic or structural qualities of the language of your output. For example, models like ROUGE or BLEU can produce scores measuring your outputs similarity to some reference. Or even more subjective qualities, like the 'readability' of your output for a given audience (using Flesch-Kincaid or Gunning-Fog scores).
Examples:
- Evaluating the proximity to a target style or tone via ROUGE or BLEU scores
- Evaluating how 'typical' your output is via a Perplexity score
Pros: Fast, can be cheap, lets you start to evaluate more 'subjective' qualities at scale Cons: Can be tricky to iterate on, can provide great directional data at best, but not always the most reliable or actionable (especially for any high priority errors)
LLM as judge (or panels)
This involves using another LLM to evaluate your output. This typically starts as a single prompt being passed to a large model. Depending on your use case, you may want to play with k-shot prompts or even fine-tuning your own evaluator small lanaguage models.
Pros: Can be very flexible, can evaluate subjective qualities and apply reasoning. Cons: Reliability is lower than more code-driven, discrete algorithms. Expensive and slow.
Human in the loop
This inovlves involves humans reviewing and rating model outputs to identify biases, inaccuracies, or low-quality responses. This can be great for validating domain-specific errors in your models output. Especially for niche knowledge bases, and/or high stakes industries.
Pros: Extremely reliable, great for niche knowledge bases Cons: Challenging to scale, expensive, slow
Note - you can combine Human in the Loop with other methods, to start building an automated and scalable LLMOps stack. Here's (opens in a new tab) an example of using human in the loop data to finetune your own LLM as judge model.
Happy hacking!
If you're still worried about shipping your LLM products without strong LLMOps, feel free to grab some time with us (opens in a new tab) to chat through your concerns.
📖 If you liked this, sign up for our newsletter below. You can check out our other posts here (opens in a new tab).
© lytix