Custom Multi-Modal Evaluations

Sid Premkumar,llmchatgptopenaicustom evaluationslytix
Cover Image

At lytix (opens in a new tab) we've been exploring the use of LLMs for evaluating LLMs. We've found success in the LLM-as-a-judge approach, where we prompt an LLM to evaluate another LLM's output. We are happy to launch our latest feature that expands this evaluation to multi-modal inputs (video + images).

LLM-as-a-Judge

Just to give the reader context, we wanted to define what LLM-as-a-judge means. This approach involves using another LLM to evaluate the output of a primary LLM interaction. There are some clear benefits and drawbacks to this approach:

LLM-as-a-Judge

Benefits

  1. Simplicity: It's relatively easy to implement, no need to a golden dataset or 'source of truth', you can just use natural language to define what good means to you.
  2. Flexibility: This approach works across any domain or task. Since its natural language, you can define what good means in your specific use case, rather than having to find the right dataset.

Drawbacks

  1. Subjective: LLM evaluations are inherently subjective, since they are based on natural language. This can be a pro or con depending on what you are trying to achieve. If you want an evaluation that is highly aligned with human preferences, this is a good approach. However, if you want an evaluation that is more objective, this may not be the best approach.
  2. Cost: This approach can be more expensive than other approaches, since you need to evaluate multiple LLMs with the same judge.

Despite these drawbacks, we've found that the LLM-as-a-judge approach is a powerful tool for evaluating LLMs. We are excited to launch our latest feature that expands this evaluation to multi-modal inputs (video + images).

Lytix Custom Evaluations

At lytix, we've found that with traffic, if you evaluate a percentage of your traffic, you can get a very good signal on what's working and what's not. Not only that, but you can find edge cases in your system that you may have never considered before.

Here is a quick demo of what this looks like:

To summarize:

This approach allows you to quantify what would otherwise be a subjective interaction. You can bring your own specific domain knowledge and priorities to the evaluation, and lytix takes care of the rest.

Multi-Modal LLM-as-a-Judge

A huge benefit of this approach is that it clearly translates to multi-modal interactions as well. And we're happy to announce that we've added support for this.

Just like with text interactions, you can do the same with video or image inputs. The video or image will be fed into the LLM-as-a-judge along with the interaction, and the judge will evaluate the interaction based on a custom rubric.

We're excited to see what people build with this and truly believe this is the fastest way to move fast while shipping high quality LLM applications.

👨‍💻

📣 If you are building with AI/LLMs, please check out our project lytix (opens in a new tab). It's a observability + evaluation platform for all the things that happen in your LLM.

📖 If you liked this, sign up for our newsletter below. You can check out our other posts here (opens in a new tab).

© lytix