Handling Hallucinations - and other LLM error cases
Since ChatGPT was released in 2022, developers have been eager to bring the power of LLMs to their own products. It’s typically easy enough to whip up a proof of concept that one can use on their local machine, and these prototypes work well enough with a small number of cases.
But when developers take these features live, they encounter a whole host of problems they were not expecting. With enough users hitting your product and using it in unexpected ways, LLMs are bound to perform unexpectedly. Your users may hit errors you haven’t seen before, or have a subpar product experience because of an edge-case you weren’t aware of. Unfortunately, tooling to solve these problems have not developed as fast as the LLMs themselves have. This leaves developers scratching their heads, thinking of ways to keep tabs on their users’ experience.
Lytix just went through Y Combinators Winter 2024 batch, where over half the companies were building on AI. We’ve written this post based on our experience learning from and working with companies on the frontlines of building and scaling AI-based product experiences -
Here are a few ways to think about and diagnose LLM “Hallucinations”.
First - what is a hallucination?
In the most direct sense - hallucinations are responses from LLMs that are factually inaccurate. Depending on the model and subject matter, LLMs can hallucinate anywhere from 3% to 25% of the time (see (opens in a new tab))
But in a broader sense, we can think about hallucinations as the LLM providing responses that ignore user or system instructions. Tangibly, this may look like:
- Responses that are factually inaccurate
- Responses that ignore user or system prompts
- Responses formatted incorrectly (i.e. incorrectly parsed JSON)
- Responses that breach guidelines or guardrails (i.e. “as a Large Language Model I cannot provide legal, health, insurance etc… advice)
Unfortunately, handling these errors is not easy as the pre-LLM world. There are no unit tests one can run to see if your latest change produces more or less hallucinations. Traditional error or crash alerts are also not reliable at catching these kinds of LLM behaviours.
So what do teams do today? In our research, we found that 2 approaches that teams have converged on:
-
Manually logging as much as possible, and reviewing these logs manually. For chatbots, this might be logging each input from your users, and output from your bots, and stitching together conversations manually. For search-based tools, this might be receiving a report of each user query, what went on behind the scenes, and the final output.
- The downsides of this are obvious - ideally you wouldn’t sink hours into manually reviewing each of your users’ product sessions. And as your user base increase, this gets increasingly unscalable. It’s also prone to user error - how do you know you haven’t missed an interaction that might deserve more attention?
-
Relying on users to report errors as they find them. On paper, this has the advantage of being slightly more scalable - simply review the errors your users have reported. In practice, teams are moving away from this for a few reasons.
- First - ideally you would catch an error before it reaches the user, so you can gracefully handle it, and protect your users’ experience. Second - in our experience, users report very few errors. In most cases, they just stop using your product (or gradually use it less over time). So relying on user reports means you’re necessarily seeing a small percent of the total errors your user-base is facing.
So then what are some useful tools for reliably and gracefully handling LLM errors? Here are a few solutions we’ve seen teams have success with -
Development frameworks
Building your gen-AI features through a development framework, can help limit the number of errors you’ll have to begin with. Frameworks are also extremely beginner friendly ways to get started with LLMs. Frameworks can help reduce hallucinations in a few ways:
-
Built in observability: Typically, these platforms have built-in observability and evaluation tools. These can help you keep tabs on what your LLMs are doing, and assess each output against a set of vetted metrics. Vellum, for example, gives developers metrics like semantic similarity (is the LLM output similar in meaning to a given ‘target’ or ground truth?), and JSON Validity (to confirm the LLM output matches a predefined shape).
- This can help you identify what kinds of errors your users are facing most frequently, and iterate accordingly.
-
Reliable output parsing: often times, you need your LLM to return a response in a particular format (i.e. JSON, .csv files, pandas dataframes etc…), based on how you’re handling the response downstream. In theory, this should be as simple as asking your LLM to “return the response in (your desired format)”. In practice, we see that getting your LLMs to do this reliably takes some seriously savvy prompt engineering. As you use more of your context window, tight prompt engineering gets even more important.
Instead of wasting your time fiddling with prompts, platforms like Langchain let you use their output parsers right off the shelf, while they handle all the prompt engineering under the hood. This can collapse 10-20 lines of code into just 3-4 lines, with prompts that provide reliable results. These parsers often have embedded ‘retry’ logic embedded, which will automatically re-attempt to parse the output if it sees an error. This lets developers focus on building great product, and not waste time trying to win an argument with an LLM.
Cons:
While frameworks can be a great way to get your foot in the door, we’ve found developers quickly outgrow them as their needs evolve and get more complex. Developers may find themselves over-engineering their way around opinionated frameworks, to get what they want. Folks who are familiar with working with LLMs, may find frameworks to be too abstracted and awkward to work with. Additionally, the base models are always expanding their tooling, reducing the need for certain framework features (for example, OpenAI (opens in a new tab) now lets users receive their outputs as JSON objects of a particular shape).
Recommended tools: Langchain (opens in a new tab), Vellum (opens in a new tab)
Evaluation metrics
The open source community has built some very useful (and free solutions) that developers can use without using a development framework. These are machine-learning, or LLM-powered, techniques, which provide metrics representing the quality of an LLMs output, along a particular dimension. For example, the BLEU score (bilingual evaluation understudy), shows how grammatically and semantically coherent an LLMs output is. Similarly, this toxicity model on HuggingFace provides a numerical score representing how toxic a given response was.
OSS eval tools can help developers in a couple ways:
- Real-time checks to prevent users ever seeing an undesirable experience. These libraries are often really flexible and easy to work with, so developers can bake them into their product experiences (for example, if an LLMs response is incoherent, try again until we receive a response with a strong BLEU score)
- Augmenting their existing custom dashboards, with a custom set of metrics they care about. This can give your team an ‘observability platform’, comparable to what you’d get with a development framework, entirely for free!
Recommended tools: HuggingFace (opens in a new tab), WhyLabs (opens in a new tab)
There are also OSS monitoring and eval platforms, which give developers many of these metrics right out of the box (instead of manually configuring the ones you want). While some of these are not free, they may be worth the saved time.
Recommended tools: Helicone (opens in a new tab), Phosphor (opens in a new tab)
Regex
While not as programmatic as some of the other approaches, simple Regex matching can help identify particular edge cases your product might care about, but may not be universal enough for the open source community to have built a solution for. We’ve seen developers get surprisingly far with simple regex functions, catching common phrases or words that represent failure cases. For example, teams building in more regulated industries (such as health, insurance or law), may find that they frequently get LLM ‘refusals’ (i.e. “as a large language model I cannot advise on legal/health related matters”), due to safety guardrails. While manual, this can give you even more fine control over your users’ experience, than even the open source evaluation tools could provide.
Cons
- Regex is notoriously not easy to work with.
- Especially in the beginning, it may take several attempts to find the right strings to look for, to find your failure cases
- You’re still relying on manual logs or user reports, to find the strings that represent failure cases. This means users will still hit failure cases you have not implemented regex matching for.
Recommended tools: regex101.com (opens in a new tab)
Conclusion
There’s no doubt LLMs are a powerful addition to any developers toolkit. But this is also a brand new space, with very few best practices or tooling. While manually logging and reviewing your users sessions may work initially, you’ll find this quickly gets out of hand as your number of users and sessions increases. We hope these tips help developers bring their ideas to production faster, and build with more confidence and reliability. If you need some help thinking through alerting for your gen-AI applications, feel free to contact our team for a free consultation! founders@lytix.co
Good luck, and happy building! 🛠️
👨💻
📣 If you are building with AI/LLMs, please check out our project lytix (opens in a new tab). It's a observability + evaluation platform for all the things that happen in your LLM.
📖 If you liked this, sign up for our newsletter below. You can check out our other posts here (opens in a new tab).