Building (+ evaluating) a multi-agent e-commerce bot, powered by Stripe
Last week, Stripe released (opens in a new tab) its tools for AI agents. This lets developers equip their AI agents with access to financial tools - such as the creation of payment links to help customers process payments, or virtual cards to make business purchase themselves. This unlocks a whole set of use cases for agents. E-commerce agents that can not only answer users questions about products, but process the final transactions (via a payment link). Procurement agents that can not only find the right vendors for your use case, but actually buy a subscription (using a virtual credit card).
On paper, this is a huge a step to making agents ready for production. But like most advancements in AI, the devil is in the detail when it comes to its usefulness in real-world settings. The very blog post announcing Stripe’s new tools calls this out;
Now that agents can move money, strong observability and evaluations are critical. You’d hate to upset your customers by overcharging them for a purchase. Or even worse - have your agent authorize a payment that’s way out of budget!
Here, I’ll give the Stripe agentic tools for a spin - creating an ecommerce agent that can not only help answer questions about products, but process the final transaction. I’ll also go over how I’d think about setting up an evaluations layer on top of that, to catch the most mission critical errors.
First - getting a stripe agent that can create a payment link
Stripe's Quickstart guide (opens in a new tab) goes through a few examples for getting started with their agents here. I went ahead with the Langchain/Langgraph examples (opens in a new tab). Here’s how you can set up an agent to use Stripe’s tools.
# setting up an agent with the stripe tool
llm = ChatOpenAI(
model="gpt-4o-mini",
# Changing my proxy so i have observability
base_url=f"https://api.lytix.co/proxy/v1/openai",
# Update your env file to use your own API keys
api_key=os.getenv("LYTIX_API_KEY"),
default_headers={
# Move your openai key to the default headers
"openaiKey": os.getenv("OPENAI_API_KEY")
},
)
# initialize the stripe agent
stripe_agent_toolkit = StripeAgentToolkit(
secret_key=os.getenv("STRIPE_SECRET_KEY"),
configuration={
"actions": {
"payment_links": {
"create": True,
},
"products": {
"create": True,
},
"prices": {
"create": True,
},
}
},
)
tools = []
tools.extend(stripe_agent_toolkit.get_tools())
# using the langgraph framework to create my agent
langgraph_agent_executor = create_react_agent(llm, tools)
That lets you programmatically create payment links using prompts like this:
input_state = {
"messages": """
Create a payment link for a new product called 'test' with a price
of $100. Come up with a funny description about buy bots,
maybe a haiku.
""",
}
output_state = langgraph_agent_executor.invoke(input_state)
print(output_state["messages"][-1].content)
Outputs something like this:
I've created a new product called "test" with a funny description about buy bots:
**Description:**
Buy bots in a box,
They'll dance and sing for you,
But watch out for bugs!
The price is set at $100.
You can access the payment link [here](https://buy.stripe.com/4gw8wI8ed7DCfLy14k).
(and here’s what the payment link looks like)
Next, integrating it into a multi-agent framework
Next, I wanted to try a more realistic use case. I set up one agent to only to answer questions for prospective customers. I set up another using the stripe agent, to create payment links given a certain product catalog. I made a third agent to route between them (similar to OpenAI’s best practices here - and showcase in Swarm).
router_prompt = f"""
Given the following user query, determine which agent would be best suited to handle it:
Query: {user_query}
Available agents:
- Stripe Agent: For visitors who are ready to make a purchase. Here are the products and prices: black t-shirts are $20, white t-shirts are $25, and blue t-shirts are $30. Use this to make a payment link, depending on what product the user wants.
- New order agent: For new visitors asking questions about products before they buy. Based on this information: all shirts come in small, medium large. black t-shirts are $20, white t-shirts are $25, and blue t-shirts are $30.
Respond with just the agent name, nothing else.
"""
routing_decision = llm.invoke(router_prompt).content.strip()
# Handle routing and return response
if "stripe" in routing_decision.lower():
output_state = langgraph_agent_executor.invoke({"messages": "You are an ecommerce agent, for visitors who are ready to make a purchase. Here are the products and prices: black t-shirts are $20, white t-shirts are $25, and blue t-shirts are $30. Use this to make a payment link, depending on what product the user wants. User Query - " + user_query})
response = output_state["messages"][-1].content
elif "new order" in routing_decision.lower():
response = llm.invoke("You are an ecommerce agent, for new visitors asking questions about products before they buy. Answer questions based on this information: all shirts come in small, medium large. black t-shirts are $20, white t-shirts are $25, and blue t-shirts are $30. User Query - " + user_query).content
else:
response = "Sorry, I don't have an agent available for that query."
return jsonify({"response": response, "agent": routing_decision})
Here's how that works in action:
Finally, setting up my evaluations.
I see 3 evaluation agents I need to set up.
- Given the prompt, was the right agent chosen?
- For the new product agent - did they answer questions well?
- For the stripe agent - did they create the payment links with the right information? (price, item information etc…)
And since this is mission critical, I’ll set up a custom alert so I can be on top of any instances where the payment link information was wrong
Great, now if my payment link information is ever wrong, I'll instantly get an alert. Here's how you can use your evaluators in action. Here, I can see a moment where maybe the agent router chose the wrong agent. You can see the user was asking to make a purchase, but the agent router chose the new product agent when it should have created a new payment link.Great! Now I have my tracking in place. here’s how I would monitor each evaluator agent over time and iterate accordingly.
1. Agent router
If I find that the agent router is choosing the wrong agent too often, I know I need to improve my routing logic. K Shot prompting is really hepful here. Not only has it been shown to improve prompt reliability in general (opens in a new tab), but it's a helpful framework for improving your routing logic over time. As you identify edge cases that are tripping your routing logic up, you can add them to your prompt (alongside the correct agent routing decision).
2. New product agent
This evaluator tracks the accuracy of the content delivered by the new product agent. Usually use cases like this go wrong in one of two ways. Either the agent doesnt answer questions using the information provided. This can look like LLM refusals "i.e. as a model, I'm not sure xyz". Alternatively, it could be the agent confidently adding details that were not mentioned in the initial information.
Depending on what you're seeing, you could iterate in the following ways:
- limiting your models reasoning and scope for responses (via parameters like temperature, and top_p) or using smaller models
- using techniques like RAG to give your agent access to more information
3. Stripe agent
This evaluator tracks the accuracy of the payment link creation by the stripe agent. Were details like the units purchases, per unit price and final amount correct?
You could iterate on this part of your flow in a few ways:
-
Splitting your agent into multiple models, one that pre-processes the users query to ensure we have all the information we need, and another that creates the payment link.
-
You could also add a model after the stripe agent, double checking the information is correct before sending it to the user.
-
Switching to a smaller model, to limit the scope of what the agent can do.
Happy Hacking!
Stripes agentic toolkit offers even more functionality that i went through here. Here’s their blog post (opens in a new tab) going through more of their tools, and here’s the Github link (opens in a new tab) to their Python and Typescript libraries if you wanted to play around yourself.
📖 If you liked this, sign up for our newsletter below. You can check out our other posts here (opens in a new tab). 📣 If you are building with AI/LLMs, please check out our project lytix (opens in a new tab). It's a observability + evaluation platform for all the things that happen in your LLM.
© lytix