Productionising GenAI Agents: Evaluating Tool Selection with Automated Testing

Author:Murphy | View: 29802 | Time: 2025-03-22 19:32:50

Introduction

Generative AI Agents are changing the landscape of how businesses interact with their users and customers. From personalised travel search experiences to virtual assistants that simplify troubleshooting, these intelligent systems help companies deliver faster, smarter, and more engaging interactions. Whether it's Alaska Airlines reimagining customer bookings or ScottsMiracle-Gro offering tailored gardening advice, AI agents have become essential.

However, deploying these agents in dynamic environments brings its own set of challenges. Frequent updates to models, prompts, and tools can unexpectedly disrupt how these agents operate. In this blog post, we'll explore how businesses can navigate these challenges to ensure their AI agents remain reliable and effective.

What is this blog post about?

This post focuses on a practical framework for one of the most crucial tasks for getting Genai agents into production: ensuring they can select tools effectively. Tool selection is at the heart of how generative AI agents perform tasks, whether retrieving weather data, translating text, or handling error cases gracefully.

We'll introduce a testing framework designed specifically for evaluating GenAI agents' tool selection capabilities. This framework includes datasets for various scenarios, robust evaluation methods, and compatibility with leading models like Gemini and OpenAI. By exploring this approach, we will gain actionable insights into how to test, refine, and confidently deploy GenAI agents in dynamic production environments.

All the code for this framework can be found on Github!

Why should we care?

In production, even the most advanced GenAI agents are only as good as their ability to pick and use the right tools for the task at hand. If an agent fails to call the correct API for weather information or mishandles an unsupported request, it can undermine user trust and disrupt business operations.

Tool selection is central to an agent's functionality, but it's also highly vulnerable to changes in model updates or prompts. Without rigorous testing, even minor tweaks can introduce regressions, causing agents to fail in unpredictable ways.

That is why a structured testing framework is critical. It allows businesses to detect issues early, validate changes systematically, and ensure that their agents remain reliable, adaptable, and robust – no matter how the underlying components evolve. For companies looking to deploy AI agents at scale, investing in such a framework is essential for long-term success.

What are GenAI agents?

GenAI agents are systems powered by large language models (LLMs) that can perform actions – not just generate text. They process natural language inputs to understand user intentions and interact with external tools, APIs, or databases to accomplish specific tasks. Unlike traditional AI systems with predefined rules, GenAI agents dynamically adapt to new contexts and user needs.

At their core, these agents combine natural language understanding with functional execution. This makes them highly versatile, whether they're responding with a direct answer, requesting clarification, or calling an external service to complete a task.

Real-world use cases

GenAI agents are already transforming industries, proving their value across a wide range of applications. Here are some examples, taken directly from Google's blog post:

Customer Support: Alaska Airlines is developing natural language search, providing travelers with a conversational experience powered by AI that's akin to interacting with a knowledgeable travel agent. This chatbot aims to streamline travel booking, enhance customer experience, and reinforce brand identity.

Automotive Assistance: Volkswagen of America built a virtual assistant in the myVW app, where drivers can explore their owners' manuals and ask questions, such as, "How do I change a flat tire?" or "What does this digital cockpit indicator light mean?" Users can also use Gemini's multimodal capabilities to see helpful information and context on indicator lights simply by pointing their smartphone cameras at the dashboard.

E-commerce: ScottsMiracle-Gro built an AI agent on Vertex AI to provide tailored gardening advice and product recommendations for consumers.

Healthcare: HCA Healthcare is testing Cati, a virtual AI caregiver assistant that helps to ensure continuity of care when one caregiver shift ends and another begins. They are also using gen AI to improve workflows on time-consuming tasks, such as clinical documentation, so physicians and nurses can focus more on patient care.

Banking: ING Bank aims to offer a superior customer experience and has developed a gen-AI chatbot for workers to enhance self-service capabilities and improve answer quality on customer queries.

They show how GenAI agents are becoming central to improving productivity, automating workflows, and delivering highly personalised user experiences across industries. They are no longer just supporting systems – they're active participants in business operations.

How GenAI agents work

GenAI agents operate by combining natural language understanding with task execution, enabling them to perform a variety of actions based on user queries. When a user inputs a request, the agent determines the intent behind it and decides on the appropriate course of action. This may involve directly responding using its internal knowledge, asking for clarification if key details are missing, or taking an action via an external tool.

The agent's workflow is dynamic and highly context-aware. For instance, if the query requires accessing real-time data or performing a calculation, the agent will usually integrate with external tools or APIs. If a request is ambiguous, like "Book me a table," it may prompt the user to specify details like the restaurant or time before proceeding.

Once the agent has figured out how to act, it either generates a natural language response or prepares inputs for tool execution. After completing the task, the agent processes the results to deliver an output that's clear, actionable, and aligned with the user's intent.

This entire flow, from understanding user intent to executing tasks, makes GenAI agents capable of handling complex, multi-step interactions in a natural and user-friendly manner.

Tool selection for GenAI agents

Tool selection is one of the most critical capabilities for GenAI agents, enabling them to bridge user inputs with external functions to perform tasks effectively. The process involves identifying the most suitable tool based on the query's intent and the agent's repository of tools. For instance, a request like "Translate this text into French" prompts the agent to select a translation tool, while "Set a reminder for tomorrow at 3 PM" would call a calendar tool.

Once the tool is selected, the agent extracts the relevant parameters from the query and formats them according to the tool's specifications. For example, in a weather-related query like "What's the weather in Tokyo tomorrow?", the agent identifies "Tokyo" as the location and "tomorrow" as the date, then structures these inputs for the weather API. After invoking the tool, the agent processes the response to ensure it meets the user's expectations. Structured data like JSON is transformed into natural language, and errors such as invalid inputs or unavailable data are communicated to the user, often with suggestions for refinement.

By dynamically selecting the right tool and handling outputs precisely, the agent ensures it can execute tasks accurately and efficiently. This capability is foundational to its ability to deliver seamless user experiences.

The importance of tool selection

Tool selection is what differentiates GenAI agents from simple conversational chatbots and allows us to build powerful, action-oriented systems. Understanding user queries and generating responses is essential, but identifying and utilising the correct tools ensures agents can take action in real-world scenarios. Missteps, such as choosing the wrong tool or poorly formatting inputs, can frustrate users and make them lose trust in the agent's capabilities. To be truly effective, tool selection must be robust, precise, and adaptable. It's this mechanism that ensures GenAI agents are not only responsive but genuinely capable of accomplishing tasks in dynamic environments.

Continuous Testing to ensure agent reliability

Production environments are always changing, which makes reliability one of the biggest challenges for GenAI agents. Model updates, prompt adjustments, or changes to the tool catalogue can lead to workflows to failing, which could result in incorrect outcomes, and therefore undermine user trust.

Continuous testing is what ensures agents remain reliable and functional, even as these changes happen. It systematically checks core functionalities, like tool selection, to catch issues early. For instance, a workflow involving a weather tool might stop working after a model update or after tweaking the system prompt . Automated testing can identify such failures before they affect users, giving teams the chance to address them quickly. This way, agents can continue delivering good results without interruption.

In addition, real-world scenarios change too, and continuous testing helps developers to adapt their agentic system to new tasks and use cases. By using and amending datasets that represent realistic situations, teams can ensure the agent performs well across a range of user needs. Automated pipelines make this process scalable and consistent, integrating directly into development workflows. This approach allows teams to keep improving and expanding their agents without sacrificing reliability.