Researchers reveal flaws in AI agent benchmarking

As agents using artificial intelligence have wormed their way into the mainstream for everything from customer service to fixing software code, it’s increasingly important to determine which are the best for a given application, and the criteria to consider when selecting an agent besides its functionality. And that’s where benchmarking comes in.

Benchmarks don’t reflect real-world applications

However, a new research paper, AI Agents That Matter, points out that current agent evaluation and benchmarking processes contain a number of shortcomings that hinder their usefulness in real-world applications. The authors, five Princeton University researchers, note that those shortcomings encourage development of agents that do well in benchmarks, but not in practice, and propose ways to address them.

“The North Star of this field is to build assistants like Siri or Alexa and get them to actually work — handle complex tasks, accurately interpret users’ requests, and perform reliably,” said a blog post about the paper by two of its authors, Sayash Kapoor and Arvind Narayanan. “But this is far from a reality, and even the research direction is fairly new.”

This, the paper said, makes it hard to distinguish genuine advances from hype. And agents are sufficiently different from language models that benchmarking practices need to be rethought.

What is an AI agent?

The definition of agent in traditional AI is that of an entity that perceives and acts upon its environment, but in the era of large language models (LLMs), it’s more complex. There, the researchers view it as a spectrum of “agentic” factors rather than a single thing.

They said that three clusters of properties make an AI system agentic:

Environment and goals – in a more complex environment, more AI systems are agentic, as are systems that pursue complex goals without instruction.

User interface and supervision – AI systems that act autonomously or accept natural language input are more agentic, especially those requiring less user supervision

System design – Systems that use tools such as web search, or planning (such as decomposing goals into subgoals), or whose flow control is driven by an LLM are more agentic.

Key findings

Five key findings came out of the research, all supported by case studies:

AI agent evaluations must be cost-controlled – Since calling the models underlying most AI agents repeatedly (at an additional cost per call) can increase accuracy, researchers can be tempted to build extremely expensive agents so they can claim top spot in accuracy. But the paper described three simple baseline agents developed by the authors that outperform many of the complex architectures at much lower cost.

Jointly optimizing accuracy and cost can yield better agent design – Two factors determine the total cost of running an agent: the one-time costs involved in optimizing the agent for a task, and the variable costs incurred each time it is run. The authors show that by spending more on the initial optimization, the variable costs can be reduced while still maintaining accuracy.

Analyst Bill Wong, AI research fellow at Info-Tech Research Group, agrees. “The focus on accuracy is a natural characteristic to draw attention to when comparing LLMs,” he said. “And suggesting that including cost optimization gives a more complete picture of a model’s performance is reasonable, just as TPC-based database benchmarks attempted to provide, which was a performance metric weighted with the resources or costs involved to deliver a given performance metric.”

Model developers and downstream developers have distinct benchmarking needs – Researchers and those who develop models have different benchmarking needs to those downstream developers who are choosing an AI to use their applications. Model developers and researchers don’t usually consider cost during their evaluations, while for downstream developers, cost is a key factor.

“There are several hurdles to cost evaluation,” the paper noted. “Different providers can charge different amounts for the same model, the cost of an API call might change overnight, and cost might vary based on model developer decisions, such as whether bulk API calls are charged differently.”

The authors suggest that making the evaluation results customizable by using mechanisms to adjust the cost of running models, such as providing users the option to adjust the cost of input and output tokens for their provider of choice, will help them recalculate the trade-off between cost and accuracy. For downstream evaluations of agents, there should be input/output token counts in addition to dollar costs, so that anyone looking at the evaluation in the future can recalculate the cost using current prices and decide whether the agent is still a good choice.

Agent benchmarks enable shortcuts – Benchmarks are only useful if they reflect real-world accuracy, the report noted. For example, shortcuts such as overfitting, in which a model is so closely tailored to its training data that it can’t make accurate predictions or conclusions from any data other than the training data, result in benchmarks whose accuracy doesn’t translate to the real world.

“This is a much more serious problem than LLM training data contamination, as knowledge of test samples can be directly programmed into the agent as opposed to merely being exposed to them during training,” the report said.

Agent evaluations lack standardization and reproducibility – The paper pointed out that, without reproducible agent evaluations, it is difficult to tell whether there have been genuine improvements, and this may mislead downstream developers when selecting agents for their applications.

However, as Kapoor and Narayanan noted in their blog, they are cautiously optimistic that reproducibility in AI agent research will improve because there’s more sharing of code and data used in developing published papers. And, they added, “Another reason is that overoptimistic research quickly gets a reality check when products based on misleading evaluations end up flopping.”

The way of the future

Despite the lack of standards, Info-Tech’s Wong said, companies are still looking to use agents in their applications.

“I agree that there are no standards to measure the performance of agent-based AI applications,” he noted. “Despite that, organizations are claiming there are benefits to pursuing agent-based architectures to drive higher accuracy and lower costs and reliance on monolithic LLMs.”

The lack of standards and the focus on cost-based evaluations will likely continue, he said, because many organizations are looking at the value that generative AI-based solutions can bring. However, cost is one of many factors that should be considered. Organizations he has worked with rank factors such as skills required to use, ease of implementation and maintenance, and scalability higher than cost when evaluating solutions.

And, he said, “We are starting to see more organizations across various industries where sustainability has become an essential driver for the AI use cases they pursue.”

That makes agent-based AI the way of the future, because it uses smaller models, reducing energy consumption while preserving or even improving model performance.

Go to Source

Author: