AI agents – autonomous systems that can plan, decide, and act – are rapidly moving from hype to reality. Unlike a static chatbot that just answers questions, an agentic AI can use tools, browse the web, operate software, and perform multi-step tasks on our behalf. But with this newfound autonomy comes a pressing question: How do we evaluate these AI agents? Measuring an agent’s abilities is far more complex than scoring a single-question answer. We need benchmarks and evals (evaluations) that put agents through realistic scenarios – from navigating websites and desktops to calling APIs – and objectively assess their success, failures, and everything in between. This comprehensive guide will dive deep into the top benchmarks of 2025 for agentic AI. We’ll explore different categories of agent evals (web browsing vs. operating system control vs. tool use), highlight key platforms and use-cases, compare what’s working (and what isn’t), and discuss where the field is headed. Whether you’re a curious newcomer or an insider tracking the latest research, this guide will equip you with a clear understanding of how we’re testing AI agents in 2025, why it matters, and what’s next.
Contents
Understanding Agentic AI and Why Benchmarks Matter
Web and Browser-Based Agent Benchmarks
Operating System and Desktop Agent Benchmarks
Function-Calling and Tool-Use Benchmarks
Cross-Domain and Specialized Agent Benchmarks
Industry Landscape: Platforms, Players, and Use Cases
Challenges and Future Outlook for Agent Evaluations
1. Understanding Agentic AI and Why Benchmarks Matter
AI agents are not your typical AI models. Unlike a single large language model (LLM) that passively replies to a prompt, an agentic AI is goal-driven and interactive – it can make decisions, hold context over time, take actions (like clicking a webpage or executing a function), and adapt its strategy based on what happens. This fundamental difference means evaluating an AI agent requires a new approach. Traditional LLM benchmarks (e.g. answering trivia or writing an essay) don’t capture what an agent does. Instead, we must test how well an agent performs tasks in uncertain, dynamic environments. For example, can it book a flight via a web browser? Fix a formatting error in a spreadsheet? Call the correct API to fetch data? The evaluation isn’t just about right or wrong answers – it’s about whether the agent achieves a goal through a sequence of actions in a realistic setting (techtarget.com) (techtarget.com).
Agent vs. Model – What’s Different to Measure? To clarify why specialized benchmarks are needed, consider how an AI agent contrasts with a static AI model:
Autonomy and Decision-Making: An agent actively makes independent decisions on what actions to take next towards a goal, whereas a normal model only responds when asked (techtarget.com). Benchmarking an agent means examining its whole decision process and how it handles unexpected situations, not just checking one response.
Context and Memory: Agents maintain longer-term memory and context, often handling multi-turn interactions or complex state. A good eval for an agent might involve a lengthy scenario with many steps, tracking if it remembers relevant details. In contrast, static model evaluation is usually one-shot or limited context.
Dynamic Output (Actions): Instead of just outputting text, agents produce actions in an environment – clicking buttons, entering text, calling functions, etc. Evaluations must therefore run the agent in an environment (like a simulated browser or OS) and see if its actions succeed (techtarget.com). This is a big change: it introduces variability (the agent might take 5 steps or 50 steps) and requires measuring outcomes in a realistic context (webpage state, program result) rather than comparing to a fixed answer string.
Unbounded Interactions and Cost: Because an agent can loop or explore until it finishes (or fails) a task, the cost and length of an evaluation can be unbounded (techtarget.com). A static model’s test is a fixed input-output pair, but an agent might keep generating actions – which is both computationally expensive and harder to judge. Good agent benchmarks carefully consider time/cost (e.g. number of steps or API calls) in their design.
Task-Specific Benchmarks: Language model benchmarks are often general (if a model answers one knowledge question, it likely can answer similar ones). But agent performance tends to be very task-specific – being great at web browsing doesn’t guarantee being good at file editing (techtarget.com). This has led to many domain-focused benchmarks (as we’ll see below) and some holistic ones that cover multiple task types.
Why Do We Need Agent Benchmarks? First, to measure progress. The field of AI agents is evolving incredibly fast, and benchmarks give us a yardstick. For instance, on one popular web agent benchmark, early GPT-4 based agents could only complete about 14% of the tasks successfully, whereas humans achieved ~78% – a huge gap (medium.com). Within two years, new agent designs and training methods have boosted that success rate to roughly 60% on the same benchmark – a massive improvement, but still short of human-level performance (medium.com). Without consistent evals, we wouldn’t even know this progress (from 14% to 60%) was happening or understand which innovations made a difference. Benchmarks let researchers and developers pinpoint what techniques actually work (or don’t) and how far we are from robust, reliable agents. Second, good evaluations help identify failure modes and risks. Agents are powerful but can make mistakes – sometimes costly ones. (One infamous incident involved an AI agent integrated with a developer tool that accidentally deleted an entire production database – a harsh reminder of why thorough testing matters! (techtarget.com)) By simulating realistic tasks, benchmarks reveal where agents might go wrong – e.g. misunderstanding an instruction, getting stuck by a pop-up dialog, or misusing a tool – so these issues can be addressed before deploying agents in the real world. Finally, as agentic AI becomes more central to software (and even society), benchmarks provide an objective way to compare solutions. Businesses can ask: which AI agent performs best on tasks that matter to us? E.g. which one has the highest success rate automating web workflows, or the safest behavior on an operating system? In a rapidly growing market (agentic AI is projected to balloon to $10+ billion by 2026 and nearly $200B by 2034) (techtarget.com), benchmarks are critical for cutting through the hype and ensuring reliability as adoption grows.
In the rest of this guide, we will break down the landscape of agent benchmarks into categories based on how the agent interacts with its environment. Broadly, agent evals can be grouped by the type of environment or interface the agent operates in:
Web/Browser-based environments (navigating websites through a browser interface)
Operating System/Desktop environments (controlling apps, GUIs, files on a PC)
Function Calling or API environments (using tools via structured function calls)
Cross-domain and specialized environments (mixing multiple interfaces or focusing on specific domains like coding, gaming, or physical tasks)
Each category comes with different challenges and popular benchmarks. Let’s explore each in depth, highlighting the most important eval suites as of 2025, what they test, and what we’ve learned from them.
2. Web and Browser-Based Agent Benchmarks
One of the earliest and most active areas for agent evaluation is web browsing tasks. These benchmarks ask AI agents to carry out tasks on websites – just as a human would use a browser. For example: “Find and purchase a red dress under $50 on an online store,” or “Book a hotel in New York for next weekend,” or “Update the README on the project’s GitLab page.” Web environments are complex: they involve reading page content (often requiring vision or parsing HTML), clicking links or buttons, filling forms, handling multi-step navigation, and sometimes even juggling multiple browser tabs. The agent needs a mix of skills: language understanding (to interpret the instruction and page text), planning (figuring out the sequence of clicks or inputs to reach the goal), tool use (using browser actions like click/type), and even some memory (remembering information across pages).
Web-based benchmarks typically provide a controlled yet realistic web environment for the agent. Rather than letting the agent roam the entire internet (which is uncontrolled and hard to evaluate), these benchmarks use either simulated websites or constrained sets of real websites so that success criteria can be defined. A common approach is hosting copies of real websites (or stylized versions of them) locally, so the environment is stable and the agent’s actions can be checked for correctness.
Key Web/Browser Benchmarks (2025):
WebArena: The flagship benchmark for autonomous web agents. WebArena provides a fully self-hosted web environment with interactive replicas of popular site types – including an e-commerce store (with tens of thousands of products), a social media forum, a collaborative coding platform (GitLab-like), and a content management system (emergentmind.com) (emergentmind.com). Agents are given tasks phrased in natural language (like a user intent: e.g. “find when your last order was shipped” or “post a comment in the forum about X”) and must use a simulated browser to complete them (emergentmind.com) (emergentmind.com). WebArena tracks whether the end goal is achieved correctly – for instance, did the agent actually post the comment or retrieve the shipping date? – while allowing flexibility in how the agent got there (multiple action paths can count as success) (emergentmind.com) (emergentmind.com). This benchmark has been pivotal in measuring progress: initially, even strong LLM-based agents struggled badly here (GPT-4 agents managed only ~14% task success) (emergentmind.com). Through 2024, researchers introduced better strategies – high-level planners, memory modules, and specialized training data – pushing success rates to over 60% by 2025 (emergentmind.com). (For context, humans performing the same tasks achieve about 78% success, so agents are closing the gap but still have work to do. (medium.com)) WebArena’s rich scenarios have revealed common failure modes like agents getting confused by pop-up dialogs or CAPTCHAs, or “hallucinating” nonexistent page content when they get stuck (arxiv.org) (arxiv.org). The community has extended WebArena with spin-offs like WebChoreArena – a set of 500+ especially tedious, long-horizon web tasks (massive form-filling, multi-page workflows, etc.) to further stress-test agent memory and stamina (emergentmind.com). Another extension focuses on safety and trust aspects (ensuring agents don’t violate user policies while browsing) (emergentmind.com). Overall, WebArena remains a foundational benchmark: if you want to know how “smart” a web-browsing agent is in 2025, you see how it scores on WebArena’s leaderboard. (As of early 2025, top agents like IBM’s “CUGA” reached ~61.7% success, while many others lag well behind that – a sign of how challenging full web autonomy still is (emergentmind.com).)
MiniWoB++ (Mini World of Bits): Before complex benchmarks like WebArena, researchers developed MiniWoB++ as a collection of over 100 bite-sized web tasks on synthetic web pages (github.com) (github.com). These are simplified web UIs – think toy login forms, search boxes, dropdown menus – designed to test basic web manipulation skills. Each MiniWoB task has a specific goal (e.g., “click the button labeled 5”, “fill the form with name and submit”). While not “real websites,” the advantage is that performance can be measured exactly (the correct button either was clicked or not) and the environment is lightweight. MiniWoB++ helped pioneer early agent strategies and is still used as a training ground. However, it lacks the rich language understanding component (tasks are very straightforward), so newer benchmarks incorporate more realistic content and instructions.
Mind2Web: This benchmark takes realism up a notch by using live websites across 31 domains to create tasks (techtarget.com). Mind2Web offers 2,350 tasks collected from 137 real websites – covering everything from booking travel, to using social media, to navigating maps (arxiv.org) (arxiv.org). It gives agents truly real-world scenarios (with all the unpredictability of live web content). Agents are evaluated on whether they successfully complete the task on the live site (and there are also intermediate checkpoints to see if they did sub-steps correctly) (techtarget.com) (techtarget.com). Because it uses real websites, Mind2Web is great for testing generalization – can an agent handle a site it’s never seen before? Early results show that even strong models struggle: for example, a GPT-4 based agent reached only about 23% strict success on Mind2Web’s full tasks (with partial credit up to ~48% when intermediate steps are counted) (arxiv.org) (arxiv.org). This indicates a lot of headroom for improvement. Mind2Web also introduced tests where agents are evaluated on entirely new websites (domains not seen in training) – a tough measure of true general-purpose web skill (arxiv.org). Mind2Web’s scale and diversity make it a valuable “stress test” for web agents, although the reliance on live websites means it’s less standardized than WebArena (where everyone’s tested on the same fixed pages).
BrowserArena: Not to be confused with WebArena, BrowserArena is an evaluation platform that pairs up agents head-to-head on user-submitted web tasks. Inspired by the idea of an Arena (tournament style comparison), it randomly assigns two different agents the same task (for example, “find the weather in Tokyo for tomorrow”) and then has humans or a judging model pick which agent did better (arxiv.org) (arxiv.org). This pairwise comparison approach (similar to how Chatbot Arena compares chatbots) allows evaluation of open-ended tasks without needing a predefined “ground-truth” answer for each step. Users can even provide step-by-step feedback, marking where an agent’s action went wrong, to uncover specific failure modes (arxiv.org) (arxiv.org). BrowserArena is more of a community-driven eval: its goal is to continuously accept new tasks and rank agents by human preference. It’s a newer concept but highlights a trend toward “reference-free” evaluation, where success isn’t a binary pass/fail but “which agent is more helpful/effective” in a given scenario (arxiv.org). This can capture nuances like whether an agent’s intermediate reasoning was sensible, not just whether the final state was correct.
BrowserGym and WorkArena: Many web benchmarks are built on top of the BrowserGym framework – essentially a universal simulation environment for web tasks (github.com). BrowserGym was developed to make it easier to create and run web-based agent tasks. It includes MiniWoB, WebArena, and also WorkArena, which is a set of tasks simulating common enterprise web workflows (like ordering a laptop through a ServiceNow portal) (github.com). The WorkArena benchmark (released around 2024) featured 682 tasks focusing on “knowledge work” scenarios – think of it as business-oriented web tasks to test planning and reasoning in an office context. BrowserGym abstracts the browser actions (click, type) and observations (page DOM, or even screenshots for visual tasks) so researchers can plug in different agents and evaluate them across these tasks uniformly. If you’re experimenting with a new web agent, BrowserGym is likely the toolkit you’d use to measure it on MiniWoB, WorkArena, WebArena, etc., all in one place. It’s part of the broader push to standardize agent evaluations so that progress in labs translates to comparable results.
What Web Benchmarks Teach Us: Web-based evals have been an eye-opener for the AI community. They taught us that size of the language model alone isn’t enough – a naïve GPT-4 agent with no special training or planning can still flounder on a complicated website. Success came from orchestration: using an LLM as a high-level planner, but pairing it with an execution module that understands the web DOM, and giving it some form of memory or scratchpad to avoid looping errors (medium.com) (medium.com). In fact, many top agents converged on a similar architecture: a Planner (LLM that decides what high-level step to do), an Executor (a model or code that carries out the step on the web interface), and a Memory to store what’s been done or learned (medium.com) (medium.com). This modular design, plus lots of specialized training data (e.g. fine-tuning on demonstration trajectories of web tasks), allowed even medium-sized models to do well by compensating for raw horsepower with skill-specific knowledge (medium.com) (emergentmind.com). We also learned about common failure modes: for example, agents without vision might mis-read graphical elements (one agent infamously claimed it closed a pop-up successfully when it hadn’t, simply because it lacked the capability to “see” the pop-up image (arxiv.org) (arxiv.org)!). These benchmarks continue to evolve – adding longer tasks (WebChoreArena) to test endurance, adding safety checks (ensuring the agent doesn’t, say, reveal a user’s private info or perform disallowed actions), and even proposing tournament-style evaluations where agents compete and are ranked by Elo scores instead of absolute metrics (emergentmind.com) (emergentmind.com). All of this is geared toward making web agents truly reliable for real-world use, where they could automate browsers for us in business, research, or personal tasks.
3. Operating System and Desktop Agent Benchmarks
Another frontier for agentic AI is having agents that can operate a computer’s OS and desktop applications like a human user. Imagine an AI that can control your Windows or Linux machine: opening apps, clicking buttons, editing files, sending emails, etc. This is even more challenging than web browsing in some ways, because the agent often only “sees” the screen pixels (a graphical user interface) and must handle a wide variety of programs – from text editors and spreadsheets to terminals and web browsers – using mouse and keyboard inputs. We’re basically asking the AI to be a general office assistant on a computer. Benchmarks in this category create virtual desktop environments and assign tasks that a typical user might do on a PC.
A key difference in OS/desktop benchmarks vs. web benchmarks is the observation and action space. In a web task, an agent might be able to read the page’s HTML or get a structured DOM object. In a desktop environment, the agent usually gets a screenshot (pixel input) of the screen and must interpret it (much like we do visually), since underlying code structures aren’t accessible for arbitrary applications (arxiv.org). Actions are things like moving a cursor to coordinates, clicking, typing keystrokes, or using keyboard shortcuts. This requires a form of vision-language-action model (often a multimodal model that can process images and output actions). It’s akin to an AI trying to replicate what a human does with eyes and hands on a computer. The complexity here is enormous – modern GUIs have infinite possible layouts and sequences.
Key OS/Desktop Benchmarks:
OSWorld: Introduced in 2024 (NeurIPS 2024), OSWorld is a groundbreaking benchmark providing a full-fledged virtual computer environment for agents (arxiv.org). It includes 369 diverse tasks on Ubuntu Linux and Windows operating systems (arxiv.org). These tasks are very practical: e.g. “Send an email with the subject X using Outlook,” “Edit a cell in an Excel spreadsheet and apply a formula,” “Download a file from the web and open it,” or even multi-application workflows like “Take data from a website and plot it in an Excel chart.” Each task defines an initial state (which applications/files are open or available) and has an automated script to check if the agent achieved the correct end state (arxiv.org). The agent runs inside a VM (Virtual Machine) – essentially a sandboxed OS – and is evaluated on a simple success/fail basis for each task (did it accomplish everything exactly?) (arxiv.org). The results from OSWorld were eye-opening: human testers can solve about 72% of the tasks, but the best AI agent at the time could only solve 12.2%! (arxiv.org) Even after some improvements, the best reported AI success rose to around 38% on OSWorld, still far below human-level (arxiv.org). This stark gap highlights how much harder the general computer-use domain is – agents struggle with things like interpreting UIs, dealing with unexpected pop-up windows, and carrying out long sequences reliably (arxiv.org). OSWorld earned a reputation as extremely challenging and has become a go-to benchmark to test the limits of multimodal agents. However, it also exposed some practical issues: running a full OS VM for each test is resource-intensive and tricky to set up (originally it required VMware or VirtualBox, making it non-trivial to automate at scale) (arxiv.org). Additionally, OSWorld’s initial design assumed a particular agent architecture (a certain kind of ReAct prompt style), which made it inflexible to test new agent designs without a lot of tweaking (arxiv.org). Despite these hurdles, OSWorld was the first to truly integrate web and desktop tasks in one suite, and it pushed the community to start tackling “real computer” operation.
OSUniverse: Announced in 2025, OSUniverse is a follow-up effort aiming to address OSWorld’s limitations and broaden the evaluation. It describes itself as a benchmark of complex, multimodal desktop tasks for GUI agents, with an emphasis on ease of use, extensibility, and comprehensive coverage (arxiv.org). OSUniverse tasks are organized in increasing difficulty levels – from basic things like precise clicking on a single app, up to multi-step workflows that involve coordinating between multiple applications (arxiv.org). One design principle is that they calibrated tasks so that SOTA agents (as of 2025) get at most ~50% success on the easiest levels (to ensure room for growth), whereas an average human can do them all perfectly (arxiv.org). OSUniverse also introduces an automated validation method with very low error (so they can score agent runs without manual checking) (arxiv.org). It’s built to be more flexible: supporting different agent architectures, making it easier to plug in new environments (they use a system called AgentDesk that can run virtual desktops in Docker containers, simplifying setup) (arxiv.org) (arxiv.org). OSUniverse explicitly supports multiple operating systems or platforms, and even includes tasks that might require switching between, say, an Android phone interface and a PC (resembling how modern work often spans devices) (crab.camel-ai.org) (crab.camel-ai.org). A notable feature is their evaluation metrics – beyond simple success/fail, they explore graph-based evaluation where each task is broken into a graph of sub-goals, so you can get partial credit or see which part of a complex task failed (crab.camel-ai.org) (crab.camel-ai.org). This fine-grained analysis is helpful because an agent might, for example, successfully open the correct apps (partial success) but then input the wrong data (fail the final goal). By mid-2025, OSUniverse is the cutting-edge academic benchmark to test new desktop agents, and it’s pushing for more robust testing (covering more apps, multi-platform, etc.). It essentially complements OSWorld with a more modern, modular approach.
AgentBench (OS tasks): Earlier we’ll discuss AgentBench as a cross-domain suite, but relevant here is that AgentBench includes an Operating Systems environment as one of its testbeds (techtarget.com). It uses a simulated OS (likely similar to OSWorld concept) to see how agents perform typical OS tasks. While not as extensive as OSWorld’s 369 tasks, AgentBench provides a slice of OS challenges within a larger evaluation (more on AgentBench in section 5).
Other GUI/desktop evals: There are a few others worth noting. GUI Tasks Benchmark by Bonatti et al. (2024) and Xie et al. (2024) – these were early attempts at having LLMs control graphical interfaces. VNC Simulated Desktop tasks, for example, where an agent uses VNC remote desktop to do things like open a paint application and draw something or organize files into folders. These were more experimental and often not standardized into a large benchmark like OSWorld, but they contributed ideas. Also, some industry efforts, like Microsoft’s internal agent tests (for their AutoGPT-style prototypes), reportedly include lots of desktop scenarios (since products like Office 365 could be automated by agents). However, those aren’t publicly documented as benchmarks, so OSWorld and OSUniverse remain the reference points in literature.
Findings from OS Benchmarks: If web tasks were a challenge, desktop tasks are perhaps the ultimate test of an AI agent’s generality. Early findings show agents are brittle – a single misread of an icon or a slight UI layout change can throw them completely off. Things that humans consider trivial (like dragging a window or coping with a slow application load time) can break an agent’s script easily. Interestingly, some benchmarks noted that many OS tasks in benchmarks were described with somewhat ambiguous instructions (to mimic a human giving a natural request). A person could clarify or try different approaches if unsure, but a current agent cannot ask clarifying questions once it’s in action – if the prompt is vague, the agent may interpret it incorrectly and fail (arxiv.org) (arxiv.org). This points to a limitation in how we set up agent tasks: we might need to give clearer instructions or allow agents to query for clarification in future. The performance gap between humans and agents on OS tasks remains huge – even larger than in web browsing. On OSWorld, the best AI was <40% vs humans ~72% (arxiv.org). On newer tasks that require fluidly moving across applications (like copy chart from Excel to Word), agents are only beginning to be tested. There’s optimism though: by incorporating vision (using multimodal models that can “see” the UI) and better planning, some agents (e.g. using GPT-4 with vision and a fine-tuned UI policy) have improved. For instance, on OSWorld some teams reported raising success from 12% to 30+% by leveraging image recognition of icons combined with LLM planning (arxiv.org) (arxiv.org). But we also see the need for more training data – there is no massive dataset of “how humans use computers” readily available, so researchers are starting to create synthetic data or logs to help agents learn these skills.
In summary, OS-level benchmarks highlight the importance of multimodal understanding, precise action execution, and error recovery. A web agent might get away with reading underlying HTML, but a true desktop agent has to interpret a visual layout (e.g., find the “File” menu on a screenshot) and deal with uncertainties like pop-up dialogs, loading spinners, or system notifications. It’s a big ask. The work in 2025 is laying the groundwork – if one day we have a reliable “AI assistant on your PC”, it will be thanks to the lessons learned from these early OS agent benchmarks.
4. Function-Calling and Tool-Use Benchmarks
Not all AI agents operate via a browser or GUI. Another major paradigm is agents that use tools and APIs directly by calling functions. This is often the case when an AI is deployed in a controlled software environment – for example, an AI developer assistant might call a compiler API, or an AI scheduling assistant might call a calendar API. In 2023–2024, with the advent of LLMs that can do function calling (like OpenAI’s function call interface), a lot of attention turned to evaluating how well models can decide to use a tool and produce the correct API call. Essentially, these benchmarks test an agent’s ability to interface with external functions: parsing a user’s request, choosing the right function, and supplying correct arguments to achieve the goal.
Function-calling evals are a bit different in style from web/OS tasks – they often resemble a conversation or set of instructions where at some point the model is expected to invoke a function (with proper JSON or code format). The evaluation then checks if the function was called correctly and if the subsequent result was handled properly. These benchmarks are crucial for tool-augmented AI systems, where an LLM isn’t just generating text but also orchestrating other services (e.g., searching the web via an API call, retrieving data from a database, or executing code).
Key Benchmarks for Function Calling & Tool Use:
BFCL (Berkeley Function-Calling Leaderboard): One of the most prominent benchmarks in this area, BFCL evaluates an LLM’s ability to accurately call functions (a.k.a tools) across a wide range of scenarios (gorilla.cs.berkeley.edu). It started around 2024 and by 2025 is in version 4, expanding from simple single-step API calls to more holistic agentic evaluations of multi-step tool use (gorilla.cs.berkeley.edu). BFCL provides a dataset of real-world function call tasks – about 2,000 question-answer pairs in early versions – where the model needs to use functions like calculator, weather API, knowledge lookup, etc., to produce the answer (evidentlyai.com). For example, a task might ask, “What’s the distance between San Francisco and Los Angeles?” and the model should decide to call a
get_distance(city1, city2)
function with appropriate arguments, instead of guessing. The leaderboard tracks accuracy (did the model get the function calls right) and even factors like cost (how expensive in terms of API usage or tokens) and latency (gorilla.cs.berkeley.edu), which is very practical for real applications. Interestingly, BFCL’s latest version explicitly moves toward agentic evaluation – meaning it’s not just one call, but possibly sequences of calls and decision-making steps, approximating a full agent scenario (gorilla.cs.berkeley.edu). Top models on BFCL are often specialized or fine-tuned for tool use. The Gorilla model from Berkeley (which BFCL is named after the Gorilla project) is one such example that’s optimized for using a wide array of tools through API calls.HammerBench: Introduced in late 2024, HammerBench is a benchmark focusing on fine-grained function-calling in multi-turn dialogues, especially simulating mobile phone assistant scenarios (arxiv.org) (arxiv.org). Developed by a team at OPPO and SJTU, it models complex user interactions that require calling phone APIs (like booking tickets, setting reminders, etc.) where the user might not give all information up front, forcing the AI to ask follow-ups or handle imperfect instructions (arxiv.org) (arxiv.org). HammerBench built a dataset via a pipeline of GPT-generated dialogues and human validation, ensuring that things like argument shifts (user changes their mind on a parameter) and imperfect instructions are included (arxiv.org) (arxiv.org). The evaluation breaks down how well an AI handles each turn of the conversation and whether each function call was correct. The name “hammer” suggests hitting the functions hard – it’s very granular. One finding was that many models made mistakes in parameter naming or usage, which was a major cause of failure in dialogues (arxiv.org). This benchmark is useful to see how resilient an AI is in a realistic chat where it has to invoke multiple functions in sequence to satisfy a user query (think of booking a flight: searchFlights -> chooseFlight -> bookFlight functions, with lots of parameter passing).
NoisyToolBench: Another research benchmark (referenced in HammerBench) that addresses how models handle tools when instructions are incomplete or “noisy”. It introduced scenarios where the prompt might be missing details and the model has to decide to call a tool to fill in the gaps or avoid hallucinating a tool output (arxiv.org). While not as widely known as BFCL, it’s part of this family of tool-use tests trying to stress test robustness.
ProLLM Function Benchmarks: There are also community-driven collections like ProLLM (an open repository of prompts and evals) which include sections for function calling. These may not be formal papers, but they compile tasks such as “use the given calculator function to add these numbers” and check if the model does use the function or just computes itself (prollm.ai). They help quickly compare model capabilities (for example, testing OpenAI’s GPT-4 function calling vs. an open-source model’s ability to follow a JSON function signature).
Databricks’ Analysis (API vs. User-Aligned): A blog from Databricks in 2024 discussed evaluating function calling by comparing models that had API-schema-based function definitions vs. those given more natural instructions (databricks.com). While not a benchmark per se, it highlights evaluation considerations – like should a model strictly output a JSON for the API (and risk formatting errors), or can it reason in a looser way? The conclusion was that having the model explicitly align to API specs was effective, but the evaluation needed to catch where models would go wrong (like dropping required parameters, or calling functions unnecessarily).
GAIA: Mentioned earlier under specialized benchmarks, GAIA is a dataset testing tool-use and reasoning in answering questions (techtarget.com). It provides tasks that seem simple for humans but require an AI to use tools or combine modalities (like looking at an image and then using a calculator). GAIA has multiple difficulty levels, each adding complexity (more steps or tools required) (techtarget.com). It’s a good evaluation for an AI assistant’s ability to coordinate tools with reasoning. For our purposes, GAIA overlaps the tool-use category, since it explicitly measures how well assistants use tools and multimodal inputs to solve problems.
MINT: Another cross-category framework, MINT evaluates an LLM’s ability to solve tasks with multi-turn interactions involving external tools and dynamic feedback (techtarget.com). It gives models access to tools via Python code and even simulates a user giving feedback or additional info (using GPT-4 as the “user”) (techtarget.com). This is like placing the model in an interactive loop where it can try a tool, see the result, and then possibly get a hint or correction from a user prompt if it’s going off track. MINT includes decision-making tasks, reasoning puzzles, and coding challenges. It measures not just final success, but how the model navigates the process – effectively checking if the model can learn from feedback and use tools effectively. This kind of evaluation is important for agents that are meant to collaborate with humans or adjust on the fly.
From Tools to True Agents: The function-calling benchmarks have shown that even very advanced LLMs can have trouble reliably using tools without fine-tuning. For example, early GPT-4 models sometimes would hallucinate a tool usage or format it incorrectly. Benchmarks like BFCL demonstrated the value of having native function calling support (distinguishing models that have an API calling feature vs. those that do it via prompt tricks) (gorilla.cs.berkeley.edu). They also revealed that latency and cost can vary widely – an agent that takes 10 steps calling various APIs vs. one that directly answers might both get the job done, but one is slower or more expensive. Thus, these evals sometimes include cost as a metric to reflect practicality (gorilla.cs.berkeley.edu). One interesting outcome: specialized models or augmented systems (like Gorilla from Berkeley, which was trained on API documentation) tend to significantly outperform general models on these benchmarks (gorilla.cs.berkeley.edu). This suggests that for tool use, having knowledge of the tool semantics and practicing calls is crucial. We also saw that for multi-turn tool use (HammerBench style), the model’s dialogue management is tested – it needs to remember what’s been asked, what parameters are already provided, and what’s still needed. It’s not purely a tool skill but a conversation+tool skill.
Looking forward, function-call evals are blending into broader agent evals. The latest BFCL update explicitly mentions “agentic evaluation,” indicating scenarios where the model might have to choose whether and when to call a function in a longer chain of reasoning (gorilla.cs.berkeley.edu). This moves closer to full agent behavior (as opposed to a one-off API call). As AI agents are deployed in products (like scheduling assistants, customer support bots that use databases, etc.), these benchmarks will be key to ensure that function calls are made accurately and safely. After all, an AI that misuses an API could be as dangerous as one that clicks the wrong button – imagine an agent calling a “delete_user(account_id)” API instead of “get_user_info”! The meticulous testing via benchmarks helps catch such issues.
In summary, function/tool-use benchmarks hone the precision of agentic AI: making sure that when an agent reaches for an external tool, it does so correctly, uses the right tool for the job, and handles the results properly. It’s about teaching our AI agents to RTFM (“read the function manual”) and not improvise when it comes to tools – a habit that benchmarks like BFCL reinforce by rewarding exact correctness.
5. Cross-Domain and Specialized Agent Benchmarks
The field of agentic AI is so broad that many benchmarks focus on specific domains or attempt to cover multiple domains. We call these cross-domain or specialized benchmarks. Some are meant to evaluate an agent’s general flexibility across different environments (web, OS, games, etc.), while others drill into a particular niche (like household robotics or cybersecurity). For a comprehensive guide, it’s worth knowing the major ones, especially since they often introduce unique evaluation methods or insights.
Holistic Multi-Environment Benchmarks:
AgentBench: AgentBench is an ambitious suite introduced to test autonomous LLM-based agents across a variety of environments in a holistic way (techtarget.com) (techtarget.com). Think of it as a “general exam” for agents. It includes eight different environments covering a spectrum of tasks: operating systems, databases, knowledge graphs, card games, puzzles, household tasks, web shopping, and web browsing (techtarget.com). By doing so, AgentBench doesn’t focus on one narrow skill but rather on decision-making and adaptability in different contexts (techtarget.com). An agent is scored on each, and the idea is to get a well-rounded view of its capabilities. For example, in the database environment, an agent might have to query or update a database given some goal; in card games, perhaps play a simplified game like poker with strategy; in puzzles, solve logic puzzles, etc. AgentBench emphasizes evaluating the agent’s reasoning quality, accuracy, and multi-turn consistency in each setting (techtarget.com). The evaluation looks at both the final outcome (did it succeed in the task) and aspects like consistency of steps (but it doesn’t nitpick each step if the final goal is reached) (techtarget.com). This is a pragmatic approach – ultimately, “did the agent do the job?” is what matters, but tracking the process helps diagnose issues. AgentBench is valuable because it mirrors a realistic scenario: a good general agent should handle a bit of everything (from browsing to playing a game to using a tool). If one model does great on web but fails completely on a simple puzzle, AgentBench will highlight that, guiding where improvement is needed. It’s also platform-agnostic; as long as an environment can be interfaced (OS, web, etc.), it can be part of AgentBench. This makes it a step toward an “AGI test suite” in some sense.
CRAB (Cross-environment Agent Benchmark): CRAB is a framework and benchmark from Camel-AI aiming for general-purpose agent evaluation across multiple platforms (crab.camel-ai.org) (crab.camel-ai.org). Its initial release (v0 in late 2024) included 120 tasks across 2 environments: Ubuntu and Android (crab.camel-ai.org). The interesting twist is that CRAB can test agents under different communication settings (for instance, whether the agent uses tool APIs vs. directly generating actions in a structured format) (crab.camel-ai.org). It set up a leaderboard comparing models like GPT-4, Claude, Google’s Gemini, etc., on these tasks – showing GPT-4 (with certain configurations) achieving the highest success rate (~14.17% on their metric, which might be weighted by some completion ratio) and others like Claude or Gemini struggling with very low success (crab.camel-ai.org) (crab.camel-ai.org). The tasks require things like interacting between a PC and a phone environment (e.g., get info on PC, then send a message on phone) (crab.camel-ai.org) (crab.camel-ai.org). CRAB’s evaluation is graph-based and fine-grained, meaning they break tasks into checkpoints and can score partial progress, plus they automatically generate many tasks by composing sub-tasks (crab.camel-ai.org) (crab.camel-ai.org). Essentially, CRAB tries to cover adaptability: can one agent handle both a Linux server and an Android phone tasks, which have very different interfaces? The results so far showed that even top models have a hard time (success rates under 15% overall), again underlining the challenge of generality (crab.camel-ai.org) (crab.camel-ai.org). CRAB is quite new, but it’s representative of efforts to create a unified framework where new environments can be added easily and a variety of agent abilities can be benchmarked in one place.
Domain-Specific and Task-Specific Benchmarks:
ALFWorld: A benchmark focusing on household tasks in a simulated environment. ALFWorld combines a text-based environment (from an earlier interactive fiction benchmark called ALFRED) with logical reasoning tasks. The goal is to evaluate an agent’s ability to understand and plan physical actions like “pick up a pan from the stove and put it in the sink” in a simulated house (techtarget.com). It tests an agent’s planning and object manipulation reasoning in a purely textual simulation of a home (techtarget.com). This is important for bridging to robotics: while ALFWorld doesn’t have real robots, it uses text descriptions to represent the physical world, so an agent has to interpret descriptions (“there is a fridge in the kitchen”) and issue actions (“open fridge”). Success is measured by completing the task through correct action sequences, and tasks often involve ambiguity or needing to reason about ordering (you can’t pour water if you haven’t filled the cup, etc.) (techtarget.com). ALFWorld showed that household reasoning is tough for agents – it requires a form of common-sense understanding of everyday activities. It’s a relatively niche benchmark unless you’re working on embodied or robotics-related agents, but it’s a piece of the puzzle for full general agents.
ColBench: A collaborative coding benchmark where an AI agent works with a simulated human partner to complete software development tasks (techtarget.com) (techtarget.com). Think of a scenario: the AI and a (simulated) human are co-workers chatting about building a feature. The agent must clarify requirements, propose code, refine it based on feedback – essentially engaging in a multi-turn dialogue to produce something like a web page or a piece of code. ColBench tests an agent’s ability to handle long conversations, clarify instructions, and produce correct outputs in a development context (techtarget.com) (techtarget.com). Evaluation is based on whether the final product meets the expected result (and possibly the quality of interactions). It’s a unique spin because it treats the agent as a collaborator, not just an autonomous solver. This reflects real use cases of AI pair programmers or project assistants. It emphasizes how well the AI can manage context over many turns and respond to a human’s partial instructions or corrections.
CyBench: A specialized benchmark for cybersecurity tasks, where an agent is evaluated on how it identifies vulnerabilities or performs exploits in various scenarios (techtarget.com). It sets up challenges in domains like web security, digital forensics, and cryptography, each with a contained environment (like a vulnerable website or a piece of encrypted text) (techtarget.com). The agent’s job might be to find a security flaw, exploit it to get some flag, etc. CyBench maintains a leaderboard to rank models on these tasks (techtarget.com). This is particularly interesting because it crosses into a highly practical and sensitive domain – it requires the agent to have some “hacking” knowledge and reasoning. It’s also a domain where mistakes could be dangerous (you don’t want a reckless AI hacker!), so benchmarking here is both about capability and controlled behavior. CyBench’s existence shows that people are keen to know if AI agents can handle domain-specific expert tasks (like a cybersecurity analyst’s job) and do so effectively.
LiveSWEBench: This benchmark zeroes in on software development tasks for AI agents, evaluating them in three categories: autonomous coding tasks (agent is given a high-level goal like “implement feature X”), targeted code editing tasks (agent must modify a given file per instructions), and code autocompletion tasks (techtarget.com). It basically simulates a coding workflow where an AI agent is doing the programming. What’s notable is that LiveSWEBench evaluates both the process and the final outcome (techtarget.com). It checks the individual decisions the agent makes – for instance, did it run tests after coding? Did it follow instructions step by step? – as well as whether the final code works or meets the spec. This dual evaluation is important because in coding, how you get there (not introducing bugs along the way, responding to errors) is as crucial as the end result. With the rise of tools like GitHub Copilot and others adding “agent” capabilities (like auto-fixing code, etc.), having a benchmark like this helps measure which AI agents can actually replace or assist human developers in complex tasks beyond just one function completion.
Others (General Knowledge & Multimodal): We already touched on GAIA (tools and multimodal Q&A) and MINT (multi-turn interactive tasks with feedback). Additionally, Mind2Web (covered in web section) could also be seen as cross-domain in the sense that its tasks span many different website types (travel, social media, etc.). There’s also FieldWorkArena (by Fujitsu, 2025) which is quite specialized: it targets real-world field work scenarios – things like monitoring factory equipment via camera feeds and reporting issues (arxiv.org) (arxiv.org). It introduces multimodal tasks (video + documents) to simulate an agent helping in a manufacturing or warehouse setting, divided into stages like planning, perception, and action (arxiv.org) (arxiv.org). This is an example of an industry-specific benchmark acknowledging that current ones don’t cover, say, using vision on surveillance footage combined with reading PDFs (a very real use-case for workplace AI agents) (arxiv.org) (arxiv.org). FieldWorkArena is quite cutting-edge and shows how new domains are being brought into agent evals.
Insight from Specialized Benchmarks: Each specialized benchmark teaches lessons relevant to its domain, but they all underscore a common theme: agents need both general intelligence and domain-specific knowledge to excel. An agent great at web browsing might still fail at a coding task if it lacks programming knowledge. Conversely, a coding agent might not understand a household task described in plain language. This is why research is branching into these areas – eventually, we’d like agents that can learn new domains efficiently, but in the meantime, benchmarks ensure we don’t overfit our assessment of “agent intelligence” to just one or two domains.
We also see that some benchmarks innovate on evaluation methodology: e.g., ColBench evaluating quality of collaboration, CRAB using graph-based scoring, LiveSWEBench looking at intermediate decisions. These innovations often then feed back into how more general benchmarks might measure things. For instance, the idea of giving partial credit for intermediate correct steps (rather than all-or-nothing) is becoming more popular, because it helps differentiate an agent that almost got it right from one that was completely off track (crab.camel-ai.org).
Finally, specialized benchmarks often align with industry interests: cybersecurity (CyBench), software dev (LiveSWEBench), field operations (FieldWorkArena), etc. They highlight where AI agents could be most immediately useful and thus need evaluation. As agents start to be deployed in these roles, having a benchmark is useful for vendors and users to gauge if a given AI agent is ready for prime time in that area.
6. Industry Landscape: Platforms, Players, and Use Cases
The rapid development of agent benchmarks hasn’t happened in a vacuum – it’s driven by (and in turn drives) intense interest from both research labs and industry players. Let’s step back and look at the bigger picture of AI agents in 2025: who are the major players, what platforms are emerging, how are these evals used in practice, and where are agents actually being successful (or not) in the real world.
Major Players & Models: In terms of foundation models powering agents, a few names dominate: OpenAI’s GPT-4 (and newer variants), Anthropic’s Claude, and as of late 2024, Google’s Gemini model, are frequently at the top of agent performance charts. For example, on the CRAB benchmark’s leaderboard, an OpenAI GPT-4 variant achieved the highest success rate by a clear margin, outperforming Claude 3 and Google’s Gemini 1.5 on the same tasks (crab.camel-ai.org) (crab.camel-ai.org). Similarly, on WebArena tasks, many of the leading agent implementations use GPT-4 (sometimes with vision, GPT-4V) as the core reasoning engine (emergentmind.com) (emergentmind.com). This isn’t surprising, as these are state-of-the-art models with the largest knowledge and reasoning capacities. However, size isn’t everything – specialized or fine-tuned systems often do better on specific benchmarks. A prime example is IBM’s CUGA agent, which set a record on WebArena (~61.7% success) by using a combination of techniques tailored for web tasks (emergentmind.com). Or Berkeley’s Gorilla model fine-tuned for API calls, which excels in function-calling benchmarks. We’re also seeing open-source models catching up: for instance, teams have adapted open 13B–34B parameter models with vision front-ends for OS tasks. Some open models (like Meta’s Llama 2 with fine-tuning, or newer entrants like Mistral) appear on leaderboards like CRAB but currently trail the big proprietary models in performance (crab.camel-ai.org). The gap is often particularly wide in complex agent tasks, possibly due to the training data advantages and fine-tuning capabilities of the big players.
Platforms and Frameworks: On the development side, there’s been an explosion of agent frameworks – tools to build and orchestrate agents – such as LangChain, Haystack, Transformers Agents, and enterprise platforms like Microsoft’s Autonomous Agents framework or Salesforce’s AI agent platform. While these are more about building agents than evaluating them, they increasingly incorporate benchmarking tools. For example, LangChain introduced evaluation modules (LangSmith) where you can simulate tasks and measure success rates of your agent flows. Companies providing AI services also highlight benchmarks as proof of capability. OpenAI, for instance, has its own Evals platform (open-sourced) which allows users to contribute evaluation scripts, and some of those are multi-step or agentic in nature. Hugging Face has an Evaluation hub and leaderboards for models on certain tasks (though not yet a specific “agent” leaderboard, they host things like the GAIA dataset for Q&A with tool use (techtarget.com)).
Interestingly, there are emerging commercial platforms focused on agent testing. For instance, startups offering “AI agent testing as a service” – where companies can plug in their custom agent and run a battery of benchmark tasks to get a report on performance, cost, etc. This is because businesses want to validate an AI agent before deploying it in production (nobody wants a rogue agent email-scaping their CRM unsupervised!). Some platforms also integrate monitoring – for example, TruLens and EvidentlyAI provide monitoring and evaluation for AI systems over time, which could include agent behavior in production. While not benchmarks in themselves, these tools rely on benchmark scenarios to continuously test that an agent hasn’t regressed after an update, etc.
Use Cases and Successes: Where are AI agents actually working well so far? One area is customer support bots that can use tools: for example, an agent that can look up your order in a database, then issue a refund via an API. Function-calling evals directly support this use-case by ensuring the bot calls the right APIs. There have been successful pilots of AI agents as IT assistants (navigating troubleshooting flows on a computer), or data analysts (querying a database and producing a chart). These narrow but valuable applications often come with custom benchmarks internally – e.g., a company might create 50 typical support scenarios and measure the AI’s success rate at handling them fully. Public benchmarks like AgentBench’s database and OS tasks are proxies for these. Another domain is web automation for business workflows – some companies have agents that update spreadsheets, scrape competitor prices, or post on social media automatically. They often test those agents on WebArena-like tasks first to gauge reliability. The fact that WebArena is self-hosted and reproducible is a big plus here – companies can simulate their own websites or apps in a similar environment.
Failures and Limitations in Practice: Despite excitement, there have been well-publicized failures. The Replit database deletion incident (techtarget.com) we mentioned shows the stakes: an agent given too much freedom can do very wrong things if not properly evaluated and constrained. Other limitations include cost – running a complex agent with GPT-4 can be expensive. Benchmarks like BFCL explicitly track an estimated dollar cost for completing the whole suite (gorilla.cs.berkeley.edu), and some entries cost significantly more than others. If one model takes twice as many actions or has a longer output, that might rack up more token usage. Companies care about this because an agent that dithers could burn through API credits. Speed is another factor: in real usage, you might need an agent to complete a task in, say, 30 seconds. Some benchmark tasks allow measuring wall-clock time or number of steps to reflect this (e.g., OSWorld might record how many steps until success or failure, not just success). Many current agents are a bit slow and sometimes require human oversight to correct mistakes, which limits where they can be deployed right now. They’re also often brittle to changes – an update in a website’s layout or a software UI can break an agent that was hardcoded or fine-tuned for the previous version. So maintaining agents is an issue.
Upcoming Players & Different Approaches: On the horizon, there are new players focusing specifically on agentic AI. For example, a startup called Adept AI (founded by former DeepMind folks) is working on an agent that can use computers and APIs very fluidly; they haven’t published benchmarks yet, but their internal evals likely influence the field. Another, Automation Anywhere and UiPath (RPA companies) are integrating LLMs – they are used to benchmarking automation flows (which are like agent scripts), so they are bringing that expertise with them, potentially creating new eval standards for enterprise process automation by AI. We also see academic teams like Camel-AI (behind CRAB) or Kentauros AI (behind OSUniverse) entering the space, bringing in multi-disciplinary approaches. Each might emphasize different things: Camel-AI’s approach was multi-agent collaboration (their name comes from having AI “role-play” roles to help each other), so CRAB might evolve to test multi-agent systems. Meanwhile, Kentauros focuses on broad GUI tasks and practical integration (their AgentDesk and SurfKit tools).
Some upcoming trends in approach: Agent orchestration (the idea of not just one LLM agent, but a hive mind or a manager/worker structure) is being explored. Benchmarks might soon have scenarios that explicitly require multiple agents to cooperate. For instance, a future eval could require an AI “CEO” agent delegating subtasks to an AI “analyst” agent – how do we measure the outcome then? It’s complex, but work is starting on frameworks for multi-agent eval.
Another difference in approach is simulation vs. real-world. Most benchmarks we discussed are simulation or static environment based. But some companies argue for testing agents in more live environments (with careful monitoring). OpenAI, for example, allowed GPT-4 to browse the live web (with certain safeguards) and evaluated qualitatively how well it did, even if not formally benchmarked. There’s a recognition that eventually agents have to face the messy real world. We might see “deployment evaluations” where an agent is run in a real setting for a period and its performance (errors per hour, tasks completed) is measured like one would do in A/B testing software. It’s not a traditional benchmark with a static dataset, but more of an ongoing evaluation.
Lastly, pricing and access: Many of the benchmarks we discussed are open and free for research. But a few come with platforms that have pricing – for example, if you use a service like WebArena’s hosted version or a commercial evaluation suite, there could be fees. Companies like Emergent Mind compile benchmark results and analysis (they had a detailed WebArena analysis) as a paid service for insights (emergentmind.com) (emergentmind.com). So an ecosystem is forming where raw benchmarks are free, but interpreted results or easy-to-use dashboards might cost money.
For most readers, the key takeaway is that agent benchmarks are not just academic exercises – they directly influence product development. When an enterprise AI vendor says “our agent platform achieved X% success on Y benchmark,” it provides credibility. Conversely, if an open-source model climbs the leaderboard, it will attract attention and likely more adoption. In 2025, we are seeing the first benchmark-driven competition in the agent space, similar to how ImageNet spurred vision model progress a decade ago. The players who consistently perform well on these evals (OpenAI, perhaps Google, etc.) are perceived as leaders, and newcomers try to demonstrate progress via the same metrics.
7. Challenges and Future Outlook for Agent Evaluations
As comprehensive as current benchmarks are, evaluating AI agents remains a fast-moving target with many open challenges. Both the benchmarks and the agents themselves are evolving. In this final section, we’ll discuss the broader challenges in agent evaluation and where things are likely headed in the near future.
Challenge: Dynamic and Long-Horizon Tasks. One big issue is that real-world tasks are often open-ended and can’t be neatly scripted with a single “success” condition. Agents might have to operate continuously or handle unpredictable goals from users. Current benchmarks like WebArena and OSWorld, while complex, still consist of finite, well-defined tasks. A frontier challenge is evaluating agents on truly long-running tasks or multi-goal missions (imagine an agent that works for 8 hours doing various things – how to score that?). Researchers are exploring **“tournament” or competition-style evaluations where instead of a static success measure, agents are pitted against each other in scenarios and ranked by human preference or outcomes (emergentmind.com) (emergentmind.com). This is akin to how we evaluate human employees or game AI – over many rounds, who performs better overall. It introduces its own complexities (needing many human evaluations or a reliable AI judge), but it might capture long-horizon performance better. In the near future, we might see benchmarks that run an agent in a sandbox for an extended period and measure things like consistency, efficiency, and adaptability over time rather than a single task completion.
Challenge: Evaluation Precision and Fairness. As agents improve, benchmarks must be sensitive enough to distinguish them. For example, when top agents get from 60% to 70% success on WebArena, we need confidence that difference is real and not noise. Issues like ambiguous tasks or slightly inconsistent scoring criteria can mis-rank agents (emergentmind.com) (emergentmind.com). Benchmark creators are working to refine evaluation scripts, add more rigorous checks, and even use reference-free methods (like Elo ratings from pairwise battles) to get a more robust measure (emergentmind.com) (emergentmind.com). Also, ensuring fairness – some agents might have access to different tools (like vision vs. no vision). Should benchmarks segregate those categories or let them all compete together? For now, many leaderboards note the agent’s capabilities (e.g., whether it used a vision model) and even have separate tracks. In the future, as multi-modal models become standard, this will normalize, but currently it’s a consideration.
Safety and Reliability Evaluations: A very important outlook is the integration of safety benchmarks for agents. An agent that completes a task but in doing so violates a policy or causes a side effect (like deleting data or revealing private info) is not truly successful for real-world use. We saw the early effort STaR or ST-WebAgentBench that adds policy compliance checks on top of WebArena tasks (emergentmind.com). Going forward, expect benchmarks to include criteria like “did the agent avoid unsafe actions?” or “did it ask for confirmation for destructive actions?”. Already, some benchmarks have unachievable task categories to see if the agent will wisely do nothing instead of doing something harmful (emergentmind.com) (emergentmind.com). Future evals might simulate adversarial conditions (like prompt injections for an agent with a browser, or malicious inputs) to test robustness. The Alignment Research Center (ARC) developed some evals for power-seeking or risky behavior for advanced agents – these are more hypothetical, but they might become relevant if agent capabilities approach levels where misuse is a concern. So, a full evaluation of an agent might soon include a “safety score” alongside its task performance score.
Human-in-the-Loop and Adaptability: Right now, most benchmarks treat the agent as fully autonomous once the task begins (no human help). But many real deployments will have a human supervising or collaborating. We might see evaluations for how well an agent works with a human. For instance, an agent that can defer to a human when unsure (rather than acting blindly) could be safer and more efficient. How to benchmark that? Possibly by having simulated user interruptions or queries. There’s a concept of mixed-initiative evaluations where sometimes the AI should take initiative, and sometimes wait – measuring that balance could be another frontier.
Scaling and Cost of Eval: As tasks get more complex, running a benchmark suite can itself be costly. For example, OSWorld tasks with GPT-4 could cost a lot of API calls, and setting up VMs for hundreds of tasks is heavy. There’s a push to optimize evaluation – maybe by using cheaper models as judges, by selective testing (not every agent on every single task if not needed), or by community-driven eval (like BrowserArena’s crowd-sourced tasks). Some propose modular evals: test basic skills separately (vision recognition, text understanding, etc.) and then infer agent performance, to reduce having to run full end-to-end tests for every change. But ultimately, nothing beats seeing the agent do the full task. So, the field may need to invest in infrastructure (possibly shared cloud platforms or funding) to allow large-scale agent testing accessible to researchers without deep pockets.
Future Benchmarks on the Horizon: We can expect new benchmarks in new domains. For example, perhaps social agents – evaluating how an AI agent navigates social interactions or negotiations. Or educational tutors – where the agent has to teach or guide a student through a problem (measuring pedagogical ability). Also, as robotics and AI converge, benchmarks that involve controlling robots (real or simulated) could gain prominence, bridging physical and digital agent capabilities. OpenAI’s rumored “Operator” agent (mentioned in papers) suggests they might be internally testing an agent that can do a variety of tasks – if that or others come to light, they might release some benchmark or challenge to the community. We’re already seeing companies host agent challenges – for instance, a hackathon where teams build agents to solve a set of surprise tasks, and the results are compared. These are one-off, but lessons from them often feed into new benchmarks.
Metrics Beyond Success Rate: Future evals might consider more qualitative metrics: efficiency (time/steps taken), robustness (performance variance across slight changes), generalization (how well an agent trained on one set of tasks does on novel but related tasks). Some of these are touched by existing benchmarks (like Mind2Web testing new domains, or CRAB mixing platforms), but it could be formalized further. For instance, a benchmark might come with a training set of tasks and a secret test set of novel tasks – testing an agent’s ability to learn or adapt, not just perform static. This moves towards evaluating learning ability, which is key if agents are to continually improve online.
Closing Thoughts: The landscape of agentic AI evals in 2025 is rich and growing, reflecting the rapid progress and the high stakes. Benchmarks are a driving force: they motivate improvements and ensure some level of accountability (we can’t just claim an agent is great – we have to show it on benchmarks). At the same time, we must remember that benchmarks are approximations of reality. There’s a saying: “All benchmarks are wrong, but some are useful.” Each eval has limitations – maybe it’s too narrow, maybe it can be gamed by overfitting – so the key is an agent that does well across many benchmarks and real tests. The ultimate “benchmark” will be real-world deployment: when AI agents can be trusted to handle a wide range of tasks safely and effectively, consistently. Until then, the best we can do is keep crafting challenging benchmarks that push these systems to their limits and highlight where they fall short. By doing so, we uncover the next problem to solve, the next innovation to make – bringing truly helpful AI agents closer to everyday use.