As agentic AI rapidly transforms the enterprise adoption landscape, MLCommons is convening experts from across academia, civil society, and industry to better understand the risks associated around practical, near-term deployments of AI agents. 

The challenge we posed to our community was: Imagine you are the decision maker on a product launch in 2026 for an agent that could take limited actions, such as placing an order or issuing a refund – what evaluations would you need to make an informed decision?

Last month, MLCommons co-hosted the Agentic Reliability Evaluation Summit (ARES) with the Oxford Martin School’s AI Governance Initiative, Schmidt Sciences, Laboratoire National de Metrologie et D’Essais (LNE), and the AI Verify Foundation. 34 organizations attended the convening — including AI labs, large software companies, AI safety institutes, startups, and civil society organizations — to discuss new and emerging evaluation methods for agentic AI. 

Today, MLCommons is announcing a new collaboration with contributors from Advai, AI Verify Foundation, Anthropic, Arize AI, Cohere, Google, Intel, LNE, Meta, Microsoft, NASSCOM, OpenAI, Patronus AI, Polytechnique Montreal, Qualcomm, QuantumBlack – AI by McKinsey, Salesforce, Schmidt Sciences, ServiceNow, University of Cambridge, University of Oxford, University of Illinois Urbana-Champaign, and University of California, Santa Barbara to co-develop an open agent reliability evaluation standard to operationalize trust in agentic deployments. This collaboration brings together a diverse ecosystem spanning AI builders, major enterprise deployers, specialized testing and safety providers, and important global policy organizations to ensure that this standard is both technically sound and practically implementable in the near-term. By uniting these different perspectives, this collaboration will create frameworks and benchmarks that are grounded in technical reality while addressing the practical needs of the global AI ecosystem.

We have outlined a set of principles to evaluate the reliability of near-term agentic use-cases, focusing on scenarios with limited actions within structured contexts. The evaluation framework will be organized into four key categories: Correctness, Safety, Security, and Control. The outcome from this group will comprise: 1) Design principles for agentic development 2) Benchmarks that are relevant to business for correctness, safety & control, and security. 

We welcome collaboration with a diverse range of enterprises, technology leaders, and business builders who share our vision. If you are interested in joining, please email [email protected]