-
The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims
Authors:
Kiana Jafari Meimandi,
Gabriela Aránguiz-Dias,
Grace Ra Kim,
Lana Saadeddin,
Mykel J. Kochenderfer
Abstract:
As industry reports claim agentic AI systems deliver double-digit productivity gains and multi-trillion dollar economic potential, the validity of these claims has become critical for investment decisions, regulatory policy, and responsible technology adoption. However, this paper demonstrates that current evaluation practices for agentic AI systems exhibit a systemic imbalance that calls into que…
▽ More
As industry reports claim agentic AI systems deliver double-digit productivity gains and multi-trillion dollar economic potential, the validity of these claims has become critical for investment decisions, regulatory policy, and responsible technology adoption. However, this paper demonstrates that current evaluation practices for agentic AI systems exhibit a systemic imbalance that calls into question prevailing industry productivity claims. Our systematic review of 84 papers (2023--2025) reveals an evaluation imbalance where technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic assessments (30%) remain peripheral, with only 15% incorporating both technical and human dimensions. This measurement gap creates a fundamental disconnect between benchmark success and deployment value. We present evidence from healthcare, finance, and retail sectors where systems excelling on technical metrics failed in real-world implementation due to unmeasured human, temporal, and contextual factors. Our position is not against agentic AI's potential, but rather that current evaluation frameworks systematically privilege narrow technical metrics while neglecting dimensions critical to real-world success. We propose a balanced four-axis evaluation model and call on the community to lead this paradigm shift because benchmark-driven optimization shapes what we build. By redefining evaluation practices, we can better align industry claims with deployment realities and ensure responsible scaling of agentic systems in high-stakes domains.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
More than Marketing? On the Information Value of AI Benchmarks for Practitioners
Authors:
Amelia Hardy,
Anka Reuel,
Kiana Jafari Meimandi,
Lisa Soder,
Allie Griffith,
Dylan M. Asmar,
Sanmi Koyejo,
Michael S. Bernstein,
Mykel J. Kochenderfer
Abstract:
Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of intervi…
▽ More
Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.
△ Less
Submitted 6 December, 2024;
originally announced December 2024.
-
Responsible AI in the Global Context: Maturity Model and Survey
Authors:
Anka Reuel,
Patrick Connolly,
Kiana Jafari Meimandi,
Shekhar Tewari,
Jakub Wiatrak,
Dikshita Venkatesh,
Mykel Kochenderfer
Abstract:
Responsible AI (RAI) has emerged as a major focus across industry, policymaking, and academia, aiming to mitigate the risks and maximize the benefits of AI, both on an organizational and societal level. This study explores the global state of RAI through one of the most extensive surveys to date on the topic, surveying 1000 organizations across 20 industries and 19 geographical regions. We define…
▽ More
Responsible AI (RAI) has emerged as a major focus across industry, policymaking, and academia, aiming to mitigate the risks and maximize the benefits of AI, both on an organizational and societal level. This study explores the global state of RAI through one of the most extensive surveys to date on the topic, surveying 1000 organizations across 20 industries and 19 geographical regions. We define a conceptual RAI maturity model for organizations to map how well they implement organizational and operational RAI measures. Based on this model, the survey assesses the adoption of system-level measures to mitigate identified risks related to, for example, discrimination, reliability, or privacy, and also covers key organizational processes pertaining to governance, risk management, and monitoring and control. The study highlights the expanding AI risk landscape, emphasizing the need for comprehensive risk mitigation strategies. The findings also reveal significant strides towards RAI maturity, but we also identify gaps in RAI implementation that could lead to increased (public) risks from AI systems. This research offers a structured approach to assess and improve RAI practices globally and underscores the critical need for bridging the gap between RAI planning and execution to ensure AI advancement aligns with human welfare and societal benefits.
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Text Classification Algorithms: A Survey
Authors:
Kamran Kowsari,
Kiana Jafari Meimandi,
Mojtaba Heidarysafa,
Sanjana Mendu,
Laura E. Barnes,
Donald E. Brown
Abstract:
In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understa…
▽ More
In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in the real-world problem are discussed.
△ Less
Submitted 20 May, 2020; v1 submitted 16 April, 2019;
originally announced April 2019.
-
An Improvement of Data Classification Using Random Multimodel Deep Learning (RMDL)
Authors:
Mojtaba Heidarysafa,
Kamran Kowsari,
Donald E. Brown,
Kiana Jafari Meimandi,
Laura E. Barnes
Abstract:
The exponential growth in the number of complex datasets every year requires more enhancement in machine learning methods to provide robust and accurate data classification. Lately, deep learning approaches have achieved surpassing results in comparison to previous machine learning algorithms. However, finding the suitable structure for these models has been a challenge for researchers. This paper…
▽ More
The exponential growth in the number of complex datasets every year requires more enhancement in machine learning methods to provide robust and accurate data classification. Lately, deep learning approaches have achieved surpassing results in comparison to previous machine learning algorithms. However, finding the suitable structure for these models has been a challenge for researchers. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning approach for classification. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. In short, RMDL trains multiple randomly generated models of Deep Neural Network (DNN), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) in parallel and combines their results to produce better result of any of those models individually. In this paper, we describe RMDL model and compare the results for image and text classification as well as face recognition. We used MNIST and CIFAR-10 datasets as ground truth datasets for image classification and WOS, Reuters, IMDB, and 20newsgroup datasets for text classification. Lastly, we used ORL dataset to compare the model performance on face recognition task.
△ Less
Submitted 22 August, 2018;
originally announced August 2018.
-
RMDL: Random Multimodel Deep Learning for Classification
Authors:
Kamran Kowsari,
Mojtaba Heidarysafa,
Donald E. Brown,
Kiana Jafari Meimandi,
Laura E. Barnes
Abstract:
The continually increasing number of complex datasets each year necessitates ever improving machine learning methods for robust and accurate categorization of these data. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL solves the problem of…
▽ More
The continually increasing number of complex datasets each year necessitates ever improving machine learning methods for robust and accurate categorization of these data. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. RDML can accept as input a variety data to include text, video, images, and symbolic. This paper describes RMDL and shows test results for image and text data including MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup. These test results show that RDML produces consistently better performance than standard methods over a broad range of data types and classification problems.
△ Less
Submitted 31 May, 2018; v1 submitted 3 May, 2018;
originally announced May 2018.
-
HDLTex: Hierarchical Deep Learning for Text Classification
Authors:
Kamran Kowsari,
Donald E. Brown,
Mojtaba Heidarysafa,
Kiana Jafari Meimandi,
Matthew S. Gerber,
Laura E. Barnes
Abstract:
The continually increasing number of documents produced each year necessitates ever improving information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of these traditional classifiers has degraded as the number…
▽ More
The continually increasing number of documents produced each year necessitates ever improving information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of these traditional classifiers has degraded as the number of documents has increased. This is because along with this growth in the number of documents has come an increase in the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide specialized understanding at each level of the document hierarchy.
△ Less
Submitted 6 October, 2017; v1 submitted 24 September, 2017;
originally announced September 2017.