AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Authors:
Shaona Ghosh,
Heather Frase,
Adina Williams,
Sarah Luger,
Paul Röttger,
Fazl Barez,
Sean McGregor,
Kenneth Fricklas,
Mala Kumar,
Quentin Feuillade--Montixi,
Kurt Bollacker,
Felix Friedrich,
Ryan Tsang,
Bertie Vidgen,
Alicia Parrish,
Chris Knotz,
Eleonora Presani,
Jonathan Bennion,
Marisa Ferrara Boston,
Mike Kuniavsky,
Wiebke Hutiri,
James Ezick,
Malek Ben Salem,
Rajat Sahay,
Sujata Goswami
, et al. (77 additional authors not shown)
Abstract:
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance…
▽ More
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation.
In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment.
△ Less
Submitted 18 April, 2025; v1 submitted 19 February, 2025;
originally announced March 2025.
A Study on Domain Generalization for Failure Detection through Human Reactions in HRI
Authors:
Maria Teresa Parreira,
Sukruth Gowdru Lingaraju,
Adolfo Ramirez-Aristizabal,
Manaswi Saha,
Michael Kuniavsky,
Wendy Ju
Abstract:
Machine learning models are commonly tested in-distribution (same dataset); performance almost always drops in out-of-distribution settings. For HRI research, the goal is often to develop generalized models. This makes domain generalization - retaining performance in different settings - a critical issue. In this study, we present a concise analysis of domain generalization in failure detection mo…
▽ More
Machine learning models are commonly tested in-distribution (same dataset); performance almost always drops in out-of-distribution settings. For HRI research, the goal is often to develop generalized models. This makes domain generalization - retaining performance in different settings - a critical issue. In this study, we present a concise analysis of domain generalization in failure detection models trained on human facial expressions. Using two distinct datasets of humans reacting to videos where error occurs, one from a controlled lab setting and another collected online, we trained deep learning models on each dataset. When testing these models on the alternate dataset, we observed a significant performance drop. We reflect on the causes for the observed model behavior and leave recommendations. This work emphasizes the need for HRI research focusing on improving model robustness and real-life applicability.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
The Bystander Affect Detection (BAD) Dataset for Failure Detection in HRI
Authors:
Alexandra Bremers,
Maria Teresa Parreira,
Xuanyu Fang,
Natalie Friedman,
Adolfo Ramirez-Aristizabal,
Alexandria Pabst,
Mirjana Spasojevic,
Michael Kuniavsky,
Wendy Ju
Abstract:
For a robot to repair its own error, it must first know it has made a mistake. One way that people detect errors is from the implicit reactions from bystanders -- their confusion, smirks, or giggles clue us in that something unexpected occurred. To enable robots to detect and act on bystander responses to task failures, we developed a novel method to elicit bystander responses to human and robot e…
▽ More
For a robot to repair its own error, it must first know it has made a mistake. One way that people detect errors is from the implicit reactions from bystanders -- their confusion, smirks, or giggles clue us in that something unexpected occurred. To enable robots to detect and act on bystander responses to task failures, we developed a novel method to elicit bystander responses to human and robot errors. Using 46 different stimulus videos featuring a variety of human and machine task failures, we collected a total of 2452 webcam videos of human reactions from 54 participants. To test the viability of the collected data, we used the bystander reaction dataset as input to a deep-learning model, BADNet, to predict failure occurrence. We tested different data labeling methods and learned how they affect model performance, achieving precisions above 90%. We discuss strategies to model bystander reactions and predict failure and how this approach can be used in real-world robotic deployments to detect errors and improve robot performance. As part of this work, we also contribute with the "Bystander Affect Detection" (BAD) dataset of bystander reactions, supporting the development of better prediction models.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.