Search | arXiv e-print repository

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

Authors: Shaona Ghosh, Heather Frase, Adina Williams, Sarah Luger, Paul Röttger, Fazl Barez, Sean McGregor, Kenneth Fricklas, Mala Kumar, Quentin Feuillade--Montixi, Kurt Bollacker, Felix Friedrich, Ryan Tsang, Bertie Vidgen, Alicia Parrish, Chris Knotz, Eleonora Presani, Jonathan Bennion, Marisa Ferrara Boston, Mike Kuniavsky, Wiebke Hutiri, James Ezick, Malek Ben Salem, Rajat Sahay, Sujata Goswami , et al. (77 additional authors not shown)

Abstract: The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance… ▽ More The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation. In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment. △ Less

Submitted 18 April, 2025; v1 submitted 19 February, 2025; originally announced March 2025.

Comments: 51 pages, 8 figures and an appendix

arXiv:2008.13707 [pdf, other]

Connecting Web Event Forecasting with Anomaly Detection: A Case Study on Enterprise Web Applications Using Self-Supervised Neural Networks

Authors: Xiaoyong Yuan, Lei Ding, Malek Ben Salem, Xiaolin Li, Dapeng Wu

Abstract: Recently web applications have been widely used in enterprises to assist employees in providing effective and efficient business processes. Forecasting upcoming web events in enterprise web applications can be beneficial in many ways, such as efficient caching and recommendation. In this paper, we present a web event forecasting approach, DeepEvent, in enterprise web applications for better anomal… ▽ More Recently web applications have been widely used in enterprises to assist employees in providing effective and efficient business processes. Forecasting upcoming web events in enterprise web applications can be beneficial in many ways, such as efficient caching and recommendation. In this paper, we present a web event forecasting approach, DeepEvent, in enterprise web applications for better anomaly detection. DeepEvent includes three key features: web-specific neural networks to take into account the characteristics of sequential web events, self-supervised learning techniques to overcome the scarcity of labeled data, and sequence embedding techniques to integrate contextual events and capture dependencies among web events. We evaluate DeepEvent on web events collected from six real-world enterprise web applications. Our experimental results demonstrate that DeepEvent is effective in forecasting sequential web events and detecting web based anomalies. DeepEvent provides a context-based system for researchers and practitioners to better forecast web events with situational awareness. △ Less

Submitted 7 September, 2020; v1 submitted 31 August, 2020; originally announced August 2020.

Comments: accepted at EAI SecureComm 2020

arXiv:2008.06612 [pdf, other]

doi 10.1109/TrustCom50675.2020.00184

Are Smart Home Devices Abandoning IPV Victims?

Authors: Ahmed Alshehri, Malek Ben Salem, Lei Ding

Abstract: Smart home devices have brought us many benefits such as advanced security, convenience, and entertainment. However, these devices also have made unintended consequences like giving ultimate power for devices' owners over their intimate partners in the same household which might lead to tech-facilitated domestic abuse (tech-abuse) as recent research has shown. In this paper, we systematize finding… ▽ More Smart home devices have brought us many benefits such as advanced security, convenience, and entertainment. However, these devices also have made unintended consequences like giving ultimate power for devices' owners over their intimate partners in the same household which might lead to tech-facilitated domestic abuse (tech-abuse) as recent research has shown. In this paper, we systematize findings on tech-abuse in smart homes. We show that domestic abuse and Intimate Partner Violence (IPV) in smart homes is more effective and less risky for abusers. Victims find it more harmful and more challenging to protect themselves from. We articulate a comprehensive analysis of all the phases of abuse in smart homes and categorize risks and needs in each phase. Technical analysis of current smart home technologies is conducted to shed light upon their limitations. We also summarize recent recommendations to combat tech-abuse in smart homes and focus on their potentials and shortcomings. Unsurprisingly, we find that many recommendations conflict with each other due to a lack of understanding of phases of abuse in smart homes. Desirable properties to design abuse-resistant smart home devices are proposed for all the phases of abuse. The research community benefits from our analysis and recommendations to move forward with a focus on filling the blind spots of existing smart home devices' safety measures and building appropriate safety measures that consider tech-abuse threats in smart homes. △ Less

Submitted 14 August, 2020; originally announced August 2020.

arXiv:1512.07560 [pdf, other]

Universal Prediction Distribution for Surrogate Models

Authors: Malek Ben Salem, Olivier Roustant, Fabrice Gamboa, Lionel Tomaso

Abstract: The use of surrogate models instead of computationally expensive simulation codes is very convenient in engineering. Roughly speaking, there are two kinds of surrogate models: the deterministic and the probabilistic ones. These last are generally based on Gaussian assumptions. The main advantage of probabilistic approach is that it provides a measure of uncertainty associated with the surrogate… ▽ More The use of surrogate models instead of computationally expensive simulation codes is very convenient in engineering. Roughly speaking, there are two kinds of surrogate models: the deterministic and the probabilistic ones. These last are generally based on Gaussian assumptions. The main advantage of probabilistic approach is that it provides a measure of uncertainty associated with the surrogate model in the whole space. This uncertainty is an efficient tool to construct strategies for various problems such as prediction enhancement, optimization or inversion.In this paper, we propose a universal method to define a measure of uncertainty suitable for any surrogate model either deterministic or probabilistic. It relies on Cross-Validation (CV) sub-models predictions. This empirical distribution may be computed in much more general frames than the Gaussian one. So that it is called the Universal Prediction distribution (UP distribution).It allows the definition of many sampling criteria. We give and study adaptive sampling techniques for global refinement and an extension of the so-called Efficient Global Optimization (EGO) algorithm. We also discuss the use of the UP distribution for inversion problems. The performances of these new algorithms are studied both on toys models and on an engineering design problem. △ Less

Submitted 23 December, 2015; originally announced December 2015.

Showing 1–4 of 4 results for author: Salem, M B