-
What Has Been Lost with Synthetic Evaluation?
Authors:
Alexander Gill,
Abhilasha Ravichander,
Ana Marasović
Abstract:
Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those c…
▽ More
Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.
△ Less
Submitted 2 June, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
-
International AI Safety Report
Authors:
Yoshua Bengio,
Sören Mindermann,
Daniel Privitera,
Tamay Besiroglu,
Rishi Bommasani,
Stephen Casper,
Yejin Choi,
Philip Fox,
Ben Garfinkel,
Danielle Goldfarb,
Hoda Heidari,
Anson Ho,
Sayash Kapoor,
Leila Khalatbari,
Shayne Longpre,
Sam Manning,
Vasilios Mavroudis,
Mantas Mazeika,
Julian Michael,
Jessica Newman,
Kwan Yee Ng,
Chinasa T. Okolo,
Deborah Raji,
Girish Sastry,
Elizabeth Seger
, et al. (71 additional authors not shown)
Abstract:
The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report's Expert Advisory Panel. A total of 100 AI experts contributed, repr…
▽ More
The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report's Expert Advisory Panel. A total of 100 AI experts contributed, representing diverse perspectives and disciplines. Led by the report's Chair, these independent experts collectively had full discretion over the report's content.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Agile System Development Lifecycle for AI Systems: Decision Architecture
Authors:
Asif Q. Gill
Abstract:
Agile system development life cycle (SDLC) focuses on typical functional and non-functional system requirements for developing traditional software systems. However, Artificial Intelligent (AI) systems are different in nature and have distinct attributes such as (1) autonomy, (2) adaptiveness, (3) content generation, (4) decision-making, (5) predictability and (6) recommendation. Agile SDLC needs…
▽ More
Agile system development life cycle (SDLC) focuses on typical functional and non-functional system requirements for developing traditional software systems. However, Artificial Intelligent (AI) systems are different in nature and have distinct attributes such as (1) autonomy, (2) adaptiveness, (3) content generation, (4) decision-making, (5) predictability and (6) recommendation. Agile SDLC needs to be enhanced to support the AI system development and ongoing post-deployment adaptation. The challenge is: how can agile SDLC be enhanced to support AI systems? The scope of this paper is limited to AI system enabled decision automation. Thus, this paper proposes the use of decision science to enhance the agile SDLC to support the AI system development. Decision science is the study of decision-making, which seems useful to identify, analyse and describe decisions and their architecture subject to automation via AI systems. Specifically, this paper discusses the decision architecture in detail within the overall context of agile SDLC for AI systems. The application of the proposed approach is demonstrated with the help of an example scenario of insurance claim processing. This initial work indicated the usability of a decision science to enhancing the agile SDLC for designing and implementing the AI systems for decision-automation. This work provides an initial foundation for further work in this new area of decision architecture and agile SDLC for AI systems.
△ Less
Submitted 16 January, 2025; v1 submitted 16 January, 2025;
originally announced January 2025.
-
AI-based Identity Fraud Detection: A Systematic Review
Authors:
Chuo Jun Zhang,
Asif Q. Gill,
Bo Liu,
Memoona J. Anwar
Abstract:
With the rapid development of digital services, a large volume of personally identifiable information (PII) is stored online and is subject to cyberattacks such as Identity fraud. Most recently, the use of Artificial Intelligence (AI) enabled deep fake technologies has significantly increased the complexity of identity fraud. Fraudsters may use these technologies to create highly sophisticated cou…
▽ More
With the rapid development of digital services, a large volume of personally identifiable information (PII) is stored online and is subject to cyberattacks such as Identity fraud. Most recently, the use of Artificial Intelligence (AI) enabled deep fake technologies has significantly increased the complexity of identity fraud. Fraudsters may use these technologies to create highly sophisticated counterfeit personal identification documents, photos and videos. These advancements in the identity fraud landscape pose challenges for identity fraud detection and society at large. There is a pressing need to review and understand identity fraud detection methods, their limitations and potential solutions. This research aims to address this important need by using the well-known systematic literature review method. This paper reviewed a selected set of 43 papers across 4 major academic literature databases. In particular, the review results highlight the two types of identity fraud prevention and detection methods, in-depth and open challenges. The results were also consolidated into a taxonomy of AI-based identity fraud detection and prevention methods including key insights and trends. Overall, this paper provides a foundational knowledge base to researchers and practitioners for further research and development in this important area of digital identity fraud.
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
On Evaluating Explanation Utility for Human-AI Decision Making in NLP
Authors:
Fateme Hashemi Chaleshtori,
Atreya Ghosal,
Alexander Gill,
Purbid Bambroo,
Ana Marasović
Abstract:
Is explainability a false promise? This debate has emerged from the insufficient evidence that explanations help people in situations they are introduced for. More human-centered, application-grounded evaluations of explanations are needed to settle this. Yet, with no established guidelines for such studies in NLP, researchers accustomed to standardized proxy evaluations must discover appropriate…
▽ More
Is explainability a false promise? This debate has emerged from the insufficient evidence that explanations help people in situations they are introduced for. More human-centered, application-grounded evaluations of explanations are needed to settle this. Yet, with no established guidelines for such studies in NLP, researchers accustomed to standardized proxy evaluations must discover appropriate measurements, tasks, datasets, and sensible models for human-AI teams in their studies.
To aid with this, we first review existing metrics suitable for application-grounded evaluation. We then establish criteria to select appropriate datasets, and using them, we find that only 4 out of over 50 datasets available for explainability research in NLP meet them. We then demonstrate the importance of reassessing the state of the art to form and study human-AI teams: teaming people with models for certain tasks might only now start to make sense, and for others, it remains unsound. Finally, we present the exemplar studies of human-AI decision-making for one of the identified tasks -- verifying the correctness of a legal claim given a contract. Our results show that providing AI predictions, with or without explanations, does not cause decision makers to speed up their work without compromising performance. We argue for revisiting the setup of human-AI teams and improving automatic deferral of instances to AI, where explanations could play a useful role.
△ Less
Submitted 4 November, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Large Language Model Watermark Stealing With Mixed Integer Programming
Authors:
Zhaoxi Zhang,
Xiaomei Zhang,
Yanjun Zhang,
Leo Yu Zhang,
Chao Chen,
Shengshan Hu,
Asif Gill,
Shirui Pan
Abstract:
The Large Language Model (LLM) watermark is a newly emerging technique that shows promise in addressing concerns surrounding LLM copyright, monitoring AI-generated text, and preventing its misuse. The LLM watermark scheme commonly includes generating secret keys to partition the vocabulary into green and red lists, applying a perturbation to the logits of tokens in the green list to increase their…
▽ More
The Large Language Model (LLM) watermark is a newly emerging technique that shows promise in addressing concerns surrounding LLM copyright, monitoring AI-generated text, and preventing its misuse. The LLM watermark scheme commonly includes generating secret keys to partition the vocabulary into green and red lists, applying a perturbation to the logits of tokens in the green list to increase their sampling likelihood, thus facilitating watermark detection to identify AI-generated text if the proportion of green tokens exceeds a threshold. However, recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks, such as token editing, synonym substitution, and paraphrasing, with robustness declining as the number of keys increases. Therefore, the state-of-the-art watermark schemes that employ fewer or single keys have been demonstrated to be more robust against text editing and paraphrasing. In this paper, we propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme and systematically examine its vulnerability to this attack. We formalize the attack as a mixed integer programming problem with constraints. We evaluate our attack under a comprehensive threat model, including an extreme scenario where the attacker has no prior knowledge, lacks access to the watermark detector API, and possesses no information about the LLM's parameter settings or watermark injection/detection scheme. Extensive experiments on LLMs, such as OPT and LLaMA, demonstrate that our attack can successfully steal the green list and remove the watermark across all settings.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
A Study on Visual Perception of Light Field Content
Authors:
Ailbhe Gill,
Emin Zerman,
Cagri Ozcinar,
Aljosa Smolic
Abstract:
The effective design of visual computing systems depends heavily on the anticipation of visual attention, or saliency. While visual attention is well investigated for conventional 2D images and video, it is nevertheless a very active research area for emerging immersive media. In particular, visual attention of light fields (light rays of a scene captured by a grid of cameras or micro lenses) has…
▽ More
The effective design of visual computing systems depends heavily on the anticipation of visual attention, or saliency. While visual attention is well investigated for conventional 2D images and video, it is nevertheless a very active research area for emerging immersive media. In particular, visual attention of light fields (light rays of a scene captured by a grid of cameras or micro lenses) has only recently become a focus of research. As they may be rendered and consumed in various ways, a primary challenge that arises is the definition of what visual perception of light field content should be. In this work, we present a visual attention study on light field content. We conducted perception experiments displaying them to users in various ways and collected corresponding visual attention data. Our analysis highlights characteristics of user behaviour in light field imaging applications. The light field data set and attention data are provided with this paper.
△ Less
Submitted 7 August, 2020;
originally announced August 2020.
-
Provable Emergent Pattern Formation by a Swarm of Anonymous, Homogeneous, Non-Communicating, Reactive Robots with Limited Relative Sensing and no Global Knowledge or Positioning
Authors:
Mario Coppola,
Jian Guo,
Eberhard K. A. Gill,
Guido C. H. E. de Croon
Abstract:
In this work, we explore emergent behaviors by swarms of anonymous, homogeneous, non-communicating, reactive robots that do not know their global position and have limited relative sensing. We introduce a novel method that enables such severely limited robots to autonomously arrange in a desired pattern and maintain it. The method includes an automatic proof procedure to check whether a given patt…
▽ More
In this work, we explore emergent behaviors by swarms of anonymous, homogeneous, non-communicating, reactive robots that do not know their global position and have limited relative sensing. We introduce a novel method that enables such severely limited robots to autonomously arrange in a desired pattern and maintain it. The method includes an automatic proof procedure to check whether a given pattern will be achieved by the swarm from any initial configuration. An attractive feature of this proof procedure is that it is local in nature, avoiding as much as possible the computational explosion that can be expected with increasing robots, states, and action possibilities. Our approach is based on extracting the local states that constitute a global goal (in this case, a pattern). We then formally show that these local states can only coexist when the global desired pattern is achieved and that, until this occurs, there is always a sequence of actions that will lead from the current pattern to the desired pattern. Furthermore, we show that the agents will never perform actions that could a) lead to intra-swarm collisions or b) cause the swarm to separate. After an analysis of the performance of pattern formation in the discrete domain, we also test the system in continuous time and space simulations and reproduce the results using asynchronous agents operating in unbounded space. The agents successfully form the desired patterns while avoiding collisions and separation.
△ Less
Submitted 18 April, 2018;
originally announced April 2018.
-
Recurrent neural networks based Indic word-wise script identification using character-wise training
Authors:
Rohun Tripathi,
Aman Gill,
Riccha Tripati
Abstract:
This paper presents a novel methodology of Indic handwritten script recognition using Recurrent Neural Networks and addresses the problem of script recognition in poor data scenarios, such as when only character level online data is available. It is based on the hypothesis that curves of online character data comprise sufficient information for prediction at the word level. Online character data i…
▽ More
This paper presents a novel methodology of Indic handwritten script recognition using Recurrent Neural Networks and addresses the problem of script recognition in poor data scenarios, such as when only character level online data is available. It is based on the hypothesis that curves of online character data comprise sufficient information for prediction at the word level. Online character data is used to train RNNs using BLSTM architecture which are then used to make predictions of online word level data. These prediction results on the test set are at par with prediction results of models trained with online word data, while the training of the character level model is much less data intensive and takes much less time. Performance for binary-script models and then 5 Indic script models are reported, along with comparison with HMM models.The system is extended for offline data prediction. Raw offline data lacks the temporal information available in online data and required for prediction using models trained with online data. To overcome this, stroke recovery is implemented and the strokes are utilized for predicting using the online character level models. The performance on character and word level offline data is reported.
△ Less
Submitted 27 December, 2018; v1 submitted 10 September, 2017;
originally announced September 2017.