Skip to main content

Showing 1–32 of 32 results for author: Bartolo, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.15795  [pdf, ps, other

    cs.CL

    Reverse Engineering Human Preferences with Reinforcement Learning

    Authors: Lisa Alazraki, Tan Yi-Chern, Jon Ander Campos, Maximilian Mozes, Marek Rei, Max Bartolo

    Abstract: The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  2. arXiv:2504.00698  [pdf

    cs.CL cs.AI cs.LG

    Command A: An Enterprise-Ready Large Language Model

    Authors: Team Cohere, :, Aakanksha, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, Raphaël Avalos, Zahara Aviv, Sammie Bae, Saurabh Baji, Alexandre Barbet, Max Bartolo, Björn Bebensee, Neeral Beladia, Walter Beller-Morales, Alexandre Bérard, Andrew Berneshawi, Anna Bialas, Phil Blunsom , et al. (205 additional authors not shown)

    Abstract: In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Genera… ▽ More

    Submitted 14 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

    Comments: 55 pages

  3. arXiv:2503.05731  [pdf, other

    cs.CY cs.AI

    AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

    Authors: Shaona Ghosh, Heather Frase, Adina Williams, Sarah Luger, Paul Röttger, Fazl Barez, Sean McGregor, Kenneth Fricklas, Mala Kumar, Quentin Feuillade--Montixi, Kurt Bollacker, Felix Friedrich, Ryan Tsang, Bertie Vidgen, Alicia Parrish, Chris Knotz, Eleonora Presani, Jonathan Bennion, Marisa Ferrara Boston, Mike Kuniavsky, Wiebke Hutiri, James Ezick, Malek Ben Salem, Rajat Sahay, Sujata Goswami , et al. (77 additional authors not shown)

    Abstract: The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance… ▽ More

    Submitted 18 April, 2025; v1 submitted 19 February, 2025; originally announced March 2025.

    Comments: 51 pages, 8 figures and an appendix

  4. arXiv:2502.08550  [pdf, other

    cs.CL cs.AI

    No Need for Explanations: LLMs can implicitly learn from mistakes in-context

    Authors: Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Tan Yi-Chern, Marek Rei, Max Bartolo

    Abstract: Showing incorrect answers to Large Language Models (LLMs) is a popular strategy to improve their performance in reasoning-intensive tasks. It is widely assumed that, in order to be helpful, the incorrect answers must be accompanied by comprehensive rationales, explicitly detailing where the mistakes are and how to correct them. However, in this work we present a counterintuitive finding: we observ… ▽ More

    Submitted 21 May, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

  5. arXiv:2501.17195  [pdf, other

    cs.CL cs.AI

    Atla Selene Mini: A General Purpose Evaluation Model

    Authors: Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, Young Sun Park

    Abstract: We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

  6. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  7. arXiv:2411.12580  [pdf, other

    cs.CL cs.LG

    Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

    Authors: Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwarak Talupuru, Acyr Locatelli, Robert Kirk, Tim Rocktäschel, Edward Grefenstette, Max Bartolo

    Abstract: The capabilities and limitations of Large Language Models have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume o… ▽ More

    Submitted 6 March, 2025; v1 submitted 19 November, 2024; originally announced November 2024.

    Comments: Published at ICLR 2025

  8. arXiv:2411.02844  [pdf, other

    cs.CV cs.AI cs.LG

    Correlation of Object Detection Performance with Visual Saliency and Depth Estimation

    Authors: Matthias Bartolo, Dylan Seychell

    Abstract: As object detection techniques continue to evolve, understanding their relationships with complementary visual tasks becomes crucial for optimising model architectures and computational resources. This paper investigates the correlations between object detection accuracy and two fundamental visual tasks: depth prediction and visual saliency prediction. Through comprehensive experiments using state… ▽ More

    Submitted 5 November, 2024; originally announced November 2024.

    Comments: Code Available at: https://github.com/mbar0075/Object-Detection-Correlation-Saliency-vs-Depth

  9. arXiv:2410.11677  [pdf, other

    cs.CL cs.AI cs.LG

    Understanding Likelihood Over-optimisation in Direct Alignment Algorithms

    Authors: Zhengyan Shi, Sander Land, Acyr Locatelli, Matthieu Geist, Max Bartolo

    Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimisation (DPO) and Identity Preference Optimisation (IPO), have emerged as alternatives to online Reinforcement Learning from Human Feedback (RLHF) algorithms such as Proximal Policy Optimisation (PPO) for aligning language models to human preferences, without the need for explicit reward modelling. These methods generally aim to in… ▽ More

    Submitted 18 October, 2024; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: Preprint Version

  10. arXiv:2408.06804  [pdf, other

    cs.SD cs.AI eess.AS

    Deep Learning for Speaker Identification: Architectural Insights from AB-1 Corpus Analysis and Performance Evaluation

    Authors: Matthias Bartolo

    Abstract: In the fields of security systems, forensic investigations, and personalized services, the importance of speech as a fundamental human input outweighs text-based interactions. This research delves deeply into the complex field of Speaker Identification (SID), examining its essential components and emphasising Mel Spectrogram and Mel Frequency Cepstral Coefficients (MFCC) for feature extraction. Mo… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: Resultant work from Assignment, Department of AI, University of Malta. Code available at: https://github.com/mbar0075/Speech-Technology

  11. arXiv:2408.06803  [pdf, other

    cs.CV cs.AI cs.LG

    Integrating Saliency Ranking and Reinforcement Learning for Enhanced Object Detection

    Authors: Matthias Bartolo, Dylan Seychell, Josef Bajada

    Abstract: With the ever-growing variety of object detection approaches, this study explores a series of experiments that combine reinforcement learning (RL)-based visual attention methods with saliency ranking techniques to investigate transparent and sustainable solutions. By integrating saliency ranking for initial bounding box prediction and subsequently applying RL techniques to refine these predictions… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: Resultant work from Dissertation, Department of AI, University of Malta. Code available at: https://github.com/mbar0075/SaRLVision

  12. arXiv:2405.20850  [pdf, other

    cs.CL

    Improving Reward Models with Synthetic Critiques

    Authors: Zihuiwen Ye, Fraser Greenlee-Scott, Max Bartolo, Phil Blunsom, Jon Ander Campos, Matthias Gallé

    Abstract: Reward models (RMs) play a critical role in aligning language models through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unse… ▽ More

    Submitted 18 October, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

  13. arXiv:2405.15032  [pdf, other

    cs.CL

    Aya 23: Open Weight Releases to Further Multilingual Progress

    Authors: Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

    Abstract: This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modelin… ▽ More

    Submitted 31 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  14. arXiv:2405.05417  [pdf

    cs.CL

    Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

    Authors: Sander Land, Max Bartolo

    Abstract: The disconnect between tokenizer creation and model training in language models allows for specific inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted model behaviour. Although such `glitch tokens', tokens present in the tokenizer vocabulary but that are nearly or entirely absent during model training, have been observed across various models, a reliable method to identify an… ▽ More

    Submitted 27 September, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

    Comments: 16 pages, 6 figures. Accepted at EMNLP 2024, main track. For associated code, see https://github.com/cohere-ai/magikarp/

    MSC Class: 68T50

  15. arXiv:2404.16019  [pdf, other

    cs.CL

    The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

    Authors: Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale

    Abstract: Human feedback is central to the alignment of Large Language Models (LLMs). However, open questions remain about methods (how), domains (where), people (who) and objectives (to what end) of feedback processes. To navigate these questions, we introduce PRISM, a dataset that maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual prefere… ▽ More

    Submitted 3 December, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Journal ref: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024)

  16. arXiv:2404.12241  [pdf, other

    cs.CL cs.AI

    Introducing v0.5 of the AI Safety Benchmark from MLCommons

    Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller , et al. (75 additional authors not shown)

    Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu… ▽ More

    Submitted 13 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

  17. arXiv:2403.12075  [pdf, other

    cs.CY cs.AI cs.CR cs.CV cs.LG

    Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

    Authors: Jessica Quaye, Alicia Parrish, Oana Inel, Charvi Rastogi, Hannah Rose Kirk, Minsuk Kahng, Erin van Liemt, Max Bartolo, Jess Tsang, Justin White, Nathan Clement, Rafael Mosquera, Juan Ciro, Vijay Janapa Reddi, Lora Aroyo

    Abstract: With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativit… ▽ More

    Submitted 13 May, 2024; v1 submitted 14 February, 2024; originally announced March 2024.

    Comments: 10 pages, 6 figures

  18. arXiv:2402.06619  [pdf, other

    cs.CL cs.AI

    Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

    Authors: Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda , et al. (8 additional authors not shown)

    Abstract: Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets.… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

  19. arXiv:2311.13028  [pdf, other

    cs.LG cs.AI cs.DC eess.SP

    DMLR: Data-centric Machine Learning Research -- Past, Present and Future

    Authors: Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš , et al. (13 additional authors not shown)

    Abstract: Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods tow… ▽ More

    Submitted 1 June, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Published in the Journal of Data-centric Machine Learning Research (DMLR) at https://data.mlr.press/assets/pdf/v01-5.pdf

  20. arXiv:2309.16349  [pdf, other

    cs.CL

    Human Feedback is not Gold Standard

    Authors: Tom Hosking, Phil Blunsom, Max Bartolo

    Abstract: Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback f… ▽ More

    Submitted 16 January, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: Accepted at ICLR 2024

  21. arXiv:2305.14384  [pdf, other

    cs.LG cs.AI cs.CR cs.CV

    Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

    Authors: Alicia Parrish, Hannah Rose Kirk, Jessica Quaye, Charvi Rastogi, Max Bartolo, Oana Inel, Juan Ciro, Rafael Mosquera, Addison Howard, Will Cukierski, D. Sculley, Vijay Janapa Reddi, Lora Aroyo

    Abstract: The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    MSC Class: 14J68 (Primary)

  22. arXiv:2207.10062  [pdf, other

    cs.LG

    DataPerf: Benchmarks for Data-Centric AI Development

    Authors: Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, Jessica Quaye, Charvi Rastogi, Douwe Kiela, David Jurado, David Kanter, Rafael Mosquera, Juan Ciro, Lora Aroyo, Bilge Acun, Lingjiao Chen, Mehul Smriti Raje, Max Bartolo, Sabri Eyuboglu, Amirata Ghorbani, Emmett Goodman , et al. (20 additional authors not shown)

    Abstract: Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing datase… ▽ More

    Submitted 13 October, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: NeurIPS 2023 Datasets and Benchmarks Track

  23. arXiv:2204.03162  [pdf, other

    cs.CV cs.CL

    Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

    Authors: Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross

    Abstract: We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annot… ▽ More

    Submitted 22 April, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: CVPR 2022

  24. arXiv:2204.01906  [pdf, other

    cs.CL cs.AI

    Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

    Authors: Tristan Thrush, Kushal Tirumala, Anmol Gupta, Max Bartolo, Pedro Rodriguez, Tariq Kane, William Gaviria Rojas, Peter Mattson, Adina Williams, Douwe Kiela

    Abstract: We introduce Dynatask: an open source system for setting up custom NLP tasks that aims to greatly lower the technical knowledge and effort required for hosting and evaluating state-of-the-art NLP models, as well as for conducting model in the loop data collection with crowdworkers. Dynatask is integrated with Dynabench, a research platform for rethinking benchmarking in AI that facilitates human a… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: ACL System Demos 2022

  25. arXiv:2112.09062  [pdf, other

    cs.CL

    Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

    Authors: Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, Douwe Kiela

    Abstract: In Dynamic Adversarial Data Collection (DADC), human annotators are tasked with finding examples that models struggle to predict correctly. Models trained on DADC-collected training data have been shown to be more robust in adversarial and out-of-domain settings, and are considerably harder for humans to fool. However, DADC is more time-consuming than traditional data collection and thus more cost… ▽ More

    Submitted 17 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

  26. arXiv:2109.04385  [pdf, other

    cs.CL

    Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification

    Authors: Maximilian Mozes, Max Bartolo, Pontus Stenetorp, Bennett Kleinberg, Lewis D. Griffin

    Abstract: Research shows that natural language processing models are generally considered to be vulnerable to adversarial attacks; but recent work has drawn attention to the issue of validating these adversarial inputs against certain criteria (e.g., the preservation of semantics and grammaticality). Enforcing constraints to uphold such criteria may render attacks unsuccessful, raising the question of wheth… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  27. arXiv:2104.14337  [pdf, other

    cs.CL cs.AI

    Dynabench: Rethinking Benchmarking in NLP

    Authors: Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, Adina Williams

    Abstract: We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary model… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

    Comments: NAACL 2021

  28. arXiv:2104.08786  [pdf, other

    cs.CL cs.AI

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

    Authors: Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, Pontus Stenetorp

    Abstract: When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are… ▽ More

    Submitted 3 March, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: ACL 2022

  29. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation

    Authors: Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Riedel, Pontus Stenetorp, Douwe Kiela

    Abstract: Despite recent progress, state-of-the-art question answering models remain vulnerable to a variety of adversarial attacks. While dynamic adversarial data collection, in which a human annotator tries to write examples that fool a model-in-the-loop, can improve model robustness, this process is expensive which limits the scale of the collected data. In this work, we are the first to use synthetic ad… ▽ More

    Submitted 15 March, 2022; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: EMNLP 2021

    Journal ref: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p.8830-8848. Association for Computational Linguistics

  30. arXiv:2003.04808  [pdf, other

    cs.CL cs.LG stat.ML

    Undersensitivity in Neural Reading Comprehension

    Authors: Johannes Welbl, Pasquale Minervini, Max Bartolo, Pontus Stenetorp, Sebastian Riedel

    Abstract: Current reading comprehension models generalise well to in-distribution test sets, yet perform poorly on adversarially selected inputs. Most prior work on adversarial inputs studies oversensitivity: semantically invariant text perturbations that cause a model's prediction to change when it should not. In this work we focus on the complementary problem: excessive prediction undersensitivity, where… ▽ More

    Submitted 15 February, 2020; originally announced March 2020.

    Comments: 15 pages, 4 figures

  31. Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

    Authors: Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, Pontus Stenetorp

    Abstract: Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, col… ▽ More

    Submitted 22 September, 2020; v1 submitted 1 February, 2020; originally announced February 2020.

    Journal ref: Transactions of the Association for Computational Linguistics, Volume 8, 2020 p.662-678

  32. arXiv:1809.01494  [pdf, other

    cs.CL cs.LG stat.ML

    Interpretation of Natural Language Rules in Conversational Machine Reading

    Authors: Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rocktäschel, Mike Sheldon, Guillaume Bouchard, Sebastian Riedel

    Abstract: Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader's background knowledge. One example is the task of interpreting regul… ▽ More

    Submitted 28 August, 2018; originally announced September 2018.

    Comments: EMNLP 2018