-
Skewed Score: A statistical framework to assess autograders
Authors:
Magda Dubois,
Harry Coppock,
Mario Giulianelli,
Timo Flesch,
Lennart Luettgau,
Cozmin Ududec
Abstract:
The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a stat…
▽ More
The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.
△ Less
Submitted 9 July, 2025; v1 submitted 4 July, 2025;
originally announced July 2025.
-
Lessons from a Chimp: AI "Scheming" and the Quest for Ape Language
Authors:
Christopher Summerfield,
Lennart Luettgau,
Magda Dubois,
Hannah Rose Kirk,
Kobi Hackenburg,
Catherine Fist,
Katarina Slama,
Nicola Ding,
Rebecca Anselmetti,
Andrew Strait,
Mario Giulianelli,
Cozmin Ududec
Abstract:
We examine recent research that asks whether current AI systems may be developing a capacity for "scheming" (covertly and strategically pursuing misaligned goals). We compare current research practices in this field to those adopted in the 1970s to test whether non-human primates could master natural language. We argue that there are lessons to be learned from that historical research endeavour, w…
▽ More
We examine recent research that asks whether current AI systems may be developing a capacity for "scheming" (covertly and strategically pursuing misaligned goals). We compare current research practices in this field to those adopted in the 1970s to test whether non-human primates could master natural language. We argue that there are lessons to be learned from that historical research endeavour, which was characterised by an overattribution of human traits to other agents, an excessive reliance on anecdote and descriptive analysis, and a failure to articulate a strong theoretical framework for the research. We recommend that research into AI scheming actively seeks to avoid these pitfalls. We outline some concrete steps that can be taken for this research programme to advance in a productive and scientifically rigorous fashion.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics
Authors:
Lennart Luettgau,
Harry Coppock,
Magda Dubois,
Christopher Summerfield,
Cozmin Ududec
Abstract:
As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI…
▽ More
As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI systems. To address these challenges, we introduce HiBayES, a generalizable Hierarchical Bayesian modeling framework for AI Evaluation Statistics. HiBayES supports robust inferences in classical question-answer benchmarks and advanced agentic evaluations, particularly in low-data scenarios (e.g., < 20 data points per evaluation). Built on Generalized Linear Models (GLMs), Bayesian data analysis, and formal model comparison, HiBayES provides principled uncertainty quantification and robust parameter estimation. This paper offers a comprehensive introduction to HiBayES, including illustrative examples, comparisons to conventional statistical methods, and practical guidance for implementing multilevel Bayesian GLMs. Additionally, we provide a HiBayES software package [4] (Beta version) for out-of-the-box implementation.
△ Less
Submitted 8 July, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
MOSAIC: Multiple Observers Spotting AI Content
Authors:
Matthieu Dubois,
François Yvon,
Pablo Piantanida
Abstract:
The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities, has made it easier for all to produce harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a binary classification problem. Early ap…
▽ More
The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities, has made it easier for all to produce harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a binary classification problem. Early approaches evaluate an input document with a well-chosen detector LLM, assuming that low-perplexity scores reliably signal machine-made content. More recent systems instead consider two LLMs and compare their probability distributions over the document to further discriminate when perplexity alone cannot. However, using a fixed pair of models can induce brittleness in performance. We extend these approaches to the ensembling of several LLMs and derive a new, theoretically grounded approach to combine their respective strengths. Our experiments, conducted with various generator LLMs, indicate that this approach effectively leverages the strengths of each model, resulting in robust detection performance across multiple domains. Our code and data are available at https://github.com/BaggerOfWords/MOSAIC .
△ Less
Submitted 11 June, 2025; v1 submitted 11 September, 2024;
originally announced September 2024.
-
Streetlight Effect in Post-Publication Peer Review: Are Open Access Publications More Scrutinized?
Authors:
Abdelghani Maddi,
Emmanuel Monneau,
Catherine Gaspare,
Floriana Gargiulo,
Michel Dubois
Abstract:
The Streetlight Effect represents an observation bias that occurs when individuals search for something only where it is easiest to look. Despite the significant development of Post-Publication Peer Review (PPPR) in recent years, facilitated in part by platforms such as PubPeer, existing literature has not examined whether PPPR is affected by this type of bias. In other words, if the PPPR mainly c…
▽ More
The Streetlight Effect represents an observation bias that occurs when individuals search for something only where it is easiest to look. Despite the significant development of Post-Publication Peer Review (PPPR) in recent years, facilitated in part by platforms such as PubPeer, existing literature has not examined whether PPPR is affected by this type of bias. In other words, if the PPPR mainly concerns publications to which researchers have direct access (eg to analyze image duplications, etc.). In this study, we compare the Open Access (OA) structures of publishers and journals among 51,882 publications commented on PubPeer to those indexed in OpenAlex database (\#156,700,177). Our findings indicate that OA journals are 33% more prevalent in PubPeer than in the global total (52% for the most commented journals). This result can be attributed to disciplinary bias in PubPeer, with overrepresentation of medical and biological research (which exhibits higher levels of openness). However, after normalization, the results reveal that PPPR does not exhibit a Streetlight Effect, as OA publications, within the same discipline, are on average 16% less prevalent in PubPeer than in the global total. These results suggest that the process of scientific self-correction operates independently of publication access status.
△ Less
Submitted 23 October, 2023;
originally announced November 2023.
-
Epistemic integration and social segregation of AI in neuroscience
Authors:
Sylvain Fontaine,
Floriana Gargiulo,
Michel Dubois,
Paola Tubaro
Abstract:
In recent years, Artificial Intelligence (AI) shows a spectacular ability of insertion inside a variety of disciplines which use it for scientific advancements and which sometimes improve it for their conceptual and methodological needs. According to the transverse science framework originally conceived by Shinn and Joerges, AI can be seen as an instrument which is progressively acquiring a univer…
▽ More
In recent years, Artificial Intelligence (AI) shows a spectacular ability of insertion inside a variety of disciplines which use it for scientific advancements and which sometimes improve it for their conceptual and methodological needs. According to the transverse science framework originally conceived by Shinn and Joerges, AI can be seen as an instrument which is progressively acquiring a universal character through its diffusion across science. In this paper we address empirically one aspect of this diffusion, namely the penetration of AI into a specific field of research. Taking neuroscience as a case study, we conduct a scientometric analysis of the development of AI in this field. We especially study the temporal egocentric citation network around the articles included in this literature, their represented journals and their authors linked together by a temporal collaboration network. We find that AI is driving the constitution of a particular disciplinary ecosystem in neuroscience which is distinct from other subfields, and which is gathering atypical scientific profiles who are coming from neuroscience or outside it. Moreover we observe that this AI community in neuroscience is socially confined in a specific subspace of the neuroscience collaboration network, which also publishes in a small set of dedicated journals that are mostly active in AI research. According to these results, the diffusion of AI in a discipline such as neuroscience didn't really challenge its disciplinary orientations but rather induced the constitution of a dedicated socio-cognitive environment inside this field.
△ Less
Submitted 6 March, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
The STOIC2021 COVID-19 AI challenge: applying reusable training methodologies to private data
Authors:
Luuk H. Boulogne,
Julian Lorenz,
Daniel Kienzle,
Robin Schon,
Katja Ludwig,
Rainer Lienhart,
Simon Jegou,
Guang Li,
Cong Chen,
Qi Wang,
Derik Shi,
Mayug Maniparambil,
Dominik Muller,
Silvan Mertes,
Niklas Schroter,
Fabio Hellmann,
Miriam Elia,
Ine Dirks,
Matias Nicolas Bossa,
Abel Diaz Berenguer,
Tanmoy Mukherjee,
Jef Vandemeulebroucke,
Hichem Sahli,
Nikos Deligiannis,
Panagiotis Gonidakis
, et al. (13 additional authors not shown)
Abstract:
Challenges drive the state-of-the-art of automated medical image analysis. The quantity of public training data that they provide can limit the performance of their solutions. Public access to the training methodology for these solutions remains absent. This study implements the Type Three (T3) challenge format, which allows for training solutions on private data and guarantees reusable training m…
▽ More
Challenges drive the state-of-the-art of automated medical image analysis. The quantity of public training data that they provide can limit the performance of their solutions. Public access to the training methodology for these solutions remains absent. This study implements the Type Three (T3) challenge format, which allows for training solutions on private data and guarantees reusable training methodologies. With T3, challenge organizers train a codebase provided by the participants on sequestered training data. T3 was implemented in the STOIC2021 challenge, with the goal of predicting from a computed tomography (CT) scan whether subjects had a severe COVID-19 infection, defined as intubation or death within one month. STOIC2021 consisted of a Qualification phase, where participants developed challenge solutions using 2000 publicly available CT scans, and a Final phase, where participants submitted their training methodologies with which solutions were trained on CT scans of 9724 subjects. The organizers successfully trained six of the eight Final phase submissions. The submitted codebases for training and running inference were released publicly. The winning solution obtained an area under the receiver operating characteristic curve for discerning between severe and non-severe COVID-19 of 0.815. The Final phase solutions of all finalists improved upon their Qualification phase solutions.HSUXJM-TNZF9CHSUXJM-TNZF9C
△ Less
Submitted 25 June, 2023; v1 submitted 18 June, 2023;
originally announced June 2023.
-
Hacking of the AES with Boolean Functions
Authors:
Michel Dubois,
Eric Filiol
Abstract:
One of the major issues of cryptography is the cryptanalysis of cipher algorithms. Cryptanalysis is the study of methods for obtaining the meaning of encrypted information, without access to the secret information that is normally required. Some mechanisms for breaking codes include differential cryptanalysis, advanced statistics and brute-force.
Recent works also attempt to use algebraic tools…
▽ More
One of the major issues of cryptography is the cryptanalysis of cipher algorithms. Cryptanalysis is the study of methods for obtaining the meaning of encrypted information, without access to the secret information that is normally required. Some mechanisms for breaking codes include differential cryptanalysis, advanced statistics and brute-force.
Recent works also attempt to use algebraic tools to reduce the cryptanalysis of a block cipher algorithm to the resolution of a system of quadratic equations describing the ciphering structure.
In our study, we will also use algebraic tools but in a new way: by using Boolean functions and their properties. A Boolean function is a function from $F_2^n\to F_2$ with $n>1$, characterized by its truth table. The arguments of Boolean functions are binary words of length $n$. Any Boolean function can be represented, uniquely, by its algebraic normal form which is an equation which only contains additions modulo 2 - the XOR function - and multiplications modulo 2 - the AND function.
Our aim is to describe the AES algorithm as a set of Boolean functions then calculate their algebraic normal forms by using the Möbius transforms. After, we use a specific representation for these equations to facilitate their analysis and particularly to try a combinatorial analysis. Through this approach we obtain a new kind of equations system. This equations system is more easily implementable and could open new ways to cryptanalysis.
△ Less
Submitted 13 September, 2016;
originally announced September 2016.
-
Using n-grams models for visual semantic place recognition
Authors:
Mathieu Dubois,
Frenoux Emmanuelle,
Philippe Tarroux
Abstract:
The aim of this paper is to present a new method for visual place recognition. Our system combines global image characterization and visual words, which allows to use efficient Bayesian filtering methods to integrate several images. More precisely, we extend the classical HMM model with techniques inspired by the field of Natural Language Processing. This paper presents our system and the Bayesian…
▽ More
The aim of this paper is to present a new method for visual place recognition. Our system combines global image characterization and visual words, which allows to use efficient Bayesian filtering methods to integrate several images. More precisely, we extend the classical HMM model with techniques inspired by the field of Natural Language Processing. This paper presents our system and the Bayesian filtering algorithm. The performance of our system and the influence of the main parameters are evaluated on a standard database. The discussion highlights the interest of using such models and proposes improvements.
△ Less
Submitted 21 March, 2014;
originally announced March 2014.
-
Assessment of a percutaneous iliosacral screw insertion simulator
Authors:
J. Tonetti,
L. Vadcard,
P. Girard,
M. Dubois,
P. Merloz,
Jocelyne Troccaz
Abstract:
BACKGROUND: Navigational simulator use for specialized training purposes is rather uncommon in orthopaedic and trauma surgery. However, it reveals providing a valuable tool to train orthopaedic surgeons and help them to plan complex surgical procedures. PURPOSE: This work's objective was to assess educational efficiency of a path simulator under fluoroscopic guidance applied to sacroiliac joint…
▽ More
BACKGROUND: Navigational simulator use for specialized training purposes is rather uncommon in orthopaedic and trauma surgery. However, it reveals providing a valuable tool to train orthopaedic surgeons and help them to plan complex surgical procedures. PURPOSE: This work's objective was to assess educational efficiency of a path simulator under fluoroscopic guidance applied to sacroiliac joint percutaneous screw fixation. MATERIALS AND METHODS: We evaluated 23 surgeons' accuracy inserting a guide-wire in a human cadaver experiment, following a pre-established procedure. These medical trainees were defined in three prospective respects: novice or skilled; with or without theoretical knowledge; with or without surgical procedure familiarity. Analysed criteria for each tested surgeon included the number of intraoperative X-rays taken in order to achieve the surgical procedure as well as an iatrogenic index reflecting the surgeon's ability to detect any hazardous trajectory at the time of performing said procedure. RESULTS: An average number of 13 X-rays was required for wire implantation by the G1 group. G2 group, assisted by the simulator use, required an average of 10 X-rays. A substantial difference was especially observed within the novice sub-group (N), with an average of 12.75 X-rays for the G1 category and an average of 8.5 X-rays for the G2 category. As far as the iatrogenic index is concerned, we were unable to observe any significant difference between the groups.
△ Less
Submitted 12 October, 2009;
originally announced October 2009.
-
Study of conditions of use of E-services accessible to visually disabled persons
Authors:
Marc-Eric Bobiller-Chaumon,
Michel Dubois,
Françoise Sandoz-Guermond
Abstract:
The aim of this paper is to determine the expectations that French-speaking disabled persons have for electronic administrative sites (utility). At the same time, it is a matter of identifying the difficulties of use that the manipulation of these E-services poses concretely for blind people (usability) and of evaluating the psychosocial impacts on the way of life of these people with specific n…
▽ More
The aim of this paper is to determine the expectations that French-speaking disabled persons have for electronic administrative sites (utility). At the same time, it is a matter of identifying the difficulties of use that the manipulation of these E-services poses concretely for blind people (usability) and of evaluating the psychosocial impacts on the way of life of these people with specific needs. We show that the lack of numerical accessibility is likely to accentuate the social exclusion of which these people are victim by establishing a numerical glass ceiling.
△ Less
Submitted 13 December, 2007;
originally announced December 2007.