-
Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations
Authors:
Kevin L. Wei,
Patricia Paskov,
Sunishchal Dev,
Michael J. Byun,
Anka Reuel,
Xavier Roberts-Gaal,
Rachel Calcott,
Evie Coxon,
Chinmay Deshpande
Abstract:
In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluatio…
▽ More
In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
The cellular automaton pulsing model, experiments with DDLab
Authors:
Andrew Wuensche,
Edward Coxon
Abstract:
The cellular automaton (CA) pulsing model (arXiv:1806.06416) described the surprising phenomenon of spontaneous, sustained and robust rhythmic oscillations, pulsing dynamics, when random wiring is applied to a 2D `glider' rule running in a 3-value totalistic CA. Case studies, pulsing measures, possible mechanisms, and implications for oscillatory networks in biology were presented. In this paper…
▽ More
The cellular automaton (CA) pulsing model (arXiv:1806.06416) described the surprising phenomenon of spontaneous, sustained and robust rhythmic oscillations, pulsing dynamics, when random wiring is applied to a 2D `glider' rule running in a 3-value totalistic CA. Case studies, pulsing measures, possible mechanisms, and implications for oscillatory networks in biology were presented. In this paper we summarise the results, extend the entropy-density and density-return map plots to include a linked history, look at totalistic glider rules with neighborhoods of 3, 4 and 5, as well as 6 and 7 studied previously, introduce methods to automatically recognise the wavelength, and extend results for randomly asynchronous updating. We show how the model is implemented in DDLab to validate results, output data, and allow experiments and research by others.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
Pulsing dynamics in randomly wired glider cellular automata
Authors:
Andrew Wuensche,
Edward Coxon
Abstract:
Sustained rhythmic oscillations, pulsing dynamics, emerge spontaneously when the local connection scheme is randomised in 3-value cellular automata that feature"glider" dynamics. Time-plots of pulsing measures maintain a distinct waveform for each glider rule, and scatter plots of entropy/density and the density return-map show unique signatures, which have the characteristics of chaotic strange a…
▽ More
Sustained rhythmic oscillations, pulsing dynamics, emerge spontaneously when the local connection scheme is randomised in 3-value cellular automata that feature"glider" dynamics. Time-plots of pulsing measures maintain a distinct waveform for each glider rule, and scatter plots of entropy/density and the density return-map show unique signatures, which have the characteristics of chaotic strange attractors. We present case studies, possible mechanisms, and implications for oscillatory networks in biology.
△ Less
Submitted 5 August, 2018; v1 submitted 17 June, 2018;
originally announced June 2018.