Search | arXiv e-print repository

A comprehensive review of classifier probability calibration metrics

Abstract: Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or over confident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing… ▽ More Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or over confident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, for assurance in safety or business-critical contexts, and for building user trust in models. This paper provides a comprehensive review of probability calibration metrics for classifier and object detection models, organising them according to a number of different categorisations to highlight their relationships. We identify 82 major metrics, which can be grouped into four classifier families (point-based, bin-based, kernel or curve-based, and cumulative) and an object detection family. For each metric, we provide equations where available, facilitating implementation and comparison by future researchers. △ Less

Submitted 25 April, 2025; originally announced April 2025.

Comments: 60 pages, 7 figures

arXiv:2502.14545 [pdf, other]

An Entropic Metric for Measuring Calibration of Machine Learning Models

Authors: Daniel James Sumler, Lee Devlin, Simon Maskell, Richard O. Lane

Abstract: Understanding the confidence with which a machine learning model classifies an input datum is an important, and perhaps under-investigated, concept. In this paper, we propose a new calibration metric, the Entropic Calibration Difference (ECD). Based on existing research in the field of state estimation, specifically target tracking (TT), we show how ECD may be applied to binary classification mach… ▽ More Understanding the confidence with which a machine learning model classifies an input datum is an important, and perhaps under-investigated, concept. In this paper, we propose a new calibration metric, the Entropic Calibration Difference (ECD). Based on existing research in the field of state estimation, specifically target tracking (TT), we show how ECD may be applied to binary classification machine learning models. We describe the relative importance of under- and over-confidence and how they are not conflated in the TT literature. Indeed, our metric distinguishes under- from over-confidence. We consider this important given that algorithms that are under-confident are likely to be 'safer' than algorithms that are over-confident, albeit at the expense of also being over-cautious and so statistically inefficient. We demonstrate how this new metric performs on real and simulated data and compare with other metrics for machine learning model probability calibration, including the Expected Calibration Error (ECE) and its signed counterpart, the Expected Signed Calibration Error (ESCE). △ Less

Submitted 20 February, 2025; originally announced February 2025.

arXiv:2502.00013 [pdf]

Behavioural Analytics: Mathematics of the Mind

Authors: Richard Lane, Hannah State-Davey, Claire Taylor, Wendy Holmes, Rachel Boon, Mark Round

Abstract: Behavioural analytics provides insights into individual and crowd behaviour, enabling analysis of what previously happened and predictions for how people may be likely to act in the future. In defence and security, this analysis allows organisations to achieve tactical and strategic advantage through influence campaigns, a key counterpart to physical activities. Before action can be taken, online… ▽ More Behavioural analytics provides insights into individual and crowd behaviour, enabling analysis of what previously happened and predictions for how people may be likely to act in the future. In defence and security, this analysis allows organisations to achieve tactical and strategic advantage through influence campaigns, a key counterpart to physical activities. Before action can be taken, online and real-world behaviour must be analysed to determine the level of threat. Huge data volumes mean that automated processes are required to attain an accurate understanding of risk. We describe the mathematical basis of technologies to analyse quotes in multiple languages. These include a Bayesian network to understand behavioural factors, state estimation algorithms for time series analysis, and machine learning algorithms for classification. We present results from studies of quotes in English, French, and Arabic, from anti-violence campaigners, politicians, extremists, and terrorists. The algorithms correctly identify extreme statements; and analysis at individual, group, and population levels detects both trends over time and sharp changes attributed to major geopolitical events. Group analysis shows that additional population characteristics can be determined, such as polarisation over particular issues and large-scale shifts in attitude. Finally, MP voting behaviour and statements from publicly-available records are analysed to determine the level of correlation between what people say and what they do. △ Less

Submitted 7 January, 2025; originally announced February 2025.

Comments: 19 pages, 14 figures, presented at 7th IMA Conference on Mathematics in Defence and Security, London, UK, 7 September 2023 (conference page at https://ima.org.uk/20850/7th-ima-defence/)

arXiv:2501.06707 [pdf, ps, other]

ELIZA Reanimated: The world's first chatbot restored on the world's first time sharing system

Authors: Rupert Lane, Anthony Hay, Arthur Schwarz, David M. Berry, Jeff Shrager

Abstract: ELIZA, created by Joseph Weizenbaum at MIT in the early 1960s, is usually considered the world's first chatbot. It was developed in MAD-SLIP on MIT's CTSS, the world's first time-sharing system, on an IBM 7094. We discovered an original ELIZA printout in Prof. Weizenbaum's archives at MIT, including an early version of the famous DOCTOR script, a nearly complete version of the MAD-SLIP code, and v… ▽ More ELIZA, created by Joseph Weizenbaum at MIT in the early 1960s, is usually considered the world's first chatbot. It was developed in MAD-SLIP on MIT's CTSS, the world's first time-sharing system, on an IBM 7094. We discovered an original ELIZA printout in Prof. Weizenbaum's archives at MIT, including an early version of the famous DOCTOR script, a nearly complete version of the MAD-SLIP code, and various support functions in MAD and FAP. Here we describe the reanimation of this original ELIZA on a restored CTSS, itself running on an emulated IBM 7094. The entire stack is open source, so that any user of a unix-like OS can run the world's first chatbot on the world's first time-sharing system. △ Less

Submitted 11 January, 2025; originally announced January 2025.

Comments: In review

Showing 1–4 of 4 results for author: Lane, R