-
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework
Authors:
Olivier Binette,
Jerome P. Reiter
Abstract:
Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a construct validity issue. To improve the validity and practical usefulness of evaluations, we propose using an estimands framework adapted from international c…
▽ More
Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a construct validity issue. To improve the validity and practical usefulness of evaluations, we propose using an estimands framework adapted from international clinical trials guidelines. This framework provides a systematic structure for inference and reporting in evaluations, emphasizing the importance of a well-defined estimation target. We illustrate our proposal on examples of commonly used evaluation methodologies - involving cross-validation, clustering evaluation, and LLM benchmarking - that can lead to incorrect rankings of competing models (rank reversals) with high probability, even when performance differences are large. We demonstrate how the estimands framework can help uncover underlying issues, their causes, and potential solutions. Ultimately, we believe this framework can improve the validity of evaluations through better-aligned inference, and help decision-makers and model users interpret reported results more effectively.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation
Authors:
Olivier Binette,
Youngsoo Baek,
Siddharth Engineer,
Christina Jones,
Abel Dasylva,
Jerome P. Reiter
Abstract:
Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching pairs of records among an immense number of non-matches. We propose an alternative that facilitates the creation of representative, reusable benchmark data sets…
▽ More
Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching pairs of records among an immense number of non-matches. We propose an alternative that facilitates the creation of representative, reusable benchmark data sets without necessitating complex sampling schemes. These benchmark data sets can then be used for model training and a variety of evaluation tasks. Specifically, we propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics, estimating key performance metrics such as cluster and pairwise precision and recall, and analyzing root causes for errors. We validate the framework in an application to inventor name disambiguation and through simulation studies. Software: https://github.com/OlivierBinette/er-evaluation/
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Optimal $F$-score Clustering for Bipartite Record Linkage
Authors:
Eric A. Bai,
Olivier Binette,
Jerome P. Reiter
Abstract:
Probabilistic record linkage is often used to match records from two files, in particular when the variables common to both files comprise imperfectly measured identifiers like names and demographic variables. We consider bipartite record linkage settings in which each entity appears at most once within a file, i.e., there are no duplicates within the files, but some entities appear in both files.…
▽ More
Probabilistic record linkage is often used to match records from two files, in particular when the variables common to both files comprise imperfectly measured identifiers like names and demographic variables. We consider bipartite record linkage settings in which each entity appears at most once within a file, i.e., there are no duplicates within the files, but some entities appear in both files. In this setting, the analyst desires a point estimate of the linkage structure that matches each record to at most one record from the other file. We propose an approach for obtaining this point estimate by maximizing the expected $F$-score for the linkage structure. We target the approach for record linkage methods that produce either (an approximate) posterior distribution of the unknown linkage structure or probabilities of matches for record pairs. Using simulations and applications with genuine data, we illustrate that the $F$-score estimators can lead to sensible estimates of the linkage structure.
△ Less
Submitted 4 December, 2023; v1 submitted 23 November, 2023;
originally announced November 2023.
-
Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org
Authors:
Olivier Binette,
Sokhna A York,
Emma Hickerson,
Youngsoo Baek,
Sarvo Madhavan,
Christina Jones
Abstract:
This paper introduces a novel evaluation methodology for entity resolution algorithms. It is motivated by PatentsView.org, a U.S. Patents and Trademarks Office patent data exploration tool that disambiguates patent inventors using an entity resolution algorithm. We provide a data collection methodology and tailored performance estimators that account for sampling biases. Our approach is simple, pr…
▽ More
This paper introduces a novel evaluation methodology for entity resolution algorithms. It is motivated by PatentsView.org, a U.S. Patents and Trademarks Office patent data exploration tool that disambiguates patent inventors using an entity resolution algorithm. We provide a data collection methodology and tailored performance estimators that account for sampling biases. Our approach is simple, practical and principled -- key characteristics that allow us to paint the first representative picture of PatentsView's disambiguation performance. This approach is used to inform PatentsView's users of the reliability of the data and to allow the comparison of competing disambiguation algorithms.
△ Less
Submitted 17 April, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
On the Reliability of Multiple Systems Estimation for the Quantification of Modern Slavery
Authors:
Olivier Binette,
Rebecca C. Steorts
Abstract:
The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long-standing controversy, disagreements have re-surfaced regarding the underlying MSE assumptions, the robustness of MSE methodology, and the accuracy of MSE estimates in this ap…
▽ More
The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long-standing controversy, disagreements have re-surfaced regarding the underlying MSE assumptions, the robustness of MSE methodology, and the accuracy of MSE estimates in this application. Our goal is to help address and move past these controversies. To do so, we review MSE, its assumptions, and commonly used models for modern slavery applications. We introduce all of the publicly available modern slavery datasets in the literature, providing a reproducible analysis and highlighting current issues. Specifically, we utilize an internal consistency approach that constructs subsets of data for which ground truth is available, allowing us to evaluate the accuracy of MSE estimators. Next, we propose a characterization of the large sample bias of estimators as a function of misspecified assumptions. Then, we propose an alternative to traditional (e.g., bootstrap-based) assessments of reliability, which allows us to visualize trajectories of MSE estimates to illustrate the robustness of estimates. Finally, our complementary analyses are used to provide guidance regarding the application and reliability of MSE methodology.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
(Almost) All of Entity Resolution
Authors:
Olivier Binette,
Rebecca C. Steorts
Abstract:
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrat…
▽ More
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
△ Less
Submitted 17 January, 2022; v1 submitted 10 August, 2020;
originally announced August 2020.
-
Bayesian Nonparametrics for Directional Statistics
Authors:
Olivier Binette,
Simon Guillotte
Abstract:
We introduce a density basis of the trigonometric polynomials that is suitable to mixture modelling. Statistical and geometric properties are derived, suggesting it as a circular analogue to the Bernstein polynomial densities. Nonparametric priors are constructed using this basis and a simulation study shows that the use of the resulting Bayes estimator may provide gains over comparable circular d…
▽ More
We introduce a density basis of the trigonometric polynomials that is suitable to mixture modelling. Statistical and geometric properties are derived, suggesting it as a circular analogue to the Bernstein polynomial densities. Nonparametric priors are constructed using this basis and a simulation study shows that the use of the resulting Bayes estimator may provide gains over comparable circular density estimators previously suggested in the literature. From a theoretical point of view, we propose a general prior specification framework for density estimation on compact metric space using sieve priors. This is tailored to density bases such as the one considered herein and may also be used to exploit their particular shape-preserving properties. Furthermore, strong posterior consistency is shown to hold under notably weak regularity assumptions and adaptative convergence rates are obtained in terms of the approximation properties of positive linear operators generating our models.
△ Less
Submitted 25 February, 2019; v1 submitted 1 July, 2018;
originally announced July 2018.