Skip to main content

Showing 1–7 of 7 results for author: Binette, O

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.10366  [pdf, other

    cs.LG stat.AP stat.ME

    Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework

    Authors: Olivier Binette, Jerome P. Reiter

    Abstract: Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a construct validity issue. To improve the validity and practical usefulness of evaluations, we propose using an estimands framework adapted from international c… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 25 pages, 2 figures, 3 tables

  2. arXiv:2404.05622  [pdf, other

    cs.CL cs.LG stat.ME

    How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation

    Authors: Olivier Binette, Youngsoo Baek, Siddharth Engineer, Christina Jones, Abel Dasylva, Jerome P. Reiter

    Abstract: Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching pairs of records among an immense number of non-matches. We propose an alternative that facilitates the creation of representative, reusable benchmark data sets… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: 33 pages, 11 figures

  3. arXiv:2311.13923  [pdf, ps, other

    stat.ME

    Optimal $F$-score Clustering for Bipartite Record Linkage

    Authors: Eric A. Bai, Olivier Binette, Jerome P. Reiter

    Abstract: Probabilistic record linkage is often used to match records from two files, in particular when the variables common to both files comprise imperfectly measured identifiers like names and demographic variables. We consider bipartite record linkage settings in which each entity appears at most once within a file, i.e., there are no duplicates within the files, but some entities appear in both files.… ▽ More

    Submitted 4 December, 2023; v1 submitted 23 November, 2023; originally announced November 2023.

  4. arXiv:2210.01230  [pdf, other

    cs.DL cs.DB cs.LG stat.ME

    Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org

    Authors: Olivier Binette, Sokhna A York, Emma Hickerson, Youngsoo Baek, Sarvo Madhavan, Christina Jones

    Abstract: This paper introduces a novel evaluation methodology for entity resolution algorithms. It is motivated by PatentsView.org, a U.S. Patents and Trademarks Office patent data exploration tool that disambiguates patent inventors using an entity resolution algorithm. We provide a data collection methodology and tailored performance estimators that account for sampling biases. Our approach is simple, pr… ▽ More

    Submitted 17 April, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: 20 pages, 4 figures

    Journal ref: The American Statistician (2023)

  5. arXiv:2112.01594  [pdf, other

    stat.ME stat.AP

    On the Reliability of Multiple Systems Estimation for the Quantification of Modern Slavery

    Authors: Olivier Binette, Rebecca C. Steorts

    Abstract: The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long-standing controversy, disagreements have re-surfaced regarding the underlying MSE assumptions, the robustness of MSE methodology, and the accuracy of MSE estimates in this ap… ▽ More

    Submitted 2 December, 2021; originally announced December 2021.

    Journal ref: Journal of the Royal Statistical Society: Series A (Statistics in Society), 1 - 37 (2022)

  6. arXiv:2008.04443  [pdf, other

    stat.ME cs.DB stat.ML

    (Almost) All of Entity Resolution

    Authors: Olivier Binette, Rebecca C. Steorts

    Abstract: Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrat… ▽ More

    Submitted 17 January, 2022; v1 submitted 10 August, 2020; originally announced August 2020.

  7. arXiv:1807.00305  [pdf, other

    stat.ME math.ST

    Bayesian Nonparametrics for Directional Statistics

    Authors: Olivier Binette, Simon Guillotte

    Abstract: We introduce a density basis of the trigonometric polynomials that is suitable to mixture modelling. Statistical and geometric properties are derived, suggesting it as a circular analogue to the Bernstein polynomial densities. Nonparametric priors are constructed using this basis and a simulation study shows that the use of the resulting Bayes estimator may provide gains over comparable circular d… ▽ More

    Submitted 25 February, 2019; v1 submitted 1 July, 2018; originally announced July 2018.

    Comments: 29 pages, 5 figures