-
A Primer on the Data Cleaning Pipeline
Authors:
Rebecca C. Steorts
Abstract:
The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple data sources, has also grown. Specifically, the sci…
▽ More
The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple data sources, has also grown. Specifically, the science of the ``data cleaning pipeline'' contains four stages that allow an analyst to perform downstream tasks, predictive analyses, or statistical analyses on ``cleaned data.'' This article provides a review of this emerging field, introducing technical terminology and commonly used methods.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors
Authors:
Neil G. Marchant,
Benjamin I. P. Rubinstein,
Rebecca C. Steorts
Abstract:
Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, wh…
▽ More
Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, which corresponds to a special class of random partition models. Second, we propose a more realistic distortion model for categorical/discrete record attributes, which corrects a logical inconsistency with the standard hit-miss model. Third, we incorporate hyperpriors to improve flexibility. Fourth, we employ a partially collapsed Gibbs sampler for inferential speedups. Using a selection of private and nonprivate data sets, we investigate the impact of our modeling contributions and compare our model with two alternative Bayesian models. In addition, we conduct a simulation study for household survey data, where we vary distortion, duplication rates and data set size. We find that our model performs more consistently than the alternatives across a variety of scenarios and typically achieves the highest entity resolution accuracy (F1 score). Open source software is available for our proposed methodology, and we provide a discussion regarding our work and future directions.
△ Less
Submitted 7 January, 2023;
originally announced January 2023.
-
(Almost) All of Entity Resolution
Authors:
Olivier Binette,
Rebecca C. Steorts
Abstract:
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrat…
▽ More
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
△ Less
Submitted 17 January, 2022; v1 submitted 10 August, 2020;
originally announced August 2020.
-
d-blink: Distributed End-to-End Bayesian Entity Resolution
Authors:
Neil G. Marchant,
Andee Kaplan,
Daniel N. Elazar,
Benjamin I. P. Rubinstein,
Rebecca C. Steorts
Abstract:
Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing mode…
▽ More
Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing models are severely limited in practice, as standard inference algorithms scale quadratically in the number of records. While scaling can be managed by fitting the model on separate blocks of the data, such a naïve approach may induce significant error in the posterior. In this paper, we propose a principled model for scalable Bayesian ER, called "distributed Bayesian linkage" or d-blink, which jointly performs blocking and ER without compromising posterior correctness. Our approach relies on several key ideas, including: (i) an auxiliary variable representation that induces a partition of the entities and records into blocks; (ii) a method for constructing well-balanced blocks based on k-d trees; (iii) a distributed partially-collapsed Gibbs sampler with improved mixing; and (iv) fast algorithms for performing Gibbs updates. Empirical studies on six data sets---including a case study on the 2010 Decennial Census---demonstrate the scalability and effectiveness of our approach.
△ Less
Submitted 22 September, 2020; v1 submitted 13 September, 2019;
originally announced September 2019.
-
Probabilistic Blocking with An Application to the Syrian Conflict
Authors:
Rebecca C. Steorts,
Anshumali Shrivastava
Abstract:
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into…
▽ More
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.
△ Less
Submitted 10 October, 2018;
originally announced October 2018.
-
A Practical Approach to Proper Inference with Linked Data
Authors:
Andee Kaplan,
Brenda Betancourt,
Rebecca C. Steorts
Abstract:
Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incor…
▽ More
Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incorporating uncertainty from ER in the downstream task is critical to ensure proper inference. To bridge the gap between ER and the downstream task in an analysis pipeline, we propose five methods to choose a representative (or canonical) record from linked data, referred to as canonicalization. Our methods are scalable in the number of records, appropriate in general data scenarios, and provide natural error propagation via a Bayesian canonicalization stage. The proposed methodology is evaluated on three simulated data sets and one application -- determining the relationship between demographic information and party affiliation in voter registration data from the North Carolina State Board of Elections. We first perform Bayesian ER and evaluate our proposed methods for canonicalization before considering the downstream tasks of linear and logistic regression. Bayesian canonicalization methods are empirically shown to improve downstream inference in both settings through prediction and coverage.
△ Less
Submitted 8 February, 2022; v1 submitted 2 October, 2018;
originally announced October 2018.
-
Unique Entity Estimation with Application to the Syrian Conflict
Authors:
Beidi Chen,
Anshumali Shrivastava,
Rebecca C. Steorts
Abstract:
Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus o…
▽ More
Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of $191,874 \pm 1772$ documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold.
△ Less
Submitted 7 October, 2017;
originally announced October 2017.
-
Performance Bounds for Graphical Record Linkage
Authors:
Rebecca C. Steorts,
Matt Barnes,
Willie Neiswanger
Abstract:
Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a res…
▽ More
Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat record linkage as a clustering task, in which each latent entity is associated with one or more noisy database records. We critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, making connections to Kolchin partition models. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insights for when our bounds hold using simulated data and provide practical user guidance.
△ Less
Submitted 7 March, 2017;
originally announced March 2017.
-
Blocking Methods Applied to Casualty Records from the Syrian Conflict
Authors:
Peter Sadosky,
Anshumali Shrivastava,
Megan Price,
Rebecca C. Steorts
Abstract:
Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Perú, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efraín Ríos Montt for acts of genocide in Guatemala. Estimation rel…
▽ More
Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Perú, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efraín Ríos Montt for acts of genocide in Guatemala. Estimation relies on both record linkage and multiple systems estimation. A key first step in this process is identifying ways to partition the records such that they are computationally manageable. This step is referred to as blocking and is a major challenge for the Syrian database since it is sparse in the number of duplicate records and feature poor in its attributes. As a consequence, we propose locality sensitive hashing (LSH) methods to overcome these challenges. We demonstrate the computational superiority and error rates of these methods by comparing our proposed approach with others in the literature. We conclude with a discussion of many challenges of merging LSH with record linkage to achieve an estimate of the number of uniquely documented deaths in the Syrian conflict.
△ Less
Submitted 26 October, 2015;
originally announced October 2015.
-
A Comparison of Blocking Methods for Record Linkage
Authors:
Rebecca C. Steorts,
Samuel L. Ventura,
Mauricio Sadinle,
Stephen E. Fienberg
Abstract:
Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sens…
▽ More
Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.
△ Less
Submitted 11 July, 2014;
originally announced July 2014.