Skip to main content

Showing 1–10 of 10 results for author: Steorts, R C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2307.13219  [pdf, other

    cs.DB cs.LG

    A Primer on the Data Cleaning Pipeline

    Authors: Rebecca C. Steorts

    Abstract: The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple data sources, has also grown. Specifically, the sci… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  2. Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

    Authors: Neil G. Marchant, Benjamin I. P. Rubinstein, Rebecca C. Steorts

    Abstract: Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, wh… ▽ More

    Submitted 7 January, 2023; originally announced January 2023.

    Comments: 27 pages, 4 figures, 3 tables. Includes 37 pages of appendices. This is an accepted manuscript to be published in the Journal of Survey Statistics and Methodology

  3. arXiv:2008.04443  [pdf, other

    stat.ME cs.DB stat.ML

    (Almost) All of Entity Resolution

    Authors: Olivier Binette, Rebecca C. Steorts

    Abstract: Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrat… ▽ More

    Submitted 17 January, 2022; v1 submitted 10 August, 2020; originally announced August 2020.

  4. arXiv:1909.06039  [pdf, other

    stat.CO cs.DB cs.LG stat.ML

    d-blink: Distributed End-to-End Bayesian Entity Resolution

    Authors: Neil G. Marchant, Andee Kaplan, Daniel N. Elazar, Benjamin I. P. Rubinstein, Rebecca C. Steorts

    Abstract: Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing mode… ▽ More

    Submitted 22 September, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: 32 pages, 6 figures, 5 tables. Includes 22 pages of supplementary material. This revision incorporates a case study on the 2010 U.S. Decennial Census

    MSC Class: 62F15; 65C40; 68W15

  5. arXiv:1810.05497  [pdf, other

    cs.DB cs.LG stat.AP stat.ML

    Probabilistic Blocking with An Application to the Syrian Conflict

    Authors: Rebecca C. Steorts, Anshumali Shrivastava

    Abstract: Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into… ▽ More

    Submitted 10 October, 2018; originally announced October 2018.

    Comments: 16 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:1510.07714, arXiv:1710.02690

    Journal ref: Steorts R.C., Shrivastava A. (2018) Probabilistic Blocking with an Application to the Syrian Conflict. PSD (2018)

  6. arXiv:1810.01538  [pdf, other

    stat.ME cs.DB cs.LG

    A Practical Approach to Proper Inference with Linked Data

    Authors: Andee Kaplan, Brenda Betancourt, Rebecca C. Steorts

    Abstract: Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incor… ▽ More

    Submitted 8 February, 2022; v1 submitted 2 October, 2018; originally announced October 2018.

    Comments: 31 pages, 2 figures

  7. arXiv:1710.02690  [pdf, other

    stat.AP cs.DB cs.DS

    Unique Entity Estimation with Application to the Syrian Conflict

    Authors: Beidi Chen, Anshumali Shrivastava, Rebecca C. Steorts

    Abstract: Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus o… ▽ More

    Submitted 7 October, 2017; originally announced October 2017.

    Comments: 35 pages, 6 figures, 2 tables

  8. arXiv:1703.02679  [pdf, other

    math.ST cs.IT stat.ME stat.ML

    Performance Bounds for Graphical Record Linkage

    Authors: Rebecca C. Steorts, Matt Barnes, Willie Neiswanger

    Abstract: Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a res… ▽ More

    Submitted 7 March, 2017; originally announced March 2017.

    Comments: 11 pages with supplement; 4 figures and 2 tables; to appear in AISTATS 2017

  9. arXiv:1510.07714  [pdf, other

    stat.AP cs.DB

    Blocking Methods Applied to Casualty Records from the Syrian Conflict

    Authors: Peter Sadosky, Anshumali Shrivastava, Megan Price, Rebecca C. Steorts

    Abstract: Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Perú, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efraín Ríos Montt for acts of genocide in Guatemala. Estimation rel… ▽ More

    Submitted 26 October, 2015; originally announced October 2015.

    Comments: 25 pages, 6 figures

  10. arXiv:1407.3191  [pdf, other

    cs.DB stat.AP

    A Comparison of Blocking Methods for Record Linkage

    Authors: Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, Stephen E. Fienberg

    Abstract: Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sens… ▽ More

    Submitted 11 July, 2014; originally announced July 2014.

    Comments: 22 pages, 2 tables, 7 figures