-
SafeTab-H: Disclosure Avoidance for the 2020 Census Detailed Demographic and Housing Characteristics File B (Detailed DHC-B)
Authors:
William Sexton,
Skye Berghel,
Bayard Carlson,
Sam Haney,
Luke Hartman,
Michael Hay,
Ashwin Machanavajjhala,
Gerome Miklau,
Amritha Pai,
Simran Rajpal,
David Pujol,
Ruchit Shrestha,
Daniel Simmons-Marengo
Abstract:
This article describes SafeTab-H, a disclosure avoidance algorithm applied to the release of the U.S. Census Bureau's Detailed Demographic and Housing Characteristics File B (Detailed DHC-B) as part of the 2020 Census. The tabulations contain household statistics about household type and tenure iterated by the householder's detailed race, ethnicity, or American Indian and Alaska Native tribe and v…
▽ More
This article describes SafeTab-H, a disclosure avoidance algorithm applied to the release of the U.S. Census Bureau's Detailed Demographic and Housing Characteristics File B (Detailed DHC-B) as part of the 2020 Census. The tabulations contain household statistics about household type and tenure iterated by the householder's detailed race, ethnicity, or American Indian and Alaska Native tribe and village at varying levels of geography. We describe the algorithmic strategy which is based on adding noise from a discrete Gaussian distribution and show that the algorithm satisfies a well-studied variant of differential privacy, called zero-concentrated differential privacy. We discuss how the implementation of the SafeTab-H codebase relies on the Tumult Analytics privacy library. We also describe the theoretical expected error properties of the algorithm and explore various aspects of its parameter tuning.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
SafeTab-P: Disclosure Avoidance for the 2020 Census Detailed Demographic and Housing Characteristics File A (Detailed DHC-A)
Authors:
Sam Haney,
Skye Berghel,
Bayard Carlson,
Ryan Cumings-Menon,
Luke Hartman,
Michael Hay,
Ashwin Machanavajjhala,
Gerome Miklau,
Amritha Pai,
Simran Rajpal,
David Pujol,
William Sexton,
Ruchit Shrestha,
Daniel Simmons-Marengo
Abstract:
This article describes the disclosure avoidance algorithm that the U.S. Census Bureau used to protect the Detailed Demographic and Housing Characteristics File A (Detailed DHC-A) of the 2020 Census. The tabulations contain statistics (counts) of demographic characteristics of the entire population of the United States, crossed with detailed races and ethnicities at varying levels of geography. The…
▽ More
This article describes the disclosure avoidance algorithm that the U.S. Census Bureau used to protect the Detailed Demographic and Housing Characteristics File A (Detailed DHC-A) of the 2020 Census. The tabulations contain statistics (counts) of demographic characteristics of the entire population of the United States, crossed with detailed races and ethnicities at varying levels of geography. The article describes the SafeTab-P algorithm, which is based on adding noise drawn to statistics of interest from a discrete Gaussian distribution. A key innovation in SafeTab-P is the ability to adaptively choose how many statistics and at what granularity to release them, depending on the size of a population group. We prove that the algorithm satisfies a well-studied variant of differential privacy, called zero-concentrated differential privacy (zCDP). We then describe how the algorithm was implemented on Tumult Analytics and briefly outline the parameterization and tuning of the algorithm.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
PHSafe: Disclosure Avoidance for the 2020 Census Supplemental Demographic and Housing Characteristics File (S-DHC)
Authors:
William Sexton,
Skye Berghel,
Bayard Carlson,
Sam Haney,
Luke Hartman,
Michael Hay,
Ashwin Machanavajjhala,
Gerome Miklau,
Amritha Pai,
Simran Rajpal,
David Pujol,
Ruchit Shrestha,
Daniel Simmons-Marengo
Abstract:
This article describes the disclosure avoidance algorithm that the U.S. Census Bureau used to protect the 2020 Census Supplemental Demographic and Housing Characteristics File (S-DHC). The tabulations contain statistics of counts of U.S. persons living in certain types of households, including averages. The article describes the PHSafe algorithm, which is based on adding noise drawn from a discret…
▽ More
This article describes the disclosure avoidance algorithm that the U.S. Census Bureau used to protect the 2020 Census Supplemental Demographic and Housing Characteristics File (S-DHC). The tabulations contain statistics of counts of U.S. persons living in certain types of households, including averages. The article describes the PHSafe algorithm, which is based on adding noise drawn from a discrete Gaussian distribution to the statistics of interest. We prove that the algorithm satisfies a well-studied variant of differential privacy, called zero-concentrated differential privacy. We then describe how the algorithm was implemented on Tumult Analytics and briefly outline the parameterization and tuning of the algorithm.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Publishing Wikipedia usage data with strong privacy guarantees
Authors:
Temilola Adeleye,
Skye Berghel,
Damien Desfontaines,
Michael Hay,
Isaac Johnson,
Cléo Lemoisson,
Ashwin Machanavajjhala,
Tom Magerlein,
Gabriele Modena,
David Pujol,
Daniel Simmons-Marengo,
Hal Triedman
Abstract:
For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day. This data helps Wikipedia editors determine where to focus their efforts to improve the online encyclopedia, and enables academic research. In June 2023, the Wikimedia Foundation, helped by Tumult Labs, addressed a long-standing request from Wikipedia editors…
▽ More
For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day. This data helps Wikipedia editors determine where to focus their efforts to improve the online encyclopedia, and enables academic research. In June 2023, the Wikimedia Foundation, helped by Tumult Labs, addressed a long-standing request from Wikipedia editors and academic researchers: it started publishing these statistics with finer granularity, including the country of origin in the daily counts of page views. This new data publication uses differential privacy to provide robust guarantees to people browsing or editing Wikipedia. This paper describes this data publication: its goals, the process followed from its inception to its deployment, the algorithms used to produce the data, and the outcomes of the data release.
△ Less
Submitted 1 September, 2023; v1 submitted 30 August, 2023;
originally announced August 2023.
-
Tumult Analytics: a robust, easy-to-use, scalable, and expressive framework for differential privacy
Authors:
Skye Berghel,
Philip Bohannon,
Damien Desfontaines,
Charles Estes,
Sam Haney,
Luke Hartman,
Michael Hay,
Ashwin Machanavajjhala,
Tom Magerlein,
Gerome Miklau,
Amritha Pai,
William Sexton,
Ruchit Shrestha
Abstract:
In this short paper, we outline the design of Tumult Analytics, a Python framework for differential privacy used at institutions such as the U.S. Census Bureau, the Wikimedia Foundation, or the Internal Revenue Service.
In this short paper, we outline the design of Tumult Analytics, a Python framework for differential privacy used at institutions such as the U.S. Census Bureau, the Wikimedia Foundation, or the Internal Revenue Service.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.