-
Ten Hard Problems in Artificial Intelligence We Must Get Right
Authors:
Gavin Leech,
Simson Garfinkel,
Misha Yagudin,
Alexander Briand,
Aleksandr Zhuravlev
Abstract:
We explore the AI2050 "hard problems" that block the promise of AI and cause AI risks: (1) developing general capabilities of the systems; (2) assuring the performance of AI systems and their training processes; (3) aligning system goals with human goals; (4) enabling great applications of AI in real life; (5) addressing economic disruptions; (6) ensuring the participation of all; (7) at the same…
▽ More
We explore the AI2050 "hard problems" that block the promise of AI and cause AI risks: (1) developing general capabilities of the systems; (2) assuring the performance of AI systems and their training processes; (3) aligning system goals with human goals; (4) enabling great applications of AI in real life; (5) addressing economic disruptions; (6) ensuring the participation of all; (7) at the same time ensuring socially responsible deployment; (8) addressing any geopolitical disruptions that AI causes; (9) promoting sound governance of the technology; and (10) managing the philosophical disruptions for humans living in the age of AI. For each problem, we outline the area, identify significant recent work, and suggest ways forward. [Note: this paper reviews literature through January 2023.]
△ Less
Submitted 19 April, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
The 2010 Census Confidentiality Protections Failed, Here's How and Why
Authors:
John M. Abowd,
Tamara Adams,
Robert Ashmead,
David Darais,
Sourya Dey,
Simson L. Garfinkel,
Nathan Goldschlag,
Daniel Kifer,
Philip Leclerc,
Ethan Lew,
Scott Moore,
Rolando A. Rodríguez,
Ramy N. Tadros,
Lars Vilhuber
Abstract:
Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can veri…
▽ More
Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. The tabular publications in Summary File 1 thus have prohibited disclosure risk similar to the unreleased confidential microdata. Reidentification studies confirm that an attacker can, within blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with nonmodal characteristics) with 95% accuracy, the same precision as the confidential data achieve and far greater than statistical baselines. The flaw in the 2010 Census framework was the assumption that aggregation prevented accurate microdata reconstruction, justifying weaker disclosure limitation methods than were applied to 2010 Census public microdata. The framework used for 2020 Census publications defends against attacks that are based on reconstruction, as we also demonstrate here. Finally, we show that alternatives to the 2020 Census Disclosure Avoidance System with similar accuracy (enhanced swapping) also fail to protect confidentiality, and those that partially defend against reconstruction attacks (incomplete suppression implementations) destroy the primary statutory use case: data for redistricting all legislatures in the country in compliance with the 1965 Voting Rights Act.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Sharpening Your Tools: Updating bulk_extractor for the 2020s
Authors:
Simson Garfinkel,
Jonathan Stewart
Abstract:
Bulk_extractor is a high-performance digital forensics tool written in C++. Between 2018 and 2022 we updated the program from C++98 to C++17, performed a complete code refactoring, and adopted a unit test framework. The new version typically runs with 75\% more throughput than the previous version, which we attribute to improved multithreading. We provide lessons and recommendations for other digi…
▽ More
Bulk_extractor is a high-performance digital forensics tool written in C++. Between 2018 and 2022 we updated the program from C++98 to C++17, performed a complete code refactoring, and adopted a unit test framework. The new version typically runs with 75\% more throughput than the previous version, which we attribute to improved multithreading. We provide lessons and recommendations for other digital forensics tool maintainers.
△ Less
Submitted 31 March, 2022;
originally announced August 2022.
-
The 2020 Census Disclosure Avoidance System TopDown Algorithm
Authors:
John M. Abowd,
Robert Ashmead,
Ryan Cumings-Menon,
Simson Garfinkel,
Micah Heineck,
Christine Heiss,
Robert Johns,
Daniel Kifer,
Philip Leclerc,
Ashwin Machanavajjhala,
Brett Moran,
William Sexton,
Matthew Spence,
Pavel Zhuravlev
Abstract:
The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another ke…
▽ More
The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another key aspect of the TDA are invariants, statistics that the Census Bureau has determined, as matter of policy, to exclude from the privacy-loss accounting. The TDA post-processes the measurements together with the invariants to produce a Microdata Detail File (MDF) that contains one record for each person and one record for each housing unit enumerated in the 2020 Census. The MDF is passed to the 2020 Census tabulation system to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File. This paper describes the mathematics and testing of the TDA for this purpose.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
An Uncertainty Principle is a Price of Privacy-Preserving Microdata
Authors:
John Abowd,
Robert Ashmead,
Ryan Cumings-Menon,
Simson Garfinkel,
Daniel Kifer,
Philip Leclerc,
William Sexton,
Ashley Simpson,
Christine Task,
Pavel Zhuravlev
Abstract:
Privacy-protected microdata are often the desired output of a differentially private algorithm since microdata is familiar and convenient for downstream users. However, there is a statistical price for this kind of convenience. We show that an uncertainty principle governs the trade-off between accuracy for a population of interest ("sum query") vs. accuracy for its component sub-populations ("poi…
▽ More
Privacy-protected microdata are often the desired output of a differentially private algorithm since microdata is familiar and convenient for downstream users. However, there is a statistical price for this kind of convenience. We show that an uncertainty principle governs the trade-off between accuracy for a population of interest ("sum query") vs. accuracy for its component sub-populations ("point queries"). Compared to differentially private query answering systems that are not required to produce microdata, accuracy can degrade by a logarithmic factor. For example, in the case of pure differential privacy, without the microdata requirement, one can provide noisy answers to the sum query and all point queries while guaranteeing that each answer has squared error $O(1/ε^2)$. With the microdata requirement, one must choose between allowing an additional $\log^2(d)$ factor ($d$ is the number of point queries) for some point queries or allowing an extra $O(d^2)$ factor for the sum query. We present lower bounds for pure, approximate, and concentrated differential privacy. We propose mitigation strategies and create a collection of benchmark datasets that can be used for public study of this problem.
△ Less
Submitted 25 October, 2021;
originally announced October 2021.
-
Randomness Concerns When Deploying Differential Privacy
Authors:
Simson L. Garfinkel,
Philip Leclerc
Abstract:
The U.S. Census Bureau is using differential privacy (DP) to protect confidential respondent data collected for the 2020 Decennial Census of Population & Housing. The Census Bureau's DP system is implemented in the Disclosure Avoidance System (DAS) and requires a source of random numbers. We estimate that the 2020 Census will require roughly 90TB of random bytes to protect the person and household…
▽ More
The U.S. Census Bureau is using differential privacy (DP) to protect confidential respondent data collected for the 2020 Decennial Census of Population & Housing. The Census Bureau's DP system is implemented in the Disclosure Avoidance System (DAS) and requires a source of random numbers. We estimate that the 2020 Census will require roughly 90TB of random bytes to protect the person and household tables. Although there are critical differences between cryptography and DP, they have similar requirements for randomness. We review the history of random number generation on deterministic computers, including von Neumann's "middle-square" method, Mersenne Twister (MT19937) (previously the default NumPy random number generator, which we conclude is unacceptable for use in production privacy-preserving systems), and the Linux /dev/urandom device. We also review hardware random number generator schemes, including the use of so-called "Lava Lamps" and the Intel Secure Key RDRAND instruction. We finally present our plan for generating random bits in the Amazon Web Services (AWS) environment using AES-CTR-DRBG seeded by mixing bits from /dev/urandom and the Intel Secure Key RDSEED instruction, a compromise of our desire to rely on a trusted hardware implementation, the unease of our external reviewers in trusting a hardware-only implementation, and the need to generate so many random bits.
△ Less
Submitted 6 September, 2020;
originally announced September 2020.
-
A File System For Write-Once Media
Authors:
Simson L. Garfinkel,
J. Spencer Love
Abstract:
A file system standard for use with write-once media such as digital compact disks is proposed. The file system is designed to work with any operating system and a variety of physical media. Although the implementation is simple, it provides a a full-featured and high-performance alternative to conventional file systems on traditional, multiple-write media such as magnetic disks.
A file system standard for use with write-once media such as digital compact disks is proposed. The file system is designed to work with any operating system and a variety of physical media. Although the implementation is simple, it provides a a full-featured and high-performance alternative to conventional file systems on traditional, multiple-write media such as magnetic disks.
△ Less
Submitted 30 March, 2020;
originally announced April 2020.
-
Issues Encountered Deploying Differential Privacy
Authors:
Simson L. Garfinkel,
John M. Abowd,
Sarah Powazek
Abstract:
When differential privacy was created more than a decade ago, the motivating example was statistics published by an official statistics agency. In attempting to transition differential privacy from the academy to practice, the U.S. Census Bureau has encountered many challenges unanticipated by differential privacy's creators. These challenges include obtaining qualified personnel and a suitable co…
▽ More
When differential privacy was created more than a decade ago, the motivating example was statistics published by an official statistics agency. In attempting to transition differential privacy from the academy to practice, the U.S. Census Bureau has encountered many challenges unanticipated by differential privacy's creators. These challenges include obtaining qualified personnel and a suitable computing environment, the difficulty accounting for all uses of the confidential data, the lack of release mechanisms that align with the needs of data users, the expectation on the part of data users that they will have access to micro-data, and the difficulty in setting the value of the privacy-loss parameter, $ε$ (epsilon), and the lack of tools and trained individuals to verify the correctness of differential privacy implementations.
△ Less
Submitted 6 September, 2018;
originally announced September 2018.