Search | arXiv e-print repository

Property Elicitation on Imprecise Probabilities

Abstract: Property elicitation studies which attributes of a probability distribution can be determined by minimising a risk. We investigate a generalisation of property elicitation to imprecise probabilities (IP). This investigation is motivated by multi-distribution learning, which takes the classical machine learning paradigm of minimising a single risk over a (precise) probability and replaces it with… ▽ More Property elicitation studies which attributes of a probability distribution can be determined by minimising a risk. We investigate a generalisation of property elicitation to imprecise probabilities (IP). This investigation is motivated by multi-distribution learning, which takes the classical machine learning paradigm of minimising a single risk over a (precise) probability and replaces it with $Γ$-maximin risk minimization over an IP. We provide necessary conditions for elicitability of a IP-property. Furthermore, we explain what an elicitable IP-property actually elicits through Bayes pairs -- the elicited IP-property is the corresponding standard property of the maximum Bayes risk distribution. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2505.07367 [pdf, ps, other]

Generalization Bounds and Stopping Rules for Learning with Self-Selected Data

Authors: Julian Rodemann, James Bailie

Abstract: Many learning paradigms self-select training data in light of previously learned parameters. Examples include active learning, semi-supervised learning, bandits, or boosting. Rodemann et al. (2024) unify them under the framework of "reciprocal learning". In this article, we address the question of how well these methods can generalize from their self-selected samples. In particular, we prove unive… ▽ More Many learning paradigms self-select training data in light of previously learned parameters. Examples include active learning, semi-supervised learning, bandits, or boosting. Rodemann et al. (2024) unify them under the framework of "reciprocal learning". In this article, we address the question of how well these methods can generalize from their self-selected samples. In particular, we prove universal generalization bounds for reciprocal learning using covering numbers and Wasserstein ambiguity sets. Our results require no assumptions on the distribution of self-selected data, only verifiable conditions on the algorithms. We prove results for both convergent and finite iteration solutions. The latter are anytime valid, thereby giving rise to stopping rules for a practitioner seeking to guarantee the out-of-sample performance of their reciprocal learning algorithm. Finally, we illustrate our bounds and stopping rules for reciprocal learning's special case of semi-supervised learning. △ Less

Submitted 12 May, 2025; originally announced May 2025.

Comments: 38 pages, 4 figures

arXiv:2504.15246 [pdf, other]

A Refreshment Stirred, Not Shaken (III): Can Swapping Be Differentially Private?

Authors: James Bailie, Ruobin Gong, Xiao-Li Meng

Abstract: The quest for a precise and contextually grounded answer to the question in the present paper's title resulted in this stirred-not-shaken triptych, a phrase that reflects our desire to deepen the theoretical basis, broaden the practical applicability, and reduce the misperception of differential privacy (DP)$\unicode{x2014}$all without shaking its core foundations. Indeed, given the existence of m… ▽ More The quest for a precise and contextually grounded answer to the question in the present paper's title resulted in this stirred-not-shaken triptych, a phrase that reflects our desire to deepen the theoretical basis, broaden the practical applicability, and reduce the misperception of differential privacy (DP)$\unicode{x2014}$all without shaking its core foundations. Indeed, given the existence of more than 200 formulations of DP (and counting), before even attempting to answer the titular question one must first precisely specify what it actually means to be DP. Motivated by this observation, a theoretical investigation into DP's fundamental essence resulted in Part I of this trio, which introduces a five-building-block system explicating the who, where, what, how and how much aspects of DP. Instantiating this system in the context of the United States Decennial Census, Part II then demonstrates the broader applicability and relevance of DP by comparing a swapping strategy like that used in 2010 with the TopDown Algorithm$\unicode{x2014}$a DP method adopted in the 2020 Census. This paper provides nontechnical summaries of the preceding two parts as well as new discussion$\unicode{x2014}$for example, on how greater awareness of the five building blocks can thwart privacy theatrics; how our results bridging traditional SDC and DP allow a data custodian to reap the benefits of both these fields; how invariants impact disclosure risk; and how removing the implicit reliance on aleatoric uncertainty could lead to new generalizations of DP. △ Less

Submitted 21 April, 2025; originally announced April 2025.

Comments: 27 pages, 1 figure

arXiv:2501.08449 [pdf, other]

A Refreshment Stirred, Not Shaken (II): Invariant-Preserving Deployments of Differential Privacy for the US Decennial Census

Authors: James Bailie, Ruobin Gong, Xiao-Li Meng

Abstract: Through the lens of the system of differential privacy specifications developed in Part I of a trio of articles, this second paper examines two statistical disclosure control (SDC) methods for the United States Decennial Census: the Permutation Swapping Algorithm (PSA), which is similar to the 2010 Census's disclosure avoidance system (DAS), and the TopDown Algorithm (TDA), which was used in the 2… ▽ More Through the lens of the system of differential privacy specifications developed in Part I of a trio of articles, this second paper examines two statistical disclosure control (SDC) methods for the United States Decennial Census: the Permutation Swapping Algorithm (PSA), which is similar to the 2010 Census's disclosure avoidance system (DAS), and the TopDown Algorithm (TDA), which was used in the 2020 DAS. To varying degrees, both methods leave unaltered some statistics of the confidential data $\unicode{x2013}$ which are called the method's invariants $\unicode{x2013}$ and hence neither can be readily reconciled with differential privacy (DP), at least as it was originally conceived. Nevertheless, we establish that the PSA satisfies $\varepsilon$-DP subject to the invariants it necessarily induces, thereby showing that this traditional SDC method can in fact still be understood within our more-general system of DP specifications. By a similar modification to $ρ$-zero concentrated DP, we also provide a DP specification for the TDA. Finally, as a point of comparison, we consider the counterfactual scenario in which the PSA was adopted for the 2020 Census, resulting in a reduction in the nominal privacy loss, but at the cost of releasing many more invariants. Therefore, while our results explicate the mathematical guarantees of SDC provided by the PSA, the TDA and the 2020 DAS in general, care must be taken in their translation to actual privacy protection $\unicode{x2013}$ just as is the case for any DP deployment. △ Less

Submitted 14 January, 2025; originally announced January 2025.

Comments: 48 pages, 2 figures

arXiv:2411.02935 [pdf, other]

Mapping Africa Settlements: High Resolution Urban and Rural Map by Deep Learning and Satellite Imagery

Authors: Mohammad Kakooei, James Bailie, Albin Söderberg, Albin Becevic, Adel Daoud

Abstract: Accurate Land Use and Land Cover (LULC) maps are essential for understanding the drivers of sustainable development, in terms of its complex interrelationships between human activities and natural resources. However, existing LULC maps often lack precise urban and rural classifications, particularly in diverse regions like Africa. This study presents a novel construction of a high-resolution rural… ▽ More Accurate Land Use and Land Cover (LULC) maps are essential for understanding the drivers of sustainable development, in terms of its complex interrelationships between human activities and natural resources. However, existing LULC maps often lack precise urban and rural classifications, particularly in diverse regions like Africa. This study presents a novel construction of a high-resolution rural-urban map using deep learning techniques and satellite imagery. We developed a deep learning model based on the DeepLabV3 architecture, which was trained on satellite imagery from Landsat-8 and the ESRI LULC dataset, augmented with human settlement data from the GHS-SMOD. The model utilizes semantic segmentation to classify land into detailed categories, including urban and rural areas, at a 10-meter resolution. Our findings demonstrate that incorporating LULC along with urban and rural classifications significantly enhances the model's ability to accurately distinguish between urban, rural, and non-human settlement areas. Therefore, our maps can support more informed decision-making for policymakers, researchers, and stakeholders. We release a continent wide urban-rural map, covering the period 2016 and 2022. △ Less

Submitted 5 November, 2024; originally announced November 2024.

arXiv:2408.07006 [pdf, other]

The Complexities of Differential Privacy for Survey Data

Authors: Jörg Drechsler, James Bailie

Abstract: The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoin… ▽ More The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoing project funded by the U.S. Census Bureau that is exploring the possibilities and limitations of DP for survey data. Specifically, we identify five aspects that need to be considered when adopting DP in the survey context: the multi-staged nature of data production; the limited privacy amplification from complex sampling designs; the implications of survey-weighted estimates; the weighting adjustments for nonresponse and other data deficiencies, and the imputation of missing values. We summarize the project's key findings with respect to each of these aspects and also discuss some of the challenges that still need to be addressed before DP could become the new data protection standard at statistical agencies. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: 18 pages, 2 figures

Showing 1–6 of 6 results for author: Bailie, J