-
Bayesian Pseudo Posterior Mechanism for Differentially Private Machine Learning
Authors:
Robert Chew,
Matthew R. Williams,
Elan A. Segarra,
Alexander J. Preiss,
Amanda Konet,
Terrance D. Savitsky
Abstract:
Differential privacy (DP) is becoming increasingly important for deployed machine learning applications because it provides strong guarantees for protecting the privacy of individuals whose data is used to train models. However, DP mechanisms commonly used in machine learning tend to struggle on many real world distributions, including highly imbalanced or small labeled training sets. In this work…
▽ More
Differential privacy (DP) is becoming increasingly important for deployed machine learning applications because it provides strong guarantees for protecting the privacy of individuals whose data is used to train models. However, DP mechanisms commonly used in machine learning tend to struggle on many real world distributions, including highly imbalanced or small labeled training sets. In this work, we propose a new scalable DP mechanism for deep learning models, SWAG-PPM, by using a pseudo posterior distribution that downweights by-record likelihood contributions proportionally to their disclosure risks as the randomized mechanism. As a motivating example from official statistics, we demonstrate SWAG-PPM on a workplace injury text classification task using a highly imbalanced public dataset published by the U.S. Occupational Safety and Health Administration (OSHA). We find that SWAG-PPM exhibits only modest utility degradation against a non-private comparator while greatly outperforming the industry standard DP-SGD for a similar privacy budget.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
BILBO: BILevel Bayesian Optimization
Authors:
Ruth Wan Theng Chew,
Quoc Phong Nguyen,
Bryan Kian Hsiang Low
Abstract:
Bilevel optimization is characterized by a two-level optimization structure, where the upper-level problem is constrained by optimal lower-level solutions, and such structures are prevalent in real-world problems. The constraint by optimal lower-level solutions poses significant challenges, especially in noisy, constrained, and derivative-free settings, as repeating lower-level optimizations is sa…
▽ More
Bilevel optimization is characterized by a two-level optimization structure, where the upper-level problem is constrained by optimal lower-level solutions, and such structures are prevalent in real-world problems. The constraint by optimal lower-level solutions poses significant challenges, especially in noisy, constrained, and derivative-free settings, as repeating lower-level optimizations is sample inefficient and predicted lower-level solutions may be suboptimal. We present BILevel Bayesian Optimization (BILBO), a novel Bayesian optimization algorithm for general bilevel problems with blackbox functions, which optimizes both upper- and lower-level problems simultaneously, without the repeated lower-level optimization required by existing methods. BILBO samples from confidence-bounds based trusted sets, which bounds the suboptimality on the lower level. Moreover, BILBO selects only one function query per iteration, where the function query selection strategy incorporates the uncertainty of estimated lower-level solutions and includes a conditional reassignment of the query to encourage exploration of the lower-level objective. The performance of BILBO is theoretically guaranteed with a sublinear regret bound for commonly used kernels and is empirically evaluated on several synthetic and real-world problems.
△ Less
Submitted 28 May, 2025; v1 submitted 4 February, 2025;
originally announced February 2025.
-
Correcting Annotator Bias in Training Data: Population-Aligned Instance Replication (PAIR)
Authors:
Stephanie Eckman,
Bolei Ma,
Christoph Kern,
Rob Chew,
Barbara Plank,
Frauke Kreuter
Abstract:
Models trained on crowdsourced labels may not reflect broader population views, because those who work as annotators do not represent the population. We propose Population-Aligned Instance Replication (PAIR), a method to address bias caused by non-representative annotator pools. Using a simulation study of offensive language and hate speech, we create two types of annotators with different labelin…
▽ More
Models trained on crowdsourced labels may not reflect broader population views, because those who work as annotators do not represent the population. We propose Population-Aligned Instance Replication (PAIR), a method to address bias caused by non-representative annotator pools. Using a simulation study of offensive language and hate speech, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. We observe that models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. By duplicating labels from underrepresented annotator groups to match population proportions, PAIR reduces bias without collecting additional annotations. These results suggest that statistical techniques from survey research can improve model performance. We conclude with practical recommendations for improving the representativity of training data and model performance.
△ Less
Submitted 7 March, 2025; v1 submitted 12 January, 2025;
originally announced January 2025.
-
Annotation Sensitivity: Training Data Collection Methods Affect Model Performance
Authors:
Christoph Kern,
Stephanie Eckman,
Jacob Beck,
Rob Chew,
Bolei Ma,
Frauke Kreuter
Abstract:
When training data are collected from human annotators, the design of the annotation instrument, the instructions given to annotators, the characteristics of the annotators, and their interactions can impact training data. This study demonstrates that design choices made when creating an annotation instrument also impact the models trained on the resulting annotations. We introduce the term annota…
▽ More
When training data are collected from human annotators, the design of the annotation instrument, the instructions given to annotators, the characteristics of the annotators, and their interactions can impact training data. This study demonstrates that design choices made when creating an annotation instrument also impact the models trained on the resulting annotations. We introduce the term annotation sensitivity to refer to the impact of annotation data collection methods on the annotations themselves and on downstream model performance and predictions. We collect annotations of hate speech and offensive language in five experimental conditions of an annotation instrument, randomly assigning annotators to conditions. We then fine-tune BERT models on each of the five resulting datasets and evaluate model performance on a holdout portion of each condition. We find considerable differences between the conditions for 1) the share of hate speech/offensive language annotations, 2) model performance, 3) model predictions, and 4) model learning curves. Our results emphasize the crucial role played by the annotation instrument which has received little attention in the machine learning literature. We call for additional research into how and why the instrument impacts the annotations to inform the development of best practices in instrument design.
△ Less
Submitted 22 January, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
-
LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding
Authors:
Robert Chew,
John Bollenbacher,
Michael Wenger,
Jessica Speer,
Annice Kim
Abstract:
Deductive coding is a widely used qualitative research method for determining the prevalence of themes across documents. While useful, deductive coding is often burdensome and time consuming since it requires researchers to read, interpret, and reliably categorize a large body of unstructured text documents. Large language models (LLMs), like ChatGPT, are a class of quickly evolving AI tools that…
▽ More
Deductive coding is a widely used qualitative research method for determining the prevalence of themes across documents. While useful, deductive coding is often burdensome and time consuming since it requires researchers to read, interpret, and reliably categorize a large body of unstructured text documents. Large language models (LLMs), like ChatGPT, are a class of quickly evolving AI tools that can perform a range of natural language processing and reasoning tasks. In this study, we explore the use of LLMs to reduce the time it takes for deductive coding while retaining the flexibility of a traditional content analysis. We outline the proposed approach, called LLM-assisted content analysis (LACA), along with an in-depth case study using GPT-3.5 for LACA on a publicly available deductive coding data set. Additionally, we conduct an empirical benchmark using LACA on 4 publicly available data sets to assess the broader question of how well GPT-3.5 performs across a range of deductive coding tasks. Overall, we find that GPT-3.5 can often perform deductive coding at levels of agreement comparable to human coders. Additionally, we demonstrate that LACA can help refine prompts for deductive coding, identify codes for which an LLM is randomly guessing, and help assess when to use LLMs vs. human coders for deductive coding. We conclude with several implications for future practice of deductive coding and related research methods.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
SMART: An Open Source Data Labeling Platform for Supervised Learning
Authors:
Rob Chew,
Michael Wenger,
Caroline Kery,
Jason Nance,
Keith Richards,
Emily Hadley,
Peter Baumgartner
Abstract:
SMART is an open source web application designed to help data scientists and research teams efficiently build labeled training data sets for supervised machine learning tasks. SMART provides users with an intuitive interface for creating labeled data sets, supports active learning to help reduce the required amount of labeled data, and incorporates inter-rater reliability statistics to provide ins…
▽ More
SMART is an open source web application designed to help data scientists and research teams efficiently build labeled training data sets for supervised machine learning tasks. SMART provides users with an intuitive interface for creating labeled data sets, supports active learning to help reduce the required amount of labeled data, and incorporates inter-rater reliability statistics to provide insight into label quality. SMART is designed to be platform agnostic and easily deployable to meet the needs of as many different research teams as possible. The project website contains links to the code repository and extensive user documentation.
△ Less
Submitted 11 December, 2018;
originally announced December 2018.