-
Lower Bounds for Public-Private Learning under Distribution Shift
Authors:
Amrith Setlur,
Pratiksha Thaker,
Jonathan Ullman
Abstract:
The most effective differentially private machine learning algorithms in practice rely on an additional source of purportedly public data. This paradigm is most interesting when the two sources combine to be more than the sum of their parts. However, there are settings such as mean estimation where we have strong lower bounds, showing that when the two data sources have the same distribution, ther…
▽ More
The most effective differentially private machine learning algorithms in practice rely on an additional source of purportedly public data. This paradigm is most interesting when the two sources combine to be more than the sum of their parts. However, there are settings such as mean estimation where we have strong lower bounds, showing that when the two data sources have the same distribution, there is no complementary value to combining the two data sources. In this work we extend the known lower bounds for public-private learning to setting where the two data sources exhibit significant distribution shift. Our results apply to both Gaussian mean estimation where the two distributions have different means, and to Gaussian linear regression where the two distributions exhibit parameter shift. We find that when the shift is small (relative to the desired accuracy), either public or private data must be sufficiently abundant to estimate the private parameter. Conversely, when the shift is large, public data provides no benefit.
△ Less
Submitted 23 July, 2025;
originally announced July 2025.
-
PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries
Authors:
Steven Kolawole,
Keshav Santhanam,
Virginia Smith,
Pratiksha Thaker
Abstract:
LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain latent semantic parallelism--decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning. We introduce PARALLELPROMPT, the first benchmark for measuring intra-query p…
▽ More
LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain latent semantic parallelism--decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning. We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts. Our dataset comprises over 37,000 real-world prompts from public LLM chat logs, each annotated with a structured schema capturing task templates, shared context, and iteration inputs. These schemas are extracted using LLM-assisted prompting with rule-based multilingual validation. To evaluate the benefits of decomposition, we provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity. Our results show that intra-query parallelism can be successfully parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks like translation, comprehension, and comparative analysis, with minimal quality degradation. By releasing this benchmark, curation pipeline, and evaluation suite, we provide the first standardized testbed for studying structure-aware execution in LLM serving pipelines.
△ Less
Submitted 20 October, 2025; v1 submitted 23 June, 2025;
originally announced June 2025.
-
BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap
Authors:
Shengyuan Hu,
Neil Kale,
Pratiksha Thaker,
Yiwei Fu,
Steven Wu,
Virginia Smith
Abstract:
Machine unlearning has the potential to improve the safety of large language models (LLMs) by removing sensitive or harmful information post hoc. A key challenge in unlearning involves balancing between forget quality (effectively unlearning undesirable information) and retain quality (maintaining good performance on other, general tasks). Unfortunately, as we show, current LLM unlearning benchmar…
▽ More
Machine unlearning has the potential to improve the safety of large language models (LLMs) by removing sensitive or harmful information post hoc. A key challenge in unlearning involves balancing between forget quality (effectively unlearning undesirable information) and retain quality (maintaining good performance on other, general tasks). Unfortunately, as we show, current LLM unlearning benchmarks contain highly disparate forget and retain sets -- painting a false picture of the effectiveness of LLM unlearning methods. This can be particularly problematic because it opens the door for benign perturbations, such as relearning attacks, to easily reveal supposedly unlearned knowledge once models are deployed. To address this, we present $\texttt{BLUR}$: a benchmark for LLM unlearning that provides more realistic scenarios of forget-retain overlap. $\texttt{BLUR}$ significantly expands on existing unlearning benchmarks by providing extended evaluation tasks, combined forget/retain queries, and relearning datasets of varying degrees of difficulty. Despite the benign nature of the queries considered, we find that the performance of existing methods drops significantly when evaluated on $\texttt{BLUR}$, with simple approaches performing better on average than more recent methods. These results highlight the importance of robust evaluation and suggest several important directions of future study. Our benchmark is publicly available at: https://huggingface.co/datasets/forgelab/BLUR
△ Less
Submitted 28 May, 2025;
originally announced June 2025.
-
Enhancing One-run Privacy Auditing with Quantile Regression-Based Membership Inference
Authors:
Terrance Liu,
Matteo Boglioni,
Yiwei Fu,
Shengyuan Hu,
Pratiksha Thaker,
Zhiwei Steven Wu
Abstract:
Differential privacy (DP) auditing aims to provide empirical lower bounds on the privacy guarantees of DP mechanisms like DP-SGD. While some existing techniques require many training runs that are prohibitively costly, recent work introduces one-run auditing approaches that effectively audit DP-SGD in white-box settings while still being computationally efficient. However, in the more practical bl…
▽ More
Differential privacy (DP) auditing aims to provide empirical lower bounds on the privacy guarantees of DP mechanisms like DP-SGD. While some existing techniques require many training runs that are prohibitively costly, recent work introduces one-run auditing approaches that effectively audit DP-SGD in white-box settings while still being computationally efficient. However, in the more practical black-box setting where gradients cannot be manipulated during training and only the last model iterate is observed, prior work shows that there is still a large gap between the empirical lower bounds and theoretical upper bounds. Consequently, in this work, we study how incorporating approaches for stronger membership inference attacks (MIA) can improve one-run auditing in the black-box setting. Evaluating on image classification models trained on CIFAR-10 with DP-SGD, we demonstrate that our proposed approach, which utilizes quantile regression for MIA, achieves tighter bounds while crucially maintaining the computational efficiency of one-run methods.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Membership Inference Attacks for Unseen Classes
Authors:
Pratiksha Thaker,
Neil Kale,
Zhiwei Steven Wu,
Virginia Smith
Abstract:
The state-of-the-art for membership inference attacks on machine learning models is a class of attacks based on shadow models that mimic the behavior of the target model on subsets of held-out nonmember data. However, we find that this class of attacks is fundamentally limited because of a key assumption -- that the shadow models can replicate the target model's behavior on the distribution of int…
▽ More
The state-of-the-art for membership inference attacks on machine learning models is a class of attacks based on shadow models that mimic the behavior of the target model on subsets of held-out nonmember data. However, we find that this class of attacks is fundamentally limited because of a key assumption -- that the shadow models can replicate the target model's behavior on the distribution of interest. As a result, we show that attacks relying on shadow models can fail catastrophically on critical AI safety applications where data access is restricted due to legal, ethical, or logistical constraints, so that the shadow models have no reasonable signal on the query examples. Although this problem seems intractable within the shadow model paradigm, we find that quantile regression attacks are a promising approach in this setting, as these models learn features of member examples that can generalize to unseen classes. We demonstrate this both empirically and theoretically, showing that quantile regression attacks achieve up to 11x the TPR of shadow model-based approaches in practice, and providing a theoretical model that outlines the generalization properties required for this approach to succeed. Our work identifies an important failure mode in existing MIAs and provides a cautionary tale for practitioners that aim to directly use existing tools for real-world applications of AI safety.
△ Less
Submitted 25 October, 2025; v1 submitted 6 June, 2025;
originally announced June 2025.
-
Generate-then-Verify: Reconstructing Data from Limited Published Statistics
Authors:
Terrance Liu,
Eileen Xiao,
Adam Smith,
Pratiksha Thaker,
Zhiwei Steven Wu
Abstract:
We study the problem of reconstructing tabular data from aggregate statistics, in which the attacker aims to identify interesting claims about the sensitive data that can be verified with 100% certainty given the aggregates. Successful attempts in prior work have conducted studies in settings where the set of published statistics is rich enough that entire datasets can be reconstructed with certai…
▽ More
We study the problem of reconstructing tabular data from aggregate statistics, in which the attacker aims to identify interesting claims about the sensitive data that can be verified with 100% certainty given the aggregates. Successful attempts in prior work have conducted studies in settings where the set of published statistics is rich enough that entire datasets can be reconstructed with certainty. In our work, we instead focus on the regime where many possible datasets match the published statistics, making it impossible to reconstruct the entire private dataset perfectly (i.e., when approaches in prior work fail). We propose the problem of partial data reconstruction, in which the goal of the adversary is to instead output a $\textit{subset}$ of rows and/or columns that are $\textit{guaranteed to be correct}$. We introduce a novel integer programming approach that first $\textbf{generates}$ a set of claims and then $\textbf{verifies}$ whether each claim holds for all possible datasets consistent with the published aggregates. We evaluate our approach on the housing-level microdata from the U.S. Decennial Census release, demonstrating that privacy violations can still persist even when information published about such data is relatively sparse.
△ Less
Submitted 11 June, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
Authors:
Pratiksha Thaker,
Shengyuan Hu,
Neil Kale,
Yash Maurya,
Zhiwei Steven Wu,
Virginia Smith
Abstract:
Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading vi…
▽ More
Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model's performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.
△ Less
Submitted 8 April, 2025; v1 submitted 3 October, 2024;
originally announced October 2024.
-
Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry
Authors:
Deepti Raghavan,
Keshav Santhanam,
Muhammad Shahir Rahman,
Nayani Modugula,
Luis Gaspar Schroeder,
Maximilien Cura,
Houjun Liu,
Pratiksha Thaker,
Philip Levis,
Matei Zaharia
Abstract:
Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during interme…
▽ More
Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.
△ Less
Submitted 21 July, 2025; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Guardrail Baselines for Unlearning in LLMs
Authors:
Pratiksha Thaker,
Yash Maurya,
Shengyuan Hu,
Zhiwei Steven Wu,
Virginia Smith
Abstract:
Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable t…
▽ More
Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing metrics and benchmarks.
△ Less
Submitted 11 June, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift
Authors:
Pratiksha Thaker,
Amrith Setlur,
Zhiwei Steven Wu,
Virginia Smith
Abstract:
Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data -- a scenario that is likely when finetuning private tasks due to the se…
▽ More
Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data -- a scenario that is likely when finetuning private tasks due to the sensitive nature of the data. In this work, we show empirically across three tasks that even in settings with large distribution shift, where both zero-shot performance from public data and training from scratch with private data give unusably weak results, public features can in fact improve private training accuracy by up to 67\% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training even if it is impossible to learn the private task from the public data alone. Altogether, our results provide evidence that public data can indeed make private training practical in realistic settings of extreme distribution shift.
△ Less
Submitted 8 September, 2025; v1 submitted 24 December, 2023;
originally announced December 2023.
-
On Noisy Evaluation in Federated Hyperparameter Tuning
Authors:
Kevin Kuo,
Pratiksha Thaker,
Mikhail Khodak,
John Nguyen,
Daniel Jiang,
Ameet Talwalkar,
Virginia Smith
Abstract:
Hyperparameter tuning is critical to the success of federated learning applications. Unfortunately, appropriately selecting hyperparameters is challenging in federated networks. Issues of scale, privacy, and heterogeneity introduce noise in the tuning process and make it difficult to evaluate the performance of various hyperparameters. In this work, we perform the first systematic study on the eff…
▽ More
Hyperparameter tuning is critical to the success of federated learning applications. Unfortunately, appropriately selecting hyperparameters is challenging in federated networks. Issues of scale, privacy, and heterogeneity introduce noise in the tuning process and make it difficult to evaluate the performance of various hyperparameters. In this work, we perform the first systematic study on the effect of noisy evaluation in federated hyperparameter tuning. We first identify and rigorously explore key sources of noise, including client subsampling, data and systems heterogeneity, and data privacy. Surprisingly, our results indicate that even small amounts of noise can significantly impact tuning methods-reducing the performance of state-of-the-art approaches to that of naive baselines. To address noisy evaluation in such scenarios, we propose a simple and effective approach that leverages public proxy data to boost the evaluation signal. Our work establishes general challenges, baselines, and best practices for future work in federated hyperparameter tuning.
△ Less
Submitted 15 May, 2023; v1 submitted 17 December, 2022;
originally announced December 2022.
-
Maximizing and Satisficing in Multi-armed Bandits with Graph Information
Authors:
Parth K. Thaker,
Mohit Malu,
Nikhil Rao,
Gautam Dasarathy
Abstract:
Pure exploration in multi-armed bandits has emerged as an important framework for modeling decision-making and search under uncertainty. In modern applications, however, one is often faced with a tremendously large number of options. Even obtaining one observation per option may be too costly rendering traditional pure exploration algorithms ineffective. Fortunately, one often has access to simila…
▽ More
Pure exploration in multi-armed bandits has emerged as an important framework for modeling decision-making and search under uncertainty. In modern applications, however, one is often faced with a tremendously large number of options. Even obtaining one observation per option may be too costly rendering traditional pure exploration algorithms ineffective. Fortunately, one often has access to similar relationships amongst the options that can be leveraged. In this paper, we consider the pure exploration problem in stochastic multi-armed bandits where the similarities between the arms are captured by a graph and the rewards may be represented as a smooth signal on this graph. In particular, we consider the problem of finding the arm with the maximum reward (i.e., the maximizing problem) or one with a sufficiently high reward (i.e., the satisficing problem) under this model. We propose novel algorithms \textbf{\algoname{}} (GRaph-based UcB) and $ζ$-\textbf{\algoname{}} for these problems and provide a theoretical characterization of their performance which specifically elicits the benefit of the graph side information. We also prove a lower bound on the data requirement, showing a large class of problems where these algorithms are near-optimal. We complement our theory with experimental results that show the benefit of capitalizing on such side information.
△ Less
Submitted 20 November, 2022; v1 submitted 2 August, 2021;
originally announced August 2021.
-
Differentiable Programming for Hyperspectral Unmixing using a Physics-based Dispersion Model
Authors:
John Janiczek,
Parth Thaker,
Gautam Dasarathy,
Christopher S. Edwards,
Philip Christensen,
Suren Jayasuriya
Abstract:
Hyperspectral unmixing is an important remote sensing task with applications including material identification and analysis. Characteristic spectral features make many pure materials identifiable from their visible-to-infrared spectra, but quantifying their presence within a mixture is a challenging task due to nonlinearities and factors of variation. In this paper, spectral variation is considere…
▽ More
Hyperspectral unmixing is an important remote sensing task with applications including material identification and analysis. Characteristic spectral features make many pure materials identifiable from their visible-to-infrared spectra, but quantifying their presence within a mixture is a challenging task due to nonlinearities and factors of variation. In this paper, spectral variation is considered from a physics-based approach and incorporated into an end-to-end spectral unmixing algorithm via differentiable programming. The dispersion model is introduced to simulate realistic spectral variation, and an efficient method to fit the parameters is presented. Then, this dispersion model is utilized as a generative model within an analysis-by-synthesis spectral unmixing algorithm. Further, a technique for inverse rendering using a convolutional neural network to predict parameters of the generative model is introduced to enhance performance and speed when training data is available. Results achieve state-of-the-art on both infrared and visible-to-near-infrared (VNIR) datasets, and show promise for the synergy between physics-based models and deep learning in hyperspectral unmixing in the future.
△ Less
Submitted 12 July, 2020;
originally announced July 2020.
-
Overlook: Differentially Private Exploratory Visualization for Big Data
Authors:
Pratiksha Thaker,
Mihai Budiu,
Parikshit Gopalan,
Udi Wieder,
Matei Zaharia
Abstract:
Data exploration systems that provide differential privacy must manage a privacy budget that measures the amount of privacy lost across multiple queries. One effective strategy to manage the privacy budget is to compute a one-time private synopsis of the data, to which users can make an unlimited number of queries. However, existing systems using synopses are built for offline use cases, where a s…
▽ More
Data exploration systems that provide differential privacy must manage a privacy budget that measures the amount of privacy lost across multiple queries. One effective strategy to manage the privacy budget is to compute a one-time private synopsis of the data, to which users can make an unlimited number of queries. However, existing systems using synopses are built for offline use cases, where a set of queries is known ahead of time and the system carefully optimizes a synopsis for it. The synopses that these systems build are costly to compute and may also be costly to store.
We introduce Overlook, a system that enables private data exploration at interactive latencies for both data analysts and data curators. The key idea in Overlook is a virtual synopsis that can be evaluated incrementally, without extra space storage or expensive precomputation. Overlook simply executes queries using an existing engine, such as a SQL DBMS, and adds noise to their results. Because Overlook's synopses do not require costly precomputation or storage, data curators can also use Overlook to explore the impact of privacy parameters interactively. Overlook offers a rich visual query interface based on the open source Hillview system. Overlook achieves accuracy comparable to existing synopsis-based systems, while offering better performance and removing the need for extra storage.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
On the Sample Complexity and Optimization Landscape for Quadratic Feasibility Problems
Authors:
Parth Thaker,
Gautam Dasarathy,
Angelia Nedić
Abstract:
We consider the problem of recovering a complex vector $\mathbf{x}\in \mathbb{C}^n$ from $m$ quadratic measurements $\{\langle A_i\mathbf{x}, \mathbf{x}\rangle\}_{i=1}^m$. This problem, known as quadratic feasibility, encompasses the well known phase retrieval problem and has applications in a wide range of important areas including power system state estimation and x-ray crystallography. In gener…
▽ More
We consider the problem of recovering a complex vector $\mathbf{x}\in \mathbb{C}^n$ from $m$ quadratic measurements $\{\langle A_i\mathbf{x}, \mathbf{x}\rangle\}_{i=1}^m$. This problem, known as quadratic feasibility, encompasses the well known phase retrieval problem and has applications in a wide range of important areas including power system state estimation and x-ray crystallography. In general, not only is the the quadratic feasibility problem NP-hard to solve, but it may in fact be unidentifiable. In this paper, we establish conditions under which this problem becomes {identifiable}, and further prove isometry properties in the case when the matrices $\{A_i\}_{i=1}^m$ are Hermitian matrices sampled from a complex Gaussian distribution. Moreover, we explore a nonconvex {optimization} formulation of this problem, and establish salient features of the associated optimization landscape that enables gradient algorithms with an arbitrary initialization to converge to a \emph{globally optimal} point with a high probability. Our results also reveal sample complexity requirements for successfully identifying a feasible solution in these contexts.
△ Less
Submitted 14 December, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.