-
Assumption-free stability for ranking problems
Authors:
Ruiting Liang,
Jake A. Soloff,
Rina Foygel Barber,
Rebecca Willett
Abstract:
In this work, we consider ranking problems among a finite set of candidates: for instance, selecting the top-$k$ items among a larger list of candidates or obtaining the full ranking of all items in the set. These problems are often unstable, in the sense that estimating a ranking from noisy data can exhibit high sensitivity to small perturbations. Concretely, if we use data to provide a score for…
▽ More
In this work, we consider ranking problems among a finite set of candidates: for instance, selecting the top-$k$ items among a larger list of candidates or obtaining the full ranking of all items in the set. These problems are often unstable, in the sense that estimating a ranking from noisy data can exhibit high sensitivity to small perturbations. Concretely, if we use data to provide a score for each item (say, by aggregating preference data over a sample of users), then for two items with similar scores, small fluctuations in the data can alter the relative ranking of those items. Many existing theoretical results for ranking problems assume a separation condition to avoid this challenge, but real-world data often contains items whose scores are approximately tied, limiting the applicability of existing theory. To address this gap, we develop a new algorithmic stability framework for ranking problems, and propose two novel ranking operators for achieving stable ranking: the \emph{inflated top-$k$} for the top-$k$ selection problem and the \emph{inflated full ranking} for ranking the full list. To enable stability, each method allows for expressing some uncertainty in the output. For both of these two problems, our proposed methods provide guaranteed stability, with no assumptions on data distributions and no dependence on the total number of candidates to be ranked. Experiments on real-world data confirm that the proposed methods offer stability without compromising the informativeness of the output.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Can a calibration metric be both testable and actionable?
Authors:
Raphael Rossellini,
Jake A. Soloff,
Rina Foygel Barber,
Zhimei Ren,
Rebecca Willett
Abstract:
Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration$\unicode{x2014}$ensuring forecasted probabilities match empirical frequencies$\unicode{x2014}$is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many prac…
▽ More
Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration$\unicode{x2014}$ensuring forecasted probabilities match empirical frequencies$\unicode{x2014}$is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance from Calibration (dCE) is testable but is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. We introduce Cutoff Calibration Error, a calibration measure that bridges this gap by assessing calibration over intervals of forecasted probabilities. We show that Cutoff Calibration Error is both testable and actionable and examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Testing conditional independence under isotonicity
Authors:
Rohan Hore,
Jake A. Soloff,
Rina Foygel Barber,
Richard J. Samworth
Abstract:
We propose a test of the conditional independence of random variables $X$ and $Y$ given $Z$ under the additional assumption that $X$ is stochastically increasing in $Z$. The well-documented hardness of testing conditional independence means that some further restriction on the null hypothesis parameter space is required, but in contrast to existing approaches based on parametric models, smoothness…
▽ More
We propose a test of the conditional independence of random variables $X$ and $Y$ given $Z$ under the additional assumption that $X$ is stochastically increasing in $Z$. The well-documented hardness of testing conditional independence means that some further restriction on the null hypothesis parameter space is required, but in contrast to existing approaches based on parametric models, smoothness assumptions, or approximations to the conditional distribution of $X$ given $Z$ and/or $Y$ given $Z$, our test requires only the stochastic monotonicity assumption. Our procedure, called PairSwap-ICI, determines the significance of a statistic by randomly swapping the $X$ values within ordered pairs of $Z$ values. The matched pairs and the test statistic may depend on both $Y$ and $Z$, providing the analyst with significant flexibility in constructing a powerful test. Our test offers finite-sample Type I error control, and provably achieves high power against a large class of alternatives that are not too close to the null. We validate our theoretical findings through a series of simulations and real data experiments.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Cross-Validation with Antithetic Gaussian Randomization
Authors:
Sifan Liu,
Snigdha Panigrahi,
Jake A. Soloff
Abstract:
We introduce a new cross-validation method based on an equicorrelated Gaussian randomization scheme. The method is well-suited for problems where sample splitting is infeasible, such as when data violate the assumption of independent and identical distribution. Even when sample splitting is possible, our method offers a computationally efficient alternative for estimating the prediction error, ach…
▽ More
We introduce a new cross-validation method based on an equicorrelated Gaussian randomization scheme. The method is well-suited for problems where sample splitting is infeasible, such as when data violate the assumption of independent and identical distribution. Even when sample splitting is possible, our method offers a computationally efficient alternative for estimating the prediction error, achieving comparable or even lower error than standard cross-validation in a few train-test repetitions. Drawing inspiration from recent techniques like data-fission and data-thinning, our method constructs train-test data pairs using externally generated Gaussian randomization variables. The key innovation lies in a carefully designed correlation structure among the randomization variables, which we refer to as antithetic Gaussian randomization. In theory, we show that this correlation is crucial in ensuring that the variance of our estimator remains bounded while allowing the bias to vanish. Through simulations on various data types and loss functions, we highlight the advantages of our antithetic Gaussian randomization scheme over both independent randomization and standard cross-validation, where the bias-variance tradeoff depends heavily on the number of folds.
△ Less
Submitted 30 January, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
Stabilizing black-box model selection with the inflated argmax
Authors:
Melissa Adrian,
Jake A. Soloff,
Rebecca Willett
Abstract:
Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is remo…
▽ More
Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. In this paper, we present a new approach to stabilizing model selection with theoretical stability guarantees that leverages a combination of bagging and an ''inflated'' argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka-Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances, and (c) a graph subset selection problem using cell-signaling data from proteomics. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks.
△ Less
Submitted 31 January, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
Building a stable classifier with the inflated argmax
Authors:
Jake A. Soloff,
Rina Foygel Barber,
Rebecca Willett
Abstract:
We propose a new framework for algorithmic stability in the context of multiclass classification. In practice, classification algorithms often operate by first assigning a continuous score (for instance, an estimated probability) to each possible label, then taking the maximizer -- i.e., selecting the class that has the highest score. A drawback of this type of approach is that it is inherently un…
▽ More
We propose a new framework for algorithmic stability in the context of multiclass classification. In practice, classification algorithms often operate by first assigning a continuous score (for instance, an estimated probability) to each possible label, then taking the maximizer -- i.e., selecting the class that has the highest score. A drawback of this type of approach is that it is inherently unstable, meaning that it is very sensitive to slight perturbations of the training data, since taking the maximizer is discontinuous. Motivated by this challenge, we propose a pipeline for constructing stable classifiers from data, using bagging (i.e., resampling and averaging) to produce stable continuous scores, and then using a stable relaxation of argmax, which we call the "inflated argmax," to convert these scores to a set of candidate labels. The resulting stability guarantee places no distributional assumptions on the data, does not depend on the number of classes or dimensionality of the covariates, and holds for any base classifier. Using a common benchmark data set, we demonstrate that the inflated argmax provides necessary protection against unstable classifiers, without loss of accuracy.
△ Less
Submitted 25 April, 2025; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Incentive-Theoretic Bayesian Inference for Collaborative Science
Authors:
Stephen Bates,
Michael I. Jordan,
Michael Sklar,
Jake A. Soloff
Abstract:
Contemporary scientific research is a distributed, collaborative endeavor, carried out by teams of researchers, regulatory institutions, funding agencies, commercial partners, and scientific bodies, all interacting with each other and facing different incentives. To maintain scientific rigor, statistical methods should acknowledge this state of affairs. To this end, we study hypothesis testing whe…
▽ More
Contemporary scientific research is a distributed, collaborative endeavor, carried out by teams of researchers, regulatory institutions, funding agencies, commercial partners, and scientific bodies, all interacting with each other and facing different incentives. To maintain scientific rigor, statistical methods should acknowledge this state of affairs. To this end, we study hypothesis testing when there is an agent (e.g., a researcher or a pharmaceutical company) with a private prior about an unknown parameter and a principal (e.g., a policymaker or regulator) who wishes to make decisions based on the parameter value. The agent chooses whether to run a statistical trial based on their private prior and then the result of the trial is used by the principal to reach a decision. We show how the principal can conduct statistical inference that leverages the information that is revealed by an agent's strategic behavior -- their choice to run a trial or not. In particular, we show how the principal can design a policy to elucidate partial information about the agent's private prior beliefs and use this to control the posterior probability of the null. One implication is a simple guideline for the choice of significance threshold in clinical trials: the type-I error level should be set to be strictly less than the cost of the trial divided by the firm's profit if the trial is successful.
△ Less
Submitted 8 February, 2024; v1 submitted 7 July, 2023;
originally announced July 2023.
-
Bagging Provides Assumption-free Stability
Authors:
Jake A. Soloff,
Rina Foygel Barber,
Rebecca Willett
Abstract:
Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constan…
▽ More
Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constant. Empirical results validate our findings, showing that bagging successfully stabilizes even highly unstable base algorithms.
△ Less
Submitted 25 April, 2024; v1 submitted 29 January, 2023;
originally announced January 2023.
-
The edge of discovery: Controlling the local false discovery rate at the margin
Authors:
Jake A. Soloff,
Daniel Xiang,
William Fithian
Abstract:
Despite the popularity of the false discovery rate (FDR) as an error control metric for large-scale multiple testing, its close Bayesian counterpart the local false discovery rate (lfdr), defined as the posterior probability that a particular null hypothesis is false, is a more directly relevant standard for justifying and interpreting individual rejections. However, the lfdr is difficult to work…
▽ More
Despite the popularity of the false discovery rate (FDR) as an error control metric for large-scale multiple testing, its close Bayesian counterpart the local false discovery rate (lfdr), defined as the posterior probability that a particular null hypothesis is false, is a more directly relevant standard for justifying and interpreting individual rejections. However, the lfdr is difficult to work with in small samples, as the prior distribution is typically unknown. We propose a simple multiple testing procedure and prove that it controls the expectation of the maximum lfdr across all rejections; equivalently, it controls the probability that the rejection with the largest p-value is a false discovery. Our method operates without knowledge of the prior, assuming only that the p-value density is uniform under the null and decreasing under the alternative. We also show that our method asymptotically implements the oracle Bayes procedure for a weighted classification risk, optimally trading off between false positives and false negatives. We derive the limiting distribution of the attained maximum lfdr over the rejections, and the limiting empirical Bayes regret relative to the oracle procedure.
△ Less
Submitted 21 September, 2023; v1 submitted 15 July, 2022;
originally announced July 2022.
-
Principal-Agent Hypothesis Testing
Authors:
Stephen Bates,
Michael I. Jordan,
Michael Sklar,
Jake A. Soloff
Abstract:
Consider the relationship between a regulator (the principal) and an experimenter (the agent) such as a pharmaceutical company. The pharmaceutical company wishes to sell a drug for profit, whereas the regulator wishes to allow only efficacious drugs to be marketed. The efficacy of the drug is not known to the regulator, so the pharmaceutical company must run a costly trial to prove efficacy to the…
▽ More
Consider the relationship between a regulator (the principal) and an experimenter (the agent) such as a pharmaceutical company. The pharmaceutical company wishes to sell a drug for profit, whereas the regulator wishes to allow only efficacious drugs to be marketed. The efficacy of the drug is not known to the regulator, so the pharmaceutical company must run a costly trial to prove efficacy to the regulator. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested agent; a lower standard of statistical evidence incentivizes the agent to run more trials that are less likely to be effective. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial for understanding this system and designing protocols with high social utility. In this work, we discuss how the regulator can set up a protocol with payoffs based on statistical evidence. We show how to design protocols that are robust to an agent's strategic actions, and derive the optimal protocol in the presence of strategic entrants.
△ Less
Submitted 15 April, 2024; v1 submitted 13 May, 2022;
originally announced May 2022.