Search | arXiv e-print repository

Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs

Authors: Harit Vishwakarma, Alan Mishler, Thomas Cook, Niccolò Dalmasso, Natraj Raman, Sumitra Ganesh

Abstract: Large language models (LLMs) are empowering decision-making in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, they often make overconfident, incorrect predictions, which can be risky in high-stakes settings like healthcare and finance. To mitigate these risks, recent works have used conformal prediction (CP), a model-agnostic framework fo… ▽ More Large language models (LLMs) are empowering decision-making in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, they often make overconfident, incorrect predictions, which can be risky in high-stakes settings like healthcare and finance. To mitigate these risks, recent works have used conformal prediction (CP), a model-agnostic framework for distribution-free uncertainty quantification. CP transforms a \emph{score function} into prediction sets that contain the true answer with high probability. While CP provides this coverage guarantee for arbitrary scores, the score quality significantly impacts prediction set sizes. Prior works have relied on LLM logits or other heuristic scores, lacking quality guarantees. We address this limitation by introducing CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Furthermore, inspired by the Monty Hall problem, we extend CP's utility beyond uncertainty quantification to improve accuracy. We propose \emph{conformal revision of questions} (CROQ) to revise the problem by narrowing down the available choices to those in the prediction set. The coverage guarantee of CP ensures that the correct choice is in the revised question prompt with high probability, while the smaller number of choices increases the LLM's chances of answering it correctly. Experiments on MMLU, ToolAlpaca, and TruthfulQA datasets with Gemma-2, Llama-3 and Phi-3 models show that CP-OPT significantly reduces set sizes while maintaining coverage, and CROQ improves accuracy over the standard inference, especially when paired with CP-OPT scores. Together, CP-OPT and CROQ offer a robust framework for improving both the safety and accuracy of LLM-driven decision-making. △ Less

Submitted 31 December, 2024; originally announced January 2025.

arXiv:2410.14029 [pdf, other]

Auditing and Enforcing Conditional Fairness via Optimal Transport

Authors: Mohsen Ghassemi, Alan Mishler, Niccolo Dalmasso, Luhao Zhang, Vamsi K. Potluru, Tucker Balch, Manuela Veloso

Abstract: Conditional demographic parity (CDP) is a measure of the demographic parity of a predictive model or decision process when conditioning on an additional feature or set of features. Many algorithmic fairness techniques exist to target demographic parity, but CDP is much harder to achieve, particularly when the conditioning variable has many levels and/or when the model outputs are continuous. The p… ▽ More Conditional demographic parity (CDP) is a measure of the demographic parity of a predictive model or decision process when conditioning on an additional feature or set of features. Many algorithmic fairness techniques exist to target demographic parity, but CDP is much harder to achieve, particularly when the conditioning variable has many levels and/or when the model outputs are continuous. The problem of auditing and enforcing CDP is understudied in the literature. In light of this, we propose novel measures of {conditional demographic disparity (CDD)} which rely on statistical distances borrowed from the optimal transport literature. We further design and evaluate regularization-based approaches based on these CDD measures. Our methods, \fairbit{} and \fairlp{}, allow us to target CDP even when the conditioning variable has many levels. When model outputs are continuous, our methods target full equality of the conditional distributions, unlike other methods that only consider first moments or related proxy quantities. We validate the efficacy of our approaches on real-world datasets. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2311.18274 [pdf, other]

Semiparametric Efficient Inference in Adaptive Experiments

Authors: Thomas Cook, Alan Mishler, Aaditya Ramdas

Abstract: We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made… ▽ More We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made in the literature. This central limit theorem enables efficient inference at fixed sample sizes. We then consider a sequential inference setting, deriving both asymptotic and nonasymptotic confidence sequences that are considerably tighter than previous methods. These anytime-valid methods enable inference under data-dependent stopping times (sample sizes). Additionally, we use propensity score truncation techniques from the recent off-policy estimation literature to reduce the finite sample variance of our estimator without affecting the asymptotic variance. Empirical results demonstrate that our methods yield narrower confidence sequences than those previously developed in the literature while maintaining time-uniform error control. △ Less

Submitted 4 March, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: 24 pages, 6 figures. To appear at CLeaR 2024

arXiv:2311.00109 [pdf, other]

doi 10.1609/aaai.v38i14.29545

FairWASP: Fast and Optimal Fair Wasserstein Pre-processing

Authors: Zikai Xiong, Niccolò Dalmasso, Alan Mishler, Vamsi K. Potluru, Tucker Balch, Manuela Veloso

Abstract: Recent years have seen a surge of machine learning approaches aimed at reducing disparities in model outputs across different subgroups. In many settings, training data may be used in multiple downstream applications by different users, which means it may be most effective to intervene on the training data itself. In this work, we present FairWASP, a novel pre-processing approach designed to reduc… ▽ More Recent years have seen a surge of machine learning approaches aimed at reducing disparities in model outputs across different subgroups. In many settings, training data may be used in multiple downstream applications by different users, which means it may be most effective to intervene on the training data itself. In this work, we present FairWASP, a novel pre-processing approach designed to reduce disparities in classification datasets without modifying the original data. FairWASP returns sample-level weights such that the reweighted dataset minimizes the Wasserstein distance to the original dataset while satisfying (an empirical version of) demographic parity, a popular fairness criterion. We show theoretically that integer weights are optimal, which means our method can be equivalently understood as duplicating or eliminating samples. FairWASP can therefore be used to construct datasets which can be fed into any classification method, not just methods which accept sample weights. Our work is based on reformulating the pre-processing task as a large-scale mixed-integer program (MIP), for which we propose a highly efficient algorithm based on the cutting plane method. Experiments demonstrate that our proposed optimization algorithm significantly outperforms state-of-the-art commercial solvers in solving both the MIP and its linear program relaxation. Further experiments highlight the competitive performance of FairWASP in reducing disparities while preserving accuracy in downstream classification settings. △ Less

Submitted 23 October, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

Comments: AAAI 2024, 15 pages, 4 figures, 1 table

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 38(14), 16120-16128, 2024

arXiv:2209.09538 [pdf, other]

Counterfactual Mean-variance Optimization

Authors: Kwangho Kim, Alan Mishler, José R. Zubizarreta

Abstract: We study a counterfactual mean-variance optimization, where the mean and variance are defined as functionals of counterfactual distributions. The optimization problem defines the optimal resource allocation under various constraints in a hypothetical scenario induced by a specified intervention, which may differ substantially from the observed world. We propose a doubly robust-style estimator for… ▽ More We study a counterfactual mean-variance optimization, where the mean and variance are defined as functionals of counterfactual distributions. The optimization problem defines the optimal resource allocation under various constraints in a hypothetical scenario induced by a specified intervention, which may differ substantially from the observed world. We propose a doubly robust-style estimator for the optimal solution to the counterfactual mean-variance optimization problem and derive a closed-form expression for its asymptotic distribution. Our analysis shows that the proposed estimator attains fast parametric convergence rates while enabling tractable inference, even when incorporating nonparametric methods. We further address the calibration of the counterfactual covariance estimator to enhance the finite-sample performance of the proposed optimal solution estimators. Finally, we evaluate the proposed methods through simulation studies and demonstrate their applicability in real-world problems involving healthcare policy and financial portfolio construction. △ Less

Submitted 12 April, 2025; v1 submitted 20 September, 2022; originally announced September 2022.

arXiv:2206.03256 [pdf, other]

Flexible Group Fairness Metrics for Survival Analysis

Authors: Raphael Sonabend, Florian Pfisterer, Alan Mishler, Moritz Schauer, Lukas Burk, Sumantrak Mukherjee, Sebastian Vollmer

Abstract: Algorithmic fairness is an increasingly important field concerned with detecting and mitigating biases in machine learning models. There has been a wealth of literature for algorithmic fairness in regression and classification however there has been little exploration of the field for survival analysis. Survival analysis is the prediction task in which one attempts to predict the probability of an… ▽ More Algorithmic fairness is an increasingly important field concerned with detecting and mitigating biases in machine learning models. There has been a wealth of literature for algorithmic fairness in regression and classification however there has been little exploration of the field for survival analysis. Survival analysis is the prediction task in which one attempts to predict the probability of an event occurring over time. Survival predictions are particularly important in sensitive settings such as when utilising machine learning for diagnosis and prognosis of patients. In this paper we explore how to utilise existing survival metrics to measure bias with group fairness metrics. We explore this in an empirical experiment with 29 survival datasets and 8 measures. We find that measures of discrimination are able to capture bias well whereas there is less clarity with measures of calibration and scoring rules. We suggest further areas for research including prediction-based fairness metrics for distribution predictions. △ Less

Submitted 22 July, 2022; v1 submitted 26 May, 2022; originally announced June 2022.

Comments: Accepted in DSHealth 2022 (Workshop on Applied Data Science for Healthcare)

arXiv:2202.05049 [pdf, other]

Fair When Trained, Unfair When Deployed: Observable Fairness Measures are Unstable in Performative Prediction Settings

Authors: Alan Mishler, Niccolò Dalmasso

Abstract: Many popular algorithmic fairness measures depend on the joint distribution of predictions, outcomes, and a sensitive feature like race or gender. These measures are sensitive to distribution shift: a predictor which is trained to satisfy one of these fairness definitions may become unfair if the distribution changes. In performative prediction settings, however, predictors are precisely intended… ▽ More Many popular algorithmic fairness measures depend on the joint distribution of predictions, outcomes, and a sensitive feature like race or gender. These measures are sensitive to distribution shift: a predictor which is trained to satisfy one of these fairness definitions may become unfair if the distribution changes. In performative prediction settings, however, predictors are precisely intended to induce distribution shift. For example, in many applications in criminal justice, healthcare, and consumer finance, the purpose of building a predictor is to reduce the rate of adverse outcomes such as recidivism, hospitalization, or default on a loan. We formalize the effect of such predictors as a type of concept shift-a particular variety of distribution shift-and show both theoretically and via simulated examples how this causes predictors which are fair when they are trained to become unfair when they are deployed. We further show how many of these issues can be avoided by using fairness definitions that depend on counterfactual rather than observable outcomes. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: 11 pages, 3 figures. Presented at the workshop on Algorithmic Fairness through the Lens of Causality and Robustness, NeurIPS 2021

arXiv:2109.00173 [pdf, other]

FADE: FAir Double Ensemble Learning for Observable and Counterfactual Outcomes

Authors: Alan Mishler, Edward Kennedy

Abstract: Methods for building fair predictors often involve tradeoffs between fairness and accuracy and between different fairness criteria, but the nature of these tradeoffs varies. Recent work seeks to characterize these tradeoffs in specific problem settings, but these methods often do not accommodate users who wish to improve the fairness of an existing benchmark model without sacrificing accuracy, or… ▽ More Methods for building fair predictors often involve tradeoffs between fairness and accuracy and between different fairness criteria, but the nature of these tradeoffs varies. Recent work seeks to characterize these tradeoffs in specific problem settings, but these methods often do not accommodate users who wish to improve the fairness of an existing benchmark model without sacrificing accuracy, or vice versa. These results are also typically restricted to observable accuracy and fairness criteria. We develop a flexible framework for fair ensemble learning that allows users to efficiently explore the fairness-accuracy space or to improve the fairness or accuracy of a benchmark model. Our framework can simultaneously target multiple observable or counterfactual fairness criteria, and it enables users to combine a large number of previously trained and newly trained predictors. We provide theoretical guarantees that our estimators converge at fast rates. We apply our method on both simulated and real data, with respect to both observable and counterfactual accuracy and fairness criteria. We show that, surprisingly, multiple unfairness measures can sometimes be minimized simultaneously with little impact on accuracy, relative to unconstrained predictors or existing benchmark models. △ Less

Submitted 31 August, 2021; originally announced September 2021.

Comments: 56 pages, 20 figures

arXiv:2104.02237 [pdf, other]

Clustering Students and Inferring Skill Set Profiles with Skill Hierarchies

Authors: Alan Mishler, Rebecca Nugent

Abstract: Cognitive diagnosis models (CDMs) are a popular tool for assessing students' mastery of sets of skills. Given a set of $K$ skills tested on an assessment, students are classified into one of $2^K$ latent skill set profiles that represent whether they have mastered each skill or not. Traditional approaches to estimating these profiles are computationally intensive and become infeasible on large dat… ▽ More Cognitive diagnosis models (CDMs) are a popular tool for assessing students' mastery of sets of skills. Given a set of $K$ skills tested on an assessment, students are classified into one of $2^K$ latent skill set profiles that represent whether they have mastered each skill or not. Traditional approaches to estimating these profiles are computationally intensive and become infeasible on large datasets. Instead, proxy skill estimates can be generated from the observed responses and then clustered, and these clusters can be assigned to different profiles. Building on previous work, we consider how to optimally perform this clustering when not all $2^K$ profiles are possible, e.g. because of hierarchical relationships among the skills, and when not all possible profiles are present in the population. We compare hierarchical clustering and several k-means variants, including semisupervised clustering using simulated student responses. The empty k-means algorithm paired with a novel method for generating starting centers yields the best overall performance. △ Less

Submitted 5 April, 2021; originally announced April 2021.

Comments: 4 pages, 3 figures. Originally presented at the Doctoral Consortium of the 11th International Conference on Educational Data Mining, July, 2018, Buffalo, NY

arXiv:2104.01921 [pdf, other]

When the Oracle Misleads: Modeling the Consequences of Using Observable Rather than Potential Outcomes in Risk Assessment Instruments

Authors: Alan Mishler, Niccolò Dalmasso

Abstract: Risk Assessment Instruments (RAIs) are widely used to forecast adverse outcomes in domains such as healthcare and criminal justice. RAIs are commonly trained on observational data and are optimized to predict observable outcomes rather than potential outcomes, which are the outcomes that would occur absent a particular intervention. Examples of relevant potential outcomes include whether a patient… ▽ More Risk Assessment Instruments (RAIs) are widely used to forecast adverse outcomes in domains such as healthcare and criminal justice. RAIs are commonly trained on observational data and are optimized to predict observable outcomes rather than potential outcomes, which are the outcomes that would occur absent a particular intervention. Examples of relevant potential outcomes include whether a patient's condition would worsen without treatment or whether a defendant would recidivate if released pretrial. We illustrate how RAIs which are trained to predict observable outcomes can lead to worse decision making, causing precisely the types of harm they are intended to prevent. This can occur even when the predictors are Bayes-optimal and there is no unmeasured confounding. △ Less

Submitted 5 April, 2021; originally announced April 2021.

Comments: 6 pages, 3 figures. Presented at the workshop "'Do the right thing': machine learning and causal inference for improved decision making," NeurIPS 2019

arXiv:2103.15281 [pdf, ps, other]

Comment on "Statistical Modeling: The Two Cultures" by Leo Breiman

Authors: Matteo Bonvini, Alan Mishler, Edward H. Kennedy

Abstract: Motivated by Breiman's rousing 2001 paper on the "two cultures" in statistics, we consider the role that different modeling approaches play in causal inference. We discuss the relationship between model complexity and causal (mis)interpretation, the relative merits of plug-in versus targeted estimation, issues that arise in tuning flexible estimators of causal effects, and some outstanding cultura… ▽ More Motivated by Breiman's rousing 2001 paper on the "two cultures" in statistics, we consider the role that different modeling approaches play in causal inference. We discuss the relationship between model complexity and causal (mis)interpretation, the relative merits of plug-in versus targeted estimation, issues that arise in tuning flexible estimators of causal effects, and some outstanding cultural divisions in causal inference. △ Less

Submitted 28 March, 2021; originally announced March 2021.

arXiv:2009.02841 [pdf, other]

doi 10.1145/3442188.3445902

Fairness in Risk Assessment Instruments: Post-Processing to Achieve Counterfactual Equalized Odds

Authors: Alan Mishler, Edward H. Kennedy, Alexandra Chouldechova

Abstract: In domains such as criminal justice, medicine, and social welfare, decision makers increasingly have access to algorithmic Risk Assessment Instruments (RAIs). RAIs estimate the risk of an adverse outcome such as recidivism or child neglect, potentially informing high-stakes decisions such as whether to release a defendant on bail or initiate a child welfare investigation. It is important to ensure… ▽ More In domains such as criminal justice, medicine, and social welfare, decision makers increasingly have access to algorithmic Risk Assessment Instruments (RAIs). RAIs estimate the risk of an adverse outcome such as recidivism or child neglect, potentially informing high-stakes decisions such as whether to release a defendant on bail or initiate a child welfare investigation. It is important to ensure that RAIs are fair, so that the benefits and harms of such decisions are equitably distributed. The most widely used algorithmic fairness criteria are formulated with respect to observable outcomes, such as whether a person actually recidivates, but these criteria are misleading when applied to RAIs. Since RAIs are intended to inform interventions that can reduce risk, the prediction itself affects the downstream outcome. Recent work has argued that fairness criteria for RAIs should instead utilize potential outcomes, i.e. the outcomes that would occur in the absence of an appropriate intervention. However, no methods currently exist to satisfy such fairness criteria. In this paper, we target one such criterion, counterfactual equalized odds. We develop a post-processed predictor that is estimated via doubly robust estimators, extending and adapting previous post-processing approaches to the counterfactual setting. We also provide doubly robust estimators of the risk and fairness properties of arbitrary fixed post-processed predictors. Our predictor converges to an optimal fair predictor at fast rates. We illustrate properties of our method and show that it performs well on both simulated and real data. △ Less

Submitted 6 August, 2021; v1 submitted 6 September, 2020; originally announced September 2020.

Comments: 19 pages, 7 figures

Journal ref: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Pages 386-400

arXiv:1909.00066 [pdf, other]

Counterfactual Risk Assessments, Evaluation, and Fairness

Authors: Amanda Coston, Alan Mishler, Edward H. Kennedy, Alexandra Chouldechova

Abstract: Algorithmic risk assessments are increasingly used to help humans make decisions in high-stakes settings, such as medicine, criminal justice and education. In each of these cases, the purpose of the risk assessment tool is to inform actions, such as medical treatments or release conditions, often with the aim of reducing the likelihood of an adverse event such as hospital readmission or recidivism… ▽ More Algorithmic risk assessments are increasingly used to help humans make decisions in high-stakes settings, such as medicine, criminal justice and education. In each of these cases, the purpose of the risk assessment tool is to inform actions, such as medical treatments or release conditions, often with the aim of reducing the likelihood of an adverse event such as hospital readmission or recidivism. Problematically, most tools are trained and evaluated on historical data in which the outcomes observed depend on the historical decision-making policy. These tools thus reflect risk under the historical policy, rather than under the different decision options that the tool is intended to inform. Even when tools are constructed to predict risk under a specific decision, they are often improperly evaluated as predictors of the target outcome. Focusing on the evaluation task, in this paper we define counterfactual analogues of common predictive performance and algorithmic fairness metrics that we argue are better suited for the decision-making context. We introduce a new method for estimating the proposed metrics using doubly robust estimation. We provide theoretical results that show that only under strong conditions can fairness according to the standard metric and the counterfactual metric simultaneously hold. Consequently, fairness-promoting methods that target parity in a standard fairness metric may --- and as we show empirically, do --- induce greater imbalance in the counterfactual analogue. We provide empirical comparisons on both synthetic data and a real world child welfare dataset to demonstrate how the proposed method improves upon standard practice. △ Less

Submitted 10 January, 2020; v1 submitted 30 August, 2019; originally announced September 2019.

Comments: To appear in ACM FAT* 2020

arXiv:1711.07137 [pdf, other]

Challenges in Obtaining Valid Causal Effect Estimates with Machine Learning Algorithms

Authors: Ashley I Naimi, Alan E Mishler, Edward H Kennedy

Abstract: Unlike parametric regression, machine learning (ML) methods do not generally require precise knowledge of the true data generating mechanisms. As such, numerous authors have advocated for ML methods to estimate causal effects. Unfortunately, ML algorithms can perform worse than parametric regression. We demonstrate the performance of ML-based single- and double-robust estimators. We use 100 Monte… ▽ More Unlike parametric regression, machine learning (ML) methods do not generally require precise knowledge of the true data generating mechanisms. As such, numerous authors have advocated for ML methods to estimate causal effects. Unfortunately, ML algorithms can perform worse than parametric regression. We demonstrate the performance of ML-based single- and double-robust estimators. We use 100 Monte Carlo samples with sample sizes of 200, 1200, and 5000 to investigate bias and confidence interval coverage under several scenarios. In a simple confounding scenario, confounders were related to the treatment and the outcome via parametric models. In a complex confounding scenario, the simple confounders were transformed to induce complicated nonlinear relationships. In the simple scenario, when ML algorithms were used, double-robust estimators were superior to single-robust estimators. In the complex scenario, single-robust estimators with ML algorithms were at least as biased as estimators using misspecified parametric models. Double-robust estimators were less biased, but coverage was well below nominal. The use of sample splitting, inclusion of confounder interactions, reliance on a richly specified ML algorithm, and use of doubly robust estimators was the only explored approach that yielded negligible bias and nominal coverage. Our results suggest that ML based singly robust methods should be avoided. △ Less

Submitted 14 May, 2020; v1 submitted 19 November, 2017; originally announced November 2017.

Comments: 21 pages, 2 figures, 1 table

arXiv:1702.06216 [pdf, other]

doi 10.1109/ICSC.2017.75

Filtering Tweets for Social Unrest

Authors: Alan Mishler, Kevin Wonus, Wendy Chambers, Michael Bloodgood

Abstract: Since the events of the Arab Spring, there has been increased interest in using social media to anticipate social unrest. While efforts have been made toward automated unrest prediction, we focus on filtering the vast volume of tweets to identify tweets relevant to unrest, which can be provided to downstream users for further analysis. We train a supervised classifier that is able to label Arabic… ▽ More Since the events of the Arab Spring, there has been increased interest in using social media to anticipate social unrest. While efforts have been made toward automated unrest prediction, we focus on filtering the vast volume of tweets to identify tweets relevant to unrest, which can be provided to downstream users for further analysis. We train a supervised classifier that is able to label Arabic language tweets as relevant to unrest with high reliability. We examine the relationship between training data size and performance and investigate ways to optimize the model building process while minimizing cost. We also explore how confidence thresholds can be set to achieve desired levels of performance. △ Less

Submitted 1 April, 2017; v1 submitted 20 February, 2017; originally announced February 2017.

Comments: 7 pages, 8 figures, 3 tables; published in Proceedings of the 2017 IEEE 11th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, pages 17-23, January 2017

ACM Class: H.3.3; I.2.6; I.2.7; I.5.4

Journal ref: In Proceedings of the 2017 IEEE 11th International Conference on Semantic Computing (ICSC), pages 17-23, San Diego, CA, USA, January 2017. IEEE

Showing 1–15 of 15 results for author: Mishler, A