-
Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation
Authors:
Michal Lukasik,
Lin Chen,
Harikrishna Narasimhan,
Aditya Krishna Menon,
Wittawat Jitkrittum,
Felix X. Yu,
Sashank J. Reddi,
Gang Fu,
Mohammadhossein Bateni,
Sanjiv Kumar
Abstract:
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal Area Under the ROC Curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze tw…
▽ More
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal Area Under the ROC Curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem -- loss aggregation and label aggregation -- by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
△ Less
Submitted 9 June, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
Universal Model Routing for Efficient LLM Inference
Authors:
Wittawat Jitkrittum,
Harikrishna Narasimhan,
Ankit Singh Rawat,
Jeevesh Juneja,
Zifeng Wang,
Chen-Yu Lee,
Pradeep Shenoy,
Rina Panigrahy,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
Large language models' significant advances in capabilities are accompanied by significant increases in inference costs. Model routing is a simple technique for reducing inference cost, wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the probl…
▽ More
Large language models' significant advances in capabilities are accompanied by significant increases in inference costs. Model routing is a simple technique for reducing inference cost, wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective strategies, relying on cluster-based routing and a learned cluster map respectively. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors. Experiments on a range of public benchmarks show the effectiveness of the proposed strategies in routing amongst more than 30 unseen LLMs.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
Authors:
Ankit Singh Rawat,
Veeranjaneyulu Sadhanala,
Afshin Rostamizadeh,
Ayan Chakrabarti,
Wittawat Jitkrittum,
Vladimir Feinberg,
Seungyeon Kim,
Hrayr Harutyunyan,
Nikunj Saunshi,
Zachary Nado,
Rakesh Shivanna,
Sashank J. Reddi,
Aditya Krishna Menon,
Rohan Anil,
Sanjiv Kumar
Abstract:
A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradig…
▽ More
A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Personalized and uncertainty-aware coronary hemodynamics simulations: From Bayesian estimation to improved multi-fidelity uncertainty quantification
Authors:
Karthik Menon,
Andrea Zanoni,
Owais Khan,
Gianluca Geraci,
Koen Nieman,
Daniele E. Schiavazzi,
Alison L. Marsden
Abstract:
Simulations of coronary hemodynamics have improved non-invasive clinical risk stratification and treatment outcomes for coronary artery disease, compared to relying on anatomical imaging alone. However, simulations typically use empirical approaches to distribute total coronary flow amongst the arteries in the coronary tree. This ignores patient variability, the presence of disease, and other clin…
▽ More
Simulations of coronary hemodynamics have improved non-invasive clinical risk stratification and treatment outcomes for coronary artery disease, compared to relying on anatomical imaging alone. However, simulations typically use empirical approaches to distribute total coronary flow amongst the arteries in the coronary tree. This ignores patient variability, the presence of disease, and other clinical factors. Further, uncertainty in the clinical data often remains unaccounted for in the modeling pipeline. We present an end-to-end uncertainty-aware pipeline to (1) personalize coronary flow simulations by incorporating branch-specific coronary flows as well as cardiac function; and (2) predict clinical and biomechanical quantities of interest with improved precision, while accounting for uncertainty in the clinical data. We assimilate patient-specific measurements of myocardial blood flow from CT myocardial perfusion imaging to estimate branch-specific coronary flows. We use adaptive Markov Chain Monte Carlo sampling to estimate the joint posterior distributions of model parameters with simulated noise in the clinical data. Additionally, we determine the posterior predictive distribution for relevant quantities of interest using a new approach combining multi-fidelity Monte Carlo estimation with non-linear, data-driven dimensionality reduction. Our framework recapitulates clinically measured cardiac function as well as branch-specific coronary flows under measurement uncertainty. We substantially shrink the confidence intervals for estimated quantities of interest compared to single-fidelity and state-of-the-art multi-fidelity Monte Carlo methods. This is especially true for quantities that showed limited correlation between the low- and high-fidelity model predictions. Moreover, the proposed estimators are significantly cheaper to compute for a specified confidence level or variance.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Efficient Document Ranking with Learnable Late Interactions
Authors:
Ziwei Ji,
Himanshu Jain,
Andreas Veit,
Sashank J. Reddi,
Sadeep Jayasumana,
Ankit Singh Rawat,
Aditya Krishna Menon,
Felix Yu,
Sanjiv Kumar
Abstract:
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been p…
▽ More
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Cascade-Aware Training of Language Models
Authors:
Congchao Wang,
Sean Augenstein,
Keith Rush,
Wittawat Jitkrittum,
Harikrishna Narasimhan,
Ankit Singh Rawat,
Aditya Krishna Menon,
Alec Go
Abstract:
Reducing serving cost and latency is a fundamental concern for the deployment of language models (LMs) in business applications. To address this, cascades of LMs offer an effective solution that conditionally employ smaller models for simpler queries. Cascaded systems are typically built with independently trained models, neglecting the advantages of considering inference-time interactions of the…
▽ More
Reducing serving cost and latency is a fundamental concern for the deployment of language models (LMs) in business applications. To address this, cascades of LMs offer an effective solution that conditionally employ smaller models for simpler queries. Cascaded systems are typically built with independently trained models, neglecting the advantages of considering inference-time interactions of the cascaded LMs during training. In this paper, we present cascade-aware training(CAT), an approach to optimizing the overall quality-cost performance tradeoff of a cascade of LMs. We achieve inference-time benefits by training the small LM with awareness of its place in a cascade and downstream capabilities. We demonstrate the value of the proposed method with over 60 LM tasks of the SuperGLUE, WMT22, and FLAN2021 datasets.
△ Less
Submitted 29 May, 2024;
originally announced June 2024.
-
Faster Cascades via Speculative Decoding
Authors:
Harikrishna Narasimhan,
Wittawat Jitkrittum,
Ankit Singh Rawat,
Seungyeon Kim,
Neha Gupta,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in p…
▽ More
Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades offer better cost-quality trade-offs, often even outperforming the large model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Experiments with Gemma and T5 models on a range of language benchmarks show that our approach yields better cost quality trade-offs than cascading and speculative decoding baselines.
△ Less
Submitted 21 October, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Bayesian Windkessel calibration using optimized 0D surrogate models
Authors:
Jakob Richter,
Jonas Nitzler,
Luca Pegolotti,
Karthik Menon,
Jonas Biehler,
Wolfgang A. Wall,
Daniele E. Schiavazzi,
Alison L. Marsden,
Martin R. Pfaller
Abstract:
Boundary condition (BC) calibration to assimilate clinical measurements is an essential step in any subject-specific simulation of cardiovascular fluid dynamics. Bayesian calibration approaches have successfully quantified the uncertainties inherent in identified parameters. Yet, routinely estimating the posterior distribution for all BC parameters in 3D simulations has been unattainable due to th…
▽ More
Boundary condition (BC) calibration to assimilate clinical measurements is an essential step in any subject-specific simulation of cardiovascular fluid dynamics. Bayesian calibration approaches have successfully quantified the uncertainties inherent in identified parameters. Yet, routinely estimating the posterior distribution for all BC parameters in 3D simulations has been unattainable due to the infeasible computational demand. We propose an efficient method to identify Windkessel parameter posteriors using results from a single high-fidelity three-dimensional (3D) model evaluation. We only evaluate the 3D model once for an initial choice of BCs and use the result to create a highly accurate zero-dimensional (0D) surrogate. We then perform Sequential Monte Carlo (SMC) using the optimized 0D model to derive the high-dimensional Windkessel BC posterior distribution. We validate this approach in a publicly available dataset of N=72 subject-specific vascular models. We found that optimizing 0D models to match 3D data a priori lowered their median approximation error by nearly one order of magnitude. In a subset of models, we confirm that the optimized 0D models still generalize to a wide range of BCs. Finally, we present the high-dimensional Windkessel parameter posterior for different measured signal-to-noise ratios in a vascular model using SMC. We further validate that the 0D-derived posterior is a good approximation of the 3D posterior. The minimal computational demand of our method using a single 3D simulation, combined with the open-source nature of all software and data used in this work, will increase access and efficiency of Bayesian Windkessel calibration in cardiovascular fluid dynamics simulations.
△ Less
Submitted 29 July, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
Language Model Cascades: Token-level uncertainty and beyond
Authors:
Neha Gupta,
Harikrishna Narasimhan,
Wittawat Jitkrittum,
Ankit Singh Rawat,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning c…
▽ More
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across examples. To mitigate this issue, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Regression-aware Inference with LLMs
Authors:
Michal Lukasik,
Harikrishna Narasimhan,
Aditya Krishna Menon,
Felix Yu,
Sanjiv Kumar
Abstract:
Large language models (LLMs) have shown strong results on a range of applications, including regression and scoring tasks. Typically, one obtains outputs from an LLM via autoregressive sampling from the model's output distribution. We show that this inference strategy can be sub-optimal for common regression and scoring evaluation metrics. As a remedy, we build on prior work on Minimum Bayes Risk…
▽ More
Large language models (LLMs) have shown strong results on a range of applications, including regression and scoring tasks. Typically, one obtains outputs from an LLM via autoregressive sampling from the model's output distribution. We show that this inference strategy can be sub-optimal for common regression and scoring evaluation metrics. As a remedy, we build on prior work on Minimum Bayes Risk decoding, and propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses. We show that our proposal significantly improves over baselines across datasets and models.
△ Less
Submitted 1 November, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
Classification of attention performance post-longitudinal tDCS via functional connectivity and machine learning methods
Authors:
Akash K Rao,
Vishnu K Menon,
Arnav Bhavsar,
Shubhajit Roy Chowdhury,
Ramsingh Negi,
Varun Dutt
Abstract:
Attention is the brain's mechanism for selectively processing specific stimuli while filtering out irrelevant information. Characterizing changes in attention following long-term interventions (such as transcranial direct current stimulation (tDCS)) has seldom been emphasized in the literature. To classify attention performance post-tDCS, this study uses functional connectivity and machine learnin…
▽ More
Attention is the brain's mechanism for selectively processing specific stimuli while filtering out irrelevant information. Characterizing changes in attention following long-term interventions (such as transcranial direct current stimulation (tDCS)) has seldom been emphasized in the literature. To classify attention performance post-tDCS, this study uses functional connectivity and machine learning algorithms. Fifty individuals were split into experimental and control conditions. On Day 1, EEG data was obtained as subjects executed an attention task. From Day 2 through Day 8, the experimental group was administered 1mA tDCS, while the control group received sham tDCS. On Day 10, subjects repeated the task mentioned on Day 1. Functional connectivity metrics were used to classify attention performance using various machine learning methods. Results revealed that combining the Adaboost model and recursive feature elimination yielded a classification accuracy of 91.84%. We discuss the implications of our results in developing neurofeedback frameworks to assess attention.
△ Less
Submitted 31 January, 2024;
originally announced February 2024.
-
Gesture Controlled Robot For Human Detection
Authors:
Athira T. S,
Honey Manoj,
R S Vishnu Priya,
Vishnu K Menon,
Srilekshmi M
Abstract:
It is very important to locate survivors from collapsed buildings so that rescue operations can be arranged. Many lives are lost due to lack of competent systems to detect people in these collapsed buildings at the right time. So here we have designed a hand gesture controlled robot which is capable of detecting humans under these collapsed building parts. The proposed work can be used to access s…
▽ More
It is very important to locate survivors from collapsed buildings so that rescue operations can be arranged. Many lives are lost due to lack of competent systems to detect people in these collapsed buildings at the right time. So here we have designed a hand gesture controlled robot which is capable of detecting humans under these collapsed building parts. The proposed work can be used to access specific locations that are not humanly possible, and detect those humans trapped under the rubble of collapsed buildings. This information is then used to notify the rescue team to take adequate measures and initiate rescue operations accordingly.
△ Less
Submitted 31 January, 2024;
originally announced January 2024.
-
Prediction of multitasking performance post-longitudinal tDCS via EEG-based functional connectivity and machine learning methods
Authors:
Akash K Rao,
Shashank Uttrani,
Vishnu K Menon,
Darshil Shah,
Arnav Bhavsar,
Shubhajit Roy Chowdhury,
Varun Dutt
Abstract:
Predicting and understanding the changes in cognitive performance, especially after a longitudinal intervention, is a fundamental goal in neuroscience. Longitudinal brain stimulation-based interventions like transcranial direct current stimulation (tDCS) induce short-term changes in the resting membrane potential and influence cognitive processes. However, very little research has been conducted o…
▽ More
Predicting and understanding the changes in cognitive performance, especially after a longitudinal intervention, is a fundamental goal in neuroscience. Longitudinal brain stimulation-based interventions like transcranial direct current stimulation (tDCS) induce short-term changes in the resting membrane potential and influence cognitive processes. However, very little research has been conducted on predicting these changes in cognitive performance post-intervention. In this research, we intend to address this gap in the literature by employing different EEG-based functional connectivity analyses and machine learning algorithms to predict changes in cognitive performance in a complex multitasking task. Forty subjects were divided into experimental and active-control conditions. On Day 1, all subjects executed a multitasking task with simultaneous 32-channel EEG being acquired. From Day 2 to Day 7, subjects in the experimental condition undertook 15 minutes of 2mA anodal tDCS stimulation during task training. Subjects in the active-control condition undertook 15 minutes of sham stimulation during task training. On Day 10, all subjects again executed the multitasking task with EEG acquisition. Source-level functional connectivity metrics, namely phase lag index and directed transfer function, were extracted from the EEG data on Day 1 and Day 10. Various machine learning models were employed to predict changes in cognitive performance. Results revealed that the multi-layer perceptron and directed transfer function recorded a cross-validation training RMSE of 5.11% and a test RMSE of 4.97%. We discuss the implications of our results in developing real-time cognitive state assessors for accurately predicting cognitive performance in dynamic and complex tasks post-tDCS intervention
△ Less
Submitted 31 January, 2024;
originally announced January 2024.
-
Predicting suicidal behavior among Indian adults using childhood trauma, mental health questionnaires and machine learning cascade ensembles
Authors:
Akash K Rao,
Gunjan Y Trivedi,
Riri G Trivedi,
Anshika Bajpai,
Gajraj Singh Chauhan,
Vishnu K Menon,
Kathirvel Soundappan,
Hemalatha Ramani,
Neha Pandya,
Varun Dutt
Abstract:
Among young adults, suicide is India's leading cause of death, accounting for an alarming national suicide rate of around 16%. In recent years, machine learning algorithms have emerged to predict suicidal behavior using various behavioral traits. But to date, the efficacy of machine learning algorithms in predicting suicidal behavior in the Indian context has not been explored in literature. In th…
▽ More
Among young adults, suicide is India's leading cause of death, accounting for an alarming national suicide rate of around 16%. In recent years, machine learning algorithms have emerged to predict suicidal behavior using various behavioral traits. But to date, the efficacy of machine learning algorithms in predicting suicidal behavior in the Indian context has not been explored in literature. In this study, different machine learning algorithms and ensembles were developed to predict suicide behavior based on childhood trauma, different mental health parameters, and other behavioral factors. The dataset was acquired from 391 individuals from a wellness center in India. Information regarding their childhood trauma, psychological wellness, and other mental health issues was acquired through standardized questionnaires. Results revealed that cascade ensemble learning methods using a support vector machine, decision trees, and random forest were able to classify suicidal behavior with an accuracy of 95.04% using data from childhood trauma and mental health questionnaires. The study highlights the potential of using these machine learning ensembles to identify individuals with suicidal tendencies so that targeted interinterventions could be provided efficiently.
△ Less
Submitted 31 January, 2024;
originally announced January 2024.
-
Classification of executive functioning performance post-longitudinal tDCS using functional connectivity and machine learning methods
Authors:
Akash K Rao,
Vishnu K Menon,
Shashank Uttrani,
Ayushman Dixit,
Dipanshu Verma,
Varun Dutt
Abstract:
Executive functioning is a cognitive process that enables humans to plan, organize, and regulate their behavior in a goal-directed manner. Understanding and classifying the changes in executive functioning after longitudinal interventions (like transcranial direct current stimulation (tDCS)) has not been explored in the literature. This study employs functional connectivity and machine learning al…
▽ More
Executive functioning is a cognitive process that enables humans to plan, organize, and regulate their behavior in a goal-directed manner. Understanding and classifying the changes in executive functioning after longitudinal interventions (like transcranial direct current stimulation (tDCS)) has not been explored in the literature. This study employs functional connectivity and machine learning algorithms to classify executive functioning performance post-tDCS. Fifty subjects were divided into experimental and placebo control groups. EEG data was collected while subjects performed an executive functioning task on Day 1. The experimental group received tDCS during task training from Day 2 to Day 8, while the control group received sham tDCS. On Day 10, subjects repeated the tasks specified on Day 1. Different functional connectivity metrics were extracted from EEG data and eventually used for classifying executive functioning performance using different machine learning algorithms. Results revealed that a novel combination of partial directed coherence and multi-layer perceptron (along with recursive feature elimination) resulted in a high classification accuracy of 95.44%. We discuss the implications of our results in developing real-time neurofeedback systems for assessing and enhancing executive functioning performance post-tDCS administration.
△ Less
Submitted 31 January, 2024;
originally announced January 2024.
-
A Probabilistic Neural Twin for Treatment Planning in Peripheral Pulmonary Artery Stenosis
Authors:
John D. Lee,
Jakob Richter,
Martin R. Pfaller,
Jason M. Szafron,
Karthik Menon,
Andrea Zanoni,
Michael R. Ma,
Jeffrey A. Feinstein,
Jacqueline Kreutzer,
Alison L. Marsden,
Daniele E. Schiavazzi
Abstract:
The substantial computational cost of high-fidelity models in numerical hemodynamics has, so far, relegated their use mainly to offline treatment planning. New breakthroughs in data-driven architectures and optimization techniques for fast surrogate modeling provide an exciting opportunity to overcome these limitations, enabling the use of such technology for time-critical decisions. We discuss an…
▽ More
The substantial computational cost of high-fidelity models in numerical hemodynamics has, so far, relegated their use mainly to offline treatment planning. New breakthroughs in data-driven architectures and optimization techniques for fast surrogate modeling provide an exciting opportunity to overcome these limitations, enabling the use of such technology for time-critical decisions. We discuss an application to the repair of multiple stenosis in peripheral pulmonary artery disease through either transcatheter pulmonary artery rehabilitation or surgery, where it is of interest to achieve desired pressures and flows at specific locations in the pulmonary artery tree, while minimizing the risk for the patient. Since different degrees of success can be achieved in practice during treatment, we formulate the problem in probability, and solve it through a sample-based approach. We propose a new offline-online pipeline for probabilsitic real-time treatment planning which combines offline assimilation of boundary conditions, model reduction, and training dataset generation with online estimation of marginal probabilities, possibly conditioned on the degree of augmentation observed in already repaired lesions. Moreover, we propose a new approach for the parametrization of arbitrarily shaped vascular repairs through iterative corrections of a zero-dimensional approximant. We demonstrate this pipeline for a diseased model of the pulmonary artery tree available through the Vascular Model Repository.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
Authors:
Yongchao Zhou,
Kaifeng Lyu,
Ankit Singh Rawat,
Aditya Krishna Menon,
Afshin Rostamizadeh,
Sanjiv Kumar,
Jean-François Kagy,
Rishabh Agarwal
Abstract:
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, w…
▽ More
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
△ Less
Submitted 30 March, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
What do larger image classifiers memorise?
Authors:
Michal Lukasik,
Vaishnavh Nagarajan,
Ankit Singh Rawat,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels. To carefully study this issue, Feldman proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the correspondi…
▽ More
The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels. To carefully study this issue, Feldman proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the corresponding memorisation profile of a ResNet on image classification bench-marks. While an exciting first glimpse into what real-world models memorise, this leaves open a fundamental question: do larger neural models memorise more? We present a comprehensive empirical analysis of this question on image classification benchmarks. We find that training examples exhibit an unexpectedly diverse set of memorisation trajectories across model sizes: most samples experience decreased memorisation under larger models, while the rest exhibit cap-shaped or increasing memorisation. We show that various proxies for the Feldman memorization score fail to capture these fundamental trends. Lastly, we find that knowledge distillation, an effective and popular model compression technique, tends to inhibit memorisation, while also improving generalisation. Specifically, memorisation is mostly inhibited on examples with increasing memorisation trajectories, thus pointing at how distillation improves generalisation.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Think before you speak: Training Language Models With Pause Tokens
Authors:
Sachin Goyal,
Ziwei Ji,
Ankit Singh Rawat,
Aditya Krishna Menon,
Sanjiv Kumar,
Vaishnavh Nagarajan
Abstract:
Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on lan…
▽ More
Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task of SQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.
△ Less
Submitted 20 April, 2024; v1 submitted 3 October, 2023;
originally announced October 2023.
-
The importance of feature preprocessing for differentially private linear optimization
Authors:
Ziteng Sun,
Ananda Theertha Suresh,
Aditya Krishna Menon
Abstract:
Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question:…
▽ More
Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question: is DPSGD alone sufficient to find a good minimizer for every dataset under privacy constraints? Towards answering this question, we show that even for the simple case of linear classification, unlike non-private optimization, (private) feature preprocessing is vital for differentially private optimization. In detail, we first show theoretically that there exists an example where without feature preprocessing, DPSGD incurs an optimality gap proportional to the maximum Euclidean norm of features over all samples. We then propose an algorithm called DPSGD-F, which combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs an optimality gap proportional to the diameter of the features $\max_{x, x' \in D} \|x - x'\|_2$. We finally demonstrate the practicality of our algorithm on image classification benchmarks.
△ Less
Submitted 19 February, 2024; v1 submitted 19 July, 2023;
originally announced July 2023.
-
When Does Confidence-Based Cascade Deferral Suffice?
Authors:
Wittawat Jitkrittum,
Neha Gupta,
Aditya Krishna Menon,
Harikrishna Narasimhan,
Ankit Singh Rawat,
Sanjiv Kumar
Abstract:
Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite…
▽ More
Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.
△ Less
Submitted 23 January, 2024; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Mining the contribution of intensive care clinical course to outcome after traumatic brain injury
Authors:
Shubhayu Bhattacharyay,
Pier Francesco Caruso,
Cecilia Åkerlund,
Lindsay Wilson,
Robert D Stevens,
David K Menon,
Ewout W Steyerberg,
David W Nelson,
Ari Ercole,
the CENTER-TBI investigators/participants
Abstract:
Existing methods to characterise the evolving condition of traumatic brain injury (TBI) patients in the intensive care unit (ICU) do not capture the context necessary for individualising treatment. Here, we integrate all heterogenous data stored in medical records (1,166 pre-ICU and ICU variables) to model the individualised contribution of clinical course to six-month functional outcome on the Gl…
▽ More
Existing methods to characterise the evolving condition of traumatic brain injury (TBI) patients in the intensive care unit (ICU) do not capture the context necessary for individualising treatment. Here, we integrate all heterogenous data stored in medical records (1,166 pre-ICU and ICU variables) to model the individualised contribution of clinical course to six-month functional outcome on the Glasgow Outcome Scale - Extended (GOSE). On a prospective cohort (n=1,550, 65 centres) of TBI patients, we train recurrent neural network models to map a token-embedded time series representation of all variables (including missing values) to an ordinal GOSE prognosis every two hours. The full range of variables explains up to 52% (95% CI: 50%-54%) of the ordinal variance in functional outcome. Up to 91% (95% CI: 90%-91%) of this explanation is derived from pre-ICU and admission information (i.e., static variables). Information collected in the ICU (i.e., dynamic variables) increases explanation (by up to 5% [95% CI: 4%-6%]), though not enough to counter poorer overall performance in longer-stay (>5.75 days) patients. Highest-contributing variables include physician-based prognoses, CT features, and markers of neurological function. Whilst static information currently accounts for the majority of functional outcome explanation after TBI, data-driven analysis highlights investigative avenues to improve dynamic characterisation of longer-stay patients. Moreover, our modelling strategy proves useful for converting large patient records into interpretable time series with missing data integration and minimal processing.
△ Less
Submitted 1 August, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
ResMem: Learn what you can and memorize the rest
Authors:
Zitong Yang,
Michal Lukasik,
Vaishnavh Nagarajan,
Zonglin Li,
Ankit Singh Rawat,
Manzil Zaheer,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a ne…
▽ More
The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.
△ Less
Submitted 20 October, 2023; v1 submitted 3 February, 2023;
originally announced February 2023.
-
On student-teacher deviations in distillation: does it pay to disobey?
Authors:
Vaishnavh Nagarajan,
Aditya Krishna Menon,
Srinadh Bhojanapalli,
Hossein Mobahi,
Sanjiv Kumar
Abstract:
Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in…
▽ More
Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in performance. Our work aims to reconcile this seemingly paradoxical observation. Specifically, we characterize the precise nature of the student-teacher deviations, and argue how they can co-occur with better generalization. First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data. Finally, we tie these two observations together: we demonstrate that the exaggerated bias of KD can simultaneously result in both (a) the exaggeration of confidence and (b) the improved generalization of the student, thus offering a resolution to the apparent paradox. Our analysis brings existing theory and practice closer by considering the role of gradient descent in KD and by demonstrating the exaggerated bias effect in both theoretical and empirical settings.
△ Less
Submitted 18 March, 2024; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Plugin estimators for selective classification with out-of-distribution detection
Authors:
Harikrishna Narasimhan,
Aditya Krishna Menon,
Wittawat Jitkrittum,
Sanjiv Kumar
Abstract:
Real-world classifiers can benefit from the option of abstaining from predicting on samples where they have low confidence. Such abstention is particularly useful on samples which are close to the learned decision boundary, or which are outliers with respect to the training sample. These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-…
▽ More
Real-world classifiers can benefit from the option of abstaining from predicting on samples where they have low confidence. Such abstention is particularly useful on samples which are close to the learned decision boundary, or which are outliers with respect to the training sample. These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-distribution (OOD) detection literature. Recent work on selective classification with OOD detection (SCOD) has argued for the unified study of these problems; however, the formal underpinnings of this problem are still nascent, and existing techniques are heuristic in nature. In this paper, we propose new plugin estimators for SCOD that are theoretically grounded, effective, and generalise existing approaches from the SC and OOD detection literature. In the course of our analysis, we formally explicate how naïve use of existing SC and OOD detection baselines may be inadequate for SCOD. We empirically demonstrate that our approaches yields competitive SC and OOD detection performance compared to baselines from both literatures.
△ Less
Submitted 24 July, 2023; v1 submitted 29 January, 2023;
originally announced January 2023.
-
Supervision Complexity and its Role in Knowledge Distillation
Authors:
Hrayr Harutyunyan,
Ankit Singh Rawat,
Aditya Krishna Menon,
Seungyeon Kim,
Sanjiv Kumar
Abstract:
Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate inte…
▽ More
Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
△ Less
Submitted 28 January, 2023;
originally announced January 2023.
-
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval
Authors:
Seungyeon Kim,
Ankit Singh Rawat,
Manzil Zaheer,
Sadeep Jayasumana,
Veeranjaneyulu Sadhanala,
Wittawat Jitkrittum,
Aditya Krishna Menon,
Rob Fergus,
Sanjiv Kumar
Abstract:
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages…
▽ More
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. Unlike existing teacher score-based distillation methods, our proposed approach employs embedding matching tasks to provide a stronger signal to align the representations of the teacher and student models. In addition, it utilizes query generation to explore the data manifold to reduce the discrepancies between the student and the teacher where training data is sparse. Furthermore, our analysis also motivates novel asymmetric architectures for student models which realizes better embedding alignment without increasing online inference cost. On standard benchmarks like MSMARCO, we show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
△ Less
Submitted 3 July, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
When does mixup promote local linearity in learned representations?
Authors:
Arslan Chaudhry,
Aditya Krishna Menon,
Andreas Veit,
Sadeep Jayasumana,
Srikumar Ramalingam,
Sanjiv Kumar
Abstract:
Mixup is a regularization technique that artificially produces new samples using convex combinations of original training points. This simple technique has shown strong empirical performance, and has been heavily used as part of semi-supervised learning techniques such as mixmatch~\citep{berthelot2019mixmatch} and interpolation consistent training (ICT)~\citep{verma2019interpolation}. In this pape…
▽ More
Mixup is a regularization technique that artificially produces new samples using convex combinations of original training points. This simple technique has shown strong empirical performance, and has been heavily used as part of semi-supervised learning techniques such as mixmatch~\citep{berthelot2019mixmatch} and interpolation consistent training (ICT)~\citep{verma2019interpolation}. In this paper, we look at Mixup through a \emph{representation learning} lens in a semi-supervised learning setup. In particular, we study the role of Mixup in promoting linearity in the learned network representations. Towards this, we study two questions: (1) how does the Mixup loss that enforces linearity in the \emph{last} network layer propagate the linearity to the \emph{earlier} layers?; and (2) how does the enforcement of stronger Mixup loss on more than two data points affect the convergence of training? We empirically investigate these properties of Mixup on vision datasets such as CIFAR-10, CIFAR-100 and SVHN. Our results show that supervised Mixup training does not make \emph{all} the network layers linear; in fact the \emph{intermediate layers} become more non-linear during Mixup training compared to a network that is trained \emph{without} Mixup. However, when Mixup is used as an unsupervised loss, we observe that all the network layers become more linear resulting in faster training convergence.
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
Robust Distillation for Worst-class Performance
Authors:
Serena Wang,
Harikrishna Narasimhan,
Yichen Zhou,
Sara Hooker,
Michal Lukasik,
Aditya Krishna Menon
Abstract:
Knowledge distillation has proven to be an effective technique in improving the performance a student model using predictions from a teacher model. However, recent work has shown that gains in average efficiency are not uniform across subgroups in the data, and in particular can often come at the cost of accuracy on rare subgroups and classes. To preserve strong performance across classes that may…
▽ More
Knowledge distillation has proven to be an effective technique in improving the performance a student model using predictions from a teacher model. However, recent work has shown that gains in average efficiency are not uniform across subgroups in the data, and in particular can often come at the cost of accuracy on rare subgroups and classes. To preserve strong performance across classes that may follow a long-tailed distribution, we develop distillation techniques that are tailored to improve the student's worst-class performance. Specifically, we introduce robust optimization objectives in different combinations for the teacher and student, and further allow for training with any tradeoff between the overall accuracy and the robust worst-class objective. We show empirically that our robust distillation techniques not only achieve better worst-class performance, but also lead to Pareto improvement in the tradeoff between overall performance and worst-class performance compared to other baseline methods. Theoretically, we provide insights into what makes a good teacher when the goal is to train a robust student.
△ Less
Submitted 13 June, 2022;
originally announced June 2022.
-
ELM: Embedding and Logit Margins for Long-Tail Learning
Authors:
Wittawat Jitkrittum,
Aditya Krishna Menon,
Ankit Singh Rawat,
Sanjiv Kumar
Abstract:
Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural m…
▽ More
Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural models, such techniques do not explicitly control the geometry of the learned embeddings. This can be potentially sub-optimal, since embeddings for tail classes may be diffuse, resulting in poor generalization for these classes. We present Embedding and Logit Margins (ELM), a unified approach to enforce margins in logit space, and regularize the distribution of embeddings. This connects losses for long-tail learning to proposals in the literature on metric embedding, and contrastive learning. We theoretically show that minimising the proposed ELM objective helps reduce the generalisation gap. The ELM method is shown to perform well empirically, and results in tighter tail class embeddings.
△ Less
Submitted 27 April, 2022;
originally announced April 2022.
-
The leap to ordinal: detailed functional prognosis after traumatic brain injury with a flexible modelling approach
Authors:
Shubhayu Bhattacharyay,
Ioan Milosevic,
Lindsay Wilson,
David K. Menon,
Robert D. Stevens,
Ewout W. Steyerberg,
David W. Nelson,
Ari Ercole,
the CENTER-TBI investigators/participants
Abstract:
When a patient is admitted to the intensive care unit (ICU) after a traumatic brain injury (TBI), an early prognosis is essential for baseline risk adjustment and shared decision making. TBI outcomes are commonly categorised by the Glasgow Outcome Scale-Extended (GOSE) into 8, ordered levels of functional recovery at 6 months after injury. Existing ICU prognostic models predict binary outcomes at…
▽ More
When a patient is admitted to the intensive care unit (ICU) after a traumatic brain injury (TBI), an early prognosis is essential for baseline risk adjustment and shared decision making. TBI outcomes are commonly categorised by the Glasgow Outcome Scale-Extended (GOSE) into 8, ordered levels of functional recovery at 6 months after injury. Existing ICU prognostic models predict binary outcomes at a certain threshold of GOSE (e.g., prediction of survival [GOSE>1] or functional independence [GOSE>4]). We aimed to develop ordinal prediction models that concurrently predict probabilities of each GOSE score. From a prospective cohort (n=1,550, 65 centres) in the ICU stratum of the Collaborative European NeuroTrauma Effectiveness Research in TBI (CENTER-TBI) patient dataset, we extracted all clinical information within 24 hours of ICU admission (1,151 predictors) and 6-month GOSE scores. We analysed the effect of 2 design elements on ordinal model performance: (1) the baseline predictor set, ranging from a concise set of 10 validated predictors to a token-embedded representation of all possible predictors, and (2) the modelling strategy, from ordinal logistic regression to multinomial deep learning. With repeated k-fold cross-validation, we found that expanding the baseline predictor set significantly improved ordinal prediction performance while increasing analytical complexity did not. Half of these gains could be achieved with the addition of 8 high-impact predictors (2 demographic variables, 4 protein biomarkers, and 2 severity assessments) to the concise set. At best, ordinal models achieved 0.76 (95% CI: 0.74-0.77) ordinal discrimination ability (ordinal c-index) and 57% (95% CI: 54%-60%) explanation of ordinal variation in 6-month GOSE (Somers' D). Our results motivate the search for informative predictors for higher GOSE and the development of ordinal dynamic prediction models.
△ Less
Submitted 4 May, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
When in Doubt, Summon the Titans: Efficient Inference with Large Models
Authors:
Ankit Singh Rawat,
Manzil Zaheer,
Aditya Krishna Menon,
Amr Ahmed,
Sanjiv Kumar
Abstract:
Scaling neural networks to "large" sizes, with billions of parameters, has been shown to yield impressive results on many challenging problems. However, the inference cost incurred by such large models often prevents their application in most real-world settings. In this paper, we propose a two-stage framework based on distillation that realizes the modelling benefits of the large models, while la…
▽ More
Scaling neural networks to "large" sizes, with billions of parameters, has been shown to yield impressive results on many challenging problems. However, the inference cost incurred by such large models often prevents their application in most real-world settings. In this paper, we propose a two-stage framework based on distillation that realizes the modelling benefits of the large models, while largely preserving the computational benefits of inference with more lightweight models. In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher. Such an approach allows us to efficiently employ large models in practical scenarios where easy examples are much more frequent than rare hard examples. Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation. Empirically, we demonstrate the benefits of our approach on both image classification and natural language processing benchmarks.
△ Less
Submitted 19 October, 2021;
originally announced October 2021.
-
Transductive image segmentation: Self-training and effect of uncertainty estimation
Authors:
Konstantinos Kamnitsas,
Stefan Winzeck,
Evgenios N. Kornaropoulos,
Daniel Whitehouse,
Cameron Englman,
Poe Phyu,
Norman Pao,
David K. Menon,
Daniel Rueckert,
Tilak Das,
Virginia F. J. Newcombe,
Ben Glocker
Abstract:
Semi-supervised learning (SSL) uses unlabeled data during training to learn better models. Previous studies on SSL for medical image segmentation focused mostly on improving model generalization to unseen data. In some applications, however, our primary interest is not generalization but to obtain optimal predictions on a specific unlabeled database that is fully available during model development…
▽ More
Semi-supervised learning (SSL) uses unlabeled data during training to learn better models. Previous studies on SSL for medical image segmentation focused mostly on improving model generalization to unseen data. In some applications, however, our primary interest is not generalization but to obtain optimal predictions on a specific unlabeled database that is fully available during model development. Examples include population studies for extracting imaging phenotypes. This work investigates an often overlooked aspect of SSL, transduction. It focuses on the quality of predictions made on the unlabeled data of interest when they are included for optimization during training, rather than improving generalization. We focus on the self-training framework and explore its potential for transduction. We analyze it through the lens of Information Gain and reveal that learning benefits from the use of calibrated or under-confident models. Our extensive experiments on a large MRI database for multi-class segmentation of traumatic brain lesions shows promising results when comparing transductive with inductive predictions. We believe this study will inspire further research on transductive learning, a well-suited paradigm for medical image analysis.
△ Less
Submitted 2 August, 2021; v1 submitted 19 July, 2021;
originally announced July 2021.
-
Training Over-parameterized Models with Non-decomposable Objectives
Authors:
Harikrishna Narasimhan,
Aditya Krishna Menon
Abstract:
Many modern machine learning applications come with complex and nuanced design goals such as minimizing the worst-case error, satisfying a given precision or recall target, or enforcing group-fairness constraints. Popular techniques for optimizing such non-decomposable objectives reduce the problem into a sequence of cost-sensitive learning tasks, each of which is then solved by re-weighting the t…
▽ More
Many modern machine learning applications come with complex and nuanced design goals such as minimizing the worst-case error, satisfying a given precision or recall target, or enforcing group-fairness constraints. Popular techniques for optimizing such non-decomposable objectives reduce the problem into a sequence of cost-sensitive learning tasks, each of which is then solved by re-weighting the training loss with example-specific costs. We point out that the standard approach of re-weighting the loss to incorporate label costs can produce unsatisfactory results when used to train over-parameterized models. As a remedy, we propose new cost-sensitive losses that extend the classical idea of logit adjustment to handle more general cost matrices. Our losses are calibrated, and can be further improved with distilled labels from a teacher model. Through experiments on benchmark image datasets, we showcase the effectiveness of our approach in training ResNet models with common robust and constrained optimization objectives.
△ Less
Submitted 9 July, 2021;
originally announced July 2021.
-
Teacher's pet: understanding and mitigating biases in distillation
Authors:
Michal Lukasik,
Srinadh Bhojanapalli,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the student's overall performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can harm performance on certain s…
▽ More
Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the student's overall performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can harm performance on certain subgroups, e.g., classes with few associated samples. We trace this behaviour to errors made by the teacher distribution being transferred to and amplified by the student model. To mitigate this problem, we present techniques which soften the teacher influence for subgroups where it is less reliable. Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance.
△ Less
Submitted 8 July, 2021; v1 submitted 19 June, 2021;
originally announced June 2021.
-
Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces
Authors:
Ankit Singh Rawat,
Aditya Krishna Menon,
Wittawat Jitkrittum,
Sadeep Jayasumana,
Felix X. Yu,
Sashank Reddi,
Sanjiv Kumar
Abstract:
Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off pe…
▽ More
Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels. Further, we provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance. We empirically verify our findings on long-tail classification and retrieval benchmarks.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Interval-censored Hawkes processes
Authors:
Marian-Andrei Rizoiu,
Alexander Soen,
Shidi Li,
Pio Calderon,
Leanne Dong,
Aditya Krishna Menon,
Lexing Xie
Abstract:
Interval-censored data solely records the aggregated counts of events during specific time intervals - such as the number of patients admitted to the hospital or the volume of vehicles passing traffic loop detectors - and not the exact occurrence time of the events. It is currently not understood how to fit the Hawkes point processes to this kind of data. Its typical loss function (the point proce…
▽ More
Interval-censored data solely records the aggregated counts of events during specific time intervals - such as the number of patients admitted to the hospital or the volume of vehicles passing traffic loop detectors - and not the exact occurrence time of the events. It is currently not understood how to fit the Hawkes point processes to this kind of data. Its typical loss function (the point process log-likelihood) cannot be computed without exact event times. Furthermore, it does not have the independent increments property to use the Poisson likelihood. This work builds a novel point process, a set of tools, and approximations for fitting Hawkes processes within interval-censored data scenarios. First, we define the Mean Behavior Poisson process (MBPP), a novel Poisson process with a direct parameter correspondence to the popular self-exciting Hawkes process. We fit MBPP in the interval-censored setting using an interval-censored Poisson log-likelihood (IC-LL). We use the parameter equivalence to uncover the parameters of the associated Hawkes process. Second, we introduce two novel exogenous functions to distinguish the exogenous from the endogenous events. We propose the multi-impulse exogenous function - for when the exogenous events are observed as event time - and the latent homogeneous Poisson process exogenous function - for when the exogenous events are presented as interval-censored volumes. Third, we provide several approximation methods to estimate the intensity and compensator function of MBPP when no analytical solution exists. Fourth and finally, we connect the interval-censored loss of MBPP to a broader class of Bregman divergence-based functions. Using the connection, we show that the popularity estimation algorithm Hawkes Intensity Process (HIP) is a particular case of the MBPP. We verify our models through empirical testing on synthetic data and real-world data.
△ Less
Submitted 25 November, 2022; v1 submitted 16 April, 2021;
originally announced April 2021.
-
Distilling Double Descent
Authors:
Andrew Cotter,
Aditya Krishna Menon,
Harikrishna Narasimhan,
Ankit Singh Rawat,
Sashank J. Reddi,
Yichen Zhou
Abstract:
Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that,…
▽ More
Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides \emph{hard} labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches.
Our explanation for this phenomenon is based on recent work on "double descent". It has been observed that, once a model's complexity roughly exceeds the amount required to memorize the training data, increasing the complexity \emph{further} can, counterintuitively, result in \emph{better} generalization. Researchers have identified several settings in which it takes place, while others have made various attempts to explain it (thus far, with only partial success). In contrast, we avoid these questions, and instead seek to \emph{exploit} this phenomenon by demonstrating that a highly-overparameterized teacher can avoid overfitting via double descent, while a student trained on a larger independent dataset labeled by this teacher will avoid overfitting due to the size of its training set.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Semantic Label Smoothing for Sequence to Sequence Problems
Authors:
Michal Lukasik,
Himanshu Jain,
Aditya Krishna Menon,
Seungyeon Kim,
Srinadh Bhojanapalli,
Felix Yu,
Sanjiv Kumar
Abstract:
Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising. However, extending such methods directly to seq2seq settings, such as Machine Translation, is challenging: the large target output space of such problems makes it intractable to apply label smoothing over all possible outputs. Most existing approache…
▽ More
Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising. However, extending such methods directly to seq2seq settings, such as Machine Translation, is challenging: the large target output space of such problems makes it intractable to apply label smoothing over all possible outputs. Most existing approaches for seq2seq settings either do token level smoothing, or smooth over sequences generated by randomly substituting tokens in the target sequence. Unlike these works, in this paper, we propose a technique that smooths over \emph{well formed} relevant sequences that not only have sufficient n-gram overlap with the target sequence, but are also \emph{semantically similar}. Our method shows a consistent and significant improvement over the state-of-the-art techniques on different datasets.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
SupMMD: A Sentence Importance Model for Extractive Summarization using Maximum Mean Discrepancy
Authors:
Umanga Bista,
Alexander Patrick Mathews,
Aditya Krishna Menon,
Lexing Xie
Abstract:
Most work on multi-document summarization has focused on generic summarization of information present in each individual document set. However, the under-explored setting of update summarization, where the goal is to identify the new information present in each set, is of equal practical interest (e.g., presenting readers with updates on an evolving news topic). In this work, we present SupMMD, a…
▽ More
Most work on multi-document summarization has focused on generic summarization of information present in each individual document set. However, the under-explored setting of update summarization, where the goal is to identify the new information present in each set, is of equal practical interest (e.g., presenting readers with updates on an evolving news topic). In this work, we present SupMMD, a novel technique for generic and update summarization based on the maximum mean discrepancy from kernel two-sample testing. SupMMD combines both supervised learning for salience and unsupervised learning for coverage and diversity. Further, we adapt multiple kernel learning to make use of similarity across multiple information sources (e.g., text features and knowledge based concepts). We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
Long-tail learning via logit adjustment
Authors:
Aditya Krishna Menon,
Sadeep Jayasumana,
Ankit Singh Rawat,
Himanshu Jain,
Andreas Veit,
Sanjiv Kumar
Abstract:
Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naïve learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these chall…
▽ More
Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naïve learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a large relative margin between logits of rare versus dominant labels. These techniques unify and generalise several recent proposals in the literature, while possessing firmer statistical grounding and empirical performance.
△ Less
Submitted 9 July, 2021; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Why distillation helps: a statistical perspective
Authors:
Aditya Krishna Menon,
Ankit Singh Rawat,
Sashank J. Reddi,
Seungyeon Kim,
Sanjiv Kumar
Abstract:
Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? In this paper, we present a statistical perspective on distillation w…
▽ More
Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques. Our core observation is that the teacher seeks to estimate the underlying (Bayes) class-probability function. Building on this, we establish a fundamental bias-variance tradeoff in the student's objective: this quantifies how approximate knowledge of these class-probabilities can significantly aid learning. Finally, we show how distillation complements existing negative mining techniques for extreme multiclass retrieval, and propose a unified objective which combines these ideas.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
Development of a Machine Learning Model and Mobile Application to Aid in Predicting Dosage of Vitamin K Antagonists Among Indian Patients
Authors:
Amruthlal M,
Devika S,
Ameer Suhail P A,
Aravind K Menon,
Vignesh Krishnan,
Alan Thomas,
Manu Thomas,
Sanjay G,
Lakshmi Kanth L R,
Jimmy Jose,
Harikrishnan S
Abstract:
Patients who undergo mechanical heart valve replacements or have conditions like Atrial Fibrillation have to take Vitamin K Antagonists (VKA) drugs to prevent coagulation of blood. These drugs have narrow therapeutic range and need to be very closely monitored due to life threatening side effects. The dosage of VKA drug is determined and revised by a physician based on Prothrombin Time - Internati…
▽ More
Patients who undergo mechanical heart valve replacements or have conditions like Atrial Fibrillation have to take Vitamin K Antagonists (VKA) drugs to prevent coagulation of blood. These drugs have narrow therapeutic range and need to be very closely monitored due to life threatening side effects. The dosage of VKA drug is determined and revised by a physician based on Prothrombin Time - International Normalised Ratio (PT-INR) value obtained through a blood test. Our work aimed at predicting the maintenance dosage of warfarin, the present most widely recommended anticoagulant drug, using the de-identified medical data collected from 109 patients from Kerala. A Support Vector Machine (SVM) Regression model was built to predict the maintenance dosage of warfarin, for patients who have been undergoing treatment from a physician and have reached stable INR values between 2.0 and 4.0.
△ Less
Submitted 19 April, 2020;
originally announced April 2020.
-
Doubly-stochastic mining for heterogeneous retrieval
Authors:
Ankit Singh Rawat,
Aditya Krishna Menon,
Andreas Veit,
Felix Yu,
Sashank J. Reddi,
Sanjiv Kumar
Abstract:
Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e.g., users of a retrieval system may be from different countries), each of which poses a challenge. The first challenge concerns scalability: with a large number of labels, standard losses are difficult to optimise even on a single example.…
▽ More
Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e.g., users of a retrieval system may be from different countries), each of which poses a challenge. The first challenge concerns scalability: with a large number of labels, standard losses are difficult to optimise even on a single example. The second challenge concerns uniformity: one ideally wants good performance on each subpopulation. While several solutions have been proposed to address the first challenge, the second challenge has received relatively less attention. In this paper, we propose doubly-stochastic mining (S2M ), a stochastic optimization technique that addresses both challenges. In each iteration of S2M, we compute a per-example loss based on a subset of hardest labels, and then compute the minibatch loss based on the hardest examples. We show theoretically and empirically that by focusing on the hardest examples, S2M ensures that all data subpopulations are modelled well.
△ Less
Submitted 22 April, 2020;
originally announced April 2020.
-
Federated Learning with Only Positive Labels
Authors:
Felix X. Yu,
Ankit Singh Rawat,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class. As a result, during each federated learning round, the users need to locally update the classifier without having access to the features and the model parameters for the negative classes. Thus, naively employing conventional decentra…
▽ More
We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class. As a result, during each federated learning round, the users need to locally update the classifier without having access to the features and the model parameters for the negative classes. Thus, naively employing conventional decentralized learning such as the distributed SGD or Federated Averaging may lead to trivial or extremely poor classifiers. In particular, for the embedding based classifiers, all the class embeddings might collapse to a single point.
To address this problem, we propose a generic framework for training with only positive labels, namely Federated Averaging with Spreadout (FedAwS), where the server imposes a geometric regularizer after each round to encourage classes to be spreadout in the embedding space. We show, both theoretically and empirically, that FedAwS can almost match the performance of conventional learning where users have access to negative labels. We further extend the proposed method to the settings with large output spaces.
△ Less
Submitted 21 April, 2020;
originally announced April 2020.
-
Prediction of number of cases expected and estimation of the final size of coronavirus epidemic in India using the logistic model and genetic algorithm
Authors:
Ganesh Kumar M,
Soman K. P,
Gopalakrishnan E. A,
Vijay Krishna Menon,
Sowmya V
Abstract:
In this paper, we have applied the logistic growth regression model and genetic algorithm to predict the number of coronavirus infected cases that can be expected in upcoming days in India and also estimated the final size and its peak time of the coronavirus epidemic in India.
In this paper, we have applied the logistic growth regression model and genetic algorithm to predict the number of coronavirus infected cases that can be expected in upcoming days in India and also estimated the final size and its peak time of the coronavirus epidemic in India.
△ Less
Submitted 26 March, 2020;
originally announced March 2020.
-
Does label smoothing mitigate label noise?
Authors:
Michal Lukasik,
Srinadh Bhojanapalli,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of coping with label noise. While label smoothing apparently amplifies this problem --…
▽ More
Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of coping with label noise. While label smoothing apparently amplifies this problem --- being equivalent to injecting symmetric noise to the labels --- we show how it relates to a general family of loss-correction techniques from the label noise literature. Building on this connection, we show that label smoothing is competitive with loss-correction under label noise. Further, we show that when distilling models from noisy data, label smoothing of the teacher is beneficial; this is in contrast to recent findings for noise-free problems, and sheds further light on settings where label smoothing is beneficial.
△ Less
Submitted 5 March, 2020;
originally announced March 2020.
-
Supervised Learning: No Loss No Cry
Authors:
Richard Nock,
Aditya Krishna Menon
Abstract:
Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an \emph{ad hoc} manner. In hopes of making this procedure more principled, the problem of \emph{learning the loss funct…
▽ More
Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an \emph{ad hoc} manner. In hopes of making this procedure more principled, the problem of \emph{learning the loss function} for a downstream task (e.g., classification) has garnered recent interest. However, works in this area have been generally empirical in nature.
In this paper, we revisit the {\sc SLIsotron} algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss. In detail, we cast {\sc SLIsotron} as learning a loss from a family of composite square losses. By interpreting this through the lens of \emph{proper losses}, we derive a generalisation of {\sc SLIsotron} based on Bregman divergences. The resulting {\sc BregmanTron} algorithm jointly learns the loss along with the classifier. It comes equipped with a simple guarantee of convergence for the loss it learns, and its set of possible outputs comes with a guarantee of agnostic approximability of Bayes rule. Experiments indicate that the {\sc BregmanTron} substantially outperforms the {\sc SLIsotron}, and that the loss it learns can be minimized by other algorithms for different tasks, thereby opening the interesting problem of \textit{loss transfer} between domains.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
Online Hierarchical Clustering Approximations
Authors:
Aditya Krishna Menon,
Anand Rajagopalan,
Baris Sumengen,
Gui Citovsky,
Qin Cao,
Sanjiv Kumar
Abstract:
Hierarchical clustering is a widely used approach for clustering datasets at multiple levels of granularity. Despite its popularity, existing algorithms such as hierarchical agglomerative clustering (HAC) are limited to the offline setting, and thus require the entire dataset to be available. This prohibits their use on large datasets commonly encountered in modern learning applications. In this p…
▽ More
Hierarchical clustering is a widely used approach for clustering datasets at multiple levels of granularity. Despite its popularity, existing algorithms such as hierarchical agglomerative clustering (HAC) are limited to the offline setting, and thus require the entire dataset to be available. This prohibits their use on large datasets commonly encountered in modern learning applications. In this paper, we consider hierarchical clustering in the online setting, where points arrive one at a time. We propose two algorithms that seek to optimize the Moseley and Wang (MW) revenue function, a variant of the Dasgupta cost. These algorithms offer different tradeoffs between efficiency and MW revenue performance. The first algorithm, OTD, is a highly efficient Online Top Down algorithm which provably achieves a 1/3-approximation to the MW revenue under a data separation assumption. The second algorithm, OHAC, is an online counterpart to offline HAC, which is known to yield a 1/3-approximation to the MW revenue, and produce good quality clusters in practice. We show that OHAC approximates offline HAC by leveraging a novel split-merge procedure. We empirically show that OTD and OHAC offer significant efficiency and cluster quality gains respectively over baselines.
△ Less
Submitted 20 September, 2019;
originally announced September 2019.
-
Noise-tolerant fair classification
Authors:
Alexandre Louis Lamy,
Ziyuan Zhong,
Aditya Krishna Menon,
Nakul Verma
Abstract:
Fairness-aware learning involves designing algorithms that do not discriminate with respect to some sensitive feature (e.g., race or gender). Existing work on the problem operates under the assumption that the sensitive feature available in one's training sample is perfectly reliable. This assumption may be violated in many real-world cases: for example, respondents to a survey may choose to conce…
▽ More
Fairness-aware learning involves designing algorithms that do not discriminate with respect to some sensitive feature (e.g., race or gender). Existing work on the problem operates under the assumption that the sensitive feature available in one's training sample is perfectly reliable. This assumption may be violated in many real-world cases: for example, respondents to a survey may choose to conceal or obfuscate their group identity out of fear of potential discrimination. This poses the question of whether one can still learn fair classifiers given noisy sensitive features. In this paper, we answer the question in the affirmative: we show that if one measures fairness using the mean-difference score, and sensitive features are subject to noise from the mutually contaminated learning model, then owing to a simple identity we only need to change the desired fairness-tolerance. The requisite tolerance can be estimated by leveraging existing noise-rate estimators from the label noise literature. We finally show that our procedure is empirically effective on two case-studies involving sensitive feature censoring.
△ Less
Submitted 9 January, 2020; v1 submitted 30 January, 2019;
originally announced January 2019.