-
Estimating Treatment Effects with Observed Confounders and Mediators
Authors:
Shantanu Gupta,
Zachary C. Lipton,
David Childers
Abstract:
Given a causal graph, the do-calculus can express treatment effects as functionals of the observational joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multiple valid formulae, prompting us to compare the statistical properties of the corresponding estimators. For example, the backdoor formula applies when all confounders are observed and the frontdoor fo…
▽ More
Given a causal graph, the do-calculus can express treatment effects as functionals of the observational joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multiple valid formulae, prompting us to compare the statistical properties of the corresponding estimators. For example, the backdoor formula applies when all confounders are observed and the frontdoor formula applies when an observed mediator transmits the causal effect. In this paper, we investigate the over-identified scenario where both confounders and mediators are observed, rendering both estimators valid. Addressing the linear Gaussian causal model, we demonstrate that either estimator can dominate the other by an unbounded constant factor. Next, we derive an optimal estimator, which leverages all observed variables, and bound its finite-sample variance. We show that it strictly outperforms the backdoor and frontdoor estimators and that this improvement can be unbounded. We also present a procedure for combining two datasets, one with observed confounders and another with observed mediators. Finally, we evaluate our methods on both simulated data and the IHDP and JTPA datasets.
△ Less
Submitted 14 June, 2021; v1 submitted 26 March, 2020;
originally announced March 2020.
-
A Unified View of Label Shift Estimation
Authors:
Saurabh Garg,
Yifan Wu,
Sivaraman Balakrishnan,
Zachary C. Lipton
Abstract:
Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for estimating the label marginal. BBSE, a moment-matching approach based on confusion matrices, is provably consistent and provides interpretable error bounds. However, a maximum likelihood estimation approach, which we call MLLS, dominates empirical…
▽ More
Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for estimating the label marginal. BBSE, a moment-matching approach based on confusion matrices, is provably consistent and provides interpretable error bounds. However, a maximum likelihood estimation approach, which we call MLLS, dominates empirically. In this paper, we present a unified view of the two methods and the first theoretical characterization of MLLS. Our contributions include (i) consistency conditions for MLLS, which include calibration of the classifier and a confusion matrix invertibility condition that BBSE also requires; (ii) a unified framework, casting BBSE as roughly equivalent to MLLS for a particular choice of calibration method; and (iii) a decomposition of MLLS's finite-sample error into terms reflecting miscalibration and estimation error. Our analysis attributes BBSE's statistical inefficiency to a loss of information due to coarse calibration. Experiments on synthetic data, MNIST, and CIFAR10 support our findings.
△ Less
Submitted 16 October, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Causal Inference With Selectively Deconfounded Data
Authors:
Kyra Gan,
Andrew A. Li,
Zachary C. Lipton,
Sridhar Tayur
Abstract:
Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data;(b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporati…
▽ More
Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data;(b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporating a large confounded observational dataset (confounder unobserved) alongside a small deconfounded observational dataset (confounder revealed) when estimating the ATE. Our theoretical results suggest that the inclusion of confounded data can significantly reduce the quantity of deconfounded data required to estimate the ATE to within a desired accuracy level. Moreover, in some cases -- say, genetics -- we could imagine retrospectively selecting samples to deconfound. We demonstrate that by actively selecting these samples based upon the (already observed) treatment and outcome, we can reduce sample complexity further. Our theoretical and empirical results establish that the worst-case relative performance of our approach (vs. a natural benchmark) is bounded while our best-case gains are unbounded. Finally, we demonstrate the benefits of selective deconfounding using a large real-world dataset related to genetic mutation in cancer.
△ Less
Submitted 6 March, 2021; v1 submitted 25 February, 2020;
originally announced February 2020.
-
How Transferable are the Representations Learned by Deep Q Agents?
Authors:
Jacob Tyo,
Zachary Lipton
Abstract:
In this paper, we consider the source of Deep Reinforcement Learning (DRL)'s sample complexity, asking how much derives from the requirement of learning useful representations of environment states and how much is due to the sample complexity of learning a policy. While for DRL agents, the distinction between representation and policy may not be clear, we seek new insight through a set of transfer…
▽ More
In this paper, we consider the source of Deep Reinforcement Learning (DRL)'s sample complexity, asking how much derives from the requirement of learning useful representations of environment states and how much is due to the sample complexity of learning a policy. While for DRL agents, the distinction between representation and policy may not be clear, we seek new insight through a set of transfer learning experiments. In each experiment, we retain some fraction of layers trained on either the same game or a related game, comparing the benefits of transfer learning to learning a policy from scratch. Interestingly, we find that benefits due to transfer are highly variable in general and non-symmetric across pairs of tasks. Our experiments suggest that perhaps transfer from simpler environments can boost performance on more complex downstream tasks and that the requirements of learning a useful representation can range from negligible to the majority of the sample complexity, based on the environment. Furthermore, we find that fine-tuning generally outperforms training with the transferred layers frozen, confirming an insight first noted in the classification setting.
△ Less
Submitted 23 February, 2020;
originally announced February 2020.
-
Algorithmic Fairness from a Non-ideal Perspective
Authors:
Sina Fazelpour,
Zachary C. Lipton
Abstract:
Inspired by recent breakthroughs in predictive modeling, practitioners in both industry and government have turned to machine learning with hopes of operationalizing predictions to drive automated decisions. Unfortunately, many social desiderata concerning consequential decisions, such as justice or fairness, have no natural formulation within a purely predictive framework. In efforts to mitigate…
▽ More
Inspired by recent breakthroughs in predictive modeling, practitioners in both industry and government have turned to machine learning with hopes of operationalizing predictions to drive automated decisions. Unfortunately, many social desiderata concerning consequential decisions, such as justice or fairness, have no natural formulation within a purely predictive framework. In efforts to mitigate these problems, researchers have proposed a variety of metrics for quantifying deviations from various statistical parities that we might expect to observe in a fair world and offered a variety of algorithms in attempts to satisfy subsets of these parities or to trade off the degree to which they are satisfied against utility. In this paper, we connect this approach to \emph{fair machine learning} to the literature on ideal and non-ideal methodological approaches in political philosophy. The ideal approach requires positing the principles according to which a just world would operate. In the most straightforward application of ideal theory, one supports a proposed policy by arguing that it closes a discrepancy between the real and the perfectly just world. However, by failing to account for the mechanisms by which our non-ideal world arose, the responsibilities of various decision-makers, and the impacts of proposed policies, naive applications of ideal thinking can lead to misguided interventions. In this paper, we demonstrate a connection between the fair machine learning literature and the ideal approach in political philosophy, and argue that the increasingly apparent shortcomings of proposed fair machine learning algorithms reflect broader troubles faced by the ideal approach. We conclude with a critical discussion of the harms of misguided solutions, a reinterpretation of impossibility results, and directions for future research.
△ Less
Submitted 8 January, 2020;
originally announced January 2020.
-
Game Design for Eliciting Distinguishable Behavior
Authors:
Fan Yang,
Liu Leqi,
Yifan Wu,
Zachary C. Lipton,
Pradeep Ravikumar,
William W. Cohen,
Tom Mitchell
Abstract:
The ability to inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing…
▽ More
The ability to inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing \emph{behavior diagnostic games} that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using prospect theory to model varying player traits, and Markov Decision Processes to parameterize the games. We validate our approach empirically, showing that our designed games can successfully distinguish among players with different traits, outperforming manually-designed ones by a large margin.
△ Less
Submitted 12 December, 2019;
originally announced December 2019.
-
Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?
Authors:
Simran Kaur,
Jeremy Cohen,
Zachary C. Lipton
Abstract:
For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, Santurkar et al. (2019) demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "per…
▽ More
For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, Santurkar et al. (2019) demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "perceptually-aligned gradients" also occur under randomized smoothing, an alternative means of constructing adversarially-robust classifiers. Our finding supports the hypothesis that perceptually-aligned gradients may be a general property of robust classifiers. We hope that our results will inspire research aimed at explaining this link between perceptually-aligned gradients and adversarial robustness.
△ Less
Submitted 23 October, 2019; v1 submitted 18 October, 2019;
originally announced October 2019.
-
Accelerating Deep Learning by Focusing on the Biggest Losers
Authors:
Angela H. Jiang,
Daniel L. -K. Wong,
Giulio Zhou,
David G. Andersen,
Jeffrey Dean,
Gregory R. Ganger,
Gauri Joshi,
Michael Kaminksy,
Michael Kozuch,
Zachary C. Lipton,
Padmanabhan Pillai
Abstract:
This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of co…
▽ More
This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of computationally-expensive backpropagation steps performed, Selective-Backprop accelerates training. Evaluation on CIFAR10, CIFAR100, and SVHN, across a variety of modern image models, shows that Selective-Backprop converges to target error rates up to 3.5x faster than with standard SGD and between 1.02--1.8x faster than a state-of-the-art importance sampling approach. Further acceleration of 26% can be achieved by using stale forward pass results for selection, thus also skipping forward passes of low priority examples.
△ Less
Submitted 1 October, 2019;
originally announced October 2019.
-
Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
Authors:
Divyansh Kaushik,
Eduard Hovy,
Zachary C. Lipton
Abstract:
Despite alarm over the reliance of machine learning systems on so-called spurious patterns, the term lacks coherent meaning in standard statistical frameworks. However, the language of causality offers clarity: spurious associations are due to confounding (e.g., a common cause), but not direct or indirect causal effects. In this paper, we focus on natural language processing, introducing methods a…
▽ More
Despite alarm over the reliance of machine learning systems on so-called spurious patterns, the term lacks coherent meaning in standard statistical frameworks. However, the language of causality offers clarity: spurious associations are due to confounding (e.g., a common cause), but not direct or indirect causal effects. In this paper, we focus on natural language processing, introducing methods and resources for training models less sensitive to spurious patterns. Given documents and their initial labels, we task humans with revising each document so that it (i) accords with a counterfactual target label; (ii) retains internal coherence; and (iii) avoids unnecessary changes. Interestingly, on sentiment analysis and natural language inference tasks, classifiers trained on original data fail on their counterfactually-revised counterparts and vice versa. Classifiers trained on combined datasets perform remarkably well, just shy of those specialized to either domain. While classifiers trained on either original or manipulated data alone are sensitive to spurious features (e.g., mentions of genre), models trained on the combined data are less sensitive to this signal. Both datasets are publicly available.
△ Less
Submitted 14 February, 2020; v1 submitted 26 September, 2019;
originally announced September 2019.
-
Learning to Deceive with Attention-Based Explanations
Authors:
Danish Pruthi,
Mansi Gupta,
Bhuwan Dhingra,
Graham Neubig,
Zachary C. Lipton
Abstract:
Attention mechanisms are ubiquitous components in neural architectures applied to natural language processing. In addition to yielding gains in predictive accuracy, attention weights are often claimed to confer interpretability, purportedly useful both for providing insights to practitioners and for explaining why a model makes its decisions to stakeholders. We call the latter use of attention mec…
▽ More
Attention mechanisms are ubiquitous components in neural architectures applied to natural language processing. In addition to yielding gains in predictive accuracy, attention weights are often claimed to confer interpretability, purportedly useful both for providing insights to practitioners and for explaining why a model makes its decisions to stakeholders. We call the latter use of attention mechanisms into question by demonstrating a simple method for training models to produce deceptive attention masks. Our method diminishes the total weight assigned to designated impermissible tokens, even when the models can be shown to nevertheless rely on these features to drive predictions. Across multiple models and tasks, our approach manipulates attention weights while paying surprisingly little cost in accuracy. Through a human study, we show that our manipulated attention-based explanations deceive people into thinking that predictions from a model biased against gender minorities do not rely on the gender. Consequently, our results cast doubt on attention's reliability as a tool for auditing algorithms in the context of fairness and accountability.
△ Less
Submitted 6 April, 2020; v1 submitted 17 September, 2019;
originally announced September 2019.
-
Entity Projection via Machine Translation for Cross-Lingual NER
Authors:
Alankar Jain,
Bhargavi Paranjape,
Zachary C. Lipton
Abstract:
Although over 100 languages are supported by strong off-the-shelf machine translation systems, only a subset of them possess large annotated corpora for named entity recognition. Motivated by this fact, we leverage machine translation to improve annotation-projection approaches to cross-lingual named entity recognition. We propose a system that improves over prior entity-projection methods by: (a)…
▽ More
Although over 100 languages are supported by strong off-the-shelf machine translation systems, only a subset of them possess large annotated corpora for named entity recognition. Motivated by this fact, we leverage machine translation to improve annotation-projection approaches to cross-lingual named entity recognition. We propose a system that improves over prior entity-projection methods by: (a) leveraging machine translation systems twice: first for translating sentences and subsequently for translating entities; (b) matching entities based on orthographic and phonetic similarity; and (c) identifying matches based on distributional statistics derived from the dataset. Our approach improves upon current state-of-the-art methods for cross-lingual named entity recognition on 5 diverse languages by an average of 4.1 points. Further, our method achieves state-of-the-art F_1 scores for Armenian, outperforming even a monolingual model trained on Armenian source data.
△ Less
Submitted 13 September, 2019; v1 submitted 31 August, 2019;
originally announced September 2019.
-
AmazonQA: A Review-Based Question Answering Task
Authors:
Mansi Gupta,
Nitish Kulkarni,
Raghuveer Chanda,
Anirudha Rayasam,
Zachary C Lipton
Abstract:
Every day, thousands of customers post questions on Amazon product pages. After some time, if they are fortunate, a knowledgeable customer might answer their question. Observing that many questions can be answered based upon the available product reviews, we propose the task of review-based QA. Given a corpus of reviews and a question, the QA system synthesizes an answer. To this end, we introduce…
▽ More
Every day, thousands of customers post questions on Amazon product pages. After some time, if they are fortunate, a knowledgeable customer might answer their question. Observing that many questions can be answered based upon the available product reviews, we propose the task of review-based QA. Given a corpus of reviews and a question, the QA system synthesizes an answer. To this end, we introduce a new dataset and propose a method that combines information retrieval techniques for selecting relevant reviews (given a question) and "reading comprehension" models for synthesizing an answer (given a question and review). Our dataset consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews. A deployed system could first classify a question as answerable and then attempt to generate an answer. Notably, unlike many popular QA datasets, here, the questions, passages, and answers are all extracted from real human interactions. We evaluate numerous models for answer generation and propose strong baselines, demonstrating the challenging nature of this new task.
△ Less
Submitted 20 August, 2019; v1 submitted 12 August, 2019;
originally announced August 2019.
-
Estimating brain age based on a healthy population with deep learning and structural MRI
Authors:
Xinyang Feng,
Zachary C. Lipton,
Jie Yang,
Scott A. Small,
Frank A. Provenzano
Abstract:
Numerous studies have established that estimated brain age, as derived from statistical models trained on healthy populations, constitutes a valuable biomarker that is predictive of cognitive decline and various neurological diseases. In this work, we curate a large-scale heterogeneous dataset (N = 10,158, age range 18 - 97) of structural brain MRIs in a healthy population from multiple publicly-a…
▽ More
Numerous studies have established that estimated brain age, as derived from statistical models trained on healthy populations, constitutes a valuable biomarker that is predictive of cognitive decline and various neurological diseases. In this work, we curate a large-scale heterogeneous dataset (N = 10,158, age range 18 - 97) of structural brain MRIs in a healthy population from multiple publicly-available sources, upon which we train a deep learning model for brain age estimation. The availability of the large-scale dataset enables a more uniform age distribution across adult life-span for effective age estimation with no bias toward certain age groups. We demonstrate that the age estimation accuracy, evaluated with mean absolute error (MAE) and correlation coefficient (r), outperforms previously reported methods in both a hold-out test set reflective of the custom population (MAE = 4.06 years, r = 0.970) and an independent life-span evaluation dataset (MAE = 4.21 years, r = 0.960) on which a previous study has evaluated. We further demonstrate the utility of the estimated age in life-span aging analysis of cognitive functions. Furthermore, we conduct extensive ablation tests and employ feature-attribution techniques to analyze which regions contribute the most predictive value, demonstrating the prominence of the frontal lobe as well as pattern shift across life-span. In summary, we achieve superior age estimation performance confirming the efficacy of deep learning and the added utility of training with data both in larger number and more uniformly distributed than in previous studies. We demonstrate the regional contribution to our brain age predictions through multiple routes and confirm the association of divergence between estimated and chronological brain age with neuropsychological measures.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
Learning Causal State Representations of Partially Observable Environments
Authors:
Amy Zhang,
Zachary C. Lipton,
Luis Pineda,
Kamyar Azizzadenesheli,
Anima Anandkumar,
Laurent Itti,
Joelle Pineau,
Tommaso Furlanello
Abstract:
Intelligent agents can cope with sensory-rich environments by learning task-agnostic state abstractions. In this paper, we propose an algorithm to approximate causal states, which are the coarsest partition of the joint history of actions and observations in partially-observable Markov decision processes (POMDP). Our method learns approximate causal state representations from RNNs trained to predi…
▽ More
Intelligent agents can cope with sensory-rich environments by learning task-agnostic state abstractions. In this paper, we propose an algorithm to approximate causal states, which are the coarsest partition of the joint history of actions and observations in partially-observable Markov decision processes (POMDP). Our method learns approximate causal state representations from RNNs trained to predict subsequent observations given the history. We demonstrate that these learned state representations are useful for learning policies efficiently in reinforcement learning problems with rich observation spaces. We connect causal states with causal feature sets from the causal inference literature, and also provide theoretical guarantees on the optimality of the continuous version of this causal state representation under Lipschitz assumptions by proving equivalence to bisimulation, a relation between behaviorally equivalent systems. This allows for lower bounds on the optimal value function of the learned representation, which is tight given certain assumptions. Finally, we empirically evaluate causal state representations using multiple partially observable tasks and compare with prior methods.
△ Less
Submitted 8 February, 2021; v1 submitted 25 June, 2019;
originally announced June 2019.
-
Learning Robust Global Representations by Penalizing Local Predictive Power
Authors:
Haohan Wang,
Songwei Ge,
Eric P. Xing,
Zachary C. Lipton
Abstract:
Despite their renowned predictive power on i.i.d. data, convolutional neural networks are known to rely more on high-frequency patterns that humans deem superficial than on low-frequency patterns that agree better with intuitions about what constitutes category membership. This paper proposes a method for training robust convolutional networks by penalizing the predictive power of the local repres…
▽ More
Despite their renowned predictive power on i.i.d. data, convolutional neural networks are known to rely more on high-frequency patterns that humans deem superficial than on low-frequency patterns that agree better with intuitions about what constitutes category membership. This paper proposes a method for training robust convolutional networks by penalizing the predictive power of the local representations learned by earlier layers. Intuitively, our networks are forced to discard predictive signals such as color and texture that can be gleaned from local receptive fields and to rely instead on the global structures of the image. Across a battery of synthetic and benchmark domain adaptation tasks, our method confers improved generalization out of the domain. Also, to evaluate cross-domain transfer, we introduce ImageNet-Sketch, a new dataset consisting of sketch-like images, that matches the ImageNet classification validation set in categories and scale.
△ Less
Submitted 4 November, 2019; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Efficient candidate screening under multiple tests and implications for fairness
Authors:
Lee Cohen,
Zachary C. Lipton,
Yishay Mansour
Abstract:
When recruiting job candidates, employers rarely observe their underlying skill level directly. Instead, they must administer a series of interviews and/or collate other noisy signals in order to estimate the worker's skill. Traditional economics papers address screening models where employers access worker skill via a single noisy signal. In this paper, we extend this theoretical analysis to a mu…
▽ More
When recruiting job candidates, employers rarely observe their underlying skill level directly. Instead, they must administer a series of interviews and/or collate other noisy signals in order to estimate the worker's skill. Traditional economics papers address screening models where employers access worker skill via a single noisy signal. In this paper, we extend this theoretical analysis to a multi-test setting, considering both Bernoulli and Gaussian models. We analyze the optimal employer policy both when the employer sets a fixed number of tests per candidate and when the employer can set a dynamic policy, assigning further tests adaptively based on results from the previous tests. To start, we characterize the optimal policy when employees constitute a single group, demonstrating some interesting trade-offs. Subsequently, we address the multi-group setting, demonstrating that when the noise levels vary across groups, a fundamental impossibility emerges whereby we cannot administer the same number of tests, subject candidates to the same decision rule, and yet realize the same outcomes in both groups.
△ Less
Submitted 27 May, 2019;
originally announced May 2019.
-
Combating Adversarial Misspellings with Robust Word Recognition
Authors:
Danish Pruthi,
Bhuwan Dhingra,
Zachary C. Lipton
Abstract:
To combat adversarial spelling mistakes, we propose placing a word recognition model in front of the downstream classifier. Our word recognition models build upon the RNN semi-character architecture, introducing several new backoff strategies for handling rare and unseen words. Trained to recognize words corrupted by random adds, drops, swaps, and keyboard mistakes, our method achieves 32% relativ…
▽ More
To combat adversarial spelling mistakes, we propose placing a word recognition model in front of the downstream classifier. Our word recognition models build upon the RNN semi-character architecture, introducing several new backoff strategies for handling rare and unseen words. Trained to recognize words corrupted by random adds, drops, swaps, and keyboard mistakes, our method achieves 32% relative (and 3.3% absolute) error reduction over the vanilla semi-character model. Notably, our pipeline confers robustness on the downstream classifier, outperforming both adversarial training and off-the-shelf spell checkers. Against a BERT model fine-tuned for sentiment analysis, a single adversarially-chosen character attack lowers accuracy from 90.3% to 45.8%. Our defense restores accuracy to 75%. Surprisingly, better word recognition does not always entail greater robustness. Our analysis reveals that robustness also depends upon a quantity that we denote the sensitivity.
△ Less
Submitted 29 August, 2019; v1 submitted 27 May, 2019;
originally announced May 2019.
-
Temporal-Clustering Invariance in Irregular Healthcare Time Series
Authors:
Mohammad Taha Bahadori,
Zachary Chase Lipton
Abstract:
Electronic records contain sequences of events, some of which take place all at once in a single visit, and others that are dispersed over multiple visits, each with a different timestamp. We postulate that fine temporal detail, e.g., whether a series of blood tests are completed at once or in rapid succession should not alter predictions based on this data. Motivated by this intuition, we propose…
▽ More
Electronic records contain sequences of events, some of which take place all at once in a single visit, and others that are dispersed over multiple visits, each with a different timestamp. We postulate that fine temporal detail, e.g., whether a series of blood tests are completed at once or in rapid succession should not alter predictions based on this data. Motivated by this intuition, we propose models for analyzing sequences of multivariate clinical time series data that are invariant to this temporal clustering. We propose an efficient data augmentation technique that exploits the postulated temporal-clustering invariance to regularize deep neural networks optimized for several clinical prediction tasks. We introduce two techniques to temporally coarsen (downsample) irregular time series: (i) grouping the data points based on regularly-spaced timestamps; and (ii) clustering them, yielding irregularly-paced timestamps. Moreover, we propose a MultiResolution Ensemble (MRE) model, improving predictive accuracy by ensembling predictions based on inputs sequences transformed by different coarsening operators. Our experiments show that MRE improves the mAP on the benchmark mortality prediction task from 51.53% to 53.92%.
△ Less
Submitted 27 April, 2019;
originally announced April 2019.
-
Embryo staging with weakly-supervised region selection and dynamically-decoded predictions
Authors:
Tingfung Lau,
Nathan Ng,
Julian Gingold,
Nina Desai,
Julian McAuley,
Zachary C. Lipton
Abstract:
To optimize clinical outcomes, fertility clinics must strategically select which embryos to transfer. Common selection heuristics are formulas expressed in terms of the durations required to reach various developmental milestones, quantities historically annotated manually by experienced embryologists based on time-lapse EmbryoScope videos. We propose a new method for automatic embryo staging that…
▽ More
To optimize clinical outcomes, fertility clinics must strategically select which embryos to transfer. Common selection heuristics are formulas expressed in terms of the durations required to reach various developmental milestones, quantities historically annotated manually by experienced embryologists based on time-lapse EmbryoScope videos. We propose a new method for automatic embryo staging that exploits several sources of structure in this time-lapse data. First, noting that in each image the embryo occupies a small subregion, we jointly train a region proposal network with the downstream classifier to isolate the embryo. Notably, because we lack ground-truth bounding boxes, our we weakly supervise the region proposal network optimizing its parameters via reinforcement learning to improve the downstream classifier's loss. Moreover, noting that embryos reaching the blastocyst stage progress monotonically through earlier stages, we develop a dynamic-programming-based decoder that post-processes our predictions to select the most likely monotonic sequence of developmental stages. Our methods outperform vanilla residual networks and rival the best numbers in contemporary papers, as measured by both per-frame accuracy and transition prediction error, despite operating on smaller data than many.
△ Less
Submitted 8 April, 2019;
originally announced April 2019.
-
Learning Robust Representations by Projecting Superficial Statistics Out
Authors:
Haohan Wang,
Zexue He,
Zachary C. Lipton,
Eric P. Xing
Abstract:
Despite impressive performance as evaluated on i.i.d. holdout data, deep neural networks depend heavily on superficial statistics of the training data and are liable to break under distribution shift. For example, subtle changes to the background or texture of an image can break a seemingly powerful classifier. Building on previous work on domain generalization, we hope to produce a classifier tha…
▽ More
Despite impressive performance as evaluated on i.i.d. holdout data, deep neural networks depend heavily on superficial statistics of the training data and are liable to break under distribution shift. For example, subtle changes to the background or texture of an image can break a seemingly powerful classifier. Building on previous work on domain generalization, we hope to produce a classifier that will generalize to previously unseen domains, even when domain identifiers are not available during training. This setting is challenging because the model may extract many distribution-specific (superficial) signals together with distribution-agnostic (semantic) signals. To overcome this challenge, we incorporate the gray-level co-occurrence matrix (GLCM) to extract patterns that our prior knowledge suggests are superficial: they are sensitive to the texture but unable to capture the gestalt of an image. Then we introduce two techniques for improving our networks' out-of-sample performance. The first method is built on the reverse gradient method that pushes our model to learn representations from which the GLCM representation is not predictable. The second method is built on the independence introduced by projecting the model's representation onto the subspace orthogonal to GLCM representation's. We test our method on the battery of standard domain generalization data sets and, interestingly, achieve comparable or better performance as compared to other domain generalization methods that explicitly require samples from the target distribution for training.
△ Less
Submitted 1 March, 2019;
originally announced March 2019.
-
Domain Adaptation with Asymmetrically-Relaxed Distribution Alignment
Authors:
Yifan Wu,
Ezra Winston,
Divyansh Kaushik,
Zachary Lipton
Abstract:
Domain adaptation addresses the common problem when the target distribution generating our test data drifts from the source (training) distribution. While absent assumptions, domain adaptation is impossible, strict conditions, e.g. covariate or label shift, enable principled algorithms. Recently-proposed domain-adversarial approaches consist of aligning source and target encodings, often motivatin…
▽ More
Domain adaptation addresses the common problem when the target distribution generating our test data drifts from the source (training) distribution. While absent assumptions, domain adaptation is impossible, strict conditions, e.g. covariate or label shift, enable principled algorithms. Recently-proposed domain-adversarial approaches consist of aligning source and target encodings, often motivating this approach as minimizing two (of three) terms in a theoretical bound on target error. Unfortunately, this minimization can cause arbitrary increases in the third term, e.g. they can break down under shifting label distributions. We propose asymmetrically-relaxed distribution alignment, a new approach that overcomes some limitations of standard domain-adversarial algorithms. Moreover, we characterize precise assumptions under which our algorithm is theoretically principled and demonstrate empirical benefits on both synthetic and real datasets.
△ Less
Submitted 11 March, 2019; v1 submitted 5 March, 2019;
originally announced March 2019.
-
What is the Effect of Importance Weighting in Deep Learning?
Authors:
Jonathon Byrd,
Zachary C. Lipton
Abstract:
Importance-weighted risk minimization is a key ingredient in many machine learning algorithms for causal inference, domain adaptation, class imbalance, and off-policy reinforcement learning. While the effect of importance weighting is well-characterized for low-capacity misspecified models, little is known about how it impacts over-parameterized, deep neural networks. This work is inspired by rece…
▽ More
Importance-weighted risk minimization is a key ingredient in many machine learning algorithms for causal inference, domain adaptation, class imbalance, and off-policy reinforcement learning. While the effect of importance weighting is well-characterized for low-capacity misspecified models, little is known about how it impacts over-parameterized, deep neural networks. This work is inspired by recent theoretical results showing that on (linearly) separable data, deep linear networks optimized by SGD learn weight-agnostic solutions, prompting us to ask, for realistic deep networks, for which many practical datasets are separable, what is the effect of importance weighting? We present the surprising finding that while importance weighting impacts models early in training, its effect diminishes over successive epochs. Moreover, while L2 regularization and batch normalization (but not dropout), restore some of the impact of importance weighting, they express the effect via (seemingly) the wrong abstraction: why should practitioners tweak the L2 regularization, and by how much, to produce the correct weighting effect? Our experiments confirm these findings across a range of architectures and datasets.
△ Less
Submitted 13 June, 2019; v1 submitted 8 December, 2018;
originally announced December 2018.
-
Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
Authors:
Stephan Rabanser,
Stephan Günnemann,
Zachary C. Lipton
Abstract:
We might hope that when faced with unexpected inputs, well-designed software systems would fire off warnings. Machine learning (ML) systems, however, which depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to fail silently. This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift, identifying exemplars…
▽ More
We might hope that when faced with unexpected inputs, well-designed software systems would fire off warnings. Machine learning (ML) systems, however, which depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to fail silently. This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift, identifying exemplars that most typify the shift, and quantifying shift malignancy. We focus on several datasets and various perturbations to both covariates and label distributions with varying magnitudes and fractions of data affected. Interestingly, we show that across the dataset shifts that we explore, a two-sample-testing-based approach, using pre-trained classifiers for dimensionality reduction, performs best. Moreover, we demonstrate that domain-discriminating approaches tend to be helpful for characterizing shifts qualitatively and determining if they are harmful.
△ Less
Submitted 28 October, 2019; v1 submitted 29 October, 2018;
originally announced October 2018.
-
Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study
Authors:
Aditya Siddhant,
Zachary C. Lipton
Abstract:
Several recent papers investigate Active Learning (AL) for mitigating the data dependence of deep learning for natural language processing. However, the applicability of AL to real-world problems remains an open question. While in supervised learning, practitioners can try many different methods, evaluating each against a validation set before selecting a model, AL affords no such luxury. Over the…
▽ More
Several recent papers investigate Active Learning (AL) for mitigating the data dependence of deep learning for natural language processing. However, the applicability of AL to real-world problems remains an open question. While in supervised learning, practitioners can try many different methods, evaluating each against a validation set before selecting a model, AL affords no such luxury. Over the course of one AL run, an agent annotates its dataset exhausting its labeling budget. Thus, given a new task, an active learner has no opportunity to compare models and acquisition functions. This paper provides a large scale empirical study of deep active learning, addressing multiple tasks and, for each, multiple datasets, multiple models, and a full suite of acquisition functions. We find that across all settings, Bayesian active learning by disagreement, using uncertainty estimates provided either by Dropout or Bayes-by Backprop significantly improves over i.i.d. baselines and usually outperforms classic uncertainty sampling.
△ Less
Submitted 24 September, 2018; v1 submitted 16 August, 2018;
originally announced August 2018.
-
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Authors:
Divyansh Kaushik,
Zachary C. Lipton
Abstract:
Many recent papers address reading comprehension, where examples consist of (question, passage, answer) tuples. Presumably, a model must combine information from both questions and passages to predict corresponding answers. However, despite intense interest in the topic, with hundreds of published papers vying for leaderboard dominance, basic questions about the difficulty of many popular benchmar…
▽ More
Many recent papers address reading comprehension, where examples consist of (question, passage, answer) tuples. Presumably, a model must combine information from both questions and passages to predict corresponding answers. However, despite intense interest in the topic, with hundreds of published papers vying for leaderboard dominance, basic questions about the difficulty of many popular benchmarks remain unanswered. In this paper, we establish sensible baselines for the bAbI, SQuAD, CBT, CNN, and Who-did-What datasets, finding that question- and passage-only models often perform surprisingly well. On $14$ out of $20$ bAbI tasks, passage-only models achieve greater than $50\%$ accuracy, sometimes matching the full model. Interestingly, while CBT provides $20$-sentence stories only the last is needed for comparably accurate prediction. By comparison, SQuAD and CNN appear better-constructed.
△ Less
Submitted 21 August, 2018; v1 submitted 14 August, 2018;
originally announced August 2018.
-
Learning Noise-Invariant Representations for Robust Speech Recognition
Authors:
Davis Liang,
Zhiheng Huang,
Zachary C. Lipton
Abstract:
Despite rapid advances in speech recognition, current models remain brittle to superficial perturbations to their inputs. Small amounts of noise can destroy the performance of an otherwise state-of-the-art model. To harden models against background noise, practitioners often perform data augmentation, adding artificially-noised examples to the training set, carrying over the original label. In thi…
▽ More
Despite rapid advances in speech recognition, current models remain brittle to superficial perturbations to their inputs. Small amounts of noise can destroy the performance of an otherwise state-of-the-art model. To harden models against background noise, practitioners often perform data augmentation, adding artificially-noised examples to the training set, carrying over the original label. In this paper, we hypothesize that a clean example and its superficially perturbed counterparts shouldn't merely map to the same class --- they should map to the same representation. We propose invariant-representation-learning (IRL): At each training iteration, for each training example,we sample a noisy counterpart. We then apply a penalty term to coerce matched representations at each layer (above some chosen layer). Our key results, demonstrated on the Librispeech dataset are the following: (i) IRL significantly reduces character error rates (CER) on both 'clean' (3.3% vs 6.5%) and 'other' (11.0% vs 18.1%) test sets; (ii) on several out-of-domain noise settings (different from those seen during training), IRL's benefits are even more pronounced. Careful ablations confirm that our results are not simply due to shrinking activations at the chosen layers.
△ Less
Submitted 17 July, 2018;
originally announced July 2018.
-
Practical Obstacles to Deploying Active Learning
Authors:
David Lowell,
Zachary C. Lipton,
Byron C. Wallace
Abstract:
Active learning (AL) is a widely-used training strategy for maximizing predictive performance subject to a fixed annotation budget. In AL one iteratively selects training examples for annotation, often those for which the current model is most uncertain (by some measure). The hope is that active sampling leads to better performance than would be achieved under independent and identically distribut…
▽ More
Active learning (AL) is a widely-used training strategy for maximizing predictive performance subject to a fixed annotation budget. In AL one iteratively selects training examples for annotation, often those for which the current model is most uncertain (by some measure). The hope is that active sampling leads to better performance than would be achieved under independent and identically distributed (i.i.d.) random samples. While AL has shown promise in retrospective evaluations, these studies often ignore practical obstacles to its use. In this paper we show that while AL may provide benefits when used with specific models and for particular domains, the benefits of current approaches do not generalize reliably across models and tasks. This is problematic because in practice one does not have the opportunity to explore and compare alternative AL strategies. Moreover, AL couples the training dataset with the model used to guide its acquisition. We find that subsequently training a successor model with an actively-acquired dataset does not consistently outperform training on i.i.d. sampled data. Our findings raise the question of whether the downsides inherent to AL are worth the modest and inconsistent performance gains it tends to afford.
△ Less
Submitted 1 November, 2019; v1 submitted 12 July, 2018;
originally announced July 2018.
-
Troubling Trends in Machine Learning Scholarship
Authors:
Zachary C. Lipton,
Jacob Steinhardt
Abstract:
Collectively, machine learning (ML) researchers are engaged in the creation and dissemination of knowledge about data-driven algorithms. In a given paper, researchers might aspire to any subset of the following goals, among others: to theoretically characterize what is learnable, to obtain understanding through empirically rigorous experiments, or to build a working system that has high predictive…
▽ More
Collectively, machine learning (ML) researchers are engaged in the creation and dissemination of knowledge about data-driven algorithms. In a given paper, researchers might aspire to any subset of the following goals, among others: to theoretically characterize what is learnable, to obtain understanding through empirically rigorous experiments, or to build a working system that has high predictive accuracy. While determining which knowledge warrants inquiry may be subjective, once the topic is fixed, papers are most valuable to the community when they act in service of the reader, creating foundational knowledge and communicating as clearly as possible.
Recent progress in machine learning comes despite frequent departures from these ideals. In this paper, we focus on the following four patterns that appear to us to be trending in ML scholarship: (i) failure to distinguish between explanation and speculation; (ii) failure to identify the sources of empirical gains, e.g., emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning; (iii) mathiness: the use of mathematics that obfuscates or impresses rather than clarifies, e.g., by confusing technical and non-technical concepts; and (iv) misuse of language, e.g., by choosing terms of art with colloquial connotations or by overloading established technical terms.
While the causes behind these patterns are uncertain, possibilities include the rapid expansion of the community, the consequent thinness of the reviewer pool, and the often-misaligned incentives between scholarship and short-term measures of success (e.g., bibliometrics, attention, and entrepreneurial opportunity). While each pattern offers a corresponding remedy (don't do it), we also discuss some speculative suggestions for how the community might combat these trends.
△ Less
Submitted 26 July, 2018; v1 submitted 9 July, 2018;
originally announced July 2018.
-
Surprising Negative Results for Generative Adversarial Tree Search
Authors:
Kamyar Azizzadenesheli,
Brandon Yang,
Weitang Liu,
Zachary C Lipton,
Animashree Anandkumar
Abstract:
While many recent advances in deep reinforcement learning (RL) rely on model-free methods, model-based approaches remain an alluring prospect for their potential to exploit unsupervised data to learn environment model. In this work, we provide an extensive study on the design of deep generative models for RL environments and propose a sample efficient and robust method to learn the model of Atari…
▽ More
While many recent advances in deep reinforcement learning (RL) rely on model-free methods, model-based approaches remain an alluring prospect for their potential to exploit unsupervised data to learn environment model. In this work, we provide an extensive study on the design of deep generative models for RL environments and propose a sample efficient and robust method to learn the model of Atari environments. We deploy this model and propose generative adversarial tree search (GATS) a deep RL algorithm that learns the environment model and implements Monte Carlo tree search (MCTS) on the learned model for planning. While MCTS on the learned model is computationally expensive, similar to AlphaGo, GATS follows depth limited MCTS. GATS employs deep Q network (DQN) and learns a Q-function to assign values to the leaves of the tree in MCTS. We theoretical analyze GATS vis-a-vis the bias-variance trade-off and show GATS is able to mitigate the worst-case error in the Q-estimate. While we were expecting GATS to enjoy a better sample complexity and faster converges to better policies, surprisingly, GATS fails to outperform DQN. We provide a study on which we show why depth limited MCTS fails to perform desirably.
△ Less
Submitted 4 September, 2019; v1 submitted 14 June, 2018;
originally announced June 2018.
-
Born Again Neural Networks
Authors:
Tommaso Furlanello,
Zachary C. Lipton,
Michael Tschannen,
Laurent Itti,
Anima Anandkumar
Abstract:
Knowledge Distillation (KD) consists of transferring âknowledgeâ from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the studentâs compactness, without sacrificing too much performance. We study KD from a new p…
▽ More
Knowledge Distillation (KD) consists of transferring âknowledgeâ from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the studentâs compactness, without sacrificing too much performance. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating the effect of the teacher outputs on both predicted and non-predicted classes.
△ Less
Submitted 29 June, 2018; v1 submitted 12 May, 2018;
originally announced May 2018.
-
Correction by Projection: Denoising Images with Generative Adversarial Networks
Authors:
Subarna Tripathi,
Zachary C. Lipton,
Truong Q. Nguyen
Abstract:
Generative adversarial networks (GANs) transform low-dimensional latent vectors into visually plausible images. If the real dataset contains only clean images, then ostensibly, the manifold learned by the GAN should contain only clean images. In this paper, we propose to denoise corrupted images by finding the nearest point on the GAN manifold, recovering latent vectors by minimizing distances in…
▽ More
Generative adversarial networks (GANs) transform low-dimensional latent vectors into visually plausible images. If the real dataset contains only clean images, then ostensibly, the manifold learned by the GAN should contain only clean images. In this paper, we propose to denoise corrupted images by finding the nearest point on the GAN manifold, recovering latent vectors by minimizing distances in image space. We first demonstrate that given a corrupted version of an image that truly lies on the GAN manifold, we can approximately recover the latent vector and denoise the image, obtaining significantly higher quality, comparing with BM3D. Next, we demonstrate that latent vectors recovered from noisy images exhibit a consistent bias. By subtracting this bias before projecting back to image space, we improve denoising results even further. Finally, even for unseen images, our method performs better at denoising better than BM3D. Notably, the basic version of our method (without bias correction) requires no prior knowledge on the noise variance. To achieve the highest possible denoising quality, the best performing signal processing based methods, such as BM3D, require an estimate of the blur kernel.
△ Less
Submitted 12 March, 2018;
originally announced March 2018.
-
Stochastic Activation Pruning for Robust Adversarial Defense
Authors:
Guneet S. Dhillon,
Kamyar Azizzadenesheli,
Zachary C. Lipton,
Jeremy Bernstein,
Jean Kossaifi,
Aran Khanna,
Anima Anandkumar
Abstract:
Neural networks are known to be vulnerable to adversarial examples. Carefully chosen perturbations to real images, while imperceptible to humans, induce misclassification and threaten the reliability of deep learning systems in the wild. To guard against adversarial examples, we take inspiration from game theory and cast the problem as a minimax zero-sum game between the adversary and the model. I…
▽ More
Neural networks are known to be vulnerable to adversarial examples. Carefully chosen perturbations to real images, while imperceptible to humans, induce misclassification and threaten the reliability of deep learning systems in the wild. To guard against adversarial examples, we take inspiration from game theory and cast the problem as a minimax zero-sum game between the adversary and the model. In general, for such games, the optimal strategy for both players requires a stochastic policy, also known as a mixed strategy. In this light, we propose Stochastic Activation Pruning (SAP), a mixed strategy for adversarial defense. SAP prunes a random subset of activations (preferentially pruning those with smaller magnitude) and scales up the survivors to compensate. We can apply SAP to pretrained networks, including adversarially trained models, without fine-tuning, providing robustness against adversarial examples. Experiments demonstrate that SAP confers robustness against attacks, increasing accuracy and preserving calibration.
△ Less
Submitted 4 March, 2018;
originally announced March 2018.
-
Active Learning with Partial Feedback
Authors:
Peiyun Hu,
Zachary C. Lipton,
Anima Anandkumar,
Deva Ramanan
Abstract:
While many active learning papers assume that the learner can simply ask for a label and receive it, real annotation often presents a mismatch between the form of a label (say, one among many classes), and the form of an annotation (typically yes/no binary feedback). To annotate examples corpora for multiclass classification, we might need to ask multiple yes/no questions, exploiting a label hiera…
▽ More
While many active learning papers assume that the learner can simply ask for a label and receive it, real annotation often presents a mismatch between the form of a label (say, one among many classes), and the form of an annotation (typically yes/no binary feedback). To annotate examples corpora for multiclass classification, we might need to ask multiple yes/no questions, exploiting a label hierarchy if one is available. To address this more realistic setting, we propose active learning with partial feedback (ALPF), where the learner must actively choose both which example to label and which binary question to ask. At each step, the learner selects an example, asking if it belongs to a chosen (possibly composite) class. Each answer eliminates some classes, leaving the learner with a partial label. The learner may then either ask more questions about the same example (until an exact label is uncovered) or move on immediately, leaving the first example partially labeled. Active learning with partial labels requires (i) a sampling strategy to choose (example, class) pairs, and (ii) learning from partial labels between rounds. Experiments on Tiny ImageNet demonstrate that our most effective method improves 26% (relative) in top-1 classification accuracy compared to i.i.d. baselines and standard active learners given 30% of the annotation budget that would be required (naively) to annotate the dataset. Moreover, ALPF-learners fully annotate TinyImageNet at 42% lower cost. Surprisingly, we observe that accounting for per-example annotation costs can alter the conventional wisdom that active learners should solicit labels for hard examples.
△ Less
Submitted 8 July, 2019; v1 submitted 21 February, 2018;
originally announced February 2018.
-
Detecting and Correcting for Label Shift with Black Box Predictors
Authors:
Zachary C. Lipton,
Yu-Xiang Wang,
Alex Smola
Abstract:
Faced with distribution shift between training and test set, we wish to detect and quantify the shift, and to correct our classifiers without test set labels. Motivated by medical diagnosis, where diseases (targets) cause symptoms (observations), we focus on label shift, where the label marginal $p(y)$ changes but the conditional $p(x| y)$ does not. We propose Black Box Shift Estimation (BBSE) to…
▽ More
Faced with distribution shift between training and test set, we wish to detect and quantify the shift, and to correct our classifiers without test set labels. Motivated by medical diagnosis, where diseases (targets) cause symptoms (observations), we focus on label shift, where the label marginal $p(y)$ changes but the conditional $p(x| y)$ does not. We propose Black Box Shift Estimation (BBSE) to estimate the test distribution $p(y)$. BBSE exploits arbitrary black box predictors to reduce dimensionality prior to shift correction. While better predictors give tighter estimates, BBSE works even when predictors are biased, inaccurate, or uncalibrated, so long as their confusion matrices are invertible. We prove BBSE's consistency, bound its error, and introduce a statistical test that uses BBSE to detect shift. We also leverage BBSE to correct classifiers. Experiments demonstrate accurate estimates and improved prediction, even on high-dimensional datasets of natural images.
△ Less
Submitted 26 July, 2018; v1 submitted 12 February, 2018;
originally announced February 2018.
-
Learning From Noisy Singly-labeled Data
Authors:
Ashish Khetan,
Zachary C. Lipton,
Anima Anandkumar
Abstract:
Supervised learning depends on annotated examples, which are taken to be the \emph{ground truth}. But these labels often come from noisy crowdsourcing platforms, like Amazon Mechanical Turk. Practitioners typically collect multiple labels per example and aggregate the results to mitigate noise (the classic crowdsourcing problem). Given a fixed annotation budget and unlimited unlabeled data, redund…
▽ More
Supervised learning depends on annotated examples, which are taken to be the \emph{ground truth}. But these labels often come from noisy crowdsourcing platforms, like Amazon Mechanical Turk. Practitioners typically collect multiple labels per example and aggregate the results to mitigate noise (the classic crowdsourcing problem). Given a fixed annotation budget and unlimited unlabeled data, redundant annotation comes at the expense of fewer labeled examples. This raises two fundamental questions: (1) How can we best learn from noisy workers? (2) How should we allocate our labeling budget to maximize the performance of a classifier? We propose a new algorithm for jointly modeling labels and worker quality from noisy crowd-sourced data. The alternating minimization proceeds in rounds, estimating worker quality from disagreement with the current model and then updating the model by optimizing a loss function that accounts for the current estimate of worker quality. Unlike previous approaches, even with only one annotation per example, our algorithm can estimate worker quality. We establish a generalization error bound for models learned with our algorithm and establish theoretically that it's better to label many examples once (vs less multiply) when worker quality is above a threshold. Experiments conducted on both ImageNet (with simulated noisy workers) and MS-COCO (using the real crowdsourced labels) confirm our algorithm's benefits.
△ Less
Submitted 20 May, 2018; v1 submitted 12 December, 2017;
originally announced December 2017.
-
The Doctor Just Won't Accept That!
Authors:
Zachary C. Lipton
Abstract:
Calls to arms to build interpretable models express a well-founded discomfort with machine learning. Should a software agent that does not even know what a loan is decide who qualifies for one? Indeed, we ought to be cautious about injecting machine learning (or anything else, for that matter) into applications where there may be a significant risk of causing social harm. However, claims that stak…
▽ More
Calls to arms to build interpretable models express a well-founded discomfort with machine learning. Should a software agent that does not even know what a loan is decide who qualifies for one? Indeed, we ought to be cautious about injecting machine learning (or anything else, for that matter) into applications where there may be a significant risk of causing social harm. However, claims that stakeholders "just won't accept that!" do not provide a sufficient foundation for a proposed field of study. For the field of interpretable machine learning to advance, we must ask the following questions: What precisely won't various stakeholders accept? What do they want? Are these desiderata reasonable? Are they feasible? In order to answer these questions, we'll have to give real-world problems and their respective stakeholders greater consideration.
△ Less
Submitted 24 November, 2017; v1 submitted 19 November, 2017;
originally announced November 2017.
-
Does mitigating ML's impact disparity require treatment disparity?
Authors:
Zachary C. Lipton,
Alexandra Chouldechova,
Julian McAuley
Abstract:
Following related work in law and policy, two notions of disparity have come to shape the study of fairness in algorithmic decision-making. Algorithms exhibit treatment disparity if they formally treat members of protected subgroups differently; algorithms exhibit impact disparity when outcomes differ across subgroups, even if the correlation arises unintentionally. Naturally, we can achieve impac…
▽ More
Following related work in law and policy, two notions of disparity have come to shape the study of fairness in algorithmic decision-making. Algorithms exhibit treatment disparity if they formally treat members of protected subgroups differently; algorithms exhibit impact disparity when outcomes differ across subgroups, even if the correlation arises unintentionally. Naturally, we can achieve impact parity through purposeful treatment disparity. In one thread of technical work, papers aim to reconcile the two forms of parity proposing disparate learning processes (DLPs). Here, the learning algorithm can see group membership during training but produce a classifier that is group-blind at test time. In this paper, we show theoretically that: (i) When other features correlate to group membership, DLPs will (indirectly) implement treatment disparity, undermining the policy desiderata they are designed to address; (ii) When group membership is partly revealed by other features, DLPs induce within-class discrimination; and (iii) In general, DLPs provide a suboptimal trade-off between accuracy and impact parity. Based on our technical analysis, we argue that transparent treatment disparity is preferable to occluded methods for achieving impact parity. Experimental results on several real-world datasets highlight the practical consequences of applying DLPs vs. per-group thresholds.
△ Less
Submitted 11 January, 2019; v1 submitted 19 November, 2017;
originally announced November 2017.
-
BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems
Authors:
Zachary Lipton,
Xiujun Li,
Jianfeng Gao,
Lihong Li,
Faisal Ahmed,
Li Deng
Abstract:
We present a new algorithm that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems. Our agents explore via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop neural network. Our algorithm learns much faster than common exploration strategies such as ε-greedy, Boltzmann, bootstrapping, and intrinsic-reward-based ones. Additionall…
▽ More
We present a new algorithm that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems. Our agents explore via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop neural network. Our algorithm learns much faster than common exploration strategies such as ε-greedy, Boltzmann, bootstrapping, and intrinsic-reward-based ones. Additionally, we show that spiking the replay buffer with experiences from just a few successful episodes can make Q-learning feasible when it might otherwise fail.
△ Less
Submitted 19 November, 2017; v1 submitted 15 November, 2017;
originally announced November 2017.
-
Improving Factor-Based Quantitative Investing by Forecasting Company Fundamentals
Authors:
John Alberg,
Zachary C. Lipton
Abstract:
On a periodic basis, publicly traded companies are required to report fundamentals: financial data such as revenue, operating income, debt, among others. These data points provide some insight into the financial health of a company. Academic research has identified some factors, i.e. computed features of the reported data, that are known through retrospective analysis to outperform the market aver…
▽ More
On a periodic basis, publicly traded companies are required to report fundamentals: financial data such as revenue, operating income, debt, among others. These data points provide some insight into the financial health of a company. Academic research has identified some factors, i.e. computed features of the reported data, that are known through retrospective analysis to outperform the market average. Two popular factors are the book value normalized by market capitalization (book-to-market) and the operating income normalized by the enterprise value (EBIT/EV). In this paper: we first show through simulation that if we could (clairvoyantly) select stocks using factors calculated on future fundamentals (via oracle), then our portfolios would far outperform a standard factor approach. Motivated by this analysis, we train deep neural networks to forecast future fundamentals based on a trailing 5-years window. Quantitative analysis demonstrates a significant improvement in MSE over a naive strategy. Moreover, in retrospective analysis using an industry-grade stock portfolio simulator (backtester), we show an improvement in compounded annual return to 17.1% (MLP) vs 14.4% for a standard factor model.
△ Less
Submitted 25 April, 2018; v1 submitted 13 November, 2017;
originally announced November 2017.
-
Tensor Regression Networks
Authors:
Jean Kossaifi,
Zachary C. Lipton,
Arinbjorn Kolbeinsson,
Aran Khanna,
Tommaso Furlanello,
Anima Anandkumar
Abstract:
Convolutional neural networks typically consist of many convolutional layers followed by one or more fully connected layers. While convolutional layers map between high-order activation tensors, the fully connected layers operate on flattened activation vectors. Despite empirical success, this approach has notable drawbacks. Flattening followed by fully connected layers discards multilinear struct…
▽ More
Convolutional neural networks typically consist of many convolutional layers followed by one or more fully connected layers. While convolutional layers map between high-order activation tensors, the fully connected layers operate on flattened activation vectors. Despite empirical success, this approach has notable drawbacks. Flattening followed by fully connected layers discards multilinear structure in the activations and requires many parameters. We address these problems by incorporating tensor algebraic operations that preserve multilinear structure at every layer. First, we introduce Tensor Contraction Layers (TCLs) that reduce the dimensionality of their input while preserving their multilinear structure using tensor contraction. Next, we introduce Tensor Regression Layers (TRLs), which express outputs through a low-rank multilinear mapping from a high-order activation tensor to an output tensor of arbitrary order. We learn the contraction and regression factors end-to-end, and produce accurate nets with fewer parameters. Additionally, our layers regularize networks by imposing low-rank constraints on the activations (TCL) and regression weights (TRL). Experiments on ImageNet show that, applied to VGG and ResNet architectures, TCLs and TRLs reduce the number of parameters compared to fully connected layers by more than 65% while maintaining or increasing accuracy. In addition to the space savings, our approach's ability to leverage topological structure can be crucial for structured data such as MRI. In particular, we demonstrate significant performance improvements over comparable architectures on three tasks associated with the UK Biobank dataset.
△ Less
Submitted 20 July, 2020; v1 submitted 26 July, 2017;
originally announced July 2017.
-
Deep Active Learning for Named Entity Recognition
Authors:
Yanyao Shen,
Hyokun Yun,
Zachary C. Lipton,
Yakov Kronrod,
Animashree Anandkumar
Abstract:
Deep learning has yielded state-of-the-art performance on many natural language processing tasks including named entity recognition (NER). However, this typically requires large amounts of labeled data. In this work, we demonstrate that the amount of labeled training data can be drastically reduced when deep learning is combined with active learning. While active learning is sample-efficient, it c…
▽ More
Deep learning has yielded state-of-the-art performance on many natural language processing tasks including named entity recognition (NER). However, this typically requires large amounts of labeled data. In this work, we demonstrate that the amount of labeled training data can be drastically reduced when deep learning is combined with active learning. While active learning is sample-efficient, it can be computationally expensive since it requires iterative retraining. To speed this up, we introduce a lightweight architecture for NER, viz., the CNN-CNN-LSTM model consisting of convolutional character and word encoders and a long short term memory (LSTM) tag decoder. The model achieves nearly state-of-the-art performance on standard datasets for the task while being computationally much more efficient than best performing models. We carry out incremental active learning, during the training process, and are able to nearly match state-of-the-art performance with just 25\% of the original training data.
△ Less
Submitted 3 February, 2018; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Tensor Contraction Layers for Parsimonious Deep Nets
Authors:
Jean Kossaifi,
Aran Khanna,
Zachary C. Lipton,
Tommaso Furlanello,
Anima Anandkumar
Abstract:
Tensors offer a natural representation for many kinds of data frequently encountered in machine learning. Images, for example, are naturally represented as third order tensors, where the modes correspond to height, width, and channels. Tensor methods are noted for their ability to discover multi-dimensional dependencies, and tensor decompositions in particular, have been used to produce compact lo…
▽ More
Tensors offer a natural representation for many kinds of data frequently encountered in machine learning. Images, for example, are naturally represented as third order tensors, where the modes correspond to height, width, and channels. Tensor methods are noted for their ability to discover multi-dimensional dependencies, and tensor decompositions in particular, have been used to produce compact low-rank approximations of data. In this paper, we explore the use of tensor contractions as neural network layers and investigate several ways to apply them to activation tensors. Specifically, we propose the Tensor Contraction Layer (TCL), the first attempt to incorporate tensor contractions as end-to-end trainable neural network layers. Applied to existing networks, TCLs reduce the dimensionality of the activation tensors and thus the number of model parameters. We evaluate the TCL on the task of image recognition, augmenting two popular networks (AlexNet, VGG). The resulting models are trainable end-to-end. Applying the TCL to the task of image recognition, using the CIFAR100 and ImageNet datasets, we evaluate the effect of parameter reduction via tensor contraction on performance. We demonstrate significant model compression without significant impact on the accuracy and, in some cases, improved performance.
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
Semantically Decomposing the Latent Spaces of Generative Adversarial Networks
Authors:
Chris Donahue,
Zachary C. Lipton,
Akshay Balsubramani,
Julian McAuley
Abstract:
We propose a new algorithm for training generative adversarial networks that jointly learns latent codes for both identities (e.g. individual humans) and observations (e.g. specific photographs). By fixing the identity portion of the latent codes, we can generate diverse images of the same subject, and by fixing the observation portion, we can traverse the manifold of subjects while maintaining co…
▽ More
We propose a new algorithm for training generative adversarial networks that jointly learns latent codes for both identities (e.g. individual humans) and observations (e.g. specific photographs). By fixing the identity portion of the latent codes, we can generate diverse images of the same subject, and by fixing the observation portion, we can traverse the manifold of subjects while maintaining contingent aspects such as lighting and pose. Our algorithm features a pairwise training scheme in which each sample from the generator consists of two images with a common identity code. Corresponding samples from the real dataset consist of two distinct photographs of the same subject. In order to fool the discriminator, the generator must produce pairs that are photorealistic, distinct, and appear to depict the same individual. We augment both the DCGAN and BEGAN approaches with Siamese discriminators to facilitate pairwise training. Experiments with human judges and an off-the-shelf face verification system demonstrate our algorithm's ability to generate convincing, identity-matched photographs.
△ Less
Submitted 22 February, 2018; v1 submitted 22 May, 2017;
originally announced May 2017.
-
Dance Dance Convolution
Authors:
Chris Donahue,
Zachary C. Lipton,
Julian McAuley
Abstract:
Dance Dance Revolution (DDR) is a popular rhythm-based video game. Players perform steps on a dance platform in synchronization with music as directed by on-screen step charts. While many step charts are available in standardized packs, players may grow tired of existing charts, or wish to dance to a song for which no chart exists. We introduce the task of learning to choreograph. Given a raw audi…
▽ More
Dance Dance Revolution (DDR) is a popular rhythm-based video game. Players perform steps on a dance platform in synchronization with music as directed by on-screen step charts. While many step charts are available in standardized packs, players may grow tired of existing charts, or wish to dance to a song for which no chart exists. We introduce the task of learning to choreograph. Given a raw audio track, the goal is to produce a new step chart. This task decomposes naturally into two subtasks: deciding when to place steps and deciding which steps to select. For the step placement task, we combine recurrent and convolutional neural networks to ingest spectrograms of low-level audio features to predict steps, conditioned on chart difficulty. For step selection, we present a conditional LSTM generative model that substantially outperforms n-gram and fixed-window approaches.
△ Less
Submitted 20 June, 2017; v1 submitted 20 March, 2017;
originally announced March 2017.
-
Predicting Surgery Duration with Neural Heteroscedastic Regression
Authors:
Nathan Ng,
Rodney A Gabriel,
Julian McAuley,
Charles Elkan,
Zachary C Lipton
Abstract:
Scheduling surgeries is a challenging task due to the fundamental uncertainty of the clinical environment, as well as the risks and costs associated with under- and over-booking. We investigate neural regression algorithms to estimate the parameters of surgery case durations, focusing on the issue of heteroscedasticity. We seek to simultaneously estimate the duration of each surgery, as well as a…
▽ More
Scheduling surgeries is a challenging task due to the fundamental uncertainty of the clinical environment, as well as the risks and costs associated with under- and over-booking. We investigate neural regression algorithms to estimate the parameters of surgery case durations, focusing on the issue of heteroscedasticity. We seek to simultaneously estimate the duration of each surgery, as well as a surgery-specific notion of our uncertainty about its duration. Estimating this uncertainty can lead to more nuanced and effective scheduling strategies, as we are able to schedule surgeries more efficiently while allowing an informed and case-specific margin of error. Using surgery records %from the UC San Diego Health System, from a large United States health system we demonstrate potential improvements on the order of 20% (in terms of minutes overbooked) compared to current scheduling techniques. Moreover, we demonstrate that surgery durations are indeed heteroscedastic. We show that models that estimate case-specific uncertainty better fit the data (log likelihood). Additionally, we show that the heteroscedastic predictions can more optimally trade off between over and under-booking minutes, especially when idle minutes and scheduling collisions confer disparate costs.
△ Less
Submitted 12 July, 2017; v1 submitted 17 February, 2017;
originally announced February 2017.
-
Precise Recovery of Latent Vectors from Generative Adversarial Networks
Authors:
Zachary C. Lipton,
Subarna Tripathi
Abstract:
Generative adversarial networks (GANs) transform latent vectors into visually plausible images. It is generally thought that the original GAN formulation gives no out-of-the-box method to reverse the mapping, projecting images back into latent space. We introduce a simple, gradient-based technique called stochastic clipping. In experiments, for images generated by the GAN, we precisely recover the…
▽ More
Generative adversarial networks (GANs) transform latent vectors into visually plausible images. It is generally thought that the original GAN formulation gives no out-of-the-box method to reverse the mapping, projecting images back into latent space. We introduce a simple, gradient-based technique called stochastic clipping. In experiments, for images generated by the GAN, we precisely recover their latent vector pre-images 100% of the time. Additional experiments demonstrate that this method is robust to noise. Finally, we show that even for unseen images, our method appears to recover unique encodings.
△ Less
Submitted 16 February, 2017; v1 submitted 15 February, 2017;
originally announced February 2017.
-
A User Simulator for Task-Completion Dialogues
Authors:
Xiujun Li,
Zachary C. Lipton,
Bhuwan Dhingra,
Lihong Li,
Jianfeng Gao,
Yun-Nung Chen
Abstract:
Despite widespread interests in reinforcement-learning for task-oriented dialogue systems, several obstacles can frustrate research and development progress. First, reinforcement learners typically require interaction with the environment, so conventional dialogue corpora cannot be used directly. Second, each task presents specific challenges, requiring separate corpus of task-specific annotated d…
▽ More
Despite widespread interests in reinforcement-learning for task-oriented dialogue systems, several obstacles can frustrate research and development progress. First, reinforcement learners typically require interaction with the environment, so conventional dialogue corpora cannot be used directly. Second, each task presents specific challenges, requiring separate corpus of task-specific annotated data. Third, collecting and annotating human-machine or human-human conversations for task-oriented dialogues requires extensive domain knowledge. Because building an appropriate dataset can be both financially costly and time-consuming, one popular approach is to build a user simulator based upon a corpus of example dialogues. Then, one can train reinforcement learning agents in an online fashion as they interact with the simulator. Dialogue agents trained on these simulators can serve as an effective starting point. Once agents master the simulator, they may be deployed in a real environment to interact with humans, and continue to be trained online. To ease empirical algorithmic comparisons in dialogues, this paper introduces a new, publicly available simulation framework, where our simulator, designed for the movie-booking domain, leverages both rules and collected data. The simulator supports two tasks: movie ticket booking and movie seeking. Finally, we demonstrate several agents and detail the procedure to add and test your own agent in the proposed framework.
△ Less
Submitted 13 November, 2017; v1 submitted 16 December, 2016;
originally announced December 2016.
-
Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear
Authors:
Zachary C. Lipton,
Kamyar Azizzadenesheli,
Abhishek Kumar,
Lihong Li,
Jianfeng Gao,
Li Deng
Abstract:
Many practical environments contain catastrophic states that an optimal agent would visit infrequently or never. Even on toy problems, Deep Reinforcement Learning (DRL) agents tend to periodically revisit these states upon forgetting their existence under a new policy. We introduce intrinsic fear (IF), a learned reward shaping that guards DRL agents against periodic catastrophes. IF agents possess…
▽ More
Many practical environments contain catastrophic states that an optimal agent would visit infrequently or never. Even on toy problems, Deep Reinforcement Learning (DRL) agents tend to periodically revisit these states upon forgetting their existence under a new policy. We introduce intrinsic fear (IF), a learned reward shaping that guards DRL agents against periodic catastrophes. IF agents possess a fear model trained to predict the probability of imminent catastrophe. This score is then used to penalize the Q-learning objective. Our theoretical analysis bounds the reduction in average return due to learning on the perturbed objective. We also prove robustness to classification errors. As a bonus, IF models tend to learn faster, owing to reward shaping. Experiments demonstrate that intrinsic-fear DQNs solve otherwise pathological environments and improve on several Atari games.
△ Less
Submitted 13 March, 2018; v1 submitted 3 November, 2016;
originally announced November 2016.
-
BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems
Authors:
Zachary C. Lipton,
Xiujun Li,
Jianfeng Gao,
Lihong Li,
Faisal Ahmed,
Li Deng
Abstract:
We present a new algorithm that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems. Our agents explore via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop neural network. Our algorithm learns much faster than common exploration strategies such as $ε$-greedy, Boltzmann, bootstrapping, and intrinsic-reward-based ones. Additiona…
▽ More
We present a new algorithm that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems. Our agents explore via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop neural network. Our algorithm learns much faster than common exploration strategies such as $ε$-greedy, Boltzmann, bootstrapping, and intrinsic-reward-based ones. Additionally, we show that spiking the replay buffer with experiences from just a few successful episodes can make Q-learning feasible when it might otherwise fail.
△ Less
Submitted 23 November, 2017; v1 submitted 17 August, 2016;
originally announced August 2016.
-
Context Matters: Refining Object Detection in Video with Recurrent Neural Networks
Authors:
Subarna Tripathi,
Zachary C. Lipton,
Serge Belongie,
Truong Nguyen
Abstract:
Given the vast amounts of video available online, and recent breakthroughs in object detection with static images, object detection in video offers a promising new frontier. However, motion blur and compression artifacts cause substantial frame-level variability, even in videos that appear smooth to the eye. Additionally, video datasets tend to have sparsely annotated frames. We present a new fram…
▽ More
Given the vast amounts of video available online, and recent breakthroughs in object detection with static images, object detection in video offers a promising new frontier. However, motion blur and compression artifacts cause substantial frame-level variability, even in videos that appear smooth to the eye. Additionally, video datasets tend to have sparsely annotated frames. We present a new framework for improving object detection in videos that captures temporal context and encourages consistency of predictions. First, we train a pseudo-labeler, that is, a domain-adapted convolutional neural network for object detection. The pseudo-labeler is first trained individually on the subset of labeled frames, and then subsequently applied to all frames. Then we train a recurrent neural network that takes as input sequences of pseudo-labeled frames and optimizes an objective that encourages both accuracy on the target frame and consistency across consecutive frames. The approach incorporates strong supervision of target frames, weak-supervision on context frames, and regularization via a smoothness penalty. Our approach achieves mean Average Precision (mAP) of 68.73, an improvement of 7.1 over the strongest image-based baselines for the Youtube-Video Objects dataset. Our experiments demonstrate that neighboring frames can provide valuable information, even absent labels.
△ Less
Submitted 18 July, 2016; v1 submitted 15 July, 2016;
originally announced July 2016.