-
Gradual Domain Adaptation via Normalizing Flows
Authors:
Shogo Sagawa,
Hideitsu Hino
Abstract:
Standard domain adaptation methods do not work well when a large gap exists between the source and target domains. Gradual domain adaptation is one of the approaches used to address the problem. It involves leveraging the intermediate domain, which gradually shifts from the source domain to the target domain. In previous work, it is assumed that the number of intermediate domains is large and the…
▽ More
Standard domain adaptation methods do not work well when a large gap exists between the source and target domains. Gradual domain adaptation is one of the approaches used to address the problem. It involves leveraging the intermediate domain, which gradually shifts from the source domain to the target domain. In previous work, it is assumed that the number of intermediate domains is large and the distance between adjacent domains is small; hence, the gradual domain adaptation algorithm, involving self-training with unlabeled datasets, is applicable. In practice, however, gradual self-training will fail because the number of intermediate domains is limited and the distance between adjacent domains is large. We propose the use of normalizing flows to deal with this problem while maintaining the framework of unsupervised domain adaptation. The proposed method learns a transformation from the distribution of the target domain to the Gaussian mixture distribution via the source domain. We evaluate our proposed method by experiments using real-world datasets and confirm that it mitigates the above-explained problem and improves the classification performance.
△ Less
Submitted 23 January, 2024; v1 submitted 23 June, 2022;
originally announced June 2022.
-
Cost-effective Framework for Gradual Domain Adaptation with Multifidelity
Authors:
Shogo Sagawa,
Hideitsu Hino
Abstract:
In domain adaptation, when there is a large distance between the source and target domains, the prediction performance will degrade. Gradual domain adaptation is one of the solutions to such an issue, assuming that we have access to intermediate domains, which shift gradually from the source to the target domain. In previous works, it was assumed that the number of samples in the intermediate doma…
▽ More
In domain adaptation, when there is a large distance between the source and target domains, the prediction performance will degrade. Gradual domain adaptation is one of the solutions to such an issue, assuming that we have access to intermediate domains, which shift gradually from the source to the target domain. In previous works, it was assumed that the number of samples in the intermediate domains was sufficiently large; hence, self-training was possible without the need for labeled data. If the number of accessible intermediate domains is restricted, the distances between domains become large, and self-training will fail. Practically, the cost of samples in intermediate domains will vary, and it is natural to consider that the closer an intermediate domain is to the target domain, the higher the cost of obtaining samples from the intermediate domain is. To solve the trade-off between cost and accuracy, we propose a framework that combines multifidelity and active domain adaptation. The effectiveness of the proposed method is evaluated by experiments with real-world datasets.
△ Less
Submitted 10 November, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
Extending the WILDS Benchmark for Unsupervised Adaptation
Authors:
Shiori Sagawa,
Pang Wei Koh,
Tony Lee,
Irena Gao,
Sang Michael Xie,
Kendrick Shen,
Ananya Kumar,
Weihua Hu,
Michihiro Yasunaga,
Henrik Marklund,
Sara Beery,
Etienne David,
Ian Stavness,
Wei Guo,
Jure Leskovec,
Kate Saenko,
Tatsunori Hashimoto,
Sergey Levine,
Chelsea Finn,
Percy Liang
Abstract:
Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribu…
▽ More
Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as the evaluation metrics. On these datasets, we systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.
△ Less
Submitted 23 April, 2022; v1 submitted 9 December, 2021;
originally announced December 2021.
-
Just Train Twice: Improving Group Robustness without Training Group Information
Authors:
Evan Zheran Liu,
Behzad Haghgoo,
Annie S. Chen,
Aditi Raghunathan,
Pang Wei Koh,
Shiori Sagawa,
Percy Liang,
Chelsea Finn
Abstract:
Standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on certain groups, especially in the presence of spurious correlations between the input and label. Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training…
▽ More
Standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on certain groups, especially in the presence of spurious correlations between the input and label. Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training point, whereas approaches that do not use such group annotations typically achieve unsatisfactory worst-group accuracy. In this paper, we propose a simple two-stage approach, JTT, that first trains a standard ERM model for several epochs, and then trains a second model that upweights the training examples that the first model misclassified. Intuitively, this upweights examples from groups on which standard ERM models perform poorly, leading to improved worst-group performance. Averaged over four image classification and natural language processing tasks with spurious correlations, JTT closes 75% of the gap in worst-group accuracy between standard ERM and group DRO, while only requiring group annotations on a small validation set in order to tune hyperparameters.
△ Less
Submitted 27 September, 2021; v1 submitted 19 July, 2021;
originally announced July 2021.
-
Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization
Authors:
John Miller,
Rohan Taori,
Aditi Raghunathan,
Shiori Sagawa,
Pang Wei Koh,
Vaishaal Shankar,
Percy Liang,
Yair Carmon,
Ludwig Schmidt
Abstract:
For machine learning systems to be reliable, we must understand their performance in unseen, out-of-distribution environments. In this paper, we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts. Specifically, we demonstrate strong correlations between in-distribution and out-of-distribut…
▽ More
For machine learning systems to be reliable, we must understand their performance in unseen, out-of-distribution environments. In this paper, we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts. Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet, a synthetic pose estimation task derived from YCB objects, satellite imagery classification in FMoW-WILDS, and wildlife classification in iWildCam-WILDS. The strong correlations hold across model architectures, hyperparameters, training set size, and training duration, and are more precise than what is expected from existing domain adaptation theory. To complete the picture, we also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS. Finally, we provide a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.
△ Less
Submitted 7 October, 2021; v1 submitted 9 July, 2021;
originally announced July 2021.
-
Selective Classification Can Magnify Disparities Across Groups
Authors:
Erik Jones,
Shiori Sagawa,
Pang Wei Koh,
Ananya Kumar,
Percy Liang
Abstract:
Selective classification, in which models can abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in…
▽ More
Selective classification, in which models can abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five vision and NLP datasets. Surprisingly, increasing abstentions can even decrease accuracies on some groups. To better understand this phenomenon, we study the margin distribution, which captures the model's confidences over all predictions. For symmetric margin distributions, we prove that whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. Our analysis also shows that selective classification tends to magnify full-coverage accuracy disparities. Motivated by our analysis, we train distributionally-robust models that achieve similar full-coverage accuracies across groups and show that selective classification uniformly improves each group on these models. Altogether, our results suggest that selective classification should be used with care and underscore the importance of training models to perform equally well across groups at full coverage.
△ Less
Submitted 14 April, 2021; v1 submitted 27 October, 2020;
originally announced October 2020.
-
An Investigation of Why Overparameterization Exacerbates Spurious Correlations
Authors:
Shiori Sagawa,
Aditi Raghunathan,
Pang Wei Koh,
Percy Liang
Abstract:
We study why overparameterization -- increasing model size well beyond the point of zero training error -- can hurt test error on minority groups despite improving average test error when there are spurious correlations in the data. Through simulations and experiments on two image datasets, we identify two key properties of the training data that drive this behavior: the proportions of majority ve…
▽ More
We study why overparameterization -- increasing model size well beyond the point of zero training error -- can hurt test error on minority groups despite improving average test error when there are spurious correlations in the data. Through simulations and experiments on two image datasets, we identify two key properties of the training data that drive this behavior: the proportions of majority versus minority groups, and the signal-to-noise ratio of the spurious correlations. We then analyze a linear setting and theoretically show how the inductive bias of models towards "memorizing" fewer examples can cause overparameterization to hurt. Our analysis leads to a counterintuitive approach of subsampling the majority group, which empirically achieves low minority error in the overparameterized regime, even though the standard approach of upweighting the minority fails. Overall, our results suggest a tension between using overparameterized models versus using all the training data for achieving low worst-group error.
△ Less
Submitted 26 August, 2020; v1 submitted 8 May, 2020;
originally announced May 2020.
-
Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization
Authors:
Shiori Sagawa,
Pang Wei Koh,
Tatsunori B. Hashimoto,
Percy Liang
Abstract:
Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find…
▽ More
Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization---a stronger-than-typical L2 penalty or early stopping---we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.
△ Less
Submitted 2 April, 2020; v1 submitted 20 November, 2019;
originally announced November 2019.
-
Multi-Resolution Weak Supervision for Sequential Data
Authors:
Frederic Sala,
Paroma Varma,
Jason Fries,
Daniel Y. Fu,
Shiori Sagawa,
Saelig Khattar,
Ashwini Ramamoorthy,
Ke Xiao,
Kayvon Fatahalian,
James Priest,
Christopher RĂ©
Abstract:
Since manually labeling training data is slow and expensive, recent industrial and scientific research efforts have turned to weaker or noisier forms of supervision sources. However, existing weak supervision approaches fail to model multi-resolution sources for sequential data, like video, that can assign labels to individual elements or collections of elements in a sequence. A key challenge in w…
▽ More
Since manually labeling training data is slow and expensive, recent industrial and scientific research efforts have turned to weaker or noisier forms of supervision sources. However, existing weak supervision approaches fail to model multi-resolution sources for sequential data, like video, that can assign labels to individual elements or collections of elements in a sequence. A key challenge in weak supervision is estimating the unknown accuracies and correlations of these sources without using labeled data. Multi-resolution sources exacerbate this challenge due to complex correlations and sample complexity that scales in the length of the sequence. We propose Dugong, the first framework to model multi-resolution weak supervision sources with complex correlations to assign probabilistic labels to training data. Theoretically, we prove that Dugong, under mild conditions, can uniquely recover the unobserved accuracy and correlation parameters and use parameter sharing to improve sample complexity. Our method assigns clinician-validated labels to population-scale biomedical video repositories, helping outperform traditional supervision by 36.8 F1 points and addressing a key use case where machine learning has been severely limited by the lack of expert labeled data. On average, Dugong improves over traditional supervision by 16.0 F1 points and existing weak supervision approaches by 24.2 F1 points across several video and sensor classification tasks.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Distributionally Robust Language Modeling
Authors:
Yonatan Oren,
Shiori Sagawa,
Tatsunori B. Hashimoto,
Percy Liang
Abstract:
Language models are generally trained on data spanning a wide range of topics (e.g., news, reviews, fiction), but they might be applied to an a priori unknown target distribution (e.g., restaurant reviews). In this paper, we first show that training on text outside the test distribution can degrade test performance when using standard maximum likelihood (MLE) training. To remedy this without the k…
▽ More
Language models are generally trained on data spanning a wide range of topics (e.g., news, reviews, fiction), but they might be applied to an a priori unknown target distribution (e.g., restaurant reviews). In this paper, we first show that training on text outside the test distribution can degrade test performance when using standard maximum likelihood (MLE) training. To remedy this without the knowledge of the test distribution, we propose an approach which trains a model that performs well over a wide range of potential test distributions. In particular, we derive a new distributionally robust optimization (DRO) procedure which minimizes the loss of the model over the worst-case mixture of topics with sufficient overlap with the training distribution. Our approach, called topic conditional value at risk (topic CVaR), obtains a 5.5 point perplexity reduction over MLE when the language models are trained on a mixture of Yelp reviews and news and tested only on reviews.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.