-
Towards Domain-Agnostic Contrastive Learning
Authors:
Vikas Verma,
Minh-Thang Luong,
Kenji Kawaguchi,
Hieu Pham,
Quoc V. Le
Abstract:
Despite recent success, most contrastive self-supervised learning methods are domain-specific, relying heavily on data augmentation techniques that require knowledge about a particular domain, such as image cropping and rotation. To overcome such limitation, we propose a novel domain-agnostic approach to contrastive learning, named DACL, that is applicable to domains where invariances, and thus, d…
▽ More
Despite recent success, most contrastive self-supervised learning methods are domain-specific, relying heavily on data augmentation techniques that require knowledge about a particular domain, such as image cropping and rotation. To overcome such limitation, we propose a novel domain-agnostic approach to contrastive learning, named DACL, that is applicable to domains where invariances, and thus, data augmentation techniques, are not readily available. Key to our approach is the use of Mixup noise to create similar and dissimilar examples by mixing data samples differently either at the input or hidden-state levels. To demonstrate the effectiveness of DACL, we conduct experiments across various domains such as tabular data, images, and graphs. Our results show that DACL not only outperforms other domain-agnostic noising methods, such as Gaussian-noise, but also combines well with domain-specific methods, such as SimCLR, to improve self-supervised visual representation learning. Finally, we theoretically analyze our method and show advantages over the Gaussian-noise based contrastive learning approach.
△ Less
Submitted 19 July, 2021; v1 submitted 9 November, 2020;
originally announced November 2020.
-
Meta Pseudo Labels
Authors:
Hieu Pham,
Zihang Dai,
Qizhe Xie,
Minh-Thang Luong,
Quoc V. Le
Abstract:
We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. Like Pseudo Labels, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher i…
▽ More
We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. Like Pseudo Labels, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher in Meta Pseudo Labels is constantly adapted by the feedback of the student's performance on the labeled dataset. As a result, the teacher generates better pseudo labels to teach the student. Our code will be available at https://github.com/google-research/google-research/tree/master/meta_pseudo_labels.
△ Less
Submitted 1 March, 2021; v1 submitted 23 March, 2020;
originally announced March 2020.
-
Towards a Human-like Open-Domain Chatbot
Authors:
Daniel Adiwardana,
Minh-Thang Luong,
David R. So,
Jamie Hall,
Noah Fiedel,
Romal Thoppilan,
Zi Yang,
Apoorv Kulshreshtha,
Gaurav Nemade,
Yifeng Lu,
Quoc V. Le
Abstract:
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation.…
▽ More
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.
△ Less
Submitted 27 February, 2020; v1 submitted 27 January, 2020;
originally announced January 2020.
-
Self-training with Noisy Student improves ImageNet classification
Authors:
Qizhe Xie,
Minh-Thang Luong,
Eduard Hovy,
Quoc V. Le
Abstract:
We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mea…
▽ More
We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2.
Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. Code is available at https://github.com/google-research/noisystudent.
△ Less
Submitted 19 June, 2020; v1 submitted 11 November, 2019;
originally announced November 2019.
-
Selfie: Self-supervised Pretraining for Image Embedding
Authors:
Trieu H. Trinh,
Minh-Thang Luong,
Quoc V. Le
Abstract:
We introduce a pretraining technique called Selfie, which stands for SELFie supervised Image Embedding. Selfie generalizes the concept of masked language modeling of BERT (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastive Predictive Coding loss (Oord et al., 2018). Given masked-out patches in an input image, our method learns to select the correct patch, amo…
▽ More
We introduce a pretraining technique called Selfie, which stands for SELFie supervised Image Embedding. Selfie generalizes the concept of masked language modeling of BERT (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastive Predictive Coding loss (Oord et al., 2018). Given masked-out patches in an input image, our method learns to select the correct patch, among other "distractor" patches sampled from the same image, to fill in the masked location. This classification objective sidesteps the need for predicting exact pixel values of the target patches. The pretraining architecture of Selfie includes a network of convolutional blocks to process patches followed by an attention pooling network to summarize the content of unmasked patches before predicting masked ones. During finetuning, we reuse the convolutional weights found by pretraining. We evaluate Selfie on three benchmarks (CIFAR-10, ImageNet 32 x 32, and ImageNet 224 x 224) with varying amounts of labeled data, from 5% to 100% of the training sets. Our pretraining method provides consistent improvements to ResNet-50 across all settings compared to the standard supervised training of the same network. Notably, on ImageNet 224 x 224 with 60 examples per class (5%), our method improves the mean accuracy of ResNet-50 from 35.6% to 46.7%, an improvement of 11.1 points in absolute accuracy. Our pretraining method also improves ResNet-50 training stability, especially on low data regime, by significantly lowering the standard deviation of test accuracies across different runs.
△ Less
Submitted 27 July, 2019; v1 submitted 7 June, 2019;
originally announced June 2019.
-
Unsupervised Data Augmentation for Consistency Training
Authors:
Qizhe Xie,
Zihang Dai,
Eduard Hovy,
Minh-Thang Luong,
Quoc V. Le
Abstract:
Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality…
▽ More
Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods such as RandAugment and back-translation, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classification dataset, with only 20 labeled examples, our method achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 5.43 with only 250 examples. Our method also combines well with transfer learning, e.g., when finetuning from BERT, and yields improvements in high-data regime, such as ImageNet, whether when there is only 10% labeled data or when a full labeled set with 1.3M extra unlabeled examples is used. Code is available at https://github.com/google-research/uda.
△ Less
Submitted 5 November, 2020; v1 submitted 29 April, 2019;
originally announced April 2019.
-
Learning Longer-term Dependencies in RNNs with Auxiliary Losses
Authors:
Trieu H. Trinh,
Andrew M. Dai,
Minh-Thang Luong,
Quoc V. Le
Abstract:
Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary lo…
▽ More
Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary loss to the original objective. This auxiliary loss forces RNNs to either reconstruct previous events or predict next events in a sequence, making truncated backpropagation feasible for long sequences and also improving full BPTT. We evaluate our method on a variety of settings, including pixel-by-pixel image classification with sequence lengths up to 16\,000, and a real document classification benchmark. Our results highlight good performance and resource efficiency of this approach over competitive baselines, including other recurrent models and a comparable sized Transformer. Further analyses reveal beneficial effects of the auxiliary loss on optimization and regularization, as well as extreme cases where there is little to no backpropagation.
△ Less
Submitted 13 June, 2018; v1 submitted 28 February, 2018;
originally announced March 2018.
-
Multi-task Sequence to Sequence Learning
Authors:
Minh-Thang Luong,
Quoc V. Le,
Ilya Sutskever,
Oriol Vinyals,
Lukasz Kaiser
Abstract:
Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machi…
▽ More
Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machine translation and syntactic parsing, (b) the many-to-one setting - useful when only the decoder can be shared, as in the case of translation and image caption generation, and (c) the many-to-many setting - where multiple encoders and decoders are shared, which is the case with unsupervised objectives and translation. Our results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks. Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F1. Lastly, we reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought.
△ Less
Submitted 1 March, 2016; v1 submitted 19 November, 2015;
originally announced November 2015.
-
Hidden Markov Model Applications in Change-Point Analysis
Authors:
The Minh Luong,
Vittorio Perduca,
Gregory Nuel
Abstract:
The detection of change-points in heterogeneous sequences is a statistical challenge with many applications in fields such as finance, signal analysis and biology. A wide variety of literature exists for finding an ideal set of change-points for characterizing the data. In this tutorial we elaborate on the Hidden Markov Model (HMM) and present two different frameworks for applying HMM to change-po…
▽ More
The detection of change-points in heterogeneous sequences is a statistical challenge with many applications in fields such as finance, signal analysis and biology. A wide variety of literature exists for finding an ideal set of change-points for characterizing the data. In this tutorial we elaborate on the Hidden Markov Model (HMM) and present two different frameworks for applying HMM to change-point models. Then we provide a summary of two procedures for inference in change-point analysis, which are particular cases of the forward-backward algorithm for HMMs, and discuss common implementation problems. Lastly, we provide two examples of the HMM methods on available data sets and we shortly discuss about the applications to current genomics studies. The R code used in the examples is provided in the appendix.
△ Less
Submitted 8 December, 2012;
originally announced December 2012.
-
Fast estimation of the ICL criterion for change-point detection problems with applications to Next-Generation Sequencing data
Authors:
Alice Cleynen,
The Minh Luong,
Guillem Rigaill,
Gregory Nuel
Abstract:
In this paper, we consider the Integrated Completed Likelihood (ICL) as a useful criterion for estimating the number of changes in the underlying distribution of data in problems where detecting the precise location of these changes is the main goal. The exact computation of the ICL requires O(Kn2) operations (with K the number of segments and n the number of data-points) which is prohibitive in m…
▽ More
In this paper, we consider the Integrated Completed Likelihood (ICL) as a useful criterion for estimating the number of changes in the underlying distribution of data in problems where detecting the precise location of these changes is the main goal. The exact computation of the ICL requires O(Kn2) operations (with K the number of segments and n the number of data-points) which is prohibitive in many practical situations with large sequences of data. We describe a framework to estimate the ICL with O(Kn) complexity. Our approach is general in the sense that it can accommodate any given model distribution. We checked the run-time and validity of our approach on simulated data and demonstrate its good performance when analyzing real Next-Generation Sequencing (NGS) data using a negative binomial model.
△ Less
Submitted 1 July, 2013; v1 submitted 14 November, 2012;
originally announced November 2012.