Search | arXiv e-print repository

doi 10.1063/5.0126667

Learning effective dynamics from data-driven stochastic systems

Authors: Lingyu Feng, Ting Gao, Min Dai, Jinqiao Duan

Abstract: Multiscale stochastic dynamical systems have been widely adopted to a variety of scientific and engineering problems due to their capability of depicting complex phenomena in many real world applications. This work is devoted to investigating the effective dynamics for slow-fast stochastic dynamical systems. Given observation data on a short-term period satisfying some unknown slow-fast stochastic… ▽ More Multiscale stochastic dynamical systems have been widely adopted to a variety of scientific and engineering problems due to their capability of depicting complex phenomena in many real world applications. This work is devoted to investigating the effective dynamics for slow-fast stochastic dynamical systems. Given observation data on a short-term period satisfying some unknown slow-fast stochastic systems, we propose a novel algorithm including a neural network called Auto-SDE to learn invariant slow manifold. Our approach captures the evolutionary nature of a series of time-dependent autoencoder neural networks with the loss constructed from a discretized stochastic differential equation. Our algorithm is also validated to be accurate, stable and effective through numerical experiments under various evaluation metrics. △ Less

Submitted 29 December, 2023; v1 submitted 9 May, 2022; originally announced May 2022.

arXiv:2111.07109 [pdf, other]

Nyström Regularization for Time Series Forecasting

Authors: Zirui Sun, Mingwei Dai, Yao Wang, Shao-Bo Lin

Abstract: This paper focuses on learning rate analysis of Nyström regularization with sequential sub-sampling for $τ$-mixing time series. Using a recently developed Banach-valued Bernstein inequality for $τ$-mixing sequences and an integral operator approach based on second-order decomposition, we succeed in deriving almost optimal learning rates of Nyström regularization with sequential sub-sampling for… ▽ More This paper focuses on learning rate analysis of Nyström regularization with sequential sub-sampling for $τ$-mixing time series. Using a recently developed Banach-valued Bernstein inequality for $τ$-mixing sequences and an integral operator approach based on second-order decomposition, we succeed in deriving almost optimal learning rates of Nyström regularization with sequential sub-sampling for $τ$-mixing time series. A series of numerical experiments are carried out to verify our theoretical results, showing the excellent learning performance of Nyström regularization with sequential sub-sampling in learning massive time series data. All these results extend the applicable range of Nyström regularization from i.i.d. samples to non-i.i.d. sequences. △ Less

Submitted 13 November, 2021; originally announced November 2021.

Comments: 35 pages

arXiv:2109.03378 [pdf, other]

Rethinking Multidimensional Discriminator Output for Generative Adversarial Networks

Authors: Mengyu Dai, Haibin Hang, Anuj Srivastava

Abstract: The study of multidimensional discriminator (critic) output for Generative Adversarial Networks has been underexplored in the literature. In this paper, we generalize the Wasserstein GAN framework to take advantage of multidimensional critic output and explore its properties. We also introduce a square-root velocity transformation (SRVT) block which favors training in the multidimensional setting.… ▽ More The study of multidimensional discriminator (critic) output for Generative Adversarial Networks has been underexplored in the literature. In this paper, we generalize the Wasserstein GAN framework to take advantage of multidimensional critic output and explore its properties. We also introduce a square-root velocity transformation (SRVT) block which favors training in the multidimensional setting. Proofs of properties are based on our proposed maximal p-centrality discrepancy, which is bounded above by p-Wasserstein distance and fits the Wasserstein GAN framework with multidimensional critic output n. Especially when n = 1 and p = 1, the proposed discrepancy equals 1-Wasserstein distance. Theoretical analysis and empirical evidence show that high-dimensional critic output has its advantage on distinguishing real and fake distributions, and benefits faster convergence and diversity of results. △ Less

Submitted 14 July, 2022; v1 submitted 7 September, 2021; originally announced September 2021.

Comments: Frontiers in Adversarial Machine Learning ICML 2022

arXiv:2103.15023 [pdf, other]

Nonparametric tests for treatment effect heterogeneity in observational studies

Authors: Maozhu Dai, Weining Shen, Hal S. Stern

Abstract: We consider the problem of testing for treatment effect heterogeneity in observational studies, and propose a nonparametric test based on multisample U-statistics. To account for potential confounders, we use reweighted data where the weights are determined by estimated propensity scores. The proposed method does not require any parametric assumptions on the outcomes and bypasses the need for mode… ▽ More We consider the problem of testing for treatment effect heterogeneity in observational studies, and propose a nonparametric test based on multisample U-statistics. To account for potential confounders, we use reweighted data where the weights are determined by estimated propensity scores. The proposed method does not require any parametric assumptions on the outcomes and bypasses the need for modeling the treatment effect for each study subgroup. We establish the asymptotic normality for the test statistic, and demonstrate its superior numerical performance over several competing approaches via simulation studies. Two real data applications including an employment program evaluation study and a mental health study of China's one-child policy are also discussed. △ Less

Submitted 27 March, 2021; originally announced March 2021.

arXiv:2012.03432 [pdf, other]

A U-statistic-based test of treatment effect heterogeneity

Authors: Maozhu Dai, Hal S. Stern

Abstract: Many studies include a goal of determining whether there is treatment effect heterogeneity across different subpopulations. In this paper, we propose a U-statistic-based non-parametric test of the null hypothesis that the treatment effects are identical in different subgroups. The proposed test provides more power than the standard parametric test when the underlying distribution assumptions of th… ▽ More Many studies include a goal of determining whether there is treatment effect heterogeneity across different subpopulations. In this paper, we propose a U-statistic-based non-parametric test of the null hypothesis that the treatment effects are identical in different subgroups. The proposed test provides more power than the standard parametric test when the underlying distribution assumptions of the latter are violated. We apply the method to data from an economic study of program effectiveness and find that there is treatment effect heterogeneity in different subpopulations. △ Less

Submitted 6 December, 2020; originally announced December 2020.

arXiv:2010.06610 [pdf, other]

Training independent subnetworks for robust prediction

Authors: Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M. Dai, Dustin Tran

Abstract: Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple pred… ▽ More Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved `for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods. △ Less

Submitted 4 August, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: Updated to the ICLR camera ready version, added reference to Soflaei et al. 2020

arXiv:2007.05189 [pdf, other]

Learning Unstable Dynamical Systems with Time-Weighted Logarithmic Loss

Authors: Kamil Nar, Yuan Xue, Andrew M. Dai

Abstract: When training the parameters of a linear dynamical model, the gradient descent algorithm is likely to fail to converge if the squared-error loss is used as the training loss function. Restricting the parameter space to a smaller subset and running the gradient descent algorithm within this subset can allow learning stable dynamical systems, but this strategy does not work for unstable systems. In… ▽ More When training the parameters of a linear dynamical model, the gradient descent algorithm is likely to fail to converge if the squared-error loss is used as the training loss function. Restricting the parameter space to a smaller subset and running the gradient descent algorithm within this subset can allow learning stable dynamical systems, but this strategy does not work for unstable systems. In this work, we look into the dynamics of the gradient descent algorithm and pinpoint what causes the difficulty of learning unstable systems. We show that observations taken at different times from the system to be learned influence the dynamics of the gradient descent algorithm in substantially different degrees. We introduce a time-weighted logarithmic loss function to fix this imbalance and demonstrate its effectiveness in learning unstable systems. △ Less

Submitted 10 July, 2020; originally announced July 2020.

arXiv:1912.00589 [pdf, other]

Flow Contrastive Estimation of Energy-Based Models

Authors: Ruiqi Gao, Erik Nijkamp, Diederik P. Kingma, Zhen Xu, Andrew M. Dai, Ying Nian Wu

Abstract: This paper studies a training method to jointly estimate an energy-based model and a flow-based model, in which the two models are iteratively updated based on a shared adversarial value function. This joint training method has the following traits. (1) The update of the energy-based model is based on noise contrastive estimation, with the flow model serving as a strong noise distribution. (2) The… ▽ More This paper studies a training method to jointly estimate an energy-based model and a flow-based model, in which the two models are iteratively updated based on a shared adversarial value function. This joint training method has the following traits. (1) The update of the energy-based model is based on noise contrastive estimation, with the flow model serving as a strong noise distribution. (2) The update of the flow model approximately minimizes the Jensen-Shannon divergence between the flow model and the data distribution. (3) Unlike generative adversarial networks (GAN) which estimates an implicit probability distribution defined by a generator model, our method estimates two explicit probabilistic distributions on the data. Using the proposed method we demonstrate a significant improvement on the synthesis quality of the flow model, and show the effectiveness of unsupervised feature learning by the learned energy-based model. Furthermore, the proposed training method can be easily adapted to semi-supervised learning. We achieve competitive results to the state-of-the-art semi-supervised learning methods. △ Less

Submitted 1 April, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

arXiv:1911.06410 [pdf, other]

Modelling EHR timeseries by restricting feature interaction

Authors: Kun Zhang, Yuan Xue, Gerardo Flores, Alvin Rajkomar, Claire Cui, Andrew M. Dai

Abstract: Time series data are prevalent in electronic health records, mostly in the form of physiological parameters such as vital signs and lab tests. The patterns of these values may be significant indicators of patients' clinical states and there might be patterns that are unknown to clinicians but are highly predictive of some outcomes. Many of these values are also missing which makes it difficult to… ▽ More Time series data are prevalent in electronic health records, mostly in the form of physiological parameters such as vital signs and lab tests. The patterns of these values may be significant indicators of patients' clinical states and there might be patterns that are unknown to clinicians but are highly predictive of some outcomes. Many of these values are also missing which makes it difficult to apply existing methods like decision trees. We propose a recurrent neural network model that reduces overfitting to noisy observations by limiting interactions between features. We analyze its performance on mortality, ICD-9 and AKI prediction from observational values on the Medical Information Mart for Intensive Care III (MIMIC-III) dataset. Our models result in an improvement of 1.1% [p<0.01] in AU-ROC for mortality prediction under the MetaVision subset and 1.0% and 2.2% [p<0.01] respectively for mortality and AKI under the full MIMIC-III dataset compared to existing state-of-the-art interpolation, embedding and decay-based recurrent models. △ Less

Submitted 14 November, 2019; originally announced November 2019.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract

arXiv:1911.05861 [pdf, other]

Federated and Differentially Private Learning for Electronic Health Records

Authors: Stephen R. Pfohl, Andrew M. Dai, Katherine Heller

Abstract: The use of collaborative and decentralized machine learning techniques such as federated learning have the potential to enable the development and deployment of clinical risk predictions models in low-resource settings without requiring sensitive data be shared or stored in a central repository. This process necessitates communication of model weights or updates between collaborating entities, but… ▽ More The use of collaborative and decentralized machine learning techniques such as federated learning have the potential to enable the development and deployment of clinical risk predictions models in low-resource settings without requiring sensitive data be shared or stored in a central repository. This process necessitates communication of model weights or updates between collaborating entities, but it is unclear to what extent patient privacy is compromised as a result. To gain insight into this question, we study the efficacy of centralized versus federated learning in both private and non-private settings. The clinical prediction tasks we consider are the prediction of prolonged length of stay and in-hospital mortality across thirty one hospitals in the eICU Collaborative Research Database. We find that while it is straightforward to apply differentially private stochastic gradient descent to achieve strong privacy bounds when training in a centralized setting, it is considerably more difficult to do so in the federated setting. △ Less

Submitted 13 November, 2019; originally announced November 2019.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract

arXiv:1910.11424 [pdf, other]

Capacity, Bandwidth, and Compositionality in Emergent Language Learning

Authors: Cinjon Resnick, Abhinav Gupta, Jakob Foerster, Andrew M. Dai, Kyunghyun Cho

Abstract: Many recent works have discussed the propensity, or lack thereof, for emergent languages to exhibit properties of natural languages. A favorite in the literature is learning compositionality. We note that most of those works have focused on communicative bandwidth as being of primary importance. While important, it is not the only contributing factor. In this paper, we investigate the learning bia… ▽ More Many recent works have discussed the propensity, or lack thereof, for emergent languages to exhibit properties of natural languages. A favorite in the literature is learning compositionality. We note that most of those works have focused on communicative bandwidth as being of primary importance. While important, it is not the only contributing factor. In this paper, we investigate the learning biases that affect the efficacy and compositionality of emergent languages. Our foremost contribution is to explore how capacity of a neural network impacts its ability to learn a compositional language. We additionally introduce a set of evaluation metrics with which we analyze the learned languages. Our hypothesis is that there should be a specific range of model capacity and channel bandwidth that induces compositional structure in the resulting language and consequently encourages systematic generalization. While we empirically see evidence for the bottom of this range, we curiously do not find evidence for the top part of the range and believe that this is an open question for the community. △ Less

Submitted 15 April, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

Comments: The first two authors contributed equally. Accepted at AAMAS 2020

arXiv:1909.09712 [pdf, other]

Learning an Adaptive Learning Rate Schedule

Authors: Zhen Xu, Andrew M. Dai, Jonas Kemp, Luke Metz

Abstract: The learning rate is one of the most important hyper-parameters for model training and generalization. However, current hand-designed parametric learning rate schedules offer limited flexibility and the predefined schedule may not match the training dynamics of high dimensional and non-convex optimization problems. In this paper, we propose a reinforcement learning based framework that can automat… ▽ More The learning rate is one of the most important hyper-parameters for model training and generalization. However, current hand-designed parametric learning rate schedules offer limited flexibility and the predefined schedule may not match the training dynamics of high dimensional and non-convex optimization problems. In this paper, we propose a reinforcement learning based framework that can automatically learn an adaptive learning rate schedule by leveraging the information from past training histories. The learning rate dynamically changes based on the current training dynamics. To validate this framework, we conduct experiments with different neural network architectures on the Fashion MINIST and CIFAR10 datasets. Experimental results show that the auto-learned learning rate controller can achieve better test results. In addition, the trained controller network is generalizable -- able to be trained on one data set and transferred to new problems. △ Less

Submitted 20 September, 2019; originally announced September 2019.

arXiv:1909.03039 [pdf, other]

Improved Hierarchical Patient Classification with Language Model Pretraining over Clinical Notes

Authors: Jonas Kemp, Alvin Rajkomar, Andrew M. Dai

Abstract: Clinical notes in electronic health records contain highly heterogeneous writing styles, including non-standard terminology or abbreviations. Using these notes in predictive modeling has traditionally required preprocessing (e.g. taking frequent terms or topic modeling) that removes much of the richness of the source data. We propose a pretrained hierarchical recurrent neural network model that pa… ▽ More Clinical notes in electronic health records contain highly heterogeneous writing styles, including non-standard terminology or abbreviations. Using these notes in predictive modeling has traditionally required preprocessing (e.g. taking frequent terms or topic modeling) that removes much of the richness of the source data. We propose a pretrained hierarchical recurrent neural network model that parses minimally processed clinical notes in an intuitive fashion, and show that it improves performance for discharge diagnosis classification tasks on the Medical Information Mart for Intensive Care III (MIMIC-III) dataset, compared to models that treat the notes as an unordered collection of terms or that conduct no pretraining. We also apply an attribution technique to examples to identify the words that the model uses to make its prediction, and show the importance of the words' nearby context. △ Less

Submitted 14 November, 2019; v1 submitted 6 September, 2019; originally announced September 2019.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - extended abstract

arXiv:1906.04716 [pdf, other]

Learning the Graphical Structure of Electronic Health Records with Graph Convolutional Transformer

Authors: Edward Choi, Zhen Xu, Yujia Li, Michael W. Dusenberry, Gerardo Flores, Yuan Xue, Andrew M. Dai

Abstract: Effective modeling of electronic health records (EHR) is rapidly becoming an important topic in both academia and industry. A recent study showed that using the graphical structure underlying EHR data (e.g. relationship between diagnoses and treatments) improves the performance of prediction tasks such as heart failure prediction. However, EHR data do not always contain complete structure informat… ▽ More Effective modeling of electronic health records (EHR) is rapidly becoming an important topic in both academia and industry. A recent study showed that using the graphical structure underlying EHR data (e.g. relationship between diagnoses and treatments) improves the performance of prediction tasks such as heart failure prediction. However, EHR data do not always contain complete structure information. Moreover, when it comes to claims data, structure information is completely unavailable to begin with. Under such circumstances, can we still do better than just treating EHR data as a flat-structured bag-of-features? In this paper, we study the possibility of jointly learning the hidden structure of EHR while performing supervised prediction tasks on EHR data. Specifically, we discuss that Transformer is a suitable basis model to learn the hidden EHR structure, and propose Graph Convolutional Transformer, which uses data statistics to guide the structure learning process. The proposed model consistently outperformed previous approaches empirically, on both synthetic data and publicly available EHR data, for various prediction tasks such as graph reconstruction and readmission prediction, indicating that it can serve as an effective general-purpose representation learning algorithm for EHR data. △ Less

Submitted 19 January, 2020; v1 submitted 11 June, 2019; originally announced June 2019.

Comments: To be presented at AAAI 2020

arXiv:1906.03842 [pdf, other]

doi 10.1145/3368555.3384457

Analyzing the Role of Model Uncertainty for Electronic Health Records

Authors: Michael W. Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, Andrew M. Dai

Abstract: In medicine, both ethical and monetary costs of incorrect predictions can be significant, and the complexity of the problems often necessitates increasingly complex models. Recent work has shown that changing just the random seed is enough for otherwise well-tuned deep neural networks to vary in their individual predicted probabilities. In light of this, we investigate the role of model uncertaint… ▽ More In medicine, both ethical and monetary costs of incorrect predictions can be significant, and the complexity of the problems often necessitates increasingly complex models. Recent work has shown that changing just the random seed is enough for otherwise well-tuned deep neural networks to vary in their individual predicted probabilities. In light of this, we investigate the role of model uncertainty methods in the medical domain. Using RNN ensembles and various Bayesian RNNs, we show that population-level metrics, such as AUC-PR, AUC-ROC, log-likelihood, and calibration error, do not capture model uncertainty. Meanwhile, the presence of significant variability in patient-specific predictions and optimal decisions motivates the need for capturing model uncertainty. Understanding the uncertainty for individual patients is an area with clear clinical impact, such as determining when a model decision is likely to be brittle. We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups. △ Less

Submitted 25 March, 2020; v1 submitted 10 June, 2019; originally announced June 2019.

Comments: Published in the ACM Conference on Health, Inference, and Learning (CHIL) 2020. Code available at https://github.com/Google-Health/records-research

arXiv:1902.03525 [pdf, other]

BOLT-SSI: A Statistical Approach to Screening Interaction Effects for Ultra-High Dimensional Data

Authors: Min Zhou, Mingwei Dai, Yuan Yao, Jin Liu, Can Yang, Heng Peng

Abstract: Detecting interaction effects among predictors on the response variable is a crucial step in various applications. In this paper, we first propose a simple method for sure screening interactions (SSI). Although its computation complexity is $O(p^2n)$, SSI works well for problems of moderate dimensionality (e.g., $p=10^3\sim10^4$), without the heredity assumption. To ultra-high dimensional problems… ▽ More Detecting interaction effects among predictors on the response variable is a crucial step in various applications. In this paper, we first propose a simple method for sure screening interactions (SSI). Although its computation complexity is $O(p^2n)$, SSI works well for problems of moderate dimensionality (e.g., $p=10^3\sim10^4$), without the heredity assumption. To ultra-high dimensional problems (e.g., $p = 10^6$), motivated by discretization associated Boolean representation and operations and the contingency table for discrete variables, we propose a fast algorithm, named "BOLT-SSI". The statistical theory has been established for SSI and BOLT-SSI, guaranteeing their sure screening property. The performance of SSI and BOLT-SSI are evaluated by comprehensive simulation and real case studies. Numerical results demonstrate that SSI and BOLT-SSI can often outperform their competitors in terms of computational efficiency and statistical accuracy. The proposed method can be applied for fully detecting interactions with more than 300,000 predictors. Based on this study, we believe that there is a great need to rethink the relationship between statistical accuracy and computational efficiency. We have shown that the computational performance of a statistical method can often be greatly improved by exploring the advantages of computational architecture with a tolerable loss of statistical accuracy. △ Less

Submitted 15 December, 2020; v1 submitted 9 February, 2019; originally announced February 2019.

Comments: 56 pages, 7 figures

arXiv:1809.04281 [pdf, other]

Music Transformer

Authors: Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, Douglas Eck

Abstract: Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence.… ▽ More Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter. △ Less

Submitted 12 December, 2018; v1 submitted 12 September, 2018; originally announced September 2018.

Comments: Improved skewing section and accompanying figures. Previous titles are "An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation" and "Music Transformer"

arXiv:1808.06576 [pdf, other]

Peptide-Spectra Matching from Weak Supervision

Authors: Samuel S. Schoenholz, Sean Hackett, Laura Deming, Eugene Melamud, Navdeep Jaitly, Fiona McAllister, Jonathon O'Brien, George Dahl, Bryson Bennett, Andrew M. Dai, Daphne Koller

Abstract: As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets mapping inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the pro… ▽ More As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets mapping inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the protein identification problem, the pairing of mass spectra with short sequences of amino acids called peptides. We train our model to differentiate between top scoring results from a state-of-the art classical system and hard-negative second and third place results. Our resulting model is much better at identifying peptides with spectra than the model used to generate its training data. In particular, we achieve a 43% improvement over standard matching methods and a 10% improvement over a combination of the matching method and an industry standard cross-spectra reranking tool. Importantly, in a more difficult experimental regime that reflects current challenges facing biologists, our advantage over the previous state-of-the-art grows to 15% even after reranking. We believe this approach will generalize to other challenging scientific problems. △ Less

Submitted 22 August, 2018; v1 submitted 20 August, 2018; originally announced August 2018.

arXiv:1804.11011 [pdf, other]

Joint Analysis of Individual-level and Summary-level GWAS Data by Leveraging Pleiotropy

Authors: Mingwei Dai, Xiang Wan, Hao Peng, Yao Wang, Yue Liu, Jin Liu, Zongben Xu, Can Yang

Abstract: A large number of recent genome-wide association studies (GWASs) for complex phenotypes confirm the early conjecture for polygenicity, suggesting the presence of large number of variants with only tiny or moderate effects. However, due to the limited sample size of a single GWAS, many associated genetic variants are too weak to achieve the genome-wide significance. These undiscovered variants furt… ▽ More A large number of recent genome-wide association studies (GWASs) for complex phenotypes confirm the early conjecture for polygenicity, suggesting the presence of large number of variants with only tiny or moderate effects. However, due to the limited sample size of a single GWAS, many associated genetic variants are too weak to achieve the genome-wide significance. These undiscovered variants further limit the prediction capability of GWAS. Restricted access to the individual-level data and the increasing availability of the published GWAS results motivate the development of methods integrating both the individual-level and summary-level data. How to build the connection between the individual-level and summary-level data determines the efficiency of using the existing abundant summary-level resources with limited individual-level data, and this issue inspires more efforts in the existing area. In this study, we propose a novel statistical approach, LEP, which provides a novel way of modeling the connection between the individual-level data and summary-level data. LEP integrates both types of data by \underline{LE}veraing \underline{P}leiotropy to increase the statistical power of risk variants identification and the accuracy of risk prediction. The algorithm for parameter estimation is developed to handle genome-wide-scale data. Through comprehensive simulation studies, we demonstrated the advantages of LEP over the existing methods. We further applied LEP to perform integrative analysis of Crohn's disease from WTCCC and summary statistics from GWAS of some other diseases, such as Type 1 diabetes, Ulcerative colitis and Primary biliary cirrhosis. LEP was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.39\% ($\pm$ 0.58\%) to 68.33\% ($\pm$ 0.32\%) using about 195,000 variants. △ Less

Submitted 29 April, 2018; originally announced April 2018.

Comments: 32 pages, 11 figures, 2 tables

arXiv:1803.10439 [pdf, other]

BIVAS: A scalable Bayesian method for bi-level variable selection with applications

Authors: Mingxuan Cai, Mingwei Dai, Jingsi Ming, Heng Peng, Jin Liu, Can Yang

Abstract: In this paper, we consider a Bayesian bi-level variable selection problem in high-dimensional regressions. In many practical situations, it is natural to assign group membership to each predictor. Examples include that genetic variants can be grouped at the gene level and a covariate from different tasks naturally forms a group. Thus, it is of interest to select important groups as well as importa… ▽ More In this paper, we consider a Bayesian bi-level variable selection problem in high-dimensional regressions. In many practical situations, it is natural to assign group membership to each predictor. Examples include that genetic variants can be grouped at the gene level and a covariate from different tasks naturally forms a group. Thus, it is of interest to select important groups as well as important members from those groups. The existing Markov Chain Monte Carlo (MCMC) methods are often computationally intensive and not scalable to large data sets. To address this problem, we consider variational inference for bi-level variable selection (BIVAS). In contrast to the commonly used mean-field approximation, we propose a hierarchical factorization to approximate the posterior distribution, by utilizing the structure of bi-level variable selection. Moreover, we develop a computationally efficient and fully parallelizable algorithm based on this variational approximation. We further extend the developed method to model data sets from multi-task learning. The comprehensive numerical results from both simulation studies and real data analysis demonstrate the advantages of BIVAS for variable selection, parameter estimation and computational efficiency over existing methods. The method is implemented in R package `bivas' available at https://github.com/mxcai/bivas. △ Less

Submitted 28 March, 2018; originally announced March 2018.

arXiv:1803.00144 [pdf, ps, other]

Learning Longer-term Dependencies in RNNs with Auxiliary Losses

Authors: Trieu H. Trinh, Andrew M. Dai, Minh-Thang Luong, Quoc V. Le

Abstract: Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary lo… ▽ More Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary loss to the original objective. This auxiliary loss forces RNNs to either reconstruct previous events or predict next events in a sequence, making truncated backpropagation feasible for long sequences and also improving full BPTT. We evaluate our method on a variety of settings, including pixel-by-pixel image classification with sequence lengths up to 16\,000, and a real document classification benchmark. Our results highlight good performance and resource efficiency of this approach over competitive baselines, including other recurrent models and a comparable sized Transformer. Further analyses reveal beneficial effects of the auxiliary loss on optimization and regularization, as well as extreme cases where there is little to no backpropagation. △ Less

Submitted 13 June, 2018; v1 submitted 28 February, 2018; originally announced March 2018.

Comments: ICML 2018

arXiv:1801.07736 [pdf, other]

MaskGAN: Better Text Generation via Filling in the______

Authors: William Fedus, Ian Goodfellow, Andrew M. Dai

Abstract: Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality… ▽ More Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality of the generated text. Additionally, these models are typically trained via maxi- mum likelihood and teacher forcing. These methods are well-suited to optimizing perplexity but can result in poor sample quality since generating text requires conditioning on sequences of words that may have never been observed at training time. We propose to improve sample quality using Generative Adversarial Networks (GANs), which explicitly train the generator to produce high quality samples and have shown a lot of success in image generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them. We claim that validation perplexity alone is not indicative of the quality of text generated by a model. We introduce an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model. △ Less

Submitted 1 March, 2018; v1 submitted 23 January, 2018; originally announced January 2018.

Comments: 16 pages, ICLR 2018

arXiv:1710.09551 [pdf, other]

LPG: a four-groups probabilistic approach to leveraging pleiotropy in genome-wide association studies

Authors: Yi Yang, Mingwei Dai, Jian Huang, Xinyi Lin, Can Yang, Jin Liu, Min Chen

Abstract: To date, genome-wide association studies (GWAS) have successfully identified tens of thousands of genetic variants among a variety of traits/diseases, shedding a light on the genetic architecture of complex diseases. Polygenicity of complex diseases, which refers to the phenomenon that a vast number of risk variants collectively contribute to the heritability of complex diseases with modest indivi… ▽ More To date, genome-wide association studies (GWAS) have successfully identified tens of thousands of genetic variants among a variety of traits/diseases, shedding a light on the genetic architecture of complex diseases. Polygenicity of complex diseases, which refers to the phenomenon that a vast number of risk variants collectively contribute to the heritability of complex diseases with modest individual effects, have been widely accepted. This imposes a major challenge towards fully characterizing the genetic bases of complex diseases. An immediate implication of polygenicity is that a much larger sample size is required to detect risk variants with weak/moderate effects. Meanwhile, accumulating evidence suggests that different complex diseases can share genetic risk variants, a phenomenon known as pleiotropy. In this study, we propose a statistical framework for Leveraging Pleiotropic effects in large-scale GWAS data (LPG). LPG utilizes a variational Bayesian expectation-maximization (VBEM) algorithm, making it computationally efficient and scalable for genome-wide scale analysis. To demon- strate the advantage of LPG over existing methods that do not leverage pleiotropy, we conducted extensive simulation studies and also applied LPG to analyze three au- toimmune disorders (Crohn's disease, rheumatoid arthritis and Type 1 diabetes). The results indicate that LPG can improve the power of prioritization of risk variants and accuracy of risk prediction by leveraging pleiotropy. The software is available at http- s://github.com/Shufeyangyi2015310117/LPG. △ Less

Submitted 26 October, 2017; originally announced October 2017.

Comments: 81 page (include supplementary)

arXiv:1710.08446 [pdf, other]

Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step

Authors: William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, Ian Goodfellow

Abstract: Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each playe… ▽ More Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each player cannot reduce their cost without changing the other players' parameters. One useful approach for the theory of GANs is to show that a divergence between the training distribution and the model distribution obtains its minimum value at equilibrium. Several recent research directions have been motivated by the idea that this divergence is the primary guide for the learning process and that every step of learning should decrease the divergence. We show that this view is overly restrictive. During GAN training, the discriminator provides learning signal in situations where the gradients of the divergences between distributions would not be useful. We provide empirical counterexamples to the view of GAN training as divergence minimization. Specifically, we demonstrate that GANs are able to learn distributions in situations where the divergence minimization point of view predicts they would fail. We also show that gradient penalties motivated from the divergence minimization perspective are equally helpful when applied in other contexts in which the divergence minimization perspective does not predict they would be helpful. This contributes to a growing body of evidence that GAN training may be more usefully viewed as approaching Nash equilibria via trajectories that do not necessarily minimize a specific divergence at each step. △ Less

Submitted 20 February, 2018; v1 submitted 23 October, 2017; originally announced October 2017.

Comments: 18 pages

arXiv:1710.07201 [pdf, other]

LSMM: A statistical approach to integrating functional annotations with genome-wide association studies

Authors: Jingsi Ming, Mingwei Dai, Mingxuan Cai, Xiang Wan, Jin Liu, Can Yang

Abstract: Thousands of risk variants underlying complex phenotypes (quantitative traits and diseases) have been identified in genome-wide association studies (GWAS). However, there are still two major challenges towards deepening our understanding of the genetic architectures of complex phenotypes. First, the majority of GWAS hits are in the non-coding region and their biological interpretation is still unc… ▽ More Thousands of risk variants underlying complex phenotypes (quantitative traits and diseases) have been identified in genome-wide association studies (GWAS). However, there are still two major challenges towards deepening our understanding of the genetic architectures of complex phenotypes. First, the majority of GWAS hits are in the non-coding region and their biological interpretation is still unclear. Second, accumulating evidence from GWAS suggests the polygenicity of complex traits, i.e., a complex trait is often affected by many variants with small or moderate effects, whereas a large proportion of risk variants with small effects remains unknown. The availability of functional annotation data enables us to address the above challenges. In this study, we propose a latent sparse mixed model (LSMM) to integrate functional annotations with GWAS data. Not only does it increase statistical power of the identification of risk variants, but also offers more biological insights by detecting relevant functional annotations. To allow LSMM scalable to millions of variants and hundreds of functional annotations, we developed an efficient variational expectation-maximization (EM) algorithm for model parameter estimation and statistical inference. We first conducted comprehensive simulation studies to evaluate the performance of LSMM. Then we applied it to analyze 30 GWAS of complex phenotypes integrated with 9 genic category annotations and 127 tissue-specific functional annotations from the Roadmap project. The results demonstrate that our method possesses more statistical power over conventional methods, and can help researchers achieve deeper understanding of genetic architecture of these complex phenotypes. △ Less

Submitted 19 October, 2017; originally announced October 2017.

arXiv:1605.07725 [pdf, ps, other]

Adversarial Training Methods for Semi-Supervised Text Classification

Authors: Takeru Miyato, Andrew M. Dai, Ian Goodfellow

Abstract: Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We ex… ▽ More Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting. Code is available at https://github.com/tensorflow/models/tree/master/research/adversarial_text. △ Less

Submitted 16 November, 2021; v1 submitted 25 May, 2016; originally announced May 2016.

Comments: Published as a conference paper at ICLR 2017

arXiv:1412.5236 [pdf, other]

doi 10.1109/TPAMI.2014.2315802

The supervised hierarchical Dirichlet process

Authors: Andrew M. Dai, Amos J. Storkey

Abstract: We propose the supervised hierarchical Dirichlet process (sHDP), a nonparametric generative model for the joint distribution of a group of observations and a response variable directly associated with that whole group. We compare the sHDP with another leading method for regression on grouped data, the supervised latent Dirichlet allocation (sLDA) model. We evaluate our method on two real-world cla… ▽ More We propose the supervised hierarchical Dirichlet process (sHDP), a nonparametric generative model for the joint distribution of a group of observations and a response variable directly associated with that whole group. We compare the sHDP with another leading method for regression on grouped data, the supervised latent Dirichlet allocation (sLDA) model. We evaluate our method on two real-world classification problems and two real-world regression problems. Bayesian nonparametric regression models based on the Dirichlet process, such as the Dirichlet process-generalised linear models (DP-GLM) have previously been explored; these models allow flexibility in modelling nonlinear relationships. However, until now, Hierarchical Dirichlet Process (HDP) mixtures have not seen significant use in supervised problems with grouped data since a straightforward application of the HDP on the grouped data results in learnt clusters that are not predictive of the responses. The sHDP solves this problem by allowing for clusters to be learnt jointly from the group structure and from the label assigned to each group. △ Less

Submitted 16 December, 2014; originally announced December 2014.

Comments: 14 pages

Showing 1–27 of 27 results for author: Dai, M