-
Spark Transformer: Reactivating Sparsity in FFN and Attention
Authors:
Chong You,
Kan Wu,
Zhipeng Jia,
Lin Chen,
Srinadh Bhojanapalli,
Jiaxian Guo,
Utku Evci,
Jan Wassenberg,
Praneeth Netrapalli,
Jeremiah J. Willcock,
Suvinay Subramanian,
Felix Chern,
Alek Andreev,
Shreya Pathak,
Felix Yu,
Prateek Jain,
David E. Culler,
Henry M. Levy,
Sanjiv Kumar
Abstract:
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the Re…
▽ More
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges.
This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top-$k$ operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Spatial Confounding in Multivariate Areal Data Analysis
Authors:
Kyle Lin Wu,
Sudipto Banerjee
Abstract:
We investigate spatial confounding in the presence of multivariate disease dependence. In the "analysis model perspective" of spatial confounding, adding a spatially dependent random effect can lead to significant variance inflation of the posterior distribution of the fixed effects. The "data generation perspective" views covariates as stochastic and correlated with an unobserved spatial confound…
▽ More
We investigate spatial confounding in the presence of multivariate disease dependence. In the "analysis model perspective" of spatial confounding, adding a spatially dependent random effect can lead to significant variance inflation of the posterior distribution of the fixed effects. The "data generation perspective" views covariates as stochastic and correlated with an unobserved spatial confounder, leading to inferior statistical inference over multiple realizations. While multiple methods have been proposed for adjusting statistical models to mitigate spatial confounding in estimating regression coefficients, results on interactions between spatial confounding and multivariate dependence are very limited. We contribute to this domain by investigating spatial confounding from the analysis and data generation perspectives in a Bayesian coregionalized areal regression model. We derive novel results that distinguish variance inflation due to spatial confounding from inflation based on multicollinearity between predictors and provide insights into the estimation efficiency of a spatial estimator under a spatially confounded data generation model. We demonstrate favorable performance of spatial analysis compared to a non-spatial model in our simulation experiments even in the presence of spatial confounding and a misspecified spatial structure. In this regard, we align with several other authors in the defense of traditional hierarchical spatial models (Gilbert et al., 2025; Khan and Berrett, 2023; Zimmerman and Ver Hoef, 2022) and extend this defense to multivariate areal models. We analyze county-level data from the US on obesity / diabetes prevalence and diabetes-related cancer mortality, comparing the results with and without spatial random effects.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
DUE: A Deep Learning Framework and Library for Modeling Unknown Equations
Authors:
Junfeng Chen,
Kailiang Wu,
Dongbin Xiu
Abstract:
Equations, particularly differential equations, are fundamental for understanding natural phenomena and predicting complex dynamics across various scientific and engineering disciplines. However, the governing equations for many complex systems remain unknown due to intricate underlying mechanisms. Recent advancements in machine learning and data science offer a new paradigm for modeling unknown e…
▽ More
Equations, particularly differential equations, are fundamental for understanding natural phenomena and predicting complex dynamics across various scientific and engineering disciplines. However, the governing equations for many complex systems remain unknown due to intricate underlying mechanisms. Recent advancements in machine learning and data science offer a new paradigm for modeling unknown equations from measurement or simulation data. This paradigm shift, known as data-driven discovery or modeling, stands at the forefront of AI for science, with significant progress made in recent years. In this paper, we introduce a systematic framework for data-driven modeling of unknown equations using deep learning. This versatile framework is capable of learning unknown ODEs, PDEs, DAEs, IDEs, SDEs, reduced or partially observed systems, and non-autonomous differential equations. Based on this framework, we have developed Deep Unknown Equations (DUE), an open-source software package designed to facilitate the data-driven modeling of unknown equations using modern deep learning techniques. DUE serves as an educational tool for classroom instruction, enabling students and newcomers to gain hands-on experience with differential equations, data-driven modeling, and contemporary deep learning approaches such as FNN, ResNet, generalized ResNet, operator semigroup networks (OSG-Net), and Transformers. Additionally, DUE is a versatile and accessible toolkit for researchers across various scientific and engineering fields. It is applicable not only for learning unknown equations from data but also for surrogate modeling of known, yet complex, equations that are costly to solve using traditional numerical methods. We provide detailed descriptions of DUE and demonstrate its capabilities through diverse examples, which serve as templates that can be easily adapted for other applications.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
A Unified Approach for Estimating Various Treatment Effects in Causal Inference
Authors:
Kuan-Hsun Wu,
Li-Pang Chen
Abstract:
In this paper, we introduce a unified estimator to analyze various treatment effects in causal inference, including but not limited to the average treatment effect (ATE) and the quantile treatment effect (QTE). The proposed estimator is developed under the statistical functional and cumulative distribution function structure, which leads to a flexible and robust estimator and covers some frequent…
▽ More
In this paper, we introduce a unified estimator to analyze various treatment effects in causal inference, including but not limited to the average treatment effect (ATE) and the quantile treatment effect (QTE). The proposed estimator is developed under the statistical functional and cumulative distribution function structure, which leads to a flexible and robust estimator and covers some frequent treatment effects. In addition, our approach also takes variable selection into account, so that informative and network structure in confounders can be identified and be implemented in our estimation procedure. The theoretical properties, including variable selection consistency and asymptotic normality of the statistical functional estimator, are established. Various treatment effects estimations are also conducted in numerical studies, and the results reveal that the proposed estimator generally outperforms the existing methods and is more efficient than its competitors.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Mixed Likelihood Variational Gaussian Processes
Authors:
Kaiwen Wu,
Craig Sanders,
Benjamin Letham,
Phillip Guan
Abstract:
Gaussian processes (GPs) are powerful models for human-in-the-loop experiments due to their flexibility and well-calibrated uncertainty. However, GPs modeling human responses typically ignore auxiliary information, including a priori domain expertise and non-task performance information like user confidence ratings. We propose mixed likelihood variational GPs to leverage auxiliary information, whi…
▽ More
Gaussian processes (GPs) are powerful models for human-in-the-loop experiments due to their flexibility and well-calibrated uncertainty. However, GPs modeling human responses typically ignore auxiliary information, including a priori domain expertise and non-task performance information like user confidence ratings. We propose mixed likelihood variational GPs to leverage auxiliary information, which combine multiple likelihoods in a single evidence lower bound to model multiple types of data. We demonstrate the benefits of mixing likelihoods in three real-world experiments with human participants. First, we use mixed likelihood training to impose prior knowledge constraints in GP classifiers, which accelerates active learning in a visual perception task where users are asked to identify geometric errors resulting from camera position errors in virtual reality. Second, we show that leveraging Likert scale confidence ratings by mixed likelihood training improves model fitting for haptic perception of surface roughness. Lastly, we show that Likert scale confidence ratings improve human preference learning in robot gait optimization. The modeling performance improvements found using our framework across this diverse set of applications illustrates the benefits of incorporating auxiliary information into active learning and preference learning by using mixed likelihoods to jointly model multiple inputs.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses
Authors:
Hao Liang,
Wanrong Zhang,
Xinlei He,
Kaishun Wu,
Hong Xing
Abstract:
Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to protect sensitive data during the training of machine learning models, but its privacy guarantees often come at the cost of model performance, largely due to the inherent challenge of accurately quantifying privacy loss. While recent efforts have strengthened privacy guarantees by focusing solely on the final output and b…
▽ More
Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to protect sensitive data during the training of machine learning models, but its privacy guarantees often come at the cost of model performance, largely due to the inherent challenge of accurately quantifying privacy loss. While recent efforts have strengthened privacy guarantees by focusing solely on the final output and bounded domain cases, they still impose restrictive assumptions, such as convexity and other parameter limitations, and often lack a thorough analysis of utility. In this paper, we provide rigorous privacy and utility characterization for DPSGD for smooth loss functions in both bounded and unbounded domains. We track the privacy loss over multiple iterations by exploiting the noisy smooth-reduction property and establish the utility analysis by leveraging the projection's non-expansiveness and clipped SGD properties. In particular, we show that for DPSGD with a bounded domain, (i) the privacy loss can still converge without the convexity assumption, and (ii) a smaller bounded diameter can improve both privacy and utility simultaneously under certain conditions. Numerical results validate our results.
△ Less
Submitted 28 February, 2025; v1 submitted 24 February, 2025;
originally announced February 2025.
-
Revisiting Interactions of Multiple Driver States in Heterogenous Population and Cognitive Tasks
Authors:
Jiyao Wang,
Ange Wang,
Song Yan,
Dengbo He,
Kaishun Wu
Abstract:
In real-world driving scenarios, multiple states occur simultaneously due to individual differences and environmental factors, complicating the analysis and estimation of driver states. Previous studies, limited by experimental design and analytical methods, may not be able to disentangle the relationships among multiple driver states and environmental factors. This paper introduces the Double Mac…
▽ More
In real-world driving scenarios, multiple states occur simultaneously due to individual differences and environmental factors, complicating the analysis and estimation of driver states. Previous studies, limited by experimental design and analytical methods, may not be able to disentangle the relationships among multiple driver states and environmental factors. This paper introduces the Double Machine Learning (DML) analysis method to the field of driver state analysis to tackle this challenge. To train and test the DML model, a driving simulator experiment with 42 participants was conducted. All participants drove SAE level-3 vehicles and conducted three types of cognitive tasks in a 3-hour driving experiment. Drivers' subjective cognitive load and drowsiness levels were collected throughout the experiment. Then, we isolated individual and environmental factors affecting driver state variations and the factors affecting drivers' physiological and eye-tracking metrics when they are under specific states. The results show that our approach successfully decoupled and inferred the complex causal relationships between multiple types of drowsiness and cognitive load. Additionally, we identified key physiological and eye-tracking indicators in the presence of multiple driver states and under the influence of a single state, excluding the influence of other driver states, environmental factors, and individual characteristics. Our causal inference analytical framework can offer new insights for subsequent analysis of drivers' states. Further, the updated causal relation graph based on the DML analysis can provide theoretical bases for driver state monitoring based on physiological and eye-tracking measures.
△ Less
Submitted 19 December, 2024; v1 submitted 18 December, 2024;
originally announced December 2024.
-
Emergenet: A Digital Twin of Sequence Evolution for Scalable Emergence Risk Assessment of Animal Influenza A Strains
Authors:
Kevin Yuanbo Wu,
Jin Li,
Aaron Esser-Kahn,
Ishanu Chattopadhyay
Abstract:
Despite having triggered devastating pandemics in the past, our ability to quantitatively assess the emergence potential of individual strains of animal influenza viruses remains limited. This study introduces Emergenet, a tool to infer a digital twin of sequence evolution to chart how new variants might emerge in the wild. Our predictions based on Emergenets built only using 220,151 Hemagglutinni…
▽ More
Despite having triggered devastating pandemics in the past, our ability to quantitatively assess the emergence potential of individual strains of animal influenza viruses remains limited. This study introduces Emergenet, a tool to infer a digital twin of sequence evolution to chart how new variants might emerge in the wild. Our predictions based on Emergenets built only using 220,151 Hemagglutinnin (HA) sequences consistently outperform WHO seasonal vaccine recommendations for H1N1/H3N2 subtypes over two decades (average match-improvement: 3.73 AAs, 28.40\%), and are at par with state-of-the-art approaches that use more detailed phenotypic annotations. Finally, our generative models are used to scalably calculate the current odds of emergence of animal strains not yet in human circulation, which strongly correlates with CDC's expert-assessed Influenza Risk Assessment Tool (IRAT) scores (Pearson's $r = 0.721, p = 10^{-4}$). A minimum five orders of magnitude speedup over CDC's assessment (seconds vs months) then enabled us to analyze 6,354 animal strains collected post-2020 to identify 35 strains with high emergence scores ($> 7.7$). The Emergenet framework opens the door to preemptive pandemic mitigation through targeted inoculation of animal hosts before the first human infection.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference
Authors:
Jonathan Wenger,
Kaiwen Wu,
Philipp Hennig,
Jacob R. Gardner,
Geoff Pleiss,
John P. Cunningham
Abstract:
Model selection in Gaussian processes scales prohibitively with the size of the training dataset, both in time and memory. While many approximations exist, all incur inevitable approximation error. Recent work accounts for this error in the form of computational uncertainty, which enables -- at the cost of quadratic complexity -- an explicit tradeoff between computation and precision. Here we exte…
▽ More
Model selection in Gaussian processes scales prohibitively with the size of the training dataset, both in time and memory. While many approximations exist, all incur inevitable approximation error. Recent work accounts for this error in the form of computational uncertainty, which enables -- at the cost of quadratic complexity -- an explicit tradeoff between computation and precision. Here we extend this development to model selection, which requires significant enhancements to the existing approach, including linear-time scaling in the size of the dataset. We propose a novel training loss for hyperparameter optimization and demonstrate empirically that the resulting method can outperform SGPR, CGGP and SVGP, state-of-the-art methods for GP model selection, on medium to large-scale datasets. Our experiments show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU. As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty -- a fundamental prerequisite for optimal decision-making.
△ Less
Submitted 7 July, 2025; v1 submitted 1 November, 2024;
originally announced November 2024.
-
Deep Limit Model-free Prediction in Regression
Authors:
Kejin Wu,
Dimitris N. Politis
Abstract:
In this paper, we provide a novel Model-free approach based on Deep Neural Network (DNN) to accomplish point prediction and prediction interval under a general regression setting. Usually, people rely on parametric or non-parametric models to bridge dependent and independent variables (Y and X). However, this classical method relies heavily on the correct model specification. Even for the non-para…
▽ More
In this paper, we provide a novel Model-free approach based on Deep Neural Network (DNN) to accomplish point prediction and prediction interval under a general regression setting. Usually, people rely on parametric or non-parametric models to bridge dependent and independent variables (Y and X). However, this classical method relies heavily on the correct model specification. Even for the non-parametric approach, some additive form is often assumed. A newly proposed Model-free prediction principle sheds light on a prediction procedure without any model assumption. Previous work regarding this principle has shown better performance than other standard alternatives. Recently, DNN, one of the machine learning methods, has received increasing attention due to its great performance in practice. Guided by the Model-free prediction idea, we attempt to apply a fully connected forward DNN to map X and some appropriate reference random variable Z to Y. The targeted DNN is trained by minimizing a specially designed loss function so that the randomness of Y conditional on X is outsourced to Z through the trained DNN. Our method is more stable and accurate compared to other DNN-based counterparts, especially for optimal point predictions. With a specific prediction procedure, our prediction interval can capture the estimation variability so that it can render a better coverage rate for finite sample cases. The superior performance of our method is verified by simulation and empirical studies.
△ Less
Submitted 11 September, 2024; v1 submitted 18 August, 2024;
originally announced August 2024.
-
Assessing Spatial Disparities: A Bayesian Linear Regression Approach
Authors:
Kyle Lin Wu,
Sudipto Banerjee
Abstract:
Epidemiological investigations of regionally aggregated spatial data often involve detecting spatial health disparities among neighboring regions on a map of disease mortality or incidence rates. Analyzing such data introduces spatial dependence among the health outcomes and seeks to report statistically significant spatial disparities by delineating boundaries that separate neighboring regions wi…
▽ More
Epidemiological investigations of regionally aggregated spatial data often involve detecting spatial health disparities among neighboring regions on a map of disease mortality or incidence rates. Analyzing such data introduces spatial dependence among the health outcomes and seeks to report statistically significant spatial disparities by delineating boundaries that separate neighboring regions with disparate health outcomes. However, there are statistical challenges to appropriately defining what constitutes a spatial disparity and to construct robust probabilistic inference for spatial disparities. We enrich the familiar Bayesian linear regression framework to introduce spatial autoregression and offer model-based detection of spatial disparities. We derive exploitable analytical tractability that considerably accelerates computation. Simulation experiments conducted over a county map of the entire United States demonstrate the effectiveness of our method and we apply our method to a data set from the Institute of Health Metrics and Evaluation (IHME) on age-standardized US county-level estimates of lung cancer mortality rates.
△ Less
Submitted 14 March, 2025; v1 submitted 27 July, 2024;
originally announced July 2024.
-
A Fast, Robust Elliptical Slice Sampling Implementation for Linearly Truncated Multivariate Normal Distributions
Authors:
Kaiwen Wu,
Jacob R. Gardner
Abstract:
Elliptical slice sampling, when adapted to linearly truncated multivariate normal distributions, is a rejection-free Markov chain Monte Carlo method. At its core, it requires analytically constructing an ellipse-polytope intersection. The main novelty of this paper is an algorithm that computes this intersection in $\mathcal{O}(m \log m)$ time, where $m$ is the number of linear inequality constrai…
▽ More
Elliptical slice sampling, when adapted to linearly truncated multivariate normal distributions, is a rejection-free Markov chain Monte Carlo method. At its core, it requires analytically constructing an ellipse-polytope intersection. The main novelty of this paper is an algorithm that computes this intersection in $\mathcal{O}(m \log m)$ time, where $m$ is the number of linear inequality constraints representing the polytope. We show that an implementation based on this algorithm enhances numerical stability, speeds up running time, and is easy to parallelize for launching multiple Markov chains.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Understanding Stochastic Natural Gradient Variational Inference
Authors:
Kaiwen Wu,
Jacob R. Gardner
Abstract:
Stochastic natural gradient variational inference (NGVI) is a popular posterior inference method with applications in various probabilistic models. Despite its wide usage, little is known about the non-asymptotic convergence rate in the \emph{stochastic} setting. We aim to lessen this gap and provide a better understanding. For conjugate likelihoods, we prove the first $\mathcal{O}(\frac{1}{T})$ n…
▽ More
Stochastic natural gradient variational inference (NGVI) is a popular posterior inference method with applications in various probabilistic models. Despite its wide usage, little is known about the non-asymptotic convergence rate in the \emph{stochastic} setting. We aim to lessen this gap and provide a better understanding. For conjugate likelihoods, we prove the first $\mathcal{O}(\frac{1}{T})$ non-asymptotic convergence rate of stochastic NGVI. The complexity is no worse than stochastic gradient descent (\aka black-box variational inference) and the rate likely has better constant dependency that leads to faster convergence in practice. For non-conjugate likelihoods, we show that stochastic NGVI with the canonical parameterization implicitly optimizes a non-convex objective. Thus, a global convergence rate of $\mathcal{O}(\frac{1}{T})$ is unlikely without some significant new understanding of optimizing the ELBO using natural gradients.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Scalable Subsampling Inference for Deep Neural Networks
Authors:
Kejin Wu,
Dimitris N. Politis
Abstract:
Deep neural networks (DNN) has received increasing attention in machine learning applications in the last several years. Recently, a non-asymptotic error bound has been developed to measure the performance of the fully connected DNN estimator with ReLU activation functions for estimating regression models. The paper at hand gives a small improvement on the current error bound based on the latest r…
▽ More
Deep neural networks (DNN) has received increasing attention in machine learning applications in the last several years. Recently, a non-asymptotic error bound has been developed to measure the performance of the fully connected DNN estimator with ReLU activation functions for estimating regression models. The paper at hand gives a small improvement on the current error bound based on the latest results on the approximation ability of DNN. More importantly, however, a non-random subsampling technique--scalable subsampling--is applied to construct a `subagged' DNN estimator. Under regularity conditions, it is shown that the subagged DNN estimator is computationally efficient without sacrificing accuracy for either estimation or prediction tasks. Beyond point estimation/prediction, we propose different approaches to build confidence and prediction intervals based on the subagged DNN estimator. In addition to being asymptotically valid, the proposed confidence/prediction intervals appear to work well in finite samples. All in all, the scalable subsampling DNN estimator offers the complete package in terms of statistical inference, i.e., (a) computational efficiency; (b) point estimation/prediction accuracy; and (c) allowing for the construction of practically useful confidence and prediction intervals.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Order picking efficiency: A scattered storage and clustered allocation strategy in automated drug dispensing systems
Authors:
Mengge Yuan,
Ning Zhao,
Kan Wu,
Lulu Cheng
Abstract:
In the smart hospital, optimizing prescription order fulfilment processes in outpatient pharmacies is crucial. A promising device, automated drug dispensing systems (ADDSs), has emerged to streamline these processes. These systems involve human order pickers who are assisted by ADDSs. The ADDS's robotic arm transports bins from storage locations to the input/output (I/O) points, while the pharmaci…
▽ More
In the smart hospital, optimizing prescription order fulfilment processes in outpatient pharmacies is crucial. A promising device, automated drug dispensing systems (ADDSs), has emerged to streamline these processes. These systems involve human order pickers who are assisted by ADDSs. The ADDS's robotic arm transports bins from storage locations to the input/output (I/O) points, while the pharmacist sorts the requested drugs from the bins at the I/O points. This paper focuses on coordinating the ADDS and the pharmacists to optimize the order-picking strategy. Another critical aspect of order-picking systems is the storage location assignment problem (SLAP), which determines the allocation of drugs to storage locations. In this study, we consider the ADDS as a smart warehouse and propose a two-stage scattered storage and clustered allocation (SSCA) strategy to optimize the SLAP for ADDSs. The first stage primarily adopts a scattered storage approach, and we develop a mathematical programming model to group drugs accordingly. In the second stage, we introduce a sequential alternating (SA) heuristic algorithm that takes into account the drug demand frequency and the correlation between drugs to cluster and locate them effectively. To evaluate the proposed SSCA strategy, we develop a double objective integer programming model for the order-picking problem in ADDSs to minimize the number of machines visited in prescription orders while maintaining the shortest average picking time of orders. The numerical results demonstrate that the proposed strategy can optimize the SLAP in ADDSs and improve significantly the order-picking efficiency of ADDSs in a human-robot cooperation environment.
△ Less
Submitted 18 December, 2023;
originally announced February 2024.
-
Multi-step ahead prediction intervals for non-parametric autoregressions via bootstrap: consistency, debiasing and pertinence
Authors:
Dimitris N. Politis,
Kejin Wu
Abstract:
To address the difficult problem of multi-step ahead prediction of non-parametric autoregressions, we consider a forward bootstrap approach. Employing a local constant estimator, we can analyze a general type of non-parametric time series model, and show that the proposed point predictions are consistent with the true optimal predictor. We construct a quantile prediction interval that is asymptoti…
▽ More
To address the difficult problem of multi-step ahead prediction of non-parametric autoregressions, we consider a forward bootstrap approach. Employing a local constant estimator, we can analyze a general type of non-parametric time series model, and show that the proposed point predictions are consistent with the true optimal predictor. We construct a quantile prediction interval that is asymptotically valid. Moreover, using a debiasing technique, we can asymptotically approximate the distribution of multi-step ahead non-parametric estimation by bootstrap. As a result, we can build bootstrap prediction intervals that are pertinent, i.e., can capture the model estimation variability, thus improving upon the standard quantile prediction intervals. Simulation studies are given to illustrate the performance of our point predictions and pertinent prediction intervals for finite samples.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Large-Scale Gaussian Processes via Alternating Projection
Authors:
Kaiwen Wu,
Jonathan Wenger,
Haydn Jones,
Geoff Pleiss,
Jacob R. Gardner
Abstract:
Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ spac…
▽ More
Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to $27\times$ and $72 \times$, respectively, compared to CG.
△ Less
Submitted 8 March, 2024; v1 submitted 26 October, 2023;
originally announced October 2023.
-
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models
Authors:
Yongchan Kwon,
Eric Wu,
Kevin Wu,
James Zou
Abstract:
Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image…
▽ More
Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled.
△ Less
Submitted 13 March, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms
Authors:
Keru Wu,
Yuansi Chen,
Wooseok Ha,
Bin Yu
Abstract:
Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify…
▽ More
Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify the assumptions under which a DA algorithm has good target performance. In this work, we focus on the assumption of the presence of conditionally invariant components (CICs), which are relevant for prediction and remain conditionally invariant across the source and target data. We demonstrate that CICs, which can be estimated through conditional invariant penalty (CIP), play three prominent roles in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features. We support our new algorithms and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, Camelyon17, and DomainNet datasets.
△ Less
Submitted 8 July, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Dynamic Reconfiguration of Brain Functional Network in Stroke
Authors:
Kaichao Wu,
Beth Jelfs,
Katrina Neville,
Wenzhen He,
Qiang Fang
Abstract:
The brain continually reorganizes its functional network to adapt to post-stroke functional impairments. Previous studies using static modularity analysis have presented global-level behavior patterns of this network reorganization. However, it is far from understood how the brain reconfigures its functional network dynamically following a stroke. This study collected resting-state functional MRI…
▽ More
The brain continually reorganizes its functional network to adapt to post-stroke functional impairments. Previous studies using static modularity analysis have presented global-level behavior patterns of this network reorganization. However, it is far from understood how the brain reconfigures its functional network dynamically following a stroke. This study collected resting-state functional MRI data from 15 stroke patients, with mild (n = 6) and severe (n = 9) two subgroups based on their clinical symptoms. Additionally, 15 age-matched healthy subjects were considered as controls. By applying a multilayer network method, a dynamic modular structure was recognized based on a time-resolved function network. Then dynamic network measurements (recruitment, integration, and flexibility) were calculated to characterize the dynamic reconfiguration of post-stroke brain functional networks, hence, to reveal the neural functional rebuilding process. It was found from this investigation that severe patients tended to have reduced recruitment and increased between-network integration, while mild patients exhibited low network flexibility and less network integration. It is also noted that this severity-dependent alteration in network interaction was not able to be revealed by previous studies using static methods. Clinically, the obtained knowledge of the diverse patterns of dynamic adjustment in brain functional networks observed from the brain signal could help understand the underlying mechanism of the motor, speech, and cognitive functional impairments caused by stroke attacks. The proposed method not only could be used to evaluate patients' current brain status but also has the potential to provide insights into prognosis analysis and prediction.
△ Less
Submitted 22 March, 2024; v1 submitted 27 June, 2023;
originally announced June 2023.
-
Bayesian model calibration for diblock copolymer thin film self-assembly using power spectrum of microscopy data and machine learning surrogate
Authors:
Lianghao Cao,
Keyi Wu,
J. Tinsley Oden,
Peng Chen,
Omar Ghattas
Abstract:
Identifying parameters of computational models from experimental data, or model calibration, is fundamental for assessing and improving the predictability and reliability of computer simulations. In this work, we propose a method for Bayesian calibration of models that predict morphological patterns of diblock copolymer (Di-BCP) thin film self-assembly while accounting for various sources of uncer…
▽ More
Identifying parameters of computational models from experimental data, or model calibration, is fundamental for assessing and improving the predictability and reliability of computer simulations. In this work, we propose a method for Bayesian calibration of models that predict morphological patterns of diblock copolymer (Di-BCP) thin film self-assembly while accounting for various sources of uncertainties in pattern formation and data acquisition. This method extracts the azimuthally-averaged power spectrum (AAPS) of the top-down microscopy characterization of Di-BCP thin film patterns as summary statistics for Bayesian inference of model parameters via the pseudo-marginal method. We derive the analytical and approximate form of a conditional likelihood for the AAPS of image data. We demonstrate that AAPS-based image data reduction retains the mutual information, particularly on important length scales, between image data and model parameters while being relatively agnostic to the aleatoric uncertainties associated with the random long-range disorder of Di-BCP patterns. Additionally, we propose a phase-informed prior distribution for Bayesian model calibration. Furthermore, reducing image data to AAPS enables us to efficiently build surrogate models to accelerate the proposed Bayesian model calibration procedure. We present the formulation and training of two multi-layer perceptrons for approximating the parameter-to-spectrum map, which enables fast integrated likelihood evaluations. We validate the proposed Bayesian model calibration method through numerical examples, for which the neural network surrogate delivers a fivefold reduction of the number of model simulations performed for a single calibration task.
△ Less
Submitted 3 August, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Bootstrap Prediction Inference of Non-linear Autoregressive Models
Authors:
Kejin Wu,
Dimitris N. Politis
Abstract:
The non-linear autoregressive (NLAR) model plays an important role in modeling and predicting time series. One-step ahead prediction is straightforward using the NLAR model, but the multi-step ahead prediction is cumbersome. For instance, iterating the one-step ahead predictor is a convenient strategy for linear autoregressive (LAR) models, but it is suboptimal under NLAR. In this paper, we first…
▽ More
The non-linear autoregressive (NLAR) model plays an important role in modeling and predicting time series. One-step ahead prediction is straightforward using the NLAR model, but the multi-step ahead prediction is cumbersome. For instance, iterating the one-step ahead predictor is a convenient strategy for linear autoregressive (LAR) models, but it is suboptimal under NLAR. In this paper, we first propose a simulation and/or bootstrap algorithm to construct optimal point predictors under an $L_1$ or $L_2$ loss criterion. In addition, we construct bootstrap prediction intervals in the multi-step ahead prediction problem; in particular, we develop an asymptotically valid quantile prediction interval as well as a pertinent prediction interval for future values. In order to correct the undercoverage of prediction intervals with finite samples, we further employ predictive -- as opposed to fitted -- residuals in the bootstrap process. Simulation studies are also given to substantiate the finite sample performance of our methods.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
The Behavior and Convergence of Local Bayesian Optimization
Authors:
Kaiwen Wu,
Kyurae Kim,
Roman Garnett,
Jacob R. Gardner
Abstract:
A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or converge…
▽ More
A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by Müller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.
△ Less
Submitted 8 March, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
On the Convergence of Black-Box Variational Inference
Authors:
Kyurae Kim,
Jisu Oh,
Kaiwen Wu,
Yi-An Ma,
Jacob R. Gardner
Abstract:
We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior dens…
▽ More
We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Also, our analysis reveals that certain algorithm design choices commonly employed in practice, particularly, nonlinear parameterizations of the scale of the variational approximation, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations, and thus achieves the strongest known convergence rate guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems.
△ Less
Submitted 10 January, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference
Authors:
Kyurae Kim,
Kaiwen Wu,
Jisu Oh,
Jacob R. Gardner
Abstract:
Understanding the gradient variance of black-box variational inference (BBVI) is a crucial step for establishing its convergence and developing algorithmic improvements. However, existing studies have yet to show that the gradient variance of BBVI satisfies the conditions used to study the convergence of stochastic gradient descent (SGD), the workhorse of BBVI. In this work, we show that BBVI sati…
▽ More
Understanding the gradient variance of black-box variational inference (BBVI) is a crucial step for establishing its convergence and developing algorithmic improvements. However, existing studies have yet to show that the gradient variance of BBVI satisfies the conditions used to study the convergence of stochastic gradient descent (SGD), the workhorse of BBVI. In this work, we show that BBVI satisfies a matching bound corresponding to the $ABC$ condition used in the SGD literature when applied to smooth and quadratically-growing log-likelihoods. Our results generalize to nonlinear covariance parameterizations widely used in the practice of BBVI. Furthermore, we show that the variance of the mean-field parameterization has provably superior dimensional dependence.
△ Less
Submitted 3 June, 2023; v1 submitted 18 March, 2023;
originally announced March 2023.
-
Deep-OSG: Deep Learning of Operators in Semigroup
Authors:
Junfeng Chen,
Kailiang Wu
Abstract:
This paper proposes a novel deep learning approach for learning operators in semigroup, with applications to modeling unknown autonomous dynamical systems using time series data collected at varied time lags. It is a sequel to the previous flow map learning (FML) works [T. Qin, K. Wu, and D. Xiu, J. Comput. Phys., 395:620--635, 2019], [K. Wu and D. Xiu, J. Comput. Phys., 408:109307, 2020], and [Z.…
▽ More
This paper proposes a novel deep learning approach for learning operators in semigroup, with applications to modeling unknown autonomous dynamical systems using time series data collected at varied time lags. It is a sequel to the previous flow map learning (FML) works [T. Qin, K. Wu, and D. Xiu, J. Comput. Phys., 395:620--635, 2019], [K. Wu and D. Xiu, J. Comput. Phys., 408:109307, 2020], and [Z. Chen, V. Churchill, K. Wu, and D. Xiu, J. Comput. Phys., 449:110782, 2022], which focused on learning single evolution operator with a fixed time step. This paper aims to learn a family of evolution operators with variable time steps, which constitute a semigroup for an autonomous system. The semigroup property is very crucial and links the system's evolutionary behaviors across varying time scales, but it was not considered in the previous works. We propose for the first time a framework of embedding the semigroup property into the data-driven learning process, through a novel neural network architecture and new loss functions. The framework is very feasible, can be combined with any suitable neural networks, and is applicable to learning general autonomous ODEs and PDEs. We present the rigorous error estimates and variance analysis to understand the prediction accuracy and robustness of our approach, showing the remarkable advantages of semigroup awareness in our model. Moreover, our approach allows one to arbitrarily choose the time steps for prediction and ensures that the predicted results are well self-matched and consistent. Extensive numerical experiments demonstrate that embedding the semigroup property notably reduces the data dependency of deep learning models and greatly improves the accuracy, robustness, and stability for long-time prediction.
△ Less
Submitted 12 September, 2023; v1 submitted 7 February, 2023;
originally announced February 2023.
-
Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow
Authors:
Jeeyung Kim,
Mengtian Jin,
Youkow Homma,
Alex Sim,
Wilko Kroeger,
Kesheng Wu
Abstract:
In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experi…
▽ More
In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experimental facility to a data center. The key idea of our approach is to find recent past data transfer events that match the current event in some ways. Tests showed that we could identify recent events matching some recorded properties and reduce the prediction error by about 12% compared to the similar models with only static features. We additionally explored an application specific technique to extract information about the data production process, and was able to reduce the average prediction error by 44%.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
What Makes You Hold on to That Old Car? Joint Insights from Machine Learning and Multinomial Logit on Vehicle-level Transaction Decisions
Authors:
Ling Jin,
Alina Lazar,
Caitlin Brown,
Bingrong Sun,
Venu Garikapati,
Srinath Ravulaparthy,
Qianmiao Chen,
Alexander Sim,
Kesheng Wu,
Tin Ho,
Thomas Wenzel,
C. Anna Spurlock
Abstract:
What makes you hold on that old car? While the vast majority of the household vehicles are still powered by conventional internal combustion engines, the progress of adopting emerging vehicle technologies will critically depend on how soon the existing vehicles are transacted out of the household fleet. Leveraging a nationally representative longitudinal data set, the Panel Study of Income Dynamic…
▽ More
What makes you hold on that old car? While the vast majority of the household vehicles are still powered by conventional internal combustion engines, the progress of adopting emerging vehicle technologies will critically depend on how soon the existing vehicles are transacted out of the household fleet. Leveraging a nationally representative longitudinal data set, the Panel Study of Income Dynamics, this study examines how household decisions to dispose of or replace a given vehicle are: (1) influenced by the vehicle's attributes, (2) mediated by households' concurrent socio-demographic and economic attributes, and (3) triggered by key life cycle events. Coupled with a newly developed machine learning interpretation tool, TreeExplainer, we demonstrate an innovative use of machine learning models to augment traditional logit modeling to both generate behavioral insights and improve model performance. We find the two gradient-boosting-based methods, CatBoost and LightGBM, are the best performing machine learning models for this problem. The multinomial logistic model can achieve similar performance levels after its model specification is informed by TreeExplainer. Both machine learning and multinomial logit models suggest that while older vehicles are more likely to be disposed of or replaced than newer ones, such probability decreases as the vehicles serve the family longer. We find that married families, families with higher education levels, homeowners, and older families tend to keep their vehicles longer. Life events such as childbirth, residential relocation, and change of household composition and income are found to increase vehicle disposal and/or replacement. We provide additional insights on the timing of vehicle replacement or disposal, in particular, the presence of children and childbirth events are more strongly associated with vehicle replacement among younger parents.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
STICC: A multivariate spatial clustering method for repeated geographic pattern discovery with consideration of spatial contiguity
Authors:
Yuhao Kang,
Kunlin Wu,
Song Gao,
Ignavier Ng,
Jinmeng Rao,
Shan Ye,
Fan Zhang,
Teng Fei
Abstract:
Spatial clustering has been widely used for spatial data mining and knowledge discovery. An ideal multivariate spatial clustering should consider both spatial contiguity and aspatial attributes. Existing spatial clustering approaches may face challenges for discovering repeated geographic patterns with spatial contiguity maintained. In this paper, we propose a Spatial Toeplitz Inverse Covariance-B…
▽ More
Spatial clustering has been widely used for spatial data mining and knowledge discovery. An ideal multivariate spatial clustering should consider both spatial contiguity and aspatial attributes. Existing spatial clustering approaches may face challenges for discovering repeated geographic patterns with spatial contiguity maintained. In this paper, we propose a Spatial Toeplitz Inverse Covariance-Based Clustering (STICC) method that considers both attributes and spatial relationships of geographic objects for multivariate spatial clustering. A subregion is created for each geographic object serving as the basic unit when performing clustering. A Markov random field is then constructed to characterize the attribute dependencies of subregions. Using a spatial consistency strategy, nearby objects are encouraged to belong to the same cluster. To test the performance of the proposed STICC algorithm, we apply it in two use cases. The comparison results with several baseline methods show that the STICC outperforms others significantly in terms of adjusted rand index and macro-F1 score. Join count statistics is also calculated and shows that the spatial contiguity is well preserved by STICC. Such a spatial clustering method may benefit various applications in the fields of geography, remote sensing, transportation, and urban planning, etc.
△ Less
Submitted 30 March, 2022; v1 submitted 17 March, 2022;
originally announced March 2022.
-
Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains
Authors:
Haitao Liu,
Kai Wu,
Yew-Soon Ong,
Chao Bian,
Xiaomo Jiang,
Xiaofang Wang
Abstract:
Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper present…
▽ More
Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (HSVLMC) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with Bayesian calibration that (i) takes into account the effect of dimensionality reduction raised by domain mappings in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases and a practical multi-fidelity steam turbine exhaust problem.
△ Less
Submitted 18 June, 2022; v1 submitted 25 February, 2022;
originally announced February 2022.
-
A New Model-free Prediction Method: GA-NoVaS
Authors:
Kejin Wu,
Sayar Karmakar
Abstract:
Volatility forecasting plays an important role in the financial econometrics. Previous works in this regime are mainly based on applying various GARCH-type models. However, it is hard for people to choose a specific GARCH model which works for general cases and such traditional methods are unstable for dealing with high-volatile period or using small sample size. The newly proposed normalizing and…
▽ More
Volatility forecasting plays an important role in the financial econometrics. Previous works in this regime are mainly based on applying various GARCH-type models. However, it is hard for people to choose a specific GARCH model which works for general cases and such traditional methods are unstable for dealing with high-volatile period or using small sample size. The newly proposed normalizing and variance stabilizing (NoVaS) method is a more robust and accurate prediction technique. This Model-free method is built by taking advantage of an inverse transformation which is based on the ARCH model. Inspired by the historic development of the ARCH to GARCH model, we propose a novel NoVaS-type method which exploits the GARCH model structure. By performing extensive data analysis, we find our model has better time-aggregated prediction performance than the current state-of-the-art NoVaS method on forecasting short and volatile data. The victory of our new method corroborates that and also opens up avenues where one can explore other NoVaS structures to improve on the existing ones or solve specific prediction problems.
△ Less
Submitted 15 December, 2021;
originally announced December 2021.
-
Minimax Mixing Time of the Metropolis-Adjusted Langevin Algorithm for Log-Concave Sampling
Authors:
Keru Wu,
Scott Schmidler,
Yuansi Chen
Abstract:
We study the mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. We establish its optimal minimax mixing time under a warm start. Our main contribution is two-fold. First, for a $d$-dimensional log-concave density with condition number $κ$, we show that MALA with a warm start mixes in $\tilde O(κ\sqrt{d})$ iteration…
▽ More
We study the mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. We establish its optimal minimax mixing time under a warm start. Our main contribution is two-fold. First, for a $d$-dimensional log-concave density with condition number $κ$, we show that MALA with a warm start mixes in $\tilde O(κ\sqrt{d})$ iterations up to logarithmic factors. This improves upon the previous work on the dependency of either the condition number $κ$ or the dimension $d$. Our proof relies on comparing the leapfrog integrator with the continuous Hamiltonian dynamics, where we establish a new concentration bound for the acceptance rate. Second, we prove a spectral gap based mixing time lower bound for reversible MCMC algorithms on general state spaces. We apply this lower bound result to construct a hard distribution for which MALA requires at least $\tilde Ω(κ\sqrt{d})$ steps to mix. The lower bound for MALA matches our upper bound in terms of condition number and dimension. Finally, numerical experiments are included to validate our theoretical results.
△ Less
Submitted 2 October, 2022; v1 submitted 27 September, 2021;
originally announced September 2021.
-
Investigating Underlying Drivers of Variability in Residential Energy Usage Patterns with Daily Load Shape Clustering of Smart Meter Data
Authors:
Ling Jin,
C. Anna Spurlock,
Sam Borgeson,
Alina Lazar,
Daniel Fredman,
Annika Todd,
Alexander Sim,
Kesheng Wu
Abstract:
Residential customers have traditionally not been treated as individual entities due to the high volatility in residential consumption patterns as well as a historic focus on aggregated loads from the utility and system feeder perspective. Large-scale deployment of smart meters has motivated increasing studies to explore disaggregated daily load patterns, which can reveal important heterogeneity a…
▽ More
Residential customers have traditionally not been treated as individual entities due to the high volatility in residential consumption patterns as well as a historic focus on aggregated loads from the utility and system feeder perspective. Large-scale deployment of smart meters has motivated increasing studies to explore disaggregated daily load patterns, which can reveal important heterogeneity across different time scales, weather conditions, as well as within and across individual households. This paper aims to shed light on the mechanisms by which electricity consumption patterns exhibit variability and the different constraints that may affect demand-response (DR) flexibility. We systematically evaluate the relationship between daily time-of-use patterns and their variability to external and internal influencing factors, including time scales of interest, meteorological conditions, and household characteristics by application of an improved version of the adaptive K-means clustering method to profile "household-days" of a summer peaking utility. We find that for this summer-peaking utility, outdoor temperature is the most important external driver of the load shape variability relative to seasonality and day-of-week. The top three consumption patterns represent approximately 50% of usage on the highest temperature days. The variability in summer load shapes across customers can be explained by the responsiveness of the households to outside temperature. Our results suggest that depending on the influencing factors, not all the consumption variability can be readily translated to consumption flexibility. Such information needs to be further explored in segmenting customers for better program targeting and tailoring to meet the needs of the rapidly evolving electricity grid.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
Model-free time-aggregated predictions for econometric datasets
Authors:
Kejin Wu,
Sayar Karmakar
Abstract:
This article explores the existing normalizing and variance-stabilizing (NoVaS) method on predicting squared log-returns of financial data. First, we explore the robustness of the existing NoVaS method for long-term time-aggregated predictions. Then we develop a more parsimonious variant of the existing method. With systematic justification and extensive data analysis, our new method shows better…
▽ More
This article explores the existing normalizing and variance-stabilizing (NoVaS) method on predicting squared log-returns of financial data. First, we explore the robustness of the existing NoVaS method for long-term time-aggregated predictions. Then we develop a more parsimonious variant of the existing method. With systematic justification and extensive data analysis, our new method shows better performance than current NoVaS and standard GARCH(1,1) methods on both short- and long-term time-aggregated predictions.
△ Less
Submitted 4 November, 2021; v1 submitted 6 January, 2021;
originally announced January 2021.
-
CAN: Feature Co-Action for Click-Through Rate Prediction
Authors:
Weijie Bian,
Kailun Wu,
Lejian Ren,
Qi Pi,
Yujing Zhang,
Can Xiao,
Xiang-Rong Sheng,
Yong-Nan Zhu,
Zhangming Chan,
Na Mou,
Xinchen Luo,
Shiming Xiang,
Guorui Zhou,
Xiaoqiang Zhu,
Hongbo Deng
Abstract:
Feature interaction has been recognized as an important problem in machine learning, which is also very essential for click-through rate (CTR) prediction tasks. In recent years, Deep Neural Networks (DNNs) can automatically learn implicit nonlinear interactions from original sparse features, and therefore have been widely used in industrial CTR prediction tasks. However, the implicit feature inter…
▽ More
Feature interaction has been recognized as an important problem in machine learning, which is also very essential for click-through rate (CTR) prediction tasks. In recent years, Deep Neural Networks (DNNs) can automatically learn implicit nonlinear interactions from original sparse features, and therefore have been widely used in industrial CTR prediction tasks. However, the implicit feature interactions learned in DNNs cannot fully retain the complete representation capacity of the original and empirical feature interactions (e.g., cartesian product) without loss. For example, a simple attempt to learn the combination of feature A and feature B <A, B> as the explicit cartesian product representation of new features can outperform previous implicit feature interaction models including factorization machine (FM)-based models and their variations. In this paper, we propose a Co-Action Network (CAN) to approximate the explicit pairwise feature interactions without introducing too many additional parameters. More specifically, giving feature A and its associated feature B, their feature interaction is modeled by learning two sets of parameters: 1) the embedding of feature A, and 2) a Multi-Layer Perceptron (MLP) to represent feature B. The approximated feature interaction can be obtained by passing the embedding of feature A through the MLP network of feature B. We refer to such pairwise feature interaction as feature co-action, and such a Co-Action Network unit can provide a very powerful capacity to fitting complex feature interactions. Experimental results on public and industrial datasets show that CAN outperforms state-of-the-art CTR models and the cartesian product method. Moreover, CAN has been deployed in the display advertisement system in Alibaba, obtaining 12\% improvement on CTR and 8\% on Revenue Per Mille (RPM), which is a great improvement to the business.
△ Less
Submitted 7 December, 2021; v1 submitted 11 November, 2020;
originally announced November 2020.
-
Bayesian inference of heterogeneous epidemic models: Application to COVID-19 spread accounting for long-term care facilities
Authors:
Peng Chen,
Keyi Wu,
Omar Ghattas
Abstract:
We propose a high dimensional Bayesian inference framework for learning heterogeneous dynamics of a COVID-19 model, with a specific application to the dynamics and severity of COVID-19 inside and outside long-term care (LTC) facilities. We develop a heterogeneous compartmental model that accounts for the heterogeneity of the time-varying spread and severity of COVID-19 inside and outside LTC facil…
▽ More
We propose a high dimensional Bayesian inference framework for learning heterogeneous dynamics of a COVID-19 model, with a specific application to the dynamics and severity of COVID-19 inside and outside long-term care (LTC) facilities. We develop a heterogeneous compartmental model that accounts for the heterogeneity of the time-varying spread and severity of COVID-19 inside and outside LTC facilities, which is characterized by time-dependent stochastic processes and time-independent parameters in $\sim$1500 dimensions after discretization. To infer these parameters, we use reported data on the number of confirmed, hospitalized, and deceased cases with suitable post-processing in both a deterministic inversion approach with appropriate regularization as a first step, followed by Bayesian inversion with proper prior distributions. To address the curse of dimensionality and the ill-posedness of the high-dimensional inference problem, we propose use of a dimension-independent projected Stein variational gradient descent method, and demonstrate the intrinsic low-dimensionality of the inverse problem. We present inference results with quantified uncertainties for both New Jersey and Texas, which experienced different epidemic phases and patterns. Moreover, we also present forecasting and validation results based on the empirical posterior samples of our inference for the future trajectory of COVID-19.
△ Less
Submitted 2 November, 2020;
originally announced November 2020.
-
Stronger and Faster Wasserstein Adversarial Attacks
Authors:
Kaiwen Wu,
Allen Houze Wang,
Yaoliang Yu
Abstract:
Deep models, while being extremely flexible and accurate, are surprisingly vulnerable to "small, imperceptible" perturbations known as adversarial attacks. While the majority of existing attacks focus on measuring perturbations under the $\ell_p$ metric, Wasserstein distance, which takes geometry in pixel space into account, has long been known to be a suitable metric for measuring image quality a…
▽ More
Deep models, while being extremely flexible and accurate, are surprisingly vulnerable to "small, imperceptible" perturbations known as adversarial attacks. While the majority of existing attacks focus on measuring perturbations under the $\ell_p$ metric, Wasserstein distance, which takes geometry in pixel space into account, has long been known to be a suitable metric for measuring image quality and has recently risen as a compelling alternative to the $\ell_p$ metric in adversarial attacks. However, constructing an effective attack under the Wasserstein metric is computationally much more challenging and calls for better optimization algorithms. We address this gap in two ways: (a) we develop an exact yet efficient projection operator to enable a stronger projected gradient attack; (b) we show that the Frank-Wolfe method equipped with a suitable linear minimization oracle works extremely fast under Wasserstein constraints. Our algorithms not only converge faster but also generate much stronger attacks. For instance, we decrease the accuracy of a residual network on CIFAR-10 to $3.4\%$ within a Wasserstein perturbation ball of radius $0.005$, in contrast to $65.6\%$ using the previous Wasserstein attack based on an \emph{approximate} projection operator. Furthermore, employing our stronger attacks in adversarial training significantly improves the robustness of adversarially trained models.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
A Theory of Multiple-Source Adaptation with Limited Target Labeled Data
Authors:
Yishay Mansour,
Mehryar Mohri,
Jae Ro,
Ananda Theertha Suresh,
Ke Wu
Abstract:
We present a theoretical and algorithmic study of the multiple-source domain adaptation problem in the common scenario where the learner has access only to a limited amount of labeled target data, but where the learner has at disposal a large amount of labeled data from multiple source domains. We show that a new family of algorithms based on model selection ideas benefits from very favorable guar…
▽ More
We present a theoretical and algorithmic study of the multiple-source domain adaptation problem in the common scenario where the learner has access only to a limited amount of labeled target data, but where the learner has at disposal a large amount of labeled data from multiple source domains. We show that a new family of algorithms based on model selection ideas benefits from very favorable guarantees in this scenario and discuss some theoretical obstacles affecting some alternative techniques. We also report the results of several experiments with our algorithms that demonstrate their practical effectiveness.
△ Less
Submitted 29 October, 2020; v1 submitted 19 July, 2020;
originally announced July 2020.
-
Newton-type Methods for Minimax Optimization
Authors:
Guojun Zhang,
Kaiwen Wu,
Pascal Poupart,
Yaoliang Yu
Abstract:
Differential games, in particular two-player sequential zero-sum games (a.k.a. minimax optimization), have been an important modeling tool in applied science and received renewed interest in machine learning due to many recent applications, such as adversarial training, generative models and reinforcement learning. However, existing theory mostly focuses on convex-concave functions with few except…
▽ More
Differential games, in particular two-player sequential zero-sum games (a.k.a. minimax optimization), have been an important modeling tool in applied science and received renewed interest in machine learning due to many recent applications, such as adversarial training, generative models and reinforcement learning. However, existing theory mostly focuses on convex-concave functions with few exceptions. In this work, we propose two novel Newton-type algorithms for nonconvex-nonconcave minimax optimization. We prove their local convergence at strict local minimax points, which are surrogates of global solutions. We argue that our Newton-type algorithms nicely complement existing ones in that (a) they converge faster to strict local minimax points; (b) they are much more effective when the problem is ill-conditioned; (c) their computational complexity remains similar. We verify the effectiveness of our Newton-type algorithms through experiments on training GANs which are intrinsically nonconvex and ill-conditioned. Our code is available at https://github.com/watml/min-max-2nd-order.
△ Less
Submitted 18 February, 2023; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Generalized logistic growth modeling of the COVID-19 outbreak: comparing the dynamics in the 29 provinces in China and in the rest of the world
Authors:
Ke Wu,
Didier Darcet,
Qian Wang,
Didier Sornette
Abstract:
Started in Wuhan, China, the COVID-19 has been spreading all over the world. We calibrate the logistic growth model, the generalized logistic growth model, the generalized Richards model and the generalized growth model to the reported number of infected cases for the whole of China, 29 provinces in China, and 33 countries and regions that have been or are undergoing major outbreaks. We dissect th…
▽ More
Started in Wuhan, China, the COVID-19 has been spreading all over the world. We calibrate the logistic growth model, the generalized logistic growth model, the generalized Richards model and the generalized growth model to the reported number of infected cases for the whole of China, 29 provinces in China, and 33 countries and regions that have been or are undergoing major outbreaks. We dissect the development of the epidemics in China and the impact of the drastic control measures both at the aggregate level and within each province. We quantitatively document four phases of the outbreak in China with a detailed analysis on the heterogeneous situations across provinces. The extreme containment measures implemented by China were very effective with some instructive variations across provinces. Borrowing from the experience of China, we made scenario projections on the development of the outbreak in other countries. We identified that outbreaks in 14 countries (mostly in western Europe) have ended, while resurgences of cases have been identified in several among them. The modeling results clearly show longer after-peak trajectories in western countries, in contrast to most provinces in China where the after-peak trajectory is characterized by a much faster decay. We identified three groups of countries in different level of outbreak progress, and provide informative implications for the current global pandemic.
△ Less
Submitted 22 September, 2020; v1 submitted 12 March, 2020;
originally announced March 2020.
-
Diffusion State Distances: Multitemporal Analysis, Fast Algorithms, and Applications to Biological Networks
Authors:
Lenore Cowen,
Kapil Devkota,
Xiaozhe Hu,
James M. Murphy,
Kaiyi Wu
Abstract:
Data-dependent metrics are powerful tools for learning the underlying structure of high-dimensional data. This article develops and analyzes a data-dependent metric known as diffusion state distance (DSD), which compares points using a data-driven diffusion process. Unlike related diffusion methods, DSDs incorporate information across time scales, which allows for the intrinsic data structure to b…
▽ More
Data-dependent metrics are powerful tools for learning the underlying structure of high-dimensional data. This article develops and analyzes a data-dependent metric known as diffusion state distance (DSD), which compares points using a data-driven diffusion process. Unlike related diffusion methods, DSDs incorporate information across time scales, which allows for the intrinsic data structure to be inferred in a parameter-free manner. This article develops a theory for DSD based on the multitemporal emergence of mesoscopic equilibria in the underlying diffusion process. New algorithms for denoising and dimension reduction with DSD are also proposed and analyzed. These approaches are based on a weighted spectral decomposition of the underlying diffusion process, and experiments on synthetic datasets and real biological networks illustrate the efficacy of the proposed algorithms in terms of both speed and accuracy. Throughout, comparisons with related methods are made, in order to illustrate the distinct advantages of DSD for datasets exhibiting multiscale structure.
△ Less
Submitted 7 March, 2020;
originally announced March 2020.
-
Methods to Recover Unknown Processes in Partial Differential Equations Using Data
Authors:
Zhen Chen,
Kailiang Wu,
Dongbin Xiu
Abstract:
We study the problem of identifying unknown processes embedded in time-dependent partial differential equation (PDE) using observational data, with an application to advection-diffusion type PDE. We first conduct theoretical analysis and derive conditions to ensure the solvability of the problem. We then present a set of numerical approaches, including Galerkin type algorithm and collocation type…
▽ More
We study the problem of identifying unknown processes embedded in time-dependent partial differential equation (PDE) using observational data, with an application to advection-diffusion type PDE. We first conduct theoretical analysis and derive conditions to ensure the solvability of the problem. We then present a set of numerical approaches, including Galerkin type algorithm and collocation type algorithm. Analysis of the algorithms are presented, along with their implementation detail. The Galerkin algorithm is more suitable for practical situations, particularly those with noisy data, as it avoids using derivative/gradient data. Various numerical examples are then presented to demonstrate the performance and properties of the numerical methods.
△ Less
Submitted 4 March, 2020;
originally announced March 2020.
-
Learning to Generate Time Series Conditioned Graphs with Generative Adversarial Nets
Authors:
Shanchao Yang,
Jing Liu,
Kai Wu,
Mingming Li
Abstract:
Deep learning based approaches have been utilized to model and generate graphs subjected to different distributions recently. However, they are typically unsupervised learning based and unconditioned generative models or simply conditioned on the graph-level contexts, which are not associated with rich semantic node-level contexts. Differently, in this paper, we are interested in a novel problem n…
▽ More
Deep learning based approaches have been utilized to model and generate graphs subjected to different distributions recently. However, they are typically unsupervised learning based and unconditioned generative models or simply conditioned on the graph-level contexts, which are not associated with rich semantic node-level contexts. Differently, in this paper, we are interested in a novel problem named Time Series Conditioned Graph Generation: given an input multivariate time series, we aim to infer a target relation graph modeling the underlying interrelationships between time series with each node corresponding to each time series. For example, we can study the interrelationships between genes in a gene regulatory network of a certain disease conditioned on their gene expression data recorded as time series. To achieve this, we propose a novel Time Series conditioned Graph Generation-Generative Adversarial Networks (TSGG-GAN) to handle challenges of rich node-level context structures conditioning and measuring similarities directly between graphs and time series. Extensive experiments on synthetic and real-word gene regulatory networks datasets demonstrate the effectiveness and generalizability of the proposed TSGG-GAN.
△ Less
Submitted 26 August, 2023; v1 submitted 3 March, 2020;
originally announced March 2020.
-
A Non-Intrusive Correction Algorithm for Classification Problems with Corrupted Data
Authors:
Jun Hou,
Tong Qin,
Kailiang Wu,
Dongbin Xiu
Abstract:
A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction procedure to the model prediction. The correction procedure can be coupled with any approximators, such as logistic regression, neural networks of various architecture…
▽ More
A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction procedure to the model prediction. The correction procedure can be coupled with any approximators, such as logistic regression, neural networks of various architectures, etc. When training dataset is sufficiently large, we prove that the corrected models deliver correct classification results as if there is no corruption in the training data. For datasets of finite size, the corrected models produce significantly better recovery results, compared to the models without the correction algorithm. All of the theoretical findings in the paper are verified by our numerical examples.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
Data-Driven Deep Learning of Partial Differential Equations in Modal Space
Authors:
Kailiang Wu,
Dongbin Xiu
Abstract:
We present a framework for recovering/approximating unknown time-dependent partial differential equation (PDE) using its solution data. Instead of identifying the terms in the underlying PDE, we seek to approximate the evolution operator of the underlying PDE numerically. The evolution operator of the PDE, defined in infinite-dimensional space, maps the solution from a current time to a future tim…
▽ More
We present a framework for recovering/approximating unknown time-dependent partial differential equation (PDE) using its solution data. Instead of identifying the terms in the underlying PDE, we seek to approximate the evolution operator of the underlying PDE numerically. The evolution operator of the PDE, defined in infinite-dimensional space, maps the solution from a current time to a future time and completely characterizes the solution evolution of the underlying unknown PDE. Our recovery strategy relies on approximation of the evolution operator in a properly defined modal space, i.e., generalized Fourier space, in order to reduce the problem to finite dimensions. The finite dimensional approximation is then accomplished by training a deep neural network structure, which is based on residual network (ResNet), using the given data. Error analysis is provided to illustrate the predictive accuracy of the proposed method. A set of examples of different types of PDEs, including inviscid Burgers' equation that develops discontinuity in its solution, are presented to demonstrate the effectiveness of the proposed method.
△ Less
Submitted 18 October, 2019; v1 submitted 15 October, 2019;
originally announced October 2019.
-
Understanding Adversarial Robustness: The Trade-off between Minimum and Average Margin
Authors:
Kaiwen Wu,
Yaoliang Yu
Abstract:
Deep models, while being extremely versatile and accurate, are vulnerable to adversarial attacks: slight perturbations that are imperceptible to humans can completely flip the prediction of deep models. Many attack and defense mechanisms have been proposed, although a satisfying solution still largely remains elusive. In this work, we give strong evidence that during training, deep models maximize…
▽ More
Deep models, while being extremely versatile and accurate, are vulnerable to adversarial attacks: slight perturbations that are imperceptible to humans can completely flip the prediction of deep models. Many attack and defense mechanisms have been proposed, although a satisfying solution still largely remains elusive. In this work, we give strong evidence that during training, deep models maximize the minimum margin in order to achieve high accuracy, but at the same time decrease the \emph{average} margin hence hurting robustness. Our empirical results highlight an intrinsic trade-off between accuracy and robustness for current deep model training. To further address this issue, we propose a new regularizer to explicitly promote average margin, and we verify through extensive experiments that it does lead to better robustness. Our regularized objective remains Fisher-consistent, hence asymptotically can still recover the Bayes optimal classifier.
△ Less
Submitted 26 July, 2019;
originally announced July 2019.
-
Res-embedding for Deep Learning Based Click-Through Rate Prediction Modeling
Authors:
Guorui Zhou,
Kailun Wu,
Weijie Bian,
Zhao Yang,
Xiaoqiang Zhu,
Kun Gai
Abstract:
Recently, click-through rate (CTR) prediction models have evolved from shallow methods to deep neural networks. Most deep CTR models follow an Embedding\&MLP paradigm, that is, first mapping discrete id features, e.g. user visited items, into low dimensional vectors with an embedding module, then learn a multi-layer perception (MLP) to fit the target. In this way, embedding module performs as the…
▽ More
Recently, click-through rate (CTR) prediction models have evolved from shallow methods to deep neural networks. Most deep CTR models follow an Embedding\&MLP paradigm, that is, first mapping discrete id features, e.g. user visited items, into low dimensional vectors with an embedding module, then learn a multi-layer perception (MLP) to fit the target. In this way, embedding module performs as the representative learning and plays a key role in the model performance. However, in many real-world applications, deep CTR model often suffers from poor generalization performance, which is mostly due to the learning of embedding parameters. In this paper, we model user behavior using an interest delay model, study carefully the embedding mechanism, and obtain two important results: (i) We theoretically prove that small aggregation radius of embedding vectors of items which belongs to a same user interest domain will result in good generalization performance of deep CTR model. (ii) Following our theoretical analysis, we design a new embedding structure named res-embedding. In res-embedding module, embedding vector of each item is the sum of two components: (i) a central embedding vector calculated from an item-based interest graph (ii) a residual embedding vector with its scale to be relatively small. Empirical evaluation on several public datasets demonstrates the effectiveness of the proposed res-embedding structure, which brings significant improvement on the model performance.
△ Less
Submitted 24 June, 2019;
originally announced June 2019.
-
Structure-preserving Method for Reconstructing Unknown Hamiltonian Systems from Trajectory Data
Authors:
Kailiang Wu,
Tong Qin,
Dongbin Xiu
Abstract:
We present a numerical approach for approximating unknown Hamiltonian systems using observation data. A distinct feature of the proposed method is that it is structure-preserving, in the sense that it enforces conservation of the reconstructed Hamiltonian. This is achieved by directly approximating the underlying unknown Hamiltonian, rather than the right-hand-side of the governing equations. We p…
▽ More
We present a numerical approach for approximating unknown Hamiltonian systems using observation data. A distinct feature of the proposed method is that it is structure-preserving, in the sense that it enforces conservation of the reconstructed Hamiltonian. This is achieved by directly approximating the underlying unknown Hamiltonian, rather than the right-hand-side of the governing equations. We present the technical details of the proposed algorithm and its error estimate in a special case, along with a practical de-noising procedure to cope with noisy data. A set of numerical examples are then presented to demonstrate the structure-preserving property and effectiveness of the algorithm.
△ Less
Submitted 19 August, 2020; v1 submitted 24 May, 2019;
originally announced May 2019.
-
Distributional Reinforcement Learning for Efficient Exploration
Authors:
Borislav Mavrin,
Shangtong Zhang,
Hengshuai Yao,
Linglong Kong,
Kaiwen Wu,
Yaoliang Yu
Abstract:
In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties. We propose a novel and efficient exploration method for deep RL that has two components. The first is a decaying schedule to suppress the intrinsic uncertainty. The second is an exploration bonus calculated from the upper quantiles of the learned distr…
▽ More
In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties. We propose a novel and efficient exploration method for deep RL that has two components. The first is a decaying schedule to suppress the intrinsic uncertainty. The second is an exploration bonus calculated from the upper quantiles of the learned distribution. In Atari 2600 games, our method outperforms QR-DQN in 12 out of 14 hard games (achieving 483 \% average gain across 49 games in cumulative rewards over QR-DQN with a big win in Venture). We also compared our algorithm with QR-DQN in a challenging 3D driving simulator (CARLA). Results show that our algorithm achieves near-optimal safety rewards twice faster than QRDQN.
△ Less
Submitted 13 May, 2019;
originally announced May 2019.
-
Fast Transient Simulation of High-Speed Channels Using Recurrent Neural Network
Authors:
Thong Nguyen,
Tianjian Lu,
Ken Wu,
Jose Schutt-Aine
Abstract:
Generating eye diagrams by using a circuit simulator can be very computationally intensive, especially in the presence of nonlinearities. It often involves multiple Newton-like iterations at every time step when a SPICE-like circuit simulator handles a nonlinear system in the transient regime. In this paper, we leverage machine learning methods, to be specific, the recurrent neural network (RNN),…
▽ More
Generating eye diagrams by using a circuit simulator can be very computationally intensive, especially in the presence of nonlinearities. It often involves multiple Newton-like iterations at every time step when a SPICE-like circuit simulator handles a nonlinear system in the transient regime. In this paper, we leverage machine learning methods, to be specific, the recurrent neural network (RNN), to generate black-box macromodels and achieve significant reduction of computation time. Through the proposed approach, an RNN model is first trained and then validated on a relatively short sequence generated from a circuit simulator. Once the training completes, the RNN can be used to make predictions on the remaining sequence in order to generate an eye diagram. The training cost can also be amortized when the trained RNN starts making predictions. Besides, the proposed approach requires no complex circuit simulations nor substantial domain knowledge. We use two high-speed link examples to demonstrate that the proposed approach provides adequate accuracy while the computation time can be dramatically reduced. In the high-speed link example with a PAM4 driver, the eye diagram generated by RNN models shows good agreement with that obtained from a commercial circuit simulator. This paper also investigates the impacts of various RNN topologies, training schemes, and tunable parameters on both the accuracy and the generalization capability of an RNN model. It is found out that the long short-term memory (LSTM) network outperforms the vanilla RNN in terms of the accuracy in predicting transient waveforms.
△ Less
Submitted 7 February, 2019; v1 submitted 25 January, 2019;
originally announced February 2019.