Search | arXiv e-print repository

Spark Transformer: Reactivating Sparsity in FFN and Attention

Authors: Chong You, Kan Wu, Zhipeng Jia, Lin Chen, Srinadh Bhojanapalli, Jiaxian Guo, Utku Evci, Jan Wassenberg, Praneeth Netrapalli, Jeremiah J. Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E. Culler, Henry M. Levy, Sanjiv Kumar

Abstract: The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the Re… ▽ More The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges. This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top-$k$ operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU. △ Less

Submitted 6 June, 2025; originally announced June 2025.

arXiv:2505.07232 [pdf, ps, other]

Spatial Confounding in Multivariate Areal Data Analysis

Authors: Kyle Lin Wu, Sudipto Banerjee

Abstract: We investigate spatial confounding in the presence of multivariate disease dependence. In the "analysis model perspective" of spatial confounding, adding a spatially dependent random effect can lead to significant variance inflation of the posterior distribution of the fixed effects. The "data generation perspective" views covariates as stochastic and correlated with an unobserved spatial confound… ▽ More We investigate spatial confounding in the presence of multivariate disease dependence. In the "analysis model perspective" of spatial confounding, adding a spatially dependent random effect can lead to significant variance inflation of the posterior distribution of the fixed effects. The "data generation perspective" views covariates as stochastic and correlated with an unobserved spatial confounder, leading to inferior statistical inference over multiple realizations. While multiple methods have been proposed for adjusting statistical models to mitigate spatial confounding in estimating regression coefficients, results on interactions between spatial confounding and multivariate dependence are very limited. We contribute to this domain by investigating spatial confounding from the analysis and data generation perspectives in a Bayesian coregionalized areal regression model. We derive novel results that distinguish variance inflation due to spatial confounding from inflation based on multicollinearity between predictors and provide insights into the estimation efficiency of a spatial estimator under a spatially confounded data generation model. We demonstrate favorable performance of spatial analysis compared to a non-spatial model in our simulation experiments even in the presence of spatial confounding and a misspecified spatial structure. In this regard, we align with several other authors in the defense of traditional hierarchical spatial models (Gilbert et al., 2025; Khan and Berrett, 2023; Zimmerman and Ver Hoef, 2022) and extend this defense to multivariate areal models. We analyze county-level data from the US on obesity / diabetes prevalence and diabetes-related cancer mortality, comparing the results with and without spatial random effects. △ Less

Submitted 12 May, 2025; originally announced May 2025.

Comments: 26 pages, 2 figures

arXiv:2504.10373 [pdf, other]

DUE: A Deep Learning Framework and Library for Modeling Unknown Equations

Authors: Junfeng Chen, Kailiang Wu, Dongbin Xiu

Abstract: Equations, particularly differential equations, are fundamental for understanding natural phenomena and predicting complex dynamics across various scientific and engineering disciplines. However, the governing equations for many complex systems remain unknown due to intricate underlying mechanisms. Recent advancements in machine learning and data science offer a new paradigm for modeling unknown e… ▽ More Equations, particularly differential equations, are fundamental for understanding natural phenomena and predicting complex dynamics across various scientific and engineering disciplines. However, the governing equations for many complex systems remain unknown due to intricate underlying mechanisms. Recent advancements in machine learning and data science offer a new paradigm for modeling unknown equations from measurement or simulation data. This paradigm shift, known as data-driven discovery or modeling, stands at the forefront of AI for science, with significant progress made in recent years. In this paper, we introduce a systematic framework for data-driven modeling of unknown equations using deep learning. This versatile framework is capable of learning unknown ODEs, PDEs, DAEs, IDEs, SDEs, reduced or partially observed systems, and non-autonomous differential equations. Based on this framework, we have developed Deep Unknown Equations (DUE), an open-source software package designed to facilitate the data-driven modeling of unknown equations using modern deep learning techniques. DUE serves as an educational tool for classroom instruction, enabling students and newcomers to gain hands-on experience with differential equations, data-driven modeling, and contemporary deep learning approaches such as FNN, ResNet, generalized ResNet, operator semigroup networks (OSG-Net), and Transformers. Additionally, DUE is a versatile and accessible toolkit for researchers across various scientific and engineering fields. It is applicable not only for learning unknown equations from data but also for surrogate modeling of known, yet complex, equations that are costly to solve using traditional numerical methods. We provide detailed descriptions of DUE and demonstrate its capabilities through diverse examples, which serve as templates that can be easily adapted for other applications. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: 28 pages

arXiv:2503.22616 [pdf, other]

A Unified Approach for Estimating Various Treatment Effects in Causal Inference

Authors: Kuan-Hsun Wu, Li-Pang Chen

Abstract: In this paper, we introduce a unified estimator to analyze various treatment effects in causal inference, including but not limited to the average treatment effect (ATE) and the quantile treatment effect (QTE). The proposed estimator is developed under the statistical functional and cumulative distribution function structure, which leads to a flexible and robust estimator and covers some frequent… ▽ More In this paper, we introduce a unified estimator to analyze various treatment effects in causal inference, including but not limited to the average treatment effect (ATE) and the quantile treatment effect (QTE). The proposed estimator is developed under the statistical functional and cumulative distribution function structure, which leads to a flexible and robust estimator and covers some frequent treatment effects. In addition, our approach also takes variable selection into account, so that informative and network structure in confounders can be identified and be implemented in our estimation procedure. The theoretical properties, including variable selection consistency and asymptotic normality of the statistical functional estimator, are established. Various treatment effects estimations are also conducted in numerical studies, and the results reveal that the proposed estimator generally outperforms the existing methods and is more efficient than its competitors. △ Less

Submitted 28 March, 2025; originally announced March 2025.

arXiv:2503.04138 [pdf, other]

Mixed Likelihood Variational Gaussian Processes

Authors: Kaiwen Wu, Craig Sanders, Benjamin Letham, Phillip Guan

Abstract: Gaussian processes (GPs) are powerful models for human-in-the-loop experiments due to their flexibility and well-calibrated uncertainty. However, GPs modeling human responses typically ignore auxiliary information, including a priori domain expertise and non-task performance information like user confidence ratings. We propose mixed likelihood variational GPs to leverage auxiliary information, whi… ▽ More Gaussian processes (GPs) are powerful models for human-in-the-loop experiments due to their flexibility and well-calibrated uncertainty. However, GPs modeling human responses typically ignore auxiliary information, including a priori domain expertise and non-task performance information like user confidence ratings. We propose mixed likelihood variational GPs to leverage auxiliary information, which combine multiple likelihoods in a single evidence lower bound to model multiple types of data. We demonstrate the benefits of mixing likelihoods in three real-world experiments with human participants. First, we use mixed likelihood training to impose prior knowledge constraints in GP classifiers, which accelerates active learning in a visual perception task where users are asked to identify geometric errors resulting from camera position errors in virtual reality. Second, we show that leveraging Likert scale confidence ratings by mixed likelihood training improves model fitting for haptic perception of surface roughness. Lastly, we show that Likert scale confidence ratings improve human preference learning in robot gait optimization. The modeling performance improvements found using our framework across this diverse set of applications illustrates the benefits of incorporating auxiliary information into active learning and preference learning by using mixed likelihoods to jointly model multiple inputs. △ Less

Submitted 6 March, 2025; originally announced March 2025.

Comments: 16 pages

arXiv:2502.17772 [pdf, other]

An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses

Authors: Hao Liang, Wanrong Zhang, Xinlei He, Kaishun Wu, Hong Xing

Abstract: Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to protect sensitive data during the training of machine learning models, but its privacy guarantees often come at the cost of model performance, largely due to the inherent challenge of accurately quantifying privacy loss. While recent efforts have strengthened privacy guarantees by focusing solely on the final output and b… ▽ More Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to protect sensitive data during the training of machine learning models, but its privacy guarantees often come at the cost of model performance, largely due to the inherent challenge of accurately quantifying privacy loss. While recent efforts have strengthened privacy guarantees by focusing solely on the final output and bounded domain cases, they still impose restrictive assumptions, such as convexity and other parameter limitations, and often lack a thorough analysis of utility. In this paper, we provide rigorous privacy and utility characterization for DPSGD for smooth loss functions in both bounded and unbounded domains. We track the privacy loss over multiple iterations by exploiting the noisy smooth-reduction property and establish the utility analysis by leveraging the projection's non-expansiveness and clipped SGD properties. In particular, we show that for DPSGD with a bounded domain, (i) the privacy loss can still converge without the convexity assumption, and (ii) a smaller bounded diameter can improve both privacy and utility simultaneously under certain conditions. Numerical results validate our results. △ Less

Submitted 28 February, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

Comments: 18 pages, 2 figures, submitted for possible publication

arXiv:2412.13574 [pdf]

Revisiting Interactions of Multiple Driver States in Heterogenous Population and Cognitive Tasks

Authors: Jiyao Wang, Ange Wang, Song Yan, Dengbo He, Kaishun Wu

Abstract: In real-world driving scenarios, multiple states occur simultaneously due to individual differences and environmental factors, complicating the analysis and estimation of driver states. Previous studies, limited by experimental design and analytical methods, may not be able to disentangle the relationships among multiple driver states and environmental factors. This paper introduces the Double Mac… ▽ More In real-world driving scenarios, multiple states occur simultaneously due to individual differences and environmental factors, complicating the analysis and estimation of driver states. Previous studies, limited by experimental design and analytical methods, may not be able to disentangle the relationships among multiple driver states and environmental factors. This paper introduces the Double Machine Learning (DML) analysis method to the field of driver state analysis to tackle this challenge. To train and test the DML model, a driving simulator experiment with 42 participants was conducted. All participants drove SAE level-3 vehicles and conducted three types of cognitive tasks in a 3-hour driving experiment. Drivers' subjective cognitive load and drowsiness levels were collected throughout the experiment. Then, we isolated individual and environmental factors affecting driver state variations and the factors affecting drivers' physiological and eye-tracking metrics when they are under specific states. The results show that our approach successfully decoupled and inferred the complex causal relationships between multiple types of drowsiness and cognitive load. Additionally, we identified key physiological and eye-tracking indicators in the presence of multiple driver states and under the influence of a single state, excluding the influence of other driver states, environmental factors, and individual characteristics. Our causal inference analytical framework can offer new insights for subsequent analysis of drivers' states. Further, the updated causal relation graph based on the DML analysis can provide theoretical bases for driver state monitoring based on physiological and eye-tracking measures. △ Less

Submitted 19 December, 2024; v1 submitted 18 December, 2024; originally announced December 2024.

arXiv:2411.17154 [pdf, other]

Emergenet: A Digital Twin of Sequence Evolution for Scalable Emergence Risk Assessment of Animal Influenza A Strains

Authors: Kevin Yuanbo Wu, Jin Li, Aaron Esser-Kahn, Ishanu Chattopadhyay

Abstract: Despite having triggered devastating pandemics in the past, our ability to quantitatively assess the emergence potential of individual strains of animal influenza viruses remains limited. This study introduces Emergenet, a tool to infer a digital twin of sequence evolution to chart how new variants might emerge in the wild. Our predictions based on Emergenets built only using 220,151 Hemagglutinni… ▽ More Despite having triggered devastating pandemics in the past, our ability to quantitatively assess the emergence potential of individual strains of animal influenza viruses remains limited. This study introduces Emergenet, a tool to infer a digital twin of sequence evolution to chart how new variants might emerge in the wild. Our predictions based on Emergenets built only using 220,151 Hemagglutinnin (HA) sequences consistently outperform WHO seasonal vaccine recommendations for H1N1/H3N2 subtypes over two decades (average match-improvement: 3.73 AAs, 28.40\%), and are at par with state-of-the-art approaches that use more detailed phenotypic annotations. Finally, our generative models are used to scalably calculate the current odds of emergence of animal strains not yet in human circulation, which strongly correlates with CDC's expert-assessed Influenza Risk Assessment Tool (IRAT) scores (Pearson's $r = 0.721, p = 10^{-4}$). A minimum five orders of magnitude speedup over CDC's assessment (seconds vs months) then enabled us to analyze 6,354 animal strains collected post-2020 to identify 35 strains with high emergence scores ($> 7.7$). The Emergenet framework opens the door to preemptive pandemic mitigation through targeted inoculation of animal hosts before the first human infection. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: 35 pages, 15 figures

arXiv:2411.01036 [pdf, ps, other]

Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Authors: Jonathan Wenger, Kaiwen Wu, Philipp Hennig, Jacob R. Gardner, Geoff Pleiss, John P. Cunningham

Abstract: Model selection in Gaussian processes scales prohibitively with the size of the training dataset, both in time and memory. While many approximations exist, all incur inevitable approximation error. Recent work accounts for this error in the form of computational uncertainty, which enables -- at the cost of quadratic complexity -- an explicit tradeoff between computation and precision. Here we exte… ▽ More Model selection in Gaussian processes scales prohibitively with the size of the training dataset, both in time and memory. While many approximations exist, all incur inevitable approximation error. Recent work accounts for this error in the form of computational uncertainty, which enables -- at the cost of quadratic complexity -- an explicit tradeoff between computation and precision. Here we extend this development to model selection, which requires significant enhancements to the existing approach, including linear-time scaling in the size of the dataset. We propose a novel training loss for hyperparameter optimization and demonstrate empirically that the resulting method can outperform SGPR, CGGP and SVGP, state-of-the-art methods for GP model selection, on medium to large-scale datasets. Our experiments show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU. As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty -- a fundamental prerequisite for optimal decision-making. △ Less

Submitted 7 July, 2025; v1 submitted 1 November, 2024; originally announced November 2024.

Comments: Advances in Neural Information Processing Systems (NeurIPS 2024)

arXiv:2408.09532 [pdf, other]

Deep Limit Model-free Prediction in Regression

Authors: Kejin Wu, Dimitris N. Politis

Abstract: In this paper, we provide a novel Model-free approach based on Deep Neural Network (DNN) to accomplish point prediction and prediction interval under a general regression setting. Usually, people rely on parametric or non-parametric models to bridge dependent and independent variables (Y and X). However, this classical method relies heavily on the correct model specification. Even for the non-para… ▽ More In this paper, we provide a novel Model-free approach based on Deep Neural Network (DNN) to accomplish point prediction and prediction interval under a general regression setting. Usually, people rely on parametric or non-parametric models to bridge dependent and independent variables (Y and X). However, this classical method relies heavily on the correct model specification. Even for the non-parametric approach, some additive form is often assumed. A newly proposed Model-free prediction principle sheds light on a prediction procedure without any model assumption. Previous work regarding this principle has shown better performance than other standard alternatives. Recently, DNN, one of the machine learning methods, has received increasing attention due to its great performance in practice. Guided by the Model-free prediction idea, we attempt to apply a fully connected forward DNN to map X and some appropriate reference random variable Z to Y. The targeted DNN is trained by minimizing a specially designed loss function so that the randomness of Y conditional on X is outsourced to Z through the trained DNN. Our method is more stable and accurate compared to other DNN-based counterparts, especially for optimal point predictions. With a specific prediction procedure, our prediction interval can capture the estimation variability so that it can render a better coverage rate for finite sample cases. The superior performance of our method is verified by simulation and empirical studies. △ Less

Submitted 11 September, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

arXiv:2407.19171 [pdf, other]

Assessing Spatial Disparities: A Bayesian Linear Regression Approach

Authors: Kyle Lin Wu, Sudipto Banerjee

Abstract: Epidemiological investigations of regionally aggregated spatial data often involve detecting spatial health disparities among neighboring regions on a map of disease mortality or incidence rates. Analyzing such data introduces spatial dependence among the health outcomes and seeks to report statistically significant spatial disparities by delineating boundaries that separate neighboring regions wi… ▽ More Epidemiological investigations of regionally aggregated spatial data often involve detecting spatial health disparities among neighboring regions on a map of disease mortality or incidence rates. Analyzing such data introduces spatial dependence among the health outcomes and seeks to report statistically significant spatial disparities by delineating boundaries that separate neighboring regions with disparate health outcomes. However, there are statistical challenges to appropriately defining what constitutes a spatial disparity and to construct robust probabilistic inference for spatial disparities. We enrich the familiar Bayesian linear regression framework to introduce spatial autoregression and offer model-based detection of spatial disparities. We derive exploitable analytical tractability that considerably accelerates computation. Simulation experiments conducted over a county map of the entire United States demonstrate the effectiveness of our method and we apply our method to a data set from the Institute of Health Metrics and Evaluation (IHME) on age-standardized US county-level estimates of lung cancer mortality rates. △ Less

Submitted 14 March, 2025; v1 submitted 27 July, 2024; originally announced July 2024.

Comments: 55 pages, 9 figures

arXiv:2407.10449 [pdf, other]

A Fast, Robust Elliptical Slice Sampling Implementation for Linearly Truncated Multivariate Normal Distributions

Authors: Kaiwen Wu, Jacob R. Gardner

Abstract: Elliptical slice sampling, when adapted to linearly truncated multivariate normal distributions, is a rejection-free Markov chain Monte Carlo method. At its core, it requires analytically constructing an ellipse-polytope intersection. The main novelty of this paper is an algorithm that computes this intersection in $\mathcal{O}(m \log m)$ time, where $m$ is the number of linear inequality constrai… ▽ More Elliptical slice sampling, when adapted to linearly truncated multivariate normal distributions, is a rejection-free Markov chain Monte Carlo method. At its core, it requires analytically constructing an ellipse-polytope intersection. The main novelty of this paper is an algorithm that computes this intersection in $\mathcal{O}(m \log m)$ time, where $m$ is the number of linear inequality constraints representing the polytope. We show that an implementation based on this algorithm enhances numerical stability, speeds up running time, and is easy to parallelize for launching multiple Markov chains. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 13 pages

arXiv:2406.01870 [pdf, other]

Understanding Stochastic Natural Gradient Variational Inference

Authors: Kaiwen Wu, Jacob R. Gardner

Abstract: Stochastic natural gradient variational inference (NGVI) is a popular posterior inference method with applications in various probabilistic models. Despite its wide usage, little is known about the non-asymptotic convergence rate in the \emph{stochastic} setting. We aim to lessen this gap and provide a better understanding. For conjugate likelihoods, we prove the first $\mathcal{O}(\frac{1}{T})$ n… ▽ More Stochastic natural gradient variational inference (NGVI) is a popular posterior inference method with applications in various probabilistic models. Despite its wide usage, little is known about the non-asymptotic convergence rate in the \emph{stochastic} setting. We aim to lessen this gap and provide a better understanding. For conjugate likelihoods, we prove the first $\mathcal{O}(\frac{1}{T})$ non-asymptotic convergence rate of stochastic NGVI. The complexity is no worse than stochastic gradient descent (\aka black-box variational inference) and the rate likely has better constant dependency that leads to faster convergence in practice. For non-conjugate likelihoods, we show that stochastic NGVI with the canonical parameterization implicitly optimizes a non-convex objective. Thus, a global convergence rate of $\mathcal{O}(\frac{1}{T})$ is unlikely without some significant new understanding of optimizing the ELBO using natural gradients. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: ICML 2024

arXiv:2405.08276 [pdf, other]

Scalable Subsampling Inference for Deep Neural Networks

Authors: Kejin Wu, Dimitris N. Politis

Abstract: Deep neural networks (DNN) has received increasing attention in machine learning applications in the last several years. Recently, a non-asymptotic error bound has been developed to measure the performance of the fully connected DNN estimator with ReLU activation functions for estimating regression models. The paper at hand gives a small improvement on the current error bound based on the latest r… ▽ More Deep neural networks (DNN) has received increasing attention in machine learning applications in the last several years. Recently, a non-asymptotic error bound has been developed to measure the performance of the fully connected DNN estimator with ReLU activation functions for estimating regression models. The paper at hand gives a small improvement on the current error bound based on the latest results on the approximation ability of DNN. More importantly, however, a non-random subsampling technique--scalable subsampling--is applied to construct a `subagged' DNN estimator. Under regularity conditions, it is shown that the subagged DNN estimator is computationally efficient without sacrificing accuracy for either estimation or prediction tasks. Beyond point estimation/prediction, we propose different approaches to build confidence and prediction intervals based on the subagged DNN estimator. In addition to being asymptotically valid, the proposed confidence/prediction intervals appear to work well in finite samples. All in all, the scalable subsampling DNN estimator offers the complete package in terms of statistical inference, i.e., (a) computational efficiency; (b) point estimation/prediction accuracy; and (c) allowing for the construction of practically useful confidence and prediction intervals. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2402.08683 [pdf]

Order picking efficiency: A scattered storage and clustered allocation strategy in automated drug dispensing systems

Authors: Mengge Yuan, Ning Zhao, Kan Wu, Lulu Cheng

Abstract: In the smart hospital, optimizing prescription order fulfilment processes in outpatient pharmacies is crucial. A promising device, automated drug dispensing systems (ADDSs), has emerged to streamline these processes. These systems involve human order pickers who are assisted by ADDSs. The ADDS's robotic arm transports bins from storage locations to the input/output (I/O) points, while the pharmaci… ▽ More In the smart hospital, optimizing prescription order fulfilment processes in outpatient pharmacies is crucial. A promising device, automated drug dispensing systems (ADDSs), has emerged to streamline these processes. These systems involve human order pickers who are assisted by ADDSs. The ADDS's robotic arm transports bins from storage locations to the input/output (I/O) points, while the pharmacist sorts the requested drugs from the bins at the I/O points. This paper focuses on coordinating the ADDS and the pharmacists to optimize the order-picking strategy. Another critical aspect of order-picking systems is the storage location assignment problem (SLAP), which determines the allocation of drugs to storage locations. In this study, we consider the ADDS as a smart warehouse and propose a two-stage scattered storage and clustered allocation (SSCA) strategy to optimize the SLAP for ADDSs. The first stage primarily adopts a scattered storage approach, and we develop a mathematical programming model to group drugs accordingly. In the second stage, we introduce a sequential alternating (SA) heuristic algorithm that takes into account the drug demand frequency and the correlation between drugs to cluster and locate them effectively. To evaluate the proposed SSCA strategy, we develop a double objective integer programming model for the order-picking problem in ADDSs to minimize the number of machines visited in prescription orders while maintaining the shortest average picking time of orders. The numerical results demonstrate that the proposed strategy can optimize the SLAP in ADDSs and improve significantly the order-picking efficiency of ADDSs in a human-robot cooperation environment. △ Less

Submitted 18 December, 2023; originally announced February 2024.

arXiv:2311.00294 [pdf, ps, other]

Multi-step ahead prediction intervals for non-parametric autoregressions via bootstrap: consistency, debiasing and pertinence

Authors: Dimitris N. Politis, Kejin Wu

Abstract: To address the difficult problem of multi-step ahead prediction of non-parametric autoregressions, we consider a forward bootstrap approach. Employing a local constant estimator, we can analyze a general type of non-parametric time series model, and show that the proposed point predictions are consistent with the true optimal predictor. We construct a quantile prediction interval that is asymptoti… ▽ More To address the difficult problem of multi-step ahead prediction of non-parametric autoregressions, we consider a forward bootstrap approach. Employing a local constant estimator, we can analyze a general type of non-parametric time series model, and show that the proposed point predictions are consistent with the true optimal predictor. We construct a quantile prediction interval that is asymptotically valid. Moreover, using a debiasing technique, we can asymptotically approximate the distribution of multi-step ahead non-parametric estimation by bootstrap. As a result, we can build bootstrap prediction intervals that are pertinent, i.e., can capture the model estimation variability, thus improving upon the standard quantile prediction intervals. Simulation studies are given to illustrate the performance of our point predictions and pertinent prediction intervals for finite samples. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.17137 [pdf, other]

Large-Scale Gaussian Processes via Alternating Projection

Authors: Kaiwen Wu, Jonathan Wenger, Haydn Jones, Geoff Pleiss, Jacob R. Gardner

Abstract: Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ spac… ▽ More Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to $27\times$ and $72 \times$, respectively, compared to CG. △ Less

Submitted 8 March, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: AISTATS 2024

arXiv:2310.00902 [pdf, other]

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Authors: Yongchan Kwon, Eric Wu, Kevin Wu, James Zou

Abstract: Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image… ▽ More Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled. △ Less

Submitted 13 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2309.10301 [pdf, other]

Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms

Authors: Keru Wu, Yuansi Chen, Wooseok Ha, Bin Yu

Abstract: Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify… ▽ More Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify the assumptions under which a DA algorithm has good target performance. In this work, we focus on the assumption of the presence of conditionally invariant components (CICs), which are relevant for prediction and remain conditionally invariant across the source and target data. We demonstrate that CICs, which can be estimated through conditional invariant penalty (CIP), play three prominent roles in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features. We support our new algorithms and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, Camelyon17, and DomainNet datasets. △ Less

Submitted 8 July, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

arXiv:2306.15209 [pdf, other]

doi 10.1109/JBHI.2024.3371097

Dynamic Reconfiguration of Brain Functional Network in Stroke

Authors: Kaichao Wu, Beth Jelfs, Katrina Neville, Wenzhen He, Qiang Fang

Abstract: The brain continually reorganizes its functional network to adapt to post-stroke functional impairments. Previous studies using static modularity analysis have presented global-level behavior patterns of this network reorganization. However, it is far from understood how the brain reconfigures its functional network dynamically following a stroke. This study collected resting-state functional MRI… ▽ More The brain continually reorganizes its functional network to adapt to post-stroke functional impairments. Previous studies using static modularity analysis have presented global-level behavior patterns of this network reorganization. However, it is far from understood how the brain reconfigures its functional network dynamically following a stroke. This study collected resting-state functional MRI data from 15 stroke patients, with mild (n = 6) and severe (n = 9) two subgroups based on their clinical symptoms. Additionally, 15 age-matched healthy subjects were considered as controls. By applying a multilayer network method, a dynamic modular structure was recognized based on a time-resolved function network. Then dynamic network measurements (recruitment, integration, and flexibility) were calculated to characterize the dynamic reconfiguration of post-stroke brain functional networks, hence, to reveal the neural functional rebuilding process. It was found from this investigation that severe patients tended to have reduced recruitment and increased between-network integration, while mild patients exhibited low network flexibility and less network integration. It is also noted that this severity-dependent alteration in network interaction was not able to be revealed by previous studies using static methods. Clinically, the obtained knowledge of the diverse patterns of dynamic adjustment in brain functional networks observed from the brain signal could help understand the underlying mechanism of the motor, speech, and cognitive functional impairments caused by stroke attacks. The proposed method not only could be used to evaluate patients' current brain status but also has the potential to provide insights into prognosis analysis and prediction. △ Less

Submitted 22 March, 2024; v1 submitted 27 June, 2023; originally announced June 2023.

Comments: Has been accepted by IEEE Journal of Biomedical and Health Informatics, doi:10.1109/JBHI.2024.3371097

Journal ref: IEEE Journal of Biomedical and Health Informatics 2024

arXiv:2306.05398 [pdf, other]

Bayesian model calibration for diblock copolymer thin film self-assembly using power spectrum of microscopy data and machine learning surrogate

Authors: Lianghao Cao, Keyi Wu, J. Tinsley Oden, Peng Chen, Omar Ghattas

Abstract: Identifying parameters of computational models from experimental data, or model calibration, is fundamental for assessing and improving the predictability and reliability of computer simulations. In this work, we propose a method for Bayesian calibration of models that predict morphological patterns of diblock copolymer (Di-BCP) thin film self-assembly while accounting for various sources of uncer… ▽ More Identifying parameters of computational models from experimental data, or model calibration, is fundamental for assessing and improving the predictability and reliability of computer simulations. In this work, we propose a method for Bayesian calibration of models that predict morphological patterns of diblock copolymer (Di-BCP) thin film self-assembly while accounting for various sources of uncertainties in pattern formation and data acquisition. This method extracts the azimuthally-averaged power spectrum (AAPS) of the top-down microscopy characterization of Di-BCP thin film patterns as summary statistics for Bayesian inference of model parameters via the pseudo-marginal method. We derive the analytical and approximate form of a conditional likelihood for the AAPS of image data. We demonstrate that AAPS-based image data reduction retains the mutual information, particularly on important length scales, between image data and model parameters while being relatively agnostic to the aleatoric uncertainties associated with the random long-range disorder of Di-BCP patterns. Additionally, we propose a phase-informed prior distribution for Bayesian model calibration. Furthermore, reducing image data to AAPS enables us to efficiently build surrogate models to accelerate the proposed Bayesian model calibration procedure. We present the formulation and training of two multi-layer perceptrons for approximating the parameter-to-spectrum map, which enables fast integrated likelihood evaluations. We validate the proposed Bayesian model calibration method through numerical examples, for which the neural network surrogate delivers a fivefold reduction of the number of model simulations performed for a single calibration task. △ Less

Submitted 3 August, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: Minor changes from the original submission, including a change in the title

arXiv:2306.04126 [pdf, ps, other]

Bootstrap Prediction Inference of Non-linear Autoregressive Models

Authors: Kejin Wu, Dimitris N. Politis

Abstract: The non-linear autoregressive (NLAR) model plays an important role in modeling and predicting time series. One-step ahead prediction is straightforward using the NLAR model, but the multi-step ahead prediction is cumbersome. For instance, iterating the one-step ahead predictor is a convenient strategy for linear autoregressive (LAR) models, but it is suboptimal under NLAR. In this paper, we first… ▽ More The non-linear autoregressive (NLAR) model plays an important role in modeling and predicting time series. One-step ahead prediction is straightforward using the NLAR model, but the multi-step ahead prediction is cumbersome. For instance, iterating the one-step ahead predictor is a convenient strategy for linear autoregressive (LAR) models, but it is suboptimal under NLAR. In this paper, we first propose a simulation and/or bootstrap algorithm to construct optimal point predictors under an $L_1$ or $L_2$ loss criterion. In addition, we construct bootstrap prediction intervals in the multi-step ahead prediction problem; in particular, we develop an asymptotically valid quantile prediction interval as well as a pertinent prediction interval for future values. In order to correct the undercoverage of prediction intervals with finite samples, we further employ predictive -- as opposed to fitted -- residuals in the bootstrap process. Simulation studies are also given to substantiate the finite sample performance of our methods. △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.15572 [pdf, other]

The Behavior and Convergence of Local Bayesian Optimization

Authors: Kaiwen Wu, Kyurae Kim, Roman Garnett, Jacob R. Gardner

Abstract: A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or converge… ▽ More A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by Müller et al. (2021), and derive convergence rates in both the noisy and noiseless settings. △ Less

Submitted 8 March, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 27 pages; NeurIPS 2023

arXiv:2305.15349 [pdf, other]

On the Convergence of Black-Box Variational Inference

Authors: Kyurae Kim, Jisu Oh, Kaiwen Wu, Yi-An Ma, Jacob R. Gardner

Abstract: We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior dens… ▽ More We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Also, our analysis reveals that certain algorithm design choices commonly employed in practice, particularly, nonlinear parameterizations of the scale of the variational approximation, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations, and thus achieves the strongest known convergence rate guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems. △ Less

Submitted 10 January, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted to NeurIPS'23; previous title: "Black-Box Variational Inference Converges"

arXiv:2303.10472 [pdf, ps, other]

Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference

Authors: Kyurae Kim, Kaiwen Wu, Jisu Oh, Jacob R. Gardner

Abstract: Understanding the gradient variance of black-box variational inference (BBVI) is a crucial step for establishing its convergence and developing algorithmic improvements. However, existing studies have yet to show that the gradient variance of BBVI satisfies the conditions used to study the convergence of stochastic gradient descent (SGD), the workhorse of BBVI. In this work, we show that BBVI sati… ▽ More Understanding the gradient variance of black-box variational inference (BBVI) is a crucial step for establishing its convergence and developing algorithmic improvements. However, existing studies have yet to show that the gradient variance of BBVI satisfies the conditions used to study the convergence of stochastic gradient descent (SGD), the workhorse of BBVI. In this work, we show that BBVI satisfies a matching bound corresponding to the $ABC$ condition used in the SGD literature when applied to smooth and quadratically-growing log-likelihoods. Our results generalize to nonlinear covariance parameterizations widely used in the practice of BBVI. Furthermore, we show that the variance of the mean-field parameterization has provably superior dimensional dependence. △ Less

Submitted 3 June, 2023; v1 submitted 18 March, 2023; originally announced March 2023.

Comments: Accepted to ICML'23 for live oral presentation

arXiv:2302.03358 [pdf, other]

Deep-OSG: Deep Learning of Operators in Semigroup

Authors: Junfeng Chen, Kailiang Wu

Abstract: This paper proposes a novel deep learning approach for learning operators in semigroup, with applications to modeling unknown autonomous dynamical systems using time series data collected at varied time lags. It is a sequel to the previous flow map learning (FML) works [T. Qin, K. Wu, and D. Xiu, J. Comput. Phys., 395:620--635, 2019], [K. Wu and D. Xiu, J. Comput. Phys., 408:109307, 2020], and [Z.… ▽ More This paper proposes a novel deep learning approach for learning operators in semigroup, with applications to modeling unknown autonomous dynamical systems using time series data collected at varied time lags. It is a sequel to the previous flow map learning (FML) works [T. Qin, K. Wu, and D. Xiu, J. Comput. Phys., 395:620--635, 2019], [K. Wu and D. Xiu, J. Comput. Phys., 408:109307, 2020], and [Z. Chen, V. Churchill, K. Wu, and D. Xiu, J. Comput. Phys., 449:110782, 2022], which focused on learning single evolution operator with a fixed time step. This paper aims to learn a family of evolution operators with variable time steps, which constitute a semigroup for an autonomous system. The semigroup property is very crucial and links the system's evolutionary behaviors across varying time scales, but it was not considered in the previous works. We propose for the first time a framework of embedding the semigroup property into the data-driven learning process, through a novel neural network architecture and new loss functions. The framework is very feasible, can be combined with any suitable neural networks, and is applicable to learning general autonomous ODEs and PDEs. We present the rigorous error estimates and variance analysis to understand the prediction accuracy and robustness of our approach, showing the remarkable advantages of semigroup awareness in our model. Moreover, our approach allows one to arbitrarily choose the time steps for prediction and ensures that the predicted results are well self-matched and consistent. Extensive numerical experiments demonstrate that embedding the semigroup property notably reduces the data dependency of deep learning models and greatly improves the accuracy, robustness, and stability for long-time prediction. △ Less

Submitted 12 September, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

arXiv:2205.09703 [pdf, other]

Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow

Authors: Jeeyung Kim, Mengtian Jin, Youkow Homma, Alex Sim, Wilko Kroeger, Kesheng Wu

Abstract: In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experi… ▽ More In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experimental facility to a data center. The key idea of our approach is to find recent past data transfer events that match the current event in some ways. Tests showed that we could identify recent events matching some recorded properties and reduce the prediction error by about 12% compared to the similar models with only static features. We additionally explored an application specific technique to extract information about the data production process, and was able to reduce the average prediction error by 44%. △ Less

Submitted 19 May, 2022; originally announced May 2022.

arXiv:2205.06622 [pdf]

What Makes You Hold on to That Old Car? Joint Insights from Machine Learning and Multinomial Logit on Vehicle-level Transaction Decisions

Authors: Ling Jin, Alina Lazar, Caitlin Brown, Bingrong Sun, Venu Garikapati, Srinath Ravulaparthy, Qianmiao Chen, Alexander Sim, Kesheng Wu, Tin Ho, Thomas Wenzel, C. Anna Spurlock

Abstract: What makes you hold on that old car? While the vast majority of the household vehicles are still powered by conventional internal combustion engines, the progress of adopting emerging vehicle technologies will critically depend on how soon the existing vehicles are transacted out of the household fleet. Leveraging a nationally representative longitudinal data set, the Panel Study of Income Dynamic… ▽ More What makes you hold on that old car? While the vast majority of the household vehicles are still powered by conventional internal combustion engines, the progress of adopting emerging vehicle technologies will critically depend on how soon the existing vehicles are transacted out of the household fleet. Leveraging a nationally representative longitudinal data set, the Panel Study of Income Dynamics, this study examines how household decisions to dispose of or replace a given vehicle are: (1) influenced by the vehicle's attributes, (2) mediated by households' concurrent socio-demographic and economic attributes, and (3) triggered by key life cycle events. Coupled with a newly developed machine learning interpretation tool, TreeExplainer, we demonstrate an innovative use of machine learning models to augment traditional logit modeling to both generate behavioral insights and improve model performance. We find the two gradient-boosting-based methods, CatBoost and LightGBM, are the best performing machine learning models for this problem. The multinomial logistic model can achieve similar performance levels after its model specification is informed by TreeExplainer. Both machine learning and multinomial logit models suggest that while older vehicles are more likely to be disposed of or replaced than newer ones, such probability decreases as the vehicles serve the family longer. We find that married families, families with higher education levels, homeowners, and older families tend to keep their vehicles longer. Life events such as childbirth, residential relocation, and change of household composition and income are found to increase vehicle disposal and/or replacement. We provide additional insights on the timing of vehicle replacement or disposal, in particular, the presence of children and childbirth events are more strongly associated with vehicle replacement among younger parents. △ Less

Submitted 13 May, 2022; originally announced May 2022.

arXiv:2203.09611 [pdf, other]

doi 10.1080/13658816.2022.2053980

STICC: A multivariate spatial clustering method for repeated geographic pattern discovery with consideration of spatial contiguity

Authors: Yuhao Kang, Kunlin Wu, Song Gao, Ignavier Ng, Jinmeng Rao, Shan Ye, Fan Zhang, Teng Fei

Abstract: Spatial clustering has been widely used for spatial data mining and knowledge discovery. An ideal multivariate spatial clustering should consider both spatial contiguity and aspatial attributes. Existing spatial clustering approaches may face challenges for discovering repeated geographic patterns with spatial contiguity maintained. In this paper, we propose a Spatial Toeplitz Inverse Covariance-B… ▽ More Spatial clustering has been widely used for spatial data mining and knowledge discovery. An ideal multivariate spatial clustering should consider both spatial contiguity and aspatial attributes. Existing spatial clustering approaches may face challenges for discovering repeated geographic patterns with spatial contiguity maintained. In this paper, we propose a Spatial Toeplitz Inverse Covariance-Based Clustering (STICC) method that considers both attributes and spatial relationships of geographic objects for multivariate spatial clustering. A subregion is created for each geographic object serving as the basic unit when performing clustering. A Markov random field is then constructed to characterize the attribute dependencies of subregions. Using a spatial consistency strategy, nearby objects are encouraged to belong to the same cluster. To test the performance of the proposed STICC algorithm, we apply it in two use cases. The comparison results with several baseline methods show that the STICC outperforms others significantly in terms of adjusted rand index and macro-F1 score. Join count statistics is also calculated and shows that the spatial contiguity is well preserved by STICC. Such a spatial clustering method may benefit various applications in the fields of geography, remote sensing, transportation, and urban planning, etc. △ Less

Submitted 30 March, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

Journal ref: International Journal of Geographical Information Science, Year 2022

arXiv:2202.12636 [pdf, other]

Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains

Authors: Haitao Liu, Kai Wu, Yew-Soon Ong, Chao Bian, Xiaomo Jiang, Xiaofang Wang

Abstract: Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper present… ▽ More Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (HSVLMC) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with Bayesian calibration that (i) takes into account the effect of dimensionality reduction raised by domain mappings in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases and a practical multi-fidelity steam turbine exhaust problem. △ Less

Submitted 18 June, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2112.08601 [pdf, ps, other]

A New Model-free Prediction Method: GA-NoVaS

Authors: Kejin Wu, Sayar Karmakar

Abstract: Volatility forecasting plays an important role in the financial econometrics. Previous works in this regime are mainly based on applying various GARCH-type models. However, it is hard for people to choose a specific GARCH model which works for general cases and such traditional methods are unstable for dealing with high-volatile period or using small sample size. The newly proposed normalizing and… ▽ More Volatility forecasting plays an important role in the financial econometrics. Previous works in this regime are mainly based on applying various GARCH-type models. However, it is hard for people to choose a specific GARCH model which works for general cases and such traditional methods are unstable for dealing with high-volatile period or using small sample size. The newly proposed normalizing and variance stabilizing (NoVaS) method is a more robust and accurate prediction technique. This Model-free method is built by taking advantage of an inverse transformation which is based on the ARCH model. Inspired by the historic development of the ARCH to GARCH model, we propose a novel NoVaS-type method which exploits the GARCH model structure. By performing extensive data analysis, we find our model has better time-aggregated prediction performance than the current state-of-the-art NoVaS method on forecasting short and volatile data. The victory of our new method corroborates that and also opens up avenues where one can explore other NoVaS structures to improve on the existing ones or solve specific prediction problems. △ Less

Submitted 15 December, 2021; originally announced December 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2101.02273

arXiv:2109.13055 [pdf, other]

Minimax Mixing Time of the Metropolis-Adjusted Langevin Algorithm for Log-Concave Sampling

Authors: Keru Wu, Scott Schmidler, Yuansi Chen

Abstract: We study the mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. We establish its optimal minimax mixing time under a warm start. Our main contribution is two-fold. First, for a $d$-dimensional log-concave density with condition number $κ$, we show that MALA with a warm start mixes in $\tilde O(κ\sqrt{d})$ iteration… ▽ More We study the mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. We establish its optimal minimax mixing time under a warm start. Our main contribution is two-fold. First, for a $d$-dimensional log-concave density with condition number $κ$, we show that MALA with a warm start mixes in $\tilde O(κ\sqrt{d})$ iterations up to logarithmic factors. This improves upon the previous work on the dependency of either the condition number $κ$ or the dimension $d$. Our proof relies on comparing the leapfrog integrator with the continuous Hamiltonian dynamics, where we establish a new concentration bound for the acceptance rate. Second, we prove a spectral gap based mixing time lower bound for reversible MCMC algorithms on general state spaces. We apply this lower bound result to construct a hard distribution for which MALA requires at least $\tilde Ω(κ\sqrt{d})$ steps to mix. The lower bound for MALA matches our upper bound in terms of condition number and dimension. Finally, numerical experiments are included to validate our theoretical results. △ Less

Submitted 2 October, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 63 pages, 2 figures

Journal ref: Journal of Machine Learning Research, Vol. 23, No. 270, pp. 1-63 (2022)

arXiv:2102.11027 [pdf, other]

Investigating Underlying Drivers of Variability in Residential Energy Usage Patterns with Daily Load Shape Clustering of Smart Meter Data

Authors: Ling Jin, C. Anna Spurlock, Sam Borgeson, Alina Lazar, Daniel Fredman, Annika Todd, Alexander Sim, Kesheng Wu

Abstract: Residential customers have traditionally not been treated as individual entities due to the high volatility in residential consumption patterns as well as a historic focus on aggregated loads from the utility and system feeder perspective. Large-scale deployment of smart meters has motivated increasing studies to explore disaggregated daily load patterns, which can reveal important heterogeneity a… ▽ More Residential customers have traditionally not been treated as individual entities due to the high volatility in residential consumption patterns as well as a historic focus on aggregated loads from the utility and system feeder perspective. Large-scale deployment of smart meters has motivated increasing studies to explore disaggregated daily load patterns, which can reveal important heterogeneity across different time scales, weather conditions, as well as within and across individual households. This paper aims to shed light on the mechanisms by which electricity consumption patterns exhibit variability and the different constraints that may affect demand-response (DR) flexibility. We systematically evaluate the relationship between daily time-of-use patterns and their variability to external and internal influencing factors, including time scales of interest, meteorological conditions, and household characteristics by application of an improved version of the adaptive K-means clustering method to profile "household-days" of a summer peaking utility. We find that for this summer-peaking utility, outdoor temperature is the most important external driver of the load shape variability relative to seasonality and day-of-week. The top three consumption patterns represent approximately 50% of usage on the highest temperature days. The variability in summer load shapes across customers can be explained by the responsiveness of the households to outside temperature. Our results suggest that depending on the influencing factors, not all the consumption variability can be readily translated to consumption flexibility. Such information needs to be further explored in segmenting customers for better program targeting and tailoring to meet the needs of the rapidly evolving electricity grid. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Comments: 11 pages, 11 figures

arXiv:2101.02273 [pdf, ps, other]

Model-free time-aggregated predictions for econometric datasets

Authors: Kejin Wu, Sayar Karmakar

Abstract: This article explores the existing normalizing and variance-stabilizing (NoVaS) method on predicting squared log-returns of financial data. First, we explore the robustness of the existing NoVaS method for long-term time-aggregated predictions. Then we develop a more parsimonious variant of the existing method. With systematic justification and extensive data analysis, our new method shows better… ▽ More This article explores the existing normalizing and variance-stabilizing (NoVaS) method on predicting squared log-returns of financial data. First, we explore the robustness of the existing NoVaS method for long-term time-aggregated predictions. Then we develop a more parsimonious variant of the existing method. With systematic justification and extensive data analysis, our new method shows better performance than current NoVaS and standard GARCH(1,1) methods on both short- and long-term time-aggregated predictions. △ Less

Submitted 4 November, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

arXiv:2011.05625 [pdf, other]

CAN: Feature Co-Action for Click-Through Rate Prediction

Authors: Weijie Bian, Kailun Wu, Lejian Ren, Qi Pi, Yujing Zhang, Can Xiao, Xiang-Rong Sheng, Yong-Nan Zhu, Zhangming Chan, Na Mou, Xinchen Luo, Shiming Xiang, Guorui Zhou, Xiaoqiang Zhu, Hongbo Deng

Abstract: Feature interaction has been recognized as an important problem in machine learning, which is also very essential for click-through rate (CTR) prediction tasks. In recent years, Deep Neural Networks (DNNs) can automatically learn implicit nonlinear interactions from original sparse features, and therefore have been widely used in industrial CTR prediction tasks. However, the implicit feature inter… ▽ More Feature interaction has been recognized as an important problem in machine learning, which is also very essential for click-through rate (CTR) prediction tasks. In recent years, Deep Neural Networks (DNNs) can automatically learn implicit nonlinear interactions from original sparse features, and therefore have been widely used in industrial CTR prediction tasks. However, the implicit feature interactions learned in DNNs cannot fully retain the complete representation capacity of the original and empirical feature interactions (e.g., cartesian product) without loss. For example, a simple attempt to learn the combination of feature A and feature B <A, B> as the explicit cartesian product representation of new features can outperform previous implicit feature interaction models including factorization machine (FM)-based models and their variations. In this paper, we propose a Co-Action Network (CAN) to approximate the explicit pairwise feature interactions without introducing too many additional parameters. More specifically, giving feature A and its associated feature B, their feature interaction is modeled by learning two sets of parameters: 1) the embedding of feature A, and 2) a Multi-Layer Perceptron (MLP) to represent feature B. The approximated feature interaction can be obtained by passing the embedding of feature A through the MLP network of feature B. We refer to such pairwise feature interaction as feature co-action, and such a Co-Action Network unit can provide a very powerful capacity to fitting complex feature interactions. Experimental results on public and industrial datasets show that CAN outperforms state-of-the-art CTR models and the cartesian product method. Moreover, CAN has been deployed in the display advertisement system in Alibaba, obtaining 12\% improvement on CTR and 8\% on Revenue Per Mille (RPM), which is a great improvement to the business. △ Less

Submitted 7 December, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

Comments: WSDM 2022

MSC Class: Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG) ACM Class: I.2.6

arXiv:2011.01058 [pdf, other]

doi 10.1016/j.cma.2021.114020

Bayesian inference of heterogeneous epidemic models: Application to COVID-19 spread accounting for long-term care facilities

Authors: Peng Chen, Keyi Wu, Omar Ghattas

Abstract: We propose a high dimensional Bayesian inference framework for learning heterogeneous dynamics of a COVID-19 model, with a specific application to the dynamics and severity of COVID-19 inside and outside long-term care (LTC) facilities. We develop a heterogeneous compartmental model that accounts for the heterogeneity of the time-varying spread and severity of COVID-19 inside and outside LTC facil… ▽ More We propose a high dimensional Bayesian inference framework for learning heterogeneous dynamics of a COVID-19 model, with a specific application to the dynamics and severity of COVID-19 inside and outside long-term care (LTC) facilities. We develop a heterogeneous compartmental model that accounts for the heterogeneity of the time-varying spread and severity of COVID-19 inside and outside LTC facilities, which is characterized by time-dependent stochastic processes and time-independent parameters in $\sim$1500 dimensions after discretization. To infer these parameters, we use reported data on the number of confirmed, hospitalized, and deceased cases with suitable post-processing in both a deterministic inversion approach with appropriate regularization as a first step, followed by Bayesian inversion with proper prior distributions. To address the curse of dimensionality and the ill-posedness of the high-dimensional inference problem, we propose use of a dimension-independent projected Stein variational gradient descent method, and demonstrate the intrinsic low-dimensionality of the inverse problem. We present inference results with quantified uncertainties for both New Jersey and Texas, which experienced different epidemic phases and patterns. Moreover, we also present forecasting and validation results based on the empirical posterior samples of our inference for the future trajectory of COVID-19. △ Less

Submitted 2 November, 2020; originally announced November 2020.

arXiv:2008.02883 [pdf, other]

Stronger and Faster Wasserstein Adversarial Attacks

Authors: Kaiwen Wu, Allen Houze Wang, Yaoliang Yu

Abstract: Deep models, while being extremely flexible and accurate, are surprisingly vulnerable to "small, imperceptible" perturbations known as adversarial attacks. While the majority of existing attacks focus on measuring perturbations under the $\ell_p$ metric, Wasserstein distance, which takes geometry in pixel space into account, has long been known to be a suitable metric for measuring image quality a… ▽ More Deep models, while being extremely flexible and accurate, are surprisingly vulnerable to "small, imperceptible" perturbations known as adversarial attacks. While the majority of existing attacks focus on measuring perturbations under the $\ell_p$ metric, Wasserstein distance, which takes geometry in pixel space into account, has long been known to be a suitable metric for measuring image quality and has recently risen as a compelling alternative to the $\ell_p$ metric in adversarial attacks. However, constructing an effective attack under the Wasserstein metric is computationally much more challenging and calls for better optimization algorithms. We address this gap in two ways: (a) we develop an exact yet efficient projection operator to enable a stronger projected gradient attack; (b) we show that the Frank-Wolfe method equipped with a suitable linear minimization oracle works extremely fast under Wasserstein constraints. Our algorithms not only converge faster but also generate much stronger attacks. For instance, we decrease the accuracy of a residual network on CIFAR-10 to $3.4\%$ within a Wasserstein perturbation ball of radius $0.005$, in contrast to $65.6\%$ using the previous Wasserstein attack based on an \emph{approximate} projection operator. Furthermore, employing our stronger attacks in adversarial training significantly improves the robustness of adversarially trained models. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Comments: 30 pages, accepted to ICML 2020

arXiv:2007.09762 [pdf, other]

A Theory of Multiple-Source Adaptation with Limited Target Labeled Data

Authors: Yishay Mansour, Mehryar Mohri, Jae Ro, Ananda Theertha Suresh, Ke Wu

Abstract: We present a theoretical and algorithmic study of the multiple-source domain adaptation problem in the common scenario where the learner has access only to a limited amount of labeled target data, but where the learner has at disposal a large amount of labeled data from multiple source domains. We show that a new family of algorithms based on model selection ideas benefits from very favorable guar… ▽ More We present a theoretical and algorithmic study of the multiple-source domain adaptation problem in the common scenario where the learner has access only to a limited amount of labeled target data, but where the learner has at disposal a large amount of labeled data from multiple source domains. We show that a new family of algorithms based on model selection ideas benefits from very favorable guarantees in this scenario and discuss some theoretical obstacles affecting some alternative techniques. We also report the results of several experiments with our algorithms that demonstrate their practical effectiveness. △ Less

Submitted 29 October, 2020; v1 submitted 19 July, 2020; originally announced July 2020.

Comments: 20 pages

arXiv:2006.14592 [pdf, other]

Newton-type Methods for Minimax Optimization

Authors: Guojun Zhang, Kaiwen Wu, Pascal Poupart, Yaoliang Yu

Abstract: Differential games, in particular two-player sequential zero-sum games (a.k.a. minimax optimization), have been an important modeling tool in applied science and received renewed interest in machine learning due to many recent applications, such as adversarial training, generative models and reinforcement learning. However, existing theory mostly focuses on convex-concave functions with few except… ▽ More Differential games, in particular two-player sequential zero-sum games (a.k.a. minimax optimization), have been an important modeling tool in applied science and received renewed interest in machine learning due to many recent applications, such as adversarial training, generative models and reinforcement learning. However, existing theory mostly focuses on convex-concave functions with few exceptions. In this work, we propose two novel Newton-type algorithms for nonconvex-nonconcave minimax optimization. We prove their local convergence at strict local minimax points, which are surrogates of global solutions. We argue that our Newton-type algorithms nicely complement existing ones in that (a) they converge faster to strict local minimax points; (b) they are much more effective when the problem is ill-conditioned; (c) their computational complexity remains similar. We verify the effectiveness of our Newton-type algorithms through experiments on training GANs which are intrinsically nonconvex and ill-conditioned. Our code is available at https://github.com/watml/min-max-2nd-order. △ Less

Submitted 18 February, 2023; v1 submitted 25 June, 2020; originally announced June 2020.

Comments: code update

arXiv:2003.05681 [pdf]

doi 10.1007/s11071-020-05862-6

Generalized logistic growth modeling of the COVID-19 outbreak: comparing the dynamics in the 29 provinces in China and in the rest of the world

Authors: Ke Wu, Didier Darcet, Qian Wang, Didier Sornette

Abstract: Started in Wuhan, China, the COVID-19 has been spreading all over the world. We calibrate the logistic growth model, the generalized logistic growth model, the generalized Richards model and the generalized growth model to the reported number of infected cases for the whole of China, 29 provinces in China, and 33 countries and regions that have been or are undergoing major outbreaks. We dissect th… ▽ More Started in Wuhan, China, the COVID-19 has been spreading all over the world. We calibrate the logistic growth model, the generalized logistic growth model, the generalized Richards model and the generalized growth model to the reported number of infected cases for the whole of China, 29 provinces in China, and 33 countries and regions that have been or are undergoing major outbreaks. We dissect the development of the epidemics in China and the impact of the drastic control measures both at the aggregate level and within each province. We quantitatively document four phases of the outbreak in China with a detailed analysis on the heterogeneous situations across provinces. The extreme containment measures implemented by China were very effective with some instructive variations across provinces. Borrowing from the experience of China, we made scenario projections on the development of the outbreak in other countries. We identified that outbreaks in 14 countries (mostly in western Europe) have ended, while resurgences of cases have been identified in several among them. The modeling results clearly show longer after-peak trajectories in western countries, in contrast to most provinces in China where the after-peak trajectory is characterized by a much faster decay. We identified three groups of countries in different level of outbreak progress, and provide informative implications for the current global pandemic. △ Less

Submitted 22 September, 2020; v1 submitted 12 March, 2020; originally announced March 2020.

Journal ref: Nonlinear Dynamics, 2020

arXiv:2003.03616 [pdf, other]

Diffusion State Distances: Multitemporal Analysis, Fast Algorithms, and Applications to Biological Networks

Authors: Lenore Cowen, Kapil Devkota, Xiaozhe Hu, James M. Murphy, Kaiyi Wu

Abstract: Data-dependent metrics are powerful tools for learning the underlying structure of high-dimensional data. This article develops and analyzes a data-dependent metric known as diffusion state distance (DSD), which compares points using a data-driven diffusion process. Unlike related diffusion methods, DSDs incorporate information across time scales, which allows for the intrinsic data structure to b… ▽ More Data-dependent metrics are powerful tools for learning the underlying structure of high-dimensional data. This article develops and analyzes a data-dependent metric known as diffusion state distance (DSD), which compares points using a data-driven diffusion process. Unlike related diffusion methods, DSDs incorporate information across time scales, which allows for the intrinsic data structure to be inferred in a parameter-free manner. This article develops a theory for DSD based on the multitemporal emergence of mesoscopic equilibria in the underlying diffusion process. New algorithms for denoising and dimension reduction with DSD are also proposed and analyzed. These approaches are based on a weighted spectral decomposition of the underlying diffusion process, and experiments on synthetic datasets and real biological networks illustrate the efficacy of the proposed algorithms in terms of both speed and accuracy. Throughout, comparisons with related methods are made, in order to illustrate the distinct advantages of DSD for datasets exhibiting multiscale structure. △ Less

Submitted 7 March, 2020; originally announced March 2020.

Comments: 28 pages

arXiv:2003.02387 [pdf, other]

doi 10.1007/s10915-020-01324-8

Methods to Recover Unknown Processes in Partial Differential Equations Using Data

Authors: Zhen Chen, Kailiang Wu, Dongbin Xiu

Abstract: We study the problem of identifying unknown processes embedded in time-dependent partial differential equation (PDE) using observational data, with an application to advection-diffusion type PDE. We first conduct theoretical analysis and derive conditions to ensure the solvability of the problem. We then present a set of numerical approaches, including Galerkin type algorithm and collocation type… ▽ More We study the problem of identifying unknown processes embedded in time-dependent partial differential equation (PDE) using observational data, with an application to advection-diffusion type PDE. We first conduct theoretical analysis and derive conditions to ensure the solvability of the problem. We then present a set of numerical approaches, including Galerkin type algorithm and collocation type algorithm. Analysis of the algorithms are presented, along with their implementation detail. The Galerkin algorithm is more suitable for practical situations, particularly those with noisy data, as it avoids using derivative/gradient data. Various numerical examples are then presented to demonstrate the performance and properties of the numerical methods. △ Less

Submitted 4 March, 2020; originally announced March 2020.

Comments: 21 pages, 11 figures

Journal ref: Journal of Scientific Computing 85, 23 (2020)

arXiv:2003.01436 [pdf, other]

Learning to Generate Time Series Conditioned Graphs with Generative Adversarial Nets

Authors: Shanchao Yang, Jing Liu, Kai Wu, Mingming Li

Abstract: Deep learning based approaches have been utilized to model and generate graphs subjected to different distributions recently. However, they are typically unsupervised learning based and unconditioned generative models or simply conditioned on the graph-level contexts, which are not associated with rich semantic node-level contexts. Differently, in this paper, we are interested in a novel problem n… ▽ More Deep learning based approaches have been utilized to model and generate graphs subjected to different distributions recently. However, they are typically unsupervised learning based and unconditioned generative models or simply conditioned on the graph-level contexts, which are not associated with rich semantic node-level contexts. Differently, in this paper, we are interested in a novel problem named Time Series Conditioned Graph Generation: given an input multivariate time series, we aim to infer a target relation graph modeling the underlying interrelationships between time series with each node corresponding to each time series. For example, we can study the interrelationships between genes in a gene regulatory network of a certain disease conditioned on their gene expression data recorded as time series. To achieve this, we propose a novel Time Series conditioned Graph Generation-Generative Adversarial Networks (TSGG-GAN) to handle challenges of rich node-level context structures conditioning and measuring similarities directly between graphs and time series. Extensive experiments on synthetic and real-word gene regulatory networks datasets demonstrate the effectiveness and generalizability of the proposed TSGG-GAN. △ Less

Submitted 26 August, 2023; v1 submitted 3 March, 2020; originally announced March 2020.

arXiv:2002.04658 [pdf, other]

A Non-Intrusive Correction Algorithm for Classification Problems with Corrupted Data

Authors: Jun Hou, Tong Qin, Kailiang Wu, Dongbin Xiu

Abstract: A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction procedure to the model prediction. The correction procedure can be coupled with any approximators, such as logistic regression, neural networks of various architecture… ▽ More A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction procedure to the model prediction. The correction procedure can be coupled with any approximators, such as logistic regression, neural networks of various architectures, etc. When training dataset is sufficiently large, we prove that the corrected models deliver correct classification results as if there is no corruption in the training data. For datasets of finite size, the corrected models produce significantly better recovery results, compared to the models without the correction algorithm. All of the theoretical findings in the paper are verified by our numerical examples. △ Less

Submitted 11 February, 2020; originally announced February 2020.

arXiv:1910.06948 [pdf, ps, other]

doi 10.1016/j.jcp.2020.109307

Data-Driven Deep Learning of Partial Differential Equations in Modal Space

Authors: Kailiang Wu, Dongbin Xiu

Abstract: We present a framework for recovering/approximating unknown time-dependent partial differential equation (PDE) using its solution data. Instead of identifying the terms in the underlying PDE, we seek to approximate the evolution operator of the underlying PDE numerically. The evolution operator of the PDE, defined in infinite-dimensional space, maps the solution from a current time to a future tim… ▽ More We present a framework for recovering/approximating unknown time-dependent partial differential equation (PDE) using its solution data. Instead of identifying the terms in the underlying PDE, we seek to approximate the evolution operator of the underlying PDE numerically. The evolution operator of the PDE, defined in infinite-dimensional space, maps the solution from a current time to a future time and completely characterizes the solution evolution of the underlying unknown PDE. Our recovery strategy relies on approximation of the evolution operator in a properly defined modal space, i.e., generalized Fourier space, in order to reduce the problem to finite dimensions. The finite dimensional approximation is then accomplished by training a deep neural network structure, which is based on residual network (ResNet), using the given data. Error analysis is provided to illustrate the predictive accuracy of the proposed method. A set of examples of different types of PDEs, including inviscid Burgers' equation that develops discontinuity in its solution, are presented to demonstrate the effectiveness of the proposed method. △ Less

Submitted 18 October, 2019; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: Minor notational changes

Journal ref: Journal of Computational Physics, 408: 109307, 2020

arXiv:1907.11780 [pdf, other]

Understanding Adversarial Robustness: The Trade-off between Minimum and Average Margin

Authors: Kaiwen Wu, Yaoliang Yu

Abstract: Deep models, while being extremely versatile and accurate, are vulnerable to adversarial attacks: slight perturbations that are imperceptible to humans can completely flip the prediction of deep models. Many attack and defense mechanisms have been proposed, although a satisfying solution still largely remains elusive. In this work, we give strong evidence that during training, deep models maximize… ▽ More Deep models, while being extremely versatile and accurate, are vulnerable to adversarial attacks: slight perturbations that are imperceptible to humans can completely flip the prediction of deep models. Many attack and defense mechanisms have been proposed, although a satisfying solution still largely remains elusive. In this work, we give strong evidence that during training, deep models maximize the minimum margin in order to achieve high accuracy, but at the same time decrease the \emph{average} margin hence hurting robustness. Our empirical results highlight an intrinsic trade-off between accuracy and robustness for current deep model training. To further address this issue, we propose a new regularizer to explicitly promote average margin, and we verify through extensive experiments that it does lead to better robustness. Our regularized objective remains Fisher-consistent, hence asymptotically can still recover the Bayes optimal classifier. △ Less

Submitted 26 July, 2019; originally announced July 2019.

arXiv:1906.10304 [pdf, other]

Res-embedding for Deep Learning Based Click-Through Rate Prediction Modeling

Authors: Guorui Zhou, Kailun Wu, Weijie Bian, Zhao Yang, Xiaoqiang Zhu, Kun Gai

Abstract: Recently, click-through rate (CTR) prediction models have evolved from shallow methods to deep neural networks. Most deep CTR models follow an Embedding\&MLP paradigm, that is, first mapping discrete id features, e.g. user visited items, into low dimensional vectors with an embedding module, then learn a multi-layer perception (MLP) to fit the target. In this way, embedding module performs as the… ▽ More Recently, click-through rate (CTR) prediction models have evolved from shallow methods to deep neural networks. Most deep CTR models follow an Embedding\&MLP paradigm, that is, first mapping discrete id features, e.g. user visited items, into low dimensional vectors with an embedding module, then learn a multi-layer perception (MLP) to fit the target. In this way, embedding module performs as the representative learning and plays a key role in the model performance. However, in many real-world applications, deep CTR model often suffers from poor generalization performance, which is mostly due to the learning of embedding parameters. In this paper, we model user behavior using an interest delay model, study carefully the embedding mechanism, and obtain two important results: (i) We theoretically prove that small aggregation radius of embedding vectors of items which belongs to a same user interest domain will result in good generalization performance of deep CTR model. (ii) Following our theoretical analysis, we design a new embedding structure named res-embedding. In res-embedding module, embedding vector of each item is the sum of two components: (i) a central embedding vector calculated from an item-based interest graph (ii) a residual embedding vector with its scale to be relatively small. Empirical evaluation on several public datasets demonstrates the effectiveness of the proposed res-embedding structure, which brings significant improvement on the model performance. △ Less

Submitted 24 June, 2019; originally announced June 2019.

arXiv:1905.10396 [pdf, other]

doi 10.1137/19M1264011

Structure-preserving Method for Reconstructing Unknown Hamiltonian Systems from Trajectory Data

Authors: Kailiang Wu, Tong Qin, Dongbin Xiu

Abstract: We present a numerical approach for approximating unknown Hamiltonian systems using observation data. A distinct feature of the proposed method is that it is structure-preserving, in the sense that it enforces conservation of the reconstructed Hamiltonian. This is achieved by directly approximating the underlying unknown Hamiltonian, rather than the right-hand-side of the governing equations. We p… ▽ More We present a numerical approach for approximating unknown Hamiltonian systems using observation data. A distinct feature of the proposed method is that it is structure-preserving, in the sense that it enforces conservation of the reconstructed Hamiltonian. This is achieved by directly approximating the underlying unknown Hamiltonian, rather than the right-hand-side of the governing equations. We present the technical details of the proposed algorithm and its error estimate in a special case, along with a practical de-noising procedure to cope with noisy data. A set of numerical examples are then presented to demonstrate the structure-preserving property and effectiveness of the algorithm. △ Less

Submitted 19 August, 2020; v1 submitted 24 May, 2019; originally announced May 2019.

Comments: 27 pages, 19 figures

Journal ref: SIAM Journal on Scientific Computing 42 (6), A3704--A3729, 2020

arXiv:1905.06125 [pdf, other]

Distributional Reinforcement Learning for Efficient Exploration

Authors: Borislav Mavrin, Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yaoliang Yu

Abstract: In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties. We propose a novel and efficient exploration method for deep RL that has two components. The first is a decaying schedule to suppress the intrinsic uncertainty. The second is an exploration bonus calculated from the upper quantiles of the learned distr… ▽ More In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties. We propose a novel and efficient exploration method for deep RL that has two components. The first is a decaying schedule to suppress the intrinsic uncertainty. The second is an exploration bonus calculated from the upper quantiles of the learned distribution. In Atari 2600 games, our method outperforms QR-DQN in 12 out of 14 hard games (achieving 483 \% average gain across 49 games in cumulative rewards over QR-DQN with a big win in Venture). We also compared our algorithm with QR-DQN in a challenging 3D driving simulator (CARLA). Results show that our algorithm achieves near-optimal safety rewards twice faster than QRDQN. △ Less

Submitted 13 May, 2019; originally announced May 2019.

Journal ref: ICML, 2019

arXiv:1902.02627 [pdf, other]

Fast Transient Simulation of High-Speed Channels Using Recurrent Neural Network

Authors: Thong Nguyen, Tianjian Lu, Ken Wu, Jose Schutt-Aine

Abstract: Generating eye diagrams by using a circuit simulator can be very computationally intensive, especially in the presence of nonlinearities. It often involves multiple Newton-like iterations at every time step when a SPICE-like circuit simulator handles a nonlinear system in the transient regime. In this paper, we leverage machine learning methods, to be specific, the recurrent neural network (RNN),… ▽ More Generating eye diagrams by using a circuit simulator can be very computationally intensive, especially in the presence of nonlinearities. It often involves multiple Newton-like iterations at every time step when a SPICE-like circuit simulator handles a nonlinear system in the transient regime. In this paper, we leverage machine learning methods, to be specific, the recurrent neural network (RNN), to generate black-box macromodels and achieve significant reduction of computation time. Through the proposed approach, an RNN model is first trained and then validated on a relatively short sequence generated from a circuit simulator. Once the training completes, the RNN can be used to make predictions on the remaining sequence in order to generate an eye diagram. The training cost can also be amortized when the trained RNN starts making predictions. Besides, the proposed approach requires no complex circuit simulations nor substantial domain knowledge. We use two high-speed link examples to demonstrate that the proposed approach provides adequate accuracy while the computation time can be dramatically reduced. In the high-speed link example with a PAM4 driver, the eye diagram generated by RNN models shows good agreement with that obtained from a commercial circuit simulator. This paper also investigates the impacts of various RNN topologies, training schemes, and tunable parameters on both the accuracy and the generalization capability of an RNN model. It is found out that the long short-term memory (LSTM) network outperforms the vanilla RNN in terms of the accuracy in predicting transient waveforms. △ Less

Submitted 7 February, 2019; v1 submitted 25 January, 2019; originally announced February 2019.

Showing 1–50 of 58 results for author: Wu, K