Search | arXiv e-print repository

Spark Transformer: Reactivating Sparsity in FFN and Attention

Authors: Chong You, Kan Wu, Zhipeng Jia, Lin Chen, Srinadh Bhojanapalli, Jiaxian Guo, Utku Evci, Jan Wassenberg, Praneeth Netrapalli, Jeremiah J. Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E. Culler, Henry M. Levy, Sanjiv Kumar

Abstract: The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the Re… ▽ More The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges. This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top-$k$ operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU. △ Less

Submitted 6 June, 2025; originally announced June 2025.

arXiv:2506.02413 [pdf, ps, other]

Tensor State Space-based Dynamic Multilayer Network Modeling

Authors: Tian Lan, Jie Guo, Chen Zhang

Abstract: Understanding the complex interactions within dynamic multilayer networks is critical for advancements in various scientific domains. Existing models often fail to capture such networks' temporal and cross-layer dynamics. This paper introduces a novel Tensor State Space Model for Dynamic Multilayer Networks (TSSDMN), utilizing a latent space model framework. TSSDMN employs a symmetric Tucker decom… ▽ More Understanding the complex interactions within dynamic multilayer networks is critical for advancements in various scientific domains. Existing models often fail to capture such networks' temporal and cross-layer dynamics. This paper introduces a novel Tensor State Space Model for Dynamic Multilayer Networks (TSSDMN), utilizing a latent space model framework. TSSDMN employs a symmetric Tucker decomposition to represent latent node features, their interaction patterns, and layer transitions. Then by fixing the latent features and allowing the interaction patterns to evolve over time, TSSDMN uniquely captures both the temporal dynamics within layers and across different layers. The model identifiability conditions are discussed. By treating latent features as variables whose posterior distributions are approximated using a mean-field variational inference approach, a variational Expectation Maximization algorithm is developed for efficient model inference. Numerical simulations and case studies demonstrate the efficacy of TSSDMN for understanding dynamic multilayer networks. △ Less

Submitted 2 June, 2025; originally announced June 2025.

arXiv:2505.12225 [pdf, other]

Reward Inside the Model: A Lightweight Hidden-State Reward Model for LLM's Best-of-N sampling

Authors: Jizhou Guo, Zhaomin Wu, Philip S. Yu

Abstract: High-quality reward models are crucial for unlocking the reasoning potential of large language models (LLMs), with best-of-N voting demonstrating significant performance gains. However, current reward models, which typically operate on the textual output of LLMs, are computationally expensive and parameter-heavy, limiting their real-world applications. We introduce the Efficient Linear Hidden Stat… ▽ More High-quality reward models are crucial for unlocking the reasoning potential of large language models (LLMs), with best-of-N voting demonstrating significant performance gains. However, current reward models, which typically operate on the textual output of LLMs, are computationally expensive and parameter-heavy, limiting their real-world applications. We introduce the Efficient Linear Hidden State Reward (ELHSR) model - a novel, highly parameter-efficient approach that leverages the rich information embedded in LLM hidden states to address these issues. ELHSR systematically outperform baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training. ELHSR also achieves orders-of-magnitude efficiency improvement with significantly less time and fewer FLOPs per sample than baseline reward models. Moreover, ELHSR exhibits robust performance even when trained only on logits, extending its applicability to some closed-source LLMs. In addition, ELHSR can also be combined with traditional reward models to achieve additional performance gains. △ Less

Submitted 18 May, 2025; originally announced May 2025.

arXiv:2505.01859 [pdf, other]

Bayesian learning of the optimal action-value function in a Markov decision process

Authors: Jiaqi Guo, Chon Wai Ho, Sumeetpal S. Singh

Abstract: The Markov Decision Process (MDP) is a popular framework for sequential decision-making problems, and uncertainty quantification is an essential component of it to learn optimal decision-making strategies. In particular, a Bayesian framework is used to maintain beliefs about the optimal decisions and the unknown ingredients of the model, which are also to be learned from the data, such as the rewa… ▽ More The Markov Decision Process (MDP) is a popular framework for sequential decision-making problems, and uncertainty quantification is an essential component of it to learn optimal decision-making strategies. In particular, a Bayesian framework is used to maintain beliefs about the optimal decisions and the unknown ingredients of the model, which are also to be learned from the data, such as the rewards and state dynamics. However, many existing Bayesian approaches for learning the optimal decision-making strategy are based on unrealistic modelling assumptions and utilise approximate inference techniques. This raises doubts whether the benefits of Bayesian uncertainty quantification are fully realised or can be relied upon. We focus on infinite-horizon and undiscounted MDPs, with finite state and action spaces, and a terminal state. We provide a full Bayesian framework, from modelling to inference to decision-making. For modelling, we introduce a likelihood function with minimal assumptions for learning the optimal action-value function based on Bellman's optimality equations, analyse its properties, and clarify connections to existing works. For deterministic rewards, the likelihood is degenerate and we introduce artificial observation noise to relax it, in a controlled manner, to facilitate more efficient Monte Carlo-based inference. For inference, we propose an adaptive sequential Monte Carlo algorithm to both sample from and adjust the sequence of relaxed posterior distributions. For decision-making, we choose actions using samples from the posterior distribution over the optimal strategies. While commonly done, we provide new insight that clearly shows that it is a generalisation of Thompson sampling from multi-arm bandit problems. Finally, we evaluate our framework on the Deep Sea benchmark problem and demonstrate the exploration benefits of posterior sampling in MDPs. △ Less

Submitted 3 May, 2025; originally announced May 2025.

Comments: 66 pages

arXiv:2504.20471 [pdf, other]

The Estimation of Continual Causal Effect for Dataset Shifting Streams

Authors: Baining Chen, Yiming Zhang, Yuqiao Han, Ruyue Zhang, Ruihuan Du, Zhishuo Zhou, Zhengdan Zhu, Xun Liu, Jiecheng Guo

Abstract: Causal effect estimation has been widely used in marketing optimization. The framework of an uplift model followed by a constrained optimization algorithm is popular in practice. To enhance performance in the online environment, the framework needs to be improved to address the complexities caused by temporal dataset shift. This paper focuses on capturing the dataset shift from user behavior and d… ▽ More Causal effect estimation has been widely used in marketing optimization. The framework of an uplift model followed by a constrained optimization algorithm is popular in practice. To enhance performance in the online environment, the framework needs to be improved to address the complexities caused by temporal dataset shift. This paper focuses on capturing the dataset shift from user behavior and domain distribution changing over time. We propose an Incremental Causal Effect with Proxy Knowledge Distillation (ICE-PKD) framework to tackle this challenge. The ICE-PKD framework includes two components: (i) a multi-treatment uplift network that eliminates confounding bias using counterfactual regression; (ii) an incremental training strategy that adapts to the temporal dataset shift by updating with the latest data and protects generalization via replay-based knowledge distillation. We also revisit the uplift modeling metrics and introduce a novel metric for more precise online evaluation in multiple treatment scenarios. Extensive experiments on both simulated and online datasets show that the proposed framework achieves better performance. The ICE-PKD framework has been deployed in the marketing system of Huaxiaozhu, a ride-hailing platform in China. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2503.01924 [pdf, other]

TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions

Authors: Wang YuHang, Junkang Guo, Aolei Liu, Kaihao Wang, Zaitong Wu, Zhenyu Liu, Wenfei Yin, Jian Liu

Abstract: Adversarial robustness is a critical challenge in deploying deep neural networks for real-world applications. While adversarial training is a widely recognized defense strategy, most existing studies focus on balanced datasets, overlooking the prevalence of long-tailed distributions in real-world data, which significantly complicates robustness. This paper provides a comprehensive analysis of adve… ▽ More Adversarial robustness is a critical challenge in deploying deep neural networks for real-world applications. While adversarial training is a widely recognized defense strategy, most existing studies focus on balanced datasets, overlooking the prevalence of long-tailed distributions in real-world data, which significantly complicates robustness. This paper provides a comprehensive analysis of adversarial training under long-tailed distributions and identifies limitations in the current state-of-the-art method, AT-BSL, in achieving robust performance under such conditions. To address these challenges, we propose a novel training framework, TAET, which integrates an initial stabilization phase followed by a stratified equalization adversarial training phase. Additionally, prior work on long-tailed robustness has largely ignored the crucial evaluation metric of balanced accuracy. To bridge this gap, we introduce the concept of balanced robustness, a comprehensive metric tailored for assessing robustness under long-tailed distributions. Extensive experiments demonstrate that our method surpasses existing advanced defenses, achieving significant improvements in both memory and computational efficiency. This work represents a substantial advancement in addressing robustness challenges in real-world applications. Our code is available at: https://github.com/BuhuiOK/TAET-Two-Stage-Adversarial-Equalization-Training-on-Long-Tailed-Distributions. △ Less

Submitted 21 March, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

Comments: Text: 8 pages of main content, 5 pages of appendices have been accepted by CVPR2025

Journal ref: Computer Vision and Pattern Recognition 2025

arXiv:2407.16134 [pdf, other]

Diffusion Transformer Captures Spatial-Temporal Dependencies: A Theory for Gaussian Process Data

Authors: Hengyu Fu, Zehao Dou, Jiawei Guo, Mengdi Wang, Minshuo Chen

Abstract: Diffusion Transformer, the backbone of Sora for video generation, successfully scales the capacity of diffusion models, pioneering new avenues for high-fidelity sequential data generation. Unlike static data such as images, sequential data consists of consecutive data frames indexed by time, exhibiting rich spatial and temporal dependencies. These dependencies represent the underlying dynamic mode… ▽ More Diffusion Transformer, the backbone of Sora for video generation, successfully scales the capacity of diffusion models, pioneering new avenues for high-fidelity sequential data generation. Unlike static data such as images, sequential data consists of consecutive data frames indexed by time, exhibiting rich spatial and temporal dependencies. These dependencies represent the underlying dynamic model and are critical to validate the generated data. In this paper, we make the first theoretical step towards bridging diffusion transformers for capturing spatial-temporal dependencies. Specifically, we establish score approximation and distribution estimation guarantees of diffusion transformers for learning Gaussian process data with covariance functions of various decay patterns. We highlight how the spatial-temporal dependencies are captured and affect learning efficiency. Our study proposes a novel transformer approximation theory, where the transformer acts to unroll an algorithm. We support our theoretical results by numerical experiments, providing strong evidence that spatial-temporal dependencies are captured within attention layers, aligning with our approximation theory. △ Less

Submitted 4 February, 2025; v1 submitted 22 July, 2024; originally announced July 2024.

Comments: 56 pages, 13 figures

arXiv:2406.19535 [pdf, other]

Modeling trajectories using functional linear differential equations

Authors: Julia Wrobel, Britton Sauerbrei, Erik A. Kirk, Jian-Zhong Guo, Adam Hantman, Jeff Goldsmith

Abstract: We are motivated by a study that seeks to better understand the dynamic relationship between muscle activation and paw position during locomotion. For each gait cycle in this experiment, activation in the biceps and triceps is measured continuously and in parallel with paw position as a mouse trotted on a treadmill. We propose an innovative general regression method that draws from both ordinary d… ▽ More We are motivated by a study that seeks to better understand the dynamic relationship between muscle activation and paw position during locomotion. For each gait cycle in this experiment, activation in the biceps and triceps is measured continuously and in parallel with paw position as a mouse trotted on a treadmill. We propose an innovative general regression method that draws from both ordinary differential equations and functional data analysis to model the relationship between these functional inputs and responses as a dynamical system that evolves over time. Specifically, our model addresses gaps in both literatures and borrows strength across curves estimating ODE parameters across all curves simultaneously rather than separately modeling each functional observation. Our approach compares favorably to related functional data methods in simulations and in cross-validated predictive accuracy of paw position in the gait data. In the analysis of the gait cycles, we find that paw speed and position are dynamically influenced by inputs from the biceps and triceps muscles, and that the effect of muscle activation persists beyond the activation itself. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2401.17585 [pdf, other]

Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks

Authors: Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, Zhiguo Wang

Abstract: Current approaches of knowledge editing struggle to effectively propagate updates to interconnected facts. In this work, we delve into the barriers that hinder the appropriate propagation of updated knowledge within these models for accurate reasoning. To support our analysis, we introduce a novel reasoning-based benchmark -- ReCoE (Reasoning-based Counterfactual Editing dataset) -- which covers s… ▽ More Current approaches of knowledge editing struggle to effectively propagate updates to interconnected facts. In this work, we delve into the barriers that hinder the appropriate propagation of updated knowledge within these models for accurate reasoning. To support our analysis, we introduce a novel reasoning-based benchmark -- ReCoE (Reasoning-based Counterfactual Editing dataset) -- which covers six common reasoning schemes in real world. We conduct a thorough analysis of existing knowledge editing techniques, including input augmentation, finetuning, and locate-and-edit. We found that all model editing methods show notably low performance on this dataset, especially in certain reasoning schemes. Our analysis over the chain-of-thought generation of edited models further uncover key reasons behind the inadequacy of existing knowledge editing methods from a reasoning standpoint, involving aspects on fact-wise editing, fact recall ability, and coherence in generation. We will make our benchmark publicly available. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 22 pages, 14 figures, 5 tables

arXiv:2309.01935 [pdf]

The impact of electronic health records (EHR) data continuity on prediction model fairness and racial-ethnic disparities

Authors: Yu Huang, Jingchuan Guo, Zhaoyi Chen, Jie Xu, William T Donahoo, Olveen Carasquillo, Hrushyang Adloori, Jiang Bian, Elizabeth A Shenkman

Abstract: Electronic health records (EHR) data have considerable variability in data completeness across sites and patients. Lack of "EHR data-continuity" or "EHR data-discontinuity", defined as "having medical information recorded outside the reach of an EHR system" can lead to a substantial amount of information bias. The objective of this study was to comprehensively evaluate (1) how EHR data-discontinui… ▽ More Electronic health records (EHR) data have considerable variability in data completeness across sites and patients. Lack of "EHR data-continuity" or "EHR data-discontinuity", defined as "having medical information recorded outside the reach of an EHR system" can lead to a substantial amount of information bias. The objective of this study was to comprehensively evaluate (1) how EHR data-discontinuity introduces data bias, (2) case finding algorithms affect downstream prediction models, and (3) how algorithmic fairness is associated with racial-ethnic disparities. We leveraged our EHRs linked with Medicaid and Medicare claims data in the OneFlorida+ network and used a validated measure (i.e., Mean Proportions of Encounters Captured [MPEC]) to estimate patients' EHR data continuity. We developed a machine learning model for predicting type 2 diabetes (T2D) diagnosis as the use case for this work. We found that using cohorts selected by different levels of EHR data-continuity affects utilities in disease prediction tasks. The prediction models trained on high continuity data will have a worse fit on low continuity data. We also found variations in racial and ethnic disparities in model performances and model fairness in models developed using different degrees of data continuity. Our results suggest that careful evaluation of data continuity is critical to improving the validity of real-world evidence generated by EHR data and health equity. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2307.02884 [pdf, ps, other]

Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight

Authors: Jiacheng Guo, Minshuo Chen, Huan Wang, Caiming Xiong, Mengdi Wang, Yu Bai

Abstract: This paper studies the sample-efficiency of learning in Partially Observable Markov Decision Processes (POMDPs), a challenging problem in reinforcement learning that is known to be exponentially hard in the worst-case. Motivated by real-world settings such as loading in game playing, we propose an enhanced feedback model called ``multiple observations in hindsight'', where after each episode of in… ▽ More This paper studies the sample-efficiency of learning in Partially Observable Markov Decision Processes (POMDPs), a challenging problem in reinforcement learning that is known to be exponentially hard in the worst-case. Motivated by real-world settings such as loading in game playing, we propose an enhanced feedback model called ``multiple observations in hindsight'', where after each episode of interaction with the POMDP, the learner may collect multiple additional observations emitted from the encountered latent states, but may not observe the latent states themselves. We show that sample-efficient learning under this feedback model is possible for two new subclasses of POMDPs: \emph{multi-observation revealing POMDPs} and \emph{distinguishable POMDPs}. Both subclasses generalize and substantially relax \emph{revealing POMDPs} -- a widely studied subclass for which sample-efficient learning is possible under standard trajectory feedback. Notably, distinguishable POMDPs only require the emission distributions from different latent states to be \emph{different} instead of \emph{linearly independent} as required in revealing POMDPs. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2305.18263 [pdf, other]

doi 10.1007/s11634-023-00546-6

MLE for the parameters of bivariate interval-valued models

Authors: S. Yaser Samadi, L. Billard, Jiin-Huarng Guo, Wei Xu

Abstract: With contemporary data sets becoming too large to analyze the data directly, various forms of aggregated data are becoming common. The original individual data are points, but after aggregation, the observations are interval-valued (e.g.). While some researchers simply analyze the set of averages of the observations by aggregated class, it is easily established that approach ignores much of the in… ▽ More With contemporary data sets becoming too large to analyze the data directly, various forms of aggregated data are becoming common. The original individual data are points, but after aggregation, the observations are interval-valued (e.g.). While some researchers simply analyze the set of averages of the observations by aggregated class, it is easily established that approach ignores much of the information in the original data set. The initial theoretical work for interval-valued data was that of Le-Rademacher and Billard (2011), but those results were limited to estimation of the mean and variance of a single variable only. This article seeks to redress the limitation of their work by deriving the maximum likelihood estimator for the all important covariance statistic, a basic requirement for numerous methodologies, such as regression, principal components, and canonical analyses. Asymptotic properties of the proposed estimators are established. The Le-Rademacher and Billard results emerge as special cases of our wider derivations. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: Will appear in ADAC

Journal ref: Advances in Data Analysis and Classification, 2023

arXiv:2212.09458 [pdf, other]

Exploring Optimal Substructure for Out-of-distribution Generalization via Feature-targeted Model Pruning

Authors: Yingchun Wang, Jingcai Guo, Song Guo, Weizhan Zhang, Jie Zhang

Abstract: Recent studies show that even highly biased dense networks contain an unbiased substructure that can achieve better out-of-distribution (OOD) generalization than the original model. Existing works usually search the invariant subnetwork using modular risk minimization (MRM) with out-domain data. Such a paradigm may bring about two potential weaknesses: 1) Unfairness, due to the insufficient observ… ▽ More Recent studies show that even highly biased dense networks contain an unbiased substructure that can achieve better out-of-distribution (OOD) generalization than the original model. Existing works usually search the invariant subnetwork using modular risk minimization (MRM) with out-domain data. Such a paradigm may bring about two potential weaknesses: 1) Unfairness, due to the insufficient observation of out-domain data during training; and 2) Sub-optimal OOD generalization, due to the feature-untargeted model pruning on the whole data distribution. In this paper, we propose a novel Spurious Feature-targeted model Pruning framework, dubbed SFP, to automatically explore invariant substructures without referring to the above weaknesses. Specifically, SFP identifies in-distribution (ID) features during training using our theoretically verified task loss, upon which, SFP can perform ID targeted-model pruning that removes branches with strong dependencies on ID features. Notably, by attenuating the projections of spurious features into model space, SFP can push the model learning toward invariant features and pull that out of environmental features, devising optimal OOD generalization. Moreover, we also conduct detailed theoretical analysis to provide the rationality guarantee and a proof framework for OOD structures via model sparsity, and for the first time, reveal how a highly biased data distribution affects the model's OOD generalization. Extensive experiments on various OOD datasets show that SFP can significantly outperform both structure-based and non-structure OOD generalization SOTAs, with accuracy improvement up to 4.72% and 23.35%, respectively. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: 9 pages;2 figures

ACM Class: I.2.6

arXiv:2210.05218 [pdf, other]

A Latent Logistic Regression Model with Graph Data

Authors: Haixiang Zhang, Yingjun Deng, Alan J. X. Guo, Qing-Hu Hou, Ou Wu

Abstract: Recently, graph (network) data is an emerging research area in artificial intelligence, machine learning and statistics. In this work, we are interested in whether node's labels (people's responses) are affected by their neighbor's features (friends' characteristics). We propose a novel latent logistic regression model to describe the network dependence with binary responses. The key advantage of… ▽ More Recently, graph (network) data is an emerging research area in artificial intelligence, machine learning and statistics. In this work, we are interested in whether node's labels (people's responses) are affected by their neighbor's features (friends' characteristics). We propose a novel latent logistic regression model to describe the network dependence with binary responses. The key advantage of our proposed model is that a latent binary indicator is introduced to indicate whether a node is susceptible to the influence of its neighbour. A score-type test is proposed to diagnose the existence of network dependence. In addition, an EM-type algorithm is used to estimate the model parameters under network dependence. Extensive simulations are conducted to evaluate the performance of our method. Two public datasets are used to illustrate the effectiveness of the proposed latent logistic regression model. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2209.14846 [pdf, other]

Modeling and Learning on High-Dimensional Matrix-Variate Sequences

Authors: Xu Zhang, Catherine C. Liu, Jianhua Guo, K. C. Yuen, A. H. Welsh

Abstract: We propose a new matrix factor model, named RaDFaM, which is strictly derived based on the general rank decomposition and assumes a structure of a high-dimensional vector factor model for each basis vector. RaDFaM contributes a novel class of low-rank latent structure that makes tradeoff between signal intensity and dimension reduction from the perspective of tensor subspace. Based on the intrinsi… ▽ More We propose a new matrix factor model, named RaDFaM, which is strictly derived based on the general rank decomposition and assumes a structure of a high-dimensional vector factor model for each basis vector. RaDFaM contributes a novel class of low-rank latent structure that makes tradeoff between signal intensity and dimension reduction from the perspective of tensor subspace. Based on the intrinsic separable covariance structure of RaDFaM, for a collection of matrix-valued observations, we derive a new class of PCA variants for estimating loading matrices, and sequentially the latent factor matrices. The peak signal-to-noise ratio of RaDFaM is proved to be superior in the category of PCA-type estimations. We also establish the asymptotic theory including the consistency, convergence rates, and asymptotic distributions for components in the signal part. Numerically, we demonstrate the performance of RaDFaM in applications such as matrix reconstruction, supervised learning, and clustering, on uncorrelated and correlated data, respectively. △ Less

Submitted 12 February, 2024; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: 33 pages, 12 figures

arXiv:2206.02508 [pdf, other]

Tucker tensor factor models: matricization and mode-wise PCA estimation

Authors: Xu Zhang, Guodong Li, Catherine C. Liu, Jianhua Guo

Abstract: High-dimensional, higher-order tensor data are gaining prominence in a variety of fields, including but not limited to computer vision and network analysis. Tensor factor models, induced from noisy versions of tensor decompositions or factorizations, are natural potent instruments to study a collection of tensor-variate objects that may be dependent or independent. However, it is still in the earl… ▽ More High-dimensional, higher-order tensor data are gaining prominence in a variety of fields, including but not limited to computer vision and network analysis. Tensor factor models, induced from noisy versions of tensor decompositions or factorizations, are natural potent instruments to study a collection of tensor-variate objects that may be dependent or independent. However, it is still in the early stage of developing statistical inferential theories for the estimation of various low-rank structures, which are customary to play the role of signals of tensor factor models. In this paper, we attempt to ``decode" the estimation of a higher-order tensor factor model by leveraging tensor matricization. Specifically, we recast it into mode-wise traditional high-dimensional vector/fiber factor models, enabling the deployment of conventional principal components analysis (PCA) for estimation. Demonstrated by the Tucker tensor factor model (TuTFaM), which is induced from the noisy version of the widely-used Tucker decomposition, we summarize that estimations on signal components are essentially mode-wise PCA techniques, and the involvement of projection and iteration will enhance the signal-to-noise ratio to various extent. We establish the inferential theory of the proposed estimators, conduct rich simulation experiments, and illustrate how the proposed estimations can work in tensor reconstruction, and clustering for independent video and dependent economic datasets, respectively. △ Less

Submitted 12 December, 2024; v1 submitted 6 June, 2022; originally announced June 2022.

arXiv:2203.10975 [pdf, other]

GCF: Generalized Causal Forest for Heterogeneous Treatment Effect Estimation in Online Marketplace

Authors: Shu Wan, Chen Zheng, Zhonggen Sun, Mengfan Xu, Xiaoqing Yang, Hongtu Zhu, Jiecheng Guo

Abstract: Uplift modeling is a rapidly growing approach that utilizes causal inference and machine learning methods to directly estimate the heterogeneous treatment effects, which has been widely applied to various online marketplaces to assist large-scale decision-making in recent years. The existing popular models, like causal forest (CF), are limited to either discrete treatments or posing parametric ass… ▽ More Uplift modeling is a rapidly growing approach that utilizes causal inference and machine learning methods to directly estimate the heterogeneous treatment effects, which has been widely applied to various online marketplaces to assist large-scale decision-making in recent years. The existing popular models, like causal forest (CF), are limited to either discrete treatments or posing parametric assumptions on the outcome-treatment relationship that may suffer model misspecification. However, continuous treatments (e.g., price, duration) often arise in marketplaces. To alleviate these restrictions, we use a kernel-based doubly robust estimator to recover the non-parametric dose-response functions that can flexibly model continuous treatment effects. Moreover, we propose a generic distance-based splitting criterion to capture the heterogeneity for the continuous treatments. We call the proposed algorithm generalized causal forest (GCF) as it generalizes the use case of CF to a much broader setting. We show the effectiveness of GCF by deriving the asymptotic property of the estimator and comparing it to popular uplift modeling methods on both synthetic and real-world datasets. We implement GCF on Spark and successfully deploy it into a large-scale online pricing system at a leading ride-sharing company. Online A/B testing results further validate the superiority of GCF. △ Less

Submitted 23 September, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

arXiv:2203.05646 [pdf, ps, other]

Koopman Methods for Estimation of Animal Motions over Unknown Submanifolds

Authors: Nathan Powell, Bowei Liu, Jia Guo, Sai Tej Parachuri, Andrew J. Kurdila

Abstract: This paper introduces a data-dependent approximation of the forward kinematics map for certain types of animal motion models. It is assumed that motions are supported on a low-dimensional, unknown configuration manifold $Q$ that is regularly embedded in high dimensional Euclidean space $X:=\mathbb{R}^d$. This paper introduces a method to estimate forward kinematics from the unknown configuration s… ▽ More This paper introduces a data-dependent approximation of the forward kinematics map for certain types of animal motion models. It is assumed that motions are supported on a low-dimensional, unknown configuration manifold $Q$ that is regularly embedded in high dimensional Euclidean space $X:=\mathbb{R}^d$. This paper introduces a method to estimate forward kinematics from the unknown configuration submanifold $Q$ to an $n$-dimensional Euclidean space $Y:=\mathbb{R}^n$ of observations. A known reproducing kernel Hilbert space (RKHS) is defined over the ambient space $X$ in terms of a known kernel function, and computations are performed using the known kernel defined on the ambient space $X$. Estimates are constructed using a certain data-dependent approximation of the Koopman operator defined in terms of the known kernel on $X$. However, the rate of convergence of approximations is studied in the space of restrictions to the unknown manifold $Q$. Strong rates of convergence are derived in terms of the fill distance of samples in the unknown configuration manifold, provided that a novel regularity result holds for the Koopman operator. Additionally, we show that the derived rates of convergence can be applied in some cases to estimates generated by the extended dynamic mode decomposition (EDMD) method. We illustrate characteristics of the estimates for simulated data as well as samples collected during motion capture experiments. △ Less

Submitted 6 June, 2022; v1 submitted 10 March, 2022; originally announced March 2022.

arXiv:2201.12609 [pdf, other]

ApolloRL: a Reinforcement Learning Platform for Autonomous Driving

Authors: Fei Gao, Peng Geng, Jiaqi Guo, Yuan Liu, Dingfeng Guo, Yabo Su, Jie Zhou, Xiao Wei, Jin Li, Xu Liu

Abstract: We introduce ApolloRL, an open platform for research in reinforcement learning for autonomous driving. The platform provides a complete closed-loop pipeline with training, simulation, and evaluation components. It comes with 300 hours of real-world data in driving scenarios and popular baselines such as Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) agents. We elaborate in this pap… ▽ More We introduce ApolloRL, an open platform for research in reinforcement learning for autonomous driving. The platform provides a complete closed-loop pipeline with training, simulation, and evaluation components. It comes with 300 hours of real-world data in driving scenarios and popular baselines such as Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) agents. We elaborate in this paper on the architecture and the environment defined in the platform. In addition, we discuss the performance of the baseline agents in the ApolloRL environment. △ Less

Submitted 29 January, 2022; originally announced January 2022.

arXiv:2111.01301 [pdf, other]

Asymptotic in a class of network models with an increasing sub-Gamma degree sequence

Authors: Jing Luo, Haoyu Wei, Xiaoyu Lei, Jiaxin Guo

Abstract: For the differential privacy under the sub-Gamma noise, we derive the asymptotic properties of a class of network models with binary values with a general link function. In this paper, we release the degree sequences of the binary networks under a general noisy mechanism with the discrete Laplace mechanism as a special case. We establish the asymptotic result including both consistency and asympto… ▽ More For the differential privacy under the sub-Gamma noise, we derive the asymptotic properties of a class of network models with binary values with a general link function. In this paper, we release the degree sequences of the binary networks under a general noisy mechanism with the discrete Laplace mechanism as a special case. We establish the asymptotic result including both consistency and asymptotically normality of the parameter estimator when the number of parameters goes to infinity in a class of network models. Simulations and a real data example are provided to illustrate asymptotic results. △ Less

Submitted 10 November, 2023; v1 submitted 1 November, 2021; originally announced November 2021.

Comments: arXiv admin note: text overlap with arXiv:2002.12733 by other authors

MSC Class: 62E20; 62F12

arXiv:2012.12196 [pdf, ps, other]

Identifiability of Bifactor Models

Authors: Guanhua Fang, Xin Xu, Jinxin Guo, Zhiliang Ying, Susu Zhang

Abstract: The bifactor model and its extensions are multidimensional latent variable models, under which each item measures up to one subdimension on top of the primary dimension(s). Despite their wide applications to educational and psychological assessments, this type of multidimensional latent variable models may suffer from non-identifiability, which can further lead to inconsistent parameter estimation… ▽ More The bifactor model and its extensions are multidimensional latent variable models, under which each item measures up to one subdimension on top of the primary dimension(s). Despite their wide applications to educational and psychological assessments, this type of multidimensional latent variable models may suffer from non-identifiability, which can further lead to inconsistent parameter estimation and invalid inference. The current work provides a relatively complete characterization of identifiability for the linear and dichotomous bifactor models and the linear extended bifactor model with correlated subdimensions. In addition, similar results for the two-tier models are also developed. Illustrative examples are provided on checking model identifiability through inspecting the factor loading structure. Simulation studies are reported that examine estimation consistency when the identifiability conditions are/are not satisfied. △ Less

Submitted 22 December, 2020; originally announced December 2020.

Comments: 89 pages

arXiv:2012.08371 [pdf, other]

Limiting laws and consistent estimation criteria for fixed and diverging number of spiked eigenvalues

Authors: Jianwei Hu, Jingfei Zhang, Jianhua Guo, Ji Zhu

Abstract: In this paper, we study limiting laws and consistent estimation criteria for the extreme eigenvalues in a spiked covariance model of dimension $p$. Firstly, for fixed $p$, we propose a generalized estimation criterion that can consistently estimate, $k$, the number of spiked eigenvalues. Compared with the existing literature, we show that consistency can be achieved under weaker conditions on the… ▽ More In this paper, we study limiting laws and consistent estimation criteria for the extreme eigenvalues in a spiked covariance model of dimension $p$. Firstly, for fixed $p$, we propose a generalized estimation criterion that can consistently estimate, $k$, the number of spiked eigenvalues. Compared with the existing literature, we show that consistency can be achieved under weaker conditions on the penalty term. Next, allowing both $p$ and $k$ to diverge, we derive limiting distributions of the spiked sample eigenvalues using random matrix theory techniques. Notably, our results do not require the spiked eigenvalues to be uniformly bounded from above or tending to infinity, as have been assumed in the existing literature. Based on the above derived results, we formulate a generalized estimation criterion and show that it can consistently estimate $k$, while $k$ can be fixed or grow at an order of $k=o(n^{1/3})$. We further show that the results in our work continue to hold under a general population distribution without assuming normality. The efficacy of the proposed estimation criteria is illustrated through comparative simulation studies. △ Less

Submitted 17 November, 2023; v1 submitted 15 December, 2020; originally announced December 2020.

arXiv:2011.08595 [pdf, other]

doi 10.1109/TIP.2021.3123555

DS-UI: Dual-Supervised Mixture of Gaussian Mixture Models for Uncertainty Inference

Authors: Jiyang Xie, Zhanyu Ma, Jing-Hao Xue, Guoqiang Zhang, Jun Guo

Abstract: This paper proposes a dual-supervised uncertainty inference (DS-UI) framework for improving Bayesian estimation-based uncertainty inference (UI) in deep neural network (DNN)-based image recognition. In the DS-UI, we combine the classifier of a DNN, i.e., the last fully-connected (FC) layer, with a mixture of Gaussian mixture models (MoGMM) to obtain an MoGMM-FC layer. Unlike existing UI methods fo… ▽ More This paper proposes a dual-supervised uncertainty inference (DS-UI) framework for improving Bayesian estimation-based uncertainty inference (UI) in deep neural network (DNN)-based image recognition. In the DS-UI, we combine the classifier of a DNN, i.e., the last fully-connected (FC) layer, with a mixture of Gaussian mixture models (MoGMM) to obtain an MoGMM-FC layer. Unlike existing UI methods for DNNs, which only calculate the means or modes of the DNN outputs' distributions, the proposed MoGMM-FC layer acts as a probabilistic interpreter for the features that are inputs of the classifier to directly calculate the probability density of them for the DS-UI. In addition, we propose a dual-supervised stochastic gradient-based variational Bayes (DS-SGVB) algorithm for the MoGMM-FC layer optimization. Unlike conventional SGVB and optimization algorithms in other UI methods, the DS-SGVB not only models the samples in the specific class for each Gaussian mixture model (GMM) in the MoGMM, but also considers the negative samples from other classes for the GMM to reduce the intra-class distances and enlarge the inter-class margins simultaneously for enhancing the learning ability of the MoGMM-FC layer in the DS-UI. Experimental results show the DS-UI outperforms the state-of-the-art UI methods in misclassification detection. We further evaluate the DS-UI in open-set out-of-domain/-distribution detection and find statistically significant improvements. Visualizations of the feature spaces demonstrate the superiority of the DS-UI. △ Less

Submitted 17 November, 2020; originally announced November 2020.

arXiv:2011.00647 [pdf, other]

Fast Network Community Detection with Profile-Pseudo Likelihood Methods

Authors: Jiangzhou Wang, Jingfei Zhang, Binghui Liu, Ji Zhu, Jianhua Guo

Abstract: The stochastic block model is one of the most studied network models for community detection. It is well-known that most algorithms proposed for fitting the stochastic block model likelihood function cannot scale to large-scale networks. One prominent work that overcomes this computational challenge is Amini et al.(2013), which proposed a fast pseudo-likelihood approach for fitting stochastic bloc… ▽ More The stochastic block model is one of the most studied network models for community detection. It is well-known that most algorithms proposed for fitting the stochastic block model likelihood function cannot scale to large-scale networks. One prominent work that overcomes this computational challenge is Amini et al.(2013), which proposed a fast pseudo-likelihood approach for fitting stochastic block models to large sparse networks. However, this approach does not have convergence guarantee, and is not well suited for small- or medium- scale networks. In this article, we propose a novel likelihood based approach that decouples row and column labels in the likelihood function, which enables a fast alternating maximization; the new method is computationally efficient, performs well for both small and large scale networks, and has provable convergence guarantee. We show that our method provides strongly consistent estimates of the communities in a stochastic block model. As demonstrated in simulation studies, the proposed method outperforms the pseudo-likelihood approach in terms of both estimation accuracy and computation efficiency, especially for large sparse networks. We further consider extensions of our proposed method to handle networks with degree heterogeneity and bipartite properties. △ Less

Submitted 29 August, 2021; v1 submitted 1 November, 2020; originally announced November 2020.

arXiv:2010.05244 [pdf, other]

doi 10.1109/TPAMI.2021.3083089

Advanced Dropout: A Model-free Methodology for Bayesian Dropout Optimization

Authors: Jiyang Xie, Zhanyu Ma, and Jianjun Lei, Guoqiang Zhang, Jing-Hao Xue, Zheng-Hua Tan, Jun Guo

Abstract: Due to lack of data, overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs). We propose advanced dropout, a model-free methodology, to mitigate overfitting and improve the performance of DNNs. The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate. Specifically, the distri… ▽ More Due to lack of data, overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs). We propose advanced dropout, a model-free methodology, to mitigate overfitting and improve the performance of DNNs. The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate. Specifically, the distribution parameters are optimized by stochastic gradient variational Bayes in order to carry out an end-to-end training. We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets (five small-scale datasets and two large-scale datasets) with various base models. The advanced dropout outperforms all the referred techniques on all the datasets.We further compare the effectiveness ratios and find that advanced dropout achieves the highest one on most cases. Next, we conduct a set of analysis of dropout rate characteristics, including convergence of the adaptive dropout rate, the learned distributions of dropout masks, and a comparison with dropout rate generation without an explicit distribution. In addition, the ability of overfitting prevention is evaluated and confirmed. Finally, we extend the application of the advanced dropout to uncertainty inference, network pruning, text classification, and regression. The proposed advanced dropout is also superior to the corresponding referred methods. Codes are available at https://github.com/PRIS-CV/AdvancedDropout. △ Less

Submitted 10 August, 2021; v1 submitted 11 October, 2020; originally announced October 2020.

Comments: Accepted by IEEE TPAMI, 2021

arXiv:2009.10645 [pdf, other]

Partially Observable Online Change Detection via Smooth-Sparse Decomposition

Authors: Jie Guo, Hao Yan, Chen Zhang, Steven Hoi

Abstract: We consider online change detection of high dimensional data streams with sparse changes, where only a subset of data streams can be observed at each sensing time point due to limited sensing capacities. On the one hand, the detection scheme should be able to deal with partially observable data and meanwhile have efficient detection power for sparse changes. On the other, the scheme should be able… ▽ More We consider online change detection of high dimensional data streams with sparse changes, where only a subset of data streams can be observed at each sensing time point due to limited sensing capacities. On the one hand, the detection scheme should be able to deal with partially observable data and meanwhile have efficient detection power for sparse changes. On the other, the scheme should be able to adaptively and actively select the most important variables to observe to maximize the detection power. To address these two points, in this paper, we propose a novel detection scheme called CDSSD. In particular, it describes the structure of high dimensional data with sparse changes by smooth-sparse decomposition, whose parameters can be learned via spike-slab variational Bayesian inference. Then the posterior Bayes factor, which incorporates the learned parameters and sparse change information, is formulated as a detection statistic. Finally, by formulating the statistic as the reward of a combinatorial multi-armed bandit problem, an adaptive sampling strategy based on Thompson sampling is proposed. The efficacy and applicability of our method in practice are demonstrated with numerical studies and a real case study. △ Less

Submitted 22 September, 2020; originally announced September 2020.

Comments: 48 pages

arXiv:2007.02394 [pdf, other]

Meta-Semi: A Meta-learning Approach for Semi-supervised Learning

Authors: Yulin Wang, Jiayi Guo, Shiji Song, Gao Huang

Abstract: Deep learning based semi-supervised learning (SSL) algorithms have led to promising results in recent years. However, they tend to introduce multiple tunable hyper-parameters, making them less practical in real SSL scenarios where the labeled data is scarce for extensive hyper-parameter search. In this paper, we propose a novel meta-learning based SSL algorithm (Meta-Semi) that requires tuning onl… ▽ More Deep learning based semi-supervised learning (SSL) algorithms have led to promising results in recent years. However, they tend to introduce multiple tunable hyper-parameters, making them less practical in real SSL scenarios where the labeled data is scarce for extensive hyper-parameter search. In this paper, we propose a novel meta-learning based SSL algorithm (Meta-Semi) that requires tuning only one additional hyper-parameter, compared with a standard supervised deep learning algorithm, to achieve competitive performance under various conditions of SSL. We start by defining a meta optimization problem that minimizes the loss on labeled data through dynamically reweighting the loss on unlabeled samples, which are associated with soft pseudo labels during training. As the meta problem is computationally intensive to solve directly, we propose an efficient algorithm to dynamically obtain the approximate solutions. We show theoretically that Meta-Semi converges to the stationary point of the loss function on labeled data under mild conditions. Empirically, Meta-Semi outperforms state-of-the-art SSL algorithms significantly on the challenging semi-supervised CIFAR-100 and STL-10 tasks, and achieves competitive performance on CIFAR-10 and SVHN. △ Less

Submitted 7 September, 2021; v1 submitted 5 July, 2020; originally announced July 2020.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2007.01488 [pdf, other]

On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation

Authors: Jianing Li, Yanyan Lan, Jiafeng Guo, Xueqi Cheng

Abstract: The goal of text generation models is to fit the underlying real probability distribution of text. For performance evaluation, quality and diversity metrics are usually applied. However, it is still not clear to what extend can the quality-diversity evaluation reflect the distribution-fitting goal. In this paper, we try to reveal such relation in a theoretical approach. We prove that under certain… ▽ More The goal of text generation models is to fit the underlying real probability distribution of text. For performance evaluation, quality and diversity metrics are usually applied. However, it is still not clear to what extend can the quality-diversity evaluation reflect the distribution-fitting goal. In this paper, we try to reveal such relation in a theoretical approach. We prove that under certain conditions, a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution. We also show that the commonly used BLEU/Self-BLEU metric pair fails to match any divergence metric, thus propose CR/NRR as a substitute for quality/diversity metric pair. △ Less

Submitted 18 August, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: 16 pages, 7 figures. ICML2020 Final Submission

arXiv:2007.01231 [pdf, other]

Software Engineering Event Modeling using Relative Time in Temporal Knowledge Graphs

Authors: Kian Ahrabian, Daniel Tarlow, Hehuimin Cheng, Jin L. C. Guo

Abstract: We present a multi-relational temporal Knowledge Graph based on the daily interactions between artifacts in GitHub, one of the largest social coding platforms. Such representation enables posing many user-activity and project management questions as link prediction and time queries over the knowledge graph. In particular, we introduce two new datasets for i) interpolated time-conditioned link pred… ▽ More We present a multi-relational temporal Knowledge Graph based on the daily interactions between artifacts in GitHub, one of the largest social coding platforms. Such representation enables posing many user-activity and project management questions as link prediction and time queries over the knowledge graph. In particular, we introduce two new datasets for i) interpolated time-conditioned link prediction and ii) extrapolated time-conditioned link/time prediction queries, each with distinguished properties. Our experiments on these datasets highlight the potential of adapting knowledge graphs to answer broad software engineering questions. Meanwhile, it also reveals the unsatisfactory performance of existing temporal models on extrapolated queries and time prediction queries in general. To overcome these shortcomings, we introduce an extension to current temporal models using relative temporal information with regards to past events. △ Less

Submitted 12 July, 2020; v1 submitted 2 July, 2020; originally announced July 2020.

Comments: 11 pages, 1 figure. 37th International Conference on Machine Learning (ICML 2020) - Workshop on Graph Representation Learning and Beyond

arXiv:2005.10696 [pdf, other]

Novel Policy Seeking with Constrained Optimization

Authors: Hao Sun, Zhenghao Peng, Bo Dai, Jian Guo, Dahua Lin, Bolei Zhou

Abstract: In problem-solving, we humans can come up with multiple novel solutions to the same problem. However, reinforcement learning algorithms can only produce a set of monotonous policies that maximize the cumulative reward but lack diversity and novelty. In this work, we address the problem of generating novel policies in reinforcement learning tasks. Instead of following the multi-objective framework… ▽ More In problem-solving, we humans can come up with multiple novel solutions to the same problem. However, reinforcement learning algorithms can only produce a set of monotonous policies that maximize the cumulative reward but lack diversity and novelty. In this work, we address the problem of generating novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in existing methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods following the new perspective. The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature. Experimental comparisons on the MuJoCo control suite show our methods can achieve substantial improvement over previous novelty-seeking methods in terms of both the novelty of policies and their performances in the primal task. △ Less

Submitted 29 October, 2022; v1 submitted 21 May, 2020; originally announced May 2020.

arXiv:2003.04575 [pdf, other]

GPCA: A Probabilistic Framework for Gaussian Process Embedded Channel Attention

Authors: Jiyang Xie, Dongliang Chang, Zhanyu Ma, Guoqiang Zhang, Jun Guo

Abstract: Channel attention mechanisms have been commonly applied in many visual tasks for effective performance improvement. It is able to reinforce the informative channels as well as to suppress the useless channels. Recently, different channel attention modules have been proposed and implemented in various ways. Generally speaking, they are mainly based on convolution and pooling operations. In this pap… ▽ More Channel attention mechanisms have been commonly applied in many visual tasks for effective performance improvement. It is able to reinforce the informative channels as well as to suppress the useless channels. Recently, different channel attention modules have been proposed and implemented in various ways. Generally speaking, they are mainly based on convolution and pooling operations. In this paper, we propose Gaussian process embedded channel attention (GPCA) module and further interpret the channel attention schemes in a probabilistic way. The GPCA module intends to model the correlations among the channels, which are assumed to be captured by beta distributed variables. As the beta distribution cannot be integrated into the end-to-end training of convolutional neural networks (CNNs) with a mathematically tractable solution, we utilize an approximation of the beta distribution to solve this problem. To specify, we adapt a Sigmoid-Gaussian approximation, in which the Gaussian distributed variables are transferred into the interval [0,1]. The Gaussian process is then utilized to model the correlations among different channels. In this case, a mathematically tractable solution is derived. The GPCA module can be efficiently implemented and integrated into the end-to-end training of the CNNs. Experimental results demonstrate the promising performance of the proposed GPCA module. Codes are available at https://github.com/PRIS-CV/GPCA. △ Less

Submitted 10 August, 2021; v1 submitted 10 March, 2020; originally announced March 2020.

Comments: Accepted by IEEE TPAMI, 2021

arXiv:2002.08021 [pdf]

Seasonal and Trend Forecasting of Tourist Arrivals: An Adaptive Multiscale Ensemble Learning Approach

Authors: Shaolong Suna, Dan Bi, Ju-e Guo, Shouyang Wang

Abstract: The accurate seasonal and trend forecasting of tourist arrivals is a very challenging task. In the view of the importance of seasonal and trend forecasting of tourist arrivals, and limited research work paid attention to these previously. In this study, a new adaptive multiscale ensemble (AME) learning approach incorporating variational mode decomposition (VMD) and least square support vector regr… ▽ More The accurate seasonal and trend forecasting of tourist arrivals is a very challenging task. In the view of the importance of seasonal and trend forecasting of tourist arrivals, and limited research work paid attention to these previously. In this study, a new adaptive multiscale ensemble (AME) learning approach incorporating variational mode decomposition (VMD) and least square support vector regression (LSSVR) is developed for short-, medium-, and long-term seasonal and trend forecasting of tourist arrivals. In the formulation of our developed AME learning approach, the original tourist arrivals series are first decomposed into the trend, seasonal and remainders volatility components. Then, the ARIMA is used to forecast the trend component, the SARIMA is used to forecast seasonal component with a 12-month cycle, while the LSSVR is used to forecast remainder volatility components. Finally, the forecasting results of the three components are aggregated to generate an ensemble forecasting of tourist arrivals by the LSSVR based nonlinear ensemble approach. Furthermore, a direct strategy is used to implement multi-step-ahead forecasting. Taking two accuracy measures and the Diebold-Mariano test, the empirical results demonstrate that our proposed AME learning approach can achieve higher level and directional forecasting accuracy compared with other benchmarks used in this study, indicating that our proposed approach is a promising model for forecasting tourist arrivals with high seasonality and volatility. △ Less

Submitted 10 March, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

arXiv:2002.07964 [pdf]

Tourism Demand Forecasting: An Ensemble Deep Learning Approach

Authors: Shaolong Sun, Yanzhao Li, Ju-e Guo, Shouyang Wang

Abstract: The availability of tourism-related big data increases the potential to improve the accuracy of tourism demand forecasting, but presents significant challenges for forecasting, including curse of dimensionality and high model complexity. A novel bagging-based multivariate ensemble deep learning approach integrating stacked autoencoders and kernel-based extreme learning machines (B-SAKE) is propose… ▽ More The availability of tourism-related big data increases the potential to improve the accuracy of tourism demand forecasting, but presents significant challenges for forecasting, including curse of dimensionality and high model complexity. A novel bagging-based multivariate ensemble deep learning approach integrating stacked autoencoders and kernel-based extreme learning machines (B-SAKE) is proposed to address these challenges in this study. By using historical tourist arrival data, economic variable data and search intensity index (SII) data, we forecast tourist arrivals in Beijing from four countries. The consistent results of multiple schemes suggest that our proposed B-SAKE approach outperforms benchmark models in terms of level accuracy, directional accuracy and even statistical significance. Both bagging and stacked autoencoder can effectively alleviate the challenges brought by tourism big data and improve the forecasting performance of the models. The ensemble deep learning model we propose contributes to tourism forecasting literature and benefits relevant government officials and tourism practitioners. △ Less

Submitted 16 January, 2021; v1 submitted 18 February, 2020; originally announced February 2020.

arXiv:2001.11355 [pdf, other]

Constructing Deep Neural Networks with a Priori Knowledge of Wireless Tasks

Authors: Jia Guo, Chenyang Yang

Abstract: Deep neural networks (DNNs) have been employed for designing wireless systems in many aspects, say transceiver design, resource optimization, and information prediction. Existing works either use the fully-connected DNN or the DNNs with particular architectures developed in other domains. While generating labels for supervised learning and gathering training samples are time-consuming or cost-proh… ▽ More Deep neural networks (DNNs) have been employed for designing wireless systems in many aspects, say transceiver design, resource optimization, and information prediction. Existing works either use the fully-connected DNN or the DNNs with particular architectures developed in other domains. While generating labels for supervised learning and gathering training samples are time-consuming or cost-prohibitive, how to develop DNNs with wireless priors for reducing training complexity remains open. In this paper, we show that two kinds of permutation invariant properties widely existed in wireless tasks can be harnessed to reduce the number of model parameters and hence the sample and computational complexity for training. We find special architecture of DNNs whose input-output relationships satisfy the properties, called permutation invariant DNN (PINN), and augment the data with the properties. By learning the impact of the scale of a wireless system, the size of the constructed PINNs can flexibly adapt to the input data dimension. We take predictive resource allocation and interference coordination as examples to show how the PINNs can be employed for learning the optimal policy with unsupervised and supervised learning. Simulations results demonstrate a dramatic gain of the proposed PINNs in terms of reducing training complexity. △ Less

Submitted 29 January, 2020; originally announced January 2020.

Comments: 30 pages, 9 figures. arXiv admin note: text overlap with arXiv:1910.13728

arXiv:1912.13256 [pdf, other]

Scalable NAS with Factorizable Architectural Parameters

Authors: Lanfei Wang, Lingxi Xie, Tianyi Zhang, Jun Guo, Qi Tian

Abstract: Neural Architecture Search (NAS) is an emerging topic in machine learning and computer vision. The fundamental ideology of NAS is using an automatic mechanism to replace manual designs for exploring powerful network architectures. One of the key factors of NAS is to scale-up the search space, e.g., increasing the number of operators, so that more possibilities are covered, but existing search algo… ▽ More Neural Architecture Search (NAS) is an emerging topic in machine learning and computer vision. The fundamental ideology of NAS is using an automatic mechanism to replace manual designs for exploring powerful network architectures. One of the key factors of NAS is to scale-up the search space, e.g., increasing the number of operators, so that more possibilities are covered, but existing search algorithms often get lost in a large number of operators. For avoiding huge computing and competition among similar operators in the same pool, this paper presents a scalable algorithm by factorizing a large set of candidate operators into smaller subspaces. As a practical example, this allows us to search for effective activation functions along with the regular operators including convolution, pooling, skip-connect, etc. With a small increase in search costs and no extra costs in re-training, we find interesting architectures that were not explored before, and achieve state-of-the-art performance on CIFAR10 and ImageNet, two standard image classification benchmarks. △ Less

Submitted 22 September, 2020; v1 submitted 31 December, 2019; originally announced December 2019.

Comments: 10 pages, 4 figures

arXiv:1912.04838 [pdf, other]

Scalability in Perception for Autonomous Driving: Waymo Open Dataset

Authors: Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Sheng Zhao, Shuyang Cheng, Yu Zhang, Jonathon Shlens, Zhifeng Chen, Dragomir Anguelov

Abstract: The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the overall viability of the technology. In an effort to help a… ▽ More The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the overall viability of the technology. In an effort to help align the research community's contributions with real-world self-driving problems, we introduce a new large scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest camera+LiDAR dataset available based on our proposed diversity metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open. △ Less

Submitted 12 May, 2020; v1 submitted 10 December, 2019; originally announced December 2019.

Comments: CVPR 2020

arXiv:1911.08717 [pdf, other]

Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation

Authors: Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, Tie-Yan Liu

Abstract: Non-autoregressive translation (NAT) models remove the dependence on previous target tokens and generate all target tokens in parallel, resulting in significant inference speedup but at the cost of inferior translation accuracy compared to autoregressive translation (AT) models. Considering that AT models have higher accuracy and are easier to train than NAT models, and both of them share the same… ▽ More Non-autoregressive translation (NAT) models remove the dependence on previous target tokens and generate all target tokens in parallel, resulting in significant inference speedup but at the cost of inferior translation accuracy compared to autoregressive translation (AT) models. Considering that AT models have higher accuracy and are easier to train than NAT models, and both of them share the same model configurations, a natural idea to improve the accuracy of NAT models is to transfer a well-trained AT model to an NAT model through fine-tuning. However, since AT and NAT models differ greatly in training strategy, straightforward fine-tuning does not work well. In this work, we introduce curriculum learning into fine-tuning for NAT. Specifically, we design a curriculum in the fine-tuning process to progressively switch the training from autoregressive generation to non-autoregressive generation. Experiments on four benchmark translation datasets show that the proposed method achieves good improvement (more than $1$ BLEU score) over previous NAT baselines in terms of translation accuracy, and greatly speed up (more than $10$ times) the inference process over AT baselines. △ Less

Submitted 21 November, 2019; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: AAAI 2020

arXiv:1911.00658 [pdf, ps, other]

Global Adaptive Generative Adjustment

Authors: Bin Wang, Xiaofei Wang, Jianhua Guo

Abstract: Many traditional signal recovery approaches can behave well basing on the penalized likelihood. However, they have to meet with the difficulty in the selection of hyperparameters or tuning parameters in the penalties. In this article, we propose a global adaptive generative adjustment (GAGA) algorithm for signal recovery, in which multiple hyperpameters are automatically learned and alternatively… ▽ More Many traditional signal recovery approaches can behave well basing on the penalized likelihood. However, they have to meet with the difficulty in the selection of hyperparameters or tuning parameters in the penalties. In this article, we propose a global adaptive generative adjustment (GAGA) algorithm for signal recovery, in which multiple hyperpameters are automatically learned and alternatively updated with the signal. We further prove that the output of our algorithm directly guarantees the consistency of model selection and signal estimate. Moreover, we also propose a variant GAGA algorithm for improving the computational efficiency in the high-dimensional data analysis. Finally, in the simulated experiment, we consider the consistency of the outputs of our algorithms, and compare our algorithms to other penalized likelihood methods: the Adaptive LASSO, the SCAD and the MCP. The simulation results support the efficiency of our algorithms for signal recovery, and demonstrate that our algorithms outperform the other algorithms. △ Less

Submitted 16 November, 2022; v1 submitted 2 November, 2019; originally announced November 2019.

arXiv:1911.00623 [pdf, other]

On-Device Machine Learning: An Algorithms and Learning Theory Perspective

Authors: Sauptik Dhar, Junyao Guo, Jiayi Liu, Samarth Tripathi, Unmesh Kurup, Mohak Shah

Abstract: The predominant paradigm for using machine learning models on a device is to train a model in the cloud and perform inference using the trained model on the device. However, with increasing number of smart devices and improved hardware, there is interest in performing model training on the device. Given this surge in interest, a comprehensive survey of the field from a device-agnostic perspective… ▽ More The predominant paradigm for using machine learning models on a device is to train a model in the cloud and perform inference using the trained model on the device. However, with increasing number of smart devices and improved hardware, there is interest in performing model training on the device. Given this surge in interest, a comprehensive survey of the field from a device-agnostic perspective sets the stage for both understanding the state-of-the-art and for identifying open challenges and future avenues of research. However, on-device learning is an expansive field with connections to a large number of related topics in AI and machine learning (including online learning, model adaptation, one/few-shot learning, etc.). Hence, covering such a large number of topics in a single survey is impractical. This survey finds a middle ground by reformulating the problem of on-device learning as resource constrained learning where the resources are compute and memory. This reformulation allows tools, techniques, and algorithms from a wide variety of research areas to be compared equitably. In addition to summarizing the state-of-the-art, the survey also identifies a number of challenges and next steps for both the algorithmic and theoretical aspects of on-device learning. △ Less

Submitted 24 July, 2020; v1 submitted 1 November, 2019; originally announced November 2019.

Comments: Edge Learning, TinyML, Resource Constrained Machine Learning, Deep learning on device, Statistical Learning Theory, 45 pages survey

arXiv:1910.14056 [pdf, other]

Unsupervised Star Galaxy Classification with Cascade Variational Auto-Encoder

Authors: Hao Sun, Jiadong Guo, Edward J. Kim, Robert J. Brunner

Abstract: The increasing amount of data in astronomy provides great challenges for machine learning research. Previously, supervised learning methods achieved satisfactory recognition accuracy for the star-galaxy classification task, based on manually labeled data set. In this work, we propose a novel unsupervised approach for the star-galaxy recognition task, namely Cascade Variational Auto-Encoder (CasVAE… ▽ More The increasing amount of data in astronomy provides great challenges for machine learning research. Previously, supervised learning methods achieved satisfactory recognition accuracy for the star-galaxy classification task, based on manually labeled data set. In this work, we propose a novel unsupervised approach for the star-galaxy recognition task, namely Cascade Variational Auto-Encoder (CasVAE). Our empirical results show our method outperforms the baseline model in both accuracy and stability. △ Less

Submitted 30 October, 2019; originally announced October 2019.

arXiv:1910.13728 [pdf, other]

Structure of Deep Neural Networks with a Priori Information in Wireless Tasks

Authors: Jia Guo, Chenyang Yang

Abstract: Deep neural networks (DNNs) have been employed for designing wireless networks in many aspects, such as transceiver optimization, resource allocation, and information prediction. Existing works either use fully-connected DNN or the DNNs with specific structures that are designed in other domains. In this paper, we show that a priori information widely existed in wireless tasks is permutation invar… ▽ More Deep neural networks (DNNs) have been employed for designing wireless networks in many aspects, such as transceiver optimization, resource allocation, and information prediction. Existing works either use fully-connected DNN or the DNNs with specific structures that are designed in other domains. In this paper, we show that a priori information widely existed in wireless tasks is permutation invariant. For these tasks, we propose a DNN with special structure, where the weight matrices between layers of the DNN only consist of two smaller sub-matrices. By such way of parameter sharing, the number of model parameters reduces, giving rise to low sample and computational complexity for training a DNN. We take predictive resource allocation as an example to show how the designed DNN can be applied for learning the optimal policy with unsupervised learning. Simulations results validate our analysis and show dramatic gain of the proposed structure in terms of reducing training complexity. △ Less

Submitted 6 November, 2019; v1 submitted 30 October, 2019; originally announced October 2019.

Comments: 6 pages, 2 figures, Submitted to ICC 2020

arXiv:1910.08892 [pdf, other]

Bayesian Symbolic Regression

Authors: Ying Jin, Weilin Fu, Jian Kang, Jiadong Guo, Jian Guo

Abstract: Interpretability is crucial for machine learning in many scenarios such as quantitative finance, banking, healthcare, etc. Symbolic regression (SR) is a classic interpretable machine learning method by bridging X and Y using mathematical expressions composed of some basic functions. However, the search space of all possible expressions grows exponentially with the length of the expression, making… ▽ More Interpretability is crucial for machine learning in many scenarios such as quantitative finance, banking, healthcare, etc. Symbolic regression (SR) is a classic interpretable machine learning method by bridging X and Y using mathematical expressions composed of some basic functions. However, the search space of all possible expressions grows exponentially with the length of the expression, making it infeasible for enumeration. Genetic programming (GP) has been traditionally and commonly used in SR to search for the optimal solution, but it suffers from several limitations, e.g. the difficulty in incorporating prior knowledge; overly-complicated output expression and reduced interpretability etc. To address these issues, we propose a new method to fit SR under a Bayesian framework. Firstly, Bayesian model can naturally incorporate prior knowledge (e.g., preference of basis functions, operators and raw features) to improve the efficiency of fitting SR. Secondly, to improve interpretability of expressions in SR, we aim to capture concise but informative signals. To this end, we assume the expected signal has an additive structure, i.e., a linear combination of several concise expressions, whose complexity is controlled by a well-designed prior distribution. In our setup, each expression is characterized by a symbolic tree, and the proposed SR model could be solved by sampling symbolic trees from the posterior distribution using an efficient Markov chain Monte Carlo (MCMC) algorithm. Finally, compared with GP, the proposed BSR(Bayesian Symbolic Regression) method saves computer memory with no need to keep an updated 'genome pool'. Numerical experiments show that, compared with GP, the solutions of BSR are closer to the ground truth and the expressions are more concise. Meanwhile we find the solution of BSR is robust to hyper-parameter specifications such as the number of trees. △ Less

Submitted 15 January, 2020; v1 submitted 20 October, 2019; originally announced October 2019.

arXiv:1910.02672 [pdf, other]

Multi-label Detection and Classification of Red Blood Cells in Microscopic Images

Authors: Wei Qiu, Jiaming Guo, Xiang Li, Mengjia Xu, Mo Zhang, Ning Guo, Quanzheng Li

Abstract: Cell detection and cell type classification from biomedical images play an important role for high-throughput imaging and various clinical application. While classification of single cell sample can be performed with standard computer vision and machine learning methods, analysis of multi-label samples (region containing congregating cells) is more challenging, as separation of individual cells ca… ▽ More Cell detection and cell type classification from biomedical images play an important role for high-throughput imaging and various clinical application. While classification of single cell sample can be performed with standard computer vision and machine learning methods, analysis of multi-label samples (region containing congregating cells) is more challenging, as separation of individual cells can be difficult (e.g. touching cells) or even impossible (e.g. overlapping cells). As multi-instance images are common in analyzing Red Blood Cell (RBC) for Sickle Cell Disease (SCD) diagnosis, we develop and implement a multi-instance cell detection and classification framework to address this challenge. The framework firstly trains a region proposal model based on Region-based Convolutional Network (RCNN) to obtain bounding-boxes of regions potentially containing single or multiple cells from input microscopic images, which are extracted as image patches. High-level image features are then calculated from image patches through a pre-trained Convolutional Neural Network (CNN) with ResNet-50 structure. Using these image features inputs, six networks are then trained to make multi-label prediction of whether a given patch contains cells belonging to a specific cell type. As the six networks are trained with image patches consisting of both individual cells and touching/overlapping cells, they can effectively recognize cell types that are presented in multi-instance image samples. Finally, for the purpose of SCD testing, we train another machine learning classifier to predict whether the given image patch contains abnormal cell type based on outputs from the six networks. Testing result of the proposed framework shows that it can achieve good performance in automatic cell detection and classification. △ Less

Submitted 14 December, 2019; v1 submitted 7 October, 2019; originally announced October 2019.

Comments: Wei Qiu, Jiaming Guo and Xiang Li contributed equally

arXiv:1910.00185 [pdf, other]

doi 10.1109/BigData47090.2019.9005971

Predicting Alzheimer's Disease by Hierarchical Graph Convolution from Positron Emission Tomography Imaging

Authors: Jiaming Guo, Wei Qiu, Xiang Li, Xuandong Zhao, Ning Guo, Quanzheng Li

Abstract: Imaging-based early diagnosis of Alzheimer Disease (AD) has become an effective approach, especially by using nuclear medicine imaging techniques such as Positron Emission Topography (PET). In various literature it has been found that PET images can be better modeled as signals (e.g. uptake of florbetapir) defined on a network (non-Euclidean) structure which is governed by its underlying graph pat… ▽ More Imaging-based early diagnosis of Alzheimer Disease (AD) has become an effective approach, especially by using nuclear medicine imaging techniques such as Positron Emission Topography (PET). In various literature it has been found that PET images can be better modeled as signals (e.g. uptake of florbetapir) defined on a network (non-Euclidean) structure which is governed by its underlying graph patterns of pathological progression and metabolic connectivity. In order to effectively apply deep learning framework for PET image analysis to overcome its limitation on Euclidean grid, we develop a solution for 3D PET image representation and analysis under a generalized, graph-based CNN architecture (PETNet), which analyzes PET signals defined on a group-wise inferred graph structure. Computations in PETNet are defined in non-Euclidean, graph (network) domain, as it performs feature extraction by convolution operations on spectral-filtered signals on the graph and pooling operations based on hierarchical graph clustering. Effectiveness of the PETNet is evaluated on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, which shows improved performance over both deep learning and other machine learning-based methods. △ Less

Submitted 30 September, 2019; originally announced October 2019.

Comments: Jiaming Guo, Wei Qiu and Xiang Li contribute equally to this work

arXiv:1909.10710 [pdf, other]

Estimating Number of Factors by Adjusted Eigenvalues Thresholding

Authors: Jianqing Fan, Jianhua Guo, Shurong Zheng

Abstract: Determining the number of common factors is an important and practical topic in high dimensional factor models. The existing literatures are mainly based on the eigenvalues of the covariance matrix. Due to the incomparability of the eigenvalues of the covariance matrix caused by heterogeneous scales of observed variables, it is very difficult to give an accurate relationship between these eigenval… ▽ More Determining the number of common factors is an important and practical topic in high dimensional factor models. The existing literatures are mainly based on the eigenvalues of the covariance matrix. Due to the incomparability of the eigenvalues of the covariance matrix caused by heterogeneous scales of observed variables, it is very difficult to give an accurate relationship between these eigenvalues and the number of common factors. To overcome this limitation, we appeal to the correlation matrix and show surprisingly that the number of eigenvalues greater than $1$ of population correlation matrix is the same as the number of common factors under some mild conditions. To utilize such a relationship, we study the random matrix theory based on the sample correlation matrix in order to correct the biases in estimating the top eigenvalues and to take into account of estimation errors in eigenvalue estimation. This leads us to propose adjusted correlation thresholding (ACT) for determining the number of common factors in high dimensional factor models, taking into account the sampling variabilities and biases of top sample eigenvalues. We also establish the optimality of the proposed methods in terms of minimal signal strength and optimal threshold. Simulation studies lend further support to our proposed method and show that our estimator outperforms other competing methods in most of our testing cases. △ Less

Submitted 24 September, 2019; originally announced September 2019.

Comments: 35 pages; 4 figures

arXiv:1909.04266 [pdf, other]

Wasserstein Collaborative Filtering for Item Cold-start Recommendation

Authors: Yitong Meng, Guangyong Chen, Benben Liao, Jun Guo, Weiwen Liu

Abstract: The item cold-start problem seriously limits the recommendation performance of Collaborative Filtering (CF) methods when new items have either none or very little interactions. To solve this issue, many modern Internet applications propose to predict a new item's interaction from the possessing contents. However, it is difficult to design and learn a map between the item's interaction history and… ▽ More The item cold-start problem seriously limits the recommendation performance of Collaborative Filtering (CF) methods when new items have either none or very little interactions. To solve this issue, many modern Internet applications propose to predict a new item's interaction from the possessing contents. However, it is difficult to design and learn a map between the item's interaction history and the corresponding contents. In this paper, we apply the Wasserstein distance to address the item cold-start problem. Given item content information, we can calculate the similarity between the interacted items and cold-start ones, so that a user's preference on cold-start items can be inferred by minimizing the Wasserstein distance between the distributions over these two types of items. We further adopt the idea of CF and propose Wasserstein CF (WCF) to improve the recommendation performance on cold-start items. Experimental results demonstrate the superiority of WCF over state-of-the-art approaches. △ Less

Submitted 9 September, 2019; originally announced September 2019.

arXiv:1909.04239 [pdf, other]

PMD: An Optimal Transportation-based User Distance for Recommender Systems

Authors: Yitong Meng, Xinyan Dai, Xiao Yan, James Cheng, Weiwen Liu, Benben Liao, Jun Guo, Guangyong Chen

Abstract: Collaborative filtering, a widely-used recommendation technique, predicts a user's preference by aggregating the ratings from similar users. As a result, these measures cannot fully utilize the rating information and are not suitable for real world sparse data. To solve these issues, we propose a novel user distance measure named Preference Mover's Distance (PMD) which makes full use of all rating… ▽ More Collaborative filtering, a widely-used recommendation technique, predicts a user's preference by aggregating the ratings from similar users. As a result, these measures cannot fully utilize the rating information and are not suitable for real world sparse data. To solve these issues, we propose a novel user distance measure named Preference Mover's Distance (PMD) which makes full use of all ratings made by each user. Our proposed PMD can properly measure the distance between a pair of users even if they have no co-rated items. We show that this measure can be cast as an instance of the Earth Mover's Distance, a well-studied transportation problem for which several highly efficient solvers have been developed. Experimental results show that PMD can help achieve superior recommendation accuracy than state-of-the-art methods, especially when training data is very sparse. △ Less

Submitted 10 December, 2019; v1 submitted 9 September, 2019; originally announced September 2019.

Comments: This paper is accepted by European Conference on Information Retrieval (ECIR 2020)

arXiv:1908.05474 [pdf, other]

Adaptive Regularization of Labels

Authors: Qianggang Ding, Sifan Wu, Hao Sun, Jiadong Guo, Shu-Tao Xia

Abstract: Recently, a variety of regularization techniques have been widely applied in deep neural networks, such as dropout, batch normalization, data augmentation, and so on. These methods mainly focus on the regularization of weight parameters to prevent overfitting effectively. In addition, label regularization techniques such as label smoothing and label disturbance have also been proposed with the mot… ▽ More Recently, a variety of regularization techniques have been widely applied in deep neural networks, such as dropout, batch normalization, data augmentation, and so on. These methods mainly focus on the regularization of weight parameters to prevent overfitting effectively. In addition, label regularization techniques such as label smoothing and label disturbance have also been proposed with the motivation of adding a stochastic perturbation to labels. In this paper, we propose a novel adaptive label regularization method, which enables the neural network to learn from the erroneous experience and update the optimal label representation online. On the other hand, compared with knowledge distillation, which learns the correlation of categories using teacher network, our proposed method requires only a minuscule increase in parameters without cumbersome teacher network. Furthermore, we evaluate our method on CIFAR-10/CIFAR-100/ImageNet datasets for image recognition tasks and AGNews/Yahoo/Yelp-Full datasets for text classification tasks. The empirical results show significant improvement under all experimental settings. △ Less

Submitted 15 August, 2019; originally announced August 2019.

arXiv:1907.04433 [pdf, other]

GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing

Authors: Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, Yi Zhu

Abstract: We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototyping and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customiza… ▽ More We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototyping and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customization. Leveraging the MXNet ecosystem, the deep learning models in GluonCV and GluonNLP can be deployed onto a variety of platforms with different programming languages. The Apache 2.0 license has been adopted by GluonCV and GluonNLP to allow for software distribution, modification, and usage. △ Less

Submitted 12 February, 2020; v1 submitted 9 July, 2019; originally announced July 2019.

Journal ref: Journal of Machine Learning Research 21 (2020) 1-7

arXiv:1906.04411 [pdf, other]

Evolutionary Trigger Set Generation for DNN Black-Box Watermarking

Authors: Jia Guo, Miodrag Potkonjak

Abstract: The commercialization of deep learning creates a compelling need for intellectual property (IP) protection. Deep neural network (DNN) watermarking has been proposed as a promising tool to help model owners prove ownership and fight piracy. A popular approach of watermarking is to train a DNN to recognize images with certain \textit{trigger} patterns. In this paper, we propose a novel evolutionary… ▽ More The commercialization of deep learning creates a compelling need for intellectual property (IP) protection. Deep neural network (DNN) watermarking has been proposed as a promising tool to help model owners prove ownership and fight piracy. A popular approach of watermarking is to train a DNN to recognize images with certain \textit{trigger} patterns. In this paper, we propose a novel evolutionary algorithm-based method to generate and optimize trigger patterns. Our method brings a siginificant reduction in false positive rates, leading to compelling proof of ownership. At the same time, it maintains the robustness of the watermark against attacks. We compare our method with the prior art and demonstrate its effectiveness on popular models and datasets. △ Less

Submitted 13 February, 2021; v1 submitted 11 June, 2019; originally announced June 2019.

Showing 1–50 of 83 results for author: Guo, J