Search | arXiv e-print repository

doi 10.1109/TCYB.2022.3168551

Co-Learning Bayesian Optimization

Authors: Zhendong Guo, Yew-Soon Ong, Tiantian He, Haitao Liu

Abstract: Bayesian optimization (BO) is well known to be sample-efficient for solving black-box problems. However, the BO algorithms can sometimes get stuck in suboptimal solutions even with plenty of samples. Intrinsically, such suboptimal problem of BO can attribute to the poor surrogate accuracy of the trained Gaussian process (GP), particularly that in the regions where the optimal solutions locate. Hen… ▽ More Bayesian optimization (BO) is well known to be sample-efficient for solving black-box problems. However, the BO algorithms can sometimes get stuck in suboptimal solutions even with plenty of samples. Intrinsically, such suboptimal problem of BO can attribute to the poor surrogate accuracy of the trained Gaussian process (GP), particularly that in the regions where the optimal solutions locate. Hence, we propose to build multiple GP models instead of a single GP surrogate to complement each other and thus resolving the suboptimal problem of BO. Nevertheless, according to the bias-variance tradeoff equation, the individual prediction errors can increase when increasing the diversity of models, which may lead to even worse overall surrogate accuracy. On the other hand, based on the theory of Rademacher complexity, it has been proved that exploiting the agreement of models on unlabeled information can help to reduce the complexity of the hypothesis space, and therefore achieving the required surrogate accuracy with fewer samples. Such value of model agreement has been extensively demonstrated for co-training style algorithms to boost model accuracy with a small portion of samples. Inspired by the above, we propose a novel BO algorithm labeled as co-learning BO (CLBO), which exploits both model diversity and agreement on unlabeled information to improve the overall surrogate accuracy with limited samples, and therefore achieving more efficient global optimization. Through tests on five numerical toy problems and three engineering benchmarks, the effectiveness of proposed CLBO has been well demonstrated. △ Less

Submitted 22 January, 2025; originally announced January 2025.

Journal ref: in IEEE Transactions on Cybernetics, vol. 52, no. 9, pp. 9820-9833, Sept. 2022

arXiv:2411.02467 [pdf, other]

Towards Harmless Rawlsian Fairness Regardless of Demographic Prior

Authors: Xuanqian Wang, Jing Li, Ivor W. Tsang, Yew-Soon Ong

Abstract: Due to privacy and security concerns, recent advancements in group fairness advocate for model training regardless of demographic information. However, most methods still require prior knowledge of demographics. In this study, we explore the potential for achieving fairness without compromising its utility when no prior demographics are provided to the training set, namely \emph{harmless Rawlsian… ▽ More Due to privacy and security concerns, recent advancements in group fairness advocate for model training regardless of demographic information. However, most methods still require prior knowledge of demographics. In this study, we explore the potential for achieving fairness without compromising its utility when no prior demographics are provided to the training set, namely \emph{harmless Rawlsian fairness}. We ascertain that such a fairness requirement with no prior demographic information essential promotes training losses to exhibit a Dirac delta distribution. To this end, we propose a simple but effective method named VFair to minimize the variance of training losses inside the optimal set of empirical losses. This problem is then optimized by a tailored dynamic update approach that operates in both loss and gradient dimensions, directing the model towards relatively fairer solutions while preserving its intact utility. Our experimental findings indicate that regression tasks, which are relatively unexplored from literature, can achieve significant fairness improvement through VFair regardless of any prior, whereas classification tasks usually do not because of their quantized utility measurements. The implementation of our method is publicly available at \url{https://github.com/wxqpxw/VFair}. △ Less

Submitted 8 November, 2024; v1 submitted 4 November, 2024; originally announced November 2024.

Comments: Accepted as a Poster in Neurips 2024

arXiv:2406.00812 [pdf, other]

Covariance-Adaptive Sequential Black-box Optimization for Diffusion Targeted Generation

Authors: Yueming Lyu, Kim Yong Tan, Yew Soon Ong, Ivor W. Tsang

Abstract: Diffusion models have demonstrated great potential in generating high-quality content for images, natural language, protein domains, etc. However, how to perform user-preferred targeted generation via diffusion models with only black-box target scores of users remains challenging. To address this issue, we first formulate the fine-tuning of the targeted reserve-time stochastic differential equatio… ▽ More Diffusion models have demonstrated great potential in generating high-quality content for images, natural language, protein domains, etc. However, how to perform user-preferred targeted generation via diffusion models with only black-box target scores of users remains challenging. To address this issue, we first formulate the fine-tuning of the targeted reserve-time stochastic differential equation (SDE) associated with a pre-trained diffusion model as a sequential black-box optimization problem. Furthermore, we propose a novel covariance-adaptive sequential optimization algorithm to optimize cumulative black-box scores under unknown transition dynamics. Theoretically, we prove a $O(\frac{d^2}{\sqrt{T}})$ convergence rate for cumulative convex functions without smooth and strongly convex assumptions. Empirically, experiments on both numerical test problems and target-guided 3D-molecule generation tasks show the superior performance of our method in achieving better target scores. △ Less

Submitted 8 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:2202.12636 [pdf, other]

Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains

Authors: Haitao Liu, Kai Wu, Yew-Soon Ong, Chao Bian, Xiaomo Jiang, Xiaofang Wang

Abstract: Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper present… ▽ More Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (HSVLMC) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with Bayesian calibration that (i) takes into account the effect of dimensionality reduction raised by domain mappings in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases and a practical multi-fidelity steam turbine exhaust problem. △ Less

Submitted 18 June, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2008.12922 [pdf, other]

Modulating Scalable Gaussian Processes for Expressive Statistical Learning

Authors: Haitao Liu, Yew-Soon Ong, Xiaomo Jiang, Xiaofang Wang

Abstract: For a learning task, Gaussian process (GP) is interested in learning the statistical relationship between inputs and outputs, since it offers not only the prediction mean but also the associated variability. The vanilla GP however struggles to learn complicated distribution with the property of, e.g., heteroscedastic noise, multi-modality and non-stationarity, from massive data due to the Gaussian… ▽ More For a learning task, Gaussian process (GP) is interested in learning the statistical relationship between inputs and outputs, since it offers not only the prediction mean but also the associated variability. The vanilla GP however struggles to learn complicated distribution with the property of, e.g., heteroscedastic noise, multi-modality and non-stationarity, from massive data due to the Gaussian marginal and the cubic complexity. To this end, this article studies new scalable GP paradigms including the non-stationary heteroscedastic GP, the mixture of GPs and the latent GP, which introduce additional latent variables to modulate the outputs or inputs in order to learn richer, non-Gaussian statistical representation. We further resort to different variational inference strategies to arrive at analytical or tighter evidence lower bounds (ELBOs) of the marginal likelihood for efficient and effective model training. Extensive numerical experiments against state-of-the-art GP and neural network (NN) counterparts on various tasks verify the superiority of these scalable modulated GPs, especially the scalable latent GP, for learning diverse data distributions. △ Less

Submitted 29 August, 2020; originally announced August 2020.

Comments: 31 pages, 9 figures, 4 tables, preprint under review

arXiv:2008.06199 [pdf, other]

Adversary Agnostic Robust Deep Reinforcement Learning

Authors: Xinghua Qu, Yew-Soon Ong, Abhishek Gupta, Zhu Sun

Abstract: Deep reinforcement learning (DRL) policies have been shown to be deceived by perturbations (e.g., random noise or intensional adversarial attacks) on state observations that appear at test time but are unknown during training. To increase the robustness of DRL policies, previous approaches assume that the knowledge of adversaries can be added into the training process to achieve the corresponding… ▽ More Deep reinforcement learning (DRL) policies have been shown to be deceived by perturbations (e.g., random noise or intensional adversarial attacks) on state observations that appear at test time but are unknown during training. To increase the robustness of DRL policies, previous approaches assume that the knowledge of adversaries can be added into the training process to achieve the corresponding generalization ability on these perturbed observations. However, such an assumption not only makes the robustness improvement more expensive but may also leave a model less effective to other kinds of attacks in the wild. In contrast, we propose an adversary agnostic robust DRL paradigm that does not require learning from adversaries. To this end, we first theoretically derive that robustness could indeed be achieved independently of the adversaries based on a policy distillation setting. Motivated by this finding, we propose a new policy distillation loss with two terms: 1) a prescription gap maximization loss aiming at simultaneously maximizing the likelihood of the action selected by the teacher policy and the entropy over the remaining actions; 2) a corresponding Jacobian regularization loss that minimizes the magnitude of the gradient with respect to the input state. The theoretical analysis shows that our distillation loss guarantees to increase the prescription gap and the adversarial robustness. Furthermore, experiments on five Atari games firmly verify the superiority of our approach in terms of boosting adversarial robustness compared to other state-of-the-art methods. △ Less

Submitted 24 December, 2020; v1 submitted 14 August, 2020; originally announced August 2020.

arXiv:2005.08467 [pdf, other]

Deep Latent-Variable Kernel Learning

Authors: Haitao Liu, Yew-Soon Ong, Xiaomo Jiang, Xiaofang Wang

Abstract: Deep kernel learning (DKL) leverages the connection between Gaussian process (GP) and neural networks (NN) to build an end-to-end, hybrid model. It combines the capability of NN to learn rich representations under massive data and the non-parametric property of GP to achieve automatic regularization that incorporates a trade-off between model fit and model complexity. However, the deterministic en… ▽ More Deep kernel learning (DKL) leverages the connection between Gaussian process (GP) and neural networks (NN) to build an end-to-end, hybrid model. It combines the capability of NN to learn rich representations under massive data and the non-parametric property of GP to achieve automatic regularization that incorporates a trade-off between model fit and model complexity. However, the deterministic encoder may weaken the model regularization of the following GP part, especially on small datasets, due to the free latent representation. We therefore present a complete deep latent-variable kernel learning (DLVKL) model wherein the latent variables perform stochastic encoding for regularized representation. We further enhance the DLVKL from two aspects: (i) the expressive variational posterior through neural stochastic differential equation (NSDE) to improve the approximation quality, and (ii) the hybrid prior taking knowledge from both the SDE prior and the posterior to arrive at a flexible trade-off. Intensive experiments imply that the DLVKL-NSDE performs similarly to the well calibrated GP on small datasets, and outperforms existing deep GPs on large datasets. △ Less

Submitted 19 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: 13 pages, 8 figures, preprint under review

arXiv:2004.13303 [pdf, other]

Heterogeneous Representation Learning: A Review

Authors: Joey Tianyi Zhou, Xi Peng, Yew-Soon Ong

Abstract: The real-world data usually exhibits heterogeneous properties such as modalities, views, or resources, which brings some unique challenges wherein the key is Heterogeneous Representation Learning (HRL) termed in this paper. This brief survey covers the topic of HRL, centered around several major learning settings and real-world applications. First of all, from the mathematical perspective, we pres… ▽ More The real-world data usually exhibits heterogeneous properties such as modalities, views, or resources, which brings some unique challenges wherein the key is Heterogeneous Representation Learning (HRL) termed in this paper. This brief survey covers the topic of HRL, centered around several major learning settings and real-world applications. First of all, from the mathematical perspective, we present a unified learning framework which is able to model most existing learning settings with the heterogeneous inputs. After that, we conduct a comprehensive discussion on the HRL framework by reviewing some selected learning problems along with the mathematics perspectives, including multi-view learning, heterogeneous transfer learning, Learning using privileged information and heterogeneous multi-task learning. For each learning task, we also discuss some applications under these learning problems and instantiates the terms in the mathematical framework. Finally, we highlight the challenges that are less-touched in HRL and present future research directions. To the best of our knowledge, there is no such framework to unify these heterogeneous problems, and this survey would benefit the community. △ Less

Submitted 30 April, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

arXiv:2001.01051 [pdf, other]

Temporal Tensor Transformation Network for Multivariate Time Series Prediction

Authors: Yuya Jeremy Ong, Mu Qiao, Divyesh Jadav

Abstract: Multivariate time series prediction has applications in a wide variety of domains and is considered to be a very challenging task, especially when the variables have correlations and exhibit complex temporal patterns, such as seasonality and trend. Many existing methods suffer from strong statistical assumptions, numerical issues with high dimensionality, manual feature engineering efforts, and sc… ▽ More Multivariate time series prediction has applications in a wide variety of domains and is considered to be a very challenging task, especially when the variables have correlations and exhibit complex temporal patterns, such as seasonality and trend. Many existing methods suffer from strong statistical assumptions, numerical issues with high dimensionality, manual feature engineering efforts, and scalability. In this work, we present a novel deep learning architecture, known as Temporal Tensor Transformation Network, which transforms the original multivariate time series into a higher order of tensor through the proposed Temporal-Slicing Stack Transformation. This yields a new representation of the original multivariate time series, which enables the convolution kernel to extract complex and non-linear features as well as variable interactional signals from a relatively large temporal region. Experimental results show that Temporal Tensor Transformation Network outperforms several state-of-the-art methods on window-based predictions across various tasks. The proposed architecture also demonstrates robust prediction performance through an extensive sensitivity analysis. △ Less

Submitted 4 January, 2020; originally announced January 2020.

arXiv:1911.07693 [pdf, ps, other]

A Multi-Task Gradient Descent Method for Multi-Label Learning

Authors: Lu Bai, Yew-Soon Ong, Tiantian He, Abhishek Gupta

Abstract: Multi-label learning studies the problem where an instance is associated with a set of labels. By treating single-label learning problem as one task, the multi-label learning problem can be casted as solving multiple related tasks simultaneously. In this paper, we propose a novel Multi-task Gradient Descent (MGD) algorithm to solve a group of related tasks simultaneously. In the proposed algorithm… ▽ More Multi-label learning studies the problem where an instance is associated with a set of labels. By treating single-label learning problem as one task, the multi-label learning problem can be casted as solving multiple related tasks simultaneously. In this paper, we propose a novel Multi-task Gradient Descent (MGD) algorithm to solve a group of related tasks simultaneously. In the proposed algorithm, each task minimizes its individual cost function using reformative gradient descent, where the relations among the tasks are facilitated through effectively transferring model parameter values across multiple tasks. Theoretical analysis shows that the proposed algorithm is convergent with a proper transfer mechanism. Compared with the existing approaches, MGD is easy to implement, has less requirement on the training model, can achieve seamless asymmetric transformation such that negative transfer is mitigated, and can benefit from parallel computing when the number of tasks is large. The competitive experimental results on multi-label learning datasets validate the effectiveness of the proposed algorithm. △ Less

Submitted 19 November, 2019; v1 submitted 18 November, 2019; originally announced November 2019.

arXiv:1911.03849 [pdf, other]

Minimalistic Attacks: How Little it Takes to Fool a Deep Reinforcement Learning Policy

Authors: Xinghua Qu, Zhu Sun, Yew-Soon Ong, Abhishek Gupta, Pengfei Wei

Abstract: Recent studies have revealed that neural network-based policies can be easily fooled by adversarial examples. However, while most prior works analyze the effects of perturbing every pixel of every frame assuming white-box policy access, in this paper we take a more restrictive view towards adversary generation - with the goal of unveiling the limits of a model's vulnerability. In particular, we ex… ▽ More Recent studies have revealed that neural network-based policies can be easily fooled by adversarial examples. However, while most prior works analyze the effects of perturbing every pixel of every frame assuming white-box policy access, in this paper we take a more restrictive view towards adversary generation - with the goal of unveiling the limits of a model's vulnerability. In particular, we explore minimalistic attacks by defining three key settings: (1) black-box policy access: where the attacker only has access to the input (state) and output (action probability) of an RL policy; (2) fractional-state adversary: where only several pixels are perturbed, with the extreme case being a single-pixel adversary; and (3) tactically-chanced attack: where only significant frames are tactically chosen to be attacked. We formulate the adversarial attack by accommodating the three key settings and explore their potency on six Atari games by examining four fully trained state-of-the-art policies. In Breakout, for example, we surprisingly find that: (i) all policies showcase significant performance degradation by merely modifying 0.01% of the input state, and (ii) the policy trained by DQN is totally deceived by perturbation to only 1% frames. △ Less

Submitted 29 October, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

Comments: Accepted by IEEE Transactions on Cognitive and Developmental System

arXiv:1910.04062 [pdf, other]

DEVDAN: Deep Evolving Denoising Autoencoder

Authors: Andri Ashfahani, Mahardhika Pratama, Edwin Lughofer, Yew Soon Ong

Abstract: The Denoising Autoencoder (DAE) enhances the flexibility of the data stream method in exploiting unlabeled samples. Nonetheless, the feasibility of DAE for data stream analytic deserves an in-depth study because it characterizes a fixed network capacity that cannot adapt to rapidly changing environments. Deep evolving denoising autoencoder (DEVDAN), is proposed in this paper. It features an open s… ▽ More The Denoising Autoencoder (DAE) enhances the flexibility of the data stream method in exploiting unlabeled samples. Nonetheless, the feasibility of DAE for data stream analytic deserves an in-depth study because it characterizes a fixed network capacity that cannot adapt to rapidly changing environments. Deep evolving denoising autoencoder (DEVDAN), is proposed in this paper. It features an open structure in the generative phase and the discriminative phase where the hidden units can be automatically added and discarded on the fly. The generative phase refines the predictive performance of the discriminative model exploiting unlabeled data. Furthermore, DEVDAN is free of the problem-specific threshold and works fully in the single-pass learning fashion. We show that DEVDAN can find competitive network architecture compared with state-of-the-art methods on the classification task using ten prominent datasets simulated under the prequential test-then-train protocol. △ Less

Submitted 9 January, 2020; v1 submitted 8 October, 2019; originally announced October 2019.

Comments: This paper has been accepted for publication in Neurocomputing 2019. arXiv admin note: substantial text overlap with arXiv:1809.09081

arXiv:1910.03437 [pdf, other]

Automatic Construction of Multi-layer Perceptron Network from Streaming Examples

Authors: Mahardhika Pratama, Choiru Za'in, Andri Ashfahani, Yew Soon Ong, Weiping Ding

Abstract: Autonomous construction of deep neural network (DNNs) is desired for data streams because it potentially offers two advantages: proper model's capacity and quick reaction to drift and shift. While the self-organizing mechanism of DNNs remains an open issue, this task is even more challenging to be developed for standard multi-layer DNNs than that using the different-depth structures, because the a… ▽ More Autonomous construction of deep neural network (DNNs) is desired for data streams because it potentially offers two advantages: proper model's capacity and quick reaction to drift and shift. While the self-organizing mechanism of DNNs remains an open issue, this task is even more challenging to be developed for standard multi-layer DNNs than that using the different-depth structures, because the addition of a new layer results in information loss of previously trained knowledge. A Neural Network with Dynamically Evolved Capacity (NADINE) is proposed in this paper. NADINE features a fully open structure where its network structure, depth and width, can be automatically evolved from scratch in an online manner and without the use of problem-specific thresholds. NADINE is structured under a standard MLP architecture and the catastrophic forgetting issue during the hidden layer addition phase is resolved using the proposal of soft-forgetting and adaptive memory methods. The advantage of NADINE, namely elastic structure and online learning trait, is numerically validated using nine data stream classification and regression problems where it demonstrates performance improvement over prominent algorithms in all problems. In addition, it is capable of dealing with data stream regression and classification problems equally well. △ Less

Submitted 9 January, 2020; v1 submitted 8 October, 2019; originally announced October 2019.

Comments: This paper has been accepted for publication in CIKM 2019

arXiv:1909.06541 [pdf, other]

Scalable Gaussian Process Classification with Additive Noise for Various Likelihoods

Authors: Haitao Liu, Yew-Soon Ong, Ziwei Yu, Jianfei Cai, Xiaobo Shen

Abstract: Gaussian process classification (GPC) provides a flexible and powerful statistical framework describing joint distributions over function space. Conventional GPCs however suffer from (i) poor scalability for big data due to the full kernel matrix, and (ii) intractable inference due to the non-Gaussian likelihoods. Hence, various scalable GPCs have been proposed through (i) the sparse approximation… ▽ More Gaussian process classification (GPC) provides a flexible and powerful statistical framework describing joint distributions over function space. Conventional GPCs however suffer from (i) poor scalability for big data due to the full kernel matrix, and (ii) intractable inference due to the non-Gaussian likelihoods. Hence, various scalable GPCs have been proposed through (i) the sparse approximation built upon a small inducing set to reduce the time complexity; and (ii) the approximate inference to derive analytical evidence lower bound (ELBO). However, these scalable GPCs equipped with analytical ELBO are limited to specific likelihoods or additional assumptions. In this work, we present a unifying framework which accommodates scalable GPCs using various likelihoods. Analogous to GP regression (GPR), we introduce additive noises to augment the probability space for (i) the GPCs with step, (multinomial) probit and logit likelihoods via the internal variables; and particularly, (ii) the GPC using softmax likelihood via the noise variables themselves. This leads to unified scalable GPCs with analytical ELBO by using variational inference. Empirically, our GPCs showcase better results than state-of-the-art scalable GPCs for extensive binary/multi-class classification tasks with up to two million data points. △ Less

Submitted 14 September, 2019; originally announced September 2019.

Comments: 11 pages, 5 figures, preprint under review

arXiv:1905.09877 [pdf, other]

CASS: Cross Adversarial Source Separation via Autoencoder

Authors: Yong Zheng Ong, Charles K. Chui, Haizhao Yang

Abstract: This paper introduces a cross adversarial source separation (CASS) framework via autoencoder, a new model that aims at separating an input signal consisting of a mixture of multiple components into individual components defined via adversarial learning and autoencoder fitting. CASS unifies popular generative networks like auto-encoders (AEs) and generative adversarial networks (GANs) in a single f… ▽ More This paper introduces a cross adversarial source separation (CASS) framework via autoencoder, a new model that aims at separating an input signal consisting of a mixture of multiple components into individual components defined via adversarial learning and autoencoder fitting. CASS unifies popular generative networks like auto-encoders (AEs) and generative adversarial networks (GANs) in a single framework. The basic building block that filters the input signal and reconstructs the $i$-th target component is a pair of deep neural networks $\mathcal{EN}_i$ and $\mathcal{DE}_i$ as an encoder for dimension reduction and a decoder for component reconstruction, respectively. The decoder $\mathcal{DE}_i$ as a generator is enhanced by a discriminator network $\mathcal{D}_i$ that favors signal structures of the $i$-th component in the $i$-th given dataset as guidance through adversarial learning. In contrast with existing practices in AEs which trains each Auto-Encoder independently, or in GANs that share the same generator, we introduce cross adversarial training that emphasizes adversarial relation between any arbitrary network pairs $(\mathcal{DE}_i,\mathcal{D}_j)$, achieving state-of-the-art performance especially when target components share similar data structures. △ Less

Submitted 23 May, 2019; originally announced May 2019.

arXiv:1901.00248 [pdf, other]

A Survey on Multi-output Learning

Authors: Donna Xu, Yaxin Shi, Ivor W. Tsang, Yew-Soon Ong, Chen Gong, Xiaobo Shen

Abstract: Multi-output learning aims to simultaneously predict multiple outputs given an input. It is an important learning problem due to the pressing need for sophisticated decision making in real-world applications. Inspired by big data, the 4Vs characteristics of multi-output imposes a set of challenges to multi-output learning, in terms of the volume, velocity, variety and veracity of the outputs. Incr… ▽ More Multi-output learning aims to simultaneously predict multiple outputs given an input. It is an important learning problem due to the pressing need for sophisticated decision making in real-world applications. Inspired by big data, the 4Vs characteristics of multi-output imposes a set of challenges to multi-output learning, in terms of the volume, velocity, variety and veracity of the outputs. Increasing number of works in the literature have been devoted to the study of multi-output learning and the development of novel approaches for addressing the challenges encountered. However, it lacks a comprehensive overview on different types of challenges of multi-output learning brought by the characteristics of the multiple outputs and the techniques proposed to overcome the challenges. This paper thus attempts to fill in this gap to provide a comprehensive review on this area. We first introduce different stages of the life cycle of the output labels. Then we present the paradigm on multi-output learning, including its myriads of output structures, definitions of its different sub-problems, model evaluation metrics and popular data repositories used in the study. Subsequently, we review a number of state-of-the-art multi-output learning methods, which are categorized based on the challenges. △ Less

Submitted 13 October, 2019; v1 submitted 1 January, 2019; originally announced January 2019.

Comments: Paper accepted by IEEE Transactions on Neural Networks and Learning Systems

arXiv:1811.01179 [pdf, other]

Large-scale Heteroscedastic Regression via Gaussian Process

Authors: Haitao Liu, Yew-Soon Ong, Jianfei Cai

Abstract: Heteroscedastic regression considering the varying noises among observations has many applications in the fields like machine learning and statistics. Here we focus on the heteroscedastic Gaussian process (HGP) regression which integrates the latent function and the noise function together in a unified non-parametric Bayesian framework. Though showing remarkable performance, HGP suffers from the c… ▽ More Heteroscedastic regression considering the varying noises among observations has many applications in the fields like machine learning and statistics. Here we focus on the heteroscedastic Gaussian process (HGP) regression which integrates the latent function and the noise function together in a unified non-parametric Bayesian framework. Though showing remarkable performance, HGP suffers from the cubic time complexity, which strictly limits its application to big data. To improve the scalability, we first develop a variational sparse inference algorithm, named VSHGP, to handle large-scale datasets. Furthermore, two variants are developed to improve the scalability and capability of VSHGP. The first is stochastic VSHGP (SVSHGP) which derives a factorized evidence lower bound, thus enhancing efficient stochastic variational inference. The second is distributed VSHGP (DVSHGP) which (i) follows the Bayesian committee machine formalism to distribute computations over multiple local VSHGP experts with many inducing points; and (ii) adopts hybrid parameters for experts to guard against over-fitting and capture local variety. The superiority of DVSHGP and SVSHGP as compared to existing scalable heteroscedastic/homoscedastic GPs is then extensively verified on various datasets. △ Less

Submitted 21 January, 2020; v1 submitted 3 November, 2018; originally announced November 2018.

Comments: 14 pages, 15 figures

arXiv:1811.01159 [pdf, ps, other]

Understanding and Comparing Scalable Gaussian Process Regression for Big Data

Authors: Haitao Liu, Jianfei Cai, Yew-Soon Ong, Yi Wang

Abstract: As a non-parametric Bayesian model which produces informative predictive distribution, Gaussian process (GP) has been widely used in various fields, like regression, classification and optimization. The cubic complexity of standard GP however leads to poor scalability, which poses challenges in the era of big data. Hence, various scalable GPs have been developed in the literature in order to impro… ▽ More As a non-parametric Bayesian model which produces informative predictive distribution, Gaussian process (GP) has been widely used in various fields, like regression, classification and optimization. The cubic complexity of standard GP however leads to poor scalability, which poses challenges in the era of big data. Hence, various scalable GPs have been developed in the literature in order to improve the scalability while retaining desirable prediction accuracy. This paper devotes to investigating the methodological characteristics and performance of representative global and local scalable GPs including sparse approximations and local aggregations from four main perspectives: scalability, capability, controllability and robustness. The numerical experiments on two toy examples and five real-world datasets with up to 250K points offer the following findings. In terms of scalability, most of the scalable GPs own a time complexity that is linear to the training size. In terms of capability, the sparse approximations capture the long-term spatial correlations, the local aggregations capture the local patterns but suffer from over-fitting in some scenarios. In terms of controllability, we could improve the performance of sparse approximations by simply increasing the inducing size. But this is not the case for local aggregations. In terms of robustness, local aggregations are robust to various initializations of hyperparameters due to the local attention mechanism. Finally, we highlight that the proper hybrid of global and local scalable GPs may be a promising way to improve both the model capability and scalability for big data. △ Less

Submitted 3 November, 2018; originally announced November 2018.

Comments: 25 pages, 15 figures, preprint submitted to KBS

arXiv:1809.09081 [pdf, other]

Autonomous Deep Learning: Incremental Learning of Denoising Autoencoder for Evolving Data Streams

Authors: Mahardhika Pratama, Andri Ashfahani, Yew Soon Ong, Savitha Ramasamy, Edwin Lughofer

Abstract: The generative learning phase of Autoencoder (AE) and its successor Denosing Autoencoder (DAE) enhances the flexibility of data stream method in exploiting unlabelled samples. Nonetheless, the feasibility of DAE for data stream analytic deserves in-depth study because it characterizes a fixed network capacity which cannot adapt to rapidly changing environments. An automated construction of a denoi… ▽ More The generative learning phase of Autoencoder (AE) and its successor Denosing Autoencoder (DAE) enhances the flexibility of data stream method in exploiting unlabelled samples. Nonetheless, the feasibility of DAE for data stream analytic deserves in-depth study because it characterizes a fixed network capacity which cannot adapt to rapidly changing environments. An automated construction of a denoising autoeconder, namely deep evolving denoising autoencoder (DEVDAN), is proposed in this paper. DEVDAN features an open structure both in the generative phase and in the discriminative phase where input features can be automatically added and discarded on the fly. A network significance (NS) method is formulated in this paper and is derived from the bias-variance concept. This method is capable of estimating the statistical contribution of the network structure and its hidden units which precursors an ideal state to add or prune input features. Furthermore, DEVDAN is free of the problem- specific threshold and works fully in the single-pass learning fashion. The efficacy of DEVDAN is numerically validated using nine non-stationary data stream problems simulated under the prequential test-then-train protocol where DEVDAN is capable of delivering an improvement of classification accuracy to recently published online learning works while having flexibility in the automatic extraction of robust input features and in adapting to rapidly changing environments. △ Less

Submitted 24 September, 2018; originally announced September 2018.

Comments: have been submitted to AAAI 2019 conference

arXiv:1807.01065 [pdf, ps, other]

When Gaussian Process Meets Big Data: A Review of Scalable GPs

Authors: Haitao Liu, Yew-Soon Ong, Xiaobo Shen, Jianfei Cai

Abstract: The vast quantity of information brought by big data as well as the evolving computer hardware encourages success stories in the machine learning community. In the meanwhile, it poses challenges for the Gaussian process (GP) regression, a well-known non-parametric and interpretable Bayesian model, which suffers from cubic complexity to data size. To improve the scalability while retaining desirabl… ▽ More The vast quantity of information brought by big data as well as the evolving computer hardware encourages success stories in the machine learning community. In the meanwhile, it poses challenges for the Gaussian process (GP) regression, a well-known non-parametric and interpretable Bayesian model, which suffers from cubic complexity to data size. To improve the scalability while retaining desirable prediction quality, a variety of scalable GPs have been presented. But they have not yet been comprehensively reviewed and analyzed in order to be well understood by both academia and industry. The review of scalable GPs in the GP community is timely and important due to the explosion of data size. To this end, this paper is devoted to the review on state-of-the-art scalable GPs involving two main categories: global approximations which distillate the entire data and local approximations which divide the data for subspace learning. Particularly, for global approximations, we mainly focus on sparse approximations comprising prior approximations which modify the prior but perform exact inference, posterior approximations which retain exact prior but perform approximate inference, and structured sparse approximations which exploit specific structures in kernel matrix; for local approximations, we highlight the mixture/product of experts that conducts model averaging from multiple local experts to boost predictions. To present a complete review, recent advances for improving the scalability and capability of scalable GPs are reviewed. Finally, the extensions and open issues regarding the implementation of scalable GPs in various scenarios are reviewed and discussed to inspire novel ideas for future research avenues. △ Less

Submitted 9 April, 2019; v1 submitted 3 July, 2018; originally announced July 2018.

Comments: 20 pages, 6 figures

arXiv:1806.00720 [pdf, ps, other]

Generalized Robust Bayesian Committee Machine for Large-scale Gaussian Process Regression

Authors: Haitao Liu, Jianfei Cai, Yi Wang, Yew-Soon Ong

Abstract: In order to scale standard Gaussian process (GP) regression to large-scale datasets, aggregation models employ factorized training process and then combine predictions from distributed experts. The state-of-the-art aggregation models, however, either provide inconsistent predictions or require time-consuming aggregation process. We first prove the inconsistency of typical aggregations using disjoi… ▽ More In order to scale standard Gaussian process (GP) regression to large-scale datasets, aggregation models employ factorized training process and then combine predictions from distributed experts. The state-of-the-art aggregation models, however, either provide inconsistent predictions or require time-consuming aggregation process. We first prove the inconsistency of typical aggregations using disjoint or random data partition, and then present a consistent yet efficient aggregation model for large-scale GP. The proposed model inherits the advantages of aggregations, e.g., closed-form inference and aggregation, parallelization and distributed computing. Furthermore, theoretical and empirical analyses reveal that the new aggregation model performs better due to the consistent predictions that converge to the true underlying function when the training size approaches infinity. △ Less

Submitted 2 June, 2018; originally announced June 2018.

Comments: paper + supplementary material, appears in Proceedings of ICML 2018

arXiv:1206.6477 [pdf]

Discovering Support and Affiliated Features from Very High Dimensions

Authors: Yiteng Zhai, Mingkui Tan, Ivor Tsang, Yew Soon Ong

Abstract: In this paper, a novel learning paradigm is presented to automatically identify groups of informative and correlated features from very high dimensions. Specifically, we explicitly incorporate correlation measures as constraints and then propose an efficient embedded feature selection method using recently developed cutting plane strategy. The benefits of the proposed algorithm are two-folds. Firs… ▽ More In this paper, a novel learning paradigm is presented to automatically identify groups of informative and correlated features from very high dimensions. Specifically, we explicitly incorporate correlation measures as constraints and then propose an efficient embedded feature selection method using recently developed cutting plane strategy. The benefits of the proposed algorithm are two-folds. First, it can identify the optimal discriminative and uncorrelated feature subset to the output labels, denoted here as Support Features, which brings about significant improvements in prediction performance over other state of the art feature selection methods considered in the paper. Second, during the learning process, the underlying group structures of correlated features associated with each support feature, denoted as Affiliated Features, can also be discovered without any additional cost. These affiliated features serve to improve the interpretations on the learning tasks. Extensive empirical studies on both synthetic and very high dimensional real-world datasets verify the validity and efficiency of the proposed method. △ Less

Submitted 27 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

Showing 1–22 of 22 results for author: Ong, Y