Search | arXiv e-print repository

Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Authors: Likun Cao, Rui Pan, James Evans

Abstract: Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator… ▽ More Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator positions within the geometric space of concepts inscribed by dynamic language representations. Using data on millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors across the creative domains of science, technology, film, entrepreneurship, and Wikipedia, here we show that measured subjective perspectives anticipate what ideas individuals and groups creatively attend to and successfully combine in future. When perspective and background diversity are decomposed as the angular difference between collaborators' perspectives on their creation and between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite, across all cases and time periods examined. We analyze a natural experiment and simulate creative collaborations between AI (large language model) agents designed with various perspective and background diversity, which are consistent with our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experience obtained through trajectories of prior work that converge to provoke one another and innovate. We explore the importance of these findings for team assembly and research policy. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: 107 pages, 20 figures

arXiv:2505.06699 [pdf, other]

Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws

Authors: Xiyuan Wei, Ming Lin, Fanjiang Ye, Fengguang Song, Liangliang Cao, My T. Thai, Tianbao Yang

Abstract: This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named $\textbf{model steering}$. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading… ▽ More This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named $\textbf{model steering}$. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to sub-optimal performance. In this work, we propose a theory-driven framework for model steering called $\textbf{DRRho risk minimization}$, which is rooted in Distributionally Robust Optimization (DRO). Through a generalization analysis, we provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for the new learning paradigm, which significantly enhance our understanding and practice of model steering. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for Contrastive Language-Image Pretraining (CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments validate the theoretical insights, reveal a superior scaling law compared to CLIP without a reference model, and demonstrate its strength over existing heuristic approaches. △ Less

Submitted 16 May, 2025; v1 submitted 10 May, 2025; originally announced May 2025.

Comments: 18 pages, 6 figures

arXiv:2411.12726 [pdf, other]

LazyDINO: Fast, scalable, and efficiently amortized Bayesian inversion via structure-exploiting and surrogate-driven measure transport

Authors: Lianghao Cao, Joshua Chen, Michael Brennan, Thomas O'Leary-Roseberry, Youssef Marzouk, Omar Ghattas

Abstract: We present LazyDINO, a transport map variational inference method for fast, scalable, and efficiently amortized solutions of high-dimensional nonlinear Bayesian inverse problems with expensive parameter-to-observable (PtO) maps. Our method consists of an offline phase in which we construct a derivative-informed neural surrogate of the PtO map using joint samples of the PtO map and its Jacobian. Du… ▽ More We present LazyDINO, a transport map variational inference method for fast, scalable, and efficiently amortized solutions of high-dimensional nonlinear Bayesian inverse problems with expensive parameter-to-observable (PtO) maps. Our method consists of an offline phase in which we construct a derivative-informed neural surrogate of the PtO map using joint samples of the PtO map and its Jacobian. During the online phase, when given observational data, we seek rapid posterior approximation using surrogate-driven training of a lazy map [Brennan et al., NeurIPS, (2020)], i.e., a structure-exploiting transport map with low-dimensional nonlinearity. The trained lazy map then produces approximate posterior samples or density evaluations. Our surrogate construction is optimized for amortized Bayesian inversion using lazy map variational inference. We show that (i) the derivative-based reduced basis architecture [O'Leary-Roseberry et al., Comput. Methods Appl. Mech. Eng., 388 (2022)] minimizes the upper bound on the expected error in surrogate posterior approximation, and (ii) the derivative-informed training formulation [O'Leary-Roseberry et al., J. Comput. Phys., 496 (2024)] minimizes the expected error due to surrogate-driven transport map optimization. Our numerical results demonstrate that LazyDINO is highly efficient in cost amortization for Bayesian inversion. We observe one to two orders of magnitude reduction of offline cost for accurate posterior approximation, compared to simulation-based amortized inference via conditional transport and conventional surrogate-driven transport. In particular, LazyDINO outperforms Laplace approximation consistently using fewer than 1000 offline samples, while other amortized inference methods struggle and sometimes fail at 16,000 offline samples. △ Less

Submitted 19 November, 2024; originally announced November 2024.

arXiv:2410.05419 [pdf, ps, other]

Refining Counterfactual Explanations With Joint-Distribution-Informed Shapley Towards Actionable Minimality

Authors: Lei You, Yijun Bian, Lele Cao

Abstract: Counterfactual explanations (CE) identify data points that closely resemble the observed data but produce different machine learning (ML) model outputs, offering critical insights into model decisions. Despite the diverse scenarios, goals and tasks to which they are tailored, existing CE methods often lack actionable efficiency because of unnecessary feature changes included within the explanation… ▽ More Counterfactual explanations (CE) identify data points that closely resemble the observed data but produce different machine learning (ML) model outputs, offering critical insights into model decisions. Despite the diverse scenarios, goals and tasks to which they are tailored, existing CE methods often lack actionable efficiency because of unnecessary feature changes included within the explanations that are presented to users and stakeholders. We address this problem by proposing a method that minimizes the required feature changes while maintaining the validity of CE, without imposing restrictions on models or CE algorithms, whether instance- or group-based. The key innovation lies in computing a joint distribution between observed and counterfactual data and leveraging it to inform Shapley values for feature attributions (FA). We demonstrate that optimal transport (OT) effectively derives this distribution, especially when the alignment between observed and counterfactual data is unclear in used CE methods. Additionally, a counterintuitive finding is uncovered: it may be misleading to rely on an exact alignment defined by the CE generation mechanism in conducting FA. Our proposed method is validated on extensive experiments across multiple datasets, showcasing its effectiveness in refining CE towards greater actionable efficiency. △ Less

Submitted 7 October, 2024; originally announced October 2024.

arXiv:2403.08220 [pdf, other]

Derivative-informed neural operator acceleration of geometric MCMC for infinite-dimensional Bayesian inverse problems

Authors: Lianghao Cao, Thomas O'Leary-Roseberry, Omar Ghattas

Abstract: We propose an operator learning approach to accelerate geometric Markov chain Monte Carlo (MCMC) for solving infinite-dimensional Bayesian inverse problems (BIPs). While geometric MCMC employs high-quality proposals that adapt to posterior local geometry, it requires repeated computations of gradients and Hessians of the log-likelihood, which becomes prohibitive when the parameter-to-observable (P… ▽ More We propose an operator learning approach to accelerate geometric Markov chain Monte Carlo (MCMC) for solving infinite-dimensional Bayesian inverse problems (BIPs). While geometric MCMC employs high-quality proposals that adapt to posterior local geometry, it requires repeated computations of gradients and Hessians of the log-likelihood, which becomes prohibitive when the parameter-to-observable (PtO) map is defined through expensive-to-solve parametric partial differential equations (PDEs). We consider a delayed-acceptance geometric MCMC method driven by a neural operator surrogate of the PtO map, where the proposal exploits fast surrogate predictions of the log-likelihood and, simultaneously, its gradient and Hessian. To achieve a substantial speedup, the surrogate must accurately approximate the PtO map and its Jacobian, which often demands a prohibitively large number of PtO map samples via conventional operator learning methods. In this work, we present an extension of derivative-informed operator learning [O'Leary-Roseberry et al., J. Comput. Phys., 496 (2024)] that uses joint samples of the PtO map and its Jacobian. This leads to derivative-informed neural operator (DINO) surrogates that accurately predict the observables and posterior local geometry at a significantly lower training cost than conventional methods. Cost and error analysis for reduced basis DINO surrogates are provided. Numerical studies demonstrate that DINO-driven MCMC generates effective posterior samples 3--9 times faster than geometric MCMC and 60--97 times faster than prior geometry-based MCMC. Furthermore, the training cost of DINO surrogates breaks even compared to geometric MCMC after just 10--25 effective posterior samples. △ Less

Submitted 20 May, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

Comments: Updated manuscript: changed title, changed format, typo correction, and minor terminology changes

arXiv:2401.13112 [pdf, other]

Distributional Counterfactual Explanations With Optimal Transport

Authors: Lei You, Lele Cao, Mattias Nilsson, Bo Zhao, Lei Lei

Abstract: Counterfactual explanations (CE) are the de facto method for providing insights into black-box decision-making models by identifying alternative inputs that lead to different outcomes. However, existing CE approaches, including group and global methods, focus predominantly on specific input modifications, lacking the ability to capture nuanced distributional characteristics that influence model ou… ▽ More Counterfactual explanations (CE) are the de facto method for providing insights into black-box decision-making models by identifying alternative inputs that lead to different outcomes. However, existing CE approaches, including group and global methods, focus predominantly on specific input modifications, lacking the ability to capture nuanced distributional characteristics that influence model outcomes across the entire input-output spectrum. This paper proposes distributional counterfactual explanation (DCE), shifting focus to the distributional properties of observed and counterfactual data, thus providing broader insights. DCE is particularly beneficial for stakeholders making strategic decisions based on statistical data analysis, as it makes the statistical distribution of the counterfactual resembles the one of the factual when aligning model outputs with a target distribution\textemdash something that the existing CE methods cannot fully achieve. We leverage optimal transport (OT) to formulate a chance-constrained optimization problem, deriving a counterfactual distribution aligned with its factual counterpart, supported by statistical confidence. The efficacy of this approach is demonstrated through experiments, highlighting its potential to provide deeper insights into decision-making models. △ Less

Submitted 12 March, 2025; v1 submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.03341 [pdf, other]

Weakly Augmented Variational Autoencoder in Time Series Anomaly Detection

Authors: Zhangkai Wu, Longbing Cao, Qi Zhang, Junxian Zhou, Hui Chen

Abstract: Due to their unsupervised training and uncertainty estimation, deep Variational Autoencoders (VAEs) have become powerful tools for reconstruction-based Time Series Anomaly Detection (TSAD). Existing VAE-based TSAD methods, either statistical or deep, tune meta-priors to estimate the likelihood probability for effectively capturing spatiotemporal dependencies in the data. However, these methods con… ▽ More Due to their unsupervised training and uncertainty estimation, deep Variational Autoencoders (VAEs) have become powerful tools for reconstruction-based Time Series Anomaly Detection (TSAD). Existing VAE-based TSAD methods, either statistical or deep, tune meta-priors to estimate the likelihood probability for effectively capturing spatiotemporal dependencies in the data. However, these methods confront the challenge of inherent data scarcity, which is often the case in anomaly detection tasks. Such scarcity easily leads to latent holes, discontinuous regions in latent space, resulting in non-robust reconstructions on these discontinuous spaces. We propose a novel generative framework that combines VAEs with self-supervised learning (SSL) to address this issue. △ Less

Submitted 6 January, 2024; originally announced January 2024.

arXiv:2309.13303 [pdf, other]

C$^2$VAE: Gaussian Copula-based VAE Differing Disentangled from Coupled Representations with Contrastive Posterior

Authors: Zhangkai Wu, Longbing Cao

Abstract: We present a self-supervised variational autoencoder (VAE) to jointly learn disentangled and dependent hidden factors and then enhance disentangled representation learning by a self-supervised classifier to eliminate coupled representations in a contrastive manner. To this end, a Contrastive Copula VAE (C$^2$VAE) is introduced without relying on prior knowledge about data in the probabilistic prin… ▽ More We present a self-supervised variational autoencoder (VAE) to jointly learn disentangled and dependent hidden factors and then enhance disentangled representation learning by a self-supervised classifier to eliminate coupled representations in a contrastive manner. To this end, a Contrastive Copula VAE (C$^2$VAE) is introduced without relying on prior knowledge about data in the probabilistic principle and involving strong modeling assumptions on the posterior in the neural architecture. C$^2$VAE simultaneously factorizes the posterior (evidence lower bound, ELBO) with total correlation (TC)-driven decomposition for learning factorized disentangled representations and extracts the dependencies between hidden features by a neural Gaussian copula for copula coupled representations. Then, a self-supervised contrastive classifier differentiates the disentangled representations from the coupled representations, where a contrastive loss regularizes this contrastive classification together with the TC loss for eliminating entangled factors and strengthening disentangled representations. C$^2$VAE demonstrates a strong effect in enhancing disentangled representation learning. C$^2$VAE further contributes to improved optimization addressing the TC-based VAE instability and the trade-off between reconstruction and representation. △ Less

Submitted 23 September, 2023; originally announced September 2023.

arXiv:2306.05398 [pdf, other]

Bayesian model calibration for diblock copolymer thin film self-assembly using power spectrum of microscopy data and machine learning surrogate

Authors: Lianghao Cao, Keyi Wu, J. Tinsley Oden, Peng Chen, Omar Ghattas

Abstract: Identifying parameters of computational models from experimental data, or model calibration, is fundamental for assessing and improving the predictability and reliability of computer simulations. In this work, we propose a method for Bayesian calibration of models that predict morphological patterns of diblock copolymer (Di-BCP) thin film self-assembly while accounting for various sources of uncer… ▽ More Identifying parameters of computational models from experimental data, or model calibration, is fundamental for assessing and improving the predictability and reliability of computer simulations. In this work, we propose a method for Bayesian calibration of models that predict morphological patterns of diblock copolymer (Di-BCP) thin film self-assembly while accounting for various sources of uncertainties in pattern formation and data acquisition. This method extracts the azimuthally-averaged power spectrum (AAPS) of the top-down microscopy characterization of Di-BCP thin film patterns as summary statistics for Bayesian inference of model parameters via the pseudo-marginal method. We derive the analytical and approximate form of a conditional likelihood for the AAPS of image data. We demonstrate that AAPS-based image data reduction retains the mutual information, particularly on important length scales, between image data and model parameters while being relatively agnostic to the aleatoric uncertainties associated with the random long-range disorder of Di-BCP patterns. Additionally, we propose a phase-informed prior distribution for Bayesian model calibration. Furthermore, reducing image data to AAPS enables us to efficiently build surrogate models to accelerate the proposed Bayesian model calibration procedure. We present the formulation and training of two multi-layer perceptrons for approximating the parameter-to-spectrum map, which enables fast integrated likelihood evaluations. We validate the proposed Bayesian model calibration method through numerical examples, for which the neural network surrogate delivers a fivefold reduction of the number of model simulations performed for a single calibration task. △ Less

Submitted 3 August, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: Minor changes from the original submission, including a change in the title

arXiv:2210.03008 [pdf, other]

doi 10.1016/j.jcp.2023.112104

Residual-based error correction for neural operator accelerated infinite-dimensional Bayesian inverse problems

Authors: Lianghao Cao, Thomas O'Leary-Roseberry, Prashant K. Jha, J. Tinsley Oden, Omar Ghattas

Abstract: We explore using neural operators, or neural network representations of nonlinear maps between function spaces, to accelerate infinite-dimensional Bayesian inverse problems (BIPs) with models governed by nonlinear parametric partial differential equations (PDEs). Neural operators have gained significant attention in recent years for their ability to approximate the parameter-to-solution maps defin… ▽ More We explore using neural operators, or neural network representations of nonlinear maps between function spaces, to accelerate infinite-dimensional Bayesian inverse problems (BIPs) with models governed by nonlinear parametric partial differential equations (PDEs). Neural operators have gained significant attention in recent years for their ability to approximate the parameter-to-solution maps defined by PDEs using as training data solutions of PDEs at a limited number of parameter samples. The computational cost of BIPs can be drastically reduced if the large number of PDE solves required for posterior characterization are replaced with evaluations of trained neural operators. However, reducing error in the resulting BIP solutions via reducing the approximation error of the neural operators in training can be challenging and unreliable. We provide an a priori error bound result that implies certain BIPs can be ill-conditioned to the approximation error of neural operators, thus leading to inaccessible accuracy requirements in training. To reliably deploy neural operators in BIPs, we consider a strategy for enhancing the performance of neural operators, which is to correct the prediction of a trained neural operator by solving a linear variational problem based on the PDE residual. We show that a trained neural operator with error correction can achieve a quadratic reduction of its approximation error, all while retaining substantial computational speedups of posterior sampling when models are governed by highly nonlinear PDEs. The strategy is applied to two numerical examples of BIPs based on a nonlinear reaction--diffusion problem and deformation of hyperelastic materials. We demonstrate that posterior representations of the two BIPs produced using trained neural operators are greatly and consistently enhanced by error correction. △ Less

Submitted 18 October, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

arXiv:2206.11343 [pdf, other]

Bayesian model calibration for block copolymer self-assembly: Likelihood-free inference and expected information gain computation via measure transport

Authors: Ricardo Baptista, Lianghao Cao, Joshua Chen, Omar Ghattas, Fengyi Li, Youssef M. Marzouk, J. Tinsley Oden

Abstract: We consider the Bayesian calibration of models describing the phenomenon of block copolymer (BCP) self-assembly using image data produced by microscopy or X-ray scattering techniques. To account for the random long-range disorder in BCP equilibrium structures, we introduce auxiliary variables to represent this aleatory uncertainty. These variables, however, result in an integrated likelihood for h… ▽ More We consider the Bayesian calibration of models describing the phenomenon of block copolymer (BCP) self-assembly using image data produced by microscopy or X-ray scattering techniques. To account for the random long-range disorder in BCP equilibrium structures, we introduce auxiliary variables to represent this aleatory uncertainty. These variables, however, result in an integrated likelihood for high-dimensional image data that is generally intractable to evaluate. We tackle this challenging Bayesian inference problem using a likelihood-free approach based on measure transport together with the construction of summary statistics for the image data. We also show that expected information gains (EIGs) from the observed data about the model parameters can be computed with no significant additional cost. Lastly, we present a numerical case study based on the Ohta--Kawasaki model for diblock copolymer thin film self-assembly and top-down microscopy characterization. For calibration, we introduce several domain-specific energy- and Fourier-based summary statistics, and quantify their informativeness using EIG. We demonstrate the power of the proposed approach to study the effect of data corruptions and experimental designs on the calibration results. △ Less

Submitted 22 June, 2022; originally announced June 2022.

arXiv:2108.10029 [pdf, other]

Modeling time evolving COVID-19 uncertainties with density dependent asymptomatic infections and social reinforcement

Authors: Qing Liu, Longbing Cao

Abstract: The COVID-19 pandemic has posed significant challenges in modeling its complex epidemic transmissions, infection and contagion, which are very different from known epidemics. The challenges in quantifying COVID-19 complexities include effectively modeling its process and data uncertainties. The uncertainties are embedded in implicit and high-proportional undocumented infections, asymptomatic conta… ▽ More The COVID-19 pandemic has posed significant challenges in modeling its complex epidemic transmissions, infection and contagion, which are very different from known epidemics. The challenges in quantifying COVID-19 complexities include effectively modeling its process and data uncertainties. The uncertainties are embedded in implicit and high-proportional undocumented infections, asymptomatic contagion, social reinforcement of infections, and various quality issues in the reported data. These uncertainties become even more apparent in the first two months of the COVID-19 pandemic, when the relevant knowledge, case reporting and testing were all limited. Here we introduce a novel hybrid approach Susceptible-Undocumented infected-Documented infected-Recovered (SUDR) model. First, SUDR (1) characterizes and distinguishes Undocumented (U) and Documented (D) infections commonly seen during COVID-19 incubation periods and asymptomatic infections. Second, SUDR characterizes the probabilistic density of infections by capturing exogenous processes. Lastly, SUDR approximates the density likelihood of COVID-19 prevalence over time by incorporating Bayesian inference into SUDR. Different from existing COVID-19 models, SUDR characterizes the undocumented infections during unknown transmission processes. To capture the uncertainties of temporal transmission and social reinforcement during COVID-19 contagion, the transmission rate is modeled by a time-varying density function of undocumented infectious cases. By sampling from the mean-field posterior distribution with reasonable priors, SUDR handles the randomness, noise and sparsity of COVID-19 observations widely seen in the public COVID-19 case data. The results demonstrate a deeper quantitative understanding of the above uncertainties, in comparison with classic SIR, time-dependent SIR, and probabilistic SIR models. △ Less

Submitted 5 April, 2022; v1 submitted 23 August, 2021; originally announced August 2021.

arXiv:2103.11516 [pdf, other]

Homophily Outlier Detection in Non-IID Categorical Data

Authors: Guansong Pang, Longbing Cao, Ling Chen

Abstract: Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distribution… ▽ More Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors. △ Less

Submitted 21 March, 2021; originally announced March 2021.

Comments: To appear in Data Ming and Knowledge Discovery Journal

arXiv:2009.06847 [pdf, other]

doi 10.1145/3447548.3467417

Toward Deep Supervised Anomaly Detection: Reinforcement Learning from Partially Labeled Anomaly Data

Authors: Guansong Pang, Anton van den Hengel, Chunhua Shen, Longbing Cao

Abstract: We consider the problem of anomaly detection with a small set of partially labeled anomaly examples and a large-scale unlabeled dataset. This is a common scenario in many important applications. Existing related methods either exclusively fit the limited anomaly examples that typically do not span the entire set of anomalies, or proceed with unsupervised learning from the unlabeled data. We propos… ▽ More We consider the problem of anomaly detection with a small set of partially labeled anomaly examples and a large-scale unlabeled dataset. This is a common scenario in many important applications. Existing related methods either exclusively fit the limited anomaly examples that typically do not span the entire set of anomalies, or proceed with unsupervised learning from the unlabeled data. We propose here instead a deep reinforcement learning-based approach that enables an end-to-end optimization of the detection of both labeled and unlabeled anomalies. This approach learns the known abnormality by automatically interacting with an anomaly-biased simulation environment, while continuously extending the learned abnormality to novel classes of anomaly (i.e., unknown anomalies) by actively exploring possible anomalies in the unlabeled data. This is achieved by jointly optimizing the exploitation of the small labeled anomaly data and the exploration of the rare unlabeled anomalies. Extensive experiments on 48 real-world datasets show that our model significantly outperforms five state-of-the-art competing methods. △ Less

Submitted 10 June, 2021; v1 submitted 14 September, 2020; originally announced September 2020.

Comments: Accepted to KDD 2021

arXiv:2007.10720 [pdf, ps, other]

doi 10.1109/TPAMI.2020.3010953

Unsupervised Heterogeneous Coupling Learning for Categorical Representation

Authors: Chengzhang Zhu, Longbing Cao, Jianping Yin

Abstract: Complex categorical data is often hierarchically coupled with heterogeneous relationships between attributes and attribute values and the couplings between objects. Such value-to-object couplings are heterogeneous with complementary and inconsistent interactions and distributions. Limited research exists on unlabeled categorical data representations, ignores the heterogeneous and hierarchical coup… ▽ More Complex categorical data is often hierarchically coupled with heterogeneous relationships between attributes and attribute values and the couplings between objects. Such value-to-object couplings are heterogeneous with complementary and inconsistent interactions and distributions. Limited research exists on unlabeled categorical data representations, ignores the heterogeneous and hierarchical couplings, underestimates data characteristics and complexities, and overuses redundant information, etc. The deep representation learning of unlabeled categorical data is challenging, overseeing such value-to-object couplings, complementarity and inconsistency, and requiring large data, disentanglement, and high computational power. This work introduces a shallow but powerful UNsupervised heTerogeneous couplIng lEarning (UNTIE) approach for representing coupled categorical data by untying the interactions between couplings and revealing heterogeneous distributions embedded in each type of couplings. UNTIE is efficiently optimized w.r.t. a kernel k-means objective function for unsupervised representation learning of heterogeneous and hierarchical value-to-object couplings. Theoretical analysis shows that UNTIE can represent categorical data with maximal separability while effectively represent heterogeneous couplings and disclose their roles in categorical data. The UNTIE-learned representations make significant performance improvement against the state-of-the-art categorical representations and deep representation models on 25 categorical data sets with diversified characteristics. △ Less

Submitted 21 July, 2020; originally announced July 2020.

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

arXiv:2007.02500 [pdf, other]

doi 10.1145/3439950

Deep Learning for Anomaly Detection: A Review

Authors: Guansong Pang, Chunhua Shen, Longbing Cao, Anton van den Hengel

Abstract: Anomaly detection, a.k.a. outlier detection or novelty detection, has been a lasting yet active research area in various research communities for several decades. There are still some unique problem complexities and challenges that require advanced approaches. In recent years, deep learning enabled anomaly detection, i.e., deep anomaly detection, has emerged as a critical direction. This paper sur… ▽ More Anomaly detection, a.k.a. outlier detection or novelty detection, has been a lasting yet active research area in various research communities for several decades. There are still some unique problem complexities and challenges that require advanced approaches. In recent years, deep learning enabled anomaly detection, i.e., deep anomaly detection, has emerged as a critical direction. This paper surveys the research of deep anomaly detection with a comprehensive taxonomy, covering advancements in three high-level categories and 11 fine-grained categories of the methods. We review their key intuitions, objective functions, underlying assumptions, advantages and disadvantages, and discuss how they address the aforementioned challenges. We further discuss a set of possible future opportunities and new perspectives on addressing the challenges. △ Less

Submitted 4 December, 2020; v1 submitted 5 July, 2020; originally announced July 2020.

Comments: Survey paper, 36 pages, 180 references, 2 figures, 4 tables

Journal ref: ACM Computing Surveys, 2020

arXiv:2005.08047 [pdf, other]

Simple, Scalable, and Stable Variational Deep Clustering

Authors: Lele Cao, Sahar Asadi, Wenfei Zhu, Christian Schmidli, Michael Sjöberg

Abstract: Deep clustering (DC) has become the state-of-the-art for unsupervised clustering. In principle, DC represents a variety of unsupervised methods that jointly learn the underlying clusters and the latent representation directly from unstructured datasets. However, DC methods are generally poorly applied due to high operational costs, low scalability, and unstable results. In this paper, we first eva… ▽ More Deep clustering (DC) has become the state-of-the-art for unsupervised clustering. In principle, DC represents a variety of unsupervised methods that jointly learn the underlying clusters and the latent representation directly from unstructured datasets. However, DC methods are generally poorly applied due to high operational costs, low scalability, and unstable results. In this paper, we first evaluate several popular DC variants in the context of industrial applicability using eight empirical criteria. We then choose to focus on variational deep clustering (VDC) methods, since they mostly meet those criteria except for simplicity, scalability, and stability. To address these three unmet criteria, we introduce four generic algorithmic improvements: initial $γ$-training, periodic $β$-annealing, mini-batch GMM (Gaussian mixture model) initialization, and inverse min-max transform. We also propose a novel clustering algorithm S3VDC (simple, scalable, and stable VDC) that incorporates all those improvements. Our experiments show that S3VDC outperforms the state-of-the-art on both benchmark tasks and a large unstructured industrial dataset without any ground truth label. In addition, we analytically evaluate the usability and interpretability of S3VDC. △ Less

Submitted 21 May, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

Comments: 17 pages, 5 figures, source code: https://github.com/king/s3vdc

arXiv:1907.00526

FiDi-RL: Incorporating Deep Reinforcement Learning with Finite-Difference Policy Search for Efficient Learning of Continuous Control

Authors: Longxiang Shi, Shijian Li, Longbing Cao, Long Yang, Gang Zheng, Gang Pan

Abstract: In recent years significant progress has been made in dealing with challenging problems using reinforcement learning.Despite its great success, reinforcement learning still faces challenge in continuous control tasks. Conventional methods always compute the derivatives of the optimal goal with a costly computation resources, and are inefficient, unstable and lack of robust-ness when dealing with s… ▽ More In recent years significant progress has been made in dealing with challenging problems using reinforcement learning.Despite its great success, reinforcement learning still faces challenge in continuous control tasks. Conventional methods always compute the derivatives of the optimal goal with a costly computation resources, and are inefficient, unstable and lack of robust-ness when dealing with such tasks. Alternatively, derivative-based methods treat the optimization process as a blackbox and show robustness and stability in learning continuous control tasks, but not data efficient in learning. The combination of both methods so as to get the best of the both has raised attention. However, most of the existing combination works adopt complex neural networks (NNs) as the policy for control. The double-edged sword of deep NNs can yield better performance, but also makes it difficult for parameter tuning and computation. To this end, in this paper we presents a novel method called FiDi-RL, which incorporates deep RL with Finite-Difference (FiDi) policy search.FiDi-RL combines Deep Deterministic Policy Gradients (DDPG)with Augment Random Search (ARS) and aims at improving the data efficiency of ARS. The empirical results show that FiDi-RL can improves the performance and stability of ARS, and provide competitive results against some existing deep reinforcement learning methods △ Less

Submitted 31 January, 2020; v1 submitted 30 June, 2019; originally announced July 2019.

Comments: I found some theoretical errors

arXiv:1905.13043 [pdf, other]

Linear interpolation gives better gradients than Gaussian smoothing in derivative-free optimization

Authors: Albert S Berahas, Liyuan Cao, Krzysztof Choromanski, Katya Scheinberg

Abstract: In this paper, we consider derivative free optimization problems, where the objective function is smooth but is computed with some amount of noise, the function evaluations are expensive and no derivative information is available. We are motivated by policy optimization problems in reinforcement learning that have recently become popular [Choromaski et al. 2018; Fazel et al. 2018; Salimans et al.… ▽ More In this paper, we consider derivative free optimization problems, where the objective function is smooth but is computed with some amount of noise, the function evaluations are expensive and no derivative information is available. We are motivated by policy optimization problems in reinforcement learning that have recently become popular [Choromaski et al. 2018; Fazel et al. 2018; Salimans et al. 2016], and that can be formulated as derivative free optimization problems with the aforementioned characteristics. In each of these works some approximation of the gradient is constructed and a (stochastic) gradient method is applied. In [Salimans et al. 2016] the gradient information is aggregated along Gaussian directions, while in [Choromaski et al. 2018] it is computed along orthogonal direction. We provide a convergence rate analysis for a first-order line search method, similar to the ones used in the literature, and derive the conditions on the gradient approximations that ensure this convergence. We then demonstrate via rigorous analysis of the variance and by numerical comparisons on reinforcement learning tasks that the Gaussian sampling method used in [Salimans et al. 2016] is significantly inferior to the orthogonal sampling used in [Choromaski et al. 2018] as well as more general interpolation methods. △ Less

Submitted 2 June, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

Comments: 14 pages, 2 figures. arXiv admin note: text overlap with arXiv:1905.01332

arXiv:1905.10594 [pdf]

Multi-view Information-theoretic Co-clustering for Co-occurrence Data

Authors: Peng Xu, Zhaohong Deng, Kup-Sze Choi, Longbing Cao, Shitong Wang

Abstract: Multi-view clustering has received much attention recently. Most of the existing multi-view clustering methods only focus on one-sided clustering. As the co-occurring data elements involve the counts of sample-feature co-occurrences, it is more efficient to conduct two-sided clustering along the samples and features simultaneously. To take advantage of two-sided clustering for the co-occurrences i… ▽ More Multi-view clustering has received much attention recently. Most of the existing multi-view clustering methods only focus on one-sided clustering. As the co-occurring data elements involve the counts of sample-feature co-occurrences, it is more efficient to conduct two-sided clustering along the samples and features simultaneously. To take advantage of two-sided clustering for the co-occurrences in the scene of multi-view clustering, a two-sided multi-view clustering method is proposed, i.e., multi-view information-theoretic co-clustering (MV-ITCC). The proposed method realizes two-sided clustering for co-occurring multi-view data under the formulation of information theory. More specifically, it exploits the agreement and disagreement among views by sharing a common clustering results along the sample dimension and keeping the clustering results of each view specific along the feature dimension. In addition, the mechanism of maximum entropy is also adopted to control the importance of different views, which can give a right balance in leveraging the agreement and disagreement. Extensive experiments are conducted on text and image multi-view datasets. The results clearly demonstrate the superiority of the proposed method. △ Less

Submitted 25 May, 2019; originally announced May 2019.

Journal ref: AAAI 2019

arXiv:1905.07237 [pdf, other]

TBQ($σ$): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Authors: Longxiang Shi, Shijian Li, Longbing Cao, Long Yang, Gang Pan

Abstract: Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing… ▽ More Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q($λ$) and Naive Q($λ$) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ($σ$), which effectively unifies the tree-backup algorithm and Naive Q($λ$). By introducing a new parameter $σ$ to illustrate the \emph{degree} of utilizing traces, TBQ($σ$) creates an effective integration of TB($λ$) and Naive Q($λ$) and continuous role shift between them. The contraction property of TB($σ$) is theoretically analyzed for both policy evaluation and control settings. We also derive the online version of TBQ($σ$) and give the convergence proof. We empirically show that, for $ε\in(0,1]$ in $ε$-greedy policies, there exists some degree of utilizing traces for $λ\in[0,1]$, which can improve the efficiency in trace utilization for off-policy reinforcement learning, to both accelerate the learning process and improve the performance. △ Less

Submitted 17 May, 2019; originally announced May 2019.

Comments: 8 pages

MSC Class: 68Wxx

arXiv:1902.07903 [pdf, other]

Learning Deterministic Policy with Target for Power Control in Wireless Networks

Authors: Yujiao Lu, Hancheng Lu, Liangliang Cao, Feng Wu, Daren Zhu

Abstract: Inter-Cell Interference Coordination (ICIC) is a promising way to improve energy efficiency in wireless networks, especially where small base stations are densely deployed. However, traditional optimization based ICIC schemes suffer from severe performance degradation with complex interference pattern. To address this issue, we propose a Deep Reinforcement Learning with Deterministic Policy and Ta… ▽ More Inter-Cell Interference Coordination (ICIC) is a promising way to improve energy efficiency in wireless networks, especially where small base stations are densely deployed. However, traditional optimization based ICIC schemes suffer from severe performance degradation with complex interference pattern. To address this issue, we propose a Deep Reinforcement Learning with Deterministic Policy and Target (DRL-DPT) framework for ICIC in wireless networks. DRL-DPT overcomes the main obstacles in applying reinforcement learning and deep learning in wireless networks, i.e. continuous state space, continuous action space and convergence. Firstly, a Deep Neural Network (DNN) is involved as the actor to obtain deterministic power control actions in continuous space. Then, to guarantee the convergence, an online training process is presented, which makes use of a dedicated reward function as the target rule and a policy gradient descent algorithm to adjust DNN weights. Experimental results show that the proposed DRL-DPT framework consistently outperforms existing schemes in terms of energy efficiency and throughput under different wireless interference scenarios. More specifically, it improves up to 15% of energy efficiency with faster convergence rate. △ Less

Submitted 21 February, 2019; originally announced February 2019.

Comments: 7 pages, 7 figures, GlobeCom2018

arXiv:1806.04808 [pdf, other]

doi 10.1145/3219819.3220042

Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

Authors: Guansong Pang, Longbing Cao, Ling Chen, Huan Liu

Abstract: Learning expressive low-dimensional representations of ultrahigh-dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequen… ▽ More Learning expressive low-dimensional representations of ultrahigh-dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distance-based detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement. △ Less

Submitted 12 June, 2018; originally announced June 2018.

Comments: 10 pages, 4 figures, 3 tables. To appear in the proceedings of KDD18, Long presentation (oral)

arXiv:1802.00324 [pdf]

One-class Collective Anomaly Detection based on Long Short-Term Memory Recurrent Neural Networks

Authors: Nga Nguyen Thi, Van Loi Cao, Nhien-An Le-Khac

Abstract: Intrusion detection for computer network systems has been becoming one of the most critical tasks for network administrators today. It has an important role for organizations, governments and our society due to the valuable resources hosted on computer networks. Traditional misuse detection strategies are unable to detect new and unknown intrusion types. In contrast, anomaly detection in network s… ▽ More Intrusion detection for computer network systems has been becoming one of the most critical tasks for network administrators today. It has an important role for organizations, governments and our society due to the valuable resources hosted on computer networks. Traditional misuse detection strategies are unable to detect new and unknown intrusion types. In contrast, anomaly detection in network security aims to distinguish between illegal or malicious events and normal behavior of network systems. Anomaly detection can be considered as a classification problem where it builds models of normal network behavior, of which it uses to detect new patterns that significantly deviate from the model. Most of the current approaches on anomaly detection is based on the learning of normal behavior and anomalous actions. They do not include memory that is they do not take into account previous events classify new ones. In this paper, we propose a one class collective anomaly detection model based on neural network learning. Normally a Long Short Term Memory Recurrent Neural Network (LSTM RNN) is trained only on normal data, and it is capable of predicting several time steps ahead of an input. In our approach, a LSTM RNN is trained on normal time series data before performing a prediction for each time step. Instead of considering each time-step separately, the observation of prediction errors from a certain number of time-steps is now proposed as a new idea for detecting collective anomalies. The prediction errors of a certain number of the latest time-steps above a threshold will indicate a collective anomaly. The model is evaluated on a time series version of the KDD 1999 dataset. The experiments demonstrate that the proposed model is capable to detect collective anomaly efficiently △ Less

Submitted 31 January, 2018; originally announced February 2018.

Comments: arXiv admin note: substantial text overlap with arXiv:1703.09752

arXiv:1703.02391 [pdf, other]

Learning from Noisy Labels with Distillation

Authors: Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, Li-Jia Li

Abstract: The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount of data with noisy labels are relatively easy to obtain. Traditionally, the label noises have been treated as statistical outliers, and approaches such as importance re-weighting and bootstrap have been proposed to alleviate the problem. According to our observation, the real-world noisy lab… ▽ More The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount of data with noisy labels are relatively easy to obtain. Traditionally, the label noises have been treated as statistical outliers, and approaches such as importance re-weighting and bootstrap have been proposed to alleviate the problem. According to our observation, the real-world noisy labels exhibit multi-mode characteristics as the true labels, rather than behaving like independent random outliers. In this work, we propose a unified distillation framework to use side information, including a small clean dataset and label relations in knowledge graph, to "hedge the risk" of learning from noisy labels. Furthermore, unlike the traditional approaches evaluated based on simulated label noises, we propose a suite of new benchmark datasets, in Sports, Species and Artifacts domains, to evaluate the task of learning from noisy labels in the practical setting. The empirical study demonstrates the effectiveness of our proposed method in all the domains. △ Less

Submitted 7 April, 2017; v1 submitted 7 March, 2017; originally announced March 2017.

arXiv:1604.05198 [pdf, ps, other]

Locally Imposing Function for Generalized Constraint Neural Networks - A Study on Equality Constraints

Authors: Linlin Cao, Ran He, Bao-Gang Hu

Abstract: This work is a further study on the Generalized Constraint Neural Network (GCNN) model [1], [2]. Two challenges are encountered in the study, that is, to embed any type of prior information and to select its imposing schemes. The work focuses on the second challenge and studies a new constraint imposing scheme for equality constraints. A new method called locally imposing function (LIF) is propose… ▽ More This work is a further study on the Generalized Constraint Neural Network (GCNN) model [1], [2]. Two challenges are encountered in the study, that is, to embed any type of prior information and to select its imposing schemes. The work focuses on the second challenge and studies a new constraint imposing scheme for equality constraints. A new method called locally imposing function (LIF) is proposed to provide a local correction to the GCNN prediction function, which therefore falls within Locally Imposing Scheme (LIS). In comparison, the conventional Lagrange multiplier method is considered as Globally Imposing Scheme (GIS) because its added constraint term exhibits a global impact to its objective function. Two advantages are gained from LIS over GIS. First, LIS enables constraints to fire locally and explicitly in the domain only where they need on the prediction function. Second, constraints can be implemented within a network setting directly. We attempt to interpret several constraint methods graphically from a viewpoint of the locality principle. Numerical examples confirm the advantages of the proposed method. In solving boundary value problems with Dirichlet and Neumann constraints, the GCNN model with LIF is possible to achieve an exact satisfaction of the constraints. △ Less

Submitted 18 April, 2016; originally announced April 2016.

Comments: 8 pages, 7 figures

arXiv:1310.1545 [pdf, ps, other]

Learning Hidden Structures with Relational Models by Adequately Involving Rich Information in A Network

Authors: Xuhui Fan, Richard Yi Da Xu, Longbing Cao, Yin Song

Abstract: Effectively modelling hidden structures in a network is very practical but theoretically challenging. Existing relational models only involve very limited information, namely the binary directional link data, embedded in a network to learn hidden networking structures. There is other rich and meaningful information (e.g., various attributes of entities and more granular information than binary ele… ▽ More Effectively modelling hidden structures in a network is very practical but theoretically challenging. Existing relational models only involve very limited information, namely the binary directional link data, embedded in a network to learn hidden networking structures. There is other rich and meaningful information (e.g., various attributes of entities and more granular information than binary elements such as "like" or "dislike") missed, which play a critical role in forming and understanding relations in a network. In this work, we propose an informative relational model (InfRM) framework to adequately involve rich information and its granularity in a network, including metadata information about each entity and various forms of link data. Firstly, an effective metadata information incorporation method is employed on the prior information from relational models MMSB and LFRM. This is to encourage the entities with similar metadata information to have similar hidden structures. Secondly, we propose various solutions to cater for alternative forms of link data. Substantial efforts have been made towards modelling appropriateness and efficiency, for example, using conjugate priors. We evaluate our framework and its inference algorithms in different datasets, which shows the generality and effectiveness of our models in capturing implicit structures in networks. △ Less

Submitted 6 October, 2013; originally announced October 2013.

arXiv:1306.3003 [pdf, ps, other]

Non-parametric Power-law Data Clustering

Authors: Xuhui Fan, Yiling Zeng, Longbing Cao

Abstract: It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the $k$-means clustering procedure. This approach shows simplicity in implementation and solidity in th… ▽ More It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the $k$-means clustering procedure. This approach shows simplicity in implementation and solidity in theory, while it also provides a feasible way to inference in large scale datasets. However, several problems remains unsolved in this pioneering work, including the power-law data applicability, mechanism to merge centers to avoid the over-fitting problem, clustering order problem, e.t.c.. To address these issues, the Pitman-Yor Process based k-means (namely \emph{pyp-means}) is proposed in this paper. Taking advantage of the Pitman-Yor Process, \emph{pyp-means} treats clusters differently by dynamically and adaptively changing the threshold to guarantee the generation of power-law clustering results. Also, one center agglomeration procedure is integrated into the implementation to be able to merge small but close clusters and then adaptively determine the cluster number. With more discussion on the clustering order, the convergence proof, complexity analysis and extension to spectral clustering, our approach is compared with traditional clustering algorithm and variational inference methods. The advantages and properties of pyp-means are validated by experiments on both synthetic datasets and real world datasets. △ Less

Submitted 12 June, 2013; originally announced June 2013.

arXiv:1306.3002 [pdf, ps, other]

A Convergence Theorem for the Graph Shift-type Algorithms

Authors: Xuhui Fan, Longbing Cao

Abstract: Graph Shift (GS) algorithms are recently focused as a promising approach for discovering dense subgraphs in noisy data. However, there are no theoretical foundations for proving the convergence of the GS Algorithm. In this paper, we propose a generic theoretical framework consisting of three key GS components: simplex of generated sequence set, monotonic and continuous objective function and close… ▽ More Graph Shift (GS) algorithms are recently focused as a promising approach for discovering dense subgraphs in noisy data. However, there are no theoretical foundations for proving the convergence of the GS Algorithm. In this paper, we propose a generic theoretical framework consisting of three key GS components: simplex of generated sequence set, monotonic and continuous objective function and closed mapping. We prove that GS algorithms with such components can be transformed to fit the Zangwill's convergence theorem, and the sequence set generated by the GS procedures always terminates at a local maximum, or at worst, contains a subsequence which converges to a local maximum of the similarity measure function. The framework is verified by expanding it to other GS-type algorithms and experimental results. △ Less

Submitted 12 June, 2013; originally announced June 2013.

arXiv:1306.2999 [pdf, ps, other]

Dynamic Infinite Mixed-Membership Stochastic Blockmodel

Authors: Xuhui Fan, Longbing Cao, Richard Yi Da Xu

Abstract: Directional and pairwise measurements are often used to model inter-relationships in a social network setting. The Mixed-Membership Stochastic Blockmodel (MMSB) was a seminal work in this area, and many of its capabilities were extended since then. In this paper, we propose the \emph{Dynamic Infinite Mixed-Membership stochastic blockModel (DIM3)}, a generalised framework that extends the existing… ▽ More Directional and pairwise measurements are often used to model inter-relationships in a social network setting. The Mixed-Membership Stochastic Blockmodel (MMSB) was a seminal work in this area, and many of its capabilities were extended since then. In this paper, we propose the \emph{Dynamic Infinite Mixed-Membership stochastic blockModel (DIM3)}, a generalised framework that extends the existing work to a potentially infinite number of communities and mixture memberships for each of the network's nodes. This model is in a dynamic setting, where additional model parameters are introduced to reflect the degree of persistence between one's memberships at consecutive times. Accordingly, two effective posterior sampling strategies and their results are presented using both synthetic and real data. △ Less

Submitted 12 June, 2013; originally announced June 2013.

arXiv:1306.2733 [pdf, ps, other]

Copula Mixed-Membership Stochastic Blockmodel for Intra-Subgroup Correlations

Authors: Xuhui Fan, Longbing Cao, Richard Yi Da Xu

Abstract: The \emph{Mixed-Membership Stochastic Blockmodel (MMSB)} is a popular framework for modeling social network relationships. It can fully exploit each individual node's participation (or membership) in a social structure. Despite its powerful representations, this model makes an assumption that the distributions of relational membership indicators between two nodes are independent. Under many social… ▽ More The \emph{Mixed-Membership Stochastic Blockmodel (MMSB)} is a popular framework for modeling social network relationships. It can fully exploit each individual node's participation (or membership) in a social structure. Despite its powerful representations, this model makes an assumption that the distributions of relational membership indicators between two nodes are independent. Under many social network settings, however, it is possible that certain known subgroups of people may have high or low correlations in terms of their membership categories towards each other, and such prior information should be incorporated into the model. To this end, we introduce a \emph{Copula Mixed-Membership Stochastic Blockmodel (cMMSB)} where an individual Copula function is employed to jointly model the membership pairs of those nodes within the subgroup of interest. The model enables the use of various Copula functions to suit the scenario, while maintaining the membership's marginal distribution, as needed, for modeling membership indicators with other nodes outside of the subgroup of interest. We describe the proposed model and its inference algorithm in detail for both the finite and infinite cases. In the experiment section, we compare our algorithms with other popular models in terms of link prediction, using both synthetic and real world data. △ Less

Submitted 6 October, 2013; v1 submitted 12 June, 2013; originally announced June 2013.

arXiv:1305.5734 [pdf, ps, other]

Characterizing A Database of Sequential Behaviors with Latent Dirichlet Hidden Markov Models

Authors: Yin Song, Longbing Cao, Xuhui Fan, Wei Cao, Jian Zhang

Abstract: This paper proposes a generative model, the latent Dirichlet hidden Markov models (LDHMM), for characterizing a database of sequential behaviors (sequences). LDHMMs posit that each sequence is generated by an underlying Markov chain process, which are controlled by the corresponding parameters (i.e., the initial state vector, transition matrix and the emission matrix). These sequence-level latent… ▽ More This paper proposes a generative model, the latent Dirichlet hidden Markov models (LDHMM), for characterizing a database of sequential behaviors (sequences). LDHMMs posit that each sequence is generated by an underlying Markov chain process, which are controlled by the corresponding parameters (i.e., the initial state vector, transition matrix and the emission matrix). These sequence-level latent parameters for each sequence are modeled as latent Dirichlet random variables and parameterized by a set of deterministic database-level hyper-parameters. Through this way, we expect to model the sequence in two levels: the database level by deterministic hyper-parameters and the sequence-level by latent parameters. To learn the deterministic hyper-parameters and approximate posteriors of parameters in LDHMMs, we propose an iterative algorithm under the variational EM framework, which consists of E and M steps. We examine two different schemes, the fully-factorized and partially-factorized forms, for the framework, based on different assumptions. We present empirical results of behavior modeling and sequence classification on three real-world data sets, and compare them to other related models. The experimental results prove that the proposed LDHMMs produce better generalization performance in terms of log-likelihood and deliver competitive results on the sequence classification problem. △ Less

Submitted 24 May, 2013; originally announced May 2013.

ACM Class: H.2.8; F.1.2

arXiv:1211.2400

A shrinkage estimation for large dimensional precision matrices using random matrix theory

Authors: Cheng Wang, Guangming Pan, Longbing Cao

Abstract: In this paper, a new ridge-type shrinkage estimator for the precision matrix has been proposed. The asymptotic optimal shrinkage coefficients and the theoretical loss were derived. Data-driven estimators for the shrinkage coefficients were also conducted based on the asymptotic results deriving from random matrix theories. The new estimator which has a simple explicit formula is distribution-fre… ▽ More In this paper, a new ridge-type shrinkage estimator for the precision matrix has been proposed. The asymptotic optimal shrinkage coefficients and the theoretical loss were derived. Data-driven estimators for the shrinkage coefficients were also conducted based on the asymptotic results deriving from random matrix theories. The new estimator which has a simple explicit formula is distribution-free and applicable to situation where the dimension of observation is greater than the sample size. Further, no assumptions are required on the structure of the population covariance matrix or the precision matrix. Finally, numerical studies are conducted to examine the performances of the new estimator and existing methods for a wide range of settings. △ Less

Submitted 2 September, 2019; v1 submitted 11 November, 2012; originally announced November 2012.

Comments: This paper has been withdrawn by the author due to substantial contents will be updated

arXiv:1211.1456 [pdf, ps, other]

doi 10.1016/j.jmva.2013.12.012

Non-parametric shrinkage mean estimation for quadratic loss functions with unknown covariance matrices

Authors: Cheng Wang, Tiejun Tong, Longbing Cao, Baiqi Miao

Abstract: In this paper, a shrinkage estimator for the population mean is proposed under known quadratic loss functions with unknown covariance matrices. The new estimator is non-parametric in the sense that it does not assume a specific parametric distribution for the data and it does not require the prior information on the population covariance matrix. Analytical results on the improvement of the propose… ▽ More In this paper, a shrinkage estimator for the population mean is proposed under known quadratic loss functions with unknown covariance matrices. The new estimator is non-parametric in the sense that it does not assume a specific parametric distribution for the data and it does not require the prior information on the population covariance matrix. Analytical results on the improvement of the proposed shrinkage estimator are provided and some corresponding asymptotic properties are also derived. Finally, we demonstrate the practical improvement of the proposed method over existing methods through extensive simulation studies and real data analysis. Keywords: High-dimensional data; Shrinkage estimator; Large $p$ small $n$; $U$-statistic. △ Less

Submitted 6 November, 2014; v1 submitted 7 November, 2012; originally announced November 2012.

Comments: Some technical parts of Theorem 3.1 and 3.2 were corrected in this version

Journal ref: Journal of Multivariate Analysis, 125, 222-232, 2014

arXiv:1206.1660 [pdf, ps, other]

doi 10.1016/j.csda.2013.04.003

Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data

Authors: Cheng Wang, Longbing Cao, Baiqi Miao

Abstract: This work studies the theoretical rules of feature selection in linear discriminant analysis (LDA), and a new feature selection method is proposed for sparse linear discriminant analysis. An $l_1$ minimization method is used to select the important features from which the LDA will be constructed. The asymptotic results of this proposed two-stage LDA (TLDA) are studied, demonstrating that TLDA is a… ▽ More This work studies the theoretical rules of feature selection in linear discriminant analysis (LDA), and a new feature selection method is proposed for sparse linear discriminant analysis. An $l_1$ minimization method is used to select the important features from which the LDA will be constructed. The asymptotic results of this proposed two-stage LDA (TLDA) are studied, demonstrating that TLDA is an optimal classification rule whose convergence rate is the best compared to existing methods. The experiments on simulated and real datasets are consistent with the theoretical results and show that TLDA performs favorably in comparison with current methods. Overall, TLDA uses a lower minimum number of features or genes than other approaches to achieve a better result with a reduced misclassification rate. △ Less

Submitted 22 April, 2013; v1 submitted 8 June, 2012; originally announced June 2012.

Comments: 20 pages, 3 figures, 5 tables, accepted by Computational Statistics and Data Analysis

arXiv:1203.3278 [pdf, ps, other]

doi 10.1016/j.jmva.2013.03.015

On Identity Tests for High Dimensional Data Using RMT

Authors: Cheng Wang, Jing Yang, Baiqi Miao, Longbing Cao

Abstract: In this work, we redefined two important statistics, the CLRT test (Bai et.al., Ann. Stat. 37 (2009) 3822-3840) and the LW test (Ledoit and Wolf, Ann. Stat. 30 (2002) 1081-1102) on identity tests for high dimensional data using random matrix theories. Compared with existing CLRT and LW tests, the new tests can accommodate data which has unknown means and non-Gaussian distributions. Simulations dem… ▽ More In this work, we redefined two important statistics, the CLRT test (Bai et.al., Ann. Stat. 37 (2009) 3822-3840) and the LW test (Ledoit and Wolf, Ann. Stat. 30 (2002) 1081-1102) on identity tests for high dimensional data using random matrix theories. Compared with existing CLRT and LW tests, the new tests can accommodate data which has unknown means and non-Gaussian distributions. Simulations demonstrate that the new tests have good properties in terms of size and power. What is more, even for Gaussian data, our new tests perform favorably in comparison to existing tests. Finally, we find the CLRT is more sensitive to eigenvalues less than 1 while the LW test has more advantages in relation to detecting eigenvalues larger than 1. △ Less

Submitted 11 April, 2013; v1 submitted 15 March, 2012; originally announced March 2012.

Comments: 16 pages, 2 figures, 3 tables, To be published in the Journal of Multivariate Analysis

MSC Class: 62H15; 62H10

Showing 1–36 of 36 results for author: Cao, L