Search | arXiv e-print repository

Bi-directional Curriculum Learning for Graph Anomaly Detection: Dual Focus on Homogeneity and Heterogeneity

Authors: Yitong Hao, Enbo He, Yue Zhang, Guisheng Yin

Abstract: Graph anomaly detection (GAD) aims to identify nodes from a graph that are significantly different from normal patterns. Most previous studies are model-driven, focusing on enhancing the detection effect by improving the model structure. However, these approaches often treat all nodes equally, neglecting the different contributions of various nodes to the training. Therefore, we introduce graph cu… ▽ More Graph anomaly detection (GAD) aims to identify nodes from a graph that are significantly different from normal patterns. Most previous studies are model-driven, focusing on enhancing the detection effect by improving the model structure. However, these approaches often treat all nodes equally, neglecting the different contributions of various nodes to the training. Therefore, we introduce graph curriculum learning as a simple and effective plug-and-play module to optimize GAD methods. The existing graph curriculum learning mainly focuses on the homogeneity of graphs and treats nodes with high homogeneity as easy nodes. In fact, GAD models can handle not only graph homogeneity but also heterogeneity, which leads to the unsuitability of these existing methods. To address this problem, we propose an innovative Bi-directional Curriculum Learning strategy (BCL), which considers nodes with higher and lower similarity to neighbor nodes as simple nodes in the direction of focusing on homogeneity and focusing on heterogeneity, respectively, and prioritizes their training. Extensive experiments show that BCL can be quickly integrated into existing detection processes and significantly improves the performance of ten GAD anomaly detection models on seven commonly used datasets. △ Less

Submitted 23 January, 2025; originally announced January 2025.

Comments: 8pages, 5 figures

arXiv:2410.20650 [pdf, other]

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Authors: Yongchang Hao, Yanshuai Cao, Lili Mou

Abstract: The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-… ▽ More The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available. △ Less

Submitted 27 October, 2024; originally announced October 2024.

arXiv:2409.12173 [pdf, other]

Poisson approximate likelihood compared to the particle filter

Authors: Yize Hao, Aaron A. Abkemeier, Edward L. Ionides

Abstract: Filtering algorithms are fundamental for inference on partially observed stochastic dynamic systems, since they provide access to the likelihood function and hence enable likelihood-based or Bayesian inference. A novel Poisson approximate likelihood (PAL) filter was introduced by Whitehouse et al. (2023). PAL employs a Poisson approximation to conditional densities, offering a fast approximation t… ▽ More Filtering algorithms are fundamental for inference on partially observed stochastic dynamic systems, since they provide access to the likelihood function and hence enable likelihood-based or Bayesian inference. A novel Poisson approximate likelihood (PAL) filter was introduced by Whitehouse et al. (2023). PAL employs a Poisson approximation to conditional densities, offering a fast approximation to the likelihood function for a certain subset of partially observed Markov process models. A central piece of evidence for PAL is the comparison in Table 1 of Whitehouse et al. (2023), which claims a large improvement for PAL over a standard particle filter algorithm. This evidence, based on a model and data from a previous scientific study by Stocks et al. (2020), might suggest that researchers confronted with similar models should use PAL rather than particle filter methods. Taken at face value, this evidence also reduces the credibility of Stocks et al. (2020) by indicating a shortcoming with the numerical methods that they used. However, we show that the comparison of log-likelihood values made by Whitehouse et al. (2023) is flawed because their PAL calculations were carried out using a dataset scaled differently from the previous study. If PAL and the particle filter are applied to the same data, the advantage claimed for PAL disappears. On simulations where the model is correctly specified, the particle filter outperforms PAL. △ Less

Submitted 18 September, 2024; originally announced September 2024.

arXiv:2409.11134 [pdf, other]

E-Values for Exponential Families: the General Case

Authors: Yunda Hao, Peter Grünwald

Abstract: We analyze common types of e-variables and e-processes for composite exponential family nulls: the optimal e-variable based on the reverse information projection (RIPr), the conditional (COND) e-variable, and the universal inference (UI) and sequen\-tialized RIPr e-processes. We characterize the RIPr prior for simple and Bayes-mixture based alternatives, either precisely (for Gaussian nulls and al… ▽ More We analyze common types of e-variables and e-processes for composite exponential family nulls: the optimal e-variable based on the reverse information projection (RIPr), the conditional (COND) e-variable, and the universal inference (UI) and sequen\-tialized RIPr e-processes. We characterize the RIPr prior for simple and Bayes-mixture based alternatives, either precisely (for Gaussian nulls and alternatives) or in an approximate sense (general exponential families). We provide conditions under which the RIPr e-variable is (again exactly vs. approximately) equal to the COND e-variable. Based on these and other interrelations which we establish, we determine the e-power of the four e-statistics as a function of sample size, exactly for Gaussian and up to $o(1)$ in general. For $d$-dimensional null and alternative, the e-power of UI tends to be smaller by a term of $(d/2) \log n + O(1)$ than that of the COND e-variable, which is the clear winner. △ Less

Submitted 17 September, 2024; originally announced September 2024.

arXiv:2404.19465 [pdf, other]

Optimal E-Values for Exponential Families: the Simple Case

Authors: Peter Grünwald, Tyron Lardy, Yunda Hao, Shaul K. Bar-Lev, Martijn de Jong

Abstract: We provide a general condition under which e-variables in the form of a simple-vs.-simple likelihood ratio exist when the null hypothesis is a composite, multivariate exponential family. Such `simple' e-variables are easy to compute and expected-log-optimal with respect to any stopping time. Simple e-variables were previously only known to exist in quite specific settings, but we offer a unifying… ▽ More We provide a general condition under which e-variables in the form of a simple-vs.-simple likelihood ratio exist when the null hypothesis is a composite, multivariate exponential family. Such `simple' e-variables are easy to compute and expected-log-optimal with respect to any stopping time. Simple e-variables were previously only known to exist in quite specific settings, but we offer a unifying theorem on their existence for testing exponential families. We start with a simple alternative $Q$ and a regular exponential family null. Together these induce a second exponential family ${\cal Q}$ containing $Q$, with the same sufficient statistic as the null. Our theorem shows that simple e-variables exist whenever the covariance matrices of ${\cal Q}$ and the null are in a certain relation. A prime example in which this relation holds is testing whether a parameter in a linear regression is 0. Other examples include some $k$-sample tests, Gaussian location- and scale tests, and tests for more general classes of natural exponential families. While in all these examples, the implicit composite alternative is also an exponential family, in general this is not required. △ Less

Submitted 1 April, 2025; v1 submitted 30 April, 2024; originally announced April 2024.

arXiv:2403.17592 [pdf, other]

On the Benefits of Over-parameterization for Out-of-Distribution Generalization

Authors: Yifan Hao, Yong Lin, Difan Zou, Tong Zhang

Abstract: In recent years, machine learning models have achieved success based on the independently and identically distributed assumption. However, this assumption can be easily violated in real-world applications, leading to the Out-of-Distribution (OOD) problem. Understanding how modern over-parameterized DNNs behave under non-trivial natural distributional shifts is essential, as current theoretical und… ▽ More In recent years, machine learning models have achieved success based on the independently and identically distributed assumption. However, this assumption can be easily violated in real-world applications, leading to the Out-of-Distribution (OOD) problem. Understanding how modern over-parameterized DNNs behave under non-trivial natural distributional shifts is essential, as current theoretical understanding is insufficient. Existing theoretical works often provide meaningless results for over-parameterized models in OOD scenarios or even contradict empirical findings. To this end, we are investigating the performance of the over-parameterized model in terms of OOD generalization under the general benign overfitting conditions. Our analysis focuses on a random feature model and examines non-trivial natural distributional shifts, where the benign overfitting estimators demonstrate a constant excess OOD loss, despite achieving zero excess in-distribution (ID) loss. We demonstrate that in this scenario, further increasing the model's parameterization can significantly reduce the OOD loss. Intuitively, the variance term of ID loss remains low due to orthogonality of long-tail features, meaning overfitting noise during training generally doesn't raise testing loss. However, in OOD cases, distributional shift increases the variance term. Thankfully, the inherent shift is unrelated to individual x, maintaining the orthogonality of long-tail features. Expanding the hidden dimension can additionally improve this orthogonality by mapping the features into higher-dimensional spaces, thereby reducing the variance term. We further show that model ensembles also improve OOD loss, akin to increasing model capacity. These insights explain the empirical phenomenon of enhanced OOD generalization through model ensembles, supported by consistent simulations with theoretical results. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2402.03295 [pdf, other]

Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks

Authors: Yongchang Hao, Yanshuai Cao, Lili Mou

Abstract: Second-order optimization approaches like the generalized Gauss-Newton method are considered more powerful as they utilize the curvature information of the objective function with preconditioning matrices. Albeit offering tempting theoretical benefits, they are not easily applicable to modern deep learning. The major reason is due to the quadratic memory and cubic time complexity to compute the in… ▽ More Second-order optimization approaches like the generalized Gauss-Newton method are considered more powerful as they utilize the curvature information of the objective function with preconditioning matrices. Albeit offering tempting theoretical benefits, they are not easily applicable to modern deep learning. The major reason is due to the quadratic memory and cubic time complexity to compute the inverse of the matrix. These requirements are infeasible even with state-of-the-art hardware. In this work, we propose Ginger, an eigendecomposition for the inverse of the generalized Gauss-Newton matrix. Our method enjoys efficient linear memory and time complexity for each iteration. Instead of approximating the conditioning matrix, we directly maintain its inverse to make the approximation more accurate. We provide the convergence result of Ginger for non-convex objectives. Our experiments on different tasks with different model architectures verify the effectiveness of our method. Our code is publicly available. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.03293 [pdf, other]

Flora: Low-Rank Adapters Are Secretly Gradient Compressors

Authors: Yongchang Hao, Yanshuai Cao, Lili Mou

Abstract: Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model perform… ▽ More Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model performance. In this work, we investigate the dynamics of LoRA and identify that it can be approximated by a random projection. Based on this observation, we propose Flora, which is able to achieve high-rank updates by resampling the projection matrices while enjoying the sublinear space complexity of optimization states. We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach. △ Less

Submitted 12 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted @ ICML 2024

arXiv:2401.12236 [pdf, ps, other]

The Surprising Harmfulness of Benign Overfitting for Adversarial Robustness

Authors: Yifan Hao, Tong Zhang

Abstract: Recent empirical and theoretical studies have established the generalization capabilities of large machine learning models that are trained to (approximately or exactly) fit noisy data. In this work, we prove a surprising result that even if the ground truth itself is robust to adversarial examples, and the benignly overfitted model is benign in terms of the ``standard'' out-of-sample risk objecti… ▽ More Recent empirical and theoretical studies have established the generalization capabilities of large machine learning models that are trained to (approximately or exactly) fit noisy data. In this work, we prove a surprising result that even if the ground truth itself is robust to adversarial examples, and the benignly overfitted model is benign in terms of the ``standard'' out-of-sample risk objective, this benign overfitting process can be harmful when out-of-sample data are subject to adversarial manipulation. More specifically, our main results contain two parts: (i) the min-norm estimator in overparameterized linear model always leads to adversarial vulnerability in the ``benign overfitting'' setting; (ii) we verify an asymptotic trade-off result between the standard risk and the ``adversarial'' risk of every ridge regression estimator, implying that under suitable conditions these two items cannot both be small at the same time by any single choice of the ridge regularization parameter. Furthermore, under the lazy training regime, we demonstrate parallel results on two-layer neural tangent kernel (NTK) model, which align with empirical observations in deep neural networks. Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust. △ Less

Submitted 25 January, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

arXiv:2307.02037 [pdf, other]

Reverse Diffusion Monte Carlo

Authors: Xunpeng Huang, Hanze Dong, Yifan Hao, Yi-An Ma, Tong Zhang

Abstract: We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates -- the score functions -- are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called revers… ▽ More We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates -- the score functions -- are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called reverse diffusion Monte Carlo (rdMC), which is distinct from the Markov chain Monte Carlo (MCMC) methods. We determine the sample size from the error tolerance and the properties of the posterior distribution to yield an algorithm that can approximately sample the target distribution with any desired accuracy. Additionally, we demonstrate and prove under suitable conditions that sampling with rdMC can be significantly faster than that with MCMC. For multi-modal target distributions such as those in Gaussian mixture models, rdMC greatly improves over the Langevin-style MCMC sampling methods both theoretically and in practice. The proposed rdMC method offers a new perspective and solution beyond classical MCMC algorithms for the challenging complex distributions. △ Less

Submitted 13 March, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: 44 pages, 16 figures, ICLR 2024

arXiv:2303.00471 [pdf, other]

E-values for k-Sample Tests With Exponential Families

Authors: Yunda Hao, Peter Grünwald, Tyron Lardy, Long Long, Reuben Adams

Abstract: We develop and compare e-variables for testing whether $k$ samples of data are drawn from the same distribution, the alternative being that they come from different elements of an exponential family. We consider the GRO (growth-rate optimal) e-variables for (1) a `small' null inside the same exponential family, and (2) a `large' nonparametric null, as well as (3) an e-variable arrived at by condit… ▽ More We develop and compare e-variables for testing whether $k$ samples of data are drawn from the same distribution, the alternative being that they come from different elements of an exponential family. We consider the GRO (growth-rate optimal) e-variables for (1) a `small' null inside the same exponential family, and (2) a `large' nonparametric null, as well as (3) an e-variable arrived at by conditioning on the sum of the sufficient statistics. (2) and (3) are efficiently computable, and extend ideas from Turner et al. [2021] and Wald [1947] respectively from Bernoulli to general exponential families. We provide theoretical and simulation-based comparisons of these e-variables in terms of their logarithmic growth rate, and find that for small effects all four e-variables behave surprisingly similarly; for the Gaussian location and Poisson families, e-variables (1) and (3) coincide; for Bernoulli, (1) and (2) coincide; but in general, whether (2) or (3) grows faster under the alternative is family-dependent. We furthermore discuss algorithms for numerically approximating (1). △ Less

Submitted 8 January, 2024; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2202.11968 [pdf]

Combining the target trial and estimand frameworks to define the causal estimand: an application using real-world data to contextualize a single-arm trial

Authors: Lisa V Hampson, Jufen Chu, Aiesha Zia, Jie Zhang, Wei-Chun Hsu, Craig Parzynski, Yanni Hao, Evgeny Degtyarev

Abstract: Single-arm trials (SATs) may be used to support regulatory submissions in settings where there is a high unmet medical need and highly promising early efficacy data undermine the equipoise needed for randomization. In this context, patient-level real-world data (RWD) may be used to create an external control arm (ECA) to contextualize the SAT results. However, naive comparisons of the SAT with its… ▽ More Single-arm trials (SATs) may be used to support regulatory submissions in settings where there is a high unmet medical need and highly promising early efficacy data undermine the equipoise needed for randomization. In this context, patient-level real-world data (RWD) may be used to create an external control arm (ECA) to contextualize the SAT results. However, naive comparisons of the SAT with its ECA will yield biased estimates of causal effects if groups are imbalanced with regards to (un)measured prognostic factors. Several methods are available to adjust for measured confounding, but the interpretation of such analyses is challenging unless the causal question of interest is clearly defined, and the estimator is aligned with the estimand. Additional complications arise when patients in the ECA are eligible for the SAT at multiple timepoints. In this paper, we use a case-study of a pivotal SAT of a novel CAR-T therapy for heavily pre-treated patients with follicular lymphoma to illustrate how a combination of the target trial and the ICH E9(R1) estimand frameworks can be used to define the target estimand and avoid common methodological pitfalls related to the design of the ECA and comparisons with the SAT. We also propose an approach to address the challenge of how to define an appropriate time zero for external controls who meet the SAT inclusion/exclusion criteria at several timepoints. Use of the target trial and estimand frameworks facilitates discussions amongst internal and external stakeholders, as well as an early assessment of the adequacy of the available RWD. △ Less

Submitted 24 February, 2022; originally announced February 2022.

arXiv:2202.07172 [pdf, other]

TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm

Authors: Yi Hao, Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar

Abstract: Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it c… ▽ More Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it can be shown that $c_{t,d}\ge 2$ for all other $t$ and $d$. Yet current computationally efficient algorithms show only $c_{t,1}\le 2.25$ and the bound rises quickly to $c_{t,d}\le 3$ for $d\ge 9$. We derive a near-linear-time and essentially sample-optimal estimator that establishes $c_{t,d}=2$ for all $(t,d)\ne(1,0)$. Additionally, for many practical distributions, the lowest approximation distance is achieved by polynomials with vastly varying number of pieces. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation. Experiments combining the two techniques confirm improved performance over existing methodologies. △ Less

Submitted 17 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: 19 pages, 12 figures

arXiv:2110.14374 [pdf, other]

A2I Transformer: Permutation-equivariant attention network for pairwise and many-body interactions with minimal featurization

Authors: Ji Woong Yu, Min Young Ha, Bumjoon Seo, Won Bo Lee

Abstract: The combination of neural network potential (NNP) with molecular simulations plays an important role in an efficient and thorough understanding of a molecular system's potential energy surface (PES). However, grasping the interplay between input features and their local contribution to NNP is growingly evasive due to heavy featurization. In this work, we suggest an end-to-end model which directly… ▽ More The combination of neural network potential (NNP) with molecular simulations plays an important role in an efficient and thorough understanding of a molecular system's potential energy surface (PES). However, grasping the interplay between input features and their local contribution to NNP is growingly evasive due to heavy featurization. In this work, we suggest an end-to-end model which directly predicts per-atom energy from the coordinates of particles, avoiding expert-guided featurization of the network input. Employing self-attention as the main workhorse, our model is intrinsically equivariant under the permutation operation, resulting in the invariance of the total potential energy. We tested our model against several challenges in molecular simulation problems, including periodic boundary condition (PBC), $n$-body interaction, and binary composition. Our model yielded stable predictions in all tested systems with errors significantly smaller than the potential energy fluctuation acquired from molecular dynamics simulations. Thus, our work provides a minimal baseline model that encodes complex interactions in a condensed phase system to facilitate the data-driven analysis of physicochemical systems. △ Less

Submitted 27 October, 2021; originally announced October 2021.

arXiv:2010.16055 [pdf, other]

Unsupervised Embedding of Hierarchical Structure in Euclidean Space

Authors: Jinyu Zhao, Yi Hao, Cyrus Rashtchian

Abstract: Deep embedding methods have influenced many areas of unsupervised learning. However, the best methods for learning hierarchical structure use non-Euclidean representations, whereas Euclidean geometry underlies the theory behind many hierarchical clustering algorithms. To bridge the gap between these two areas, we consider learning a non-linear embedding of data into Euclidean space as a way to imp… ▽ More Deep embedding methods have influenced many areas of unsupervised learning. However, the best methods for learning hierarchical structure use non-Euclidean representations, whereas Euclidean geometry underlies the theory behind many hierarchical clustering algorithms. To bridge the gap between these two areas, we consider learning a non-linear embedding of data into Euclidean space as a way to improve the hierarchical clustering produced by agglomerative algorithms. To learn the embedding, we revisit using a variational autoencoder with a Gaussian mixture prior, and we show that rescaling the latent space embedding and then applying Ward's linkage-based algorithm leads to improved results for both dendrogram purity and the Moseley-Wang cost function. Finally, we complement our empirical results with a theoretical explanation of the success of this approach. We study a synthetic model of the embedded vectors and prove that Ward's method exactly recovers the planted hierarchical clustering with high probability. △ Less

Submitted 29 October, 2020; originally announced October 2020.

arXiv:2007.08053 [pdf, other]

doi 10.24963/ijcai.2020/168

Inductive Link Prediction for Nodes Having Only Attribute Information

Authors: Yu Hao, Xin Cao, Yixiang Fang, Xike Xie, Sibo Wang

Abstract: Predicting the link between two nodes is a fundamental problem for graph data analytics. In attributed graphs, both the structure and attribute information can be utilized for link prediction. Most existing studies focus on transductive link prediction where both nodes are already in the graph. However, many real-world applications require inductive prediction for new nodes having only attribute i… ▽ More Predicting the link between two nodes is a fundamental problem for graph data analytics. In attributed graphs, both the structure and attribute information can be utilized for link prediction. Most existing studies focus on transductive link prediction where both nodes are already in the graph. However, many real-world applications require inductive prediction for new nodes having only attribute information. It is more challenging since the new nodes do not have structure information and cannot be seen during the model training. To solve this problem, we propose a model called DEAL, which consists of three components: two node embedding encoders and one alignment mechanism. The two encoders aim to output the attribute-oriented node embedding and the structure-oriented node embedding, and the alignment mechanism aligns the two types of embeddings to build the connections between the attributes and links. Our model DEAL is versatile in the sense that it works for both inductive and transductive link prediction. Extensive experiments on several benchmark datasets show that our proposed model significantly outperforms existing inductive link prediction methods, and also outperforms the state-of-the-art methods on transductive link prediction. △ Less

Submitted 15 July, 2020; originally announced July 2020.

Comments: IJCAI2020

arXiv:2004.11934 [pdf, other]

Correlation-aware Unsupervised Change-point Detection via Graph Neural Networks

Authors: Ruohong Zhang, Yu Hao, Donghan Yu, Wei-Cheng Chang, Guokun Lai, Yiming Yang

Abstract: Change-point detection (CPD) aims to detect abrupt changes over time series data. Intuitively, effective CPD over multivariate time series should require explicit modeling of the dependencies across input variables. However, existing CPD methods either ignore the dependency structures entirely or rely on the (unrealistic) assumption that the correlation structures are static over time. In this pap… ▽ More Change-point detection (CPD) aims to detect abrupt changes over time series data. Intuitively, effective CPD over multivariate time series should require explicit modeling of the dependencies across input variables. However, existing CPD methods either ignore the dependency structures entirely or rely on the (unrealistic) assumption that the correlation structures are static over time. In this paper, we propose a Correlation-aware Dynamics Model for CPD, which explicitly models the correlation structure and dynamics of variables by incorporating graph neural networks into an encoder-decoder framework. Extensive experiments on synthetic and real-world datasets demonstrate the advantageous performance of the proposed model on CPD tasks over strong baselines, as well as its ability to classify the change-points as correlation changes or independent changes. Keywords: Multivariate Time Series, Change-point Detection, Graph Neural Networks △ Less

Submitted 13 September, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

Comments: Accepted for publication in the International Conference on Neural Information Processing (ICONIP) 2020 Original paper is 12 pages, additional appendix is available on arxiv

MSC Class: I.2.6

Journal ref: ICONIP 2020: Neural Information Processing

arXiv:2003.09660 [pdf, other]

NeuCrowd: Neural Sampling Network for Representation Learning with Crowdsourced Labels

Authors: Yang Hao, Wenbiao Ding, Zitao Liu

Abstract: Representation learning approaches require a massive amount of discriminative training data, which is unavailable in many scenarios, such as healthcare, smart city, education, etc. In practice, people refer to crowdsourcing to get annotated labels. However, due to issues like data privacy, budget limitation, shortage of domain-specific annotators, the number of crowdsourced labels is still very li… ▽ More Representation learning approaches require a massive amount of discriminative training data, which is unavailable in many scenarios, such as healthcare, smart city, education, etc. In practice, people refer to crowdsourcing to get annotated labels. However, due to issues like data privacy, budget limitation, shortage of domain-specific annotators, the number of crowdsourced labels is still very limited. Moreover, because of annotators' diverse expertise, crowdsourced labels are often inconsistent. Thus, directly applying existing supervised representation learning (SRL) algorithms may easily get the overfitting problem and yield suboptimal solutions. In this paper, we propose \emph{NeuCrowd}, a unified framework for SRL from crowdsourced labels. The proposed framework (1) creates a sufficient number of high-quality \emph{n}-tuplet training samples by utilizing safety-aware sampling and robust anchor generation; and (2) automatically learns a neural sampling network that adaptively learns to select effective samples for SRL networks. The proposed framework is evaluated on both one synthetic and three real-world data sets. The results show that our approach outperforms a wide range of state-of-the-art baselines in terms of prediction accuracy and AUC. To encourage reproducible results, we make our code publicly available at \url{https://github.com/tal-ai/NeuCrowd_KAIS2021}. △ Less

Submitted 15 December, 2021; v1 submitted 21 March, 2020; originally announced March 2020.

Comments: Accepted in Knowledge and Information Systems

arXiv:2002.11665 [pdf, ps, other]

Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Discrete Distributions

Authors: Yi Hao, Alon Orlitsky

Abstract: The profile of a sample is the multiset of its symbol frequencies. We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression. Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetr… ▽ More The profile of a sample is the multiset of its symbol frequencies. We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression. Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetric properties compared with the best estimator over any label-invariant distribution collection; c) serves as the limit of profile compression, for which we derive optimal near-linear-time block and sequential algorithms. To further our understanding of profile entropy, we investigate its attributes, provide algorithms for approximating its value, and determine its magnitude for numerous structural distribution families. △ Less

Submitted 26 February, 2020; originally announced February 2020.

Comments: 56 pages

arXiv:2002.09589 [pdf, other]

SURF: A Simple, Universal, Robust, Fast Distribution Learning Algorithm

Authors: Yi Hao, Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar

Abstract: Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probabi… ▽ More Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probability interpolation}, and using plain divide-and-conquer to merge the pieces; universal, as well-known polynomial-approximation results imply that it accurately approximates a large class of common distributions; robust to distribution mis-specification as for any degree $d \le 8$, it estimates any distribution to an $\ell_1$ distance $< 3$ times that of the nearest degree-$d$ piecewise polynomial, improving known factor upper bounds of 3 for single polynomials and 15 for polynomials with arbitrarily many pieces; fast, using optimal sample complexity, running in near sample-linear time, and if given sorted samples it may be parallelized to run in sub-linear time. In experiments, SURF outperforms state-of-the art algorithms. △ Less

Submitted 11 February, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

Comments: 27 pages, 9 figures, 3 tables

arXiv:1911.03105 [pdf, ps, other]

Unified Sample-Optimal Property Estimation in Near-Linear Time

Authors: Yi Hao, Alon Orlitsky

Abstract: We consider the fundamental learning problem of estimating properties of distributions over large domains. Using a novel piecewise-polynomial approximation technique, we derive the first unified methodology for constructing sample- and time-efficient estimators for all sufficiently smooth, symmetric and non-symmetric, additive properties. This technique yields near-linear-time computable estimator… ▽ More We consider the fundamental learning problem of estimating properties of distributions over large domains. Using a novel piecewise-polynomial approximation technique, we derive the first unified methodology for constructing sample- and time-efficient estimators for all sufficiently smooth, symmetric and non-symmetric, additive properties. This technique yields near-linear-time computable estimators whose approximation values are asymptotically optimal and highly-concentrated, resulting in the first: 1) estimators achieving the $\mathcal{O}(k/(\varepsilon^2\log k))$ min-max $\varepsilon$-error sample complexity for all $k$-symbol Lipschitz properties; 2) unified near-optimal differentially private estimators for a variety of properties; 3) unified estimator achieving optimal bias and near-optimal variance for five important properties; 4) near-optimal sample-complexity estimators for several important symmetric properties over both domain sizes and confidence levels. In addition, we establish a McDiarmid's inequality under Poisson sampling, which is of independent interest. △ Less

Submitted 17 March, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: Appeared at NeurIPS 2019. Fixed a few typos and minor issues in corner cases

arXiv:1911.00776 [pdf, other]

Ten-year Survival Prediction for Breast Cancer Patients

Authors: Changmao Li, Han He, Yunze Hao, Caleb Ziems

Abstract: This report assesses different machine learning approaches to 10-year survival prediction of breast cancer patients. This report assesses different machine learning approaches to 10-year survival prediction of breast cancer patients. △ Less

Submitted 2 November, 2019; originally announced November 2019.

arXiv:1906.03794 [pdf, other]

The Broad Optimality of Profile Maximum Likelihood

Authors: Yi Hao, Alon Orlitsky

Abstract: We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$: $\textbf{Distribution estimation}$ Under $\ell_1$ dis… ▽ More We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$: $\textbf{Distribution estimation}$ Under $\ell_1$ distance, PML yields optimal $Θ(k/(\varepsilon^2\log k))$ sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution; $\textbf{Additive property estimation}$ For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence; $\boldsymbolα\textbf{-Rényi entropy estimation}$ For integer $α>1$, the PML plug-in estimator has optimal $k^{1-1/α}$ sample complexity; for non-integer $α>3/4$, the PML plug-in estimator has sample complexity lower than the state of the art; $\textbf{Identity testing}$ In testing whether an unknown distribution is equal to or at least $\varepsilon$ far from a given distribution in $\ell_1$ distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of $k$. Most of these results also hold for a near-linear-time computable variant of PML. Stronger results hold for a different and novel variant called truncated PML (TPML). △ Less

Submitted 11 July, 2019; v1 submitted 10 June, 2019; originally announced June 2019.

Comments: Added a new section (Section 8) about truncated PML (TPML) and derived several new results

arXiv:1905.13550 [pdf]

A novel hybrid model based on multi-objective Harris hawks optimization algorithm for daily PM2.5 and PM10 forecasting

Authors: Pei Du, Jianzhou Wang, Yan Hao, Tong Niu, Wendong Yang

Abstract: High levels of air pollution may seriously affect people's living environment and even endanger their lives. In order to reduce air pollution concentrations, and warn the public before the occurrence of hazardous air pollutants, it is urgent to design an accurate and reliable air pollutant forecasting model. However, most previous research have many deficiencies, such as ignoring the importance of… ▽ More High levels of air pollution may seriously affect people's living environment and even endanger their lives. In order to reduce air pollution concentrations, and warn the public before the occurrence of hazardous air pollutants, it is urgent to design an accurate and reliable air pollutant forecasting model. However, most previous research have many deficiencies, such as ignoring the importance of predictive stability, and poor initial parameters and so on, which have significantly effect on the performance of air pollution prediction. Therefore, to address these issues, a novel hybrid model is proposed in this study. Specifically, a powerful data preprocessing techniques is applied to decompose the original time series into different modes from low- frequency to high- frequency. Next, a new multi-objective algorithm called MOHHO is first developed in this study, which are introduced to tune the parameters of ELM model with high forecasting accuracy and stability for air pollution series prediction, simultaneously. And the optimized ELM model is used to perform the time series prediction. Finally, a scientific and robust evaluation system including several error criteria, benchmark models, and several experiments using six air pollutant concentrations time series from three cities in China is designed to perform a compressive assessment for the presented hybrid forecasting model. Experimental results indicate that the proposed hybrid model can guarantee a more stable and higher predictive performance compared to others, whose superior prediction ability may help to develop effective plans for air pollutant emissions and prevent health problems caused by air pollution. △ Less

Submitted 30 May, 2019; originally announced May 2019.

Comments: 24 pages, 4 figures

MSC Class: 68U20

arXiv:1904.00070 [pdf, other]

Data Amplification: A Unified and Competitive Approach to Property Estimation

Authors: Yi Hao, Alon Orlitsky, Ananda T. Suresh, Yihong Wu

Abstract: Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just $2n$ samples to achieve the performance attained by the empirical estimator with $n\sqrt{\log n}$ samples. This provides off-the-shelf, distribut… ▽ More Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just $2n$ samples to achieve the performance attained by the empirical estimator with $n\sqrt{\log n}$ samples. This provides off-the-shelf, distribution-independent, "amplification" of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with $n$ samples is even as good as that of the empirical estimator with $n\log n$ samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property. △ Less

Submitted 29 March, 2019; originally announced April 2019.

Comments: In NeurIPS 2018

arXiv:1903.01432 [pdf, other]

Data Amplification: Instance-Optimal Property Estimation

Authors: Yi Hao, Alon Orlitsky

Abstract: The best-known and most commonly used distribution-property estimation technique uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly "amplify" the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlyi… ▽ More The best-known and most commonly used distribution-property estimation technique uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly "amplify" the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlying distribution, they achieve the accuracy that the empirical-frequency plug-in estimators would attain using a logarithmic-factor more samples. Specifically, for Shannon entropy and a very broad class of properties including $\ell_1$-distance, the new estimators use $n$ samples to achieve the accuracy attained by the empirical estimators with $n\log n$ samples. For support-size and coverage, the new estimators use $n$ samples to achieve the performance of empirical frequency with sample size $n$ times the logarithm of the property value. Significantly strengthening the traditional min-max formulation, these results hold not only for the worst distributions, but for each and every underlying distribution. Furthermore, the logarithmic amplification factors are optimal. Experiments on a wide variety of distributions show that the new estimators outperform the previous state-of-the-art estimators designed for each specific property. △ Less

Submitted 5 March, 2019; v1 submitted 4 March, 2019; originally announced March 2019.

Comments: In this new version, we strengthened the previous results by eliminating unnecessary assumptions

arXiv:1810.11754 [pdf, other]

On Learning Markov Chains

Authors: Yi Hao, Alon Orlitsky, Venkatadheeraj Pichapati

Abstract: The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted significant research effort and has been solved for a variety of divergence measures. Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems rel… ▽ More The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted significant research effort and has been solved for a variety of divergence measures. Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems related to the min-max risk (expected loss) of estimating an unknown $k$-state Markov chain from its $n$ sequential samples: predicting the conditional distribution of the next sample with respect to the KL-divergence, and estimating the transition matrix with respect to a natural loss induced by KL or a more general $f$-divergence measure. For the first measure, we determine the min-max prediction risk to within a linear factor in the alphabet size, showing it is $Ω(k\log\log n\ / n)$ and $\mathcal{O}(k^2\log\log n\ / n)$. For the second, if the transition probabilities can be arbitrarily small, then only trivial uniform risk upper bounds can be derived. We therefore consider transition probabilities that are bounded away from zero, and resolve the problem for essentially all sufficiently smooth $f$-divergences, including KL-, $L_2$-, Chi-squared, Hellinger, and Alpha-divergences. △ Less

Submitted 27 October, 2018; originally announced October 2018.

Comments: To appear at NIPS 2018

arXiv:1806.00740 [pdf]

How does climate change influence regional stability

Authors: Tianyu Shi, Jiayan Guo, Xuxin Cheng, Yu hao

Abstract: Nowadays, different places have different region stability, which is influenced by lots of factors. In this paper ,it is aimed to analyze the influence of climate change on regional stability. several factors that may influence the region stability are proposed. Then Principle Components Analysis (PCA) was used to select the most relevant factors. After that ,a BP neural network is established con… ▽ More Nowadays, different places have different region stability, which is influenced by lots of factors. In this paper ,it is aimed to analyze the influence of climate change on regional stability. several factors that may influence the region stability are proposed. Then Principle Components Analysis (PCA) was used to select the most relevant factors. After that ,a BP neural network is established considering all the principle components to evaluate the Region Stability (RS). Subsequently, the specific influence of the climate change is analyzed and the results showed that long term average precipitation is a main climate factor influencing the RS. △ Less

Submitted 5 August, 2018; v1 submitted 3 June, 2018; originally announced June 2018.

arXiv:1704.05041 [pdf, other]

Fast multi-output relevance vector regression

Authors: Youngmin Ha

Abstract: This paper aims to decrease the time complexity of multi-output relevance vector regression from O(VM^3) to O(V^3+M^3), where V is the number of output dimensions, M is the number of basis functions, and V<M. The experimental results demonstrate that the proposed method is more competitive than the existing method, with regard to computation time. MATLAB codes are available at http://www.mathworks… ▽ More This paper aims to decrease the time complexity of multi-output relevance vector regression from O(VM^3) to O(V^3+M^3), where V is the number of output dimensions, M is the number of basis functions, and V<M. The experimental results demonstrate that the proposed method is more competitive than the existing method, with regard to computation time. MATLAB codes are available at http://www.mathworks.com/matlabcentral/fileexchange/49131. △ Less

Submitted 17 April, 2017; originally announced April 2017.

arXiv:1704.04137 [pdf, other]

Fashion Conversation Data on Instagram

Authors: Yu-I Ha, Sejeong Kwon, Meeyoung Cha, Jungseock Joo

Abstract: The fashion industry is establishing its presence on a number of visual-centric social media like Instagram. This creates an interesting clash as fashion brands that have traditionally practiced highly creative and editorialized image marketing now have to engage with people on the platform that epitomizes impromptu, realtime conversation. What kinds of fashion images do brands and individuals sha… ▽ More The fashion industry is establishing its presence on a number of visual-centric social media like Instagram. This creates an interesting clash as fashion brands that have traditionally practiced highly creative and editorialized image marketing now have to engage with people on the platform that epitomizes impromptu, realtime conversation. What kinds of fashion images do brands and individuals share and what are the types of visual features that attract likes and comments? In this research, we take both quantitative and qualitative approaches to answer these questions. We analyze visual features of fashion posts first via manual tagging and then via training on convolutional neural networks. The classified images were examined across four types of fashion brands: mega couture, small couture, designers, and high street. We find that while product-only images make up the majority of fashion conversation in terms of volume, body snaps and face images that portray fashion items more naturally tend to receive a larger number of likes and comments by the audience. Our findings bring insights into building an automated tool for classifying or generating influential fashion information. We make our novel dataset of {24,752} labeled images on fashion conversations, containing visual and textual cues, available for the research community. △ Less

Submitted 13 April, 2017; originally announced April 2017.

Comments: 10 pages, 6 figures, This paper will be presented at ICWSM'17

arXiv:1311.1040 [pdf]

Combined Independent Component Analysis and Canonical Polyadic Decomposition via Joint Diagonalization

Authors: Xiao-Feng Gong, Cheng-Yuan Wang, Ya-Na Hao, Qiu-Hua Lin

Abstract: Recently, there has been a trend to combine independent component analysis and canonical polyadic decomposition (ICA-CPD) for an enhanced robustness for the computation of CPD, and ICA-CPD could be further converted into CPD of a 5th-order partially symmetric tensor, by calculating the eigenmatrices of the 4th-order cumulant slices of a trilinear mixture. In this study, we propose a new 5th-order… ▽ More Recently, there has been a trend to combine independent component analysis and canonical polyadic decomposition (ICA-CPD) for an enhanced robustness for the computation of CPD, and ICA-CPD could be further converted into CPD of a 5th-order partially symmetric tensor, by calculating the eigenmatrices of the 4th-order cumulant slices of a trilinear mixture. In this study, we propose a new 5th-order CPD algorithm constrained with partial symmetry based on joint diagonalization. As the main steps involved in the proposed algorithm undergo no updating iterations for the loading matrices, it is much faster than the existing algorithm based on alternating least squares and enhanced line search, with competent performances. Simulation results are provided to demonstrate the performance of the proposed algorithm. △ Less

Submitted 27 December, 2016; v1 submitted 5 November, 2013; originally announced November 2013.

Comments: IEEE China Summit & International Conference on Signal and Information Processing. IEEE, 2014:804 - 808

Showing 1–32 of 32 results for author: Ha, Y