-
Quasi-Monte Carlo with one categorical variable
Authors:
Valerie N. P. Ho,
Art B. Owen,
Zexin Pan
Abstract:
We study randomized quasi-Monte Carlo (RQMC) estimation of a multivariate integral where one of the variables takes only a finite number of values. This problem arises when the variable of integration is drawn from a mixture distribution as is common in importance sampling and also arises in some recent work on transport maps. We find that when integration error decreases at an RQMC rate that it i…
▽ More
We study randomized quasi-Monte Carlo (RQMC) estimation of a multivariate integral where one of the variables takes only a finite number of values. This problem arises when the variable of integration is drawn from a mixture distribution as is common in importance sampling and also arises in some recent work on transport maps. We find that when integration error decreases at an RQMC rate that it is then beneficial to oversample the smallest mixture components instead of using a proportional allocation. We also find that for the most accurate RQMC sampling methods, it is advantageous to arrange that our $n=2^m$ randomized Sobol' points split into subsample sizes that are also powers of~$2$.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts
Authors:
Fanqi Yan,
Huy Nguyen,
Dung Le,
Pedram Akbarian,
Nhat Ho,
Alessandro Rinaldo
Abstract:
The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplore…
▽ More
The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing Measures
Authors:
Tuan Thai,
TrungTin Nguyen,
Dat Do,
Nhat Ho,
Christopher Drovandi
Abstract:
Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the…
▽ More
Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the theoretical understanding of model selection, especially concerning the optimal number of mixture components or experts, remains limited and poses significant challenges. These challenges primarily stem from the inclusion of covariates in both the Gaussian gating functions and expert networks, which introduces intrinsic interactions governed by partial differential equations with respect to their parameters. In this paper, we revisit the concept of dendrograms of mixing measures and introduce a novel extension to Gaussian-gated Gaussian MoE models that enables consistent estimation of the true number of mixture components and achieves the pointwise optimal convergence rate for parameter estimation in overfitted scenarios. Notably, this approach circumvents the need to train and compare a range of models with varying numbers of components, thereby alleviating the computational burden, particularly in high-dimensional or deep neural network settings. Experimental results on synthetic data demonstrate the effectiveness of the proposed method in accurately recovering the number of experts. It outperforms common criteria such as the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood, while achieving optimal convergence rates for parameter estimation and accurately approximating the regression function.
△ Less
Submitted 23 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating
Authors:
Huy Nguyen,
Thong T. Doan,
Quang Pham,
Nghi D. Q. Bui,
Nhat Ho,
Alessandro Rinaldo
Abstract:
Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSe…
▽ More
Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Convergence Rates for Softmax Gating Mixture of Experts
Authors:
Huy Nguyen,
Nhat Ho,
Alessandro Rinaldo
Abstract:
Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dyna…
▽ More
Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. Despite its widespread use in practice, a comprehensive study on the effects of the softmax gating on the MoE has been lacking in the literature. To bridge this gap in this paper, we perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating, respectively. Furthermore, our theories also provide useful insights into the design of sample-efficient expert structures. In particular, we demonstrate that it requires polynomially many data points to estimate experts satisfying our proposed \emph{strong identifiability} condition, namely a commonly used two-layer feed-forward network. In stark contrast, estimating linear experts, which violate the strong identifiability condition, necessitates exponentially many data points as a result of intrinsic parameter interactions expressed in the language of partial differential equations. All the theoretical results are substantiated with a rigorous guarantee.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Lightspeed Geometric Dataset Distance via Sliced Optimal Transport
Authors:
Khai Nguyen,
Hai Nguyen,
Tuan Pham,
Nhat Ho
Abstract:
We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, w…
▽ More
We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.
△ Less
Submitted 15 May, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts
Authors:
Fanqi Yan,
Huy Nguyen,
Dung Le,
Pedram Akbarian,
Nhat Ho
Abstract:
We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of t…
▽ More
We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of the pre-trained model and the prompt may converge to zero during the training, leading to the prompt vanishing issue; (ii) the algebraic interaction among parameters of the pre-trained model and the prompt can occur via some partial differential equations and decelerate the prompt learning. In response, we introduce a distinguishability condition to control the previous parameter interaction. Additionally, we also investigate various types of expert structure to understand their effects on the convergence behavior of parameter estimation. In each scenario, we provide comprehensive convergence rates of parameter estimation along with the corresponding minimax lower bounds. Finally, we run several numerical experiments to empirically justify our theoretical findings.
△ Less
Submitted 5 March, 2025; v1 submitted 16 October, 2024;
originally announced October 2024.
-
Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention
Authors:
Pedram Akbarian,
Huy Nguyen,
Xing Han,
Nhat Ho
Abstract:
Mixture of Experts (MoE) models are well known for effectively scaling model capacity while preserving computational overheads. In this paper, we establish a rigorous relation between MoE and the self-attention mechanism, showing that each row of a self-attention matrix can be written as a quadratic gating mixture of linear experts. Motivated by this connection, we conduct a comprehensive converge…
▽ More
Mixture of Experts (MoE) models are well known for effectively scaling model capacity while preserving computational overheads. In this paper, we establish a rigorous relation between MoE and the self-attention mechanism, showing that each row of a self-attention matrix can be written as a quadratic gating mixture of linear experts. Motivated by this connection, we conduct a comprehensive convergence analysis of MoE models with two different quadratic gating functions, namely the quadratic polynomial gate and the quadratic monomial gate, offering useful insights into the design of gating and experts for the MoE framework. First, our analysis indicates that the use of the quadratic monomial gate yields an improved sample efficiency for estimating parameters and experts compared to the quadratic polynomial gate. Second, parameter and expert estimation rates become significantly faster when employing non-linear experts in place of linear experts. Combining these theoretical insights with the above link between MoE and self-attention, we propose a novel \emph{active-attention} mechanism where we apply a non-linear activation function to the value matrix in the formula of self-attention. Finally, we demonstrate that the proposed active-attention outperforms the standard self-attention through several extensive experiments in various tasks, including image classification, language modeling, and multivariate time series forecasting.
△ Less
Submitted 8 July, 2025; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Improving Generalization with Flat Hilbert Bayesian Inference
Authors:
Tuan Truong,
Quyen Tran,
Quan Pham-Ngoc,
Nhat Ho,
Dinh Phung,
Trung Le
Abstract:
We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative two-step procedure with an adversarial functional perturbation step and a functional descent step within a reproducing kernel Hilbert space. This methodology is supported by a theoretical analysis that extends previous findings on generaliza…
▽ More
We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative two-step procedure with an adversarial functional perturbation step and a functional descent step within a reproducing kernel Hilbert space. This methodology is supported by a theoretical analysis that extends previous findings on generalization ability from finite-dimensional Euclidean spaces to infinite-dimensional functional spaces. To evaluate the effectiveness of FHBI, we conduct comprehensive comparisons against nine baseline methods on the \texttt{VTAB-1K} benchmark, which encompasses 19 diverse datasets across various domains with diverse semantics. Empirical results demonstrate that FHBI consistently outperforms the baselines by notable margins, highlighting its practical efficacy.
△ Less
Submitted 8 June, 2025; v1 submitted 5 October, 2024;
originally announced October 2024.
-
On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions
Authors:
Huy Nguyen,
Xing Han,
Carl Harris,
Suchi Saria,
Nhat Ho
Abstract:
With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gatin…
▽ More
With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gating within the HMoE frameworks. We theoretically demonstrate that applying the Laplace gating function at both levels of the HMoE model helps eliminate undesirable parameter interactions caused by the Softmax gating and, therefore, accelerates the expert convergence as well as enhances the expert specialization. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements compared to the conventional HMoE models.
△ Less
Submitted 6 March, 2025; v1 submitted 3 October, 2024;
originally announced October 2024.
-
A Primal-Dual Framework for Transformers and Neural Networks
Authors:
Tan M. Nguyen,
Tam Nguyen,
Nhat Ho,
Andrea L. Bertozzi,
Richard G. Baraniuk,
Stanley J. Osher
Abstract:
Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresp…
▽ More
Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Statistical Advantages of Perturbing Cosine Router in Mixture of Experts
Authors:
Huy Nguyen,
Pedram Akbarian,
Trang Pham,
Trang Nguyen,
Shujian Zhang,
Nhat Ho
Abstract:
The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical su…
▽ More
The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in MoE has been lacking. Considering the least square estimation of the cosine routing MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^τ(n))$ where $τ> 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the $\ell^2$-norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.
△ Less
Submitted 5 March, 2025; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts
Authors:
Huy Nguyen,
Nhat Ho,
Alessandro Rinaldo
Abstract:
The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and…
▽ More
The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator under the over-specified case in which the number of fitted experts is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate an identifiability condition for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating than those under softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.
△ Less
Submitted 2 November, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Data-Driven DRO and Economic Decision Theory: An Analytical Synthesis With Bayesian Nonparametric Advancements
Authors:
Nicola Bariletto,
Khai Nguyen,
Nhat Ho
Abstract:
We develop an analytical synthesis that bridges data-driven Distributionally Robust Optimization (DRO) and Economic Decision Theory under Ambiguity (DTA). By reinterpreting standard regularization and DRO techniques as data-driven counterparts of ambiguity-averse decision models, we provide a unified framework that clarifies their intrinsic connections. Building on this synthesis, we propose a nov…
▽ More
We develop an analytical synthesis that bridges data-driven Distributionally Robust Optimization (DRO) and Economic Decision Theory under Ambiguity (DTA). By reinterpreting standard regularization and DRO techniques as data-driven counterparts of ambiguity-averse decision models, we provide a unified framework that clarifies their intrinsic connections. Building on this synthesis, we propose a novel DRO approach that leverages a popular DTA model of smooth ambiguity-averse preferences together with tools from Bayesian nonparametric statistics. Our baseline framework employs Dirichlet Process (DP) posteriors, which naturally extend to heterogeneous data sources via Hierarchical Dirichlet Processes (HDPs), and can be further refined to induce outlier robustness through a procedure that selectively filters poorly-fitting observations during training. Theoretical performance guarantees and convergence results, together with extensive simulations and real-data experiments, illustrate the method's favorable performance in terms of prediction accuracy and stability.
△ Less
Submitted 25 February, 2025; v1 submitted 21 May, 2024;
originally announced May 2024.
-
Towards Marginal Fairness Sliced Wasserstein Barycenter
Authors:
Khai Nguyen,
Hai Nguyen,
Nhat Ho
Abstract:
The sliced Wasserstein barycenter (SWB) is a widely acknowledged method for efficiently generalizing the averaging operation within probability measure spaces. However, achieving marginal fairness SWB, ensuring approximately equal distances from the barycenter to marginals, remains unexplored. The uniform weighted SWB is not necessarily the optimal choice to obtain the desired marginal fairness ba…
▽ More
The sliced Wasserstein barycenter (SWB) is a widely acknowledged method for efficiently generalizing the averaging operation within probability measure spaces. However, achieving marginal fairness SWB, ensuring approximately equal distances from the barycenter to marginals, remains unexplored. The uniform weighted SWB is not necessarily the optimal choice to obtain the desired marginal fairness barycenter due to the heterogeneous structure of marginals and the non-optimality of the optimization. As the first attempt to tackle the problem, we define the marginal fairness sliced Wasserstein barycenter (MFSWB) as a constrained SWB problem. Due to the computational disadvantages of the formal definition, we propose two hyperparameter-free and computationally tractable surrogate MFSWB problems that implicitly minimize the distances to marginals and encourage marginal fairness at the same time. To further improve the efficiency, we perform slicing distribution selection and obtain the third surrogate definition by introducing a new slicing distribution that focuses more on marginally unfair projecting directions. We discuss the relationship of the three proposed problems and their relationship to sliced multi-marginal Wasserstein distance. Finally, we conduct experiments on finding 3D point-clouds averaging, color harmonization, and training of sliced Wasserstein autoencoder with class-fairness representation to show the favorable performance of the proposed surrogate MFSWB problems.
△ Less
Submitted 3 February, 2025; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Hierarchical Hybrid Sliced Wasserstein: A Scalable Metric for Heterogeneous Joint Distributions
Authors:
Khai Nguyen,
Nhat Ho
Abstract:
Sliced Wasserstein (SW) and Generalized Sliced Wasserstein (GSW) have been widely used in applications due to their computational and statistical scalability. However, the SW and the GSW are only defined between distributions supported on a homogeneous domain. This limitation prevents their usage in applications with heterogeneous joint distributions with marginal distributions supported on multip…
▽ More
Sliced Wasserstein (SW) and Generalized Sliced Wasserstein (GSW) have been widely used in applications due to their computational and statistical scalability. However, the SW and the GSW are only defined between distributions supported on a homogeneous domain. This limitation prevents their usage in applications with heterogeneous joint distributions with marginal distributions supported on multiple different domains. Using SW and GSW directly on the joint domains cannot make a meaningful comparison since their homogeneous slicing operator i.e., Radon Transform (RT) and Generalized Radon Transform (GRT) are not expressive enough to capture the structure of the joint supports set. To address the issue, we propose two new slicing operators i.e., Partial Generalized Radon Transform (PGRT) and Hierarchical Hybrid Radon Transform (HHRT). In greater detail, PGRT is the generalization of Partial Radon Transform (PRT), which transforms a subset of function arguments non-linearly while HHRT is the composition of PRT and multiple domain-specific PGRT on marginal domain arguments. By using HHRT, we extend the SW into Hierarchical Hybrid Sliced Wasserstein (H2SW) distance which is designed specifically for comparing heterogeneous joint distributions. We then discuss the topological, statistical, and computational properties of H2SW. Finally, we demonstrate the favorable performance of H2SW in 3D mesh deformation, deep 3D mesh autoencoders, and datasets comparison.
△ Less
Submitted 7 October, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
On Parameter Estimation in Deviated Gaussian Mixture of Experts
Authors:
Huy Nguyen,
Khai Nguyen,
Nhat Ho
Abstract:
We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - λ^{\ast}) g_0(Y| X)+ λ^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},σ_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $λ^{\ast} \in [0, 1]$ is true but unkno…
▽ More
We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - λ^{\ast}) g_0(Y| X)+ λ^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},σ_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $λ^{\ast} \in [0, 1]$ is true but unknown mixing proportion, and $(p_{i}^{\ast}, a_{i}^{\ast}, b_{i}^{\ast}, σ_{i}^{\ast})$ for $1 \leq i \leq k^{\ast}$ are unknown parameters of the Gaussian mixture of experts. This problem arises from the goodness-of-fit test when we would like to test whether the data are generated from $g_{0}(Y|X)$ (null hypothesis) or they are generated from the whole mixture (alternative hypothesis). Based on the algebraic structure of the expert functions and the distinguishability between $g_0$ and the mixture part, we construct novel Voronoi-based loss functions to capture the convergence rates of maximum likelihood estimation (MLE) for our models. We further demonstrate that our proposed loss functions characterize the local convergence rates of parameter estimation more accurately than the generalized Wasserstein, a loss function being commonly used for estimating parameters in the Gaussian mixture of experts.
△ Less
Submitted 24 June, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
On Least Square Estimation in Softmax Gating Mixture of Experts
Authors:
Huy Nguyen,
Nhat Ho,
Alessandro Rinaldo
Abstract:
Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous t…
▽ More
Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions $\mathrm{sigmoid}(\cdot)$ and $\tanh(\cdot)$, are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.
△ Less
Submitted 24 June, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Sliced Wasserstein with Random-Path Projecting Directions
Authors:
Khai Nguyen,
Shujian Zhang,
Tam Le,
Nhat Ho
Abstract:
Slicing distribution selection has been used as an effective technique to improve the performance of parameter estimators based on minimizing sliced Wasserstein distance in applications. Previous works either utilize expensive optimization to select the slicing distribution or use slicing distributions that require expensive sampling methods. In this work, we propose an optimization-free slicing d…
▽ More
Slicing distribution selection has been used as an effective technique to improve the performance of parameter estimators based on minimizing sliced Wasserstein distance in applications. Previous works either utilize expensive optimization to select the slicing distribution or use slicing distributions that require expensive sampling methods. In this work, we propose an optimization-free slicing distribution that provides a fast sampling for the Monte Carlo estimation of expectation. In particular, we introduce the random-path projecting direction (RPD) which is constructed by leveraging the normalized difference between two random vectors following the two input measures. From the RPD, we derive the random-path slicing distribution (RPSD) and two variants of sliced Wasserstein, i.e., the Random-Path Projection Sliced Wasserstein (RPSW) and the Importance Weighted Random-Path Projection Sliced Wasserstein (IWRPSW). We then discuss the topological, statistical, and computational properties of RPSW and IWRPSW. Finally, we showcase the favorable performance of RPSW and IWRPSW in gradient flow and the training of denoising diffusion generative models on images.
△ Less
Submitted 8 May, 2024; v1 submitted 28 January, 2024;
originally announced January 2024.
-
Bayesian Nonparametrics Meets Data-Driven Distributionally Robust Optimization
Authors:
Nicola Bariletto,
Nhat Ho
Abstract:
Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights fr…
▽ More
Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet process) theory and a recent decision-theoretic model of smooth ambiguity-averse preferences. First, we highlight novel connections with standard regularized empirical risk minimization techniques, among which Ridge and LASSO regressions. Then, we theoretically demonstrate the existence of favorable finite-sample and asymptotic statistical guarantees on the performance of the robust optimization procedure. For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet process representations. We also show that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization. Finally, we provide insights into the workings of our method by applying it to a variety of tasks based on simulated and real datasets.
△ Less
Submitted 7 November, 2024; v1 submitted 28 January, 2024;
originally announced January 2024.
-
Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?
Authors:
Huy Nguyen,
Pedram Akbarian,
Nhat Ho
Abstract:
Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabi…
▽ More
Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabilize the expert specialization. Nevertheless, while there are previous attempts to theoretically comprehend the sparse MoE, a comprehensive analysis of the dense-to-sparse gating MoE has remained elusive. Therefore, we aim to explore the impacts of the dense-to-sparse gate on the maximum likelihood estimation under the Gaussian MoE in this paper. We demonstrate that due to interactions between the temperature and other model parameters via some partial differential equations, the convergence rates of parameter estimations are slower than any polynomial rates, and could be as slow as $\mathcal{O}(1/\log(n))$, where $n$ denotes the sample size. To address this issue, we propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linearly independence conditions on the activation function and its derivatives, we show that the parameter estimation rates are significantly improved to polynomial rates. Finally, we conduct a simulation study to empirically validate our theoretical results.
△ Less
Submitted 24 June, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
Neural Collapse for Cross-entropy Class-Imbalanced Learning with Unconstrained ReLU Feature Model
Authors:
Hien Dang,
Tho Tran,
Tan Nguyen,
Nhat Ho
Abstract:
The current paradigm of training deep neural networks for classification tasks includes minimizing the empirical risk that pushes the training loss value towards zero, even after the training error has been vanished. In this terminal phase of training, it has been observed that the last-layer features collapse to their class-means and these class-means converge to the vertices of a simplex Equiang…
▽ More
The current paradigm of training deep neural networks for classification tasks includes minimizing the empirical risk that pushes the training loss value towards zero, even after the training error has been vanished. In this terminal phase of training, it has been observed that the last-layer features collapse to their class-means and these class-means converge to the vertices of a simplex Equiangular Tight Frame (ETF). This phenomenon is termed as Neural Collapse (NC). To theoretically understand this phenomenon, recent works employ a simplified unconstrained feature model to prove that NC emerges at the global solutions of the training problem. However, when the training dataset is class-imbalanced, some NC properties will no longer be true. For example, the class-means geometry will skew away from the simplex ETF when the loss converges. In this paper, we generalize NC to imbalanced regime for cross-entropy loss under the unconstrained ReLU feature model. We prove that, while the within-class features collapse property still holds in this setting, the class-means will converge to a structure consisting of orthogonal vectors with different lengths. Furthermore, we find that the classifier weights are aligned to the scaled and centered class-means with scaling factors depend on the number of training samples of each class, which generalizes NC in the class-balanced setting. We empirically prove our results through experiments on practical architectures and dataset.
△ Less
Submitted 6 June, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts
Authors:
Huy Nguyen,
Pedram Akbarian,
TrungTin Nguyen,
Nhat Ho
Abstract:
Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the…
▽ More
Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the Gaussian MoE model, such analysis under the setting of a classification problem has remained missing in the literature. We close this gap by establishing the convergence rates of density estimation and parameter estimation in the softmax gating multinomial logistic MoE model. Notably, when part of the expert parameters vanish, these rates are shown to be slower than polynomial rates owing to an inherent interaction between the softmax gating and expert functions via partial differential equations. To address this issue, we propose using a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
△ Less
Submitted 24 June, 2024; v1 submitted 22 October, 2023;
originally announced October 2023.
-
Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts
Authors:
Huy Nguyen,
Pedram Akbarian,
Fanqi Yan,
Nhat Ho
Abstract:
Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitio…
▽ More
Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{\ast}$ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when $k_{\ast}$ becomes unknown and the true model is over-specified by a Gaussian mixture of $k$ experts where $k > k_{\ast}$, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.
△ Less
Submitted 23 February, 2024; v1 submitted 24 September, 2023;
originally announced September 2023.
-
Quasi-Monte Carlo for 3D Sliced Wasserstein
Authors:
Khai Nguyen,
Nicola Bariletto,
Nhat Ho
Abstract:
Monte Carlo (MC) integration has been employed as the standard approximation method for the Sliced Wasserstein (SW) distance, whose analytical expression involves an intractable expectation. However, MC integration is not optimal in terms of absolute approximation error. To provide a better class of empirical SW, we propose quasi-sliced Wasserstein (QSW) approximations that rely on Quasi-Monte Car…
▽ More
Monte Carlo (MC) integration has been employed as the standard approximation method for the Sliced Wasserstein (SW) distance, whose analytical expression involves an intractable expectation. However, MC integration is not optimal in terms of absolute approximation error. To provide a better class of empirical SW, we propose quasi-sliced Wasserstein (QSW) approximations that rely on Quasi-Monte Carlo (QMC) methods. For a comprehensive investigation of QMC for SW, we focus on the 3D setting, specifically computing the SW between probability measures in three dimensions. In greater detail, we empirically evaluate various methods to construct QMC point sets on the 3D unit-hypersphere, including the Gaussian-based and equal area mappings, generalized spiral points, and optimizing discrepancy energies. Furthermore, to obtain an unbiased estimator for stochastic optimization, we extend QSW to Randomized Quasi-Sliced Wasserstein (RQSW) by introducing randomness in the discussed point sets. Theoretically, we prove the asymptotic convergence of QSW and the unbiasedness of RQSW. Finally, we conduct experiments on various 3D tasks, such as point-cloud comparison, point-cloud interpolation, image style transfer, and training deep point-cloud autoencoders, to demonstrate the favorable performance of the proposed QSW and RQSW variants.
△ Less
Submitted 16 February, 2024; v1 submitted 20 September, 2023;
originally announced September 2023.
-
Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders
Authors:
Hien Dang,
Tho Tran,
Tan Nguyen,
Nhat Ho
Abstract:
The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations…
▽ More
The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations as input to the reconstruction process in the decoder. While this phenomenon has been an actively addressed topic related to VAE performance, the theory for posterior collapse remains underdeveloped, especially beyond the standard VAE. In this work, we advance the theoretical understanding of posterior collapse to two important and prevalent yet less studied classes of VAE: conditional VAE and hierarchical VAE. Specifically, via a non-trivial theoretical analysis of linear conditional VAE and hierarchical VAE with two levels of latent, we prove that the cause of posterior collapses in these models includes the correlation between the input and output of the conditional VAE and the effect of learnable encoder variance in the hierarchical VAE. We empirically validate our theoretical findings for linear conditional and hierarchical VAE and demonstrate that these results are also predictive for non-linear cases with extensive experiments.
△ Less
Submitted 13 May, 2024; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Towards Convergence Rates for Parameter Estimation in Gaussian-gated Mixture of Experts
Authors:
Huy Nguyen,
TrungTin Nguyen,
Khai Nguyen,
Nhat Ho
Abstract:
Originally introduced as a neural network for ensemble learning, mixture of experts (MoE) has recently become a fundamental building block of highly successful modern deep neural networks for heterogeneous data analysis in several applications of machine learning and statistics. Despite its popularity in practice, a satisfactory level of theoretical understanding of the MoE model is far from compl…
▽ More
Originally introduced as a neural network for ensemble learning, mixture of experts (MoE) has recently become a fundamental building block of highly successful modern deep neural networks for heterogeneous data analysis in several applications of machine learning and statistics. Despite its popularity in practice, a satisfactory level of theoretical understanding of the MoE model is far from complete. To shed new light on this problem, we provide a convergence analysis for maximum likelihood estimation (MLE) in the Gaussian-gated MoE model. The main challenge of that analysis comes from the inclusion of covariates in the Gaussian gating functions and expert networks, which leads to their intrinsic interaction via some partial differential equations with respect to their parameters. We tackle these issues by designing novel Voronoi loss functions among parameters to accurately capture the heterogeneity of parameter estimation rates. Our findings reveal that the MLE has distinct behaviors under two complement settings of location parameters of the Gaussian gating functions, namely when all these parameters are non-zero versus when at least one among them vanishes. Notably, these behaviors can be characterized by the solvability of two different systems of polynomial equations. Finally, we conduct a simulation study to empirically verify our theoretical results.
△ Less
Submitted 9 February, 2024; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Demystifying Softmax Gating Function in Gaussian Mixture of Experts
Authors:
Huy Nguyen,
TrungTin Nguyen,
Nhat Ho
Abstract:
Understanding the parameter estimation of softmax gating Gaussian mixture of experts has remained a long-standing open problem in the literature. It is mainly due to three fundamental theoretical challenges associated with the softmax gating function: (i) the identifiability only up to the translation of parameters; (ii) the intrinsic interaction via partial differential equations between the soft…
▽ More
Understanding the parameter estimation of softmax gating Gaussian mixture of experts has remained a long-standing open problem in the literature. It is mainly due to three fundamental theoretical challenges associated with the softmax gating function: (i) the identifiability only up to the translation of parameters; (ii) the intrinsic interaction via partial differential equations between the softmax gating and the expert functions in the Gaussian density; (iii) the complex dependence between the numerator and denominator of the conditional density of softmax gating Gaussian mixture of experts. We resolve these challenges by proposing novel Voronoi loss functions among parameters and establishing the convergence rates of maximum likelihood estimator (MLE) for solving parameter estimation in these models. When the true number of experts is unknown and over-specified, our findings show a connection between the convergence rate of the MLE and a solvability problem of a system of polynomial equations.
△ Less
Submitted 29 October, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Sliced Wasserstein Estimation with Control Variates
Authors:
Khai Nguyen,
Nhat Ho
Abstract:
The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate…
▽ More
The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate the value of the SW distance. Despite having various variants, there has been no prior work that improves the Monte Carlo estimation scheme for the SW distance in terms of controlling its variance. To bridge the literature on variance reduction and the literature on the SW distance, we propose computationally efficient control variates to reduce the variance of the empirical estimation of the SW distance. The key idea is to first find Gaussian approximations of projected one-dimensional measures, then we utilize the closed-form of the Wasserstein-2 distance between two Gaussian distributions to design the control variates. In particular, we propose using a lower bound and an upper bound of the Wasserstein-2 distance between two fitted Gaussians as two computationally efficient control variates. We empirically show that the proposed control variate estimators can help to reduce the variance considerably when comparing measures over images and point-clouds. Finally, we demonstrate the favorable performance of the proposed control variate estimators in gradient flows to interpolate between two point-clouds and in deep generative modeling on standard image datasets, such as CIFAR10 and CelebA.
△ Less
Submitted 18 February, 2024; v1 submitted 30 April, 2023;
originally announced May 2023.
-
Energy-Based Sliced Wasserstein Distance
Authors:
Khai Nguyen,
Nhat Ho
Abstract:
The sliced Wasserstein (SW) distance has been widely recognized as a statistically effective and computationally efficient metric between two probability measures. A key component of the SW distance is the slicing distribution. There are two existing approaches for choosing this distribution. The first approach is using a fixed prior distribution. The second approach is optimizing for the best dis…
▽ More
The sliced Wasserstein (SW) distance has been widely recognized as a statistically effective and computationally efficient metric between two probability measures. A key component of the SW distance is the slicing distribution. There are two existing approaches for choosing this distribution. The first approach is using a fixed prior distribution. The second approach is optimizing for the best distribution which belongs to a parametric family of distributions and can maximize the expected distance. However, both approaches have their limitations. A fixed prior distribution is non-informative in terms of highlighting projecting directions that can discriminate two general probability measures. Doing optimization for the best distribution is often expensive and unstable. Moreover, designing the parametric family of the candidate distribution could be easily misspecified. To address the issues, we propose to design the slicing distribution as an energy-based distribution that is parameter-free and has the density proportional to an energy function of the projected one-dimensional Wasserstein distance. We then derive a novel sliced Wasserstein metric, energy-based sliced Waserstein (EBSW) distance, and investigate its topological, statistical, and computational properties via importance sampling, sampling importance resampling, and Markov Chain methods. Finally, we conduct experiments on point-cloud gradient flow, color transfer, and point-cloud reconstruction to show the favorable performance of the EBSW.
△ Less
Submitted 29 December, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction
Authors:
Khai Nguyen,
Dang Nguyen,
Nhat Ho
Abstract:
Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for less discriminative projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the ``max" projecting directions given two input measures instead of using projected gradient ascent multiple times.…
▽ More
Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for less discriminative projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the ``max" projecting directions given two input measures instead of using projected gradient ascent multiple times. Despite being efficient, Max-SW and its amortized version cannot guarantee metricity property due to the sub-optimality of the projected gradient ascent and the amortization gap. Therefore, we propose to replace Max-SW with distributional sliced Wasserstein distance with von Mises-Fisher (vMF) projecting distribution (v-DSW). Since v-DSW is a metric with any non-degenerate vMF distribution, its amortized version can guarantee the metricity when performing amortization. Furthermore, current amortized models are not permutation invariant and symmetric. To address the issue, we design amortized models based on self-attention architecture. In particular, we adopt efficient self-attention architectures to make the computation linear in the number of supports. With the two improvements, we derive self-attention amortized distributional projection optimization and show its appealing performance in point-cloud reconstruction and its downstream applications.
△ Less
Submitted 8 May, 2023; v1 submitted 11 January, 2023;
originally announced January 2023.
-
Markovian Sliced Wasserstein Distances: Beyond Independent Projections
Authors:
Khai Nguyen,
Tongzheng Ren,
Nhat Ho
Abstract:
Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of Max-K-SW cannot be guaranteed in practice due t…
▽ More
Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of MSW.
△ Less
Submitted 31 December, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data
Authors:
Hien Dang,
Tho Tran,
Stanley Osher,
Hung Tran-The,
Nhat Ho,
Tan Nguyen
Abstract:
Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-laye…
▽ More
Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-layer features collapse to their class-means, and those class-means are the vertices of a simplex Equiangular Tight Frame (ETF). This phenomenon is known as Neural Collapse (NC). Recent papers have theoretically shown that NC emerges in the global minimizers of training problems with the simplified "unconstrained feature model". In this context, we take a step further and prove the NC occurrences in deep linear networks for the popular mean squared error (MSE) and cross entropy (CE) losses, showing that global solutions exhibit NC properties across the linear layers. Furthermore, we extend our study to imbalanced data for MSE loss and present the first geometric analysis of NC under bias-free setting. Our results demonstrate the convergence of the last-layer features and classifiers to a geometry consisting of orthogonal vectors, whose lengths depend on the amount of data in their corresponding classes. Finally, we empirically validate our theoretical analyses on synthetic and practical network architectures with both balanced and imbalanced scenarios.
△ Less
Submitted 18 June, 2023; v1 submitted 1 January, 2023;
originally announced January 2023.
-
Revisiting Over-smoothing and Over-squashing Using Ollivier-Ricci Curvature
Authors:
Khang Nguyen,
Hieu Nong,
Vinh Nguyen,
Nhat Ho,
Stanley Osher,
Tan Nguyen
Abstract:
Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues…
▽ More
Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.
△ Less
Submitted 31 May, 2023; v1 submitted 28 November, 2022;
originally announced November 2022.
-
Fast Approximation of the Generalized Sliced-Wasserstein Distance
Authors:
Dung Le,
Huy Nguyen,
Khai Nguyen,
Trang Nguyen,
Nhat Ho
Abstract:
Generalized sliced Wasserstein distance is a variant of sliced Wasserstein distance that exploits the power of non-linear projection through a given defining function to better capture the complex structures of the probability distributions. Similar to sliced Wasserstein distance, generalized sliced Wasserstein is defined as an expectation over random projections which can be approximated by the M…
▽ More
Generalized sliced Wasserstein distance is a variant of sliced Wasserstein distance that exploits the power of non-linear projection through a given defining function to better capture the complex structures of the probability distributions. Similar to sliced Wasserstein distance, generalized sliced Wasserstein is defined as an expectation over random projections which can be approximated by the Monte Carlo method. However, the complexity of that approximation can be expensive in high-dimensional settings. To that end, we propose to form deterministic and fast approximations of the generalized sliced Wasserstein distance by using the concentration of random projections when the defining functions are polynomial function, circular function, and neural network type function. Our approximations hinge upon an important result that one-dimensional projections of a high-dimensional random vector are approximately Gaussian.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Improving Generative Flow Networks with Path Regularization
Authors:
Anh Do,
Duy Dinh,
Tan Nguyen,
Khuong Nguyen,
Stanley Osher,
Nhat Ho
Abstract:
Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory…
▽ More
Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory that places prior constraints on the underlying structure of the GFlowNets. The prior is designed to help the GFlowNets better discover the latent structure of the target distribution or enhance its ability to explore the environment in the context of active learning. The path regularization controls the flow in GFlowNets to generate more diverse and novel candidates via maximizing the optimal transport distances between two forward policies or to improve the generalization via minimizing the optimal transport distances. In addition, we derive an efficient implementation of the regularization by finding its closed form solutions in specific cases and a meaningful upper bound that can be used as an approximation to minimize the regularization term. We empirically demonstrate the advantage of our path regularization on a wide range of tasks, including synthetic hypergrid environment modeling, discrete probabilistic modeling, and biological sequence design.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Hierarchical Sliced Wasserstein Distance
Authors:
Khai Nguyen,
Tongzheng Ren,
Huy Nguyen,
Litu Rout,
Tan Nguyen,
Nhat Ho
Abstract:
Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite i…
▽ More
Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite its efficiency in the number of supports, estimating the sliced Wasserstein requires a relatively large number of projections in high-dimensional settings. Therefore, for applications where the number of supports is relatively small compared with the dimension, e.g., several deep learning applications where the mini-batch approaches are utilized, the complexities from matrix multiplication of Radon Transform become the main computational bottleneck. To address this issue, we propose to derive projections by linearly and randomly combining a smaller number of projections which are named bottleneck projections. We explain the usage of these projections by introducing Hierarchical Radon Transform (HRT) which is constructed by applying Radon Transform variants recursively. We then formulate the approach into a new metric between measures, named Hierarchical Sliced Wasserstein (HSW) distance. By proving the injectivity of HRT, we derive the metricity of HSW. Moreover, we investigate the theoretical properties of HSW including its connection to SW variants and its computational and sample complexities. Finally, we compare the computational cost and generative quality of HSW with the conventional SW on the task of deep generative modeling using various benchmark datasets including CIFAR10, CelebA, and Tiny ImageNet.
△ Less
Submitted 6 February, 2023; v1 submitted 27 September, 2022;
originally announced September 2022.
-
Stochastic Multiple Target Sampling Gradient Descent
Authors:
Hoang Phan,
Ngoc Tran,
Trung Le,
Toan Tran,
Nhat Ho,
Dinh Phung
Abstract:
Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimiz…
▽ More
Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: "Can we derive a probabilistic version of the multi-objective optimization?". To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions. Specifically, our MT-SGD conducts a flow of intermediate distributions gradually orienting to multiple target distributions, which allows the sampled particles to move to the joint high-likelihood region of the target distributions. Interestingly, the asymptotic analysis shows that our approach reduces exactly to the multiple-gradient descent algorithm for multi-objective optimization, as expected. Finally, we conduct comprehensive experiments to demonstrate the merit of our approach to multi-task learning.
△ Less
Submitted 10 February, 2023; v1 submitted 4 June, 2022;
originally announced June 2022.
-
Transformer with Fourier Integral Attentions
Authors:
Tan Nguyen,
Minh Pham,
Tam Nguyen,
Khai Nguyen,
Stanley J. Osher,
Nhat Ho
Abstract:
Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. Ther…
▽ More
Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
△ Less
Submitted 31 May, 2022;
originally announced June 2022.
-
Beyond EM Algorithm on Over-specified Two-Component Location-Scale Gaussian Mixtures
Authors:
Tongzheng Ren,
Fuheng Cui,
Sujay Sanghavi,
Nhat Ho
Abstract:
The Expectation-Maximization (EM) algorithm has been predominantly used to approximate the maximum likelihood estimation of the location-scale Gaussian mixtures. However, when the models are over-specified, namely, the chosen number of components to fit the data is larger than the unknown true number of components, EM needs a polynomial number of iterations in terms of the sample size to reach the…
▽ More
The Expectation-Maximization (EM) algorithm has been predominantly used to approximate the maximum likelihood estimation of the location-scale Gaussian mixtures. However, when the models are over-specified, namely, the chosen number of components to fit the data is larger than the unknown true number of components, EM needs a polynomial number of iterations in terms of the sample size to reach the final statistical radius; this is computationally expensive in practice. The slow convergence of EM is due to the missing of the locally strong convexity with respect to the location parameter on the negative population log-likelihood function, i.e., the limit of the negative sample log-likelihood function when the sample size goes to infinity. To efficiently explore the curvature of the negative log-likelihood functions, by specifically considering two-component location-scale Gaussian mixtures, we develop the Exponential Location Update (ELU) algorithm. The idea of the ELU algorithm is that we first obtain the exact optimal solution for the scale parameter and then perform an exponential step-size gradient descent for the location parameter. We demonstrate theoretically and empirically that the ELU iterates converge to the final statistical radius of the models after a logarithmic number of iterations. To the best of our knowledge, it resolves the long-standing open question in the literature about developing an optimization algorithm that has optimal statistical and computational complexities for solving parameter estimation even under some specific settings of the over-specified Gaussian mixture models.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models
Authors:
Nhat Ho,
Tongzheng Ren,
Sujay Sanghavi,
Purnamrita Sarkar,
Rachel Ward
Abstract:
Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under hom…
▽ More
Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
△ Less
Submitted 1 February, 2023; v1 submitted 16 May, 2022;
originally announced May 2022.
-
Revisiting Sliced Wasserstein on Images: From Vectorization to Convolution
Authors:
Khai Nguyen,
Nhat Ho
Abstract:
The conventional sliced Wasserstein is defined between two probability measures that have realizations as vectors. When comparing two probability measures over images, practitioners first need to vectorize images and then project them to one-dimensional space by using matrix multiplication between the sample matrix and the projection matrix. After that, the sliced Wasserstein is evaluated by avera…
▽ More
The conventional sliced Wasserstein is defined between two probability measures that have realizations as vectors. When comparing two probability measures over images, practitioners first need to vectorize images and then project them to one-dimensional space by using matrix multiplication between the sample matrix and the projection matrix. After that, the sliced Wasserstein is evaluated by averaging the two corresponding one-dimensional projected probability measures. However, this approach has two limitations. The first limitation is that the spatial structure of images is not captured efficiently by the vectorization step; therefore, the later slicing process becomes harder to gather the discrepancy information. The second limitation is memory inefficiency since each slicing direction is a vector that has the same dimension as the images. To address these limitations, we propose novel slicing methods for sliced Wasserstein between probability measures over images that are based on the convolution operators. We derive convolution sliced Wasserstein (CSW) and its variants via incorporating stride, dilation, and non-linear activation function into the convolution operators. We investigate the metricity of CSW as well as its sample complexity, its computational complexity, and its connection to conventional sliced Wasserstein distances. Finally, we demonstrate the favorable performance of CSW over the conventional sliced Wasserstein in comparing probability measures over images and in training deep generative modeling on images.
△ Less
Submitted 23 September, 2022; v1 submitted 3 April, 2022;
originally announced April 2022.
-
Amortized Projection Optimization for Sliced Wasserstein Generative Models
Authors:
Khai Nguyen,
Nhat Ho
Abstract:
Seeking informative projecting directions has been an important task in utilizing sliced Wasserstein distance in applications. However, finding these directions usually requires an iterative optimization procedure over the space of projecting directions, which is computationally expensive. Moreover, the computational issue is even more severe in deep learning applications, where computing the dist…
▽ More
Seeking informative projecting directions has been an important task in utilizing sliced Wasserstein distance in applications. However, finding these directions usually requires an iterative optimization procedure over the space of projecting directions, which is computationally expensive. Moreover, the computational issue is even more severe in deep learning applications, where computing the distance between two mini-batch probability measures is repeated several times. This nested loop has been one of the main challenges that prevent the usage of sliced Wasserstein distances based on good projections in practice. To address this challenge, we propose to utilize the learning-to-optimize technique or amortized optimization to predict the informative direction of any given two mini-batch probability measures. To the best of our knowledge, this is the first work that bridges amortized optimization and sliced Wasserstein generative models. In particular, we derive linear amortized models, generalized linear amortized models, and non-linear amortized models which are corresponding to three types of novel mini-batch losses, named amortized sliced Wasserstein. We demonstrate the favorable performance of the proposed sliced losses in deep generative modeling on standard benchmark datasets.
△ Less
Submitted 23 September, 2022; v1 submitted 24 March, 2022;
originally announced March 2022.
-
Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models
Authors:
Tudor Manole,
Nhat Ho
Abstract:
We revisit the classical problem of deriving convergence rates for the maximum likelihood estimator (MLE) in finite mixture models. The Wasserstein distance has become a standard loss function for the analysis of parameter estimation in these models, due in part to its ability to circumvent label switching and to accurately characterize the behaviour of fitted mixture components with vanishing wei…
▽ More
We revisit the classical problem of deriving convergence rates for the maximum likelihood estimator (MLE) in finite mixture models. The Wasserstein distance has become a standard loss function for the analysis of parameter estimation in these models, due in part to its ability to circumvent label switching and to accurately characterize the behaviour of fitted mixture components with vanishing weights. However, the Wasserstein distance is only able to capture the worst-case convergence rate among the remaining fitted mixture components. We demonstrate that when the log-likelihood function is penalized to discourage vanishing mixing weights, stronger loss functions can be derived to resolve this shortcoming of the Wasserstein distance. These new loss functions accurately capture the heterogeneity in convergence rates of fitted mixture components, and we use them to sharpen existing pointwise and uniform convergence rates in various classes of mixture models. In particular, these results imply that a subset of the components of the penalized MLE typically converge significantly faster than could have been anticipated from past work. We further show that some of these conclusions extend to the traditional MLE. Our theoretical findings are supported by a simulation study to illustrate these improved convergence rates.
△ Less
Submitted 20 June, 2022; v1 submitted 17 February, 2022;
originally announced February 2022.
-
Improving Computational Complexity in Statistical Models with Second-Order Information
Authors:
Tongzheng Ren,
Jiacheng Zhuo,
Sujay Sanghavi,
Nhat Ho
Abstract:
It is known that when the statistical models are singular, i.e., the Fisher information matrix at the true parameter is degenerate, the fixed step-size gradient descent algorithm takes polynomial number of steps in terms of the sample size $n$ to converge to a final statistical radius around the true parameter, which can be unsatisfactory for the application. To further improve that computational…
▽ More
It is known that when the statistical models are singular, i.e., the Fisher information matrix at the true parameter is degenerate, the fixed step-size gradient descent algorithm takes polynomial number of steps in terms of the sample size $n$ to converge to a final statistical radius around the true parameter, which can be unsatisfactory for the application. To further improve that computational complexity, we consider the utilization of the second-order information in the design of optimization algorithms. Specifically, we study the normalized gradient descent (NormGD) algorithm for solving parameter estimation in parametric statistical models, which is a variant of gradient descent algorithm whose step size is scaled by the maximum eigenvalue of the Hessian matrix of the empirical loss function of statistical models. When the population loss function, i.e., the limit of the empirical loss function when $n$ goes to infinity, is homogeneous in all directions, we demonstrate that the NormGD iterates reach a final statistical radius around the true parameter after a logarithmic number of iterations in terms of $n$. Therefore, for fixed dimension $d$, the NormGD algorithm achieves the optimal overall computational complexity $\mathcal{O}(n)$ to reach the final statistical radius. This computational complexity is cheaper than that of the fixed step-size gradient descent algorithm, which is of the order $\mathcal{O}(n^τ)$ for some $τ> 1$, to reach the same statistical radius. We illustrate our general theory under two statistical models: generalized linear models and mixture models, and experimental results support our prediction with general theory.
△ Less
Submitted 13 April, 2022; v1 submitted 8 February, 2022;
originally announced February 2022.
-
Beyond Black Box Densities: Parameter Learning for the Deviated Components
Authors:
Dat Do,
Nhat Ho,
XuanLong Nguyen
Abstract:
As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model}…
▽ More
As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model} $(1-λ^{*})h_0 + λ^{*} (\sum_{i = 1}^{k} p_{i}^{*} f(x|θ_{i}^{*}))$, where $h_0$ is a known density function, while the deviated proportion $λ^{*}$ and latent mixing measure $G_{*} = \sum_{i = 1}^{k} p_{i}^{*} δ_{θ_i^{*}}$ associated with the mixture distribution are unknown. Via a novel notion of distinguishability between the known density $h_{0}$ and the deviated mixture distribution, we establish rates of convergence for the maximum likelihood estimates of $λ^{*}$ and $G^{*}$ under Wasserstein metric. Simulation studies are carried out to illustrate the theory.
△ Less
Submitted 26 October, 2022; v1 submitted 5 February, 2022;
originally announced February 2022.
-
Bayesian Consistency with the Supremum Metric
Authors:
Nhat Ho,
Stephen G. Walker
Abstract:
We present simple conditions for Bayesian consistency in the supremum metric. The key to the technique is a triangle inequality which allows us to explicitly use weak convergence, a consequence of the standard Kullback--Leibler support condition for the prior. A further condition is to ensure that smoothed versions of densities are not too far from the original density, thus dealing with densities…
▽ More
We present simple conditions for Bayesian consistency in the supremum metric. The key to the technique is a triangle inequality which allows us to explicitly use weak convergence, a consequence of the standard Kullback--Leibler support condition for the prior. A further condition is to ensure that smoothed versions of densities are not too far from the original density, thus dealing with densities which could track the data too closely. A key result of the paper is that we demonstrate supremum consistency using weaker conditions compared to those currently used to secure $\mathbb{L}_1$ consistency.
△ Less
Submitted 10 January, 2022;
originally announced January 2022.
-
On Label Shift in Domain Adaptation via Wasserstein Distance
Authors:
Trung Le,
Dat Do,
Tuan Nguyen,
Huy Nguyen,
Hung Bui,
Nhat Ho,
Dinh Phung
Abstract:
We study the label shift problem between the source and target domains in general domain adaptation (DA) settings. We consider transformations transporting the target to source domains, which enable us to align the source and target examples. Through those transformations, we define the label shift between two domains via optimal transport and develop theory to investigate the properties of DA und…
▽ More
We study the label shift problem between the source and target domains in general domain adaptation (DA) settings. We consider transformations transporting the target to source domains, which enable us to align the source and target examples. Through those transformations, we define the label shift between two domains via optimal transport and develop theory to investigate the properties of DA under various DA settings (e.g., closed-set, partial-set, open-set, and universal settings). Inspired from the developed theory, we propose Label and Data Shift Reduction via Optimal Transport (LDROT) which can mitigate the data and label shifts simultaneously. Finally, we conduct comprehensive experiments to verify our theoretical findings and compare LDROT with state-of-the-art baselines.
△ Less
Submitted 1 March, 2022; v1 submitted 28 October, 2021;
originally announced October 2021.
-
Improving Transformers with Probabilistic Attention Keys
Authors:
Tam Nguyen,
Tan M. Nguyen,
Dung D. Le,
Duy Khuong Nguyen,
Viet-Anh Tran,
Richard G. Baraniuk,
Nhat Ho,
Stanley J. Osher
Abstract:
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observati…
▽ More
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.
△ Less
Submitted 12 June, 2022; v1 submitted 16 October, 2021;
originally announced October 2021.
-
Towards Statistical and Computational Complexities of Polyak Step Size Gradient Descent
Authors:
Tongzheng Ren,
Fuheng Cui,
Alexia Atsidakou,
Sujay Sanghavi,
Nhat Ho
Abstract:
We study the statistical and computational complexities of the Polyak step size gradient descent algorithm under generalized smoothness and Lojasiewicz conditions of the population loss function, namely, the limit of the empirical loss function when the sample size goes to infinity, and the stability between the gradients of the empirical and population loss functions, namely, the polynomial growt…
▽ More
We study the statistical and computational complexities of the Polyak step size gradient descent algorithm under generalized smoothness and Lojasiewicz conditions of the population loss function, namely, the limit of the empirical loss function when the sample size goes to infinity, and the stability between the gradients of the empirical and population loss functions, namely, the polynomial growth on the concentration bound between the gradients of sample and population loss functions. We demonstrate that the Polyak step size gradient descent iterates reach a final statistical radius of convergence around the true parameter after logarithmic number of iterations in terms of the sample size. It is computationally cheaper than the polynomial number of iterations on the sample size of the fixed-step size gradient descent algorithm to reach the same final statistical radius when the population loss function is not locally strongly convex. Finally, we illustrate our general theory under three statistical examples: generalized linear model, mixture model, and mixed linear regression model.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.