Skip to main content

Showing 1–50 of 90 results for author: Ho, N

Searching in archive stat. Search in all archives.
.
  1. arXiv:2506.16582  [pdf, ps, other

    stat.CO math.NA

    Quasi-Monte Carlo with one categorical variable

    Authors: Valerie N. P. Ho, Art B. Owen, Zexin Pan

    Abstract: We study randomized quasi-Monte Carlo (RQMC) estimation of a multivariate integral where one of the variables takes only a finite number of values. This problem arises when the variable of integration is drawn from a mixture distribution as is common in importance sampling and also arises in some recent work on transport maps. We find that when integration error decreases at an RQMC rate that it i… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  2. arXiv:2505.18455  [pdf, ps, other

    stat.ML cs.LG

    On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts

    Authors: Fanqi Yan, Huy Nguyen, Dung Le, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo

    Abstract: The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplore… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Fanqi Yan, Huy Nguyen, and Dung Le contributed equally to this work

  3. arXiv:2505.13052  [pdf, ps, other

    stat.ML cs.LG math.ST stat.CO stat.ME

    Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing Measures

    Authors: Tuan Thai, TrungTin Nguyen, Dat Do, Nhat Ho, Christopher Drovandi

    Abstract: Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the… ▽ More

    Submitted 23 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Correct typos and update the numerical experiments. Tuan Thai and TrungTin Nguyen are co-first authors

  4. arXiv:2505.10860  [pdf, ps, other

    cs.LG stat.ML

    On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

    Authors: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

    Abstract: Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSe… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: 100 pages

  5. arXiv:2503.03213  [pdf, other

    stat.ML cs.LG

    Convergence Rates for Softmax Gating Mixture of Experts

    Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo

    Abstract: Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dyna… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: Section 2 of this work comes from our previous paper titled "On Least Square Estimation in Softmax Gating Mixture of Experts" and published at the ICML 2024

  6. arXiv:2501.18901  [pdf, other

    cs.LG cs.AI stat.CO stat.ME stat.ML

    Lightspeed Geometric Dataset Distance via Sliced Optimal Transport

    Authors: Khai Nguyen, Hai Nguyen, Tuan Pham, Nhat Ho

    Abstract: We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, w… ▽ More

    Submitted 15 May, 2025; v1 submitted 31 January, 2025; originally announced January 2025.

    Comments: Accepted to ICML 2025, 16 pages, 13 figures

  7. arXiv:2410.12258  [pdf, other

    cs.LG cs.AI stat.ML

    Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts

    Authors: Fanqi Yan, Huy Nguyen, Dung Le, Pedram Akbarian, Nhat Ho

    Abstract: We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of t… ▽ More

    Submitted 5 March, 2025; v1 submitted 16 October, 2024; originally announced October 2024.

    Comments: Fanqi Yan, Huy Nguyen, and Dung Le contributed equally to this work. Accepted to AISTATS 2025

  8. arXiv:2410.11222  [pdf, ps, other

    stat.ML cs.LG

    Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention

    Authors: Pedram Akbarian, Huy Nguyen, Xing Han, Nhat Ho

    Abstract: Mixture of Experts (MoE) models are well known for effectively scaling model capacity while preserving computational overheads. In this paper, we establish a rigorous relation between MoE and the self-attention mechanism, showing that each row of a self-attention matrix can be written as a quadratic gating mixture of linear experts. Motivated by this connection, we conduct a comprehensive converge… ▽ More

    Submitted 8 July, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

    Comments: Pedram Akbarian, Huy Nguyen, and Xing Han made equal contributions to this work

  9. arXiv:2410.04196  [pdf, ps, other

    cs.LG stat.ML

    Improving Generalization with Flat Hilbert Bayesian Inference

    Authors: Tuan Truong, Quyen Tran, Quan Pham-Ngoc, Nhat Ho, Dinh Phung, Trung Le

    Abstract: We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative two-step procedure with an adversarial functional perturbation step and a functional descent step within a reproducing kernel Hilbert space. This methodology is supported by a theoretical analysis that extends previous findings on generaliza… ▽ More

    Submitted 8 June, 2025; v1 submitted 5 October, 2024; originally announced October 2024.

    Comments: Accepted (ICML 2025)

  10. arXiv:2410.02935  [pdf, other

    stat.ML cs.LG

    On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

    Authors: Huy Nguyen, Xing Han, Carl Harris, Suchi Saria, Nhat Ho

    Abstract: With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gatin… ▽ More

    Submitted 6 March, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

    Comments: Huy Nguyen and Xing Han contributed equally to this work

  11. arXiv:2406.13781  [pdf, other

    cs.LG cs.AI cs.CL cs.CV stat.ML

    A Primal-Dual Framework for Transformers and Neural Networks

    Authors: Tan M. Nguyen, Tam Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard G. Baraniuk, Stanley J. Osher

    Abstract: Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresp… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted to ICLR 2023, 26 pages, 4 figures, 14 tables

  12. arXiv:2405.14131  [pdf, other

    stat.ML cs.LG

    Statistical Advantages of Perturbing Cosine Router in Mixture of Experts

    Authors: Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, Nhat Ho

    Abstract: The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical su… ▽ More

    Submitted 5 March, 2025; v1 submitted 22 May, 2024; originally announced May 2024.

    Comments: Accepted to ICLR 2025

  13. arXiv:2405.13997  [pdf, other

    stat.ML cs.LG

    Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

    Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo

    Abstract: The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and… ▽ More

    Submitted 2 November, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

    Comments: Accepted to NeurIPS 2024, 26 pages

  14. arXiv:2405.13160  [pdf, other

    stat.ML cs.LG

    Data-Driven DRO and Economic Decision Theory: An Analytical Synthesis With Bayesian Nonparametric Advancements

    Authors: Nicola Bariletto, Khai Nguyen, Nhat Ho

    Abstract: We develop an analytical synthesis that bridges data-driven Distributionally Robust Optimization (DRO) and Economic Decision Theory under Ambiguity (DTA). By reinterpreting standard regularization and DRO techniques as data-driven counterparts of ambiguity-averse decision models, we provide a unified framework that clarifies their intrinsic connections. Building on this synthesis, we propose a nov… ▽ More

    Submitted 25 February, 2025; v1 submitted 21 May, 2024; originally announced May 2024.

    Comments: This work significantly extends the analytical framework connecting DRO with economic decision theory, while also broadening the methodological scope of Bariletto and Ho's NeurIPS 2024 article, "Bayesian Nonparametrics Meets Data-Driven Distributionally Robust Optimization."

  15. arXiv:2405.07482  [pdf, other

    stat.ML cs.GR cs.LG

    Towards Marginal Fairness Sliced Wasserstein Barycenter

    Authors: Khai Nguyen, Hai Nguyen, Nhat Ho

    Abstract: The sliced Wasserstein barycenter (SWB) is a widely acknowledged method for efficiently generalizing the averaging operation within probability measure spaces. However, achieving marginal fairness SWB, ensuring approximately equal distances from the barycenter to marginals, remains unexplored. The uniform weighted SWB is not necessarily the optimal choice to obtain the desired marginal fairness ba… ▽ More

    Submitted 3 February, 2025; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: Accepted to ICLR 2025, 29 pages, 15 figures, 6 tables

  16. arXiv:2404.15378  [pdf, other

    cs.CV cs.AI cs.GR cs.LG stat.ML

    Hierarchical Hybrid Sliced Wasserstein: A Scalable Metric for Heterogeneous Joint Distributions

    Authors: Khai Nguyen, Nhat Ho

    Abstract: Sliced Wasserstein (SW) and Generalized Sliced Wasserstein (GSW) have been widely used in applications due to their computational and statistical scalability. However, the SW and the GSW are only defined between distributions supported on a homogeneous domain. This limitation prevents their usage in applications with heterogeneous joint distributions with marginal distributions supported on multip… ▽ More

    Submitted 7 October, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Accepted to NeurIPS 2024, 27 pages, 11 figures, 4 tables

  17. arXiv:2402.05220  [pdf, other

    stat.ML cs.LG

    On Parameter Estimation in Deviated Gaussian Mixture of Experts

    Authors: Huy Nguyen, Khai Nguyen, Nhat Ho

    Abstract: We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - λ^{\ast}) g_0(Y| X)+ λ^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},σ_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $λ^{\ast} \in [0, 1]$ is true but unkno… ▽ More

    Submitted 24 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: Accepted to AISTATS 2024, 32 pages, 2 figures, 1 table

  18. arXiv:2402.02952  [pdf, other

    stat.ML cs.LG

    On Least Square Estimation in Softmax Gating Mixture of Experts

    Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo

    Abstract: Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous t… ▽ More

    Submitted 24 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024, 29 pages, 2 figures, 2 tables

  19. arXiv:2401.15889  [pdf, other

    stat.ML cs.AI cs.CV cs.LG

    Sliced Wasserstein with Random-Path Projecting Directions

    Authors: Khai Nguyen, Shujian Zhang, Tam Le, Nhat Ho

    Abstract: Slicing distribution selection has been used as an effective technique to improve the performance of parameter estimators based on minimizing sliced Wasserstein distance in applications. Previous works either utilize expensive optimization to select the slicing distribution or use slicing distributions that require expensive sampling methods. In this work, we propose an optimization-free slicing d… ▽ More

    Submitted 8 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

    Comments: Accepted to ICML 2024, 21 pages, 5 figures, 2 tables

  20. arXiv:2401.15771  [pdf, other

    stat.ML cs.LG

    Bayesian Nonparametrics Meets Data-Driven Distributionally Robust Optimization

    Authors: Nicola Bariletto, Nhat Ho

    Abstract: Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights fr… ▽ More

    Submitted 7 November, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

    Journal ref: Thirty-Eight Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

  21. arXiv:2401.13875  [pdf, other

    stat.ML cs.LG

    Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?

    Authors: Huy Nguyen, Pedram Akbarian, Nhat Ho

    Abstract: Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabi… ▽ More

    Submitted 24 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted to ICML 2024, 47 pages, 2 figures, 2 tables

  22. arXiv:2401.02058  [pdf, other

    cs.LG stat.ML

    Neural Collapse for Cross-entropy Class-Imbalanced Learning with Unconstrained ReLU Feature Model

    Authors: Hien Dang, Tho Tran, Tan Nguyen, Nhat Ho

    Abstract: The current paradigm of training deep neural networks for classification tasks includes minimizing the empirical risk that pushes the training loss value towards zero, even after the training error has been vanished. In this terminal phase of training, it has been observed that the last-layer features collapse to their class-means and these class-means converge to the vertices of a simplex Equiang… ▽ More

    Submitted 6 June, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: 2024 International Conference on Machine Learning

  23. arXiv:2310.14188  [pdf, other

    stat.ML cs.LG

    A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts

    Authors: Huy Nguyen, Pedram Akbarian, TrungTin Nguyen, Nhat Ho

    Abstract: Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the… ▽ More

    Submitted 24 June, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

    Comments: Accepted to ICML 2024, 32 pages, 3 figures, 3 tables

  24. arXiv:2309.13850  [pdf, other

    stat.ML cs.LG

    Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

    Authors: Huy Nguyen, Pedram Akbarian, Fanqi Yan, Nhat Ho

    Abstract: Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitio… ▽ More

    Submitted 23 February, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

    Comments: Accepted to ICLR 2024, 38 pages, 3 figures, 1 table

  25. arXiv:2309.11713  [pdf, other

    stat.ML cs.GR cs.LG

    Quasi-Monte Carlo for 3D Sliced Wasserstein

    Authors: Khai Nguyen, Nicola Bariletto, Nhat Ho

    Abstract: Monte Carlo (MC) integration has been employed as the standard approximation method for the Sliced Wasserstein (SW) distance, whose analytical expression involves an intractable expectation. However, MC integration is not optimal in terms of absolute approximation error. To provide a better class of empirical SW, we propose quasi-sliced Wasserstein (QSW) approximations that rely on Quasi-Monte Car… ▽ More

    Submitted 16 February, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted to ICLR 2024 (Spotlight), 25 pages, 13 figures, 6 tables

  26. arXiv:2306.05023  [pdf, other

    stat.ML cs.LG

    Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders

    Authors: Hien Dang, Tho Tran, Tan Nguyen, Nhat Ho

    Abstract: The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations… ▽ More

    Submitted 13 May, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Accepted (Poster) at the Twelfth International Conference on Learning Representations

  27. arXiv:2305.07572  [pdf, other

    stat.ML cs.LG

    Towards Convergence Rates for Parameter Estimation in Gaussian-gated Mixture of Experts

    Authors: Huy Nguyen, TrungTin Nguyen, Khai Nguyen, Nhat Ho

    Abstract: Originally introduced as a neural network for ensemble learning, mixture of experts (MoE) has recently become a fundamental building block of highly successful modern deep neural networks for heterogeneous data analysis in several applications of machine learning and statistics. Despite its popularity in practice, a satisfactory level of theoretical understanding of the MoE model is far from compl… ▽ More

    Submitted 9 February, 2024; v1 submitted 12 May, 2023; originally announced May 2023.

    Comments: 32 pages, 9 figures; Huy Nguyen and TrungTin Nguyen contributed equally to this work

  28. arXiv:2305.03288  [pdf, other

    stat.ML cs.LG math.ST

    Demystifying Softmax Gating Function in Gaussian Mixture of Experts

    Authors: Huy Nguyen, TrungTin Nguyen, Nhat Ho

    Abstract: Understanding the parameter estimation of softmax gating Gaussian mixture of experts has remained a long-standing open problem in the literature. It is mainly due to three fundamental theoretical challenges associated with the softmax gating function: (i) the identifiability only up to the translation of parameters; (ii) the intrinsic interaction via partial differential equations between the soft… ▽ More

    Submitted 29 October, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: 29 pages, 3 figures

  29. arXiv:2305.00402  [pdf, other

    stat.ML cs.CV cs.GR cs.LG

    Sliced Wasserstein Estimation with Control Variates

    Authors: Khai Nguyen, Nhat Ho

    Abstract: The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate… ▽ More

    Submitted 18 February, 2024; v1 submitted 30 April, 2023; originally announced May 2023.

    Comments: Accepted to ICLR2024, 20 pages, 7 figures, 4 tables

  30. arXiv:2304.13586  [pdf, other

    stat.ML cs.CV cs.GR cs.LG

    Energy-Based Sliced Wasserstein Distance

    Authors: Khai Nguyen, Nhat Ho

    Abstract: The sliced Wasserstein (SW) distance has been widely recognized as a statistically effective and computationally efficient metric between two probability measures. A key component of the SW distance is the slicing distribution. There are two existing approaches for choosing this distribution. The first approach is using a fixed prior distribution. The second approach is optimizing for the best dis… ▽ More

    Submitted 29 December, 2023; v1 submitted 26 April, 2023; originally announced April 2023.

    Comments: Accepted to NeurIPS 2023, 30 pages, 8 figures, 6 tables

  31. arXiv:2301.04791  [pdf, other

    stat.ML cs.CV cs.GR cs.LG

    Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction

    Authors: Khai Nguyen, Dang Nguyen, Nhat Ho

    Abstract: Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for less discriminative projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the ``max" projecting directions given two input measures instead of using projected gradient ascent multiple times.… ▽ More

    Submitted 8 May, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: Accepted to ICML 2023, 23 pages, 6 figures, 9 tables,

  32. arXiv:2301.03749  [pdf, other

    stat.ML cs.LG

    Markovian Sliced Wasserstein Distances: Beyond Independent Projections

    Authors: Khai Nguyen, Tongzheng Ren, Nhat Ho

    Abstract: Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of Max-K-SW cannot be guaranteed in practice due t… ▽ More

    Submitted 31 December, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

    Comments: Accepted to NeurIPS 2023, 29 pages, 8 figures, 5 tables

  33. arXiv:2301.00437  [pdf, other

    cs.LG stat.ML

    Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data

    Authors: Hien Dang, Tho Tran, Stanley Osher, Hung Tran-The, Nhat Ho, Tan Nguyen

    Abstract: Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-laye… ▽ More

    Submitted 18 June, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

    Comments: 75 pages, 20 figures, 4 tables. Hien Dang and Tho Tran contributed equally to this work

  34. arXiv:2211.15779  [pdf, other

    cs.LG stat.ML

    Revisiting Over-smoothing and Over-squashing Using Ollivier-Ricci Curvature

    Authors: Khang Nguyen, Hieu Nong, Vinh Nguyen, Nhat Ho, Stanley Osher, Tan Nguyen

    Abstract: Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues… ▽ More

    Submitted 31 May, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: Accepted at ICML 2023; 24 pages, 4 figures

  35. arXiv:2210.10268  [pdf, other

    stat.ML cs.LG

    Fast Approximation of the Generalized Sliced-Wasserstein Distance

    Authors: Dung Le, Huy Nguyen, Khai Nguyen, Trang Nguyen, Nhat Ho

    Abstract: Generalized sliced Wasserstein distance is a variant of sliced Wasserstein distance that exploits the power of non-linear projection through a given defining function to better capture the complex structures of the probability distributions. Similar to sliced Wasserstein distance, generalized sliced Wasserstein is defined as an expectation over random projections which can be approximated by the M… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: 22 pages, 2 figures. Dung Le, Huy Nguyen and Khai Nguyen contributed equally to this work

  36. arXiv:2209.15092  [pdf, other

    cs.LG stat.ML

    Improving Generative Flow Networks with Path Regularization

    Authors: Anh Do, Duy Dinh, Tan Nguyen, Khuong Nguyen, Stanley Osher, Nhat Ho

    Abstract: Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

    Comments: 28 pages, 2 figures, 5 tables. Anh Do, Duy Dinh, and Tan Nguyen contributed equally to this work

  37. arXiv:2209.13570  [pdf, other

    stat.ML cs.LG

    Hierarchical Sliced Wasserstein Distance

    Authors: Khai Nguyen, Tongzheng Ren, Huy Nguyen, Litu Rout, Tan Nguyen, Nhat Ho

    Abstract: Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite i… ▽ More

    Submitted 6 February, 2023; v1 submitted 27 September, 2022; originally announced September 2022.

    Comments: Accepted to ICLR 2023, 29 pages, 8 figures, 3 tables,

  38. arXiv:2206.01934  [pdf, other

    cs.LG cs.AI stat.ML

    Stochastic Multiple Target Sampling Gradient Descent

    Authors: Hoang Phan, Ngoc Tran, Trung Le, Toan Tran, Nhat Ho, Dinh Phung

    Abstract: Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimiz… ▽ More

    Submitted 10 February, 2023; v1 submitted 4 June, 2022; originally announced June 2022.

    Comments: Accepted to Advances in Neural Information Processing Systems (NeurIPS) 2022. 27 pages, 10 figures, 5 tables

  39. arXiv:2206.00206  [pdf, ps, other

    cs.LG stat.ML

    Transformer with Fourier Integral Attentions

    Authors: Tan Nguyen, Minh Pham, Tam Nguyen, Khai Nguyen, Stanley J. Osher, Nhat Ho

    Abstract: Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. Ther… ▽ More

    Submitted 31 May, 2022; originally announced June 2022.

    Comments: 35 pages, 5 tables. Tan Nguyen and Minh Pham contributed equally to this work

  40. arXiv:2205.11078  [pdf, other

    stat.ML cs.LG math.ST

    Beyond EM Algorithm on Over-specified Two-Component Location-Scale Gaussian Mixtures

    Authors: Tongzheng Ren, Fuheng Cui, Sujay Sanghavi, Nhat Ho

    Abstract: The Expectation-Maximization (EM) algorithm has been predominantly used to approximate the maximum likelihood estimation of the location-scale Gaussian mixtures. However, when the models are over-specified, namely, the chosen number of components to fit the data is larger than the unknown true number of components, EM needs a polynomial number of iterations in terms of the sample size to reach the… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: 38 pages, 4 figures. Tongzheng Ren and Fuheng Cui contributed equally to this work

  41. arXiv:2205.07999  [pdf, other

    stat.ML cs.LG math.OC math.ST

    An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models

    Authors: Nhat Ho, Tongzheng Ren, Sujay Sanghavi, Purnamrita Sarkar, Rachel Ward

    Abstract: Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under hom… ▽ More

    Submitted 1 February, 2023; v1 submitted 16 May, 2022; originally announced May 2022.

    Comments: 37 pages. The authors are listed in alphabetical order

  42. arXiv:2204.01188  [pdf, other

    cs.CV cs.LG stat.ML

    Revisiting Sliced Wasserstein on Images: From Vectorization to Convolution

    Authors: Khai Nguyen, Nhat Ho

    Abstract: The conventional sliced Wasserstein is defined between two probability measures that have realizations as vectors. When comparing two probability measures over images, practitioners first need to vectorize images and then project them to one-dimensional space by using matrix multiplication between the sample matrix and the projection matrix. After that, the sliced Wasserstein is evaluated by avera… ▽ More

    Submitted 23 September, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

    Comments: Accepted to NeurIPS 2022, 29 pages, 9 figures, 11 tables

  43. arXiv:2203.13417  [pdf, other

    stat.ML cs.LG

    Amortized Projection Optimization for Sliced Wasserstein Generative Models

    Authors: Khai Nguyen, Nhat Ho

    Abstract: Seeking informative projecting directions has been an important task in utilizing sliced Wasserstein distance in applications. However, finding these directions usually requires an iterative optimization procedure over the space of projecting directions, which is computationally expensive. Moreover, the computational issue is even more severe in deep learning applications, where computing the dist… ▽ More

    Submitted 23 September, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to NeurIPS 2022, 22 pages, 6 figures, 8 tables

  44. arXiv:2202.08786  [pdf, other

    math.ST stat.ML

    Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models

    Authors: Tudor Manole, Nhat Ho

    Abstract: We revisit the classical problem of deriving convergence rates for the maximum likelihood estimator (MLE) in finite mixture models. The Wasserstein distance has become a standard loss function for the analysis of parameter estimation in these models, due in part to its ability to circumvent label switching and to accurately characterize the behaviour of fitted mixture components with vanishing wei… ▽ More

    Submitted 20 June, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

    Comments: To appear in the Proceedings of the 39th International Conference on Machine Learning (ICML), 2022

  45. arXiv:2202.04219  [pdf, other

    stat.ML cs.LG math.ST

    Improving Computational Complexity in Statistical Models with Second-Order Information

    Authors: Tongzheng Ren, Jiacheng Zhuo, Sujay Sanghavi, Nhat Ho

    Abstract: It is known that when the statistical models are singular, i.e., the Fisher information matrix at the true parameter is degenerate, the fixed step-size gradient descent algorithm takes polynomial number of steps in terms of the sample size $n$ to converge to a final statistical radius around the true parameter, which can be unsatisfactory for the application. To further improve that computational… ▽ More

    Submitted 13 April, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: 27 pages, 2 figures. Fixing a bug in the proof of Lemma 7

  46. arXiv:2202.02651  [pdf, other

    stat.ML cs.LG math.ST

    Beyond Black Box Densities: Parameter Learning for the Deviated Components

    Authors: Dat Do, Nhat Ho, XuanLong Nguyen

    Abstract: As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model}… ▽ More

    Submitted 26 October, 2022; v1 submitted 5 February, 2022; originally announced February 2022.

    Comments: Accepted at NeurIPS 2022. Dat Do and Nhat Ho contributed equally to this work

  47. arXiv:2201.03447  [pdf, ps, other

    math.ST stat.ML

    Bayesian Consistency with the Supremum Metric

    Authors: Nhat Ho, Stephen G. Walker

    Abstract: We present simple conditions for Bayesian consistency in the supremum metric. The key to the technique is a triangle inequality which allows us to explicitly use weak convergence, a consequence of the standard Kullback--Leibler support condition for the prior. A further condition is to ensure that smoothed versions of densities are not too far from the original density, thus dealing with densities… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

    Comments: 11 pages

  48. arXiv:2110.15520  [pdf, other

    cs.LG stat.ME stat.ML

    On Label Shift in Domain Adaptation via Wasserstein Distance

    Authors: Trung Le, Dat Do, Tuan Nguyen, Huy Nguyen, Hung Bui, Nhat Ho, Dinh Phung

    Abstract: We study the label shift problem between the source and target domains in general domain adaptation (DA) settings. We consider transformations transporting the target to source domains, which enable us to align the source and target examples. Through those transformations, we define the label shift between two domains via optimal transport and develop theory to investigate the properties of DA und… ▽ More

    Submitted 1 March, 2022; v1 submitted 28 October, 2021; originally announced October 2021.

    Comments: 35 pages, 7 figures, 6 tables

  49. arXiv:2110.08678  [pdf, other

    cs.LG cs.CL stat.ML

    Improving Transformers with Probabilistic Attention Keys

    Authors: Tam Nguyen, Tan M. Nguyen, Dung D. Le, Duy Khuong Nguyen, Viet-Anh Tran, Richard G. Baraniuk, Nhat Ho, Stanley J. Osher

    Abstract: Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observati… ▽ More

    Submitted 12 June, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: 27 pages, 16 figures, 10 tables

    Journal ref: Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022

  50. arXiv:2110.07810  [pdf, other

    cs.LG math.ST stat.ML

    Towards Statistical and Computational Complexities of Polyak Step Size Gradient Descent

    Authors: Tongzheng Ren, Fuheng Cui, Alexia Atsidakou, Sujay Sanghavi, Nhat Ho

    Abstract: We study the statistical and computational complexities of the Polyak step size gradient descent algorithm under generalized smoothness and Lojasiewicz conditions of the population loss function, namely, the limit of the empirical loss function when the sample size goes to infinity, and the stability between the gradients of the empirical and population loss functions, namely, the polynomial growt… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: First three authors contributed equally. 40 pages, 4 figures