-
A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random
Authors:
Binh H. Ho,
Long Nguyen Chi,
TrungTin Nguyen,
Binh T. Nguyen,
Van Ha Hoang,
Christopher Drovandi
Abstract:
Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have…
▽ More
Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing Measures
Authors:
Tuan Thai,
TrungTin Nguyen,
Dat Do,
Nhat Ho,
Christopher Drovandi
Abstract:
Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the…
▽ More
Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the theoretical understanding of model selection, especially concerning the optimal number of mixture components or experts, remains limited and poses significant challenges. These challenges primarily stem from the inclusion of covariates in both the Gaussian gating functions and expert networks, which introduces intrinsic interactions governed by partial differential equations with respect to their parameters. In this paper, we revisit the concept of dendrograms of mixing measures and introduce a novel extension to Gaussian-gated Gaussian MoE models that enables consistent estimation of the true number of mixture components and achieves the pointwise optimal convergence rate for parameter estimation in overfitted scenarios. Notably, this approach circumvents the need to train and compare a range of models with varying numbers of components, thereby alleviating the computational burden, particularly in high-dimensional or deep neural network settings. Experimental results on synthetic data demonstrate the effectiveness of the proposed method in accurately recovering the number of experts. It outperforms common criteria such as the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood, while achieving optimal convergence rates for parameter estimation and accurately approximating the regression function.
△ Less
Submitted 23 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
The Statistical Accuracy of Neural Posterior and Likelihood Estimation
Authors:
David T. Frazier,
Ryan Kelly,
Christopher Drovandi,
David J. Warne
Abstract:
Neural posterior estimation (NPE) and neural likelihood estimation (NLE) are machine learning approaches that provide accurate posterior, and likelihood, approximations in complex modeling scenarios, and in situations where conducting amortized inference is a necessity. While such methods have shown significant promise across a range of diverse scientific applications, the statistical accuracy of…
▽ More
Neural posterior estimation (NPE) and neural likelihood estimation (NLE) are machine learning approaches that provide accurate posterior, and likelihood, approximations in complex modeling scenarios, and in situations where conducting amortized inference is a necessity. While such methods have shown significant promise across a range of diverse scientific applications, the statistical accuracy of these methods is so far unexplored. In this manuscript, we give, for the first time, an in-depth exploration on the statistical behavior of NPE and NLE. We prove that these methods have similar theoretical guarantees to common statistical methods like approximate Bayesian computation (ABC) and Bayesian synthetic likelihood (BSL). While NPE and NLE methods are just as accurate as ABC and BSL, we prove that this accuracy can often be achieved at a vastly reduced computational cost, and will therefore deliver more attractive approximations than ABC and BSL in certain problems. We verify our results theoretically and in several examples from the literature.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
Exact Sampling of Gibbs Measures with Estimated Losses
Authors:
David T. Frazier,
Jeremias Knoblauch,
Jack Jewson,
Christopher Drovandi
Abstract:
In recent years, the shortcomings of Bayesian posteriors as inferential devices have received increased attention. A popular strategy for fixing them has been to instead target a Gibbs measure based on losses that connect a parameter of interest to observed data. However, existing theory for such inference procedures assumes these losses are analytically available, while in many situations these l…
▽ More
In recent years, the shortcomings of Bayesian posteriors as inferential devices have received increased attention. A popular strategy for fixing them has been to instead target a Gibbs measure based on losses that connect a parameter of interest to observed data. However, existing theory for such inference procedures assumes these losses are analytically available, while in many situations these losses must be stochastically estimated using pseudo-observations. In such cases, we show that when standard Markov Chain Monte Carlo algorithms are used to produce posterior samples, the resulting posterior exhibits strong dependence on the number of pseudo-observations: unless the number of pseudo-observations diverge sufficiently fast the resulting posterior will concentrate very slowly. However, we show that in many situations it is feasible to alleviate this dependence entirely using a modified piecewise deterministic Markov process (PDMP) sampler, and we formally and empirically show that these samplers produce posterior draws that have no dependence on the number of pseudo-observations used to estimate the loss within the Gibbs Measure. We apply our results to three examples that feature intractable likelihoods and model misspecification.
△ Less
Submitted 22 April, 2025; v1 submitted 24 April, 2024;
originally announced April 2024.
-
Reliable Bayesian Inference in Misspecified Models
Authors:
David T. Frazier,
Robert Kohn,
Christopher Drovandi,
David Gunawan
Abstract:
We provide a general solution to a fundamental open problem in Bayesian inference, namely poor uncertainty quantification, from a frequency standpoint, of Bayesian methods in misspecified models. While existing solutions are based on explicit Gaussian approximations of the posterior, or computationally onerous post-processing procedures, we demonstrate that correct uncertainty quantification can b…
▽ More
We provide a general solution to a fundamental open problem in Bayesian inference, namely poor uncertainty quantification, from a frequency standpoint, of Bayesian methods in misspecified models. While existing solutions are based on explicit Gaussian approximations of the posterior, or computationally onerous post-processing procedures, we demonstrate that correct uncertainty quantification can be achieved by replacing the usual posterior with an intuitive approximate posterior. Critically, our solution is applicable to likelihood-based, and generalized, posteriors as well as cases where the likelihood is intractable and must be estimated. We formally demonstrate the reliable uncertainty quantification of our proposed approach, and show that valid uncertainty quantification is not an asymptotic result but occurs even in small samples. We illustrate this approach through a range of examples, including linear, and generalized, mixed effects models.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Pooling information in likelihood-free inference
Authors:
David T. Frazier,
Christopher Drovandi,
Lucas Kock,
David J. Nott
Abstract:
Likelihood-free inference (LFI) methods, such as approximate Bayesian computation, have become commonplace for conducting inference in complex models. Many approaches are based on summary statistics or discrepancies derived from synthetic data. However, determining which summary statistics or discrepancies to use for constructing the posterior remains a challenging question, both practically and t…
▽ More
Likelihood-free inference (LFI) methods, such as approximate Bayesian computation, have become commonplace for conducting inference in complex models. Many approaches are based on summary statistics or discrepancies derived from synthetic data. However, determining which summary statistics or discrepancies to use for constructing the posterior remains a challenging question, both practically and theoretically. Instead of relying on a single vector of summaries for inference, we propose a new pooled posterior that optimally combines inferences from multiple LFI posteriors. This pooled approach eliminates the need to select a single vector of summaries or even a specific LFI algorithm. Our approach is straightforward to implement and avoids performing a high-dimensional LFI analysis involving all summary statistics. We give theoretical guarantees for the improved performance of the pooled posterior mean in terms of asymptotic frequentist risk and demonstrate the effectiveness of the approach in a number of benchmark examples.
△ Less
Submitted 5 June, 2025; v1 submitted 5 December, 2022;
originally announced December 2022.
-
Synthetic Likelihood in Misspecified Models: Consequences and Corrections
Authors:
David T. Frazier,
Christopher Drovandi,
David J. Nott
Abstract:
We analyse the behaviour of the synthetic likelihood (SL) method when the model generating the simulated data differs from the actual data generating process. One of the most common methods to obtain SL-based inferences is via the Bayesian posterior distribution, with this method often referred to as Bayesian synthetic likelihood (BSL). We demonstrate that when the model is misspecified, the BSL p…
▽ More
We analyse the behaviour of the synthetic likelihood (SL) method when the model generating the simulated data differs from the actual data generating process. One of the most common methods to obtain SL-based inferences is via the Bayesian posterior distribution, with this method often referred to as Bayesian synthetic likelihood (BSL). We demonstrate that when the model is misspecified, the BSL posterior can be poorly behaved, placing significant posterior mass on values of the model parameters that do not represent the true features observed in the data. Theoretical results demonstrate that in misspecified models the BSL posterior can display a wide range of behaviours depending on the level of model misspecification, including being asymptotically non-Gaussian. Our results suggest that a recently proposed robust BSL approach can ameliorate this behavior and deliver reliable posterior inference under model misspecification. We document all theoretical results using a simple running example.
△ Less
Submitted 7 April, 2021;
originally announced April 2021.