-
Parametric MMD Estimation with Missing Values: Robustness to Missingness and Data Model Misspecification
Authors:
Badr-Eddine Chérief-Abdellatif,
Jeffrey Näf
Abstract:
In the missing data literature, the Maximum Likelihood Estimator (MLE) is celebrated for its ignorability property under missing at random (MAR) data. However, its sensitivity to misspecification of the (complete) data model, even under MAR, remains a significant limitation. This issue is further exacerbated by the fact that the MAR assumption may not always be realistic, introducing an additional…
▽ More
In the missing data literature, the Maximum Likelihood Estimator (MLE) is celebrated for its ignorability property under missing at random (MAR) data. However, its sensitivity to misspecification of the (complete) data model, even under MAR, remains a significant limitation. This issue is further exacerbated by the fact that the MAR assumption may not always be realistic, introducing an additional source of potential misspecification through the missingness mechanism. To address this, we propose a novel M-estimation procedure based on the Maximum Mean Discrepancy (MMD), which is provably robust to both model misspecification and deviations from the assumed missingness mechanism. Our approach offers strong theoretical guarantees and improved reliability in complex settings. We establish the consistency and asymptotic normality of the estimator under missingness completely at random (MCAR), provide an efficient stochastic gradient descent algorithm, and derive error bounds that explicitly separate the contributions of model misspecification and missingness bias. Furthermore, we analyze missing not at random (MNAR) scenarios where our estimator maintains controlled error, including a Huber setting where both the missingness mechanism and the data model are contaminated. Our contributions refine the understanding of the limitations of the MLE and provide a robust and principled alternative for handling missing data.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information Loss
Authors:
Yi-Shan Wu,
Yijie Zhang,
Badr-Eddine Chérief-Abdellatif,
Yevgeny Seldin
Abstract:
PAC-Bayesian analysis is a frequentist framework for incorporating prior knowledge into learning. It was inspired by Bayesian learning, which allows sequential data processing and naturally turns posteriors from one processing step into priors for the next. However, despite two and a half decades of research, the ability to update priors sequentially without losing confidence information along the…
▽ More
PAC-Bayesian analysis is a frequentist framework for incorporating prior knowledge into learning. It was inspired by Bayesian learning, which allows sequential data processing and naturally turns posteriors from one processing step into priors for the next. However, despite two and a half decades of research, the ability to update priors sequentially without losing confidence information along the way remained elusive for PAC-Bayes. While PAC-Bayes allows construction of data-informed priors, the final confidence intervals depend only on the number of points that were not used for the construction of the prior, whereas confidence information in the prior, which is related to the number of points used to construct the prior, is lost. This limits the possibility and benefit of sequential prior updates, because the final bounds depend only on the size of the final batch.
We present a novel and, in retrospect, surprisingly simple and powerful PAC-Bayesian procedure that allows sequential prior updates with no information loss. The procedure is based on a novel decomposition of the expected loss of randomized classifiers. The decomposition rewrites the loss of the posterior as an excess loss relative to a downscaled loss of the prior plus the downscaled loss of the prior, which is bounded recursively. As a side result, we also present a generalization of the split-kl and PAC-Bayes-split-kl inequalities to discrete random variables, which we use for bounding the excess losses, and which can be of independent interest. In empirical evaluation the new procedure significantly outperforms state-of-the-art.
△ Less
Submitted 8 April, 2025; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching
Authors:
Bastien Dussap,
Gilles Blanchard,
Badr-Eddine Chérief-Abdellatif
Abstract:
Quantification learning deals with the task of estimating the target label distribution under label shift. In this paper, we first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures, improving in several key aspects upon previous bound…
▽ More
Quantification learning deals with the task of estimating the target label distribution under label shift. In this paper, we first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures, improving in several key aspects upon previous bounds derived in particular cases. We then extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. These theoretical findings are confirmed by a detailed numerical study on simulated and real-world datasets. We also introduce an efficient, scalable and robust version of kernel-based DFM using the Random Fourier Feature principle.
△ Less
Submitted 2 July, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Bayes meets Bernstein at the Meta Level: an Analysis of Fast Rates in Meta-Learning with PAC-Bayes
Authors:
Charles Riou,
Pierre Alquier,
Badr-Eddine Chérief-Abdellatif
Abstract:
Bernstein's condition is a key assumption that guarantees fast rates in machine learning. For example, the Gibbs algorithm with prior $π$ has an excess risk in $O(d_π/n)$, as opposed to the standard $O(\sqrt{d_π/n})$, where $n$ denotes the number of observations and $d_π$ is a complexity parameter which depends on the prior $π$. In this paper, we examine the Gibbs algorithm in the context of meta-…
▽ More
Bernstein's condition is a key assumption that guarantees fast rates in machine learning. For example, the Gibbs algorithm with prior $π$ has an excess risk in $O(d_π/n)$, as opposed to the standard $O(\sqrt{d_π/n})$, where $n$ denotes the number of observations and $d_π$ is a complexity parameter which depends on the prior $π$. In this paper, we examine the Gibbs algorithm in the context of meta-learning, i.e., when learning the prior $π$ from $T$ tasks (with $n$ observations each) generated by a meta distribution. Our main result is that Bernstein's condition always holds at the meta level, regardless of its validity at the observation level. This implies that the additional cost to learn the Gibbs prior $π$, which will reduce the term $d_π$ across tasks, is in $O(1/T)$, instead of the expected $O(1/\sqrt{T})$. We further illustrate how this result improves on standard rates in three different settings: discrete priors, Gaussian priors and mixture of Gaussians priors.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
On PAC-Bayesian reconstruction guarantees for VAEs
Authors:
Badr-Eddine Chérief-Abdellatif,
Yuyang Shi,
Arnaud Doucet,
Benjamin Guedj
Abstract:
Despite its wide use and empirical successes, the theoretical understanding and study of the behaviour and performance of the variational autoencoder (VAE) have only emerged in the past few years. We contribute to this recent line of work by analysing the VAE's reconstruction ability for unseen test data, leveraging arguments from the PAC-Bayes theory. We provide generalisation bounds on the theor…
▽ More
Despite its wide use and empirical successes, the theoretical understanding and study of the behaviour and performance of the variational autoencoder (VAE) have only emerged in the past few years. We contribute to this recent line of work by analysing the VAE's reconstruction ability for unseen test data, leveraging arguments from the PAC-Bayes theory. We provide generalisation bounds on the theoretical reconstruction error, and provide insights on the regularisation effect of VAE objectives. We illustrate our theoretical results with supporting experiments on classical benchmark datasets.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
Estimation of copulas via Maximum Mean Discrepancy
Authors:
Pierre Alquier,
Badr-Eddine Chérief-Abdellatif,
Alexis Derumigny,
Jean-David Fermanian
Abstract:
This paper deals with robust inference for parametric copula models. Estimation using Canonical Maximum Likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the Maximum Mean Discrepancy (MMD) principle. We derive non-asymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequalit…
▽ More
This paper deals with robust inference for parametric copula models. Estimation using Canonical Maximum Likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the Maximum Mean Discrepancy (MMD) principle. We derive non-asymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequality holds without any assumption on the copula family, and can be applied in the presence of outliers or under misspecification. Moreover, in our MMD framework, the statistical inference of copula models for which there exists no density with respect to the Lebesgue measure on $[0,1]^d$, as the Marshall-Olkin copula, becomes feasible. A simulation study shows the robustness of our new procedures, especially compared to pseudo-maximum likelihood estimation. An R package implementing the MMD estimator for copula models is available.
△ Less
Submitted 14 January, 2022; v1 submitted 1 October, 2020;
originally announced October 2020.
-
Finite sample properties of parametric MMD estimation: robustness to misspecification and dependence
Authors:
Badr-Eddine Chérief-Abdellatif,
Pierre Alquier
Abstract:
Many works in statistics aim at designing a universal estimation procedure, that is, an estimator that would converge to the best approximation of the (unknown) data generating distribution in a model, without any assumption on this distribution. This question is of major interest, in particular because the universality property leads to the robustness of the estimator. In this paper, we tackle th…
▽ More
Many works in statistics aim at designing a universal estimation procedure, that is, an estimator that would converge to the best approximation of the (unknown) data generating distribution in a model, without any assumption on this distribution. This question is of major interest, in particular because the universality property leads to the robustness of the estimator. In this paper, we tackle the problem of universal estimation using a minimum distance estimator presented in Briol et al. (2019) based on the Maximum Mean Discrepancy. We show that the estimator is robust to both dependence and to the presence of outliers in the dataset. Finally, we provide a theoretical study of the stochastic gradient descent algorithm used to compute the estimator, and we support our findings with numerical simulations.
** The proof of Proposition 4.4 in the published version contains a mistake. The mistake is fixed here (and the bound is actually improved by a factor 2). **
△ Less
Submitted 13 February, 2025; v1 submitted 11 December, 2019;
originally announced December 2019.
-
MMD-Bayes: Robust Bayesian Estimation via Maximum Mean Discrepancy
Authors:
Badr-Eddine Chérief-Abdellatif,
Pierre Alquier
Abstract:
In some misspecified settings, the posterior distribution in Bayesian statistics may lead to inconsistent estimates. To fix this issue, it has been suggested to replace the likelihood by a pseudo-likelihood, that is the exponential of a loss function enjoying suitable robustness properties. In this paper, we build a pseudo-likelihood based on the Maximum Mean Discrepancy, defined via an embedding…
▽ More
In some misspecified settings, the posterior distribution in Bayesian statistics may lead to inconsistent estimates. To fix this issue, it has been suggested to replace the likelihood by a pseudo-likelihood, that is the exponential of a loss function enjoying suitable robustness properties. In this paper, we build a pseudo-likelihood based on the Maximum Mean Discrepancy, defined via an embedding of probability distributions into a reproducing kernel Hilbert space. We show that this MMD-Bayes posterior is consistent and robust to model misspecification. As the posterior obtained in this way might be intractable, we also prove that reasonable variational approximations of this posterior enjoy the same properties. We provide details on a stochastic gradient algorithm to compute these variational approximations. Numerical simulations indeed suggest that our estimator is more robust to misspecification than the ones based on the likelihood.
△ Less
Submitted 11 December, 2019; v1 submitted 29 September, 2019;
originally announced September 2019.
-
Convergence Rates of Variational Inference in Sparse Deep Learning
Authors:
Badr-Eddine Chérief-Abdellatif
Abstract:
Variational inference is becoming more and more popular for approximating intractable posterior distributions in Bayesian statistics and machine learning. Meanwhile, a few recent works have provided theoretical justification and new insights on deep neural networks for estimating smooth functions in usual settings such as nonparametric regression. In this paper, we show that variational inference…
▽ More
Variational inference is becoming more and more popular for approximating intractable posterior distributions in Bayesian statistics and machine learning. Meanwhile, a few recent works have provided theoretical justification and new insights on deep neural networks for estimating smooth functions in usual settings such as nonparametric regression. In this paper, we show that variational inference for sparse deep learning retains the same generalization properties than exact Bayesian inference. In particular, we highlight the connection between estimation and approximation theories via the classical bias-variance trade-off and show that it leads to near-minimax rates of convergence for Hölder smooth functions. Additionally, we show that the model selection framework over the neural network architecture via ELBO maximization does not overfit and adaptively achieves the optimal rate of convergence.
△ Less
Submitted 5 September, 2019; v1 submitted 9 August, 2019;
originally announced August 2019.
-
A Generalization Bound for Online Variational Inference
Authors:
Badr-Eddine Chérief-Abdellatif,
Pierre Alquier,
Mohammad Emtiyaz Khan
Abstract:
Bayesian inference provides an attractive online-learning framework to analyze sequential data, and offers generalization guarantees which hold even with model mismatch and adversaries. Unfortunately, exact Bayesian inference is rarely feasible in practice and approximation methods are usually employed, but do such methods preserve the generalization properties of Bayesian inference ? In this pape…
▽ More
Bayesian inference provides an attractive online-learning framework to analyze sequential data, and offers generalization guarantees which hold even with model mismatch and adversaries. Unfortunately, exact Bayesian inference is rarely feasible in practice and approximation methods are usually employed, but do such methods preserve the generalization properties of Bayesian inference ? In this paper, we show that this is indeed the case for some variational inference (VI) algorithms. We consider a few existing online, tempered VI algorithms, as well as a new algorithm, and derive their generalization bounds. Our theoretical result relies on the convexity of the variational objective, but we argue that the result should hold more generally and present empirical evidence in support of this. Our work in this paper presents theoretical justifications in favor of online algorithms relying on approximate Bayesian methods.
△ Less
Submitted 10 December, 2019; v1 submitted 8 April, 2019;
originally announced April 2019.
-
Consistency of Variational Bayes Inference for Estimation and Model Selection in Mixtures
Authors:
Badr-Eddine Chérief-Abdellatif,
Pierre Alquier
Abstract:
Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable posteriors thanks to optimization algorithms, is extremely popular in practice when dealing with complex models such as mixtures. The contribution of this paper is two-…
▽ More
Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable posteriors thanks to optimization algorithms, is extremely popular in practice when dealing with complex models such as mixtures. The contribution of this paper is two-fold. First, we study the concentration of variational approximations of posteriors, which is still an open problem for general mixtures, and we derive consistency and rates of convergence. We also tackle the problem of model selection for the number of components: we study the approach already used in practice, which consists in maximizing a numerical criterion (the Evidence Lower Bound). We prove that this strategy indeed leads to strong oracle inequalities. We illustrate our theoretical results by applications to Gaussian and multinomial mixtures.
△ Less
Submitted 12 August, 2018; v1 submitted 14 May, 2018;
originally announced May 2018.