-
Convergence of Statistical Estimators via Mutual Information Bounds
Authors:
El Mahdi Khribch,
Pierre Alquier
Abstract:
Recent advances in statistical learning theory have revealed profound connections between mutual information (MI) bounds, PAC-Bayesian theory, and Bayesian nonparametrics. This work introduces a novel mutual information bound for statistical models. The derived bound has wide-ranging applications in statistical inference. It yields improved contraction rates for fractional posteriors in Bayesian n…
▽ More
Recent advances in statistical learning theory have revealed profound connections between mutual information (MI) bounds, PAC-Bayesian theory, and Bayesian nonparametrics. This work introduces a novel mutual information bound for statistical models. The derived bound has wide-ranging applications in statistical inference. It yields improved contraction rates for fractional posteriors in Bayesian nonparametrics. It can also be used to study a wide range of estimation methods, such as variational inference or Maximum Likelihood Estimation (MLE). By bridging these diverse areas, this work advances our understanding of the fundamental limits of statistical inference and the role of information in learning from data. We hope that these results will not only clarify connections between statistical inference and information theory but also help to develop a new toolbox to study a wide range of estimators.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
Optimistic Estimation of Convergence in Markov Chains with the Average-Mixing Time
Authors:
Geoffrey Wolfer,
Pierre Alquier
Abstract:
The convergence rate of a Markov chain to its stationary distribution is typically assessed using the concept of total variation mixing time. However, this worst-case measure often yields pessimistic estimates and is challenging to infer from observations. In this paper, we advocate for the use of the average-mixing time as a more optimistic and demonstrably easier-to-estimate alternative. We furt…
▽ More
The convergence rate of a Markov chain to its stationary distribution is typically assessed using the concept of total variation mixing time. However, this worst-case measure often yields pessimistic estimates and is challenging to infer from observations. In this paper, we advocate for the use of the average-mixing time as a more optimistic and demonstrably easier-to-estimate alternative. We further illustrate its applicability across a range of settings, from two-point to countable spaces, and discuss some practical implications.
△ Less
Submitted 23 July, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Dimension-free Bounds for Sum of Dependent Matrices and Operators with Heavy-Tailed Distribution
Authors:
Shogo Nakakita,
Pierre Alquier,
Masaaki Imaizumi
Abstract:
We study the deviation inequality for a sum of high-dimensional random matrices and operators with dependence and arbitrary heavy tails. There is an increase in the importance of the problem of estimating high-dimensional matrices, and dependence and heavy-tail properties of data are among the most critical topics currently. In this paper, we derive a dimension-free upper bound on the deviation, t…
▽ More
We study the deviation inequality for a sum of high-dimensional random matrices and operators with dependence and arbitrary heavy tails. There is an increase in the importance of the problem of estimating high-dimensional matrices, and dependence and heavy-tail properties of data are among the most critical topics currently. In this paper, we derive a dimension-free upper bound on the deviation, that is, the bound does not depend explicitly on the dimension of matrices, but depends on their effective rank. Our result is a generalization of several existing studies on the deviation of the sum of matrices. Our proof is based on two techniques: (i) a variational approximation of the dual of moment generating functions, and (ii) robustification through truncation of eigenvalues of matrices. We show that our results are applicable to several problems such as covariance matrix estimation, hidden Markov models, and overparameterized linear regression models.
△ Less
Submitted 21 October, 2022; v1 submitted 18 October, 2022;
originally announced October 2022.
-
Variance-Aware Estimation of Kernel Mean Embedding
Authors:
Geoffrey Wolfer,
Pierre Alquier
Abstract:
An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show th…
▽ More
An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We further extend our results from independent data to stationary mixing sequences and illustrate our methods in the context of hypothesis testing and robust parametric estimation.
△ Less
Submitted 16 April, 2025; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Optimal quasi-Bayesian reduced rank regression with incomplete response
Authors:
The Tien Mai,
Pierre Alquier
Abstract:
The aim of reduced rank regression is to connect multiple response variables to multiple predictors. This model is very popular, especially in biostatistics where multiple measurements on individuals can be re-used to predict multiple outputs. Unfortunately, there are often missing data in such datasets, making it difficult to use standard estimation tools. In this paper, we study the problem of r…
▽ More
The aim of reduced rank regression is to connect multiple response variables to multiple predictors. This model is very popular, especially in biostatistics where multiple measurements on individuals can be re-used to predict multiple outputs. Unfortunately, there are often missing data in such datasets, making it difficult to use standard estimation tools. In this paper, we study the problem of reduced rank regression where the response matrix is incomplete. We propose a quasi-Bayesian approach to this problem, in the sense that the likelihood is replaced by a quasi-likelihood. We provide a tight oracle inequality, proving that our method is adaptive to the rank of the coefficient matrix. We describe a Langevin Monte Carlo algorithm for the computation of the posterior mean. Numerical comparison on synthetic and real data show that our method are competitive to the state-of-the-art where the rank is chosen by cross validation, and sometimes lead to an improvement.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
Concentration of discrepancy-based approximate Bayesian computation via Rademacher complexity
Authors:
Sirio Legramanti,
Daniele Durante,
Pierre Alquier
Abstract:
There has been increasing interest on summary-free solutions for approximate Bayesian computation (ABC) which replace distances among summaries with discrepancies between the empirical distributions of the observed data and the synthetic samples generated under the proposed parameter values. The success of these strategies has motivated theoretical studies on the limiting properties of the induced…
▽ More
There has been increasing interest on summary-free solutions for approximate Bayesian computation (ABC) which replace distances among summaries with discrepancies between the empirical distributions of the observed data and the synthetic samples generated under the proposed parameter values. The success of these strategies has motivated theoretical studies on the limiting properties of the induced posteriors. However, there is still the lack of a theoretical framework for summary-free ABC that (i) is unified, instead of discrepancy-specific, (ii) does not require to constrain the analysis to data generating processes and statistical models meeting specific regularity conditions, but rather facilitates the derivation of limiting properties that hold uniformly, and (iii) relies on verifiable assumptions that provide explicit concentration bounds clarifying which factors govern the limiting behavior of the ABC posterior. We address this gap via a novel theoretical framework that introduces the concept of Rademacher complexity in the analysis of the limiting properties for discrepancy-based ABC posteriors, including in non-i.i.d. and misspecified settings. This yields a unified theory that relies on constructive arguments and provides more informative asymptotic results and uniform concentration bounds, even in settings not covered by current studies. These advancements are obtained by relating the asymptotic properties of summary-free ABC posteriors to the behavior of the Rademacher complexity associated with the chosen discrepancy in the family of integral probability semimetrics (IPS). The IPS class extends summary-based distances, and includes the Wasserstein distance and maximum mean discrepancy, among others. As clarified in specialized theoretical analyses of popular IPS discrepancies and via illustrative simulations, this perspective improves the understanding of summary-free ABC.
△ Less
Submitted 24 January, 2025; v1 submitted 14 June, 2022;
originally announced June 2022.
-
User-friendly introduction to PAC-Bayes bounds
Authors:
Pierre Alquier
Abstract:
Aggregated predictors are obtained by making a set of basic predictors vote according to some weights, that is, to some probability distribution.
Randomized predictors are obtained by sampling in a set of basic predictors, according to some prescribed probability distribution.
Thus, aggregated and randomized predictors have in common that they are not defined by a minimization problem, but by…
▽ More
Aggregated predictors are obtained by making a set of basic predictors vote according to some weights, that is, to some probability distribution.
Randomized predictors are obtained by sampling in a set of basic predictors, according to some prescribed probability distribution.
Thus, aggregated and randomized predictors have in common that they are not defined by a minimization problem, but by a probability distribution on the set of predictors. In statistical learning theory, there is a set of tools designed to understand the generalization ability of such procedures: PAC-Bayesian or PAC-Bayes bounds.
Since the original PAC-Bayes bounds of D. McAllester, these tools have been considerably improved in many directions (we will for example describe a simplified version of the localization technique of O. Catoni that was missed by the community, and later rediscovered as "mutual information bounds"). Very recently, PAC-Bayes bounds received a considerable attention: for example there was workshop on PAC-Bayes at NIPS 2017, "(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights", organized by B. Guedj, F. Bach and P. Germain. One of the reason of this recent success is the successful application of these bounds to neural networks by G. Dziugaite and D. Roy.
An elementary introduction to PAC-Bayes theory is still missing. This is an attempt to provide such an introduction.
△ Less
Submitted 28 February, 2025; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Deviation inequalities for stochastic approximation by averaging
Authors:
Xiequan Fan,
Pierre Alquier,
Paul Doukhan
Abstract:
We introduce a class of Markov chains, that contains the model of stochastic approximation by averaging and non-averaging. Using martingale approximation method, we establish various deviation inequalities for separately Lipschitz functions of such a chain, with different moment conditions on some dominating random variables of martingale differences.Finally, we apply these inequalities to the sto…
▽ More
We introduce a class of Markov chains, that contains the model of stochastic approximation by averaging and non-averaging. Using martingale approximation method, we establish various deviation inequalities for separately Lipschitz functions of such a chain, with different moment conditions on some dominating random variables of martingale differences.Finally, we apply these inequalities to the stochastic approximation by averaging and empirical risk minimisation.
△ Less
Submitted 18 February, 2022; v1 submitted 17 February, 2021;
originally announced February 2021.
-
Tight Risk Bound for High Dimensional Time Series Completion
Authors:
Pierre Alquier,
Nicolas Marie,
Amélie Rosier
Abstract:
Initially designed for independent datas, low-rank matrix completion was successfully applied in many domains to the reconstruction of partially observed high-dimensional time series. However, there is a lack of theory to support the application of these methods to dependent datas. In this paper, we propose a general model for multivariate, partially observed time series. We show that the least-sq…
▽ More
Initially designed for independent datas, low-rank matrix completion was successfully applied in many domains to the reconstruction of partially observed high-dimensional time series. However, there is a lack of theory to support the application of these methods to dependent datas. In this paper, we propose a general model for multivariate, partially observed time series. We show that the least-square method with a rank penalty leads to reconstruction error of the same order as for independent datas. Moreover, when the time series has some additional properties such as periodicity or smoothness, the rate can actually be faster than in the independent case.
△ Less
Submitted 11 March, 2022; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Meta-strategy for Learning Tuning Parameters with Guarantees
Authors:
Dimitri Meunier,
Pierre Alquier
Abstract:
Online learning methods, like the online gradient algorithm (OGA) and exponentially weighted aggregation (EWA), often depend on tuning parameters that are difficult to set in practice. We consider an online meta-learning scenario, and we propose a meta-strategy to learn these parameters from past tasks. Our strategy is based on the minimization of a regret bound. It allows to learn the initializat…
▽ More
Online learning methods, like the online gradient algorithm (OGA) and exponentially weighted aggregation (EWA), often depend on tuning parameters that are difficult to set in practice. We consider an online meta-learning scenario, and we propose a meta-strategy to learn these parameters from past tasks. Our strategy is based on the minimization of a regret bound. It allows to learn the initialization and the step size in OGA with guarantees. It also allows to learn the prior or the learning rate in EWA. We provide a regret analysis of the strategy. It allows to identify settings where meta-learning indeed improves on learning each task in isolation.
△ Less
Submitted 6 August, 2021; v1 submitted 4 February, 2021;
originally announced February 2021.
-
Estimation of copulas via Maximum Mean Discrepancy
Authors:
Pierre Alquier,
Badr-Eddine Chérief-Abdellatif,
Alexis Derumigny,
Jean-David Fermanian
Abstract:
This paper deals with robust inference for parametric copula models. Estimation using Canonical Maximum Likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the Maximum Mean Discrepancy (MMD) principle. We derive non-asymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequalit…
▽ More
This paper deals with robust inference for parametric copula models. Estimation using Canonical Maximum Likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the Maximum Mean Discrepancy (MMD) principle. We derive non-asymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequality holds without any assumption on the copula family, and can be applied in the presence of outliers or under misspecification. Moreover, in our MMD framework, the statistical inference of copula models for which there exists no density with respect to the Lebesgue measure on $[0,1]^d$, as the Marshall-Olkin copula, becomes feasible. A simulation study shows the robustness of our new procedures, especially compared to pseudo-maximum likelihood estimation. An R package implementing the MMD estimator for copula models is available.
△ Less
Submitted 14 January, 2022; v1 submitted 1 October, 2020;
originally announced October 2020.
-
Universal Robust Regression via Maximum Mean Discrepancy
Authors:
Pierre Alquier,
Mathieu Gerber
Abstract:
Many modern datasets are collected automatically and are thus easily contaminated by outliers. This led to a regain of interest in robust estimation, including new notions of robustness such as robustness to adversarial contamination of the data. However, most robust estimation methods are designed for a specific model. Notably, many methods were proposed recently to obtain robust estimators in li…
▽ More
Many modern datasets are collected automatically and are thus easily contaminated by outliers. This led to a regain of interest in robust estimation, including new notions of robustness such as robustness to adversarial contamination of the data. However, most robust estimation methods are designed for a specific model. Notably, many methods were proposed recently to obtain robust estimators in linear models (or generalized linear models), and a few were developed for very specific settings, for example beta regression or sample selection models. In this paper we develop a new approach for robust estimation in arbitrary regression models, based on Maximum Mean Discrepancy minimization. We build two estimators which are both proven to be robust to Huber-type contamination. We obtain a non-asymptotic error bound for one them and show that it is also robust to adversarial contamination, but this estimator is computationally more expensive to use in practice than the other one. As a by-product of our theoretical analysis of the proposed estimators we derive new results on kernel conditional mean embedding of distributions which are of independent interest.
△ Less
Submitted 4 May, 2023; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Finite sample properties of parametric MMD estimation: robustness to misspecification and dependence
Authors:
Badr-Eddine Chérief-Abdellatif,
Pierre Alquier
Abstract:
Many works in statistics aim at designing a universal estimation procedure, that is, an estimator that would converge to the best approximation of the (unknown) data generating distribution in a model, without any assumption on this distribution. This question is of major interest, in particular because the universality property leads to the robustness of the estimator. In this paper, we tackle th…
▽ More
Many works in statistics aim at designing a universal estimation procedure, that is, an estimator that would converge to the best approximation of the (unknown) data generating distribution in a model, without any assumption on this distribution. This question is of major interest, in particular because the universality property leads to the robustness of the estimator. In this paper, we tackle the problem of universal estimation using a minimum distance estimator presented in Briol et al. (2019) based on the Maximum Mean Discrepancy. We show that the estimator is robust to both dependence and to the presence of outliers in the dataset. Finally, we provide a theoretical study of the stochastic gradient descent algorithm used to compute the estimator, and we support our findings with numerical simulations.
** The proof of Proposition 4.4 in the published version contains a mistake. The mistake is fixed here (and the bound is actually improved by a factor 2). **
△ Less
Submitted 13 February, 2025; v1 submitted 11 December, 2019;
originally announced December 2019.
-
MMD-Bayes: Robust Bayesian Estimation via Maximum Mean Discrepancy
Authors:
Badr-Eddine Chérief-Abdellatif,
Pierre Alquier
Abstract:
In some misspecified settings, the posterior distribution in Bayesian statistics may lead to inconsistent estimates. To fix this issue, it has been suggested to replace the likelihood by a pseudo-likelihood, that is the exponential of a loss function enjoying suitable robustness properties. In this paper, we build a pseudo-likelihood based on the Maximum Mean Discrepancy, defined via an embedding…
▽ More
In some misspecified settings, the posterior distribution in Bayesian statistics may lead to inconsistent estimates. To fix this issue, it has been suggested to replace the likelihood by a pseudo-likelihood, that is the exponential of a loss function enjoying suitable robustness properties. In this paper, we build a pseudo-likelihood based on the Maximum Mean Discrepancy, defined via an embedding of probability distributions into a reproducing kernel Hilbert space. We show that this MMD-Bayes posterior is consistent and robust to model misspecification. As the posterior obtained in this way might be intractable, we also prove that reasonable variational approximations of this posterior enjoy the same properties. We provide details on a stochastic gradient algorithm to compute these variational approximations. Numerical simulations indeed suggest that our estimator is more robust to misspecification than the ones based on the likelihood.
△ Less
Submitted 11 December, 2019; v1 submitted 29 September, 2019;
originally announced September 2019.
-
High dimensional VAR with low rank transition
Authors:
Pierre Alquier,
Karine Bertin,
Paul Doukhan,
Rémy Garnier
Abstract:
We propose a vector auto-regressive (VAR) model with a low-rank constraint on the transition matrix. This new model is well suited to predict high-dimensional series that are highly correlated, or that are driven by a small number of hidden factors. We study estimation, prediction, and rank selection for this model in a very general setting. Our method shows excellent performances on a wide variet…
▽ More
We propose a vector auto-regressive (VAR) model with a low-rank constraint on the transition matrix. This new model is well suited to predict high-dimensional series that are highly correlated, or that are driven by a small number of hidden factors. We study estimation, prediction, and rank selection for this model in a very general setting. Our method shows excellent performances on a wide variety of simulated datasets. On macro-economic data from Giannone et al. (2015), our method is competitive with state-of-the-art methods in small dimension, and even improves on them in high dimension.
△ Less
Submitted 10 February, 2020; v1 submitted 2 May, 2019;
originally announced May 2019.
-
A Generalization Bound for Online Variational Inference
Authors:
Badr-Eddine Chérief-Abdellatif,
Pierre Alquier,
Mohammad Emtiyaz Khan
Abstract:
Bayesian inference provides an attractive online-learning framework to analyze sequential data, and offers generalization guarantees which hold even with model mismatch and adversaries. Unfortunately, exact Bayesian inference is rarely feasible in practice and approximation methods are usually employed, but do such methods preserve the generalization properties of Bayesian inference ? In this pape…
▽ More
Bayesian inference provides an attractive online-learning framework to analyze sequential data, and offers generalization guarantees which hold even with model mismatch and adversaries. Unfortunately, exact Bayesian inference is rarely feasible in practice and approximation methods are usually employed, but do such methods preserve the generalization properties of Bayesian inference ? In this paper, we show that this is indeed the case for some variational inference (VI) algorithms. We consider a few existing online, tempered VI algorithms, as well as a new algorithm, and derive their generalization bounds. Our theoretical result relies on the convexity of the variational objective, but we argue that the result should hold more generally and present empirical evidence in support of this. Our work in this paper presents theoretical justifications in favor of online algorithms relying on approximate Bayesian methods.
△ Less
Submitted 10 December, 2019; v1 submitted 8 April, 2019;
originally announced April 2019.
-
Matrix factorization for multivariate time series analysis
Authors:
Pierre Alquier,
Nicolas Marie
Abstract:
Matrix factorization is a powerful data analysis tool. It has been used in multivariate time series analysis, leading to the decomposition of the series in a small set of latent factors. However, little is known on the statistical performances of matrix factorization for time series. In this paper, we extend the results known for matrix estimation in the i.i.d setting to time series. Moreover, we…
▽ More
Matrix factorization is a powerful data analysis tool. It has been used in multivariate time series analysis, leading to the decomposition of the series in a small set of latent factors. However, little is known on the statistical performances of matrix factorization for time series. In this paper, we extend the results known for matrix estimation in the i.i.d setting to time series. Moreover, we prove that when the series exhibit some additional structure like periodicity or smoothness, it is possible to improve on the classical rates of convergence.
△ Less
Submitted 12 October, 2019; v1 submitted 13 March, 2019;
originally announced March 2019.
-
Consistency of Variational Bayes Inference for Estimation and Model Selection in Mixtures
Authors:
Badr-Eddine Chérief-Abdellatif,
Pierre Alquier
Abstract:
Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable posteriors thanks to optimization algorithms, is extremely popular in practice when dealing with complex models such as mixtures. The contribution of this paper is two-…
▽ More
Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable posteriors thanks to optimization algorithms, is extremely popular in practice when dealing with complex models such as mixtures. The contribution of this paper is two-fold. First, we study the concentration of variational approximations of posteriors, which is still an open problem for general mixtures, and we derive consistency and rates of convergence. We also tackle the problem of model selection for the number of components: we study the approach already used in practice, which consists in maximizing a numerical criterion (the Evidence Lower Bound). We prove that this strategy indeed leads to strong oracle inequalities. We illustrate our theoretical results by applications to Gaussian and multinomial mixtures.
△ Less
Submitted 12 August, 2018; v1 submitted 14 May, 2018;
originally announced May 2018.
-
Concentration of tempered posteriors and of their variational approximations
Authors:
Pierre Alquier,
James Ridgway
Abstract:
While Bayesian methods are extremely popular in statistics and machine learning, their application to massive datasets is often challenging, when possible at all. Indeed, the classical MCMC algorithms are prohibitively slow when both the model dimension and the sample size are large. Variational Bayesian methods aim at approximating the posterior by a distribution in a tractable family. Thus, MCMC…
▽ More
While Bayesian methods are extremely popular in statistics and machine learning, their application to massive datasets is often challenging, when possible at all. Indeed, the classical MCMC algorithms are prohibitively slow when both the model dimension and the sample size are large. Variational Bayesian methods aim at approximating the posterior by a distribution in a tractable family. Thus, MCMC are replaced by an optimization algorithm which is orders of magnitude faster. VB methods have been applied in such computationally demanding applications as including collaborative filtering, image and video processing, NLP and text processing... However, despite very nice results in practice, the theoretical properties of these approximations are usually not known. In this paper, we propose a general approach to prove the concentration of variational approximations of fractional posteriors. We apply our theory to two examples: matrix completion, and Gaussian VB.
△ Less
Submitted 22 April, 2019; v1 submitted 28 June, 2017;
originally announced June 2017.
-
Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions
Authors:
Pierre Alquier,
Vincent Cottet,
Guillaume Lecué
Abstract:
We obtain estimation error rates and sharp oracle inequalities for regularization procedures of the form \begin{equation*}
\hat f \in argmin_{f\in
F}\left(\frac{1}{N}\sum_{i=1}^N\ell(f(X_i), Y_i)+λ\|f\|\right) \end{equation*} when $\|\cdot\|$ is any norm, $F$ is a convex class of functions and $\ell$ is a Lipschitz loss function satisfying a Bernstein condition over $F$. We explore both the bo…
▽ More
We obtain estimation error rates and sharp oracle inequalities for regularization procedures of the form \begin{equation*}
\hat f \in argmin_{f\in
F}\left(\frac{1}{N}\sum_{i=1}^N\ell(f(X_i), Y_i)+λ\|f\|\right) \end{equation*} when $\|\cdot\|$ is any norm, $F$ is a convex class of functions and $\ell$ is a Lipschitz loss function satisfying a Bernstein condition over $F$. We explore both the bounded and subgaussian stochastic frameworks for the distribution of the $f(X_i)$'s, with no assumption on the distribution of the $Y_i$'s. The general results rely on two main objects: a complexity function, and a sparsity equation, that depend on the specific setting in hand (loss $\ell$ and norm $\|\cdot\|$).
As a proof of concept, we obtain minimax rates of convergence in the following problems: 1) matrix completion with any Lipschitz loss function, including the hinge and logistic loss for the so-called 1-bit matrix completion instance of the problem, and quantile losses for the general case, which enables to estimate any quantile on the entries of the matrix; 2) logistic LASSO and variants such as the logistic SLOPE; 3) kernel methods, where the loss is the hinge loss, and the regularization function is the RKHS norm.
△ Less
Submitted 7 February, 2017; v1 submitted 5 February, 2017;
originally announced February 2017.
-
Simpler PAC-Bayesian Bounds for Hostile Data
Authors:
Pierre Alquier,
Benjamin Guedj
Abstract:
PAC-Bayesian learning bounds are of the utmost interest to the learning community. Their role is to connect the generalization ability of an aggregation distribution $ρ$ to its empirical risk and to its Kullback-Leibler divergence with respect to some prior distribution $π$. Unfortunately, most of the available bounds typically rely on heavy assumptions such as boundedness and independence of the…
▽ More
PAC-Bayesian learning bounds are of the utmost interest to the learning community. Their role is to connect the generalization ability of an aggregation distribution $ρ$ to its empirical risk and to its Kullback-Leibler divergence with respect to some prior distribution $π$. Unfortunately, most of the available bounds typically rely on heavy assumptions such as boundedness and independence of the observations. This paper aims at relaxing these constraints and provides PAC-Bayesian learning bounds that hold for dependent, heavy-tailed observations (hereafter referred to as \emph{hostile data}). In these bounds the Kullack-Leibler divergence is replaced with a general version of Csiszár's $f$-divergence. We prove a general PAC-Bayesian bound, and show how to use it in various hostile settings.
△ Less
Submitted 23 May, 2019; v1 submitted 23 October, 2016;
originally announced October 2016.
-
Pseudo-Bayesian Quantum Tomography with Rank-adaptation
Authors:
The Tien Mai,
Pierre Alquier
Abstract:
Quantum state tomography, an important task in quantum information processing, aims at reconstructing a state from prepared measurement data. Bayesian methods are recognized to be one of the good and reliable choice in estimating quantum states~\cite{blume2010optimal}. Several numerical works showed that Bayesian estimations are comparable to, and even better than other methods in the problem of…
▽ More
Quantum state tomography, an important task in quantum information processing, aims at reconstructing a state from prepared measurement data. Bayesian methods are recognized to be one of the good and reliable choice in estimating quantum states~\cite{blume2010optimal}. Several numerical works showed that Bayesian estimations are comparable to, and even better than other methods in the problem of $1$-qubit state recovery. However, the problem of choosing prior distribution in the general case of $n$ qubits is not straightforward. More importantly, the statistical performance of Bayesian type estimators have not been studied from a theoretical perspective yet. In this paper, we propose a novel prior for quantum states (density matrices), and we define pseudo-Bayesian estimators of the density matrix. Then, using PAC-Bayesian theorems, we derive rates of convergence for the posterior mean. The numerical performance of these estimators are tested on simulated and real datasets.
△ Less
Submitted 10 October, 2016; v1 submitted 19 May, 2016;
originally announced May 2016.
-
An Oracle Inequality for Quasi-Bayesian Non-Negative Matrix Factorization
Authors:
Pierre Alquier,
Benjamin Guedj
Abstract:
The aim of this paper is to provide some theoretical understanding of quasi-Bayesian aggregation methods non-negative matrix factorization. We derive an oracle inequality for an aggregated estimator. This result holds for a very general class of prior distributions and shows how the prior affects the rate of convergence.
The aim of this paper is to provide some theoretical understanding of quasi-Bayesian aggregation methods non-negative matrix factorization. We derive an oracle inequality for an aggregated estimator. This result holds for a very general class of prior distributions and shows how the prior affects the rate of convergence.
△ Less
Submitted 26 June, 2018; v1 submitted 6 January, 2016;
originally announced January 2016.
-
On the properties of variational approximations of Gibbs posteriors
Authors:
Pierre Alquier,
James Ridgway,
Nicolas Chopin
Abstract:
The PAC-Bayesian approach is a powerful set of techniques to derive non- asymptotic risk bounds for random estimators. The corresponding optimal distribution of estimators, usually called the Gibbs posterior, is unfortunately intractable. One may sample from it using Markov chain Monte Carlo, but this is often too slow for big datasets. We consider instead variational approximations of the Gibbs p…
▽ More
The PAC-Bayesian approach is a powerful set of techniques to derive non- asymptotic risk bounds for random estimators. The corresponding optimal distribution of estimators, usually called the Gibbs posterior, is unfortunately intractable. One may sample from it using Markov chain Monte Carlo, but this is often too slow for big datasets. We consider instead variational approximations of the Gibbs posterior, which are fast to compute. We undertake a general study of the properties of such approximations. Our main finding is that such a variational approximation has often the same rate of convergence as the original PAC-Bayesian procedure it approximates. We specialise our results to several learning tasks (classification, ranking, matrix completion),discuss how to implement a variational approximation in each case, and illustrate the good properties of said approximation on real datasets.
△ Less
Submitted 15 June, 2015; v1 submitted 12 June, 2015;
originally announced June 2015.
-
A Bayesian Approach for Noisy Matrix Completion: Optimal Rate under General Sampling Distribution
Authors:
The Tien Mai,
Pierre Alquier
Abstract:
Bayesian methods for low-rank matrix completion with noise have been shown to be very efficient computationally. While the behaviour of penalized minimization methods is well understood both from the theoretical and computational points of view in this problem, the theoretical optimality of Bayesian estimators have not been explored yet. In this paper, we propose a Bayesian estimator for matrix co…
▽ More
Bayesian methods for low-rank matrix completion with noise have been shown to be very efficient computationally. While the behaviour of penalized minimization methods is well understood both from the theoretical and computational points of view in this problem, the theoretical optimality of Bayesian estimators have not been explored yet. In this paper, we propose a Bayesian estimator for matrix completion under general sampling distribution. We also provide an oracle inequality for this estimator. This inequality proves that, whatever the rank of the matrix to be estimated, our estimator reaches the minimax-optimal rate of convergence (up to a logarithmic factor). We end the paper with a short simulation study.
△ Less
Submitted 21 January, 2015; v1 submitted 25 August, 2014;
originally announced August 2014.
-
Bayesian matrix completion: prior specification
Authors:
Pierre Alquier,
Vincent Cottet,
Nicolas Chopin,
Judith Rousseau
Abstract:
Low-rank matrix estimation from incomplete measurements recently received increased attention due to the emergence of several challenging applications, such as recommender systems; see in particular the famous Netflix challenge. While the behaviour of algorithms based on nuclear norm minimization is now well understood, an as yet unexplored avenue of research is the behaviour of Bayesian algorithm…
▽ More
Low-rank matrix estimation from incomplete measurements recently received increased attention due to the emergence of several challenging applications, such as recommender systems; see in particular the famous Netflix challenge. While the behaviour of algorithms based on nuclear norm minimization is now well understood, an as yet unexplored avenue of research is the behaviour of Bayesian algorithms in this context. In this paper, we briefly review the priors used in the Bayesian literature for matrix completion. A standard approach is to assign an inverse gamma prior to the singular values of a certain singular value decomposition of the matrix of interest; this prior is conjugate. However, we show that two other types of priors (again for the singular values) may be conjugate for this model: a gamma prior, and a discrete prior. Conjugacy is very convenient, as it makes it possible to implement either Gibbs sampling or Variational Bayes. Interestingly enough, the maximum a posteriori for these different priors is related to the nuclear norm minimization problems. We also compare all these priors on simulated datasets, and on the classical MovieLens and Netflix datasets.
△ Less
Submitted 22 October, 2014; v1 submitted 5 June, 2014;
originally announced June 2014.
-
Adaptive estimation of the density matrix in quantum homodyne tomography with noisy data
Authors:
P Alquier,
K Meziani,
G Peyré
Abstract:
In the framework of noisy quantum homodyne tomography with efficiency parameter $1/2 < η\leq 1$, we propose a novel estimator of a quantum state whose density matrix elements $ρ_{m,n}$ decrease like $Ce^{-B(m+n)^{r/ 2}}$, for fixed $C\geq 1$, $B>0$ and $0<r\leq 2$. On the contrary to previous works, we focus on the case where $r$, $C$ and $B$ are unknown. The procedure estimates the matrix coeffic…
▽ More
In the framework of noisy quantum homodyne tomography with efficiency parameter $1/2 < η\leq 1$, we propose a novel estimator of a quantum state whose density matrix elements $ρ_{m,n}$ decrease like $Ce^{-B(m+n)^{r/ 2}}$, for fixed $C\geq 1$, $B>0$ and $0<r\leq 2$. On the contrary to previous works, we focus on the case where $r$, $C$ and $B$ are unknown. The procedure estimates the matrix coefficients by a projection method on the pattern functions, and then by soft-thresholding the estimated coefficients.
We prove that under the $\mathbb{L}_2$ -loss our procedure is adaptive rate-optimal, in the sense that it achieves the same rate of conversgence as the best possible procedure relying on the knowledge of $(r,B,C)$. Finite sample behaviour of our adaptive procedure are explored through numerical experiments.
△ Less
Submitted 21 March, 2013; v1 submitted 31 January, 2013;
originally announced January 2013.
-
Prediction of time series by statistical learning: general losses and fast rates
Authors:
Pierre Alquier,
Xiaoyin Li,
Olivier Wintenberger
Abstract:
We establish rates of convergences in time series forecasting using the statistical learning approach based on oracle inequalities. A series of papers extends the oracle inequalities obtained for iid observations to time series under weak dependence conditions. Given a family of predictors and $n$ observations, oracle inequalities state that a predictor forecasts the series as well as the best pre…
▽ More
We establish rates of convergences in time series forecasting using the statistical learning approach based on oracle inequalities. A series of papers extends the oracle inequalities obtained for iid observations to time series under weak dependence conditions. Given a family of predictors and $n$ observations, oracle inequalities state that a predictor forecasts the series as well as the best predictor in the family up to a remainder term $Δ_n$. Using the PAC-Bayesian approach, we establish under weak dependence conditions oracle inequalities with optimal rates of convergence. We extend previous results for the absolute loss function to any Lipschitz loss function with rates $Δ_n\sim\sqrt{c(Θ)/ n}$ where $c(Θ)$ measures the complexity of the model. We apply the method for quantile loss functions to forecast the french GDP. Under additional conditions on the loss functions (satisfied by the quadratic loss function) and on the time series, we refine the rates of convergence to $Δ_n \sim c(Θ)/n$. We achieve for the first time these fast rates for uniformly mixing processes. These rates are known to be optimal in the iid case and for individual sequences. In particular, we generalize the results of Dalalyan and Tsybakov on sparse regression estimation to the case of autoregression.
△ Less
Submitted 8 November, 2012;
originally announced November 2012.
-
PAC-Bayesian Estimation and Prediction in Sparse Additive Models
Authors:
Benjamin Guedj,
Pierre Alquier
Abstract:
The present paper is about estimation and prediction in high-dimensional additive models under a sparsity assumption ($p\gg n$ paradigm). A PAC-Bayesian strategy is investigated, delivering oracle inequalities in probability. The implementation is performed through recent outcomes in high-dimensional MCMC algorithms, and the performance of our method is assessed on simulated data.
The present paper is about estimation and prediction in high-dimensional additive models under a sparsity assumption ($p\gg n$ paradigm). A PAC-Bayesian strategy is investigated, delivering oracle inequalities in probability. The implementation is performed through recent outcomes in high-dimensional MCMC algorithms, and the performance of our method is assessed on simulated data.
△ Less
Submitted 1 February, 2013; v1 submitted 6 August, 2012;
originally announced August 2012.
-
Rank penalized estimation of a quantum system
Authors:
Pierre Alquier,
Cristina Butucea,
Mohamed Hebiri,
Katia Meziani,
Morimae Tomoyuki
Abstract:
We introduce a new method to reconstruct the density matrix $ρ$ of a system of $n$-qubits and estimate its rank $d$ from data obtained by quantum state tomography measurements repeated $m$ times. The procedure consists in minimizing the risk of a linear estimator $\hatρ$ of $ρ$ penalized by given rank (from 1 to $2^n$), where $\hatρ$ is previously obtained by the moment method. We obtain simultane…
▽ More
We introduce a new method to reconstruct the density matrix $ρ$ of a system of $n$-qubits and estimate its rank $d$ from data obtained by quantum state tomography measurements repeated $m$ times. The procedure consists in minimizing the risk of a linear estimator $\hatρ$ of $ρ$ penalized by given rank (from 1 to $2^n$), where $\hatρ$ is previously obtained by the moment method. We obtain simultaneously an estimator of the rank and the resulting density matrix associated to this rank. We establish an upper bound for the error of penalized estimator, evaluated with the Frobenius norm, which is of order $dn(4/3)^n /m$ and consistency for the estimator of the rank. The proposed methodology is computationaly efficient and is illustrated with some example states and real experimental data sets.
△ Less
Submitted 26 September, 2013; v1 submitted 8 June, 2012;
originally announced June 2012.
-
Prediction of quantiles by statistical learning and application to GDP forecasting
Authors:
Pierre Alquier,
Xiaoyin Li
Abstract:
In this paper, we tackle the problem of prediction and confidence intervals for time series using a statistical learning approach and quantile loss functions. In a first time, we show that the Gibbs estimator (also known as Exponentially Weighted aggregate) is able to predict as well as the best predictor in a given family for a wide set of loss functions. In particular, using the quantile loss fu…
▽ More
In this paper, we tackle the problem of prediction and confidence intervals for time series using a statistical learning approach and quantile loss functions. In a first time, we show that the Gibbs estimator (also known as Exponentially Weighted aggregate) is able to predict as well as the best predictor in a given family for a wide set of loss functions. In particular, using the quantile loss function of Koenker and Bassett (1978), this allows to build confidence intervals. We apply these results to the problem of prediction and confidence regions for the French Gross Domestic Product (GDP) growth, with promising results.
△ Less
Submitted 8 August, 2012; v1 submitted 20 February, 2012;
originally announced February 2012.
-
Fast rates in learning with dependent observations
Authors:
Pierre Alquier,
Olivier Wintenberger
Abstract:
In this paper we tackle the problem of fast rates in time series forecasting from a statistical learning perspective. In a serie of papers (e.g. Meir 2000, Modha and Masry 1998, Alquier and Wintenberger 2012) it is shown that the main tools used in learning theory with iid observations can be extended to the prediction of time series. The main message of these papers is that, given a family of pre…
▽ More
In this paper we tackle the problem of fast rates in time series forecasting from a statistical learning perspective. In a serie of papers (e.g. Meir 2000, Modha and Masry 1998, Alquier and Wintenberger 2012) it is shown that the main tools used in learning theory with iid observations can be extended to the prediction of time series. The main message of these papers is that, given a family of predictors, we are able to build a new predictor that predicts the series as well as the best predictor in the family, up to a remainder of order $1/\sqrt{n}$. It is known that this rate cannot be improved in general. In this paper, we show that in the particular case of the least square loss, and under a strong assumption on the time series (phi-mixing) the remainder is actually of order $1/n$. Thus, the optimal rate for iid variables, see e.g. Tsybakov 2003, and individual sequences, see \cite{lugosi} is, for the first time, achieved for uniformly mixing processes. We also show that our method is optimal for aggregating sparse linear combinations of predictors.
△ Less
Submitted 20 February, 2012;
originally announced February 2012.
-
Sparsity considerations for dependent observations
Authors:
Pierre Alquier,
Paul Doukhan
Abstract:
The aim of this paper is to provide a comprehensive introduction for the study of L1-penalized estimators in the context of dependent observations. We define a general $\ell_{1}$-penalized estimator for solving problems of stochastic optimization. This estimator turns out to be the LASSO in the regression estimation setting. Powerful theoretical guarantees on the statistical performances of the LA…
▽ More
The aim of this paper is to provide a comprehensive introduction for the study of L1-penalized estimators in the context of dependent observations. We define a general $\ell_{1}$-penalized estimator for solving problems of stochastic optimization. This estimator turns out to be the LASSO in the regression estimation setting. Powerful theoretical guarantees on the statistical performances of the LASSO were provided in recent papers, however, they usually only deal with the iid case. Here, we study our estimator under various dependence assumptions.
△ Less
Submitted 7 August, 2011; v1 submitted 8 February, 2011;
originally announced February 2011.
-
Sparse single-index model
Authors:
Pierre Alquier,
Gérard Biau
Abstract:
Let $(\bX, Y)$ be a random pair taking values in $\mathbb R^p \times \mathbb R$. In the so-called single-index model, one has $Y=f^{\star}(θ^{\star T}\bX)+\bW$, where $f^{\star}$ is an unknown univariate measurable function, $θ^{\star}$ is an unknown vector in $\mathbb R^d$, and $W$ denotes a random noise satisfying $\mathbb E[\bW|\bX]=0$. The single-index model is known to offer a flexible way to…
▽ More
Let $(\bX, Y)$ be a random pair taking values in $\mathbb R^p \times \mathbb R$. In the so-called single-index model, one has $Y=f^{\star}(θ^{\star T}\bX)+\bW$, where $f^{\star}$ is an unknown univariate measurable function, $θ^{\star}$ is an unknown vector in $\mathbb R^d$, and $W$ denotes a random noise satisfying $\mathbb E[\bW|\bX]=0$. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations ("$p$ larger than $n$" paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures.
△ Less
Submitted 6 October, 2011; v1 submitted 17 January, 2011;
originally announced January 2011.
-
Pac-bayesian bounds for sparse regression estimation with exponential weights
Authors:
Pierre Alquier,
Karim Lounici
Abstract:
We consider the sparse regression model where the number of parameters $p$ is larger than the sample size $n$. The difficulty when considering high-dimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The BIC estimator for instance performs well from the statistical point of view \cite{BTW07} but can only be computed for valu…
▽ More
We consider the sparse regression model where the number of parameters $p$ is larger than the sample size $n$. The difficulty when considering high-dimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The BIC estimator for instance performs well from the statistical point of view \cite{BTW07} but can only be computed for values of $p$ of at most a few tens. The Lasso estimator is solution of a convex minimization problem, hence computable for large value of $p$. However stringent conditions on the design are required to establish fast rates of convergence for this estimator. Dalalyan and Tsybakov \cite{arnak} propose a method achieving a good compromise between the statistical and computational aspects of the problem. Their estimator can be computed for reasonably large $p$ and satisfies nice statistical properties under weak assumptions on the design. However, \cite{arnak} proposes sparsity oracle inequalities in expectation for the empirical excess risk only. In this paper, we propose an aggregation procedure similar to that of \cite{arnak} but with improved statistical performances. Our main theoretical result is a sparsity oracle inequality in probability for the true excess risk for a version of exponential weight estimator. We also propose a MCMC method to compute our estimator for reasonably large values of $p$.
△ Less
Submitted 14 March, 2011; v1 submitted 14 September, 2010;
originally announced September 2010.
-
Transductive versions of the LASSO and the Dantzig Selector
Authors:
Pierre Alquier,
Mohamed Hebiri
Abstract:
Transductive methods are useful in prediction problems when the training dataset is composed of a large number of unlabeled observations and a smaller number of labeled observations. In this paper, we propose an approach for developing transductive prediction procedures that are able to take advantage of the sparsity in the high dimensional linear regression. More precisely, we define transductive…
▽ More
Transductive methods are useful in prediction problems when the training dataset is composed of a large number of unlabeled observations and a smaller number of labeled observations. In this paper, we propose an approach for developing transductive prediction procedures that are able to take advantage of the sparsity in the high dimensional linear regression. More precisely, we define transductive versions of the LASSO and the Dantzig Selector . These procedures combine labeled and unlabeled observations of the training dataset to produce a prediction for the unlabeled observations. We propose an experimental study of the transductive estimators, that shows that they improve the LASSO and Dantzig Selector in many situations, and particularly in high dimensional problems when the predictors are correlated. We then provide non-asymptotic theoretical guarantees for these estimation methods. Interestingly, our theoretical results show that the Transductive LASSO and Dantzig Selector satisfy sparsity inequalities under weaker assumptions than those required for the "original" LASSO.
△ Less
Submitted 5 May, 2010;
originally announced May 2010.
-
Transductive versions of the LASSO and the Dantzig Selector
Authors:
Pierre Alquier,
Mohamed Hebiri
Abstract:
We consider the linear regression problem, where the number $p$ of covariates is possibly larger than the number $n$ of observations $(x_{i},y_{i})_{i\leq i \leq n}$, under sparsity assumptions. On the one hand, several methods have been successfully proposed to perform this task, for example the LASSO or the Dantzig Selector. On the other hand, consider new values $(x_{i})_{n+1\leq i \leq m}$.…
▽ More
We consider the linear regression problem, where the number $p$ of covariates is possibly larger than the number $n$ of observations $(x_{i},y_{i})_{i\leq i \leq n}$, under sparsity assumptions. On the one hand, several methods have been successfully proposed to perform this task, for example the LASSO or the Dantzig Selector. On the other hand, consider new values $(x_{i})_{n+1\leq i \leq m}$. If one wants to estimate the corresponding $y_{i}$'s, one should think of a specific estimator devoted to this task, referred by Vapnik as a "transductive" estimator. This estimator may differ from an estimator designed to the more general task "estimate on the whole domain". In this work, we propose a generalized version both of the LASSO and the Dantzig Selector, based on the geometrical remarks about the LASSO in prévious works. The "usual" LASSO and Dantzig Selector, as well as new estimators interpreted as transductive versions of the LASSO, appear as special cases. These estimators are interesting at least from a theoretical point of view: we can give theoretical guarantees for these estimators under hypotheses that are relaxed versions of the hypotheses required in the papers about the "usual" LASSO. These estimators can also be efficiently computed, with results comparable to the ones of the LASSO.
△ Less
Submitted 6 June, 2009; v1 submitted 3 June, 2009;
originally announced June 2009.
-
Model selection for weakly dependent time series forecasting
Authors:
Pierre Alquier,
Olivier Wintenberger
Abstract:
Observing a stationary time series, we propose a two-step procedure for the prediction of the next value of the time series. The first step follows machine learning theory paradigm and consists in determining a set of possible predictors as randomized estimators in (possibly numerous) different predictive models. The second step follows the model selection paradigm and consists in choosing one pre…
▽ More
Observing a stationary time series, we propose a two-step procedure for the prediction of the next value of the time series. The first step follows machine learning theory paradigm and consists in determining a set of possible predictors as randomized estimators in (possibly numerous) different predictive models. The second step follows the model selection paradigm and consists in choosing one predictor with good properties among all the predictors of the first steps. We study our procedure for two different types of bservations: causal Bernoulli shifts and bounded weakly dependent processes. In both cases, we give oracle inequalities: the risk of the chosen predictor is close to the best prediction risk in all predictive models that we consider. We apply our procedure for predictive models such as linear predictors, neural networks predictors and non-parametric autoregressive.
△ Less
Submitted 3 July, 2012; v1 submitted 17 February, 2009;
originally announced February 2009.
-
Generalization of l1 constraints for high dimensional regression problems
Authors:
Pierre Alquier,
Mohamed Hebiri
Abstract:
We focus on the high dimensional linear regression $Y\sim\mathcal{N}(Xβ^{*},σ^{2}I_{n})$, where $β^{*}\in\mathds{R}^{p}$ is the parameter of interest. In this setting, several estimators such as the LASSO and the Dantzig Selector are known to satisfy interesting properties whenever the vector $β^{*}$ is sparse. Interestingly both of the LASSO and the Dantzig Selector can be seen as orthogonal proj…
▽ More
We focus on the high dimensional linear regression $Y\sim\mathcal{N}(Xβ^{*},σ^{2}I_{n})$, where $β^{*}\in\mathds{R}^{p}$ is the parameter of interest. In this setting, several estimators such as the LASSO and the Dantzig Selector are known to satisfy interesting properties whenever the vector $β^{*}$ is sparse. Interestingly both of the LASSO and the Dantzig Selector can be seen as orthogonal projections of 0 into $\mathcal{DC}(s)=\{β\in\mathds{R}^{p},\|X'(Y-Xβ)\|_{\infty}\leq s\}$ - using an $\ell_{1}$ distance for the Dantzig Selector and $\ell_{2}$ for the LASSO. For a well chosen $s>0$, this set is actually a confidence region for $β^{*}$. In this paper, we investigate the properties of estimators defined as projections on $\mathcal{DC}(s)$ using general distances. We prove that the obtained estimators satisfy oracle properties close to the one of the LASSO and Dantzig Selector. On top of that, it turns out that these estimators can be tuned to exploit a different sparsity or/and slightly different estimation objectives.
△ Less
Submitted 4 July, 2011; v1 submitted 1 November, 2008;
originally announced November 2008.
-
PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers
Authors:
Pierre Alquier
Abstract:
The aim of this paper is to generalize the PAC-Bayesian theorems proved by Catoni in the classification setting to more general problems of statistical inference. We show how to control the deviations of the risk of randomized estimators. A particular attention is paid to randomized estimators drawn in a small neighborhood of classical estimators, whose study leads to control the risk of the lat…
▽ More
The aim of this paper is to generalize the PAC-Bayesian theorems proved by Catoni in the classification setting to more general problems of statistical inference. We show how to control the deviations of the risk of randomized estimators. A particular attention is paid to randomized estimators drawn in a small neighborhood of classical estimators, whose study leads to control the risk of the latter. These results allow to bound the risk of very general estimation procedures, as well as to perform model selection.
△ Less
Submitted 9 January, 2009; v1 submitted 11 December, 2007;
originally announced December 2007.
-
LASSO, Iterative Feature Selection and the Correlation Selector: Oracle Inequalities and Numerical Performances
Authors:
Pierre Alquier
Abstract:
We propose a general family of algorithms for regression estimation with quadratic loss. Our algorithms are able to select relevant functions into a large dictionary. We prove that a lot of algorithms that have already been studied for this task (LASSO and Group LASSO, Dantzig selector, Iterative Feature Selection, among others) belong to our family, and exhibit another particular member of this…
▽ More
We propose a general family of algorithms for regression estimation with quadratic loss. Our algorithms are able to select relevant functions into a large dictionary. We prove that a lot of algorithms that have already been studied for this task (LASSO and Group LASSO, Dantzig selector, Iterative Feature Selection, among others) belong to our family, and exhibit another particular member of this family that we call Correlation Selector in this paper. Using general properties of our family of algorithm we prove oracle inequalities for IFS, for the LASSO and for the Correlation Selector, and compare numerical performances of these estimators on a toy example.
△ Less
Submitted 25 November, 2008; v1 submitted 24 October, 2007;
originally announced October 2007.
-
Density estimation with quadratic loss: a confidence intervals method
Authors:
Pierre Alquier
Abstract:
In a previous article, a least square regression estimation procedure was proposed: first, we condiser a family of functions and study the properties of an estimator in every unidimensionnal model defined by one of these functions; we then show how to aggregate these estimators. The purpose of this paper is to extend this method to the case of density estimation. We first give a general overview…
▽ More
In a previous article, a least square regression estimation procedure was proposed: first, we condiser a family of functions and study the properties of an estimator in every unidimensionnal model defined by one of these functions; we then show how to aggregate these estimators. The purpose of this paper is to extend this method to the case of density estimation. We first give a general overview of the method, adapted to the density estimation problem. We then show that this leads to adaptative estimators, that means that the estimator reaches the best possible rate of convergence (up to a $\log$ factor). Finally we show some ways to improve and generalize the method.
△ Less
Submitted 14 March, 2006;
originally announced March 2006.
-
Iterative Feature Selection In Least Square Regression Estimation
Authors:
Pierre Alquier
Abstract:
In this paper, we focus on regression estimation in both the inductive and the transductive case. We assume that we are given a set of features (which can be a base of functions, but not necessarily). We begin by giving a deviation inequality on the risk of an estimator in every model defined by using a single feature. These models are too simple to be useful by themselves, but we then show how…
▽ More
In this paper, we focus on regression estimation in both the inductive and the transductive case. We assume that we are given a set of features (which can be a base of functions, but not necessarily). We begin by giving a deviation inequality on the risk of an estimator in every model defined by using a single feature. These models are too simple to be useful by themselves, but we then show how this result motivates an iterative algorithm that performs feature selection in order to build a suitable estimator. We prove that every selected feature actually improves the performance of the estimator. We give all the estimators and results at first in the inductive case, which requires the knowledge of the distribution of the design, and then in the transductive case, in which we do not need to know this distribution.
△ Less
Submitted 10 April, 2008; v1 submitted 11 November, 2005;
originally announced November 2005.