-
Robust Model-Based Clustering
Authors:
Juan D. Gonzalez,
Ricardo Maronna,
Victor J. Yohai,
Ruben H. Zamar
Abstract:
We propose a new class of robust and Fisher-consistent estimators for mixture models. These estimators can be used to construct robust model-based clustering procedures. We study in detail the case of multivariate normal mixtures and propose a procedure that uses S estimators of multivariate location and scatter. We develop an algorithm to compute the estimators and to build the clusters which is…
▽ More
We propose a new class of robust and Fisher-consistent estimators for mixture models. These estimators can be used to construct robust model-based clustering procedures. We study in detail the case of multivariate normal mixtures and propose a procedure that uses S estimators of multivariate location and scatter. We develop an algorithm to compute the estimators and to build the clusters which is quite similar to the EM algorithm. An extensive Monte Carlo simulation study shows that our proposal compares favorably with other robust and non robust model-based clustering procedures. We apply ours and alternative procedures to a real data set and again find that the best results are obtained using our proposal.
△ Less
Submitted 8 June, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
Optimal robust estimators for families of distributions on the integers
Authors:
Ricardo A. Maronna,
Victor J. Yohai
Abstract:
Let F_{θ} be a family of distributions with support on the set of nonnegative integers Z_0. In this paper we derive the M-estimators with smallest gross error sensitivity (GES). We start by defining the uniform median of a distribution F with support on Z_0 (umed(F)) as the median of x+u, where x and u are independent variables with distributions F and uniform in [-0.5,0.5] respectively. Under som…
▽ More
Let F_{θ} be a family of distributions with support on the set of nonnegative integers Z_0. In this paper we derive the M-estimators with smallest gross error sensitivity (GES). We start by defining the uniform median of a distribution F with support on Z_0 (umed(F)) as the median of x+u, where x and u are independent variables with distributions F and uniform in [-0.5,0.5] respectively. Under some general conditions we prove that the estimator with smallest GES satisfies umed(F_{n})=umed(F_{θ}), where F_{n} is the empirical distribution. The asymptotic distribution of these estimators is found. This distribution is normal except when there is a positive integer k so that F_{θ}(k)=0.5. In this last case, the asymptotic distribution behaves as normal at each side of 0, but with different variances. A simulation Monte Carlo study compares, for the Poisson distribution, the efficiency and robustness for finite sample sizes of this estimator with those of other robust estimators.
△ Less
Submitted 10 November, 2019;
originally announced November 2019.
-
Multivariate Location and Scatter Matrix Estimation Under Cellwise and Casewise Contamination
Authors:
Andy Leung,
Victor J. Yohai,
Ruben H. Zamar
Abstract:
We consider the problem of multivariate location and scatter matrix estimation when the data contain cellwise and casewise outliers. Agostinelli et al. (2015) propose a two-step approach to deal with this problem: first, apply a univariate filter to remove cellwise outliers and second, apply a generalized S-estimator to downweight casewise outliers. We improve this proposal in three main direction…
▽ More
We consider the problem of multivariate location and scatter matrix estimation when the data contain cellwise and casewise outliers. Agostinelli et al. (2015) propose a two-step approach to deal with this problem: first, apply a univariate filter to remove cellwise outliers and second, apply a generalized S-estimator to downweight casewise outliers. We improve this proposal in three main directions. First, we introduce a consistent bivariate filter to be used in combination with the univariate filter in the first step. Second, we propose a new fast subsampling procedure to generate starting points for the generalized S-estimator in the second step. Third, we consider a non-monotonic weight function for the generalized S-estimator to better deal with casewise outliers in high dimension. A simulation study and real data example show that, unlike the original two-step procedure, the modified two-step approach performs and scales well for high dimension. Moreover, the modified procedure outperforms the original one and other state-of-the-art robust procedures under cellwise and casewise data contamination.
△ Less
Submitted 25 December, 2016; v1 submitted 1 September, 2016;
originally announced September 2016.
-
Robust and sparse estimators for linear regression models
Authors:
Ezequiel Smucler,
Víctor J. Yohai
Abstract:
Penalized regression estimators are a popular tool for the analysis of sparse and high-dimensional data sets. However, penalized regression estimators defined using an unbounded loss function can be very sensitive to the presence of outlying observations, especially high leverage outliers. Moreover, it can be particularly challenging to detect outliers in high-dimensional data sets. Thus, robust e…
▽ More
Penalized regression estimators are a popular tool for the analysis of sparse and high-dimensional data sets. However, penalized regression estimators defined using an unbounded loss function can be very sensitive to the presence of outlying observations, especially high leverage outliers. Moreover, it can be particularly challenging to detect outliers in high-dimensional data sets. Thus, robust estimators for sparse and high-dimensional linear regression models are in need. In this paper, we study the robust and asymptotic properties of MM-Bridge and adaptive MM-Bridge estimators: $\ell_q$-penalized MM-estimators of regression and MM-estimators with an adaptive $\ell_t$ penalty. For the case of a fixed number of covariates, we derive the asymptotic distribution of MM-Bridge estimators for all $q>0$. We prove that for $q<1$ MM-Bridge estimators can have the oracle property defined in Fan and Li (2001). We prove that for all $t\leq 1$ adaptive MM-Bridge estimators can have the oracle property. The advantages of our proposed estimators are demonstrated through an extensive simulation study and the analysis of a real high-dimensional data set.
△ Less
Submitted 16 October, 2015; v1 submitted 8 August, 2015;
originally announced August 2015.
-
Robust and efficient estimation of high dimensional scatter and location
Authors:
Ricardo A. Maronna,
Victor J. Yohai
Abstract:
We deal with the equivariant estimation of scatter and location for p-dimensional data, giving emphasis to scatter. It it important that the estimators possess both a high efficiency for normal data and a high resistance to outliers, that is, a low bias under contamination. The most frequently employed estimators are not quite satisfactory in this respect. The Minimum Volume Ellipsoid (MVE) and Mi…
▽ More
We deal with the equivariant estimation of scatter and location for p-dimensional data, giving emphasis to scatter. It it important that the estimators possess both a high efficiency for normal data and a high resistance to outliers, that is, a low bias under contamination. The most frequently employed estimators are not quite satisfactory in this respect. The Minimum Volume Ellipsoid (MVE) and Minimum Covariance Determinant (MCD) estimators are known to have a very low efficiency. S-Estimators (Davies 1987) with a monotonic weight function like the bisquare behave satisfactorily for "small" p, say p not larger than 10. Rocke (1996) showed that their efficiency tends to one with increasing p. Unfortunately, this advantage is paid with a serious loss of robustness for large p. We consider three families of estimators with controllable efficiencies: non-monotonic S-estimators (Rocke 1996), MM-estimators (Tatsuoka and Tyler 2000) and tau-estimators (Lopuhaa 1991), whose performance for large p has not been explored to date. Two types of starting estimators are employed: the MVE computed through subsampling, and a semi-deterministic procedure proposed by Peña and Prieto (2007) for outlier detection. A simulation study shows that the Rocke and MM estimators starting from the Peña-Prieto estimator and with an adequate tuning, can simultaneously attain high efficiency and high robustness.
△ Less
Submitted 13 August, 2015; v1 submitted 13 April, 2015;
originally announced April 2015.
-
Composite Robust Estimators for Linear Mixed Models
Authors:
Claudio Agostinelli,
Victor J. Yohai
Abstract:
The Classical Tukey-Huber Contamination Model (CCM) is a usual framework to describe the mechanism of outliers generation in robust statistics. In a data set with $n$ observations and $p$ variables, under the CCM, an outlier is a unit, even if only one or few values are corrupted. Classical robust procedures were designed to cope with this setting and the impact of observations were limited whenev…
▽ More
The Classical Tukey-Huber Contamination Model (CCM) is a usual framework to describe the mechanism of outliers generation in robust statistics. In a data set with $n$ observations and $p$ variables, under the CCM, an outlier is a unit, even if only one or few values are corrupted. Classical robust procedures were designed to cope with this setting and the impact of observations were limited whenever necessary. Recently, a different mechanism of outliers generation, namely Independent Contamination Model (ICM), was introduced. In this new setting each cell of the data matrix might be corrupted or not with a probability independent on the status of the other cells. ICM poses new challenge to robust statistics since the percentage of contaminated rows dramatically increase with $p$, often reaching more than $50\%$. When this situation appears, classical affine equivariant robust procedures do not work since their breakdown point is $50\%$. For this contamination model we propose a new type of robust methods namely composite robust procedures which are inspired on the idea of composite likelihood, where low dimension likelihood, very often the likelihood of pairs, are aggregate together in order to obtain an approximation of the full likelihood which is more tractable. Our composite robust procedures are build over pairs of observations in order to gain robustness in the independent contamination model. We propose composite S and $τ$-estimators for linear mixed models. Composite $τ$-estimators are proved to have an high breakdown point both in the CCM and ICM. A Monte Carlo study shows that our estimators compare favorably with respect to classical S-estimators under the CCM and outperform them under the ICM. One example based on a real data set illustrates the new robust procedure.
△ Less
Submitted 14 July, 2014; v1 submitted 8 July, 2014;
originally announced July 2014.
-
Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination
Authors:
Claudio Agostinelli,
Andy Leung,
Victor J. Yohai,
Ruben H. Zamar
Abstract:
Multivariate location and scatter matrix estimation is a cornerstone in multivariate data analysis. We consider this problem when the data may contain independent cellwise and casewise outliers. Flat data sets with a large number of variables and a relatively small number of cases are common place in modern statistical applications. In these cases global down-weighting of an entire case, as perfor…
▽ More
Multivariate location and scatter matrix estimation is a cornerstone in multivariate data analysis. We consider this problem when the data may contain independent cellwise and casewise outliers. Flat data sets with a large number of variables and a relatively small number of cases are common place in modern statistical applications. In these cases global down-weighting of an entire case, as performed by traditional robust procedures, may lead to poor results. We highlight the need for a new generation of robust estimators that can efficiently deal with cellwise outliers and at the same time show good performance under casewise outliers.
△ Less
Submitted 23 June, 2014;
originally announced June 2014.
-
Dynamic Principal Components in the Time Domain
Authors:
Daniel Peña,
Víctor J. Yohai
Abstract:
We propose a time domain approach to define dynamic principal components (DPC) using a reconstruction of the original series criterion. This approach to define DPC was introduced by Brillinger, who gave a very elegant theoretical solution in the stationary case using the cross spectrum. Our procedure can be applied under more general conditions including the case ofnon stationary series and relati…
▽ More
We propose a time domain approach to define dynamic principal components (DPC) using a reconstruction of the original series criterion. This approach to define DPC was introduced by Brillinger, who gave a very elegant theoretical solution in the stationary case using the cross spectrum. Our procedure can be applied under more general conditions including the case ofnon stationary series and relatively short series. We also present a robust version of our procedure that allows to estimate the DPC when the series have outlier contamination. Our non robust and robust procedures are illustrated with real datasets.
△ Less
Submitted 17 June, 2014;
originally announced June 2014.
-
M-estimators for Isotonic Regression
Authors:
Enrique E. Álvarez,
Víctor J. Yohai
Abstract:
In this paper we propose a family of robust estimates for isotonic regression: isotonic M-estimators. We show that their asymptotic distribution is, up to an scalar factor, the same as that of Brunk's classical isotonic estimator. We also derive the influence function and the breakdown point of these estimates. Finally we perform a Monte Carlo study that shows that the proposed family includes est…
▽ More
In this paper we propose a family of robust estimates for isotonic regression: isotonic M-estimators. We show that their asymptotic distribution is, up to an scalar factor, the same as that of Brunk's classical isotonic estimator. We also derive the influence function and the breakdown point of these estimates. Finally we perform a Monte Carlo study that shows that the proposed family includes estimators that are simultaneously highly efficient under gaussian errors and highly robust when the error distribution has heavy tails.
△ Less
Submitted 25 May, 2011;
originally announced May 2011.
-
Robust location estimation with missing data
Authors:
Mariela Sued,
Victor J. Yohai
Abstract:
In a missing-data setting, we have a sample in which a vector of explanatory variables x_i is observed for every subject i, while scalar outcomes y_i are missing by happenstance on some individuals. In this work we propose robust estimates of the distribution of the responses assuming missing at random (MAR) data, under a semiparametric regression model. Our approach allows the consistent estimati…
▽ More
In a missing-data setting, we have a sample in which a vector of explanatory variables x_i is observed for every subject i, while scalar outcomes y_i are missing by happenstance on some individuals. In this work we propose robust estimates of the distribution of the responses assuming missing at random (MAR) data, under a semiparametric regression model. Our approach allows the consistent estimation of any weakly continuous functional of the response's distribution. In particular, strongly consistent estimates of any continuous location functional, such as the median or MM functionals, are proposed. A robust fit for the regression model combined with the robust properties of the location functional gives rise to a robust recipe for estimating the location parameter. Robustness is quantified through the breakdown point of the proposed procedure. The asymptotic distribution of the location estimates is also derived.
△ Less
Submitted 17 September, 2010; v1 submitted 29 April, 2010;
originally announced April 2010.
-
Continuity and differentiability of regression M functionals
Authors:
María V. Fasano,
Ricardo A. Maronna,
Mariela Sued,
Víctor J. Yohai
Abstract:
This paper deals with the Fisher-consistency, weak continuity and differentiability of estimating functionals corresponding to a class of both linear and nonlinear regression high breakdown M estimates, which includes S and MM estimates. A restricted type of differentiability, called weak differentiability, is defined, which suffices to prove the asymptotic normality of estimates based on the func…
▽ More
This paper deals with the Fisher-consistency, weak continuity and differentiability of estimating functionals corresponding to a class of both linear and nonlinear regression high breakdown M estimates, which includes S and MM estimates. A restricted type of differentiability, called weak differentiability, is defined, which suffices to prove the asymptotic normality of estimates based on the functionals. This approach allows to prove the consistency, asymptotic normality and qualitative robustness of M estimates under more general conditions than those required in standard approaches. In particular, we prove that regression MM-estimates are asymptotically normal when the observations are $φ$-mixing.
△ Less
Submitted 23 November, 2012; v1 submitted 24 April, 2010;
originally announced April 2010.
-
Robust estimation for ARMA models
Authors:
Nora Muler,
Daniel Peña,
Víctor J. Yohai
Abstract:
This paper introduces a new class of robust estimates for ARMA models. They are M-estimates, but the residuals are computed so the effect of one outlier is limited to the period where it occurs. These estimates are closely related to those based on a robust filter, but they have two important advantages: they are consistent and the asymptotic theory is tractable. We perform a Monte Carlo where w…
▽ More
This paper introduces a new class of robust estimates for ARMA models. They are M-estimates, but the residuals are computed so the effect of one outlier is limited to the period where it occurs. These estimates are closely related to those based on a robust filter, but they have two important advantages: they are consistent and the asymptotic theory is tractable. We perform a Monte Carlo where we show that these estimates compare favorably with respect to standard M-estimates and to estimates based on a diagnostic procedure.
△ Less
Submitted 1 April, 2009;
originally announced April 2009.
-
Propagation of outliers in multivariate data
Authors:
Fatemah Alqallaf,
Stefan Van Aelst,
Victor J. Yohai,
Ruben H. Zamar
Abstract:
We investigate the performance of robust estimates of multivariate location under nonstandard data contamination models such as componentwise outliers (i.e., contamination in each variable is independent from the other variables). This model brings up a possible new source of statistical error that we call "propagation of outliers." This source of error is unusual in the sense that it is generat…
▽ More
We investigate the performance of robust estimates of multivariate location under nonstandard data contamination models such as componentwise outliers (i.e., contamination in each variable is independent from the other variables). This model brings up a possible new source of statistical error that we call "propagation of outliers." This source of error is unusual in the sense that it is generated by the data processing itself and takes place after the data has been collected. We define and derive the influence function of robust multivariate location estimates under flexible contamination models and use it to investigate the effect of propagation of outliers. Furthermore, we show that standard high-breakdown affine equivariant estimators propagate outliers and therefore show poor breakdown behavior under componentwise contamination when the dimension $d$ is high.
△ Less
Submitted 3 March, 2009;
originally announced March 2009.
-
High breakdown point robust regression with censored data
Authors:
Matías Salibian-Barrera,
Víctor J. Yohai
Abstract:
In this paper, we propose a class of high breakdown point estimators for the linear regression model when the response variable contains censored observations. These estimators are robust against high-leverage outliers and they generalize the LMS (least median of squares), S, MM and $τ$-estimators for linear regression. An important contribution of this paper is that we can define consistent est…
▽ More
In this paper, we propose a class of high breakdown point estimators for the linear regression model when the response variable contains censored observations. These estimators are robust against high-leverage outliers and they generalize the LMS (least median of squares), S, MM and $τ$-estimators for linear regression. An important contribution of this paper is that we can define consistent estimators using a bounded loss function (or equivalently, a redescending score function). Since the calculation of these estimators can be computationally costly, we propose an efficient algorithm to compute them. We illustrate their use on an example and present simulation studies that show that these estimators also have good finite sample properties.
△ Less
Submitted 12 March, 2008;
originally announced March 2008.
-
Robust nonparametric inference for the median
Authors:
Victor J. Yohai,
Ruben H. Zamar
Abstract:
We consider the problem of constructing robust nonparametric confidence intervals and tests of hypothesis for the median when the data distribution is unknown and the data may contain a small fraction of contamination. We propose a modification of the sign test (and its associated confidence interval) which attains the nominal significance level (probability coverage) for any distribution in the…
▽ More
We consider the problem of constructing robust nonparametric confidence intervals and tests of hypothesis for the median when the data distribution is unknown and the data may contain a small fraction of contamination. We propose a modification of the sign test (and its associated confidence interval) which attains the nominal significance level (probability coverage) for any distribution in the contamination neighborhood of a continuous distribution. We also define some measures of robustness and efficiency under contamination for confidence intervals and tests. These measures are computed for the proposed procedures.
△ Less
Submitted 29 March, 2005;
originally announced March 2005.