-
Robust multivariate methods in Chemometrics
Authors:
Peter Filzmoser,
Sven Serneels,
Ricardo Maronna,
Christophe Croux
Abstract:
This chapter presents an introduction to robust statistics with applications of a chemometric nature. Following a description of the basic ideas and concepts behind robust statistics, including how robust estimators can be conceived, the chapter builds up to the construction (and use) of robust alternatives for some methods for multivariate analysis frequently used in chemometrics, such as princip…
▽ More
This chapter presents an introduction to robust statistics with applications of a chemometric nature. Following a description of the basic ideas and concepts behind robust statistics, including how robust estimators can be conceived, the chapter builds up to the construction (and use) of robust alternatives for some methods for multivariate analysis frequently used in chemometrics, such as principal component analysis and partial least squares. The chapter then provides an insight into how these robust methods can be used or extended to classification. To conclude, the issue of validation of the results is being addressed: it is shown how uncertainty statements associated with robust estimates, can be obtained.
△ Less
Submitted 15 June, 2020; v1 submitted 2 June, 2020;
originally announced June 2020.
-
Sliced Average Variance Estimation for Multivariate Time Series
Authors:
Markus Matilainen,
Christophe Croux,
Klaus Nordhausen,
Hannu Oja
Abstract:
Supervised dimension reduction for time series is challenging as there may be temporal dependence between the response $y$ and the predictors $\boldsymbol x$. Recently a time series version of sliced inverse regression, TSIR, was suggested, which applies approximate joint diagonalization of several supervised lagged covariance matrices to consider the temporal nature of the data. In this paper we…
▽ More
Supervised dimension reduction for time series is challenging as there may be temporal dependence between the response $y$ and the predictors $\boldsymbol x$. Recently a time series version of sliced inverse regression, TSIR, was suggested, which applies approximate joint diagonalization of several supervised lagged covariance matrices to consider the temporal nature of the data. In this paper we develop this concept further and propose a time series version of sliced average variance estimation, TSAVE. As both TSIR and TSAVE have their own advantages and disadvantages, we consider furthermore a hybrid version of TSIR and TSAVE. Based on examples and simulations we demonstrate and evaluate the differences between the three methods and show also that they are superior to apply their iid counterparts to when also using lagged values of the explaining variables as predictors.
△ Less
Submitted 5 October, 2018;
originally announced October 2018.
-
Lasso-based forecast combinations for forecasting realized variances
Authors:
Ines Wilms,
Jeroen Rombouts,
Christophe Croux
Abstract:
Volatility forecasts are key inputs in financial analysis. While lasso based forecasts have shown to perform well in many applications, their use to obtain volatility forecasts has not yet received much attention in the literature. Lasso estimators produce parsimonious forecast models. Our forecast combination approach hedges against the risk of selecting a wrong degree of model parsimony. Apart f…
▽ More
Volatility forecasts are key inputs in financial analysis. While lasso based forecasts have shown to perform well in many applications, their use to obtain volatility forecasts has not yet received much attention in the literature. Lasso estimators produce parsimonious forecast models. Our forecast combination approach hedges against the risk of selecting a wrong degree of model parsimony. Apart from the standard lasso, we consider several lasso extensions that account for the dynamic nature of the forecast model. We apply forecast combined lasso estimators in a comprehensive forecasting exercise using realized variance time series of ten major international stock market indices. We find the lasso extended "ordered lasso" to give the most accurate realized variance forecasts. Multivariate forecast models, accounting for volatility spillovers between different stock markets, outperform univariate forecast models for longer forecast horizons.
△ Less
Submitted 9 October, 2016;
originally announced October 2016.
-
Multi-class Vector AutoRegressive Models for Multi-store Sales Data
Authors:
Ines Wilms,
Luca Barbaglia,
Christophe Croux
Abstract:
Retailers use the Vector AutoRegressive (VAR) model as a standard tool to estimate the effects of prices, promotions and sales in one product category on the sales of another product category. Besides, these price, promotion and sales data are available for not just one store, but a whole chain of stores. We propose to study cross-category effects using a multi-class VAR model: we jointly estimate…
▽ More
Retailers use the Vector AutoRegressive (VAR) model as a standard tool to estimate the effects of prices, promotions and sales in one product category on the sales of another product category. Besides, these price, promotion and sales data are available for not just one store, but a whole chain of stores. We propose to study cross-category effects using a multi-class VAR model: we jointly estimate cross-category effects for several distinct but related VAR models, one for each store. Our methodology encourages effects to be similar across stores, while still allowing for small differences between stores to account for store heterogeneity. Moreover, our estimator is sparse: unimportant effects are estimated as exactly zero, which facilitates the interpretation of the results. A simulation study shows that the proposed multi-class estimator improves estimation accuracy by borrowing strength across classes. Finally, we provide three visual tools showing (i) the clustering of stores on identical cross-category effects, (ii) the networks of product categories and (iii) the similarity matrices of shared cross-category effects across stores.
△ Less
Submitted 11 May, 2016;
originally announced May 2016.
-
An algorithm for the multivariate group lasso with covariance estimation
Authors:
Ines Wilms,
Christophe Croux
Abstract:
We study a group lasso estimator for the multivariate linear regression model that accounts for correlated error terms. A block coordinate descent algorithm is used to compute this estimator. We perform a simulation study with categorical data and multivariate time series data, typical settings with a natural grouping among the predictor variables. Our simulation studies show the good performance…
▽ More
We study a group lasso estimator for the multivariate linear regression model that accounts for correlated error terms. A block coordinate descent algorithm is used to compute this estimator. We perform a simulation study with categorical data and multivariate time series data, typical settings with a natural grouping among the predictor variables. Our simulation studies show the good performance of the proposed group lasso estimator compared to alternative estimators. We illustrate the method on a time series data set of gene expressions.
△ Less
Submitted 16 December, 2015;
originally announced December 2015.
-
The predictive power of the business and bank sentiment of firms: A high-dimensional Granger Causality approach
Authors:
Ines Wilms,
Sarah Gelper,
Christophe Croux
Abstract:
We study the predictive power of industry-specific economic sentiment indicators for future macro-economic developments. In addition to the sentiment of firms towards their own business situation, we study their sentiment with respect to the banking sector - their main credit providers. The use of industry-specific sentiment indicators results in a high-dimensional forecasting problem. To identify…
▽ More
We study the predictive power of industry-specific economic sentiment indicators for future macro-economic developments. In addition to the sentiment of firms towards their own business situation, we study their sentiment with respect to the banking sector - their main credit providers. The use of industry-specific sentiment indicators results in a high-dimensional forecasting problem. To identify the most predictive industries, we present a bootstrap Granger Causality test based on the Adaptive Lasso. This test is more powerful than the standard Wald test in such high-dimensional settings. Forecast accuracy is improved by using only the most predictive industries rather than all industries.
△ Less
Submitted 12 August, 2015;
originally announced August 2015.
-
Identifying Demand Effects in a Large Network of Product Categories
Authors:
Sarah Gelper,
Ines Wilms,
Christophe Croux
Abstract:
Planning marketing mix strategies requires retailers to understand within- as well as cross-category demand effects. Most retailers carry products in a large variety of categories, leading to a high number of such demand effects to be estimated. At the same time, we do not expect cross-category effects between all categories. This paper outlines a methodology to estimate a parsimonious product cat…
▽ More
Planning marketing mix strategies requires retailers to understand within- as well as cross-category demand effects. Most retailers carry products in a large variety of categories, leading to a high number of such demand effects to be estimated. At the same time, we do not expect cross-category effects between all categories. This paper outlines a methodology to estimate a parsimonious product category network without prior constraints on its structure. To do so, sparse estimation of the Vector AutoRegressive Market Response Model is presented. We find that cross-category effects go beyond substitutes and complements, and that categories have asymmetric roles in the product category network. Destination categories are most influential for other product categories, while convenience and occasional categories are most responsive. Routine categories are moderately influential and moderately responsive.
△ Less
Submitted 4 June, 2015;
originally announced June 2015.
-
The shooting S-estimator for robust regression
Authors:
Viktoria Öllerer,
Andreas Alfons,
Christophe Croux
Abstract:
To perform multiple regression, the least squares estimator is commonly used. However, this estimator is not robust to outliers. Therefore, robust methods such as S-estimation have been proposed. These estimators flag any observation with a large residual as an outlier and downweight it in the further procedure. However, a large residual may be caused by an outlier in only one single predictor var…
▽ More
To perform multiple regression, the least squares estimator is commonly used. However, this estimator is not robust to outliers. Therefore, robust methods such as S-estimation have been proposed. These estimators flag any observation with a large residual as an outlier and downweight it in the further procedure. However, a large residual may be caused by an outlier in only one single predictor variable, and downweighting the complete observation results in a loss of information.
Therefore, we propose the shooting S-estimator, a regression estimator that is especially designed for situations where a large number of observations suffer from contamination in a small number of predictor variables. The shooting S-estimator combines the ideas of the coordinate descent algorithm with simple S-regression, which makes it robust against componentwise contamination, at the cost of failing the regression equivariance property.
△ Less
Submitted 3 June, 2015;
originally announced June 2015.
-
Sparse cointegration
Authors:
Ines Wilms,
Christophe Croux
Abstract:
Cointegration analysis is used to estimate the long-run equilibrium relations between several time series. The coefficients of these long-run equilibrium relations are the cointegrating vectors. In this paper, we provide a sparse estimator of the cointegrating vectors. The estimation technique is sparse in the sense that some elements of the cointegrating vectors will be estimated as zero. For thi…
▽ More
Cointegration analysis is used to estimate the long-run equilibrium relations between several time series. The coefficients of these long-run equilibrium relations are the cointegrating vectors. In this paper, we provide a sparse estimator of the cointegrating vectors. The estimation technique is sparse in the sense that some elements of the cointegrating vectors will be estimated as zero. For this purpose, we combine a penalized estimation procedure for vector autoregressive models with sparse reduced rank regression. The sparse cointegration procedure achieves a higher estimation accuracy than the traditional Johansen cointegration approach in settings where the true cointegrating vectors have a sparse structure, and/or when the sample size is low compared to the number of time series. We also discuss a criterion to determine the cointegration rank and we illustrate its good performance in several simulation settings. In a first empirical application we investigate whether the expectations hypothesis of the term structure of interest rates, implying sparse cointegrating vectors, holds in practice. In a second empirical application we show that forecast performance in high-dimensional systems can be improved by sparsely estimating the cointegration relations.
△ Less
Submitted 6 January, 2015;
originally announced January 2015.
-
Robust Sparse Canonical Correlation Analysis
Authors:
Ines Wilms,
Christophe Croux
Abstract:
Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. This paper discusses a method for Robust Sparse CCA. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As…
▽ More
Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. This paper discusses a method for Robust Sparse CCA. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their interpretability is improved. We also robustify the method such that it can cope with outliers in the data. To estimate the canonical vectors, we convert the CCA problem into an alternating regression framework, and use the sparse Least Trimmed Squares estimator. We illustrate the good performance of the Robust Sparse CCA method in several simulation studies and two real data examples.
△ Less
Submitted 6 January, 2015;
originally announced January 2015.
-
Sparse canonical correlation analysis from a predictive point of view
Authors:
Ines Wilms,
Christophe Croux
Abstract:
Canonical correlation analysis (CCA) describes the associations between two sets of variables by maximizing the correlation between linear combinations of the variables in each data set. However, in high-dimensional settings where the number of variables exceeds the sample size or when the variables are highly correlated, traditional CCA is no longer appropriate. This paper proposes a method for s…
▽ More
Canonical correlation analysis (CCA) describes the associations between two sets of variables by maximizing the correlation between linear combinations of the variables in each data set. However, in high-dimensional settings where the number of variables exceeds the sample size or when the variables are highly correlated, traditional CCA is no longer appropriate. This paper proposes a method for sparse CCA. Sparse estimation produces linear combinations of only a subset of variables from each data set, thereby increasing the interpretability of the canonical variates. We consider the CCA problem from a predictive point of view and recast it into a regression framework. By combining an alternating regression approach together with a lasso penalty, we induce sparsity in the canonical vectors. We compare the performance with other sparse CCA techniques in different simulation settings and illustrate its usefulness on a genomic data set.
△ Less
Submitted 6 January, 2015;
originally announced January 2015.
-
Robust high-dimensional precision matrix estimation
Authors:
Viktoria Öllerer,
Christophe Croux
Abstract:
The dependency structure of multivariate data can be analyzed using the covariance matrix $Σ$. In many fields the precision matrix $Σ^{-1}$ is even more informative. As the sample covariance estimator is singular in high-dimensions, it cannot be used to obtain a precision matrix estimator. A popular high-dimensional estimator is the graphical lasso, but it lacks robustness. We consider the high-di…
▽ More
The dependency structure of multivariate data can be analyzed using the covariance matrix $Σ$. In many fields the precision matrix $Σ^{-1}$ is even more informative. As the sample covariance estimator is singular in high-dimensions, it cannot be used to obtain a precision matrix estimator. A popular high-dimensional estimator is the graphical lasso, but it lacks robustness. We consider the high-dimensional independent contamination model. Here, even a small percentage of contaminated cells in the data matrix may lead to a high percentage of contaminated rows. Downweighting entire observations, which is done by traditional robust procedures, would then results in a loss of information. In this paper, we formally prove that replacing the sample covariance matrix in the graphical lasso with an elementwise robust covariance matrix leads to an elementwise robust, sparse precision matrix estimator computable in high-dimensions. Examples of such elementwise robust covariance estimators are given. The final precision matrix estimator is positive definite, has a high breakdown point under elementwise contamination and can be computed fast.
△ Less
Submitted 3 June, 2015; v1 submitted 6 January, 2015;
originally announced January 2015.
-
Sparse least trimmed squares regression for analyzing high-dimensional large data sets
Authors:
Andreas Alfons,
Christophe Croux,
Sarah Gelper
Abstract:
Sparse model estimation is a topic of high importance in modern data analysis due to the increasing availability of data sets with a large number of variables. Another common problem in applied statistics is the presence of outliers in the data. This paper combines robust regression and sparse model estimation. A robust and sparse estimator is introduced by adding an $L_1$ penalty on the coefficie…
▽ More
Sparse model estimation is a topic of high importance in modern data analysis due to the increasing availability of data sets with a large number of variables. Another common problem in applied statistics is the presence of outliers in the data. This paper combines robust regression and sparse model estimation. A robust and sparse estimator is introduced by adding an $L_1$ penalty on the coefficient estimates to the well-known least trimmed squares (LTS) estimator. The breakdown point of this sparse LTS estimator is derived, and a fast algorithm for its computation is proposed. In addition, the sparse LTS is applied to protein and gene expression data of the NCI-60 cancer cell panel. Both a simulation study and the real data application show that the sparse LTS has better prediction performance than its competitors in the presence of leverage points.
△ Less
Submitted 17 April, 2013;
originally announced April 2013.