-
Generalized Ridge Regression: Applications to Nonorthogonal Linear Regression Models
Authors:
Román Salmerón Gómez,
Catalina García García,
Guillermo Hortal Reina
Abstract:
This paper analyzes the possibilities of using the generalized ridge regression to mitigate multicollinearity in a multiple linear regression model. For this purpose, we obtain the expressions for the estimated variance, the coefficient of variation, the coefficient of correlation, the variance inflation factor and the condition number. The results obtained are illustrated with two numerical examp…
▽ More
This paper analyzes the possibilities of using the generalized ridge regression to mitigate multicollinearity in a multiple linear regression model. For this purpose, we obtain the expressions for the estimated variance, the coefficient of variation, the coefficient of correlation, the variance inflation factor and the condition number. The results obtained are illustrated with two numerical examples.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Stepwise regression revisited
Authors:
Román Salmerón Gómez,
Catalina García García
Abstract:
This paper shows that the degree of approximate multicollinearity in a linear regression model increases simply by including independent variables, even if these are not highly linearly related. In the current situation where it is relatively easy to find linear models with a large number of independent variables, it is shown that this issue can lead to the erroneous conclusion that there is a wor…
▽ More
This paper shows that the degree of approximate multicollinearity in a linear regression model increases simply by including independent variables, even if these are not highly linearly related. In the current situation where it is relatively easy to find linear models with a large number of independent variables, it is shown that this issue can lead to the erroneous conclusion that there is a worrying problem of approximate multicollinearity. To avoid this situation, an adjusted variance inflation factor is proposed to compensate the presence of a large number of independent variables in the multiple linear regression model. It is shown that this proposal has a direct impact on variable selection models based on influence relationships, which translates into a new decision criterion in the individual significance contrast to be considered in stepwise regression models or even directly in a multiple linear regression model.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Unraveling Residualization: enhancing its application and exposing its relationship with the FWL theorem
Authors:
Catalina García García,
Román Salmerón Gómez,
Claudia García García
Abstract:
The residualization procedure has been applied in many different fields to estimate models with multicollinearity. However, there exists a lack of understanding of this methodology and some authors discourage its use. This paper aims to contribute to a better understanding of the residualization procedure to promote an adequate application and interpretation of it among statistics and data science…
▽ More
The residualization procedure has been applied in many different fields to estimate models with multicollinearity. However, there exists a lack of understanding of this methodology and some authors discourage its use. This paper aims to contribute to a better understanding of the residualization procedure to promote an adequate application and interpretation of it among statistics and data sciences. We highlight its interesting potential application, not only to mitigate multicollinearity but also when the study is oriented to the analysis of the isolated effect of independent variables. The relation between the residualization methodology and the Frisch-Waugh-Lovell (FWL) theorem is also analyzed, concluding that, although both provide the same estimations, the interpretation of the estimated coefficients is different. These different interpretations justify the application of the residualization methodology regardless of the FWL theorem. A real data example is presented for a better illustration of the contribution of this paper.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
Generalized Ridge Regression: Biased Estimation for Multiple Linear Regression Models
Authors:
Román Salmerón Gómez,
Catalina García García,
Guillermo Hortal Reina
Abstract:
When the regressors of a econometric linear model are nonorthogonal, it is well known that their estimation by ordinary least squares can present various problems that discourage the use of this model. The ridge regression is the most commonly used alternative; however, its generalized version has hardly been analyzed. The present work addresses the estimation of this generalized version, as well…
▽ More
When the regressors of a econometric linear model are nonorthogonal, it is well known that their estimation by ordinary least squares can present various problems that discourage the use of this model. The ridge regression is the most commonly used alternative; however, its generalized version has hardly been analyzed. The present work addresses the estimation of this generalized version, as well as the calculation of its mean squared error, goodness of fit and bootstrap inference.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Enlarging of the sample to address multicollinearity
Authors:
Román Salmerón Gómez,
Catalina García García,
Ainara Rodríguez Sánchez
Abstract:
The paper analyzes how the enlarging of the sample affects to the mitigation of collinearity concluding that it may mitigate the consequences of collinearity related to statistical analysis but not necessarily the numerical instability. The problem that is addressed is of importance in the teaching of social sciences since it discusses one of the solutions proposed almost unanimously to solve the…
▽ More
The paper analyzes how the enlarging of the sample affects to the mitigation of collinearity concluding that it may mitigate the consequences of collinearity related to statistical analysis but not necessarily the numerical instability. The problem that is addressed is of importance in the teaching of social sciences since it discusses one of the solutions proposed almost unanimously to solve the problem of multicollinearity. For a better understanding and illustration of the contribution of this paper, two empirical examples are presented and not highly technical developments are used.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Low-complexity Approximate Convolutional Neural Networks
Authors:
R. J. Cintra,
S. Duffner,
C. Garcia,
A. Leite
Abstract:
In this paper, we present an approach for minimizing the computational complexity of trained Convolutional Neural Networks (ConvNet). The idea is to approximate all elements of a given ConvNet and replace the original convolutional filters and parameters (pooling and bias coefficients; and activation function) with efficient approximations capable of extreme reductions in computational complexity.…
▽ More
In this paper, we present an approach for minimizing the computational complexity of trained Convolutional Neural Networks (ConvNet). The idea is to approximate all elements of a given ConvNet and replace the original convolutional filters and parameters (pooling and bias coefficients; and activation function) with efficient approximations capable of extreme reductions in computational complexity. Low-complexity convolution filters are obtained through a binary (zero-one) linear programming scheme based on the Frobenius norm over sets of dyadic rationals. The resulting matrices allow for multiplication-free computations requiring only addition and bit-shifting operations. Such low-complexity structures pave the way for low-power, efficient hardware designs. We applied our approach on three use cases of different complexity: (i) a "light" but efficient ConvNet for face detection (with around 1000 parameters); (ii) another one for hand-written digit classification (with more than 180000 parameters); and (iii) a significantly larger ConvNet: AlexNet with $\approx$1.2 million matrices. We evaluated the overall performance on the respective tasks for different levels of approximations. In all considered applications, very low-complexity approximations have been derived maintaining an almost equal classification performance.
△ Less
Submitted 29 July, 2022;
originally announced August 2022.
-
MultiColl package and other packages to detect multicollinearity in R
Authors:
R. Salmerón,
C. B. García,
J. García
Abstract:
This work presents a guide for the use of some of the functions of the multiColl package in R for the detection of near-multicollinearity. The main contribution, in comparison to other existing packages in R or other econometric software, is the treatment of qualitative independent variables and the intercept in the simple/multiple linear regression model. The main goal of this paper is to show th…
▽ More
This work presents a guide for the use of some of the functions of the multiColl package in R for the detection of near-multicollinearity. The main contribution, in comparison to other existing packages in R or other econometric software, is the treatment of qualitative independent variables and the intercept in the simple/multiple linear regression model. The main goal of this paper is to show the advantages of the multiColl package in R, comparing its results with the results obtained by other existing packages in R for the treatment of multicollinearity.
△ Less
Submitted 7 July, 2021;
originally announced July 2021.
-
The Raise Regression: Justification, properties and application
Authors:
Román Salmerón Gómez,
Catalina García García,
José García Pérez
Abstract:
Multicollinearity produces an inflation in the variance of the Ordinary Least Squares estimators due to the correlation between two or more independent variables (including the constant term). A widely applied solution is to estimate with penalized estimators (such as the ridge estimator, the Liu estimator, etc.) which exchange the mean square error by the bias. Although the variance diminishes wi…
▽ More
Multicollinearity produces an inflation in the variance of the Ordinary Least Squares estimators due to the correlation between two or more independent variables (including the constant term). A widely applied solution is to estimate with penalized estimators (such as the ridge estimator, the Liu estimator, etc.) which exchange the mean square error by the bias. Although the variance diminishes with these procedures, all seems to indicate that the inference is lost and also the goodness of fit. Alternatively, the raise regression (\cite{Garcia2011} and \cite{Salmeron2017}) allows the mitigation of the problems generated by multicollinearity but without losing the inference and keeping the coefficient of determination. This paper completely formalizes the raise estimator summarizing all the previous contributions: its mean square error, the variance inflation factor, the condition number, the adequate selection of the variable to be raised, the successive raising and the relation between the raise and the ridge estimator. As a novelty, it is also presented the estimation method, the relation between the raise and the residualization, it is analyzed the norm of the estimator and the behaviour of the individual and joint significance test and the behaviour of the mean square error and the coefficient of variation. The usefulness of the raise regression as alternative to mitigate the multicollinearity is illustrated with two empirical applications.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.
-
Overcoming the inconsistences of the variance inflation factor: a redefined VIF and a test to detect statistical troubling multicollinearity
Authors:
Román Salmerón,
Catalina García,
José García
Abstract:
Multicollinearity is relevant to many different fields where linear regression models are applied, and its existence may affect the analysis of ordinary least squares (OLS) estimators from both the numerical and statistical points of views. Thus, multicollinearity can lead to incoherence in the statistical significance of the independent variables and the global significance of the model. The vari…
▽ More
Multicollinearity is relevant to many different fields where linear regression models are applied, and its existence may affect the analysis of ordinary least squares (OLS) estimators from both the numerical and statistical points of views. Thus, multicollinearity can lead to incoherence in the statistical significance of the independent variables and the global significance of the model. The variance inflation factor (VIF) is traditionally applied to diagnose the possible existence of multicollinearity, but it is not always the case that detection by VIF of a troubling degree of multicollinearity corresponds to negative effects on the statistical analysis. The reason for the lack of specificity of VIF is that there are other factors, such as the size of the sample and the variance of the random disturbance, that can lead to high values of the VIF but not to problematic variance in the OLS estimators (see O'Brien 2007). This paper presents a new variance inflation factor (TVIF) that consider all these additional factors. Thresholds for this new measure and from the index provided by Stewart (1987) are also provided. These thresholds are reinterpreted and presented as a new statistical test to diagnose the existence of statistical troubling multicollinearity. The contributions of this paper are illustrated with two real data examples previously applied in the scientific literature.
△ Less
Submitted 5 May, 2020;
originally announced May 2020.
-
"multiColl": An R package to detect multicollinearity
Authors:
Román Salmerón,
Catalina García,
José García
Abstract:
This work presents a guide for the use of some of the functions of the R package "multiColl" for the detection of near multicollinearity. The main contribution, in comparison to other existing packages in R or other econometric software, is the treatment of qualitative independent variables and the intercept in the simple/multiple linear regression model.
This work presents a guide for the use of some of the functions of the R package "multiColl" for the detection of near multicollinearity. The main contribution, in comparison to other existing packages in R or other econometric software, is the treatment of qualitative independent variables and the intercept in the simple/multiple linear regression model.
△ Less
Submitted 31 October, 2019;
originally announced October 2019.
-
Routine Modeling with Time Series Metric Learning
Authors:
Paul Compagnon,
Grégoire Lefebvre,
Stefan Duffner,
Christophe Garcia
Abstract:
Traditionally, the automatic recognition of human activities is performed with supervised learning algorithms on limited sets of specific activities. This work proposes to recognize recurrent activity patterns, called routines, instead of precisely defined activities. The modeling of routines is defined as a metric learning problem, and an architecture, called SS2S, based on sequence-to-sequence m…
▽ More
Traditionally, the automatic recognition of human activities is performed with supervised learning algorithms on limited sets of specific activities. This work proposes to recognize recurrent activity patterns, called routines, instead of precisely defined activities. The modeling of routines is defined as a metric learning problem, and an architecture, called SS2S, based on sequence-to-sequence models is proposed to learn a distance between time series. This approach only relies on inertial data and is thus non intrusive and preserves privacy. Experimental results show that a clustering algorithm provided with the learned distance is able to recover daily routines.
△ Less
Submitted 8 July, 2019;
originally announced July 2019.
-
Sparse estimation for case-control studies with multiple subtypes of cases
Authors:
Nadim Ballout,
Cedric Garcia,
Vivian Viallon
Abstract:
The analysis of case-control studies with several subtypes of cases is increasingly common, e.g. in cancer epidemiology. For matched designs, we show that a natural strategy is based on a stratified conditional logistic regression model. Then, to account for the potential homogeneity among the subtypes of cases, we adapt the ideas of data shared lasso, which has been recently proposed for the esti…
▽ More
The analysis of case-control studies with several subtypes of cases is increasingly common, e.g. in cancer epidemiology. For matched designs, we show that a natural strategy is based on a stratified conditional logistic regression model. Then, to account for the potential homogeneity among the subtypes of cases, we adapt the ideas of data shared lasso, which has been recently proposed for the estimation of regression models in a stratified setting. For unmatched designs, we compare two standard methods based on L1-norm penalized multinomial logistic regression. We describe formal connections between these two approaches, from which practical guidance can be derived. We show that one of these approaches, which is based on a symmetric formulation of the multinomial logistic regression model, actually reduces to a data shared lasso version of the other. Consequently, the relative performance of the two approaches critically depends on the level of homogeneity that exists among the subtypes of cases: more precisely, when homogeneity is moderate to high, the non-symmetric formulation with controls as the reference is not recommended. Empirical results obtained from synthetic data are presented, which confirm the benefit of properly accounting for potential homogeneity under both matched and unmatched designs. We also present preliminary results from the analysis a case-control study nested within the EPIC cohort, where the objective is to identify metabolites associated with the occurrence of subtypes of breast cancer.
△ Less
Submitted 18 January, 2019; v1 submitted 6 January, 2019;
originally announced January 2019.
-
Deep Haar Scattering Networks in Pattern Recognition: A promising approach
Authors:
Fernando Fernandes Neto,
Alemayehu Admasu Solomon,
Rodrigo de Losso,
Claudio Garcia,
Pedro Delano Cavalcanti
Abstract:
The aim of this paper is to discuss the use of Haar scattering networks, which is a very simple architecture that naturally supports a large number of stacked layers, yet with very few parameters, in a relatively broad set of pattern recognition problems, including regression and classification tasks. This architecture, basically, consists of stacking convolutional filters, that can be thought as…
▽ More
The aim of this paper is to discuss the use of Haar scattering networks, which is a very simple architecture that naturally supports a large number of stacked layers, yet with very few parameters, in a relatively broad set of pattern recognition problems, including regression and classification tasks. This architecture, basically, consists of stacking convolutional filters, that can be thought as a generalization of Haar wavelets, followed by non-linear operators which aim to extract symmetries and invariances that are later fed in a classification/regression algorithm. We show that good results can be obtained with the proposed method for both kind of tasks. We have outperformed the best available algorithms in 4 out of 18 important data classification problems, and have obtained a more robust performance than ARIMA and ETS time series methods in regression problems for data with strong periodicities.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
Process Control with Highly Left Censored Data
Authors:
Javier Neira Rueda,
Andres Carrion Garcia
Abstract:
The need to monitor industrial processes, detecting changes in process parameters in order to promptly correct problems that may arise, generates a particular area of interest. This is particularly critical and complex when the measured value falls below the sensitivity limits of the measuring system or below detection limits, causing much of their observations are incomplete. Such observations to…
▽ More
The need to monitor industrial processes, detecting changes in process parameters in order to promptly correct problems that may arise, generates a particular area of interest. This is particularly critical and complex when the measured value falls below the sensitivity limits of the measuring system or below detection limits, causing much of their observations are incomplete. Such observations to be called incomplete observations or left censored data. With a high level of censorship, for example greater than 70%, the application of traditional methods for monitoring processes is not appropriate. It is required to use appropriate data analysis statistical techniques, to assess the actual state of the process at any time. This paper proposes a way to estimate process parameters in such cases and presents the corresponding control chart, from an algorithm that is also presented.
△ Less
Submitted 5 May, 2019; v1 submitted 2 April, 2018;
originally announced April 2018.
-
Non-parametric Estimation of Stochastic Differential Equations with Sparse Gaussian Processes
Authors:
Constantino A. García,
Abraham Otero,
Paulo Félix,
Jesús Presedo,
David G. Márquez
Abstract:
The application of Stochastic Differential Equations (SDEs) to the analysis of temporal data has attracted increasing attention, due to their ability to describe complex dynamics with physically interpretable equations. In this paper, we introduce a non-parametric method for estimating the drift and diffusion terms of SDEs from a densely observed discrete time series. The use of Gaussian processes…
▽ More
The application of Stochastic Differential Equations (SDEs) to the analysis of temporal data has attracted increasing attention, due to their ability to describe complex dynamics with physically interpretable equations. In this paper, we introduce a non-parametric method for estimating the drift and diffusion terms of SDEs from a densely observed discrete time series. The use of Gaussian processes as priors permits working directly in a function-space view and thus the inference takes place directly in this space. To cope with the computational complexity that requires the use of Gaussian processes, a sparse Gaussian process approximation is provided. This approximation permits the efficient computation of predictions for the drift and diffusion terms by using a distribution over a small subset of pseudo-samples. The proposed method has been validated using both simulated data and real data from economy and paleoclimatology. The application of the method to real data demonstrates its ability to capture the behaviour of complex systems.
△ Less
Submitted 10 July, 2017; v1 submitted 14 April, 2017;
originally announced April 2017.
-
A new algorithm for wavelet-based heart rate variability analysis
Authors:
Constantino A. García,
Abraham Otero,
Xosé Vila,
David G. Márquez
Abstract:
One of the most promising non-invasive markers of the activity of the autonomic nervous system is Heart Rate Variability (HRV). HRV analysis toolkits often provide spectral analysis techniques using the Fourier transform, which assumes that the heart rate series is stationary. To overcome this issue, the Short Time Fourier Transform is often used (STFT). However, the wavelet transform is thought t…
▽ More
One of the most promising non-invasive markers of the activity of the autonomic nervous system is Heart Rate Variability (HRV). HRV analysis toolkits often provide spectral analysis techniques using the Fourier transform, which assumes that the heart rate series is stationary. To overcome this issue, the Short Time Fourier Transform is often used (STFT). However, the wavelet transform is thought to be a more suitable tool for analyzing non-stationary signals than the STFT. Given the lack of support for wavelet-based analysis in HRV toolkits, such analysis must be implemented by the researcher. This has made this technique underutilized. This paper presents a new algorithm to perform HRV power spectrum analysis based on the Maximal Overlap Discrete Wavelet Packet Transform (MODWPT). The algorithm calculates the power in any spectral band with a given tolerance for the band's boundaries. The MODWPT decomposition tree is pruned to avoid calculating unnecessary wavelet coefficients, thereby optimizing execution time. The center of energy shift correction is applied to achieve optimum alignment of the wavelet coefficients. This algorithm has been implemented in RHRV, an open-source package for HRV analysis. To the best of our knowledge, RHRV is the first HRV toolkit with support for wavelet-based spectral analysis.
△ Less
Submitted 19 November, 2014;
originally announced November 2014.