-
Functional Partial Least-Squares: Adaptive Estimation and Inference
Authors:
Andrii Babii,
Marine Carrasco,
Idriss Tsafack
Abstract:
We study the functional linear regression model with a scalar response and a Hilbert space-valued predictor, a canonical example of an ill-posed inverse problem. We show that the functional partial least squares (PLS) estimator attains nearly minimax-optimal convergence rates over a class of ellipsoids and propose an adaptive early stopping procedure for selecting the number of PLS components. In…
▽ More
We study the functional linear regression model with a scalar response and a Hilbert space-valued predictor, a canonical example of an ill-posed inverse problem. We show that the functional partial least squares (PLS) estimator attains nearly minimax-optimal convergence rates over a class of ellipsoids and propose an adaptive early stopping procedure for selecting the number of PLS components. In addition, we develop new test that can detect local alternatives converging at the parametric rate which can be inverted to construct confidence sets. Simulation results demonstrate that the estimator performs favorably relative to several existing methods and the proposed test exhibits good power properties. We apply our methodology to evaluate the nonlinear effects of temperature on corn and soybean yields.
△ Less
Submitted 7 May, 2025; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Tensor PCA for Factor Models
Authors:
Andrii Babii,
Eric Ghysels,
Junsu Pan
Abstract:
Modern empirical analysis often relies on high-dimensional panel datasets with non-negligible cross-sectional and time-series correlations. Factor models are natural for capturing such dependencies. A tensor factor model describes the $d$-dimensional panel as a sum of a reduced rank component and an idiosyncratic noise, generalizing traditional factor models for two-dimensional panels. We consider…
▽ More
Modern empirical analysis often relies on high-dimensional panel datasets with non-negligible cross-sectional and time-series correlations. Factor models are natural for capturing such dependencies. A tensor factor model describes the $d$-dimensional panel as a sum of a reduced rank component and an idiosyncratic noise, generalizing traditional factor models for two-dimensional panels. We consider a tensor factor model corresponding to the notion of a reduced multilinear rank of a tensor. We show that for a strong factor model, a simple tensor principal component analysis algorithm is optimal for estimating factors and loadings. When the factors are weak, the convergence rate of simple TPCA can be improved with alternating least-squares iterations. We also provide inferential results for factors and loadings and propose the first test to select the number of factors. The new tools are applied to the problem of imputing missing values in a multidimensional panel of firm characteristics.
△ Less
Submitted 6 March, 2025; v1 submitted 25 December, 2022;
originally announced December 2022.
-
Binary Choice with Asymmetric Loss in a Data-Rich Environment: Theory and an Application to Racial Justice
Authors:
Andrii Babii,
Xi Chen,
Eric Ghysels,
Rohit Kumar
Abstract:
We study the binary choice problem in a data-rich environment with asymmetric loss functions. The econometrics literature covers nonparametric binary choice problems but does not offer computationally attractive solutions in data-rich environments. The machine learning literature has many algorithms but is focused mostly on loss functions that are independent of covariates. We show that theoretica…
▽ More
We study the binary choice problem in a data-rich environment with asymmetric loss functions. The econometrics literature covers nonparametric binary choice problems but does not offer computationally attractive solutions in data-rich environments. The machine learning literature has many algorithms but is focused mostly on loss functions that are independent of covariates. We show that theoretically valid decisions on binary outcomes with general loss functions can be achieved via a very simple loss-based reweighting of the logistic regression or state-of-the-art machine learning techniques. We apply our analysis to racial justice in pretrial detention.
△ Less
Submitted 6 November, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.
-
Machine Learning Panel Data Regressions with Heavy-tailed Dependent Data: Theory and Application
Authors:
Andrii Babii,
Ryan T. Ball,
Eric Ghysels,
Jonas Striaukas
Abstract:
The paper introduces structured machine learning regressions for heavy-tailed dependent panel data potentially sampled at different frequencies. We focus on the sparse-group LASSO regularization. This type of regularization can take advantage of the mixed frequency time series panel data structures and improve the quality of the estimates. We obtain oracle inequalities for the pooled and fixed eff…
▽ More
The paper introduces structured machine learning regressions for heavy-tailed dependent panel data potentially sampled at different frequencies. We focus on the sparse-group LASSO regularization. This type of regularization can take advantage of the mixed frequency time series panel data structures and improve the quality of the estimates. We obtain oracle inequalities for the pooled and fixed effects sparse-group LASSO panel data estimators recognizing that financial and economic data can have fat tails. To that end, we leverage on a new Fuk-Nagaev concentration inequality for panel data consisting of heavy-tailed $τ$-mixing processes.
△ Less
Submitted 22 November, 2021; v1 submitted 8 August, 2020;
originally announced August 2020.
-
High-dimensional mixed-frequency IV regression
Authors:
Andrii Babii
Abstract:
This paper introduces a high-dimensional linear IV regression for the data sampled at mixed frequencies. We show that the high-dimensional slope parameter of a high-frequency covariate can be identified and accurately estimated leveraging on a low-frequency instrumental variable. The distinguishing feature of the model is that it allows handing high-dimensional datasets without imposing the approx…
▽ More
This paper introduces a high-dimensional linear IV regression for the data sampled at mixed frequencies. We show that the high-dimensional slope parameter of a high-frequency covariate can be identified and accurately estimated leveraging on a low-frequency instrumental variable. The distinguishing feature of the model is that it allows handing high-dimensional datasets without imposing the approximate sparsity restrictions. We propose a Tikhonov-regularized estimator and derive the convergence rate of its mean-integrated squared error for time series data. The estimator has a closed-form expression that is easy to compute and demonstrates excellent performance in our Monte Carlo experiments. We estimate the real-time price elasticity of supply on the Australian electricity spot market. Our estimates suggest that the supply is relatively inelastic and that its elasticity is heterogeneous throughout the day.
△ Less
Submitted 30 March, 2020;
originally announced March 2020.
-
High-Dimensional Granger Causality Tests with an Application to VIX and News
Authors:
Andrii Babii,
Eric Ghysels,
Jonas Striaukas
Abstract:
We study Granger causality testing for high-dimensional time series using regularized regressions. To perform proper inference, we rely on heteroskedasticity and autocorrelation consistent (HAC) estimation of the asymptotic variance and develop the inferential theory in the high-dimensional setting. To recognize the time series data structures we focus on the sparse-group LASSO estimator, which in…
▽ More
We study Granger causality testing for high-dimensional time series using regularized regressions. To perform proper inference, we rely on heteroskedasticity and autocorrelation consistent (HAC) estimation of the asymptotic variance and develop the inferential theory in the high-dimensional setting. To recognize the time series data structures we focus on the sparse-group LASSO estimator, which includes the LASSO and the group LASSO as special cases. We establish the debiased central limit theorem for low dimensional groups of regression coefficients and study the HAC estimator of the long-run variance based on the sparse-group LASSO residuals. This leads to valid time series inference for individual regression coefficients as well as groups, including Granger causality tests. The treatment relies on a new Fuk-Nagaev inequality for a class of $τ$-mixing processes with heavier than Gaussian tails, which is of independent interest. In an empirical application, we study the Granger causal relationship between the VIX and financial news.
△ Less
Submitted 1 February, 2021; v1 submitted 12 December, 2019;
originally announced December 2019.
-
Isotonic Regression Discontinuity Designs
Authors:
Andrii Babii,
Rohit Kumar
Abstract:
This paper studies the estimation and inference for the isotonic regression at the boundary point, an object that is particularly interesting and required in the analysis of monotone regression discontinuity designs. We show that the isotonic regression is inconsistent in this setting and derive the asymptotic distributions of boundary corrected estimators. Interestingly, the boundary corrected es…
▽ More
This paper studies the estimation and inference for the isotonic regression at the boundary point, an object that is particularly interesting and required in the analysis of monotone regression discontinuity designs. We show that the isotonic regression is inconsistent in this setting and derive the asymptotic distributions of boundary corrected estimators. Interestingly, the boundary corrected estimators can be bootstrapped without subsampling or additional nonparametric smoothing which is not the case for the interior point. The Monte Carlo experiments indicate that shape restrictions can improve dramatically the finite-sample performance of unrestricted estimators. Lastly, we apply the isotonic regression discontinuity designs to estimate the causal effect of incumbency in the U.S. House elections.
△ Less
Submitted 19 December, 2020; v1 submitted 15 August, 2019;
originally announced August 2019.
-
Is completeness necessary? Estimation in nonidentified linear models
Authors:
Andrii Babii,
Jean-Pierre Florens
Abstract:
Modern data analysis depends increasingly on estimating models via flexible high-dimensional or nonparametric machine learning methods, where the identification of structural parameters is often challenging and untestable. In linear settings, this identification hinges on the completeness condition, which requires the nonsingularity of a high-dimensional matrix or operator and may fail for finite…
▽ More
Modern data analysis depends increasingly on estimating models via flexible high-dimensional or nonparametric machine learning methods, where the identification of structural parameters is often challenging and untestable. In linear settings, this identification hinges on the completeness condition, which requires the nonsingularity of a high-dimensional matrix or operator and may fail for finite samples or even at the population level. Regularized estimators provide a solution by enabling consistent estimation of structural or average structural functions, sometimes even under identification failure. We show that the asymptotic distribution in these cases can be nonstandard. We develop a comprehensive theory of regularized estimators, which include methods such as high-dimensional ridge regularization, gradient descent, and principal component analysis (PCA). The results are illustrated for high-dimensional and nonparametric instrumental variable regressions and are supported through simulation experiments.
△ Less
Submitted 6 January, 2025; v1 submitted 11 September, 2017;
originally announced September 2017.
-
Are Unobservables Separable?
Authors:
Andrii Babii,
Jean-Pierre Florens
Abstract:
It is common to assume in empirical research that observables and unobservables are additively separable, especially, when the former are endogenous. This is done because it is widely recognized that identification and estimation challenges arise when interactions between the two are allowed for. Starting from a nonseparable IV model, where the instrumental variable is independent of unobservables…
▽ More
It is common to assume in empirical research that observables and unobservables are additively separable, especially, when the former are endogenous. This is done because it is widely recognized that identification and estimation challenges arise when interactions between the two are allowed for. Starting from a nonseparable IV model, where the instrumental variable is independent of unobservables, we develop a novel nonparametric test of separability of unobservables. The large-sample distribution of the test statistics is nonstandard and relies on a novel Donsker-type central limit theorem for the empirical distribution of nonparametric IV residuals, which may be of independent interest. Using a dataset drawn from the 2015 US Consumer Expenditure Survey, we find that the test rejects the separability in Engel curves for most of the commodities.
△ Less
Submitted 31 March, 2021; v1 submitted 3 May, 2017;
originally announced May 2017.
-
Honest Confidence Sets in Nonparametric IV Regression and Other Ill-Posed Models
Authors:
Andrii Babii
Abstract:
This paper develops inferential methods for a very general class of ill-posed models in econometrics encompassing the nonparametric instrumental variable regression, various functional regressions, and the density deconvolution. We focus on uniform confidence sets for the parameter of interest estimated with Tikhonov regularization, as in Darolles, Fan, Florens, and Renault (2011). Since it is imp…
▽ More
This paper develops inferential methods for a very general class of ill-posed models in econometrics encompassing the nonparametric instrumental variable regression, various functional regressions, and the density deconvolution. We focus on uniform confidence sets for the parameter of interest estimated with Tikhonov regularization, as in Darolles, Fan, Florens, and Renault (2011). Since it is impossible to have inferential methods based on the central limit theorem, we develop two alternative approaches relying on the concentration inequality and bootstrap approximations. We show that expected diameters and coverage properties of resulting sets have uniform validity over a large class of models, i.e., constructed confidence sets are honest. Monte Carlo experiments illustrate that introduced confidence sets have reasonable width and coverage properties. Using U.S. data, we provide uniform confidence sets for Engel curves for various commodities.
△ Less
Submitted 19 December, 2020; v1 submitted 9 November, 2016;
originally announced November 2016.