-
EarlyStopping: Implicit Regularization for Iterative Learning Procedures in Python
Authors:
Eric Ziebell,
Ratmir Miftachov,
Bernhard Stankewitz,
Laura Hucker
Abstract:
Iterative learning procedures are ubiquitous in machine learning and modern statistics.
Regularision is typically required to prevent inflating the expected loss of a procedure in
later iterations via the propagation of noise inherent in the data.
Significant emphasis has been placed on achieving this regularisation implicitly by stopping
procedures early.
The EarlyStopping-package provi…
▽ More
Iterative learning procedures are ubiquitous in machine learning and modern statistics.
Regularision is typically required to prevent inflating the expected loss of a procedure in
later iterations via the propagation of noise inherent in the data.
Significant emphasis has been placed on achieving this regularisation implicitly by stopping
procedures early.
The EarlyStopping-package provides a toolbox of (in-sample) sequential early stopping rules for
several well-known iterative estimation procedures, such as truncated SVD, Landweber (gradient
descent), conjugate gradient descent, L2-boosting and regression trees.
One of the central features of the package is that the algorithms allow the specification of the
true data-generating process and keep track of relevant theoretical quantities.
In this paper, we detail the principles governing the implementation of the EarlyStopping-package and provide
a survey of recent foundational advances in the theoretical literature.
We demonstrate how to use the EarlyStopping-package to explore core features of implicit regularisation
and replicate results from the literature.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression
Authors:
Bernhard Stankewitz,
Botond Szabo
Abstract:
Due to their flexibility and theoretical tractability Gaussian process (GP) regression models have become a central topic in modern statistics and machine learning. While the true posterior in these models is given explicitly, numerical evaluations depend on the inversion of the augmented kernel matrix $ K + σ^2 I $, which requires up to $ O(n^3) $ operations. For large sample sizes n, which are t…
▽ More
Due to their flexibility and theoretical tractability Gaussian process (GP) regression models have become a central topic in modern statistics and machine learning. While the true posterior in these models is given explicitly, numerical evaluations depend on the inversion of the augmented kernel matrix $ K + σ^2 I $, which requires up to $ O(n^3) $ operations. For large sample sizes n, which are typically given in modern applications, this is computationally infeasible and necessitates the use of an approximate version of the posterior. Although such methods are widely used in practice, they typically have very limtied theoretical underpinning.
In this context, we analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics. They can be interpreted in terms of Lanczos approximate eigenvectors of the kernel matrix or a conjugate gradient approximation of the posterior mean, which are particularly advantageous in truly large scale applications, as they are fundamentally only based on matrix vector multiplications amenable to the GPU acceleration of modern software frameworks. We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates. Our theoretical findings are illustrated by numerical experiments.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Early stopping for $ L^2 $-boosting in high-dimensional linear models
Authors:
Bernhard Stankewitz
Abstract:
Increasingly high-dimensional data sets require that estimation methods do not only satisfy statistical guarantees but also remain computationally feasible. In this context, we consider $ L^{2} $-boosting via orthogonal matching pursuit in a high-dimensional linear model and analyze a data-driven early stopping time $ τ$ of the algorithm, which is sequential in the sense that its computation is ba…
▽ More
Increasingly high-dimensional data sets require that estimation methods do not only satisfy statistical guarantees but also remain computationally feasible. In this context, we consider $ L^{2} $-boosting via orthogonal matching pursuit in a high-dimensional linear model and analyze a data-driven early stopping time $ τ$ of the algorithm, which is sequential in the sense that its computation is based on the first $ τ$ iterations only. This approach is much less costly than established model selection criteria, that require the computation of the full boosting path. We prove that sequential early stopping preserves statistical optimality in this setting in terms of a fully general oracle inequality for the empirical risk and recently established optimal convergence rates for the population risk. Finally, an extensive simulation study shows that at an immensely reduced computational cost, the performance of these type of methods is on par with other state of the art algorithms such as the cross-validated Lasso or model selection via a high dimensional Akaike criterion based on the full boosting path.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
From inexact optimization to learning via gradient concentration
Authors:
Bernhard Stankewitz,
Nicole Mücke,
Lorenzo Rosasco
Abstract:
Optimization in machine learning typically deals with the minimization of empirical objectives defined by training data. However, the ultimate goal of learning is to minimize the error on future data (test error), for which the training data provides only partial information. In this view, the optimization problems that are practically feasible are based on inexact quantities that are stochastic i…
▽ More
Optimization in machine learning typically deals with the minimization of empirical objectives defined by training data. However, the ultimate goal of learning is to minimize the error on future data (test error), for which the training data provides only partial information. In this view, the optimization problems that are practically feasible are based on inexact quantities that are stochastic in nature. In this paper, we show how probabilistic results, specifically gradient concentration, can be combined with results from inexact optimization to derive sharp test error guarantees. By considering unconstrained objectives we highlight the implicit regularization properties of optimization for learning.
△ Less
Submitted 5 November, 2021; v1 submitted 9 June, 2021;
originally announced June 2021.
-
Smoothed residual stopping for statistical inverse problems via truncated SVD estimation
Authors:
Bernhard Stankewitz
Abstract:
This work examines under what circumstances adaptivity for truncated SVD estimation can be achieved by an early stopping rule based on the smoothed residuals $ \| ( A A^{\top} )^{α/ 2} ( Y - A \hatμ^{( m )}) \|^{2} $. Lower and upper bounds for the risk are derived, which show that moderate smoothing of the residuals can be used to adapt over classes of signals with varying smoothness, while overs…
▽ More
This work examines under what circumstances adaptivity for truncated SVD estimation can be achieved by an early stopping rule based on the smoothed residuals $ \| ( A A^{\top} )^{α/ 2} ( Y - A \hatμ^{( m )}) \|^{2} $. Lower and upper bounds for the risk are derived, which show that moderate smoothing of the residuals can be used to adapt over classes of signals with varying smoothness, while oversmoothing yields suboptimal convergence rates. The theoretical results are illustrated by Monte-Carlo simulations.
△ Less
Submitted 30 August, 2020; v1 submitted 30 September, 2019;
originally announced September 2019.