Search | arXiv e-print repository

Criterion Collapse and Loss Distribution Control

Abstract: In this work, we consider the notion of "criterion collapse," in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers under a wide variety of learning criteria, ranging from DRO and OCE risks (CVaR, tilted ERM) to non-monotonic criteria underlying recent ascent-descent algorithms explored in the literat… ▽ More In this work, we consider the notion of "criterion collapse," in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers under a wide variety of learning criteria, ranging from DRO and OCE risks (CVaR, tilted ERM) to non-monotonic criteria underlying recent ascent-descent algorithms explored in the literature (Flooding, SoftAD). We show how collapse in the context of losses with a Bernoulli distribution goes far beyond existing results for CVaR and DRO, then expand our scope to include surrogate losses, showing conditions where monotonic criteria such as tilted ERM cannot avoid collapse, whereas non-monotonic alternatives can. △ Less

Submitted 21 May, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: Revised version accepted to ICML 2024

arXiv:2310.10006 [pdf, other]

Soft ascent-descent as a stable and flexible alternative to flooding

Authors: Matthew J. Holland, Kosuke Nakatani

Abstract: As a heuristic for improving test accuracy in classification, the "flooding" method proposed by Ishida et al. (2020) sets a threshold for the average surrogate loss at training time; above the threshold, gradient descent is run as usual, but below the threshold, a switch to gradient ascent is made. While setting the threshold is non-trivial and is usually done with validation data, this simple tec… ▽ More As a heuristic for improving test accuracy in classification, the "flooding" method proposed by Ishida et al. (2020) sets a threshold for the average surrogate loss at training time; above the threshold, gradient descent is run as usual, but below the threshold, a switch to gradient ascent is made. While setting the threshold is non-trivial and is usually done with validation data, this simple technique has proved remarkably effective in terms of accuracy. On the other hand, what if we are also interested in other metrics such as model complexity or average surrogate loss at test time? As an attempt to achieve better overall performance with less fine-tuning, we propose a softened, pointwise mechanism called SoftAD (soft ascent-descent) that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect of flooding, with no additional computational overhead. We contrast formal stationarity guarantees with those for flooding, and empirically demonstrate how SoftAD can realize classification accuracy competitive with flooding (and the more expensive alternative SAM) while enjoying a much smaller loss generalization gap and model norm. △ Less

Submitted 21 October, 2024; v1 submitted 15 October, 2023; originally announced October 2023.

Comments: Revised version accepted to NeurIPS 2024

arXiv:2301.11584 [pdf, other]

Robust variance-regularized risk minimization with concomitant scaling

Authors: Matthew J. Holland

Abstract: Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in tradit… ▽ More Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets. △ Less

Submitted 8 February, 2024; v1 submitted 27 January, 2023; originally announced January 2023.

Comments: Revised version accepted to AISTATS 2024

arXiv:2203.14434 [pdf, other]

Flexible risk design using bi-directional dispersion

Authors: Matthew J. Holland

Abstract: Many novel notions of "risk" (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is off… ▽ More Many novel notions of "risk" (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution incurred by gradient-based learners. △ Less

Submitted 16 February, 2023; v1 submitted 27 March, 2022; originally announced March 2022.

Comments: Final revision, just minor typos corrected for camera-ready at AISTATS 2023

arXiv:2110.04996 [pdf, ps, other]

doi 10.1613/jair.1.15000

A Survey of Learning Criteria Going Beyond the Usual Risk

Authors: Matthew J. Holland, Kazuki Tanabe

Abstract: Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, w… ▽ More Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of "what makes for a desirable loss distribution?" in place of tacit use of the expected loss. △ Less

Submitted 29 November, 2023; v1 submitted 11 October, 2021; originally announced October 2021.

Comments: Final version published in JAIR

Journal ref: Journal of Artificial Intelligence Research, 78:781-821, 2023

arXiv:2105.11135 [pdf, other]

doi 10.1609/aaai.v36i6.20649

Robust learning with anytime-guaranteed feedback

Authors: Matthew J. Holland

Abstract: Under data distributions which may be heavy-tailed, many stochastic gradient-based learning algorithms are driven by feedback queried at points with almost no performance guarantees on their own. Here we explore a modified "anytime online-to-batch" mechanism which for smooth objectives admits high-probability error bounds while requiring only lower-order moment bounds on the stochastic gradients.… ▽ More Under data distributions which may be heavy-tailed, many stochastic gradient-based learning algorithms are driven by feedback queried at points with almost no performance guarantees on their own. Here we explore a modified "anytime online-to-batch" mechanism which for smooth objectives admits high-probability error bounds while requiring only lower-order moment bounds on the stochastic gradients. Using this conversion, we can derive a wide variety of "anytime robust" procedures, for which the task of performance analysis can be effectively reduced to regret control, meaning that existing regret bounds (for the bounded gradient case) can be robustified and leveraged in a straightforward manner. As a direct takeaway, we obtain an easily implemented stochastic gradient-based algorithm for which all queried points formally enjoy sub-Gaussian error bounds, and in practice show noteworthy gains on real-world data applications. △ Less

Submitted 24 May, 2021; originally announced May 2021.

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6918-6925, 2022

arXiv:2105.04816 [pdf, other]

Spectral risk-based learning using unbounded losses

Authors: Matthew J. Holland, El Mehdi Haress

Abstract: In this work, we consider the setting of learning problems under a wide class of spectral risk (or "L-risk") functions, where a Lipschitz-continuous spectral density is used to flexibly assign weight to extreme loss values. We obtain excess risk guarantees for a derivative-free learning procedure under unbounded heavy-tailed loss distributions, and propose a computationally efficient implementatio… ▽ More In this work, we consider the setting of learning problems under a wide class of spectral risk (or "L-risk") functions, where a Lipschitz-continuous spectral density is used to flexibly assign weight to extreme loss values. We obtain excess risk guarantees for a derivative-free learning procedure under unbounded heavy-tailed loss distributions, and propose a computationally efficient implementation which empirically outperforms traditional risk minimizers in terms of balancing spectral risk and misclassification error. △ Less

Submitted 11 May, 2021; originally announced May 2021.

arXiv:2012.07346 [pdf, other]

Better scalability under potentially heavy-tailed feedback

Authors: Matthew J. Holland

Abstract: We study scalable alternatives to robust gradient descent (RGD) techniques that can be used when the losses and/or gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we instead focus computational effort… ▽ More We study scalable alternatives to robust gradient descent (RGD) techniques that can be used when the losses and/or gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we instead focus computational effort on robustly choosing (or newly constructing) a strong candidate based on a collection of cheap stochastic sub-processes which can be run in parallel. The exact selection process depends on the convexity of the underlying objective, but in all cases, our selection technique amounts to a robust form of boosting the confidence of weak learners. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data, along with applications to a variety of benchmark datasets. The overall take-away is an extensible procedure that is simple to implement, trivial to parallelize, which keeps the formal merits of RGD methods but scales much better to large learning problems. △ Less

Submitted 14 December, 2020; originally announced December 2020.

Comments: This work merges arXiv:2006.00784 and arXiv:2006.01364, providing additional empirical analysis using real-world benchmark datasets

arXiv:2012.02424 [pdf, other]

doi 10.1007/s10994-022-06217-5

Learning with risks based on M-location

Authors: Matthew J. Holland

Abstract: In this work, we study a new class of risks defined in terms of the location and deviation of the loss distribution, generalizing far beyond classical mean-variance risk functions. The class is easily implemented as a wrapper around any smooth loss, it admits finite-sample stationarity guarantees for stochastic gradient methods, it is straightforward to interpret and adjust, with close links to M-… ▽ More In this work, we study a new class of risks defined in terms of the location and deviation of the loss distribution, generalizing far beyond classical mean-variance risk functions. The class is easily implemented as a wrapper around any smooth loss, it admits finite-sample stationarity guarantees for stochastic gradient methods, it is straightforward to interpret and adjust, with close links to M-estimators of the loss location, and has a salient effect on the test loss distribution. △ Less

Submitted 25 April, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: Substantial update to initial version; refined theory, improved exposition, added experimental analysis

Journal ref: Machine Learning, 111:4679-4718, 2022

arXiv:2007.04486 [pdf, other]

Making learning more transparent using conformalized performance prediction

Authors: Matthew J. Holland

Abstract: In this work, we study some novel applications of conformal inference techniques to the problem of providing machine learning procedures with more transparent, accurate, and practical performance guarantees. We provide a natural extension of the traditional conformal prediction framework, done in such a way that we can make valid and well-calibrated predictive statements about the future performan… ▽ More In this work, we study some novel applications of conformal inference techniques to the problem of providing machine learning procedures with more transparent, accurate, and practical performance guarantees. We provide a natural extension of the traditional conformal prediction framework, done in such a way that we can make valid and well-calibrated predictive statements about the future performance of arbitrary learning algorithms, when passed an as-yet unseen training set. In addition, we include some nascent empirical examples to illustrate potential applications. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2006.02001 [pdf, other]

Learning with CVaR-based feedback under potentially heavy tails

Authors: Matthew J. Holland, El Mehdi Haress

Abstract: We study learning algorithms that seek to minimize the conditional value-at-risk (CVaR), when all the learner knows is that the losses incurred may be heavy-tailed. We begin by studying a general-purpose estimator of CVaR for potentially heavy-tailed random variables, which is easy to implement in practice, and requires nothing more than finite variance and a distribution function that does not ch… ▽ More We study learning algorithms that seek to minimize the conditional value-at-risk (CVaR), when all the learner knows is that the losses incurred may be heavy-tailed. We begin by studying a general-purpose estimator of CVaR for potentially heavy-tailed random variables, which is easy to implement in practice, and requires nothing more than finite variance and a distribution function that does not change too fast or slow around just the quantile of interest. With this estimator in hand, we then derive a new learning algorithm which robustly chooses among candidates produced by stochastic gradient-driven sub-processes. For this procedure we provide high-probability excess CVaR bounds, and to complement the theory we conduct empirical tests of the underlying CVaR estimator and the learning algorithm derived from it. △ Less

Submitted 2 June, 2020; originally announced June 2020.

arXiv:2006.01364

Improved scalability under heavy tails, without strong convexity

Authors: Matthew J. Holland

Abstract: Real-world data is laden with outlying values. The challenge for machine learning is that the learner typically has no prior knowledge of whether the feedback it receives (losses, gradients, etc.) will be heavy-tailed or not. In this work, we study a simple algorithmic strategy that can be leveraged when both losses and gradients can be heavy-tailed. The core technique introduces a simple robust v… ▽ More Real-world data is laden with outlying values. The challenge for machine learning is that the learner typically has no prior knowledge of whether the feedback it receives (losses, gradients, etc.) will be heavy-tailed or not. In this work, we study a simple algorithmic strategy that can be leveraged when both losses and gradients can be heavy-tailed. The core technique introduces a simple robust validation sub-routine, which is used to boost the confidence of inexpensive gradient-based sub-processes. Compared with recent robust gradient descent methods from the literature, dimension dependence (both risk bounds and cost) is substantially improved, without relying upon strong convexity or expensive per-step robustification. Empirically, we also show that under heavy-tailed losses, the proposed procedure cannot simply be replaced with naive cross-validation. Taken together, we have a scalable method with transparent guarantees, which performs well without prior knowledge of how "convenient" the feedback it receives will be. △ Less

Submitted 14 December, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

Comments: This paper has been superseded by arXiv:2012.07346 (a merge and extension of this article and arXiv:2006.00784)

arXiv:2006.00784

Better scalability under potentially heavy-tailed gradients

Authors: Matthew J. Holland

Abstract: We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too fa… ▽ More We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too far from the majority of cheap stochastic sub-processes run for a single pass over partitioned data. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data. The result is a procedure that is simple to implement, trivial to parallelize, which keeps the formal strength of RGD methods but scales much better to large learning problems. △ Less

Submitted 14 December, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

Comments: This paper has been superseded by arXiv:2012.07346 (a merge and extension of this article and arXiv:2006.01364)

arXiv:1905.07900 [pdf, other]

PAC-Bayes under potentially heavy tails

Authors: Matthew J. Holland

Abstract: We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain a novel optimal Gibbs posterior which enjoys finite-sample excess risk bounds at logarithmic confidence. Our core technique itself makes use of PAC-Bayesian inequalities in order to derive a robust risk estimator, which by design is easy to compute. In particular, only assuming that the first three moments of the loss d… ▽ More We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain a novel optimal Gibbs posterior which enjoys finite-sample excess risk bounds at logarithmic confidence. Our core technique itself makes use of PAC-Bayesian inequalities in order to derive a robust risk estimator, which by design is easy to compute. In particular, only assuming that the first three moments of the loss distribution are bounded, the learning algorithm derived from this estimator achieves nearly sub-Gaussian statistical error, up to the quality of the prior. △ Less

Submitted 18 December, 2019; v1 submitted 20 May, 2019; originally announced May 2019.

arXiv:1810.06207 [pdf, other]

Robust descent using smoothed multiplicative noise

Authors: Matthew J. Holland

Abstract: To improve the off-sample generalization of classical procedures minimizing the empirical risk under potentially heavy-tailed data, new robust learning algorithms have been proposed in recent years, with generalized median-of-means strategies being particularly salient. These procedures enjoy performance guarantees in the form of sharp risk bounds under weak moment assumptions on the underlying lo… ▽ More To improve the off-sample generalization of classical procedures minimizing the empirical risk under potentially heavy-tailed data, new robust learning algorithms have been proposed in recent years, with generalized median-of-means strategies being particularly salient. These procedures enjoy performance guarantees in the form of sharp risk bounds under weak moment assumptions on the underlying loss, but typically suffer from a large computational overhead and substantial bias when the data happens to be sub-Gaussian, limiting their utility. In this work, we propose a novel robust gradient descent procedure which makes use of a smoothed multiplicative noise applied directly to observations before constructing a sum of soft-truncated gradient coordinates. We show that the procedure has competitive theoretical guarantees, with the major advantage of a simple implementation that does not require an iterative sub-routine for robustification. Empirical tests reinforce the theory, showing more efficient generalization over a much wider class of data distributions. △ Less

Submitted 15 October, 2018; originally announced October 2018.

arXiv:1810.04863 [pdf, other]

Classification using margin pursuit

Authors: Matthew J. Holland

Abstract: In this work, we study a new approach to optimizing the margin distribution realized by binary classifiers. The classical approach to this problem is simply maximization of the expected margin, while more recent proposals consider simultaneous variance control and proxy objectives based on robust location estimates, in the vein of keeping the margin distribution sharply concentrated in a desirable… ▽ More In this work, we study a new approach to optimizing the margin distribution realized by binary classifiers. The classical approach to this problem is simply maximization of the expected margin, while more recent proposals consider simultaneous variance control and proxy objectives based on robust location estimates, in the vein of keeping the margin distribution sharply concentrated in a desirable region. While conceptually appealing, these new approaches are often computationally unwieldy, and theoretical guarantees are limited. Given this context, we propose an algorithm which searches the hypothesis space in such a way that a pre-set "margin level" ends up being a distribution-robust estimator of the margin location. This procedure is easily implemented using gradient descent, and admits finite-sample bounds on the excess risk under unbounded inputs. Empirical tests on real-world benchmark data reinforce the basic principles highlighted by the theory, and are suggestive of a promising new technique for classification. △ Less

Submitted 11 October, 2018; originally announced October 2018.

arXiv:1706.00182 [pdf, other]

Efficient learning with robust gradient descent

Authors: Matthew J. Holland, Kazushi Ikeda

Abstract: Minimizing the empirical risk is a popular training strategy, but for learning tasks where the data may be noisy or heavy-tailed, one may require many observations in order to generalize well. To achieve better performance under less stringent requirements, we introduce a procedure which constructs a robust approximation of the risk gradient for use in an iterative learning routine. Using high-pro… ▽ More Minimizing the empirical risk is a popular training strategy, but for learning tasks where the data may be noisy or heavy-tailed, one may require many observations in order to generalize well. To achieve better performance under less stringent requirements, we introduce a procedure which constructs a robust approximation of the risk gradient for use in an iterative learning routine. Using high-probability bounds on the excess risk of this algorithm, we show that our update does not deviate far from the ideal gradient-based update. Empirical tests using both controlled simulations and real-world benchmark data show that in diverse settings, the proposed procedure can learn more efficiently, using less resources (iterations and observations) while generalizing better. △ Less

Submitted 14 October, 2018; v1 submitted 1 June, 2017; originally announced June 2017.

Showing 1–17 of 17 results for author: Holland, M J