-
On the symmetric $q$-analog on the bi-univalent functions with respect to symmetric points
Authors:
Pinhong Long,
Huili Han,
Halit Orhan,
Huo Tang
Abstract:
Our objective is to usher and investigate the subclass$\widetilde{\mathcal{S^{*}_{\sum}}}^η_{q}(μ,λ;φ)$ of the function class $\sum$ of analytic and bi-univalent functions related with the symmetric $q$-derivative operator and the generalized Bernardi integral operator. On the one hand, without the generalized Bernardi integral operator we estimate the second Hankel determinants for the reduced su…
▽ More
Our objective is to usher and investigate the subclass$\widetilde{\mathcal{S^{*}_{\sum}}}^η_{q}(μ,λ;φ)$ of the function class $\sum$ of analytic and bi-univalent functions related with the symmetric $q$-derivative operator and the generalized Bernardi integral operator. On the one hand, without the generalized Bernardi integral operator we estimate the second Hankel determinants for the reduced subclasses $\widetilde{\mathcal{S^{*}_{\sum}}}_{q}(λ;φ)$ with respect to symmetric points. On the other hand, we also give the corresponding results of Fekete-Szegö functional inequalities and the upper bounds of the coefficients $a_2$ and $a_3$ for these subclasses.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima
Authors:
Peter L. Bartlett,
Philip M. Long,
Olivier Bousquet
Abstract:
We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest c…
▽ More
We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence.
In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM's update may be regarded as a third derivative -- the derivative of the Hessian in the leading eigenvector direction -- that encourages drift toward wider minima.
△ Less
Submitted 11 April, 2023; v1 submitted 4 October, 2022;
originally announced October 2022.
-
Deep Linear Networks can Benignly Overfit when Shallow Ones Do
Authors:
Niladri S. Chatterji,
Philip M. Long
Abstract:
We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear model…
▽ More
We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $\ell_2$-norm solution. Since the noise affects the excess risk only through the conditional variance, this implies that depth does not improve the algorithm's ability to "hide the noise". Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with ReLU networks, although the situation there is more nuanced.
△ Less
Submitted 6 February, 2023; v1 submitted 19 September, 2022;
originally announced September 2022.
-
Foolish Crowds Support Benign Overfitting
Authors:
Niladri S. Chatterji,
Philip M. Long
Abstract:
We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum $\ell_1$-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum $\ell_2$-norm interpolant), even when the gro…
▽ More
We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum $\ell_1$-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum $\ell_2$-norm interpolant), even when the ground truth is sparse. Our analysis exposes the benefit of an effect analogous to the "wisdom of the crowd", except here the harm arising from fitting the $\textit{noise}$ is ameliorated by spreading it among many directions -- the variance reduction arises from a $\textit{foolish}$ crowd.
△ Less
Submitted 17 March, 2022; v1 submitted 6 October, 2021;
originally announced October 2021.
-
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks
Authors:
Niladri S. Chatterji,
Philip M. Long,
Peter L. Bartlett
Abstract:
The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained wit…
▽ More
The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.
△ Less
Submitted 9 September, 2022; v1 submitted 25 August, 2021;
originally announced August 2021.
-
When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?
Authors:
Niladri S. Chatterji,
Philip M. Long,
Peter L. Bartlett
Abstract:
We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at…
▽ More
We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.
△ Less
Submitted 1 July, 2021; v1 submitted 9 February, 2021;
originally announced February 2021.
-
When does gradient descent with logistic loss find interpolating two-layer networks?
Authors:
Niladri S. Chatterji,
Philip M. Long,
Peter L. Bartlett
Abstract:
We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the…
▽ More
We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
△ Less
Submitted 1 July, 2021; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Failures of model-dependent generalization bounds for least-norm interpolation
Authors:
Peter L. Bartlett,
Philip M. Long
Abstract:
We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joi…
▽ More
We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose -- it can be bounded below by a constant when the true excess risk goes to zero.
△ Less
Submitted 20 January, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.
-
Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime
Authors:
Niladri S. Chatterji,
Philip M. Long
Abstract:
We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassif…
▽ More
We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk.
△ Less
Submitted 1 June, 2021; v1 submitted 24 April, 2020;
originally announced April 2020.
-
On the Global Convergence of Training Deep Linear ResNets
Authors:
Difan Zou,
Philip M. Long,
Quanquan Gu
Abstract:
We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minim…
▽ More
We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear ResNets. Compared with the global convergence result of GD for training standard deep linear networks (Du & Hu 2019), our condition on the neural network width is sharper by a factor of $O(κL)$, where $κ$ denotes the condition number of the covariance matrix of the training data. We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d,k$ are the input and output dimensions respectively.
△ Less
Submitted 2 March, 2020;
originally announced March 2020.
-
Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms
Authors:
Niladri S. Chatterji,
Peter L. Bartlett,
Philip M. Long
Abstract:
We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an…
▽ More
We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms.
We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least $\varepsilon$ away from the target in total variation distance if the number of gradient queries is less than $Ω(σ^2 d/\varepsilon^2)$, where $σ^2 d$ is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem.
△ Less
Submitted 3 July, 2021; v1 submitted 1 February, 2020;
originally announced February 2020.
-
Benign Overfitting in Linear Regression
Authors:
Peter L. Bartlett,
Philip M. Long,
Gábor Lugosi,
Alexander Tsigler
Abstract:
The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for whic…
▽ More
The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lies in an infinite dimensional space versus when the data lies in a finite dimensional space whose dimension grows faster than the sample size.
△ Less
Submitted 29 January, 2020; v1 submitted 26 June, 2019;
originally announced June 2019.
-
Generalization bounds for deep convolutional neural networks
Authors:
Philip M. Long,
Hanie Sedghi
Abstract:
We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments using CIFAR-10 with varying hyper…
▽ More
We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments using CIFAR-10 with varying hyperparameters of a deep convolutional network, comparing our bounds with practical generalization gaps.
△ Less
Submitted 8 April, 2020; v1 submitted 29 May, 2019;
originally announced May 2019.
-
On the effect of the activation function on the distribution of hidden nodes in a deep network
Authors:
Philip M. Long,
Hanie Sedghi
Abstract:
We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to Gaussian distributions, and the input is in $\{ -1, 1\}^N$. We show that, if the activation function $φ$ satisfies a minimal set of assumptions, satisfied by all activation functions that…
▽ More
We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to Gaussian distributions, and the input is in $\{ -1, 1\}^N$. We show that, if the activation function $φ$ satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the `length process' converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases, and the activation function $φ$. We also show that this convergence may fail for $φ$ that violate our assumptions.
△ Less
Submitted 7 January, 2019;
originally announced January 2019.
-
A characterization of rough fractional type integral operators and Campanato estimates for their commutators on the variable exponent vanishing generalized Morrey spaces
Authors:
Ferit Grbz,
Shenghu Ding,
Huili Han,
Pinhong Long
Abstract:
In this paper, applying some properties of variable exponent analysis, we first dwell on Adams and Spanne type estimates for a class of fractional type integral operators of variable orders, respectively and then, obtain variable exponent generalized Campanato estimates for the corresponding commutators on the vanishing generalized Morrey spaces…
▽ More
In this paper, applying some properties of variable exponent analysis, we first dwell on Adams and Spanne type estimates for a class of fractional type integral operators of variable orders, respectively and then, obtain variable exponent generalized Campanato estimates for the corresponding commutators on the vanishing generalized Morrey spaces $VL_{Π}^{p\left( \cdot \right) ,w\left( \cdot \right) }\left( E\right) $ with variable exponent $p(\cdot )$ and bounded set $E$
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
Learning Sums of Independent Random Variables with Sparse Collective Support
Authors:
Anindya De,
Philip M. Long,
Rocco A. Servedio
Abstract:
We study the learnability of sums of independent integer random variables given a bound on the size of the union of their supports. For $\mathcal{A} \subset \mathbf{Z}_{+}$, a sum of independent random variables with collective support $\mathcal{A}$} (called an $\mathcal{A}$-sum in this paper) is a distribution $\mathbf{S} = \mathbf{X}_1 + \cdots + \mathbf{X}_N$ where the $\mathbf{X}_i$'s are mutu…
▽ More
We study the learnability of sums of independent integer random variables given a bound on the size of the union of their supports. For $\mathcal{A} \subset \mathbf{Z}_{+}$, a sum of independent random variables with collective support $\mathcal{A}$} (called an $\mathcal{A}$-sum in this paper) is a distribution $\mathbf{S} = \mathbf{X}_1 + \cdots + \mathbf{X}_N$ where the $\mathbf{X}_i$'s are mutually independent (but not necessarily identically distributed) integer random variables with $\cup_i \mathsf{supp}(\mathbf{X}_i) \subseteq \mathcal{A}.$ We give two main algorithmic results for learning such distributions:
1. For the case $| \mathcal{A} | = 3$, we give an algorithm for learning $\mathcal{A}$-sums to accuracy $ε$ that uses $\mathsf{poly}(1/ε)$ samples and runs in time $\mathsf{poly}(1/ε)$, independent of $N$ and of the elements of $\mathcal{A}$.
2. For an arbitrary constant $k \geq 4$, if $\mathcal{A} = \{ a_1,...,a_k\}$ with $0 \leq a_1 < ... < a_k$, we give an algorithm that uses $\mathsf{poly}(1/ε) \cdot \log \log a_k$ samples (independent of $N$) and runs in time $\mathsf{poly}(1/ε, \log a_k).$
We prove an essentially matching lower bound: if $|\mathcal{A}| = 4$, then any algorithm must use $Ω(\log \log a_4) $ samples even for learning to constant accuracy. We also give similar-in-spirit (but quantitatively very different) algorithmic results, and essentially matching lower bounds, for the case in which $\mathcal{A}$ is not known to the learner.
△ Less
Submitted 12 November, 2020; v1 submitted 18 July, 2018;
originally announced July 2018.
-
Representing smooth functions as compositions of near-identity functions with implications for deep network optimization
Authors:
Peter L. Bartlett,
Steven N. Evans,
Philip M. Long
Abstract:
We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each $\left(h_i-\mathrm{Id}\right)$ is Lipschitz, and the Lipschitz constant decreases inversely with the number $m$ of functions composed. This implies that $h$ can be represented to any accuracy by a deep residu…
▽ More
We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each $\left(h_i-\mathrm{Id}\right)$ is Lipschitz, and the Lipschitz constant decreases inversely with the number $m$ of functions composed. This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider nonlinear regression with a composition of near-identity nonlinear maps. We show that, regarding Fréchet derivatives with respect to the $h_1,...,h_m$, any critical point of a quadratic criterion in this near-identity region must be a global minimizer. In contrast, if we consider derivatives with respect to parameters of a fixed-size residual network with sigmoid activation functions, we show that there are near-identity critical points that are suboptimal, even in the realizable case. Informally, this means that functional gradient methods for residual networks cannot get stuck at suboptimal critical points corresponding to near-identity layers, whereas parametric gradient methods for sigmoidal residual networks suffer from suboptimal critical points in the near-identity region.
△ Less
Submitted 16 April, 2018; v1 submitted 13 April, 2018;
originally announced April 2018.
-
Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks
Authors:
Peter L. Bartlett,
David P. Helmbold,
Philip M. Long
Abstract:
We analyze algorithms for approximating a function $f(x) = Φx$ mapping $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $Θ_1,...,Θ_L$ and defined by $h(x) = Θ_L Θ_{L-1} ... Θ_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic.…
▽ More
We analyze algorithms for approximating a function $f(x) = Φx$ mapping $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $Θ_1,...,Θ_L$ and defined by $h(x) = Θ_L Θ_{L-1} ... Θ_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic.
We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $Φ$, in the case where the initial hypothesis $Θ_1 = ... = Θ_L = I$ has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for $Φ$ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help.
If $Φ$ is symmetric positive definite, we show that an algorithm that initializes $Θ_i = I$ learns an $ε$-approximation of $f$ using a number of updates polynomial in $L$, the condition number of $Φ$, and $\log(d/ε)$. In contrast, we show that if the least squares matrix $Φ$ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge.
We analyze an algorithm for the case that $Φ$ satisfies $u^{\top} Φu > 0$ for all $u$, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant $u^{\top} Θ_L Θ_{L-1} ... Θ_1 u > 0$ for all $u$, and another that "balances" $Θ_1, ..., Θ_L$ so that they have the same singular values.
△ Less
Submitted 18 June, 2018; v1 submitted 16 February, 2018;
originally announced February 2018.
-
A Flexible Procedure for Mixture Proportion Estimation in Positive-Unlabeled Learning
Authors:
Zhenfeng Lin,
James P. Long
Abstract:
Positive--unlabeled (PU) learning considers two samples, a positive set P with observations from only one class and an unlabeled set U with observations from two classes. The goal is to classify observations in U. Class mixture proportion estimation (MPE) in U is a key step in PU learning. Blanchard et al. [2010] showed that MPE in PU learning is a generalization of the problem of estimating the p…
▽ More
Positive--unlabeled (PU) learning considers two samples, a positive set P with observations from only one class and an unlabeled set U with observations from two classes. The goal is to classify observations in U. Class mixture proportion estimation (MPE) in U is a key step in PU learning. Blanchard et al. [2010] showed that MPE in PU learning is a generalization of the problem of estimating the proportion of true null hypotheses in multiple testing problems. Motivated by this idea, we propose reducing the problem to one dimension via construction of a probabilistic classifier trained on the P and U data sets followed by application of a one--dimensional mixture proportion method from the multiple testing literature to the observation class probabilities. The flexibility of this framework lies in the freedom to choose the classifier and the one--dimensional MPE method. We prove consistency of two mixture proportion estimators using bounds from empirical process theory, develop tuning parameter free implementations, and demonstrate that they have competitive performance on simulated waveform data and a protein signaling problem.
△ Less
Submitted 9 January, 2020; v1 submitted 29 January, 2018;
originally announced January 2018.
-
Surprising properties of dropout in deep networks
Authors:
David P. Helmbold,
Philip M. Long
Abstract:
We analyze dropout in deep networks with rectified linear units and the quadratic loss. Our results expose surprising differences between the behavior of dropout and more traditional regularizers like weight decay. For example, on some simple data sets dropout training produces negative weights even though the output is the sum of the inputs. This provides a counterpoint to the suggestion that dro…
▽ More
We analyze dropout in deep networks with rectified linear units and the quadratic loss. Our results expose surprising differences between the behavior of dropout and more traditional regularizers like weight decay. For example, on some simple data sets dropout training produces negative weights even though the output is the sum of the inputs. This provides a counterpoint to the suggestion that dropout discourages co-adaptation of weights. We also show that the dropout penalty can grow exponentially in the depth of the network while the weight-decay penalty remains essentially linear, and that dropout is insensitive to various re-scalings of the input features, outputs, and network weights. This last insensitivity implies that there are no isolated local minima of the dropout training criterion. Our work uncovers new properties of dropout, extends our understanding of why dropout succeeds, and lays the foundation for further progress.
△ Less
Submitted 19 April, 2017; v1 submitted 14 February, 2016;
originally announced February 2016.
-
On the Inductive Bias of Dropout
Authors:
David P. Helmbold,
Philip M. Long
Abstract:
Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassif…
▽ More
Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassification loss (i.e. the logistic loss used in logistic regression) is minimized. We show: (a) when the dropout-regularized criterion has a unique minimizer, (b) when the dropout-regularization penalty goes to infinity with the weights, and when it remains bounded, (c) that the dropout regularization can be non-monotonic as individual weights increase from 0, and (d) that the dropout regularization penalty may not be convex. This last point is particularly surprising because the combination of dropout regularization with any convex loss proxy is always a convex function.
In order to contrast dropout regularization with $L_2$ regularization, we formalize the notion of when different sources are more compatible with different regularizers. We then exhibit distributions that are provably more compatible with dropout regularization than $L_2$ regularization, and vice versa. These sources provide additional insight into how the inductive biases of dropout and $L_2$ regularization differ. We provide some similar results for $L_1$ regularization.
△ Less
Submitted 17 February, 2015; v1 submitted 15 December, 2014;
originally announced December 2014.
-
Active and passive learning of linear separators under log-concave distributions
Authors:
Maria Florina Balcan,
Philip M. Long
Abstract:
We provide new results concerning label efficient, polynomial time, passive and active learning of linear separators. We prove that active learning provides an exponential improvement over PAC (passive) learning of homogeneous linear separators under nearly log-concave distributions. Building on this, we provide a computationally efficient PAC algorithm with optimal (up to a constant factor) sampl…
▽ More
We provide new results concerning label efficient, polynomial time, passive and active learning of linear separators. We prove that active learning provides an exponential improvement over PAC (passive) learning of homogeneous linear separators under nearly log-concave distributions. Building on this, we provide a computationally efficient PAC algorithm with optimal (up to a constant factor) sample complexity for such problems. This resolves an open question concerning the sample complexity of efficient PAC algorithms under the uniform distribution in the unit ball. Moreover, it provides the first bound for a polynomial-time PAC algorithm that is tight for an interesting infinite class of hypothesis functions under a general and natural class of data-distributions, providing significant progress towards a longstanding open question.
We also provide new bounds for active and passive learning in the case that the data might not be linearly separable, both in the agnostic case and and under the Tsybakov low-noise condition. To derive our results, we provide new structural results for (nearly) log-concave distributions, which might be of independent interest as well.
△ Less
Submitted 26 April, 2013; v1 submitted 5 November, 2012;
originally announced November 2012.
-
Criterions of Wiener type for minimally thin sets and rarefied sets associated with the stationary Schrödinger operator in a cone
Authors:
Pinhong Long,
Zhiqiang Gao,
Guantie Deng
Abstract:
In the paper we give some criterions for a-minimally thin sets and a-rarefied sets associated with the stationary Schrödinger operator at a fixed Martin boundary point or {\infty} with respect to a cone. Moreover, we show that a positive superfunction on a cone behaves regularly outside a-rarefied set. Finally we illustrate the relation between a-minimally thin set and a-rarefied set in a cone.
In the paper we give some criterions for a-minimally thin sets and a-rarefied sets associated with the stationary Schrödinger operator at a fixed Martin boundary point or {\infty} with respect to a cone. Moreover, we show that a positive superfunction on a cone behaves regularly outside a-rarefied set. Finally we illustrate the relation between a-minimally thin set and a-rarefied set in a cone.
△ Less
Submitted 28 May, 2012;
originally announced May 2012.