Search | arXiv e-print repository

SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network

Authors: Tomer Galanti, Zachary S. Siegel, Aparna Gupte, Tomaso Poggio

Abstract: We investigate the inherent bias of Stochastic Gradient Descent (SGD) toward learning low-rank weight matrices during the training of deep neural networks. Our results demonstrate that training with mini-batch SGD and weight decay induces a bias toward rank minimization in the weight matrices. Specifically, we show both theoretically and empirically that this bias becomes more pronounced with smal… ▽ More We investigate the inherent bias of Stochastic Gradient Descent (SGD) toward learning low-rank weight matrices during the training of deep neural networks. Our results demonstrate that training with mini-batch SGD and weight decay induces a bias toward rank minimization in the weight matrices. Specifically, we show both theoretically and empirically that this bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay. Additionally, we predict and empirically confirm that weight decay is essential for this bias to occur. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices, making it applicable to a wide range of neural network architectures of any width or depth. Finally, we empirically explore the connection between this bias and generalization, finding that it has a marginal effect on the test performance. △ Less

Submitted 18 October, 2024; v1 submitted 12 June, 2022; originally announced June 2022.

arXiv:2003.12193 [pdf, other]

On Infinite-Width Hypernetworks

Authors: Etai Littwin, Tomer Galanti, Lior Wolf, Greg Yang

Abstract: {\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such repr… ▽ More {\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such representations have seen considerable success in practice, they remain lacking in the theoretical guarantees in the wide regime of the standard architectures. In this work, we study wide over-parameterized hypernetworks. We show that unlike typical architectures, infinitely wide hypernetworks do not guarantee convergence to a global minima under gradient descent. We further show that convexity can be achieved by increasing the dimensionality of the hypernetwork's output, to represent wide MLPs. In the dually infinite-width regime, we identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels, the latter of which we refer to as the {\em hyperkernel}. As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor expansion terms of standard fully connected ReLU networks. △ Less

Submitted 22 February, 2021; v1 submitted 26 March, 2020; originally announced March 2020.

Comments: The first two authors contributed equally

arXiv:2002.10007 [pdf, other]

A Critical View of the Structural Causal Model

Authors: Tomer Galanti, Ofir Nabati, Lior Wolf

Abstract: In the univariate case, we show that by comparing the individual complexities of univariate cause and effect, one can identify the cause and the effect, without considering their interaction at all. In our framework, complexities are captured by the reconstruction error of an autoencoder that operates on the quantiles of the distribution. Comparing the reconstruction errors of the two autoencoders… ▽ More In the univariate case, we show that by comparing the individual complexities of univariate cause and effect, one can identify the cause and the effect, without considering their interaction at all. In our framework, complexities are captured by the reconstruction error of an autoencoder that operates on the quantiles of the distribution. Comparing the reconstruction errors of the two autoencoders, one for each variable, is shown to perform surprisingly well on the accepted causality directionality benchmarks. Hence, the decision as to which of the two is the cause and which is the effect may not be based on causality but on complexity. In the multivariate case, where one can ensure that the complexities of the cause and effect are balanced, we propose a new adversarial training method that mimics the disentangled structure of the causal model. We prove that in the multidimensional case, such modeling is likely to fit the data only in the direction of causality. Furthermore, a uniqueness result shows that the learned model is able to identify the underlying causal and residual (noise) components. Our multidimensional method outperforms the literature methods on both synthetic and real world datasets. △ Less

Submitted 23 February, 2020; originally announced February 2020.

arXiv:2002.10006 [pdf, other]

On the Modularity of Hypernetworks

Authors: Tomer Galanti, Lior Wolf

Abstract: In the context of learning to map an input $I$ to a function $h_I:\mathcal{X}\to \mathbb{R}$, two alternative methods are compared: (i) an embedding-based method, which learns a fixed function in which $I$ is encoded as a conditioning signal $e(I)$ and the learned function takes the form $h_I(x) = q(x,e(I))$, and (ii) hypernetworks, in which the weights $θ_I$ of the function $h_I(x) = g(x;θ_I)$ ar… ▽ More In the context of learning to map an input $I$ to a function $h_I:\mathcal{X}\to \mathbb{R}$, two alternative methods are compared: (i) an embedding-based method, which learns a fixed function in which $I$ is encoded as a conditioning signal $e(I)$ and the learned function takes the form $h_I(x) = q(x,e(I))$, and (ii) hypernetworks, in which the weights $θ_I$ of the function $h_I(x) = g(x;θ_I)$ are given by a hypernetwork $f$ as $θ_I=f(I)$. In this paper, we define the property of modularity as the ability to effectively learn a different function for each input instance $I$. For this purpose, we adopt an expressivity perspective of this property and extend the theory of Devore et al. 1996 and provide a lower bound on the complexity (number of trainable parameters) of neural networks as function approximators, by eliminating the requirements for the approximation method to be robust. Our results are then used to compare the complexities of $q$ and $g$, showing that under certain conditions and when letting the functions $e$ and $f$ be as large as we wish, $g$ can be smaller than $q$ by orders of magnitude. This sheds light on the modularity of hypernetworks in comparison with the embedding-based method. Besides, we show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method. △ Less

Submitted 2 November, 2020; v1 submitted 23 February, 2020; originally announced February 2020.

Comments: Accepted to Advances in Neural Information Processing Systems (NeurIPS) 2020

arXiv:2001.10460 [pdf, other]

On Random Kernels of Residual Architectures

Authors: Etai Littwin, Tomer Galanti, Lior Wolf

Abstract: We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our analysis reveals that finite size residual architectures are initialized much closer to the "kernel regime" than their vanilla counterparts: while in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers' width. Our fi… ▽ More We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our analysis reveals that finite size residual architectures are initialized much closer to the "kernel regime" than their vanilla counterparts: while in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers' width. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided with a proper initialization. In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed, at a rate that is independent of both the depth and scale of the weights. Our experiments validate the theoretical results and demonstrate the advantage of deep ResNets and DenseNets for kernel regression with random gradient features. △ Less

Submitted 17 June, 2020; v1 submitted 28 January, 2020; originally announced January 2020.

arXiv:2001.05207 [pdf, ps, other]

A Formal Approach to Explainability

Authors: Lior Wolf, Tomer Galanti, Tamir Hazan

Abstract: We regard explanations as a blending of the input sample and the model's output and offer a few definitions that capture various desired properties of the function that generates these explanations. We study the links between these properties and between explanation-generating functions and intermediate representations of learned models and are able to show, for example, that if the activations of… ▽ More We regard explanations as a blending of the input sample and the model's output and offer a few definitions that capture various desired properties of the function that generates these explanations. We study the links between these properties and between explanation-generating functions and intermediate representations of learned models and are able to show, for example, that if the activations of a given layer are consistent with an explanation, then so do all other subsequent layers. In addition, we study the intersection and union of explanations as a way to construct new explanations. △ Less

Submitted 15 January, 2020; originally announced January 2020.

Journal ref: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, January 2019, Pages 255-261

arXiv:2001.05026 [pdf, other]

Unsupervised Learning of the Set of Local Maxima

Authors: Lior Wolf, Sagie Benaim, Tomer Galanti

Abstract: This paper describes a new form of unsupervised learning, whose input is a set of unlabeled points that are assumed to be local maxima of an unknown value function v in an unknown subset of the vector space. Two functions are learned: (i) a set indicator c, which is a binary classifier, and (ii) a comparator function h that given two nearby samples, predicts which sample has the higher value of th… ▽ More This paper describes a new form of unsupervised learning, whose input is a set of unlabeled points that are assumed to be local maxima of an unknown value function v in an unknown subset of the vector space. Two functions are learned: (i) a set indicator c, which is a binary classifier, and (ii) a comparator function h that given two nearby samples, predicts which sample has the higher value of the unknown function v. Loss terms are used to ensure that all training samples x are a local maxima of v, according to h and satisfy c(x)=1. Therefore, c and h provide training signals to each other: a point x' in the vicinity of x satisfies c(x)=-1 or is deemed by h to be lower in value than x. We present an algorithm, show an example where it is more efficient to use local maxima as an indicator function than to employ conventional classification, and derive a suitable generalization bound. Our experiments show that the method is able to outperform one-class classification algorithms in the task of anomaly detection and also provide an additional signal that is extracted in a completely unsupervised way. △ Less

Submitted 14 January, 2020; originally announced January 2020.

Comments: ICLR 2019

arXiv:1807.08501 [pdf, other]

Risk Bounds for Unsupervised Cross-Domain Mapping with IPMs

Authors: Tomer Galanti, Sagie Benaim, Lior Wolf

Abstract: The recent empirical success of unsupervised cross-domain mapping algorithms, between two domains that share common characteristics, is not well-supported by theoretical justifications. This lacuna is especially troubling, given the clear ambiguity in such mappings. We work with adversarial training methods based on IPMs and derive a novel risk bound, which upper bounds the risk between the lear… ▽ More The recent empirical success of unsupervised cross-domain mapping algorithms, between two domains that share common characteristics, is not well-supported by theoretical justifications. This lacuna is especially troubling, given the clear ambiguity in such mappings. We work with adversarial training methods based on IPMs and derive a novel risk bound, which upper bounds the risk between the learned mapping $h$ and the target mapping $y$, by a sum of three terms: (i) the risk between $h$ and the most distant alternative mapping that was learned by the same cross-domain mapping algorithm, (ii) the minimal discrepancy between the target domain and the domain obtained by applying a hypothesis $h^*$ on the samples of the source domain, where $h^*$ is a hypothesis selectable by the same algorithm. The bound is directly related to Occam's razor and encourages the selection of the minimal architecture that supports a small mapping discrepancy and (iii) an approximation error term that decreases as the complexity of the class of discriminators increases and is empirically shown to be small. The bound leads to multiple algorithmic consequences, including a method for hyperparameters selection and for early stopping in cross-domain mapping GANs. We also demonstrate a novel capability for unsupervised learning of estimating confidence in the mapping of every specific sample. △ Less

Submitted 2 November, 2020; v1 submitted 23 July, 2018; originally announced July 2018.

Comments: arXiv admin note: text overlap with arXiv:1709.00074

arXiv:1703.01606 [pdf, ps, other]

A Theory of Output-Side Unsupervised Domain Adaptation

Authors: Tomer Galanti, Lior Wolf

Abstract: When learning a mapping from an input space to an output space, the assumption that the sample distribution of the training data is the same as that of the test data is often violated. Unsupervised domain shift methods adapt the learned function in order to correct for this shift. Previous work has focused on utilizing unlabeled samples from the target distribution. We consider the complementary p… ▽ More When learning a mapping from an input space to an output space, the assumption that the sample distribution of the training data is the same as that of the test data is often violated. Unsupervised domain shift methods adapt the learned function in order to correct for this shift. Previous work has focused on utilizing unlabeled samples from the target distribution. We consider the complementary problem in which the unlabeled samples are given post mapping, i.e., we are given the outputs of the mapping of unknown samples from the shifted domain. Two other variants are also studied: the two sided version, in which unlabeled samples are give from both the input and the output spaces, and the Domain Transfer problem, which was recently formalized. In all cases, we derive generalization bounds that employ discrepancy terms. △ Less

Submitted 5 March, 2017; originally announced March 2017.

Showing 1–9 of 9 results for author: Galanti, T