-
Bipartite Matching is in Catalytic Logspace
Authors:
Aryan Agarwala,
Ian Mertz
Abstract:
Matching is a central problem in theoretical computer science, with a large body of work spanning the last five decades. However, understanding matching in the time-space bounded setting remains a longstanding open question, even in the presence of additional resources such as randomness or non-determinism.
In this work we study space-bounded machines with access to catalytic space, which is add…
▽ More
Matching is a central problem in theoretical computer science, with a large body of work spanning the last five decades. However, understanding matching in the time-space bounded setting remains a longstanding open question, even in the presence of additional resources such as randomness or non-determinism.
In this work we study space-bounded machines with access to catalytic space, which is additional working memory that is full with arbitrary data that must be preserved at the end of its computation. Despite this heavy restriction, many recent works have shown the power of catalytic space, its utility in designing classical space-bounded algorithms, and surprising connections between catalytic computation and derandomization.
Our main result is that bipartite maximum matching ($MATCH$) can be computed in catalytic logspace ($CL$) with a polynomial time bound ($CLP$). Moreover, we show that $MATCH$ can be reduced to the lossy coding problem for $NC$ circuits ($LOSSY[NC]$). This has consequences for matching, catalytic space, and derandomization:
- Matching: this is the first well studied subclass of $P$ which is known to compute $MATCH$, as well as the first algorithm simultaneously using sublinear free space and polynomial time with any additional resources.
- Catalytic space: this is the first new problem shown to be in $CL$ since the model was defined, and one which is extremely central and well-studied.
- Derandomization: we give the first class $\mathcal{C}$ beyond $L$ for which we exhibit a natural problem in $LOSSY[\mathcal{C}]$ which is not known to be in $\mathcal{C}$, as well as a full derandomization of the isolation lemma in $CL$ in the context of $MATCH$.
Our proof combines a number of strengthened ideas from isolation-based algorithms for matching alongside the compress-or-random framework in catalytic computation.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Avoiding spurious sharpness minimization broadens applicability of SAM
Authors:
Sidak Pal Singh,
Hossein Mobahi,
Atish Agarwala,
Yann Dauphin
Abstract:
Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance -- even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominat…
▽ More
Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance -- even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics -- instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional-SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional-SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs).
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
A Space Lower Bound for Approximate Membership with Duplicate Insertions or Deletions of Nonelements
Authors:
Aryan Agarwala,
Guy Even
Abstract:
Designs of data structures for approximate membership queries with false-positive errors that support both insertions and deletions stipulate the following two conditions: (1) Duplicate insertions are prohibited, i.e., it is prohibited to insert an element $x$ if $x$ is currently a member of the dataset. (2) Deletions of nonelements are prohibited, i.e., it is prohibited to delete $x$ if $x$ is no…
▽ More
Designs of data structures for approximate membership queries with false-positive errors that support both insertions and deletions stipulate the following two conditions: (1) Duplicate insertions are prohibited, i.e., it is prohibited to insert an element $x$ if $x$ is currently a member of the dataset. (2) Deletions of nonelements are prohibited, i.e., it is prohibited to delete $x$ if $x$ is not currently a member of the dataset. Under these conditions, the space required for the approximate representation of a datasets of cardinality $n$ with a false-positive probability of $ε^{+}$ is at most $(1+o(1))n\cdot\log_2 (1/ε^{+}) + O(n)$ bits [Bender et al., 2018; Bercea and Even, 2019].
We prove that if these conditions are lifted, then the space required for the approximate representation of datasets of cardinality $n$ from a universe of cardinality $u$ is at least $\frac 12 \cdot (1-ε^{+} -\frac 1n)\cdot \log \binom{u}{n} -O(n)$ bits.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects
Authors:
Ke Liang Xiao,
Noah Marshall,
Atish Agarwala,
Elliot Paquette
Abstract:
In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensio…
▽ More
In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.
△ Less
Submitted 21 February, 2025; v1 submitted 18 November, 2024;
originally announced November 2024.
-
CultureVo: The Serious Game of Utilizing Gen AI for Enhancing Cultural Intelligence
Authors:
Ajita Agarwala,
Anupam Purwar,
Viswanadhasai Rao
Abstract:
CultureVo, Inc. has developed the Integrated Culture Learning Suite (ICLS) to deliver foundational knowledge of world cultures through a combination of interactive lessons and gamified experiences. This paper explores how Generative AI powered by open source Large Langauge Models are utilized within the ICLS to enhance cultural intelligence. The suite employs Generative AI techniques to automate t…
▽ More
CultureVo, Inc. has developed the Integrated Culture Learning Suite (ICLS) to deliver foundational knowledge of world cultures through a combination of interactive lessons and gamified experiences. This paper explores how Generative AI powered by open source Large Langauge Models are utilized within the ICLS to enhance cultural intelligence. The suite employs Generative AI techniques to automate the assessment of learner knowledge, analyze behavioral patterns, and manage interactions with non-player characters using real time learner assessment. Additionally, ICLS provides contextual hint and recommend course content by assessing learner proficiency, while Generative AI facilitates the automated creation and validation of educational content.
△ Less
Submitted 1 August, 2024; v1 submitted 30 July, 2024;
originally announced July 2024.
-
Stepping on the Edge: Curvature Aware Learning Rate Tuners
Authors:
Vincent Roulet,
Atish Agarwala,
Jean-Bastien Grill,
Grzegorz Swirszcz,
Mathieu Blondel,
Fabian Pedregosa
Abstract:
Curvature information -- particularly, the largest eigenvalue of the loss Hessian, known as the sharpness -- often forms the basis for learning rate tuners. However, recent work has shown that the curvature information undergoes complex dynamics during training, going from a phase of increasing sharpness to eventual stabilization. We analyze the closed-loop feedback effect between learning rate tu…
▽ More
Curvature information -- particularly, the largest eigenvalue of the loss Hessian, known as the sharpness -- often forms the basis for learning rate tuners. However, recent work has shown that the curvature information undergoes complex dynamics during training, going from a phase of increasing sharpness to eventual stabilization. We analyze the closed-loop feedback effect between learning rate tuning and curvature. We find that classical learning rate tuners may yield greater one-step loss reduction, yet they ultimately underperform in the long term when compared to constant learning rates in the full batch regime. These models break the stabilization of the sharpness, which we explain using a simplified model of the joint dynamics of the learning rate and the curvature. To further investigate these effects, we introduce a new learning rate tuning method, Curvature Dynamics Aware Tuning (CDAT), which prioritizes long term curvature stabilization over instantaneous progress on the objective. In the full batch regime, CDAT shows behavior akin to prefixed warm-up schedules on deep learning objectives, outperforming tuned constant learning rates. In the mini batch regime, we observe that stochasticity introduces confounding effects that explain the previous success of some learning rate tuners at appropriate batch sizes. Our findings highlight the critical role of understanding the joint dynamics of the learning rate and curvature, beyond greedy minimization, to diagnose failures and design effective adaptive learning rate tuners.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions
Authors:
Noah Marshall,
Ke Liang Xiao,
Atish Agarwala,
Elliot Paquette
Abstract:
The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical a…
▽ More
The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss and demonstrate that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. We propose a simple heuristic for near optimal scheduling of the clipping threshold which requires the tuning of only one hyperparameter. We conclude with a discussion about the links between high-dimensional clipping and neural network training.
△ Less
Submitted 6 October, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
High dimensional analysis reveals conservative sharpening and a stochastic edge of stability
Authors:
Atish Agarwala,
Jeffrey Pennington
Abstract:
Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues of the training loss Hessian have some remarkably robust features across models and datasets in the full batch regime. There is often an early period of progressive sharpening where the large eigenvalues increase, followed by stabilization at a predictable value known as the edge of stability. Previous work…
▽ More
Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues of the training loss Hessian have some remarkably robust features across models and datasets in the full batch regime. There is often an early period of progressive sharpening where the large eigenvalues increase, followed by stabilization at a predictable value known as the edge of stability. Previous work showed that in the stochastic setting, the eigenvalues increase more slowly - a phenomenon we call conservative sharpening. We provide a theoretical analysis of a simple high-dimensional model which shows the origin of this slowdown. We also show that there is an alternative stochastic edge of stability which arises at small batch size that is sensitive to the trace of the Neural Tangent Kernel rather than the large Hessian eigenvalues. We conduct an experimental study which highlights the qualitative differences from the full batch phenomenology, and suggests that controlling the stochastic edge of stability can help optimization.
△ Less
Submitted 31 January, 2025; v1 submitted 30 April, 2024;
originally announced April 2024.
-
Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
Authors:
Daniel Beaglehole,
Ioannis Mitliagkas,
Atish Agarwala
Abstract:
Understanding the mechanisms through which neural networks extract statistics from input-label pairs through feature learning is one of the most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as…
▽ More
Understanding the mechanisms through which neural networks extract statistics from input-label pairs through feature learning is one of the most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as the neural feature ansatz (NFA). Through the NFA, the authors introduce mapping with the AGOP as a general mechanism for neural feature learning. However, these works do not provide a theoretical explanation for this correlation or its origins. In this work, we further clarify the nature of this correlation, and explain its emergence. We show that this correlation is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent features at each layer. We further establish that the alignment is driven by the interaction of weight changes induced by SGD with the pre-activation features, and analyze the resulting dynamics analytically at early times in terms of simple statistics of the inputs and labels. We prove the derivative alignment occurs almost surely in specific high dimensional settings. Finally, we introduce a simple optimization rule motivated by our analysis of the centered correlation which dramatically increases the NFA correlations at any given layer and improves the quality of features learned.
△ Less
Submitted 17 November, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Neglected Hessian component explains mysteries in Sharpness regularization
Authors:
Yann N. Dauphin,
Atish Agarwala,
Hossein Mobahi
Abstract:
Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition…
▽ More
Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.
△ Less
Submitted 24 January, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
-
On the Interplay Between Stepsize Tuning and Progressive Sharpening
Authors:
Vincent Roulet,
Atish Agarwala,
Fabian Pedregosa
Abstract:
Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners…
▽ More
Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch in the deterministic setting may be well explained by its tendency to ever-increase the sharpness of the objective. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, outperforming its Armijo and constant stepsizes counterparts in the deterministic setting. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness.
△ Less
Submitted 29 December, 2023; v1 submitted 30 November, 2023;
originally announced December 2023.
-
Domain specificity and data efficiency in typo tolerant spell checkers: the case of search in online marketplaces
Authors:
Dayananda Ubrangala,
Juhi Sharma,
Ravi Prasad Kondapalli,
Kiran R,
Amit Agarwala,
Laurent Boué
Abstract:
Typographical errors are a major source of frustration for visitors of online marketplaces. Because of the domain-specific nature of these marketplaces and the very short queries users tend to search for, traditional spell cheking solutions do not perform well in correcting typos. We present a data augmentation method to address the lack of annotated typo data and train a recurrent neural network…
▽ More
Typographical errors are a major source of frustration for visitors of online marketplaces. Because of the domain-specific nature of these marketplaces and the very short queries users tend to search for, traditional spell cheking solutions do not perform well in correcting typos. We present a data augmentation method to address the lack of annotated typo data and train a recurrent neural network to learn context-limited domain-specific embeddings. Those embeddings are deployed in a real-time inferencing API for the Microsoft AppSource marketplace to find the closest match between a misspelled user query and the available product names. Our data efficient solution shows that controlled high quality synthetic data may be a powerful tool especially considering the current climate of large language models which rely on prohibitively huge and often uncontrolled datasets.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
SAM operates far from home: eigenvalue regularization as a dynamical phenomenon
Authors:
Atish Agarwala,
Yann N. Dauphin
Abstract:
The Sharpness Aware Minimization (SAM) optimization algorithm has been shown to control large eigenvalues of the loss Hessian and provide generalization benefits in a variety of settings. The original motivation for SAM was a modified loss function which penalized sharp minima; subsequent analyses have also focused on the behavior near minima. However, our work reveals that SAM provides a strong r…
▽ More
The Sharpness Aware Minimization (SAM) optimization algorithm has been shown to control large eigenvalues of the loss Hessian and provide generalization benefits in a variety of settings. The original motivation for SAM was a modified loss function which penalized sharp minima; subsequent analyses have also focused on the behavior near minima. However, our work reveals that SAM provides a strong regularization of the eigenvalues throughout the learning trajectory. We show that in a simplified setting, SAM dynamically induces a stabilization related to the edge of stability (EOS) phenomenon observed in large learning rate gradient descent. Our theory predicts the largest eigenvalue as a function of the learning rate and SAM radius parameters. Finally, we show that practical models can also exhibit this EOS stabilization, and that understanding SAM must account for these dynamics far away from any minima.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Second-order regression models exhibit progressive sharpening to the edge of stability
Authors:
Atish Agarwala,
Fabian Pedregosa,
Jeffrey Pennington
Abstract:
Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). These phenomena are intrinsically non-linear and do not happen for models in the constant N…
▽ More
Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). These phenomena are intrinsically non-linear and do not happen for models in the constant Neural Tangent Kernel (NTK) regime, for which the predictive function is approximately linear in the parameters. As such, we consider the next simplest class of predictive models, namely those that are quadratic in the parameters, which we call second-order regression models. For quadratic objectives in two dimensions, we prove that this second-order regression model exhibits progressive sharpening of the NTK eigenvalue towards a value that differs slightly from the edge of stability, which we explicitly compute. In higher dimensions, the model generically shows similar behavior, even without the specific structure of a neural network, suggesting that progressive sharpening and edge-of-stability behavior aren't unique features of neural networks, and could be a more general property of discrete learning algorithms in high-dimensional non-linear models.
△ Less
Submitted 10 October, 2022;
originally announced October 2022.
-
Deep equilibrium networks are sensitive to initialization statistics
Authors:
Atish Agarwala,
Samuel S. Schoenholz
Abstract:
Deep equilibrium networks (DEQs) are a promising way to construct models which trade off memory for compute. However, theoretical understanding of these models is still lacking compared to traditional networks, in part because of the repeated application of a single set of weights. We show that DEQs are sensitive to the higher order statistics of the matrix families from which they are initialized…
▽ More
Deep equilibrium networks (DEQs) are a promising way to construct models which trade off memory for compute. However, theoretical understanding of these models is still lacking compared to traditional networks, in part because of the repeated application of a single set of weights. We show that DEQs are sensitive to the higher order statistics of the matrix families from which they are initialized. In particular, initializing with orthogonal or symmetric matrices allows for greater stability in training. This gives us a practical prescription for initializations which allow for training with a broader range of initial weight scales.
△ Less
Submitted 19 July, 2022;
originally announced July 2022.
-
Neural Volumetric Object Selection
Authors:
Zhongzheng Ren,
Aseem Agarwala,
Bryan Russell,
Alexander G. Schwing,
Oliver Wang
Abstract:
We introduce an approach for selecting objects in neural volumetric 3D representations, such as multi-plane images (MPI) and neural radiance fields (NeRF). Our approach takes a set of foreground and background 2D user scribbles in one view and automatically estimates a 3D segmentation of the desired object, which can be rendered into novel views. To achieve this result, we propose a novel voxel fe…
▽ More
We introduce an approach for selecting objects in neural volumetric 3D representations, such as multi-plane images (MPI) and neural radiance fields (NeRF). Our approach takes a set of foreground and background 2D user scribbles in one view and automatically estimates a 3D segmentation of the desired object, which can be rendered into novel views. To achieve this result, we propose a novel voxel feature embedding that incorporates the neural volumetric 3D representation and multi-view image features from all input views. To evaluate our approach, we introduce a new dataset of human-provided segmentation masks for depicted objects in real-world multi-view scene captures. We show that our approach out-performs strong baselines, including 2D segmentation and 3D segmentation approaches adapted to our task.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks
Authors:
Atish Agarwala,
Abhimanyu Das,
Brendan Juba,
Rina Panigrahy,
Vatsal Sharan,
Xin Wang,
Qiuyi Zhang
Abstract:
Can deep learning solve multiple tasks simultaneously, even when they are unrelated and very different? We investigate how the representations of the underlying tasks affect the ability of a single neural network to learn them jointly. We present theoretical and empirical findings that a single neural network is capable of simultaneously learning multiple tasks from a combined data set, for a vari…
▽ More
Can deep learning solve multiple tasks simultaneously, even when they are unrelated and very different? We investigate how the representations of the underlying tasks affect the ability of a single neural network to learn them jointly. We present theoretical and empirical findings that a single neural network is capable of simultaneously learning multiple tasks from a combined data set, for a variety of methods for representing tasks -- for example, when the distinct tasks are encoded by well-separated clusters or decision trees over certain task-code attributes. More concretely, we present a novel analysis that shows that families of simple programming-like constructs for the codes encoding the tasks are learnable by two-layer neural networks with standard training. We study more generally how the complexity of learning such combined tasks grows with the complexity of the task codes; we find that combining many tasks may incur a sample complexity penalty, even though the individual tasks are easy to learn. We provide empirical support for the usefulness of the learning bounds by training networks on clusters, decision trees, and SQL-style aggregation.
△ Less
Submitted 28 March, 2021;
originally announced March 2021.
-
Temperature check: theory and practice for training models with softmax-cross-entropy losses
Authors:
Atish Agarwala,
Jeffrey Pennington,
Yann Dauphin,
Sam Schoenholz
Abstract:
The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynam…
▽ More
The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $β$ as well as the magnitude of the logits at initialization, $||β{\bf z}||_{2}$. We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on $β$ is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of $β$ as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal $β$ to be sensitive to the architecture, our results suggest that tuning $β$ over the range $10^{-2}$ to $10^1$ improves performance over all architectures studied. We find that smaller $β$ may lead to better peak performance at the cost of learning stability.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
Learning the gravitational force law and other analytic functions
Authors:
Atish Agarwala,
Abhimanyu Das,
Rina Panigrahy,
Qiuyi Zhang
Abstract:
Large neural network models have been successful in learning functions of importance in many branches of science, including physics, chemistry and biology. Recent theoretical work has shown explicit learning bounds for wide networks and kernel methods on some simple classes of functions, but not on more complex functions which arise in practice. We extend these techniques to provide learning bound…
▽ More
Large neural network models have been successful in learning functions of importance in many branches of science, including physics, chemistry and biology. Recent theoretical work has shown explicit learning bounds for wide networks and kernel methods on some simple classes of functions, but not on more complex functions which arise in practice. We extend these techniques to provide learning bounds for analytic functions on the sphere for any kernel method or equivalent infinitely-wide network with the corresponding activation function trained with SGD. We show that a wide, one-hidden layer ReLU network can learn analytic functions with a number of samples proportional to the derivative of a related function. Many functions important in the sciences are therefore efficiently learnable. As an example, we prove explicit bounds on learning the many-body gravitational force function given by Newton's law of gravitation. Our theoretical bounds suggest that very wide ReLU networks (and the corresponding NTK kernel) are better at learning analytic functions as compared to kernel learning with Gaussian kernels. We present experimental evidence that the many-body gravitational force function is easier to learn with ReLU networks as compared to networks with exponential activations.
△ Less
Submitted 15 May, 2020;
originally announced May 2020.
-
Deep Homography Estimation for Dynamic Scenes
Authors:
Hoang Le,
Feng Liu,
Shu Zhang,
Aseem Agarwala
Abstract:
Homography estimation is an important step in many computer vision problems. Recently, deep neural network methods have shown to be favorable for this problem when compared to traditional methods. However, these new methods do not consider dynamic content in input images. They train neural networks with only image pairs that can be perfectly aligned using homographies. This paper investigates and…
▽ More
Homography estimation is an important step in many computer vision problems. Recently, deep neural network methods have shown to be favorable for this problem when compared to traditional methods. However, these new methods do not consider dynamic content in input images. They train neural networks with only image pairs that can be perfectly aligned using homographies. This paper investigates and discusses how to design and train a deep neural network that handles dynamic scenes. We first collect a large video dataset with dynamic content. We then develop a multi-scale neural network and show that when properly trained using our new dataset, this neural network can already handle dynamic scenes to some extent. To estimate a homography of a dynamic scene in a more principled way, we need to identify the dynamic content. Since dynamic content detection and homography estimation are two tightly coupled tasks, we follow the multi-task learning principles and augment our multi-scale network such that it jointly estimates the dynamics masks and homographies. Our experiments show that our method can robustly estimate homography for challenging scenarios with dynamic scenes, blur artifacts, or lack of textures.
△ Less
Submitted 5 April, 2020;
originally announced April 2020.
-
A Compact Embedding for Facial Expression Similarity
Authors:
Raviteja Vemulapalli,
Aseem Agarwala
Abstract:
Most of the existing work on automatic facial expression analysis focuses on discrete emotion recognition, or facial action unit detection. However, facial expressions do not always fall neatly into pre-defined semantic categories. Also, the similarity between expressions measured in the action unit space need not correspond to how humans perceive expression similarity. Different from previous wor…
▽ More
Most of the existing work on automatic facial expression analysis focuses on discrete emotion recognition, or facial action unit detection. However, facial expressions do not always fall neatly into pre-defined semantic categories. Also, the similarity between expressions measured in the action unit space need not correspond to how humans perceive expression similarity. Different from previous work, our goal is to describe facial expressions in a continuous fashion using a compact embedding space that mimics human visual preferences. To achieve this goal, we collect a large-scale faces-in-the-wild dataset with human annotations in the form: Expressions A and B are visually more similar when compared to expression C, and use this dataset to train a neural network that produces a compact (16-dimensional) expression embedding. We experimentally demonstrate that the learned embedding can be successfully used for various applications such as expression retrieval, photo album summarization, and emotion recognition. We also show that the embedding learned using the proposed dataset performs better than several other embeddings learned using existing emotion or action unit datasets.
△ Less
Submitted 9 January, 2019; v1 submitted 27 November, 2018;
originally announced November 2018.
-
Video Frame Synthesis using Deep Voxel Flow
Authors:
Ziwei Liu,
Raymond A. Yeh,
Xiaoou Tang,
Yiming Liu,
Aseem Agarwala
Abstract:
We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucina…
▽ More
We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-of-the-art.
△ Less
Submitted 5 August, 2017; v1 submitted 8 February, 2017;
originally announced February 2017.
-
Semantic Facial Expression Editing using Autoencoded Flow
Authors:
Raymond Yeh,
Ziwei Liu,
Dan B Goldman,
Aseem Agarwala
Abstract:
High-level manipulation of facial expressions in images --- such as changing a smile to a neutral expression --- is challenging because facial expression changes are highly non-linear, and vary depending on the appearance of the face. We present a fully automatic approach to editing faces that combines the advantages of flow-based face manipulation with the more recent generative capabilities of V…
▽ More
High-level manipulation of facial expressions in images --- such as changing a smile to a neutral expression --- is challenging because facial expression changes are highly non-linear, and vary depending on the appearance of the face. We present a fully automatic approach to editing faces that combines the advantages of flow-based face manipulation with the more recent generative capabilities of Variational Autoencoders (VAEs). During training, our model learns to encode the flow from one expression to another over a low-dimensional latent space. At test time, expression editing can be done simply using latent vector arithmetic. We evaluate our methods on two applications: 1) single-image facial expression editing, and 2) facial expression interpolation between two images. We demonstrate that our method generates images of higher perceptual quality than previous VAE and flow-based methods.
△ Less
Submitted 29 November, 2016;
originally announced November 2016.
-
Performance metrics in a hybrid MPI-OpenMP based molecular dynamics simulation with short-range interactions
Authors:
Anirban Pal,
Abhishek Agarwala,
Soumyendu Raha,
Baidurya Bhattacharya
Abstract:
We discuss the computational bottlenecks in molecular dynamics (MD) and describe the challenges in parallelizing the computation intensive tasks. We present a hybrid algorithm using MPI (Message Passing Interface) with OpenMP threads for parallelizing a generalized MD computation scheme for systems with short range interatomic interactions. The algorithm is discussed in the context of nanoindentat…
▽ More
We discuss the computational bottlenecks in molecular dynamics (MD) and describe the challenges in parallelizing the computation intensive tasks. We present a hybrid algorithm using MPI (Message Passing Interface) with OpenMP threads for parallelizing a generalized MD computation scheme for systems with short range interatomic interactions. The algorithm is discussed in the context of nanoindentation of Chromium films with carbon indenters using the Embedded Atom Method potential for Cr Cr interaction and the Morse potential for Cr C interactions. We study the performance of our algorithm for a range of MPIthread combinations and find the performance to depend strongly on the computational task and load sharing in the multicore processor. The algorithm scaled poorly with MPI and our hybrid schemes were observed to outperform the pure message passing scheme, despite utilizing the same number of processors or cores in the cluster. Speed-up achieved by our algorithm compared favourably with that achieved by standard MD packages.
△ Less
Submitted 25 July, 2015;
originally announced July 2015.
-
DeepFont: Identify Your Font from An Image
Authors:
Zhangyang Wang,
Jianchao Yang,
Hailin Jin,
Eli Shechtman,
Aseem Agarwala,
Jonathan Brandt,
Thomas S. Huang
Abstract:
As font is one of the core design concepts, automatic font identification and similar font suggestion from an image or photo has been on the wish list of many designers. We study the Visual Font Recognition (VFR) problem, and advance the state-of-the-art remarkably by developing the DeepFont system. First of all, we build up the first available large-scale VFR dataset, named AdobeVFR, consisting o…
▽ More
As font is one of the core design concepts, automatic font identification and similar font suggestion from an image or photo has been on the wish list of many designers. We study the Visual Font Recognition (VFR) problem, and advance the state-of-the-art remarkably by developing the DeepFont system. First of all, we build up the first available large-scale VFR dataset, named AdobeVFR, consisting of both labeled synthetic data and partially labeled real-world data. Next, to combat the domain mismatch between available training and testing data, we introduce a Convolutional Neural Network (CNN) decomposition approach, using a domain adaptation technique based on a Stacked Convolutional Auto-Encoder (SCAE) that exploits a large corpus of unlabeled real-world text images combined with synthetic data preprocessed in a specific way. Moreover, we study a novel learning-based model compression approach, in order to reduce the DeepFont model size without sacrificing its performance. The DeepFont system achieves an accuracy of higher than 80% (top-5) on our collected dataset, and also produces a good font similarity measure for font selection and suggestion. We also achieve around 6 times compression of the model without any visible loss of recognition accuracy.
△ Less
Submitted 12 July, 2015;
originally announced July 2015.
-
Real-World Font Recognition Using Deep Network and Domain Adaptation
Authors:
Zhangyang Wang,
Jianchao Yang,
Hailin Jin,
Eli Shechtman,
Aseem Agarwala,
Jonathan Brandt,
Thomas S. Huang
Abstract:
We address a challenging fine-grain classification problem: recognizing a font style from an image of text. In this task, it is very easy to generate lots of rendered font examples but very hard to obtain real-world labeled images. This real-to-synthetic domain gap caused poor generalization to new real data in previous methods (Chen et al. (2014)). In this paper, we refer to Convolutional Neural…
▽ More
We address a challenging fine-grain classification problem: recognizing a font style from an image of text. In this task, it is very easy to generate lots of rendered font examples but very hard to obtain real-world labeled images. This real-to-synthetic domain gap caused poor generalization to new real data in previous methods (Chen et al. (2014)). In this paper, we refer to Convolutional Neural Networks, and use an adaptation technique based on a Stacked Convolutional Auto-Encoder that exploits unlabeled real-world images combined with synthetic data. The proposed method achieves an accuracy of higher than 80% (top-5) on a real-world dataset.
△ Less
Submitted 31 March, 2015;
originally announced April 2015.
-
Decomposition-Based Domain Adaptation for Real-World Font Recognition
Authors:
Zhangyang Wang,
Jianchao Yang,
Hailin Jin,
Eli Shechtman,
Aseem Agarwala,
Jonathan Brandt,
Thomas S. Huang
Abstract:
We present a domain adaption framework to address a domain mismatch between synthetic training and real-world testing data. We demonstrate our method on a challenging fine-grain classification problem: recognizing a font style from an image of text. In this task, it is very easy to generate lots of rendered font examples but very hard to obtain real-world labeled images. This real-to-synthetic dom…
▽ More
We present a domain adaption framework to address a domain mismatch between synthetic training and real-world testing data. We demonstrate our method on a challenging fine-grain classification problem: recognizing a font style from an image of text. In this task, it is very easy to generate lots of rendered font examples but very hard to obtain real-world labeled images. This real-to-synthetic domain gap caused poor generalization to new real data in previous font recognition methods (Chen et al. (2014)). In this paper, we introduce a Convolutional Neural Network decomposition approach, leveraging a large training corpus of synthetic data to obtain effective features for classification. This is done using an adaptation technique based on a Stacked Convolutional Auto-Encoder that exploits a large collection of unlabeled real-world text images combined with synthetic data preprocessed in a specific way. The proposed DeepFont method achieves an accuracy of higher than 80% (top-5) on a new large labeled real-world dataset we collected.
△ Less
Submitted 1 April, 2015; v1 submitted 18 December, 2014;
originally announced December 2014.
-
Recognizing Image Style
Authors:
Sergey Karayev,
Matthew Trentacoste,
Helen Han,
Aseem Agarwala,
Trevor Darrell,
Aaron Hertzmann,
Holger Winnemoeller
Abstract:
The style of an image plays a significant role in how it is viewed, but style has received little attention in computer vision research. We describe an approach to predicting style of images, and perform a thorough evaluation of different image features for these tasks. We find that features learned in a multi-layer network generally perform best -- even when trained with object class (not style)…
▽ More
The style of an image plays a significant role in how it is viewed, but style has received little attention in computer vision research. We describe an approach to predicting style of images, and perform a thorough evaluation of different image features for these tasks. We find that features learned in a multi-layer network generally perform best -- even when trained with object class (not style) labels. Our large-scale learning methods results in the best published performance on an existing dataset of aesthetic ratings and photographic style annotations. We present two novel datasets: 80K Flickr photographs annotated with 20 curated style labels, and 85K paintings annotated with 25 style/genre labels. Our approach shows excellent classification performance on both datasets. We use the learned classifiers to extend traditional tag-based image search to consider stylistic constraints, and demonstrate cross-dataset understanding of style.
△ Less
Submitted 23 July, 2014; v1 submitted 14 November, 2013;
originally announced November 2013.