-
What Makes a Reward Model a Good Teacher? An Optimization Perspective
Authors:
Noam Razin,
Zixuan Wang,
Hubert Strauss,
Stanley Wei,
Jason D. Lee,
Sanjeev Arora
Abstract:
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model…
▽ More
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Weak-to-Strong Generalization Even in Random Feature Networks, Provably
Authors:
Marko Medvedev,
Kaifeng Lyu,
Dingli Yu,
Sanjeev Arora,
Zhiyuan Li,
Nathan Srebro
Abstract:
Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed…
▽ More
Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent Set
Authors:
Yikai Wu,
Haoyu Zhao,
Sanjeev Arora
Abstract:
AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on Maximum Independent Set (MIS). Experiments on standard graph families show that AI-based algorithms fail to outperform and, in many cases, to match the soluti…
▽ More
AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on Maximum Independent Set (MIS). Experiments on standard graph families show that AI-based algorithms fail to outperform and, in many cases, to match the solution quality of the state-of-art classical solver KaMIS running on a single CPU. Some GPU-based methods even perform similarly to the simplest heuristic, degree-based greedy. Even with post-processing techniques like local search, AI-based methods still perform worse than CPU-based solvers.
We develop a new mode of analysis to reveal that non-backtracking AI methods, e.g. LTFT (which is based on GFlowNets), end up reasoning similarly to the simplest degree-based greedy approach, and thus worse than KaMIS. We also find that CPU-based algorithms, notably KaMIS, have strong performance on sparse random graphs, which appears to refute a well-known conjectured upper bound for efficient algorithms from Coja-Oghlan & Efthymiou (2015).
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
LoLCATs: On Low-Rank Linearizing of Large Language Models
Authors:
Michael Zhang,
Simran Arora,
Rahul Chalamala,
Alan Wu,
Benjamin Spector,
Aaryan Singhal,
Krithik Ramesh,
Christopher Ré
Abstract:
Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. W…
▽ More
Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. We base these steps on two findings. First, we can replace an LLM's softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss ("attention transfer"). Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. Furthermore, LoLCATs does so with only 0.2% of past methods' model parameters and 0.4% of their training tokens. Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.
△ Less
Submitted 5 March, 2025; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Authors:
Noam Razin,
Sadhika Malladi,
Adithya Bhaskar,
Danqi Chen,
Sanjeev Arora,
Boris Hanin
Abstract:
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on th…
▽ More
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.
△ Less
Submitted 27 April, 2025; v1 submitted 11 October, 2024;
originally announced October 2024.
-
A Theory for Emergence of Complex Skills in Language Models
Authors:
Sanjeev Arora,
Anirudh Goyal
Abstract:
A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws…
▽ More
A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.
△ Less
Submitted 5 November, 2023; v1 submitted 29 July, 2023;
originally announced July 2023.
-
New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound
Authors:
Arushi Gupta,
Nikunj Saunshi,
Dingli Yu,
Kaifeng Lyu,
Sanjeev Arora
Abstract:
Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if th…
▽ More
Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if the net's output is mostly unchanged. This is usually seen as an {\em explanation} of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of {\em completeness \& soundness}, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an {\em intrinsic} framework -- i.e., using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.
△ Less
Submitted 5 November, 2022;
originally announced November 2022.
-
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
Authors:
Zhiyuan Li,
Tianhao Wang,
Sanjeev Arora
Abstract:
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $η$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further…
▽ More
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $η$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\mathrm{tr}[\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $η^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $η^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(κ\ln d)$ samples for learning an $κ$-sparse overparametrized linear model in $\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $Ω(d)$ samples. This upper bound is minimax optimal and improves the previous $\tilde{O}(κ^2)$ upper bound (HaoChen et al., 2020).
△ Less
Submitted 28 July, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Review of Low Voltage Load Forecasting: Methods, Applications, and Recommendations
Authors:
Stephen Haben,
Siddharth Arora,
Georgios Giasemidis,
Marcus Voss,
Danica Vukadinovic Greetham
Abstract:
The increased digitalisation and monitoring of the energy system opens up numerous opportunities to decarbonise the energy system. Applications on low voltage, local networks, such as community energy markets and smart storage will facilitate decarbonisation, but they will require advanced control and management. Reliable forecasting will be a necessary component of many of these systems to antici…
▽ More
The increased digitalisation and monitoring of the energy system opens up numerous opportunities to decarbonise the energy system. Applications on low voltage, local networks, such as community energy markets and smart storage will facilitate decarbonisation, but they will require advanced control and management. Reliable forecasting will be a necessary component of many of these systems to anticipate key features and uncertainties. Despite this urgent need, there has not yet been an extensive investigation into the current state-of-the-art of low voltage level forecasts, other than at the smart meter level. This paper aims to provide a comprehensive overview of the landscape, current approaches, core applications, challenges and recommendations. Another aim of this paper is to facilitate the continued improvement and advancement in this area. To this end, the paper also surveys some of the most relevant and promising trends. It establishes an open, community-driven list of the known low voltage level open datasets to encourage further research and development.
△ Less
Submitted 5 September, 2021; v1 submitted 30 May, 2021;
originally announced June 2021.
-
Rip van Winkle's Razor: A Simple Estimate of Overfit to Test Data
Authors:
Sanjeev Arora,
Yi Zhang
Abstract:
Traditional statistics forbids use of test data (a.k.a. holdout data) during training. Dwork et al. 2015 pointed out that current practices in machine learning, whereby researchers build upon each other's models, copying hyperparameters and even computer code -- amounts to implicitly training on the test set. Thus error rate on test data may not reflect the true population error. This observation…
▽ More
Traditional statistics forbids use of test data (a.k.a. holdout data) during training. Dwork et al. 2015 pointed out that current practices in machine learning, whereby researchers build upon each other's models, copying hyperparameters and even computer code -- amounts to implicitly training on the test set. Thus error rate on test data may not reflect the true population error. This observation initiated {\em adaptive data analysis}, which provides evaluation mechanisms with guaranteed upper bounds on this difference. With statistical query (i.e. test accuracy) feedbacks, the best upper bound is fairly pessimistic: the deviation can hit a practically vacuous value if the number of models tested is quadratic in the size of the test set.
In this work, we present a simple new estimate, {\em Rip van Winkle's Razor}. It relies upon a new notion of \textquotedblleft information content\textquotedblright\ of a model: the amount of information that would have to be provided to an expert referee who is intimately familiar with the field and relevant science/math, and who has been just been woken up after falling asleep at the moment of the creation of the test data (like \textquotedblleft Rip van Winkle\textquotedblright\ of the famous fairy tale). This notion of information content is used to provide an estimate of the above deviation which is shown to be non-vacuous in many modern settings.
△ Less
Submitted 25 February, 2021;
originally announced February 2021.
-
On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)
Authors:
Zhiyuan Li,
Sadhika Malladi,
Sanjeev Arora
Abstract:
It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies to SGD with tiny LR. Experimental verificatio…
▽ More
It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Ito SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al., 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.
△ Less
Submitted 16 June, 2021; v1 submitted 24 February, 2021;
originally announced February 2021.
-
Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?
Authors:
Zhiyuan Li,
Yi Zhang,
Sanjeev Arora
Abstract:
Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional net (for a fixed task). Thus the training algorith…
▽ More
Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional net (for a fixed task). Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be shown, for standard training algorithms. We construct a single natural distribution on $\mathbb{R}^d\times\{\pm 1\}$ on which any orthogonal-invariant algorithm (i.e. fully-connected networks trained with most gradient-based methods from gaussian initialization) requires $Ω(d^2)$ samples to generalize while $O(1)$ samples suffice for convolutional architectures. Furthermore, we demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $Ω(d^2/\varepsilon)$ gap. The proof relies on the fact that SGD on fully-connected network is orthogonal equivariant. Similar results are achieved for $\ell_2$ regression and adaptive training algorithms, e.g. Adam and AdaGrad, which are only permutation equivariant.
△ Less
Submitted 4 May, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.
-
TextHide: Tackling Data Privacy in Language Understanding Tasks
Authors:
Yangsibo Huang,
Zhao Song,
Danqi Chen,
Kai Li,
Sanjeev Arora
Abstract:
An unsolved challenge in distributed or federated learning is to effectively mitigate privacy risks without slowing down training or reducing accuracy. In this paper, we propose TextHide aiming at addressing this challenge for natural language understanding tasks. It requires all participants to add a simple encryption step to prevent an eavesdropping attacker from recovering private text data. Su…
▽ More
An unsolved challenge in distributed or federated learning is to effectively mitigate privacy risks without slowing down training or reducing accuracy. In this paper, we propose TextHide aiming at addressing this challenge for natural language understanding tasks. It requires all participants to add a simple encryption step to prevent an eavesdropping attacker from recovering private text data. Such an encryption step is efficient and only affects the task performance slightly. In addition, TextHide fits well with the popular framework of fine-tuning pre-trained language models (e.g., BERT) for any sentence or sentence-pair task. We evaluate TextHide on the GLUE benchmark, and our experiments show that TextHide can effectively defend attacks on shared gradients or representations and the averaged accuracy reduction is only $1.9\%$. We also present an analysis of the security of TextHide using a conjecture about the computational intractability of a mathematical problem.
Our code is available at https://github.com/Hazelsuko07/TextHide
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
Authors:
Nikunj Saunshi,
Sadhika Malladi,
Sanjeev Arora
Abstract:
Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions…
▽ More
Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we mathematically formalize this connection and quantify the benefit of language modeling? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as sentence completion tasks, thus making language modeling a meaningful pretraining task. With a mathematical formalization of this hypothesis, we make progress towards (2) and show that language models that are $ε$-optimal in cross-entropy (log-perplexity) learn features that can linearly solve such classification tasks with $\mathcal{O}(\sqrtε)$ error, thus demonstrating that doing well on language modeling can be beneficial for downstream tasks. We experimentally verify various assumptions and theoretical findings, and also use insights from the analysis to design a new objective function that performs well on some classification tasks.
△ Less
Submitted 14 April, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
InstaHide: Instance-hiding Schemes for Private Distributed Learning
Authors:
Yangsibo Huang,
Zhao Song,
Kai Li,
Sanjeev Arora
Abstract:
How can multiple distributed entities collaboratively train a shared deep net on their private data while preserving privacy? This paper introduces InstaHide, a simple encryption of training images, which can be plugged into existing distributed deep learning pipelines. The encryption is efficient and applying it during training has minor effect on test accuracy.
InstaHide encrypts each training…
▽ More
How can multiple distributed entities collaboratively train a shared deep net on their private data while preserving privacy? This paper introduces InstaHide, a simple encryption of training images, which can be plugged into existing distributed deep learning pipelines. The encryption is efficient and applying it during training has minor effect on test accuracy.
InstaHide encrypts each training image with a "one-time secret key" which consists of mixing a number of randomly chosen images and applying a random pixel-wise mask. Other contributions of this paper include: (a) Using a large public dataset (e.g. ImageNet) for mixing during its encryption, which improves security. (b) Experimental results to show effectiveness in preserving privacy against known attacks with only minor effects on accuracy. (c) Theoretical analysis showing that successfully attacking privacy requires attackers to solve a difficult computational problem. (d) Demonstrating that use of the pixel-wise mask is important for security, since Mixup alone is shown to be insecure to some some efficient attacks. (e) Release of a challenge dataset https://github.com/Hazelsuko07/InstaHide_Challenge
Our code is available at https://github.com/Hazelsuko07/InstaHide
△ Less
Submitted 24 February, 2021; v1 submitted 6 October, 2020;
originally announced October 2020.
-
IterefinE: Iterative KG Refinement Embeddings using Symbolic Knowledge
Authors:
Siddhant Arora,
Srikanta Bedathur,
Maya Ramanath,
Deepak Sharma
Abstract:
Knowledge Graphs (KGs) extracted from text sources are often noisy and lead to poor performance in downstream application tasks such as KG-based question answering.While much of the recent activity is focused on addressing the sparsity of KGs by using embeddings for inferring new facts, the issue of cleaning up of noise in KGs through KG refinement task is not as actively studied. Most successful…
▽ More
Knowledge Graphs (KGs) extracted from text sources are often noisy and lead to poor performance in downstream application tasks such as KG-based question answering.While much of the recent activity is focused on addressing the sparsity of KGs by using embeddings for inferring new facts, the issue of cleaning up of noise in KGs through KG refinement task is not as actively studied. Most successful techniques for KG refinement make use of inference rules and reasoning over ontologies. Barring a few exceptions, embeddings do not make use of ontological information, and their performance in KG refinement task is not well understood. In this paper, we present a KG refinement framework called IterefinE which iteratively combines the two techniques - one which uses ontological information and inferences rules, PSL-KGI, and the KG embeddings such as ComplEx and ConvE which do not. As a result, IterefinE is able to exploit not only the ontological information to improve the quality of predictions, but also the power of KG embeddings which (implicitly) perform longer chains of reasoning. The IterefinE framework, operates in a co-training mode and results in explicit type-supervised embedding of the refined KG from PSL-KGI which we call as TypeE-X. Our experiments over a range of KG benchmarks show that the embeddings that we produce are able to reject noisy facts from KG and at the same time infer higher quality new facts resulting in up to 9% improvement of overall weighted F1 score
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
Probabilistic Forecasting of Patient Waiting Times in an Emergency Department
Authors:
Siddharth Arora,
James W. Taylor,
Ho-Yin Mak
Abstract:
We study the estimation of the probability distribution of individual patient waiting times in an emergency department (ED). Our feature-rich modelling allows for dynamic updating and refinement of waiting time estimates as patient- and ED-specific information (e.g., patient condition, ED congestion levels) is revealed during the waiting process. Aspects relating to communicating forecast uncertai…
▽ More
We study the estimation of the probability distribution of individual patient waiting times in an emergency department (ED). Our feature-rich modelling allows for dynamic updating and refinement of waiting time estimates as patient- and ED-specific information (e.g., patient condition, ED congestion levels) is revealed during the waiting process. Aspects relating to communicating forecast uncertainty to patients, and implementing this methodology in practice, are also discussed.
△ Less
Submitted 30 May, 2020;
originally announced June 2020.
-
Maximum Entropy Multi-Task Inverse RL
Authors:
Saurabh Arora,
Bikramjit Banerjee,
Prashant Doshi
Abstract:
Multi-task IRL allows for the possibility that the expert could be switching between multiple ways of solving the same problem, or interleaving demonstrations of multiple tasks. The learner aims to learn the multiple reward functions that guide these ways of solving the problem. We present a new method for multi-task IRL that generalizes the well-known maximum entropy approach to IRL by combining…
▽ More
Multi-task IRL allows for the possibility that the expert could be switching between multiple ways of solving the same problem, or interleaving demonstrations of multiple tasks. The learner aims to learn the multiple reward functions that guide these ways of solving the problem. We present a new method for multi-task IRL that generalizes the well-known maximum entropy approach to IRL by combining it with the Dirichlet process based clustering of the observed input. This yields a single nonlinear optimization problem, called MaxEnt Multi-task IRL, which can be solved using the Lagrangian relaxation and gradient descent methods. We evaluate MaxEnt Multi-task IRL in simulation on the robotic task of sorting onions on a processing line where the expert utilizes multiple ways of detecting and removing blemished onions. The method is able to learn the underlying reward functions to a high level of accuracy and it improves on the previous approaches to multi-task IRL.
△ Less
Submitted 27 April, 2020;
originally announced April 2020.
-
Privacy-preserving Learning via Deep Net Pruning
Authors:
Yangsibo Huang,
Yushan Su,
Sachin Ravi,
Zhao Song,
Sanjeev Arora,
Kai Li
Abstract:
This paper attempts to answer the question whether neural network pruning can be used as a tool to achieve differential privacy without losing much data utility. As a first step towards understanding the relationship between neural network pruning and differential privacy, this paper proves that pruning a given layer of the neural network is equivalent to adding a certain amount of differentially…
▽ More
This paper attempts to answer the question whether neural network pruning can be used as a tool to achieve differential privacy without losing much data utility. As a first step towards understanding the relationship between neural network pruning and differential privacy, this paper proves that pruning a given layer of the neural network is equivalent to adding a certain amount of differentially private noise to its hidden-layer activations. The paper also presents experimental results to show the practical implications of the theoretical finding and the key parameter values in a simple practical setting. These results show that neural network pruning can be a more effective alternative to adding differentially private noise for neural networks.
△ Less
Submitted 3 March, 2020;
originally announced March 2020.
-
A Sample Complexity Separation between Non-Convex and Convex Meta-Learning
Authors:
Nikunj Saunshi,
Yi Zhang,
Mikhail Khodak,
Sanjeev Arora
Abstract:
One popular trend in meta-learning is to learn from many training tasks a common initialization for a gradient-based method that can be used to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex models. This work shows that convex-case analysis…
▽ More
One popular trend in meta-learning is to learn from many training tasks a common initialization for a gradient-based method that can be used to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex models. This work shows that convex-case analysis might be insufficient to understand the success of meta-learning, and that even for non-convex models it is important to look inside the optimization black-box, specifically at properties of the optimization trajectory. We construct a simple meta-learning instance that captures the problem of one-dimensional subspace learning. For the convex formulation of linear regression on this instance, we show that the new task sample complexity of any initialization-based meta-learning algorithm is $Ω(d)$, where $d$ is the input dimension. In contrast, for the non-convex formulation of a two layer linear network on the same instance, we show that both Reptile and multi-task representation learning can have new task sample complexity of $\mathcal{O}(1)$, demonstrating a separation from convex meta-learning. Crucially, analyses of the training dynamics of these methods reveal that they can meta-learn the correct subspace onto which the data should be projected.
△ Less
Submitted 25 February, 2020;
originally announced February 2020.
-
Provable Representation Learning for Imitation Learning via Bi-level Optimization
Authors:
Sanjeev Arora,
Simon S. Du,
Sham Kakade,
Yuping Luo,
Nikunj Saunshi
Abstract:
A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level optimization problem where the "outer" optimization tries to learn…
▽ More
A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level optimization problem where the "outer" optimization tries to learn the joint representation and the "inner" optimization encodes the imitation learning setup and tries to learn task-specific parameters. We instantiate this framework for the imitation learning settings of behavior cloning and observation-alone. Theoretically, we show using our framework that representation learning can provide sample complexity benefits for imitation learning in both settings. We also provide proof-of-concept experiments to verify our theory.
△ Less
Submitted 24 February, 2020;
originally announced February 2020.
-
Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality
Authors:
Yi Zhang,
Orestis Plevrakis,
Simon S. Du,
Xingguo Li,
Zhao Song,
Sanjeev Arora
Abstract:
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. In practice adversarial training leads to low robust training loss. However, a rigorous explanation for why this happens under natural conditions is still missing. Recently a convergence theory for standard (non-adversarial) supervised training was developed by various groups for {\em very ov…
▽ More
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. In practice adversarial training leads to low robust training loss. However, a rigorous explanation for why this happens under natural conditions is still missing. Recently a convergence theory for standard (non-adversarial) supervised training was developed by various groups for {\em very overparametrized} nets. It is unclear how to extend these results to adversarial training because of the min-max objective. Recently, a first step towards this direction was made by Gao et al. using tools from online learning, but they require the width of the net to be \emph{exponential} in input dimension $d$, and with an unnatural activation function. Our work proves convergence to low robust training loss for \emph{polynomial} width instead of exponential, under natural assumptions and with the ReLU activation. Key element of our proof is showing that ReLU networks near initialization can approximate the step function, which may be of independent interest.
△ Less
Submitted 23 February, 2020; v1 submitted 16 February, 2020;
originally announced February 2020.
-
Newtonian Monte Carlo: single-site MCMC meets second-order gradient methods
Authors:
Nimar S. Arora,
Nazanin Khosravani Tehrani,
Kinjal Divesh Shah,
Michael Tingley,
Yucen Lily Li,
Narjes Torabi,
David Noursi,
Sepehr Akhavan Masouleh,
Eric Lippert,
Erik Meijer
Abstract:
Single-site Markov Chain Monte Carlo (MCMC) is a variant of MCMC in which a single coordinate in the state space is modified in each step. Structured relational models are a good candidate for this style of inference. In the single-site context, second order methods become feasible because the typical cubic costs associated with these methods is now restricted to the dimension of each coordinate.…
▽ More
Single-site Markov Chain Monte Carlo (MCMC) is a variant of MCMC in which a single coordinate in the state space is modified in each step. Structured relational models are a good candidate for this style of inference. In the single-site context, second order methods become feasible because the typical cubic costs associated with these methods is now restricted to the dimension of each coordinate. Our work, which we call Newtonian Monte Carlo (NMC), is a method to improve MCMC convergence by analyzing the first and second order gradients of the target density to determine a suitable proposal density at each point. Existing first order gradient-based methods suffer from the problem of determining an appropriate step size. Too small a step size and it will take a large number of steps to converge, while a very large step size will cause it to overshoot the high density region. NMC is similar to the Newton-Raphson update in optimization where the second order gradient is used to automatically scale the step size in each dimension. However, our objective is to find a parameterized proposal density rather than the maxima.
As a further improvement on existing first and second order methods, we show that random variables with constrained supports don't need to be transformed before taking a gradient step. We demonstrate the efficiency of NMC on a number of different domains. For statistical models where the prior is conjugate to the likelihood, our method recovers the posterior quite trivially in one step. However, we also show results on fairly large non-conjugate models, where NMC performs better than adaptive first order methods such as NUTS or other inexact scalable inference methods such as Stochastic Variational Inference or bootstrapping.
△ Less
Submitted 15 January, 2020;
originally announced January 2020.
-
Enhanced Convolutional Neural Tangent Kernels
Authors:
Zhiyuan Li,
Ruosong Wang,
Dingli Yu,
Simon S. Du,
Wei Hu,
Ruslan Salakhutdinov,
Sanjeev Arora
Abstract:
Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional Neural Tangent Kernel (CNTK) if all layers are tr…
▽ More
Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional Neural Tangent Kernel (CNTK) if all layers are trained. An exact algorithm to compute CNTK (Arora et al., 2019) yielded the finding that classification accuracy of CNTK on CIFAR-10 is within 6-7% of that of that of the corresponding CNN architecture (best figure being around 78%) which is interesting performance for a fixed kernel. Here we show how to significantly enhance the performance of these kernels using two ideas. (1) Modifying the kernel using a new operation called Local Average Pooling (LAP) which preserves efficient computability of the kernel and inherits the spirit of standard data augmentation using pixel shifts. Earlier papers were unable to incorporate naive data augmentation because of the quadratic training cost of kernel regression. This idea is inspired by Global Average Pooling (GAP), which we show for CNN-GP and CNTK is equivalent to full translation data augmentation. (2) Representing the input image using a pre-processing technique proposed by Coates et al. (2011), which uses a single convolutional layer composed of random image patches. On CIFAR-10, the resulting kernel, CNN-GP with LAP and horizontal flip data augmentation, achieves 89% accuracy, matching the performance of AlexNet (Krizhevsky et al., 2012). Note that this is the best such result we know of for a classifier that is not a trained neural network. Similar improvements are obtained for Fashion-MNIST.
△ Less
Submitted 2 November, 2019;
originally announced November 2019.
-
An Exponential Learning Rate Schedule for Deep Learning
Authors:
Zhiyuan Li,
Sanjeev Arora
Abstract:
Intriguing empirical evidence exists that deep learning can work well with exoticschedules for varying the learning rate. This paper suggests that the phenomenon may be due to Batch Normalization or BN, which is ubiquitous and provides benefits in optimization and generalization across all standard architectures. The following new results are shown about BN with weight decay and momentum (in other…
▽ More
Intriguing empirical evidence exists that deep learning can work well with exoticschedules for varying the learning rate. This paper suggests that the phenomenon may be due to Batch Normalization or BN, which is ubiquitous and provides benefits in optimization and generalization across all standard architectures. The following new results are shown about BN with weight decay and momentum (in other words, the typical use case which was not considered in earlier theoretical analyses of stand-alone BN.
1. Training can be done using SGD with momentum and an exponentially increasing learning rate schedule, i.e., learning rate increases by some $(1 +α)$ factor in every epoch for some $α>0$. (Precise statement in the paper.) To the best of our knowledge this is the first time such a rate schedule has been successfully used, let alone for highly successful architectures. As expected, such training rapidly blows up network weights, but the net stays well-behaved due to normalization.
2. Mathematical explanation of the success of the above rate schedule: a rigorous proof that it is equivalent to the standard setting of BN + SGD + StandardRate Tuning + Weight Decay + Momentum. This equivalence holds for other normalization layers as well, Group Normalization, LayerNormalization, Instance Norm, etc.
3. A worked-out toy example illustrating the above linkage of hyper-parameters. Using either weight decay or BN alone reaches global minimum, but convergence fails when both are used.
△ Less
Submitted 21 November, 2019; v1 submitted 16 October, 2019;
originally announced October 2019.
-
Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks
Authors:
Sanjeev Arora,
Simon S. Du,
Zhiyuan Li,
Ruslan Salakhutdinov,
Ruosong Wang,
Dingli Yu
Abstract:
Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et a…
▽ More
Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et al. (2019a), which allowed studying performance of infinitely wide nets on datasets like CIFAR-10. However, super-quadratic running time of kernel methods makes them best suited for small-data tasks. We report results suggesting neural tangent kernels perform strongly on low-data tasks.
1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets.
2. On CIFAR-10 with 10 - 640 training samples, Convolutional NTK consistently beats ResNet-34 by 1% - 3%.
3. On VOC07 testbed for few-shot image classification tasks on ImageNet with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance.
4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a). NTK's efficacy may trace to lower variance of output.
△ Less
Submitted 27 October, 2019; v1 submitted 3 October, 2019;
originally announced October 2019.
-
Learning to Rank Broad and Narrow Queries in E-Commerce
Authors:
Siddhartha Devapujula,
Sagar Arora,
Sumit Borar
Abstract:
Search is a prominent channel for discovering products on an e-commerce platform. Ranking products retrieved from search becomes crucial to address customer's need and optimize for business metrics. While learning to Rank (LETOR) models have been extensively studied and have demonstrated efficacy in the context of web search; it is a relatively new research area to be explored in the e-commerce. I…
▽ More
Search is a prominent channel for discovering products on an e-commerce platform. Ranking products retrieved from search becomes crucial to address customer's need and optimize for business metrics. While learning to Rank (LETOR) models have been extensively studied and have demonstrated efficacy in the context of web search; it is a relatively new research area to be explored in the e-commerce. In this paper, we present a framework for building LETOR model for an e-commerce platform. We analyze user queries and propose a mechanism to segment queries between broad and narrow based on user's intent. We discuss different types of features - query, product and query-product and discuss challenges in using them. We show that sparsity in product features can be tackled through a denoising auto-encoder while skip-gram based word embeddings help solve the query-product sparsity issues. We also present various target metrics that can be employed for evaluating search results and compare their robustness. Further, we build and compare performances of both pointwise and pairwise LETOR models on fashion category data set. We also build and compare distinct models for broad and narrow queries, analyze feature importance across these and show that these specialized models perform better than a combined model in the fashion world.
△ Less
Submitted 15 July, 2019; v1 submitted 1 July, 2019;
originally announced July 2019.
-
One Embedding To Do Them All
Authors:
Loveperteek Singh,
Shreya Singh,
Sagar Arora,
Sumit Borar
Abstract:
Online shopping caters to the needs of millions of users daily. Search, recommendations, personalization have become essential building blocks for serving customer needs. Efficacy of such systems is dependent on a thorough understanding of products and their representation. Multiple information sources and data types provide a complete picture of the product on the platform. While each of these ta…
▽ More
Online shopping caters to the needs of millions of users daily. Search, recommendations, personalization have become essential building blocks for serving customer needs. Efficacy of such systems is dependent on a thorough understanding of products and their representation. Multiple information sources and data types provide a complete picture of the product on the platform. While each of these tasks shares some common characteristics, typically product embeddings are trained and used in isolation.
In this paper, we propose a framework to combine multiple data sources and learn unified embeddings for products on our e-commerce platform. Our product embeddings are built from three types of data sources - catalog text data, a user's clickstream session data and product images. We use various techniques like denoising auto-encoders for text, Bayesian personalized ranking (BPR) for clickstream data, Siamese neural network architecture for image data and combined ensemble over the above methods for unified embeddings. Further, we compare and analyze the performance of these embeddings across three unrelated real-world e-commerce tasks specifically checking product attribute coverage, finding similar products and predicting returns. We show that unified product embeddings perform uniformly well across all these tasks.
△ Less
Submitted 28 June, 2019;
originally announced June 2019.
-
Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets
Authors:
Rohith Kuditipudi,
Xiang Wang,
Holden Lee,
Yi Zhang,
Zhiyuan Li,
Wei Hu,
Sanjeev Arora,
Rong Ge
Abstract:
Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima -- at least those discovered by gradient-based optimization -- turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic pr…
▽ More
Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima -- at least those discovered by gradient-based optimization -- turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.
△ Less
Submitted 6 January, 2020; v1 submitted 14 June, 2019;
originally announced June 2019.
-
Implicit Regularization in Deep Matrix Factorization
Authors:
Sanjeev Arora,
Nadav Cohen,
Wei Hu,
Yuping Luo
Abstract:
Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding…
▽ More
Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.
△ Less
Submitted 26 October, 2019; v1 submitted 31 May, 2019;
originally announced May 2019.
-
A Simple Saliency Method That Passes the Sanity Checks
Authors:
Arushi Gupta,
Sanjeev Arora
Abstract:
There is great interest in "saliency methods" (also called "attribution methods"), which give "explanations" for a deep net's decision, by assigning a "score" to each feature/pixel in the input. Their design usually involves credit-assignment via the gradient of the output with respect to input. Recently Adebayo et al. [arXiv:1810.03292] questioned the validity of many of these methods since they…
▽ More
There is great interest in "saliency methods" (also called "attribution methods"), which give "explanations" for a deep net's decision, by assigning a "score" to each feature/pixel in the input. Their design usually involves credit-assignment via the gradient of the output with respect to input. Recently Adebayo et al. [arXiv:1810.03292] questioned the validity of many of these methods since they do not pass simple *sanity checks* which test whether the scores shift/vanish when layers of the trained net are randomized, or when the net is retrained using random labels for inputs.
We propose a simple fix to existing saliency methods that helps them pass sanity checks, which we call "competition for pixels". This involves computing saliency maps for all possible labels in the classification task, and using a simple competition among them to identify and remove less relevant pixels from the map. The simplest variant of this is "Competitive Gradient $\odot$ Input (CGI)": it is efficient, requires no additional training, and uses only the input and gradient. Some theoretical justification is provided for it (especially for ReLU networks) and its performance is empirically demonstrated.
△ Less
Submitted 6 June, 2019; v1 submitted 27 May, 2019;
originally announced May 2019.
-
Developing a large scale population screening tool for the assessment of Parkinson's disease using telephone-quality voice
Authors:
Siddharth Arora,
Ladan Baghai-Ravary,
Athanasios Tsanas
Abstract:
Recent studies have demonstrated that analysis of laboratory-quality voice recordings can be used to accurately differentiate people diagnosed with Parkinson's disease (PD) from healthy controls (HC). These findings could help facilitate the development of remote screening and monitoring tools for PD. In this study, we analyzed 2759 telephone-quality voice recordings from 1483 PD and 15321 recordi…
▽ More
Recent studies have demonstrated that analysis of laboratory-quality voice recordings can be used to accurately differentiate people diagnosed with Parkinson's disease (PD) from healthy controls (HC). These findings could help facilitate the development of remote screening and monitoring tools for PD. In this study, we analyzed 2759 telephone-quality voice recordings from 1483 PD and 15321 recordings from 8300 HC participants. To account for variations in phonetic backgrounds, we acquired data from seven countries. We developed a statistical framework for analyzing voice, whereby we computed 307 dysphonia measures that quantify different properties of voice impairment, such as, breathiness, roughness, monopitch, hoarse voice quality, and exaggerated vocal tremor. We used feature selection algorithms to identify robust parsimonious feature subsets, which were used in combination with a Random Forests (RF) classifier to accurately distinguish PD from HC. The best 10-fold cross-validation performance was obtained using Gram-Schmidt Orthogonalization (GSO) and RF, leading to mean sensitivity of 64.90% (standard deviation, SD 2.90%) and mean specificity of 67.96% (SD 2.90%). This large-scale study is a step forward towards assessing the development of a reliable, cost-effective and practical clinical decision support tool for screening the population at large for PD using telephone-quality voice.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.
-
On Exact Computation with an Infinitely Wide Neural Net
Authors:
Sanjeev Arora,
Simon S. Du,
Wei Hu,
Zhiyuan Li,
Ruslan Salakhutdinov,
Ruosong Wang
Abstract:
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its width --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries abo…
▽ More
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its width --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width.
The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for the performance of a pure kernel-based method on CIFAR-10, being $10\%$ higher than the methods reported in [Novak et al., 2019], and only $6\%$ lower than the performance of the corresponding finite deep net architecture (once batch normalization, etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.
△ Less
Submitted 4 November, 2019; v1 submitted 26 April, 2019;
originally announced April 2019.
-
A Theoretical Analysis of Contrastive Unsupervised Representation Learning
Authors:
Sanjeev Arora,
Hrishikesh Khandeparkar,
Mikhail Khodak,
Orestis Plevrakis,
Nikunj Saunshi
Abstract:
Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner forces the inner product of representations of sim…
▽ More
Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory.
△ Less
Submitted 25 February, 2019;
originally announced February 2019.
-
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
Authors:
Sanjeev Arora,
Simon S. Du,
Wei Hu,
Zhiyuan Li,
Ruosong Wang
Abstract:
Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works:
(i) Using a tighter characterization of training speed than recent papers, an explanation for why trai…
▽ More
Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works:
(i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17].
(ii) Generalization bound independent of network size, using a data-dependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size.
(iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent.
The key idea is to track dynamics of training and generalization via properties of a related kernel.
△ Less
Submitted 27 May, 2019; v1 submitted 24 January, 2019;
originally announced January 2019.
-
Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
Authors:
Sanjeev Arora,
Zhiyuan Li,
Kaifeng Lyu
Abstract:
Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuni…
▽ More
Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, $0.3$), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of $T^{-1/2}$ in $T$ iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate $T^{-1/4}$ is also shown for stochastic gradient descent.
△ Less
Submitted 10 December, 2018;
originally announced December 2018.
-
Investigating Voice as a Biomarker for leucine-rich repeat kinase 2-Associated Parkinson's Disease
Authors:
S. Arora,
N. P. Visanji,
T. A. Mestre,
A. Tsanas,
A. AlDakheel,
B. S. Connolly,
C. Gasca-Salas,
D. S. Kern,
J. Jain,
E. J. Slow,
A. Faust-Socher,
A. E. Lang,
M. A. Little,
C. Marras
Abstract:
We investigate the potential association between leucine-rich repeat kinase 2 (LRRK2) mutations and voice. Sustained phonations ('aaah' sounds) were recorded from 7 individuals with LRRK2-associated Parkinson's disease (PD), 17 participants with idiopathic PD (iPD), 20 non-manifesting LRRK2-mutation carriers, 25 related non-carriers, and 26 controls. In distinguishing LRRK2-associated PD and iPD,…
▽ More
We investigate the potential association between leucine-rich repeat kinase 2 (LRRK2) mutations and voice. Sustained phonations ('aaah' sounds) were recorded from 7 individuals with LRRK2-associated Parkinson's disease (PD), 17 participants with idiopathic PD (iPD), 20 non-manifesting LRRK2-mutation carriers, 25 related non-carriers, and 26 controls. In distinguishing LRRK2-associated PD and iPD, the mean sensitivity was 95.4% (SD 17.8%) and mean specificity was 89.6% (SD 26.5%). Voice features for non-manifesting carriers, related non-carriers, and controls were much less discriminatory. Vocal deficits in LRRK2-associated PD may be different than those in iPD. These preliminary results warrant longitudinal analyses and replication in larger cohorts
△ Less
Submitted 20 October, 2018;
originally announced October 2018.
-
A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks
Authors:
Sanjeev Arora,
Nadav Cohen,
Noah Golowich,
Wei Hu
Abstract:
We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x \mapsto W_N W_{N-1} \cdots W_1 x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices a…
▽ More
We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x \mapsto W_N W_{N-1} \cdots W_1 x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices at initialization are approximately balanced; and (iii) the initial loss is smaller than the loss of any rank-deficient solution. The assumptions on initialization (conditions (ii) and (iii)) are necessary, in the sense that violating any one of them may lead to convergence failure. Moreover, in the important case of output dimension 1, i.e. scalar regression, they are met, and thus convergence to global optimum holds, with constant probability under a random initialization scheme. Our results significantly extend previous analyses, e.g., of deep linear residual networks (Bartlett et al., 2018).
△ Less
Submitted 26 October, 2019; v1 submitted 4 October, 2018;
originally announced October 2018.
-
A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress
Authors:
Saurabh Arora,
Prashant Doshi
Abstract:
Inverse reinforcement learning (IRL) is the problem of inferring the reward function of an agent, given its policy or observed behavior. Analogous to RL, IRL is perceived both as a problem and as a class of methods. By categorically surveying the current literature in IRL, this article serves as a reference for researchers and practitioners of machine learning and beyond to understand the challeng…
▽ More
Inverse reinforcement learning (IRL) is the problem of inferring the reward function of an agent, given its policy or observed behavior. Analogous to RL, IRL is perceived both as a problem and as a class of methods. By categorically surveying the current literature in IRL, this article serves as a reference for researchers and practitioners of machine learning and beyond to understand the challenges of IRL and select the approaches best suited for the problem on hand. The survey formally introduces the IRL problem along with its central challenges such as the difficulty in performing accurate inference and its generalizability, its sensitivity to prior knowledge, and the disproportionate growth in solution complexity with problem size. The article elaborates how the current methods mitigate these challenges. We further discuss the extensions to traditional IRL methods for handling: inaccurate and incomplete perception, an incomplete model, multiple reward functions, and nonlinear reward functions. This survey concludes the discussion with some broad advances in the research area and currently open research questions.
△ Less
Submitted 18 November, 2020; v1 submitted 18 June, 2018;
originally announced June 2018.
-
A Framework and Method for Online Inverse Reinforcement Learning
Authors:
Saurabh Arora,
Prashant Doshi,
Bikramjit Banerjee
Abstract:
Inverse reinforcement learning (IRL) is the problem of learning the preferences of an agent from the observations of its behavior on a task. While this problem has been well investigated, the related problem of {\em online} IRL---where the observations are incrementally accrued, yet the demands of the application often prohibit a full rerun of an IRL method---has received relatively less attention…
▽ More
Inverse reinforcement learning (IRL) is the problem of learning the preferences of an agent from the observations of its behavior on a task. While this problem has been well investigated, the related problem of {\em online} IRL---where the observations are incrementally accrued, yet the demands of the application often prohibit a full rerun of an IRL method---has received relatively less attention. We introduce the first formal framework for online IRL, called incremental IRL (I2RL), and a new method that advances maximum entropy IRL with hidden variables, to this setting. Our formal analysis shows that the new method has a monotonically improving performance with more demonstration data, as well as probabilistically bounded error, both under full and partial observability. Experiments in a simulated robotic application of penetrating a continuous patrol under occlusion shows the relatively improved performance and speed up of the new method and validates the utility of online IRL.
△ Less
Submitted 20 May, 2018;
originally announced May 2018.
-
Short Term Load Forecasts of Low Voltage Demand and the Effects of Weather
Authors:
Stephen Haben,
Georgios Giasemidis,
Florian Ziel,
Siddharth Arora
Abstract:
Short term load forecasts will play a key role in the implementation of smart electricity grids. They are required to optimise a wide range of potential network solutions on the low voltage (LV) grid, including integrating low carbon technologies (such as photovoltaics) and utilising battery storage devices. Despite the need for accurate LV level load forecasts, previous studies have mostly focuse…
▽ More
Short term load forecasts will play a key role in the implementation of smart electricity grids. They are required to optimise a wide range of potential network solutions on the low voltage (LV) grid, including integrating low carbon technologies (such as photovoltaics) and utilising battery storage devices. Despite the need for accurate LV level load forecasts, previous studies have mostly focused on forecasting at the individual household or building level using data from smart meters. In this study we provide detailed analysis of a variety of methods in terms of both point and probabilistic forecasting accuracy using data from 100 real LV feeders. Moreover, we investigate the effect of temperature (both actual and forecasts) on the accuracy of load forecasts. We present some important results on the drivers of LV forecasting accuracy that are crucial for the management of LV networks, along with an empirical comparison of forecast measures.
△ Less
Submitted 6 April, 2018;
originally announced April 2018.
-
Rule-based Autoregressive Moving Average Models for Forecasting Load on Special Days: A Case Study for France
Authors:
Siddharth Arora,
James W. Taylor
Abstract:
This paper presents a case study on short-term load forecasting for France, with emphasis on special days, such as public holidays. We investigate the generalisability to French data of a recently proposed approach, which generates forecasts for normal and special days in a coherent and unified framework, by incorporating subjective judgment in univariate statistical models using a rule-based meth…
▽ More
This paper presents a case study on short-term load forecasting for France, with emphasis on special days, such as public holidays. We investigate the generalisability to French data of a recently proposed approach, which generates forecasts for normal and special days in a coherent and unified framework, by incorporating subjective judgment in univariate statistical models using a rule-based methodology. The intraday, intraweek, and intrayear seasonality in load are accommodated using a rule-based triple seasonal adaptation of a seasonal autoregressive moving average (SARMA) model. We find that, for application to French load, the method requires an important adaption. We also adapt a recently proposed SARMA model that accommodates special day effects on an hourly basis using indicator variables. Using a rule formulated specifically for the French load, we compare the SARMA models with a range of different benchmark methods based on an evaluation of their point and density forecast accuracy. As sophisticated benchmarks, we employ the rule-based triple seasonal adaptations of Holt-Winters-Taylor (HWT) exponential smoothing and artificial neural networks (ANNs). We use nine years of half-hourly French load data, and consider lead times ranging from one half-hour up to a day ahead. The rule-based SARMA approach generated the most accurate forecasts.
△ Less
Submitted 26 March, 2018;
originally announced March 2018.
-
Theoretical limitations of Encoder-Decoder GAN architectures
Authors:
Sanjeev Arora,
Andrej Risteski,
Yi Zhang
Abstract:
Encoder-decoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to the data-point and vice-versa, which should improv…
▽ More
Encoder-decoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to the data-point and vice-versa, which should improve the learning of the target distribution and ameliorate mode-collapse. It should also yield meaningful codes that are useful as features for downstream tasks. The current paper shows rigorously that even on real-life distributions of images, the encode-decoder GAN training objectives (a) cannot prevent mode collapse; i.e. the objective can be near-optimal even when the generated distribution has low and finite support (b) cannot prevent learning meaningless codes for data -- essentially white noise. Thus if encoder-decoder GANs do indeed work then it must be due to reasons as yet not understood, since the training objective can be low even for meaningless solutions.
△ Less
Submitted 7 November, 2017;
originally announced November 2017.
-
Provable benefits of representation learning
Authors:
Sanjeev Arora,
Andrej Risteski
Abstract:
There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machines, etc.
To study the relative merits of these…
▽ More
There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machines, etc.
To study the relative merits of these techniques, it's essential to formalize the definition and goals of representation learning, so that they are all become instances of the same definition. This paper introduces such a formal framework that also formalizes the utility of learning the representation. It is related to previous Bayesian notions, but with some new twists. We show the usefulness of our framework by exhibiting simple and natural settings -- linear mixture models and loglinear models, where the power of representation learning can be formally shown. In these examples, representation learning can be performed provably and efficiently under plausible assumptions (despite being NP-hard), and furthermore: (i) it greatly reduces the need for labeled data (semi-supervised learning) and (ii) it allows solving classification tasks when simpler approaches like nearest neighbors require too much data (iii) it is more powerful than manifold learning methods.
△ Less
Submitted 14 June, 2017;
originally announced June 2017.
-
Generalization and Equilibrium in Generative Adversarial Nets (GANs)
Authors:
Sanjeev Arora,
Rong Ge,
Yingyu Liang,
Tengyu Ma,
Yi Zhang
Abstract:
We show that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. However, generalization does occur for a weaker metric called neural net distance. It is also shown that an approximate pure equilibrium exists in the discriminator/gener…
▽ More
We show that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. However, generalization does occur for a weaker metric called neural net distance. It is also shown that an approximate pure equilibrium exists in the discriminator/generator game for a special class of generators with natural training objectives when generator capacity and training set sizes are moderate.
This existence of equilibrium inspires MIX+GAN protocol, which can be combined with any existing GAN training, and empirically shown to improve some of them.
△ Less
Submitted 1 August, 2017; v1 submitted 1 March, 2017;
originally announced March 2017.
-
Provable learning of Noisy-or Networks
Authors:
Sanjeev Arora,
Rong Ge,
Tengyu Ma,
Andrej Risteski
Abstract:
Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models w…
▽ More
Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models with linear structures: topic models, mixture models, hidden markov models, etc. These algorithms use matrix or tensor decomposition, and make some reasonable assumptions about the parameters of the underlying model.
But matrix or tensor decomposition seems of little use when the latent variable model has nonlinearities. The current paper shows how to make progress: tensor decomposition is applied for learning the single-layer {\em noisy or} network, which is a textbook example of a Bayes net, and used for example in the classic QMR-DT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits.
The technical novelty here, which should be useful in other settings in future, is analysis of tensor decomposition in presence of systematic error (i.e., where the noise/error is correlated with the signal, and doesn't decrease as number of samples goes to infinity). This requires rethinking all steps of tensor decomposition methods from the ground up.
For simplicity our analysis is stated assuming that the network parameters were chosen from a probability distribution but the method seems more generally applicable.
△ Less
Submitted 27 December, 2016;
originally announced December 2016.
-
Provable Algorithms for Inference in Topic Models
Authors:
Sanjeev Arora,
Rong Ge,
Frederic Koehler,
Tengyu Ma,
Ankur Moitra
Abstract:
Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us…
▽ More
Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a {\em single} iteration of Gibbs sampling.
△ Less
Submitted 26 May, 2016;
originally announced May 2016.
-
Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Authors:
Sanjeev Arora,
Yuanzhi Li,
Yingyu Liang,
Tengyu Ma,
Andrej Risteski
Abstract:
Word embeddings are ubiquitous in NLP and information retrieval, but it is unclear what they represent when the word is polysemous. Here it is shown that multiple word senses reside in linear superposition within the word embedding and simple sparse coding can recover vectors that approximately capture the senses. The success of our approach, which applies to several embedding methods, is mathemat…
▽ More
Word embeddings are ubiquitous in NLP and information retrieval, but it is unclear what they represent when the word is polysemous. Here it is shown that multiple word senses reside in linear superposition within the word embedding and simple sparse coding can recover vectors that approximately capture the senses. The success of our approach, which applies to several embedding methods, is mathematically explained using a variant of the random walk on discourses model (Arora et al., 2016). A novel aspect of our technique is that each extracted word sense is accompanied by one of about 2000 "discourse atoms" that gives a succinct description of which other words co-occur with that word sense. Discourse atoms can be of independent interest, and make the method potentially more useful. Empirical tests are used to verify and support the theory.
△ Less
Submitted 7 December, 2018; v1 submitted 14 January, 2016;
originally announced January 2016.
-
Simple, Efficient, and Neural Algorithms for Sparse Coding
Authors:
Sanjeev Arora,
Rong Ge,
Tengyu Ma,
Ankur Moitra
Abstract:
Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Re- cent work has resulted in several a…
▽ More
Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Re- cent work has resulted in several algorithms for sparse coding with provable guarantees, but somewhat surprisingly these are outperformed by the simple alternating minimization heuristics. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees. Some of these algorithms seem implementable on simple neural architectures, which was the original motivation of Olshausen and Field (1997a) in introducing sparse coding. We also give the first efficient algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our analysis framework will have applications in other settings where simple iterative algorithms are used.
△ Less
Submitted 2 March, 2015;
originally announced March 2015.
-
A Latent Variable Model Approach to PMI-based Word Embeddings
Authors:
Sanjeev Arora,
Yuanzhi Li,
Yingyu Liang,
Tengyu Ma,
Andrej Risteski
Abstract:
Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods.
This paper proposes a new generative model, a dynamic version of the log-linear topic model of~\citet{mnih2007three}. The methodological novelty is to use the prior to com…
▽ More
Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods.
This paper proposes a new generative model, a dynamic version of the log-linear topic model of~\citet{mnih2007three}. The methodological novelty is to use the prior to compute closed form expressions for word statistics. This provides a theoretical justification for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparameter choices. It also helps explain why low-dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by~\citet{mikolov2013efficient} and many subsequent papers.
Experimental support is provided for the generative model assumptions, the most important of which is that latent word vectors are fairly uniformly dispersed in space.
△ Less
Submitted 19 June, 2019; v1 submitted 11 February, 2015;
originally announced February 2015.