Skip to main content

Showing 1–50 of 56 results for author: Arora, S

Searching in archive stat. Search in all archives.
.
  1. arXiv:2503.15477  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    What Makes a Reward Model a Good Teacher? An Optimization Perspective

    Authors: Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

    Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: Code available at https://github.com/princeton-pli/what-makes-good-rm

  2. arXiv:2503.02877  [pdf, other

    cs.LG stat.ML

    Weak-to-Strong Generalization Even in Random Feature Networks, Provably

    Authors: Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro

    Abstract: Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  3. arXiv:2502.03669  [pdf, other

    cs.LG cs.AI cs.DM math.OC stat.ML

    Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent Set

    Authors: Yikai Wu, Haoyu Zhao, Sanjeev Arora

    Abstract: AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on Maximum Independent Set (MIS). Experiments on standard graph families show that AI-based algorithms fail to outperform and, in many cases, to match the soluti… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

    Comments: 24 pages, 7 figures, 8 tables

  4. arXiv:2410.10254  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    LoLCATs: On Low-Rank Linearizing of Large Language Models

    Authors: Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, Christopher Ré

    Abstract: Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. W… ▽ More

    Submitted 5 March, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

    Comments: 58 pages, 25 figures, 26 tables, ICLR 2025

  5. arXiv:2410.08847  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

    Authors: Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin

    Abstract: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on th… ▽ More

    Submitted 27 April, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: Accepted to ICLR 2025; Code available at https://github.com/princeton-nlp/unintentional-unalignment

  6. arXiv:2307.15936  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    A Theory for Emergence of Complex Skills in Language Models

    Authors: Sanjeev Arora, Anirudh Goyal

    Abstract: A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws… ▽ More

    Submitted 5 November, 2023; v1 submitted 29 July, 2023; originally announced July 2023.

  7. arXiv:2211.02912  [pdf, other

    stat.ML cs.LG

    New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

    Authors: Arushi Gupta, Nikunj Saunshi, Dingli Yu, Kaifeng Lyu, Sanjeev Arora

    Abstract: Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if th… ▽ More

    Submitted 5 November, 2022; originally announced November 2022.

    Comments: NeurIPS 2022 (Oral)

  8. arXiv:2110.06914  [pdf, other

    cs.LG stat.ML

    What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

    Authors: Zhiyuan Li, Tianhao Wang, Sanjeev Arora

    Abstract: Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $η$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further… ▽ More

    Submitted 28 July, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 56 pages, 2 figures; ICLR 2022

  9. Review of Low Voltage Load Forecasting: Methods, Applications, and Recommendations

    Authors: Stephen Haben, Siddharth Arora, Georgios Giasemidis, Marcus Voss, Danica Vukadinovic Greetham

    Abstract: The increased digitalisation and monitoring of the energy system opens up numerous opportunities to decarbonise the energy system. Applications on low voltage, local networks, such as community energy markets and smart storage will facilitate decarbonisation, but they will require advanced control and management. Reliable forecasting will be a necessary component of many of these systems to antici… ▽ More

    Submitted 5 September, 2021; v1 submitted 30 May, 2021; originally announced June 2021.

    Comments: 37 pages, 6 figures, 2 tables, review paper

    Journal ref: Applied Energy 304 (2021) 117798

  10. arXiv:2102.13189  [pdf, ps, other

    cs.LG stat.ML

    Rip van Winkle's Razor: A Simple Estimate of Overfit to Test Data

    Authors: Sanjeev Arora, Yi Zhang

    Abstract: Traditional statistics forbids use of test data (a.k.a. holdout data) during training. Dwork et al. 2015 pointed out that current practices in machine learning, whereby researchers build upon each other's models, copying hyperparameters and even computer code -- amounts to implicitly training on the test set. Thus error rate on test data may not reflect the true population error. This observation… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

  11. arXiv:2102.12470  [pdf, other

    cs.LG stat.ML

    On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

    Authors: Zhiyuan Li, Sadhika Malladi, Sanjeev Arora

    Abstract: It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies to SGD with tiny LR. Experimental verificatio… ▽ More

    Submitted 16 June, 2021; v1 submitted 24 February, 2021; originally announced February 2021.

    Comments: 36 pages, 20 figures

  12. arXiv:2010.08515  [pdf, other

    cs.LG stat.ML

    Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?

    Authors: Zhiyuan Li, Yi Zhang, Sanjeev Arora

    Abstract: Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional net (for a fixed task). Thus the training algorith… ▽ More

    Submitted 4 May, 2021; v1 submitted 16 October, 2020; originally announced October 2020.

    Comments: 24 pages, 1 figure; Accepted by ICLR 2021

  13. arXiv:2010.06053  [pdf, other

    cs.CL cs.CR cs.DS cs.LG stat.ML

    TextHide: Tackling Data Privacy in Language Understanding Tasks

    Authors: Yangsibo Huang, Zhao Song, Danqi Chen, Kai Li, Sanjeev Arora

    Abstract: An unsolved challenge in distributed or federated learning is to effectively mitigate privacy risks without slowing down training or reducing accuracy. In this paper, we propose TextHide aiming at addressing this challenge for natural language understanding tasks. It requires all participants to add a simple encryption step to prevent an eavesdropping attacker from recovering private text data. Su… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: Findings of EMNLP 2020

  14. arXiv:2010.03648  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks

    Authors: Nikunj Saunshi, Sadhika Malladi, Sanjeev Arora

    Abstract: Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions… ▽ More

    Submitted 14 April, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

    Comments: This version is the camera-ready version for ICLR 2021. Main changes include a detailed discussion about natural tasks, more detailed proof sketch and updated experimental evaluations

  15. arXiv:2010.02772  [pdf, other

    cs.CR cs.CC cs.DS cs.LG stat.ML

    InstaHide: Instance-hiding Schemes for Private Distributed Learning

    Authors: Yangsibo Huang, Zhao Song, Kai Li, Sanjeev Arora

    Abstract: How can multiple distributed entities collaboratively train a shared deep net on their private data while preserving privacy? This paper introduces InstaHide, a simple encryption of training images, which can be plugged into existing distributed deep learning pipelines. The encryption is efficient and applying it during training has minor effect on test accuracy. InstaHide encrypts each training… ▽ More

    Submitted 24 February, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: ICML 2020

  16. arXiv:2006.04509  [pdf, other

    cs.AI cs.DB cs.LG stat.ML

    IterefinE: Iterative KG Refinement Embeddings using Symbolic Knowledge

    Authors: Siddhant Arora, Srikanta Bedathur, Maya Ramanath, Deepak Sharma

    Abstract: Knowledge Graphs (KGs) extracted from text sources are often noisy and lead to poor performance in downstream application tasks such as KG-based question answering.While much of the recent activity is focused on addressing the sparsity of KGs by using embeddings for inferring new facts, the issue of cleaning up of noise in KGs through KG refinement task is not as actively studied. Most successful… ▽ More

    Submitted 3 June, 2020; originally announced June 2020.

    Comments: 16 pages, 7 figures, AKBC 2020 Conference

  17. Probabilistic Forecasting of Patient Waiting Times in an Emergency Department

    Authors: Siddharth Arora, James W. Taylor, Ho-Yin Mak

    Abstract: We study the estimation of the probability distribution of individual patient waiting times in an emergency department (ED). Our feature-rich modelling allows for dynamic updating and refinement of waiting time estimates as patient- and ED-specific information (e.g., patient condition, ED congestion levels) is revealed during the waiting process. Aspects relating to communicating forecast uncertai… ▽ More

    Submitted 30 May, 2020; originally announced June 2020.

  18. arXiv:2004.12873  [pdf, ps, other

    cs.LG cs.AI cs.RO stat.ML

    Maximum Entropy Multi-Task Inverse RL

    Authors: Saurabh Arora, Bikramjit Banerjee, Prashant Doshi

    Abstract: Multi-task IRL allows for the possibility that the expert could be switching between multiple ways of solving the same problem, or interleaving demonstrations of multiple tasks. The learner aims to learn the multiple reward functions that guide these ways of solving the problem. We present a new method for multi-task IRL that generalizes the well-known maximum entropy approach to IRL by combining… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

  19. arXiv:2003.01876  [pdf, other

    cs.LG cs.CR stat.ML

    Privacy-preserving Learning via Deep Net Pruning

    Authors: Yangsibo Huang, Yushan Su, Sachin Ravi, Zhao Song, Sanjeev Arora, Kai Li

    Abstract: This paper attempts to answer the question whether neural network pruning can be used as a tool to achieve differential privacy without losing much data utility. As a first step towards understanding the relationship between neural network pruning and differential privacy, this paper proves that pruning a given layer of the neural network is equivalent to adding a certain amount of differentially… ▽ More

    Submitted 3 March, 2020; originally announced March 2020.

  20. arXiv:2002.11172  [pdf, other

    cs.LG math.OC stat.ML

    A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

    Authors: Nikunj Saunshi, Yi Zhang, Mikhail Khodak, Sanjeev Arora

    Abstract: One popular trend in meta-learning is to learn from many training tasks a common initialization for a gradient-based method that can be used to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex models. This work shows that convex-case analysis… ▽ More

    Submitted 25 February, 2020; originally announced February 2020.

    Comments: 34 pages

  21. arXiv:2002.10544  [pdf, other

    cs.LG cs.AI stat.ML

    Provable Representation Learning for Imitation Learning via Bi-level Optimization

    Authors: Sanjeev Arora, Simon S. Du, Sham Kakade, Yuping Luo, Nikunj Saunshi

    Abstract: A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level optimization problem where the "outer" optimization tries to learn… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

    Comments: 26 pages

  22. arXiv:2002.06668  [pdf, other

    cs.LG stat.ML

    Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality

    Authors: Yi Zhang, Orestis Plevrakis, Simon S. Du, Xingguo Li, Zhao Song, Sanjeev Arora

    Abstract: Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. In practice adversarial training leads to low robust training loss. However, a rigorous explanation for why this happens under natural conditions is still missing. Recently a convergence theory for standard (non-adversarial) supervised training was developed by various groups for {\em very ov… ▽ More

    Submitted 23 February, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

  23. arXiv:2001.05567  [pdf, other

    cs.LG stat.ML

    Newtonian Monte Carlo: single-site MCMC meets second-order gradient methods

    Authors: Nimar S. Arora, Nazanin Khosravani Tehrani, Kinjal Divesh Shah, Michael Tingley, Yucen Lily Li, Narjes Torabi, David Noursi, Sepehr Akhavan Masouleh, Eric Lippert, Erik Meijer

    Abstract: Single-site Markov Chain Monte Carlo (MCMC) is a variant of MCMC in which a single coordinate in the state space is modified in each step. Structured relational models are a good candidate for this style of inference. In the single-site context, second order methods become feasible because the typical cubic costs associated with these methods is now restricted to the dimension of each coordinate.… ▽ More

    Submitted 15 January, 2020; originally announced January 2020.

    Comments: StarAI has a 6 page limit excluding references

  24. arXiv:1911.00809  [pdf, other

    cs.LG cs.CV cs.NE stat.ML

    Enhanced Convolutional Neural Tangent Kernels

    Authors: Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S. Du, Wei Hu, Ruslan Salakhutdinov, Sanjeev Arora

    Abstract: Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional Neural Tangent Kernel (CNTK) if all layers are tr… ▽ More

    Submitted 2 November, 2019; originally announced November 2019.

  25. arXiv:1910.07454  [pdf, other

    cs.LG stat.ML

    An Exponential Learning Rate Schedule for Deep Learning

    Authors: Zhiyuan Li, Sanjeev Arora

    Abstract: Intriguing empirical evidence exists that deep learning can work well with exoticschedules for varying the learning rate. This paper suggests that the phenomenon may be due to Batch Normalization or BN, which is ubiquitous and provides benefits in optimization and generalization across all standard architectures. The following new results are shown about BN with weight decay and momentum (in other… ▽ More

    Submitted 21 November, 2019; v1 submitted 16 October, 2019; originally announced October 2019.

  26. arXiv:1910.01663  [pdf, ps, other

    cs.LG stat.ML

    Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

    Authors: Sanjeev Arora, Simon S. Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, Dingli Yu

    Abstract: Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et a… ▽ More

    Submitted 27 October, 2019; v1 submitted 3 October, 2019; originally announced October 2019.

    Comments: Code for UCI experiments: https://github.com/LeoYu/neural-tangent-kernel-UCI

  27. arXiv:1907.01549  [pdf, other

    cs.IR cs.CL cs.LG stat.ML

    Learning to Rank Broad and Narrow Queries in E-Commerce

    Authors: Siddhartha Devapujula, Sagar Arora, Sumit Borar

    Abstract: Search is a prominent channel for discovering products on an e-commerce platform. Ranking products retrieved from search becomes crucial to address customer's need and optimize for business metrics. While learning to Rank (LETOR) models have been extensively studied and have demonstrated efficacy in the context of web search; it is a relatively new research area to be explored in the e-commerce. I… ▽ More

    Submitted 15 July, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: 7+1 pages

  28. arXiv:1906.12120  [pdf, other

    cs.LG cs.IR stat.ML

    One Embedding To Do Them All

    Authors: Loveperteek Singh, Shreya Singh, Sagar Arora, Sumit Borar

    Abstract: Online shopping caters to the needs of millions of users daily. Search, recommendations, personalization have become essential building blocks for serving customer needs. Efficacy of such systems is dependent on a thorough understanding of products and their representation. Multiple information sources and data types provide a complete picture of the product on the platform. While each of these ta… ▽ More

    Submitted 28 June, 2019; originally announced June 2019.

  29. arXiv:1906.06247  [pdf, other

    cs.LG stat.ML

    Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

    Authors: Rohith Kuditipudi, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu, Sanjeev Arora, Rong Ge

    Abstract: Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima -- at least those discovered by gradient-based optimization -- turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic pr… ▽ More

    Submitted 6 January, 2020; v1 submitted 14 June, 2019; originally announced June 2019.

  30. arXiv:1905.13655  [pdf, other

    cs.LG cs.AI cs.NE stat.ML

    Implicit Regularization in Deep Matrix Factorization

    Authors: Sanjeev Arora, Nadav Cohen, Wei Hu, Yuping Luo

    Abstract: Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding… ▽ More

    Submitted 26 October, 2019; v1 submitted 31 May, 2019; originally announced May 2019.

    Comments: Published at the conference on Neural Information Processing Systems (NeurIPS) 2019

  31. arXiv:1905.12152  [pdf, other

    cs.LG cs.AI stat.ML

    A Simple Saliency Method That Passes the Sanity Checks

    Authors: Arushi Gupta, Sanjeev Arora

    Abstract: There is great interest in "saliency methods" (also called "attribution methods"), which give "explanations" for a deep net's decision, by assigning a "score" to each feature/pixel in the input. Their design usually involves credit-assignment via the gradient of the output with respect to input. Recently Adebayo et al. [arXiv:1810.03292] questioned the validity of many of these methods since they… ▽ More

    Submitted 6 June, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Small typo on paragraph 3 of section 3 fixed

  32. arXiv:1905.00377  [pdf

    stat.AP cs.SD eess.AS

    Developing a large scale population screening tool for the assessment of Parkinson's disease using telephone-quality voice

    Authors: Siddharth Arora, Ladan Baghai-Ravary, Athanasios Tsanas

    Abstract: Recent studies have demonstrated that analysis of laboratory-quality voice recordings can be used to accurately differentiate people diagnosed with Parkinson's disease (PD) from healthy controls (HC). These findings could help facilitate the development of remote screening and monitoring tools for PD. In this study, we analyzed 2759 telephone-quality voice recordings from 1483 PD and 15321 recordi… ▽ More

    Submitted 1 May, 2019; originally announced May 2019.

    Comments: 43 pages, 5 figures, 6 tables

  33. arXiv:1904.11955  [pdf, ps, other

    cs.LG cs.CV cs.NE stat.ML

    On Exact Computation with an Infinitely Wide Neural Net

    Authors: Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang

    Abstract: How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its width --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries abo… ▽ More

    Submitted 4 November, 2019; v1 submitted 26 April, 2019; originally announced April 2019.

    Comments: In NeurIPS 2019. Code available: https://github.com/ruosongwang/cntk

  34. arXiv:1902.09229  [pdf, other

    cs.LG cs.AI stat.ML

    A Theoretical Analysis of Contrastive Unsupervised Representation Learning

    Authors: Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, Nikunj Saunshi

    Abstract: Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner forces the inner product of representations of sim… ▽ More

    Submitted 25 February, 2019; originally announced February 2019.

    Comments: 19 pages, 5 figures

  35. arXiv:1901.08584  [pdf, ps, other

    cs.LG cs.NE stat.ML

    Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

    Authors: Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruosong Wang

    Abstract: Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why trai… ▽ More

    Submitted 27 May, 2019; v1 submitted 24 January, 2019; originally announced January 2019.

    Comments: In ICML 2019

  36. arXiv:1812.03981  [pdf, other

    cs.LG stat.ML

    Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

    Authors: Sanjeev Arora, Zhiyuan Li, Kaifeng Lyu

    Abstract: Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuni… ▽ More

    Submitted 10 December, 2018; originally announced December 2018.

    Comments: 22 pages

  37. arXiv:1810.08807  [pdf

    stat.AP

    Investigating Voice as a Biomarker for leucine-rich repeat kinase 2-Associated Parkinson's Disease

    Authors: S. Arora, N. P. Visanji, T. A. Mestre, A. Tsanas, A. AlDakheel, B. S. Connolly, C. Gasca-Salas, D. S. Kern, J. Jain, E. J. Slow, A. Faust-Socher, A. E. Lang, M. A. Little, C. Marras

    Abstract: We investigate the potential association between leucine-rich repeat kinase 2 (LRRK2) mutations and voice. Sustained phonations ('aaah' sounds) were recorded from 7 individuals with LRRK2-associated Parkinson's disease (PD), 17 participants with idiopathic PD (iPD), 20 non-manifesting LRRK2-mutation carriers, 25 related non-carriers, and 26 controls. In distinguishing LRRK2-associated PD and iPD,… ▽ More

    Submitted 20 October, 2018; originally announced October 2018.

    Comments: 27 pages including supplemental information, Journal of Parkinson's Disease, 2018

  38. arXiv:1810.02281  [pdf, other

    cs.LG cs.NE stat.ML

    A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

    Authors: Sanjeev Arora, Nadav Cohen, Noah Golowich, Wei Hu

    Abstract: We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x \mapsto W_N W_{N-1} \cdots W_1 x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices a… ▽ More

    Submitted 26 October, 2019; v1 submitted 4 October, 2018; originally announced October 2018.

    Comments: Published as a conference paper at ICLR 2019

  39. arXiv:1806.06877  [pdf, other

    cs.LG stat.ML

    A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress

    Authors: Saurabh Arora, Prashant Doshi

    Abstract: Inverse reinforcement learning (IRL) is the problem of inferring the reward function of an agent, given its policy or observed behavior. Analogous to RL, IRL is perceived both as a problem and as a class of methods. By categorically surveying the current literature in IRL, this article serves as a reference for researchers and practitioners of machine learning and beyond to understand the challeng… ▽ More

    Submitted 18 November, 2020; v1 submitted 18 June, 2018; originally announced June 2018.

  40. arXiv:1805.07871  [pdf, other

    cs.LG cs.AI stat.ML

    A Framework and Method for Online Inverse Reinforcement Learning

    Authors: Saurabh Arora, Prashant Doshi, Bikramjit Banerjee

    Abstract: Inverse reinforcement learning (IRL) is the problem of learning the preferences of an agent from the observations of its behavior on a task. While this problem has been well investigated, the related problem of {\em online} IRL---where the observations are incrementally accrued, yet the demands of the application often prohibit a full rerun of an IRL method---has received relatively less attention… ▽ More

    Submitted 20 May, 2018; originally announced May 2018.

    Journal ref: Journal of Autonomous Agents and Multi-Agent Systems, Volume 35, Article number: 4 (2021)

  41. arXiv:1804.02955  [pdf, other

    stat.AP physics.soc-ph

    Short Term Load Forecasts of Low Voltage Demand and the Effects of Weather

    Authors: Stephen Haben, Georgios Giasemidis, Florian Ziel, Siddharth Arora

    Abstract: Short term load forecasts will play a key role in the implementation of smart electricity grids. They are required to optimise a wide range of potential network solutions on the low voltage (LV) grid, including integrating low carbon technologies (such as photovoltaics) and utilising battery storage devices. Despite the need for accurate LV level load forecasts, previous studies have mostly focuse… ▽ More

    Submitted 6 April, 2018; originally announced April 2018.

    Journal ref: International Journal of Forecasting, 35.4 (2019) 1469-1484

  42. arXiv:1803.09590  [pdf

    stat.AP

    Rule-based Autoregressive Moving Average Models for Forecasting Load on Special Days: A Case Study for France

    Authors: Siddharth Arora, James W. Taylor

    Abstract: This paper presents a case study on short-term load forecasting for France, with emphasis on special days, such as public holidays. We investigate the generalisability to French data of a recently proposed approach, which generates forecasts for normal and special days in a coherent and unified framework, by incorporating subjective judgment in univariate statistical models using a rule-based meth… ▽ More

    Submitted 26 March, 2018; originally announced March 2018.

    Comments: 11 figures, 3 tables

  43. arXiv:1711.02651  [pdf, other

    cs.LG stat.ML

    Theoretical limitations of Encoder-Decoder GAN architectures

    Authors: Sanjeev Arora, Andrej Risteski, Yi Zhang

    Abstract: Encoder-decoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to the data-point and vice-versa, which should improv… ▽ More

    Submitted 7 November, 2017; originally announced November 2017.

  44. arXiv:1706.04601  [pdf, ps, other

    cs.LG stat.ML

    Provable benefits of representation learning

    Authors: Sanjeev Arora, Andrej Risteski

    Abstract: There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machines, etc. To study the relative merits of these… ▽ More

    Submitted 14 June, 2017; originally announced June 2017.

    Comments: 22 pages

  45. arXiv:1703.00573  [pdf, other

    cs.LG cs.NE stat.ML

    Generalization and Equilibrium in Generative Adversarial Nets (GANs)

    Authors: Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, Yi Zhang

    Abstract: We show that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. However, generalization does occur for a weaker metric called neural net distance. It is also shown that an approximate pure equilibrium exists in the discriminator/gener… ▽ More

    Submitted 1 August, 2017; v1 submitted 1 March, 2017; originally announced March 2017.

    Comments: This is an updated version of an ICML'17 paper with the same title. The main difference is that in the ICML'17 version the pure equilibrium result was only proved for Wasserstein GAN. In the current version the result applies to most reasonable training objectives. In particular, Theorem 4.3 now applies to both original GAN and Wasserstein GAN

  46. arXiv:1612.08795  [pdf, ps, other

    cs.LG cs.DS stat.ML

    Provable learning of Noisy-or Networks

    Authors: Sanjeev Arora, Rong Ge, Tengyu Ma, Andrej Risteski

    Abstract: Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models w… ▽ More

    Submitted 27 December, 2016; originally announced December 2016.

  47. arXiv:1605.08491  [pdf, other

    cs.LG stat.ML

    Provable Algorithms for Inference in Topic Models

    Authors: Sanjeev Arora, Rong Ge, Frederic Koehler, Tengyu Ma, Ankur Moitra

    Abstract: Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us… ▽ More

    Submitted 26 May, 2016; originally announced May 2016.

    Comments: to appear at ICML'2016

  48. arXiv:1601.03764  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Linear Algebraic Structure of Word Senses, with Applications to Polysemy

    Authors: Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski

    Abstract: Word embeddings are ubiquitous in NLP and information retrieval, but it is unclear what they represent when the word is polysemous. Here it is shown that multiple word senses reside in linear superposition within the word embedding and simple sparse coding can recover vectors that approximately capture the senses. The success of our approach, which applies to several embedding methods, is mathemat… ▽ More

    Submitted 7 December, 2018; v1 submitted 14 January, 2016; originally announced January 2016.

    Comments: Appear in the Transactions of the Association for Computational Linguistics 2018, link: https://transacl.org/ojs/index.php/tacl/article/view/1346

  49. arXiv:1503.00778  [pdf, other

    cs.LG cs.DS cs.NE stat.ML

    Simple, Efficient, and Neural Algorithms for Sparse Coding

    Authors: Sanjeev Arora, Rong Ge, Tengyu Ma, Ankur Moitra

    Abstract: Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Re- cent work has resulted in several a… ▽ More

    Submitted 2 March, 2015; originally announced March 2015.

    Comments: 37 pages, 1 figure

  50. arXiv:1502.03520  [pdf, other

    cs.LG cs.CL stat.ML

    A Latent Variable Model Approach to PMI-based Word Embeddings

    Authors: Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski

    Abstract: Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods. This paper proposes a new generative model, a dynamic version of the log-linear topic model of~\citet{mnih2007three}. The methodological novelty is to use the prior to com… ▽ More

    Submitted 19 June, 2019; v1 submitted 11 February, 2015; originally announced February 2015.

    Comments: Appear in Transactions of the Association for Computational Linguistics (TACL), 2016