Skip to main content

Showing 1–43 of 43 results for author: Bhojanapalli, S

.
  1. arXiv:2410.15787  [pdf, other

    cs.LG cs.AI

    Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

    Authors: Hanseul Cho, Jaeyoung Cha, Srinadh Bhojanapalli, Chulhee Yun

    Abstract: Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (… ▽ More

    Submitted 17 April, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: 44 pages, 20 figures, 26 tables, accepted to ICLR 2025

  2. arXiv:2410.11135  [pdf, other

    cs.LG cs.CL

    Mimetic Initialization Helps State Space Models Learn to Recall

    Authors: Asher Trockman, Hrayr Harutyunyan, J. Zico Kolter, Sanjiv Kumar, Srinadh Bhojanapalli

    Abstract: Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We inve… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  3. arXiv:2405.20671  [pdf, other

    cs.LG cs.AI cs.CL

    Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

    Authors: Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun

    Abstract: Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absol… ▽ More

    Submitted 30 October, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: Accepted to NeurIPS 2024. 76 pages. 23 figures. 90 tables

  4. arXiv:2403.08100  [pdf, other

    cs.LG cs.CR cs.DC

    Efficient Language Model Architectures for Differentially Private Federated Learning

    Authors: Jae Hun Ro, Srinadh Bhojanapalli, Zheng Xu, Yanxiang Zhang, Ananda Theertha Suresh

    Abstract: Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data leaving the devices. SGD is the standard client optimizer for on device training in cross-device FL, favored for its memory and computational efficiency. However, in centralized training of neural language models, adaptive optimizers are preferred as th… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  5. arXiv:2402.09360  [pdf, other

    cs.LG cs.AI

    HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference

    Authors: Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli

    Abstract: Autoregressive decoding with generative Large Language Models (LLMs) on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache. On the other hand, recent works show that LLMs can maintain quality with significant sparsity/redundancy in the feedforward (FFN) layers by appropriately training the model… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  6. arXiv:2310.10636  [pdf, other

    cs.LG

    Dual-Encoders for Extreme Multi-Label Classification

    Authors: Nilesh Gupta, Devvrit Khatri, Ankit S Rawat, Srinadh Bhojanapalli, Prateek Jain, Inderjit Dhillon

    Abstract: Dual-encoder (DE) models are widely used in retrieval tasks, most commonly studied on open QA benchmarks that are often characterized by multi-class and limited training data. In contrast, their performance in multi-label and data-rich retrieval settings like extreme multi-label classification (XMC), remains under-explored. Current empirical evidence indicates that DE models fall significantly sho… ▽ More

    Submitted 17 March, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: 27 pages, 8 figures

    Journal ref: ICLR 2024 camera-ready publication

  7. arXiv:2310.04418  [pdf, other

    cs.LG

    Functional Interpolation for Relative Positions Improves Long Context Transformers

    Authors: Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, Srinadh Bhojanapalli

    Abstract: Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. W… ▽ More

    Submitted 2 March, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: 26 pages; ICLR 2024 camera ready version

  8. arXiv:2305.07810  [pdf, ps, other

    cs.LG stat.ML

    Depth Dependence of $μ$P Learning Rates in ReLU MLPs

    Authors: Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar

    Abstract: In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. A… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  9. arXiv:2301.12923  [pdf, other

    cs.LG cs.AI stat.ML

    On student-teacher deviations in distillation: does it pay to disobey?

    Authors: Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar

    Abstract: Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in… ▽ More

    Submitted 18 March, 2024; v1 submitted 30 January, 2023; originally announced January 2023.

  10. arXiv:2210.10253  [pdf, other

    cs.LG cs.AI cs.CR cs.CV

    On the Adversarial Robustness of Mixture of Experts

    Authors: Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pranjal Awasthi, Srinadh Bhojanapalli

    Abstract: Adversarial robustness is a key desirable property of neural networks. It has been empirically shown to be affected by their sizes, with larger networks being typically more robust. Recently, Bubeck and Sellke proved a lower bound on the Lipschitz constant of functions that fit the training data in terms of their number of parameters. This raises an interesting open question, do -- and can -- func… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted to NeurIPS 2022

  11. arXiv:2210.06313  [pdf, other

    cs.LG cs.CL cs.CV stat.ML

    The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

    Authors: Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

    Abstract: This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP… ▽ More

    Submitted 9 June, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: A short version was presented at ICLR 2023. Previous title: Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

  12. arXiv:2208.09015  [pdf, other

    cs.CL cs.LG

    Treeformer: Dense Gradient Trees for Efficient Attention Computation

    Authors: Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain

    Abstract: Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc. Consequently, several approaches have been developed recently to speedup attention computation by enforcing different attention structures such as sparsity, low-rank, a… ▽ More

    Submitted 17 March, 2023; v1 submitted 18 August, 2022; originally announced August 2022.

    Comments: ICLR 2023

  13. arXiv:2202.00980  [pdf, other

    cs.LG stat.ML

    Robust Training of Neural Networks Using Scale Invariant Architectures

    Authors: Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar

    Abstract: In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve… ▽ More

    Submitted 18 July, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

    Comments: 36 pages, 7 figures; ICML 2022

  14. arXiv:2110.06821  [pdf, other

    cs.LG cs.CL cs.CV

    Leveraging redundancy in attention with Reuse Transformers

    Authors: Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himanshu Jain, Frederick Liu, Yin-Wen Chang, Sanjiv Kumar

    Abstract: Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision. However, a typical Transformer model computes such pairwise attention scores repeatedly for the same sequence, in multiple heads in multiple layers. We systematically analyze the empirical similari… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

  15. arXiv:2106.10494  [pdf, other

    cs.LG

    Teacher's pet: understanding and mitigating biases in distillation

    Authors: Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

    Abstract: Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the student's overall performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can harm performance on certain s… ▽ More

    Submitted 8 July, 2021; v1 submitted 19 June, 2021; originally announced June 2021.

    Comments: 21 pages, 8 figures

  16. arXiv:2106.08823  [pdf, other

    cs.LG

    Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation

    Authors: Srinadh Bhojanapalli, Ayan Chakrabarti, Himanshu Jain, Sanjiv Kumar, Michal Lukasik, Andreas Veit

    Abstract: State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length. In this paper, we investigate the global structure of attention scores computed using this dot product mechanism on a typical distribution of inputs, and study the principal components of their variation. Through eigen analysis of full atten… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: 14 pages

  17. arXiv:2104.08698  [pdf, other

    cs.CL cs.LG

    A Simple and Effective Positional Encoding for Transformers

    Authors: Pu-Chin Chen, Henry Tsai, Srinadh Bhojanapalli, Hyung Won Chung, Yin-Wen Chang, Chun-Sung Ferng

    Abstract: Transformer models are permutation equivariant. To supply the order and type information of the input tokens, position and segment embeddings are usually added to the input. Recent works proposed variations of positional encodings with relative position encodings achieving better performance. Our analysis shows that the gain actually comes from moving positional information to attention layer from… ▽ More

    Submitted 3 November, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: Accepted by EMNLP

  18. arXiv:2103.14586  [pdf, other

    cs.CV cs.AI cs.LG

    Understanding Robustness of Transformers for Image Classification

    Authors: Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, Andreas Veit

    Abstract: Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture -- such as the use of non-overlapping patches -- lead one to wonder whether these networks are as robus… ▽ More

    Submitted 8 October, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: Accepted for publication at ICCV 2021. Rewrote Section 5 and made other minor changes throughout

  19. arXiv:2102.03349  [pdf, other

    cs.LG

    On the Reproducibility of Neural Network Predictions

    Authors: Srinadh Bhojanapalli, Kimberly Wilber, Andreas Veit, Ankit Singh Rawat, Seungyeon Kim, Aditya Menon, Sanjiv Kumar

    Abstract: Standard training techniques for neural networks involve multiple sources of randomness, e.g., initialization, mini-batch ordering and in some cases data augmentation. Given that neural networks are heavily over-parameterized in practice, such randomness can cause {\em churn} -- for the same input, disagreements between predictions of the two models independently trained by the same algorithm, con… ▽ More

    Submitted 5 February, 2021; originally announced February 2021.

    Comments: 19 pages, 7 figures

  20. arXiv:2012.00363  [pdf, other

    cs.CL cs.LG

    Modifying Memories in Transformer Models

    Authors: Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, Sanjiv Kumar

    Abstract: Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transfor… ▽ More

    Submitted 1 December, 2020; originally announced December 2020.

  21. arXiv:2010.14322  [pdf, other

    math.OC cs.AI cs.LG cs.NE

    An efficient nonconvex reformulation of stagewise convex optimization problems

    Authors: Rudy Bunel, Oliver Hinder, Srinadh Bhojanapalli, Krishnamurthy, Dvijotham

    Abstract: Convex optimization problems with staged structure appear in several contexts, including optimal control, verification of deep neural networks, and isotonic regression. Off-the-shelf solvers can solve these problems but may scale poorly. We develop a nonconvex reformulation designed to exploit this staged structure. Our reformulation has only simple bound constraints, enabling solution via project… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: First and second authors made equal contribution. To appear in Neurips 2020

  22. arXiv:2010.12230  [pdf, other

    cs.LG cs.CV math.OC

    Coping with Label Shift via Distributionally Robust Optimisation

    Authors: Jingzhao Zhang, Aditya Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra

    Abstract: The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, thei… ▽ More

    Submitted 17 August, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

  23. arXiv:2010.07447  [pdf, ps, other

    cs.CL cs.LG

    Semantic Label Smoothing for Sequence to Sequence Problems

    Authors: Michal Lukasik, Himanshu Jain, Aditya Krishna Menon, Seungyeon Kim, Srinadh Bhojanapalli, Felix Yu, Sanjiv Kumar

    Abstract: Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising. However, extending such methods directly to seq2seq settings, such as Machine Translation, is challenging: the large target output space of such problems makes it intractable to apply label smoothing over all possible outputs. Most existing approache… ▽ More

    Submitted 14 October, 2020; originally announced October 2020.

  24. arXiv:2006.04862  [pdf, other

    cs.LG stat.ML

    $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

    Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental… ▽ More

    Submitted 19 December, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 31 pages, NeurIPS 2020 Camera-ready

  25. arXiv:2003.02819  [pdf, other

    cs.LG stat.ML

    Does label smoothing mitigate label noise?

    Authors: Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

    Abstract: Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of coping with label noise. While label smoothing apparently amplifies this problem --… ▽ More

    Submitted 5 March, 2020; originally announced March 2020.

  26. arXiv:2002.07028  [pdf, other

    cs.LG stat.ML

    Low-Rank Bottleneck in Multi-head Attention Models

    Authors: Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.

    Comments: 17 pages, 4 figures

  27. arXiv:1912.10077  [pdf, other

    cs.LG stat.ML

    Are Transformers universal approximators of sequence-to-sequence functions?

    Authors: Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using… ▽ More

    Submitted 24 February, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

    Comments: 23 pages, ICLR 2020 camera-ready version

  28. arXiv:1904.00962  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    Authors: Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

    Abstract: Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly fo… ▽ More

    Submitted 3 January, 2020; v1 submitted 1 April, 2019; originally announced April 2019.

    Comments: Published as a conference paper at ICLR 2020

  29. arXiv:1805.12076  [pdf, other

    cs.LG stat.ML

    Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

    Authors: Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, Nathan Srebro

    Abstract: Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound… ▽ More

    Submitted 30 May, 2018; originally announced May 2018.

    Comments: 19 pages, 8 figures

  30. arXiv:1803.00186  [pdf, ps, other

    stat.ML cs.LG math.OC

    Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form

    Authors: Srinadh Bhojanapalli, Nicolas Boumal, Prateek Jain, Praneeth Netrapalli

    Abstract: Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer--Monteiro factorization approach for solving SDPs. We show that all approximate local optima are global optima for the penalty formulation of appropriately rank-constrained SDPs as long as the number o… ▽ More

    Submitted 28 February, 2018; originally announced March 2018.

    Comments: 24 pages

  31. arXiv:1711.02524  [pdf, other

    quant-ph cs.DS

    Provable quantum state tomography via non-convex methods

    Authors: Anastasios Kyrillidis, Amir Kalev, Dohuyng Park, Srinadh Bhojanapalli, Constantine Caramanis, Sujay Sanghavi

    Abstract: With nowadays steadily growing quantum processors, it is required to develop new quantum tomography tools that are tailored for high-dimensional systems. In this work, we describe such a computational tool, based on recent ideas from non-convex optimization. The algorithm excels in the compressed-sensing-like setting, where only a few data points are measured from a low-rank or highly-pure quantum… ▽ More

    Submitted 18 November, 2017; v1 submitted 4 November, 2017; originally announced November 2017.

    Comments: 21 pages, 26 figures, code included

  32. arXiv:1707.09564  [pdf, ps, other

    cs.LG

    A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

    Authors: Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro

    Abstract: We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.

    Submitted 23 February, 2018; v1 submitted 29 July, 2017; originally announced July 2017.

    Comments: Accepted to ICLR 2018

  33. arXiv:1706.08947  [pdf, other

    cs.LG

    Exploring Generalization in Deep Learning

    Authors: Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro

    Abstract: With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures expl… ▽ More

    Submitted 6 July, 2017; v1 submitted 27 June, 2017; originally announced June 2017.

    Comments: 19 pages, 8 figures

  34. arXiv:1705.09280  [pdf, other

    stat.ML cs.LG

    Implicit Regularization in Matrix Factorization

    Authors: Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

    Abstract: We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.

    Submitted 25 May, 2017; originally announced May 2017.

  35. arXiv:1705.07831  [pdf, other

    cs.LG cs.CV

    Stabilizing GAN Training with Multiple Random Projections

    Authors: Behnam Neyshabur, Srinadh Bhojanapalli, Ayan Chakrabarti

    Abstract: Training generative adversarial networks is unstable in high-dimensions as the true data distribution tends to be concentrated in a small fraction of the ambient space. The discriminator is then quickly able to classify nearly all generated samples as fake, leaving the generator without meaningful gradients and causing it to deteriorate after a point in training. In this work, we propose training… ▽ More

    Submitted 22 June, 2018; v1 submitted 22 May, 2017; originally announced May 2017.

  36. arXiv:1610.06656  [pdf, ps, other

    stat.ML cs.DS cs.IT cs.LG

    Single Pass PCA of Matrix Products

    Authors: Shanshan Wu, Srinadh Bhojanapalli, Sujay Sanghavi, Alexandros G. Dimakis

    Abstract: In this paper we present a new algorithm for computing a low rank approximation of the product $A^TB$ by taking only a single pass of the two matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch $A$ and $B$ individually, and then (b) find the top components using PCA on the sketch. Our algorithm in contrast retains additional summary information about $A,B$ (e.g. row and… ▽ More

    Submitted 26 October, 2016; v1 submitted 20 October, 2016; originally announced October 2016.

    Comments: 24 pages, 4 figures, NIPS 2016

  37. arXiv:1606.01316  [pdf, other

    stat.ML cs.DS cs.IT math.NA math.OC

    Provable Burer-Monteiro factorization for a class of norm-constrained matrix problems

    Authors: Dohyung Park, Anastasios Kyrillidis, Srinadh Bhojanapalli, Constantine Caramanis, Sujay Sanghavi

    Abstract: We study the projected gradient descent method on low-rank matrix problems with a strongly convex objective. We use the Burer-Monteiro factorization approach to implicitly enforce low-rankness; such factorization introduces non-convexity in the objective. We focus on constraint sets that include both positive semi-definite (PSD) constraints and specific matrix norm-constraints. Such criteria appea… ▽ More

    Submitted 1 October, 2016; v1 submitted 3 June, 2016; originally announced June 2016.

    Comments: 28 pages

  38. arXiv:1605.07221  [pdf, other

    stat.ML cs.LG math.OC

    Global Optimality of Local Search for Low Rank Matrix Recovery

    Authors: Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

    Abstract: We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random… ▽ More

    Submitted 26 May, 2016; v1 submitted 23 May, 2016; originally announced May 2016.

    Comments: 21 pages, 3 figures

  39. arXiv:1509.03917  [pdf, other

    stat.ML cs.DS cs.IT cs.LG math.NA math.OC

    Dropping Convexity for Faster Semi-definite Optimization

    Authors: Srinadh Bhojanapalli, Anastasios Kyrillidis, Sujay Sanghavi

    Abstract: We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semi-definite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq n$. We study the performance of gradient descent on $g$---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function $f$. W… ▽ More

    Submitted 15 April, 2016; v1 submitted 13 September, 2015; originally announced September 2015.

    Comments: 40 pages

  40. arXiv:1502.05023  [pdf, ps, other

    stat.ML cs.DS cs.IT cs.LG

    A New Sampling Technique for Tensors

    Authors: Srinadh Bhojanapalli, Sujay Sanghavi

    Abstract: In this paper we propose new techniques to sample arbitrary third-order tensors, with an objective of speeding up tensor algorithms that have recently gained popularity in machine learning. Our main contribution is a new way to select, in a biased random way, only $O(n^{1.5}/ε^2)$ of the possible $n^3$ elements while still achieving each of the three goals: \\ {\em (a) tensor sparsification}: for… ▽ More

    Submitted 19 February, 2015; v1 submitted 17 February, 2015; originally announced February 2015.

    Comments: 29 pages,3 figures

  41. arXiv:1410.3886  [pdf, ps, other

    cs.DS cs.LG stat.ML

    Tighter Low-rank Approximation via Sampling the Leveraged Element

    Authors: Srinadh Bhojanapalli, Prateek Jain, Sujay Sanghavi

    Abstract: In this work, we propose a new randomized algorithm for computing a low-rank approximation to a given matrix. Taking an approach different from existing literature, our method first involves a specific biased sampling, with an element being chosen based on the leverage scores of its row and column, and then involves weighted alternating minimization over the factored form of the intended low-rank… ▽ More

    Submitted 14 October, 2014; originally announced October 2014.

    Comments: 36 pages, 3 figures, Extended abstract to appear in the proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA15)

  42. arXiv:1402.2324  [pdf, ps, other

    stat.ML cs.IT cs.LG

    Universal Matrix Completion

    Authors: Srinadh Bhojanapalli, Prateek Jain

    Abstract: The problem of low-rank matrix completion has recently generated a lot of interest leading to several results that offer exact solutions to the problem. However, in order to do so, these methods make assumptions that can be quite restrictive in practice. More specifically, the methods assume that: a) the observed indices are sampled uniformly at random, and b) for every new matrix, the observed in… ▽ More

    Submitted 11 July, 2014; v1 submitted 10 February, 2014; originally announced February 2014.

    Comments: 22 pages, 2 figures

  43. arXiv:1306.2979  [pdf, other

    stat.ML cs.IT cs.LG

    Completing Any Low-rank Matrix, Provably

    Authors: Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, Rachel Ward

    Abstract: Matrix completion, i.e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces. In these cases, the subset of elements is sampled uniformly at random. In this paper, we show that {\em any} rank-$ r $… ▽ More

    Submitted 21 July, 2014; v1 submitted 12 June, 2013; originally announced June 2013.

    Comments: Added a new necessary condition(Theorem 6) and a result on completion of row coherent matrices(Corollary 4). Partial results appeared in the International Conference on Machine Learning 2014, under the title 'Coherent Matrix Completion'. (34 pages, 4 figures)