Skip to main content

Showing 1–44 of 44 results for author: Yun, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.07438  [pdf, ps, other

    cs.CL

    LGAI-EMBEDDING-Preview Technical Report

    Authors: Jooyoung Choi, Hyun Kim, Hansol Jang, Changwook Jun, Kyunghoon Bae, Hyewon Choi, Stanley Jungkyu Choi, Honglak Lee, Chulmin Yun

    Abstract: This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structur… ▽ More

    Submitted 22 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: 10 pages

  2. arXiv:2506.06940  [pdf, ps, other

    cs.LG

    Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More

    Authors: Geonhui Yoo, Minhak Song, Chulhee Yun

    Abstract: When training deep neural networks with gradient descent, sharpness often increases -- a phenomenon known as progressive sharpening -- before saturating at the edge of stability. Although commonly observed in practice, the underlying mechanisms behind progressive sharpening remain poorly understood. In this work, we study this phenomenon using a minimalist model: a deep linear network with a singl… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: ICML 2025

  3. arXiv:2506.04126  [pdf, ps, other

    cs.LG math.OC

    Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems

    Authors: Yujun Kim, Jaeyoung Cha, Chulhee Yun

    Abstract: Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $κ$. In contrast, little is known when $K$ is smaller than $κ$, and it is still a challenging open question whether p… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted to ICML 2025, 56 pages, 6 figures

  4. arXiv:2504.12712  [pdf, other

    cs.LG math.OC

    Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification

    Authors: Hyunji Jung, Hanseul Cho, Chulhee Yun

    Abstract: We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training o… ▽ More

    Submitted 26 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: 67 pages, 11 figures, accepted to ICLR 2025, Camera-ready version

  5. arXiv:2504.06033  [pdf, ps, other

    cs.DS

    Parallel Small Vertex Connectivity in Near-Linear Work and Polylogarithmic Depth

    Authors: Yonggang Jiang, Changki Yun

    Abstract: We present a randomized parallel algorithm in the {\sf PRAM} model for $k$-vertex connectivity. Given an undirected simple graph, our algorithm either finds a set of fewer than $k$ vertices whose removal disconnects the graph or reports that no such set exists. The algorithm runs in $O(m \cdot \text{poly}(k, \log n))$ work and $O(\text{poly}(k, \log n))$ depth, which is nearly optimal for any… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  6. arXiv:2503.00699  [pdf, other

    cs.LG cs.AI stat.ML

    Parameter Expanded Stochastic Gradient Markov Chain Monte Carlo

    Authors: Hyunsu Kim, Giung Nam, Chulhee Yun, Hongseok Yang, Juho Lee

    Abstract: Bayesian Neural Networks (BNNs) provide a promising framework for modeling predictive uncertainty and enhancing out-of-distribution robustness (OOD) by estimating the posterior distribution of network parameters. Stochastic Gradient Markov Chain Monte Carlo (SGMCMC) is one of the most powerful methods for scalable posterior sampling in BNNs, achieving efficiency by combining stochastic gradient de… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

    Journal ref: ICLR 2025

  7. arXiv:2502.06905  [pdf, ps, other

    cs.LG cs.AI

    Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

    Authors: Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun

    Abstract: Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensiv… ▽ More

    Submitted 11 June, 2025; v1 submitted 9 February, 2025; originally announced February 2025.

  8. arXiv:2501.00511  [pdf, other

    cs.LG math.OC

    Stochastic Extragradient with Flip-Flop Shuffling & Anchoring: Provable Improvements

    Authors: Jiseok Chae, Chulhee Yun, Donghwan Kim

    Abstract: In minimax optimization, the extragradient (EG) method has been extensively studied because it outperforms the gradient descent-ascent method in convex-concave (C-C) problems. Yet, stochastic EG (SEG) has seen limited success in C-C problems, especially for unconstrained cases. Motivated by the recent progress of shuffling-based stochastic methods, we investigate the convergence of shuffling-based… ▽ More

    Submitted 31 December, 2024; originally announced January 2025.

    Comments: 73+7 pages, 4 figures. Published in NeurIPS 2024

  9. arXiv:2410.23672  [pdf, other

    cs.LG cs.AI stat.ML

    Provable Benefit of Cutout and CutMix for Feature Learning

    Authors: Junsoo Oh, Chulhee Yun

    Abstract: Patch-level data augmentation techniques such as Cutout and CutMix have demonstrated significant efficacy in enhancing the performance of vision tasks. However, a comprehensive theoretical understanding of these methods remains elusive. In this paper, we study two-layer neural networks trained using three distinct methods: vanilla training without augmentation, Cutout training, and CutMix training… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

    Comments: NeurIPS 2024 camera-ready version, 81 pages

  10. arXiv:2410.23495  [pdf, other

    cs.LG cs.AI

    DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity

    Authors: Baekrok Shin, Junsoo Oh, Hanseul Cho, Chulhee Yun

    Abstract: Warm-starting neural network training by initializing networks with previously learned weights is appealing, as practical neural networks are often deployed under a continuous influx of new data. However, it often leads to loss of plasticity, where the network loses its ability to learn new information, resulting in worse generalization than training from scratch. This occurs even under stationary… ▽ More

    Submitted 1 November, 2024; v1 submitted 30 October, 2024; originally announced October 2024.

    Comments: Published at NeurIPS 2024

  11. arXiv:2410.15787  [pdf, other

    cs.LG cs.AI

    Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

    Authors: Hanseul Cho, Jaeyoung Cha, Srinadh Bhojanapalli, Chulhee Yun

    Abstract: Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (… ▽ More

    Submitted 17 April, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: 44 pages, 20 figures, 26 tables, accepted to ICLR 2025

  12. arXiv:2405.20671  [pdf, other

    cs.LG cs.AI cs.CL

    Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

    Authors: Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun

    Abstract: Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absol… ▽ More

    Submitted 30 October, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: Accepted to NeurIPS 2024. 76 pages. 23 figures. 90 tables

  13. arXiv:2405.16002  [pdf, other

    cs.LG math.OC stat.ML

    Does SGD really happen in tiny subspaces?

    Authors: Minhak Song, Kwangjun Ahn, Chulhee Yun

    Abstract: Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural n… ▽ More

    Submitted 10 March, 2025; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: Published at ICLR 2025

  14. Optimal Switching Networks for Paired-Egress Bell State Analyzer Pools

    Authors: Marii Koyama, Claire Yun, Amin Taherkhani, Naphan Benchasattabuse, Bernard Ousmane Sane, Michal Hajdušek, Shota Nagayama, Rodney Van Meter

    Abstract: To scale quantum computers to useful levels, we must build networks of quantum computational nodes that can share entanglement for use in distributed forms of quantum algorithms. In one proposed architecture, node-to-node entanglement is created when nodes emit photons entangled with stationary memories, with the photons routed through a switched interconnect to a shared pool of Bell state analyze… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 11 pages, 8 figures, 1 table

    Journal ref: 2024 IEEE International Conference on Quantum Computing and Engineering (QCE), pp. 1897-1907 (2024)

  15. arXiv:2402.10475  [pdf, other

    math.OC cs.LG

    Fundamental Benefit of Alternating Updates in Minimax Optimization

    Authors: Jaewook Lee, Hanseul Cho, Chulhee Yun

    Abstract: The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA). While Alt-GDA is commonly observed to converge faster, the performance gap between the two is not yet well understood theoretically, especially in terms of global convergence rates. To address this theory-practice… ▽ More

    Submitted 15 July, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024 (Spotlight). 76 pages, 2 figures. Additional experiments (quadratic game, GAN) and proofs

  16. arXiv:2401.15554  [pdf

    cs.CV

    Pericoronary adipose tissue feature analysis in CT calcium score images with comparison to coronary CTA

    Authors: Yingnan Song, Hao Wu, Juhwan Lee, Justin Kim, Ammar Hoori, Tao Hu, Vladislav Zimin, Mohamed Makhlouf, Sadeer Al-Kindi, Sanjay Rajagopalan, Chun-Ho Yun, Chung-Lieh Hung, David L. Wilson

    Abstract: We investigated the feasibility and advantages of using non-contrast CT calcium score (CTCS) images to assess pericoronary adipose tissue (PCAT) and its association with major adverse cardiovascular events (MACE). PCAT features from coronary CTA (CCTA) have been shown to be associated with cardiovascular risk but are potentially confounded by iodine. If PCAT in CTCS images can be similarly analyze… ▽ More

    Submitted 27 January, 2024; originally announced January 2024.

    Comments: 24 pages,10 figures

  17. arXiv:2311.15051  [pdf, other

    cs.LG math.OC stat.ML

    Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

    Authors: Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun

    Abstract: Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much fla… ▽ More

    Submitted 29 May, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

    Comments: v3: major updates; 25 pages, 17 figures; the first two authors contributed equally. The preliminary version was accepted to the NeurIPS 2023 M3L Workshop (oral) under the title "Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study."

  18. arXiv:2310.18593  [pdf, other

    stat.ML cs.CY cs.LG

    Fair Streaming Principal Component Analysis: Statistical and Algorithmic Viewpoint

    Authors: Junghyun Lee, Hanseul Cho, Se-Young Yun, Chulhee Yun

    Abstract: Fair Principal Component Analysis (PCA) is a problem setting where we aim to perform PCA while making the resulting representation fair in that the projected distributions, conditional on the sensitive attributes, match one another. However, existing approaches to fair PCA have two main problems: theoretically, there has been no statistical foundation of fair PCA in terms of learnability; practica… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Comments: 42 pages, 5 figures, 4 tables. Accepted to the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  19. arXiv:2310.01082  [pdf, other

    cs.LG cs.AI math.OC

    Linear attention is (maybe) all you need (to understand transformer optimization)

    Authors: Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra

    Abstract: Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and… ▽ More

    Submitted 13 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Published at ICLR 2024

  20. arXiv:2307.15777  [pdf, other

    cs.PL

    Error Localization for Sequential Effect Systems (Extended Version)

    Authors: Colin S. Gordon, Chaewon Yun

    Abstract: We describe a new concrete approach to giving predictable error locations for sequential (flow-sensitive) effect systems. Prior implementations of sequential effect systems rely on either computing a bottom-up effect and comparing it to a declaration (e.g., method annotation) or leaning on constraint-based type inference. These approaches do not necessarily report program locations that precisely… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

    Comments: Extended report of upcoming Static Analysis Symposium 2023 paper

  21. arXiv:2307.04204  [pdf, other

    cs.LG math.OC stat.ML

    Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory

    Authors: Minhak Song, Chulhee Yun

    Abstract: Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this… ▽ More

    Submitted 26 October, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023 camera-ready; 51 pages

  22. arXiv:2306.15593  [pdf

    cs.CV

    Cardiac CT perfusion imaging of pericoronary adipose tissue (PCAT) highlights potential confounds in coronary CTA

    Authors: Hao Wu, Yingnan Song, Ammar Hoori, Ananya Subramaniam, Juhwan Lee, Justin Kim, Tao Hu, Sadeer Al-Kindi, Wei-Ming Huang, Chun-Ho Yun, Chung-Lieh Hung, Sanjay Rajagopalan, David L. Wilson

    Abstract: Features of pericoronary adipose tissue (PCAT) assessed from coronary computed tomography angiography (CCTA) are associated with inflammation and cardiovascular risk. As PCAT is vascularly connected with coronary vasculature, the presence of iodine is a potential confounding factor on PCAT HU and textures that has not been adequately investigated. Use dynamic cardiac CT perfusion (CCTP) to inform… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: 13 pages, 8 figures

  23. arXiv:2306.10711  [pdf, other

    cs.LG

    PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning

    Authors: Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, Chulhee Yun

    Abstract: In Reinforcement Learning (RL), enhancing sample efficiency is crucial, particularly in scenarios when data acquisition is costly and risky. In principle, off-policy RL algorithms can improve sample efficiency by allowing multiple updates per environment interaction. However, these multiple updates often lead the model to overfit to earlier interactions, which is referred to as the loss of plastic… ▽ More

    Submitted 8 December, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

    Comments: 26 pages, 6 figures, accepted to NeurIPS 2023

  24. arXiv:2306.09850  [pdf, other

    cs.LG math.OC stat.ML

    Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima

    Authors: Dongkuk Si, Chulhee Yun

    Abstract: Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + ρ\frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $ρ$ and/or no gradient normalization in $y_t$, which is detached from pr… ▽ More

    Submitted 27 October, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: 39 pages. v3 NeurIPS 2023 camera ready version

  25. arXiv:2306.00267  [pdf, other

    cs.LG math.OC stat.ML

    Provable Benefit of Mixup for Finding Optimal Decision Boundaries

    Authors: Junsoo Oh, Chulhee Yun

    Abstract: We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem. For a family of data distributions with a separability constant $κ$, we analyze how well the optimal classifier in terms of training loss aligns with the optimal one in test accuracy (i.e., Bayes optimal classifier). For… ▽ More

    Submitted 5 June, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

    Comments: ICML 2023 camera-ready version; 48 pages

  26. arXiv:2303.07160  [pdf, ps, other

    cs.LG math.OC stat.ML

    Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond

    Authors: Jaeyoung Cha, Jaewook Lee, Chulhee Yun

    Abstract: We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems. Unlike most existing results focusing on final iterate lower bounds in terms of the number of components $n$ and the number of epochs $K$, we seek bounds for arbitrary weighted average iterates that are tight in all factors including the… ▽ More

    Submitted 9 June, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

    Comments: 58 pages

  27. arXiv:2302.12444  [pdf, other

    cs.LG math.OC

    On the Training Instability of Shuffling SGD with Batch Normalization

    Authors: David X. Wu, Chulhee Yun, Suvrit Sra

    Abstract: We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for r… ▽ More

    Submitted 14 August, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: ICML 2023 camera-ready version, added references; 75 pages

  28. arXiv:2110.10342  [pdf, other

    cs.LG math.OC stat.ML

    Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

    Authors: Chulhee Yun, Shashank Rajput, Suvrit Sra

    Abstract: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients… ▽ More

    Submitted 23 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: ICLR 2022 camera-ready (selected for an oral presentation); 76 pages, 3 figures

  29. arXiv:2103.07079  [pdf, other

    cs.LG math.OC

    Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restri… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 26 pages, 2 figures

  30. arXiv:2010.13363  [pdf, other

    cs.LG

    Provable Memorization via Deep Neural Networks using Sub-linear Parameters

    Authors: Sejun Park, Jaeho Lee, Chulhee Yun, Jinwoo Shin

    Abstract: It is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $O(N^{2/3})$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width $3$) are shown to memorize more pairs than shallow networks, which also agrees with the re… ▽ More

    Submitted 2 November, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

  31. DML-GANR: Deep Metric Learning With Generative Adversarial Network Regularization for High Spatial Resolution Remote Sensing Image Retrieval

    Authors: Yun Cao, Yuebin Wang, Junhuan Peng, Liqiang Zhang, Linlin Xu, Kai Yan, Lihua Li

    Abstract: With a small number of labeled samples for training, it can save considerable manpower and material resources, especially when the amount of high spatial resolution remote sensing images (HSR-RSIs) increases considerably. However, many deep models face the problem of overfitting when using a small number of labeled samples. This might degrade HSRRSI retrieval accuracy. Aiming at obtaining more acc… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

    Comments: 17 pages

  32. SLCRF: Subspace Learning with Conditional Random Field for Hyperspectral Image Classification

    Authors: Yun Cao, Jie Mei, Yuebin Wang, Liqiang Zhang, Junhuan Peng, Bing Zhang, Lihua Li, Yibo Zheng

    Abstract: Subspace learning (SL) plays an important role in hyperspectral image (HSI) classification, since it can provide an effective solution to reduce the redundant information in the image pixels of HSIs. Previous works about SL aim to improve the accuracy of HSI recognition. Using a large number of labeled samples, related methods can train the parameters of the proposed solutions to obtain better rep… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

    Comments: 13 pages, 6 figures

  33. arXiv:2010.02501  [pdf, other

    cs.LG math.OC stat.ML

    A Unifying View on Implicit Bias in Training Linear Neural Networks

    Authors: Chulhee Yun, Shankar Krishnan, Hossein Mobahi

    Abstract: We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize th… ▽ More

    Submitted 10 September, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: 38 pages, 7 figures. Revision after ICLR 2021 camera-ready version. Figure 2 newly added, theorem statements revised, including correction of Theorem 2

  34. arXiv:2006.08859  [pdf, other

    cs.LG stat.ML

    Minimum Width for Universal Approximation

    Authors: Sejun Park, Chulhee Yun, Jaeho Lee, Jinwoo Shin

    Abstract: The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension $d_x$ and the output dimension $d_y$. In this work, we provide the first definitive result in this directi… ▽ More

    Submitted 15 June, 2020; originally announced June 2020.

  35. arXiv:2006.04862  [pdf, other

    cs.LG stat.ML

    $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

    Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental… ▽ More

    Submitted 19 December, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 31 pages, NeurIPS 2020 Camera-ready

  36. arXiv:2002.09635  [pdf, other

    eess.IV cs.CV cs.LG

    Towards Label-Free 3D Segmentation of Optical Coherence Tomography Images of the Optic Nerve Head Using Deep Learning

    Authors: Sripad Krishna Devalla, Tan Hung Pham, Satish Kumar Panda, Liang Zhang, Giridhar Subramanian, Anirudh Swaminathan, Chin Zhi Yun, Mohan Rajan, Sujatha Mohan, Ramaswami Krishnadas, Vijayalakshmi Senthil, John Mark S. de Leon, Tin A. Tun, Ching-Yu Cheng, Leopold Schmetterer, Shamira Perera, Tin Aung, Alexandre H. Thiery, Michael J. A. Girard

    Abstract: Since the introduction of optical coherence tomography (OCT), it has been possible to study the complex 3D morphological changes of the optic nerve head (ONH) tissues that occur along with the progression of glaucoma. Although several deep learning (DL) techniques have been recently proposed for the automated extraction (segmentation) and quantification of these morphological changes, the device s… ▽ More

    Submitted 22 February, 2020; originally announced February 2020.

  37. arXiv:2002.07028  [pdf, other

    cs.LG stat.ML

    Low-Rank Bottleneck in Multi-head Attention Models

    Authors: Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.

    Comments: 17 pages, 4 figures

  38. arXiv:1912.10077  [pdf, other

    cs.LG stat.ML

    Are Transformers universal approximators of sequence-to-sequence functions?

    Authors: Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using… ▽ More

    Submitted 24 February, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

    Comments: 23 pages, ICLR 2020 camera-ready version

  39. arXiv:1907.03922  [pdf, ps, other

    cs.LG math.OC stat.ML

    Are deep ResNets provably better than linear predictors?

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residu… ▽ More

    Submitted 29 October, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

    Comments: 15 pages. NeurIPS 2019 Camera-ready version

  40. arXiv:1810.07770  [pdf, ps, other

    cs.LG stat.ML

    Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $Ω(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $Θ(\sqrt{N})$ is necessary and sufficient for mem… ▽ More

    Submitted 29 October, 2019; v1 submitted 17 October, 2018; originally announced October 2018.

    Comments: 28 pages, 2 figures. NeurIPS 2019 Camera-ready version

  41. arXiv:1809.10858  [pdf, ps, other

    math.OC cs.LG stat.ML

    Efficiently testing local optimality and escaping saddles for ReLU networks

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into a… ▽ More

    Submitted 28 May, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

    Comments: 23 pages, appeared at ICLR 2019

  42. arXiv:1802.03487  [pdf, ps, other

    cs.LG math.OC stat.ML

    Small nonlinearities in activation functions create bad local minima in neural networks

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like… ▽ More

    Submitted 28 May, 2019; v1 submitted 9 February, 2018; originally announced February 2018.

    Comments: 33 pages, appeared at ICLR 2019

  43. arXiv:1707.02444  [pdf, ps, other

    cs.LG math.OC stat.ML

    Global optimality conditions for deep neural networks

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global mi… ▽ More

    Submitted 24 March, 2018; v1 submitted 8 July, 2017; originally announced July 2017.

    Comments: 14 pages. A camera-ready version that will appear at ICLR 2018

  44. arXiv:1304.2014  [pdf

    math.DS cs.CV math.GT

    Image Compression predicated on Recurrent Iterated Function Systems

    Authors: Chol-Hui Yun, W. Metzler, M. Barski

    Abstract: Recurrent iterated function systems (RIFSs) are improvements of iterated function systems (IFSs) using elements of the theory of Marcovian stochastic processes which can produce more natural looking images. We construct new RIFSs consisting substantially of a vertical contraction factor function and nonlinear transformations. These RIFSs are applied to image compression.

    Submitted 7 April, 2013; originally announced April 2013.

    Comments: 11 pages, presented at 2nd International Conference on Mathematics & Statistics, 16-19 June, 2008, Athens, Greece

    Report number: KISU-MATH-2008-E-C-001