Search | arXiv e-print repository

Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems

Authors: Yujun Kim, Jaeyoung Cha, Chulhee Yun

Abstract: Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $κ$. In contrast, little is known when $K$ is smaller than $κ$, and it is still a challenging open question whether p… ▽ More Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $κ$. In contrast, little is known when $K$ is smaller than $κ$, and it is still a challenging open question whether permutation-based SGD can converge faster in this small epoch regime (Safran and Shamir, 2021). As a step toward understanding this gap, we study the naive deterministic variant, Incremental Gradient Descent (IGD), on smooth and strongly convex functions. Our lower bounds reveal that for the small epoch regime, IGD can exhibit surprisingly slow convergence even when all component functions are strongly convex. Furthermore, when some component functions are allowed to be nonconvex, we prove that the optimality gap of IGD can be significantly worse throughout the small epoch regime. Our analyses reveal that the convergence properties of permutation-based SGD in the small epoch regime may vary drastically depending on the assumptions on component functions. Lastly, we supplement the paper with tight upper and lower bounds for IGD in the large epoch regime. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Accepted to ICML 2025, 56 pages, 6 figures

arXiv:2505.23152 [pdf, ps, other]

Provable Benefit of Random Permutations over Uniform Sampling in Stochastic Coordinate Descent

Authors: Donghwa Kim, Jaewook Lee, Chulhee Yun

Abstract: We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and random-permutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms' performanc… ▽ More We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and random-permutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms' performance has remained elusive. Even for the benign case of positive-definite quadratic functions with permutation-invariant Hessians, previous efforts have failed to demonstrate a provable performance gap between RCD and RPCD. To this end, we present novel results showing that, for a class of quadratics with permutation-invariant structures, the contraction rate upper bound for RPCD is always strictly smaller than the contraction rate lower bound for RCD for every individual problem instance. Furthermore, we conjecture that this function class contains the worst-case examples of RPCD among all positive-definite quadratics. Combined with our RCD lower bound, this conjecture extends our results to the general class of positive-definite quadratic functions. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: Accepted to ICML 2025. 68 pages, 15 figures

arXiv:2504.12712 [pdf, other]

Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification

Authors: Hyunji Jung, Hanseul Cho, Chulhee Yun

Abstract: We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training o… ▽ More We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training on a single task is implicitly biased towards the individual max-margin solution for the task, and the direction of the joint max-margin solution can be largely different from these individual solutions. Additionally, when tasks are given in a cyclic order, we present a non-asymptotic analysis on cycle-averaged forgetting, revealing that (1) alignment between tasks is indeed closely tied to catastrophic forgetting and backward knowledge transfer and (2) the amount of forgetting vanishes to zero as the cycle repeats. Lastly, we analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function. △ Less

Submitted 26 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

Comments: 67 pages, 11 figures, accepted to ICLR 2025, Camera-ready version

arXiv:2501.00511 [pdf, other]

Stochastic Extragradient with Flip-Flop Shuffling & Anchoring: Provable Improvements

Authors: Jiseok Chae, Chulhee Yun, Donghwan Kim

Abstract: In minimax optimization, the extragradient (EG) method has been extensively studied because it outperforms the gradient descent-ascent method in convex-concave (C-C) problems. Yet, stochastic EG (SEG) has seen limited success in C-C problems, especially for unconstrained cases. Motivated by the recent progress of shuffling-based stochastic methods, we investigate the convergence of shuffling-based… ▽ More In minimax optimization, the extragradient (EG) method has been extensively studied because it outperforms the gradient descent-ascent method in convex-concave (C-C) problems. Yet, stochastic EG (SEG) has seen limited success in C-C problems, especially for unconstrained cases. Motivated by the recent progress of shuffling-based stochastic methods, we investigate the convergence of shuffling-based SEG in unconstrained finite-sum minimax problems, in search of convergent shuffling-based SEG. Our analysis reveals that both random reshuffling and the recently proposed flip-flop shuffling alone can suffer divergence in C-C problems. However, with an additional simple trick called anchoring, we develop the SEG with flip-flop anchoring (SEG-FFA) method which successfully converges in C-C problems. We also show upper and lower bounds in the strongly-convex-strongly-concave setting, demonstrating that SEG-FFA has a provably faster convergence rate compared to other shuffling-based methods. △ Less

Submitted 31 December, 2024; originally announced January 2025.

Comments: 73+7 pages, 4 figures. Published in NeurIPS 2024

arXiv:2405.16002 [pdf, other]

Does SGD really happen in tiny subspaces?

Authors: Minhak Song, Kwangjun Ahn, Chulhee Yun

Abstract: Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural n… ▽ More Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of the original update component. We observe similar behavior across practical setups, including the large learning rate regime (also known as Edge of Stability), Sharpness-Aware Minimization, momentum, and adaptive optimizers. We discuss the main causes and implications of this spurious alignment, shedding light on the dynamics of neural network training. △ Less

Submitted 10 March, 2025; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: Published at ICLR 2025

arXiv:2403.06624 [pdf, ps, other]

On the topology of the moduli of tropical unramified p-covers

Authors: Yassine El Maazouz, Paul Alexander Helminck, Felix Röhrle, Pedro Souza, Claudia He Yun

Abstract: We study the topology of the moduli space of unramified $\mathbb{Z}/p$-covers of tropical curves of genus $g \geq 2$, where $p$ is a prime number. We use recent techniques by Chan--Galatius--Payne to identify contractible subcomplexes of the moduli space. We then use this contractibility result to show that this moduli space is simply connected. In the case of genus 2, we determine the homotopy ty… ▽ More We study the topology of the moduli space of unramified $\mathbb{Z}/p$-covers of tropical curves of genus $g \geq 2$, where $p$ is a prime number. We use recent techniques by Chan--Galatius--Payne to identify contractible subcomplexes of the moduli space. We then use this contractibility result to show that this moduli space is simply connected. In the case of genus 2, we determine the homotopy type of this moduli space for all primes $p$. This work is motivated by prospective applications to the top-weight cohomology of the space of prime cyclic étale covers of smooth algebraic curves. △ Less

Submitted 3 October, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 39 pages, 11 figures, 5 tables

MSC Class: 14T20; 05E14; 14H10

arXiv:2402.10475 [pdf, other]

Fundamental Benefit of Alternating Updates in Minimax Optimization

Authors: Jaewook Lee, Hanseul Cho, Chulhee Yun

Abstract: The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA). While Alt-GDA is commonly observed to converge faster, the performance gap between the two is not yet well understood theoretically, especially in terms of global convergence rates. To address this theory-practice… ▽ More The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA). While Alt-GDA is commonly observed to converge faster, the performance gap between the two is not yet well understood theoretically, especially in terms of global convergence rates. To address this theory-practice gap, we present fine-grained convergence analyses of both algorithms for strongly-convex-strongly-concave and Lipschitz-gradient objectives. Our new iteration complexity upper bound of Alt-GDA is strictly smaller than the lower bound of Sim-GDA; i.e., Alt-GDA is provably faster. Moreover, we propose Alternating-Extrapolation GDA (Alex-GDA), a general algorithmic framework that subsumes Sim-GDA and Alt-GDA, for which the main idea is to alternately take gradients from extrapolations of the iterates. We show that Alex-GDA satisfies a smaller iteration complexity bound, identical to that of the Extra-gradient method, while requiring less gradient computations. We also prove that Alex-GDA enjoys linear convergence for bilinear problems, for which both Sim-GDA and Alt-GDA fail to converge at all. △ Less

Submitted 15 July, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted to ICML 2024 (Spotlight). 76 pages, 2 figures. Additional experiments (quadratic game, GAN) and proofs

arXiv:2311.15051 [pdf, other]

Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

Authors: Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun

Abstract: Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much fla… ▽ More Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much flatter minima than those found by gradient descent. We hypothesize that the large catapult is caused by momentum "prolonging" the self-stabilization effect (Damian et al., 2023). We provide theoretical and empirical support for our hypothesis in a simple toy example and empirical evidence supporting our hypothesis for linear diagonal networks. △ Less

Submitted 29 May, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

Comments: v3: major updates; 25 pages, 17 figures; the first two authors contributed equally. The preliminary version was accepted to the NeurIPS 2023 M3L Workshop (oral) under the title "Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study."

arXiv:2310.01082 [pdf, other]

Linear attention is (maybe) all you need (to understand transformer optimization)

Authors: Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra

Abstract: Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and… ▽ More Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization. △ Less

Submitted 13 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: Published at ICLR 2024

arXiv:2307.09265 [pdf, ps, other]

PGL orbits in tree varieties

Authors: Izzet Coskun, Demir Eken, Chris Yun

Abstract: In this paper, we introduce tree varieties as a natural generalization of products of partial flag varieties. We study orbits of the PGL action on tree varieties. We characterize tree varieties with finitely many PGL orbits, generalizing a celebrated theorem of Magyar, Weyman and Zelevinsky. We give criteria that guarantee that a tree variety has a dense PGL orbit and provide many examples of tree… ▽ More In this paper, we introduce tree varieties as a natural generalization of products of partial flag varieties. We study orbits of the PGL action on tree varieties. We characterize tree varieties with finitely many PGL orbits, generalizing a celebrated theorem of Magyar, Weyman and Zelevinsky. We give criteria that guarantee that a tree variety has a dense PGL orbit and provide many examples of tree varieties that do not have dense PGL orbits. We show that a triple of two-step flag varieties $F(k_1, k_2; n)^3$ has a dense PGL orbit if and only if $k_1 + k_2 \not= n$. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: 25 pages

MSC Class: Primary: 14L30; 14M15; 14M17. Secondary: 14L35; 51N30

arXiv:2307.04204 [pdf, other]

Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory

Authors: Minhak Song, Chulhee Yun

Abstract: Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this… ▽ More Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this paper, we start by demonstrating through empirical studies that when the EoS phenomenon occurs, different GD trajectories (after a proper reparameterization) align on a specific bifurcation diagram independent of initialization. We then rigorously prove this trajectory alignment phenomenon for a two-layer fully-connected linear network and a single-neuron nonlinear network trained with a single data point. Our trajectory alignment analysis establishes both progressive sharpening and EoS phenomena, encompassing and extending recent findings in the literature. △ Less

Submitted 26 October, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

Comments: NeurIPS 2023 camera-ready; 51 pages

arXiv:2307.01960 [pdf, ps, other]

A Serre spectral sequence for the moduli space of tropical curves

Authors: Christin Bibby, Melody Chan, Nir Gadish, Claudia He Yun

Abstract: We construct, for all $g\geq 2$ and $n\geq 0$, a spectral sequence of rational $S_n$-representations which computes the $S_n$-equivariant reduced rational cohomology of the tropical moduli spaces of curves $Δ_{g,n}$ in terms of compactly supported cohomology groups of configuration spaces of $n$ points on graphs of genus $g$. Using the canonical $S_n$-equivariant isomorphisms… ▽ More We construct, for all $g\geq 2$ and $n\geq 0$, a spectral sequence of rational $S_n$-representations which computes the $S_n$-equivariant reduced rational cohomology of the tropical moduli spaces of curves $Δ_{g,n}$ in terms of compactly supported cohomology groups of configuration spaces of $n$ points on graphs of genus $g$. Using the canonical $S_n$-equivariant isomorphisms $\widetilde{H}^{i-1}(Δ_{g,n};\mathbb{Q}) \cong W_0 H^i_c(\mathcal{M}_{g,n};\mathbb{Q})$, we calculate the weight $0$, compactly supported rational cohomology of the moduli spaces $\mathcal{M}_{g,n}$ in the range $g=3$ and $n\leq 9$, with partial computations available for $n\leq 13$. △ Less

Submitted 15 April, 2024; v1 submitted 4 July, 2023; originally announced July 2023.

Comments: 24 pages plus appendix

MSC Class: 14H10; 14Q05; 14T20; 55N30; 55R80; 55T10

arXiv:2306.13604 [pdf, other]

Positive del Pezzo Geometry

Authors: Nick Early, Alheydis Geiger, Marta Panizzut, Bernd Sturmfels, Claudia He Yun

Abstract: Real, complex, and tropical algebraic geometry join forces in a new branch of mathematical physics called positive geometry. We develop the positive geometry of del Pezzo surfaces and their moduli spaces, viewed as very affine varieties. Their connected components are derived from polyhedral spaces with Weyl group symmetries. We study their canonical forms and scattering amplitudes, and we solve t… ▽ More Real, complex, and tropical algebraic geometry join forces in a new branch of mathematical physics called positive geometry. We develop the positive geometry of del Pezzo surfaces and their moduli spaces, viewed as very affine varieties. Their connected components are derived from polyhedral spaces with Weyl group symmetries. We study their canonical forms and scattering amplitudes, and we solve the likelihood equations. △ Less

Submitted 6 January, 2025; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: 37 pages, 4 figures

arXiv:2306.09850 [pdf, other]

Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima

Authors: Dongkuk Si, Chulhee Yun

Abstract: Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + ρ\frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $ρ$ and/or no gradient normalization in $y_t$, which is detached from pr… ▽ More Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + ρ\frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $ρ$ and/or no gradient normalization in $y_t$, which is detached from practice. To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant $ρ$ and gradient normalization in $y_t$) and explore their convergence properties on smooth functions with (non)convexity assumptions. Perhaps surprisingly, in many scenarios, we find out that SAM has limited capability to converge to global minima or stationary points. For smooth strongly convex functions, we show that while deterministic SAM enjoys tight global convergence rates of $\tilde Θ(\frac{1}{T^2})$, the convergence bound of stochastic SAM suffers an inevitable additive term $O(ρ^2)$, indicating convergence only up to neighborhoods of optima. In fact, such $O(ρ^2)$ factors arise for stochastic SAM in all the settings we consider, and also for deterministic SAM in nonconvex cases; importantly, we prove by examples that such terms are unavoidable. Our results highlight vastly different characteristics of SAM with vs. without decaying perturbation size or gradient normalization, and suggest that the intuitions gained from one version may not apply to the other. △ Less

Submitted 27 October, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: 39 pages. v3 NeurIPS 2023 camera ready version

arXiv:2306.00267 [pdf, other]

Provable Benefit of Mixup for Finding Optimal Decision Boundaries

Authors: Junsoo Oh, Chulhee Yun

Abstract: We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem. For a family of data distributions with a separability constant $κ$, we analyze how well the optimal classifier in terms of training loss aligns with the optimal one in test accuracy (i.e., Bayes optimal classifier). For… ▽ More We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem. For a family of data distributions with a separability constant $κ$, we analyze how well the optimal classifier in terms of training loss aligns with the optimal one in test accuracy (i.e., Bayes optimal classifier). For vanilla training without augmentation, we uncover an interesting phenomenon named the curse of separability. As we increase $κ$ to make the data distribution more separable, the sample complexity of vanilla training increases exponentially in $κ$; perhaps surprisingly, the task of finding optimal decision boundaries becomes harder for more separable distributions. For Mixup training, we show that Mixup mitigates this problem by significantly reducing the sample complexity. To this end, we develop new concentration results applicable to $n^2$ pair-wise augmented data points constructed from $n$ independent data, by carefully dealing with dependencies between overlapping pairs. Lastly, we study other masking-based Mixup-style techniques and show that they can distort the training loss and make its minimizer converge to a suboptimal classifier in terms of test accuracy. △ Less

Submitted 5 June, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

Comments: ICML 2023 camera-ready version; 48 pages

arXiv:2304.08365 [pdf, other]

doi 10.1007/s13366-023-00709-8

Some thoughts and experiments on Bergman's compact amalgamation problem

Authors: Michael Joswig, Mario Kummer, Andreas Thom, Claudia He Yun

Abstract: We study the question whether copies of $S^1$ in $\mathrm{SU}(3)$ can be amalgamated in a compact group. This is the simplest instance of a fundamental open problem in the theory of compact groups raised by George Bergman in 1987. Considerable computational experiments suggest that the answer is positive in this case. We obtain a positive answer for a relaxed problem using theoretical consideratio… ▽ More We study the question whether copies of $S^1$ in $\mathrm{SU}(3)$ can be amalgamated in a compact group. This is the simplest instance of a fundamental open problem in the theory of compact groups raised by George Bergman in 1987. Considerable computational experiments suggest that the answer is positive in this case. We obtain a positive answer for a relaxed problem using theoretical considerations. △ Less

Submitted 13 July, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: 15 pages, 2 figures, 3 tables; update contains minor changes that address referee comments

MSC Class: 22C05; 18B99; 90-05; 90C90

arXiv:2303.07160 [pdf, ps, other]

Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond

Authors: Jaeyoung Cha, Jaewook Lee, Chulhee Yun

Abstract: We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems. Unlike most existing results focusing on final iterate lower bounds in terms of the number of components $n$ and the number of epochs $K$, we seek bounds for arbitrary weighted average iterates that are tight in all factors including the… ▽ More We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems. Unlike most existing results focusing on final iterate lower bounds in terms of the number of components $n$ and the number of epochs $K$, we seek bounds for arbitrary weighted average iterates that are tight in all factors including the condition number $κ$. For SGD with Random Reshuffling, we present lower bounds that have tighter $κ$ dependencies than existing bounds. Our results are the first to perfectly close the gap between lower and upper bounds for weighted average iterates in both strongly-convex and convex cases. We also prove weighted average iterate lower bounds for arbitrary permutation-based SGD, which apply to all variants that carefully choose the best permutation. Our bounds improve the existing bounds in factors of $n$ and $κ$ and thereby match the upper bounds shown for a recently proposed algorithm called GraB. △ Less

Submitted 9 June, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

Comments: 58 pages

arXiv:2302.12444 [pdf, other]

On the Training Instability of Shuffling SGD with Batch Normalization

Authors: David X. Wu, Chulhee Yun, Suvrit Sra

Abstract: We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for r… ▽ More We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are "distorted" away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by confirming them empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice. △ Less

Submitted 14 August, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: ICML 2023 camera-ready version, added references; 75 pages

arXiv:2210.05995 [pdf, other]

SGDA with shuffling: faster convergence for nonconvex-PŁ minimax optimization

Authors: Hanseul Cho, Chulhee Yun

Abstract: Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monot… ▽ More Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvex-nonconcave objectives with Polyak-Łojasiewicz (PŁ) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-PŁ and primal-PŁ-PŁ objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for GDA with an arbitrary step-size ratio, which matches the full-batch upper bound for the primal-PŁ-PŁ case. △ Less

Submitted 20 February, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: ICLR 2023 camera-ready version; 46 pages

arXiv:2209.01070 [pdf, ps, other]

Discrete Morse theory for symmetric Delta-complexes

Authors: Claudia He Yun

Abstract: We generalize Forman's discrete Morse theory to the context of symmetric $Δ$-complexes. As an application, we prove that the coloop subcomplex of the link of the origin $LA^{\mathrm{trop},\mathrm{P}}_g$ in the moduli space of principally polarized tropical abelian varieties of dimension $g$ with respect to the perfect cone decomposition is contractible. We generalize Forman's discrete Morse theory to the context of symmetric $Δ$-complexes. As an application, we prove that the coloop subcomplex of the link of the origin $LA^{\mathrm{trop},\mathrm{P}}_g$ in the moduli space of principally polarized tropical abelian varieties of dimension $g$ with respect to the perfect cone decomposition is contractible. △ Less

Submitted 2 September, 2022; originally announced September 2022.

Comments: 16 pages, 5 figures

MSC Class: 57Q70; 14T15

arXiv:2207.02800 [pdf, ps, other]

Equivariant Hodge polynomials of heavy/light moduli spaces

Authors: Siddarth Kannan, Stefano Serpente, Claudia He Yun

Abstract: Let $\bar{\mathcal{M}}_{g, m|n}$ denote Hassett's moduli space of weighted pointed stable curves of genus $g$ for the heavy/light weight data $\left(1^{(m)}, 1/n^{(n)}\right)$, and let $\mathcal{M}_{g, m|n} \subset \bar{\mathcal{M}}_{g, m|n}$ be the locus parameterizing smooth, not necessarily distinctly marked curves. We give a change-of-variables formula which computes the generating function fo… ▽ More Let $\bar{\mathcal{M}}_{g, m|n}$ denote Hassett's moduli space of weighted pointed stable curves of genus $g$ for the heavy/light weight data $\left(1^{(m)}, 1/n^{(n)}\right)$, and let $\mathcal{M}_{g, m|n} \subset \bar{\mathcal{M}}_{g, m|n}$ be the locus parameterizing smooth, not necessarily distinctly marked curves. We give a change-of-variables formula which computes the generating function for $(S_m\times S_n)$-equivariant Hodge-Deligne polynomials of these spaces in terms of the generating functions for $S_{n}$-equivariant Hodge-Deligne polynomials of $\bar{\mathcal{M}}_{g,n}$ and $\mathcal{M}_{g,n}$. △ Less

Submitted 22 April, 2024; v1 submitted 6 July, 2022; originally announced July 2022.

Comments: 21 pages, 3 tables. Edits based on referee suggestions

MSC Class: 14H10

arXiv:2110.10342 [pdf, other]

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

Authors: Chulhee Yun, Shashank Rajput, Suvrit Sra

Abstract: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients… ▽ More In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings. △ Less

Submitted 23 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: ICLR 2022 camera-ready (selected for an oral presentation); 76 pages, 3 figures

arXiv:2110.03357 [pdf, other]

doi 10.1016/j.mbs.2020.108520

The role of viral infectivity in oncolytic virotherapy outcomes: A mathematical study

Authors: Pantea Pooladvand, Chae-Ok Yun, A-Rum Yoon, Peter S. Kim, Federico Frascoli

Abstract: A model capturing the dynamics between virus and tumour cells in the context of oncolytic virotherapy is presented and analysed. The ability of the virus to be internalised by uninfected cells is described by an infectivity parameter, which is inferred from available experimental data. The parameter is also able to describe the effects of changes in the tumour environment that affect viral uptake… ▽ More A model capturing the dynamics between virus and tumour cells in the context of oncolytic virotherapy is presented and analysed. The ability of the virus to be internalised by uninfected cells is described by an infectivity parameter, which is inferred from available experimental data. The parameter is also able to describe the effects of changes in the tumour environment that affect viral uptake from tumour cells. Results show that when a virus is inoculated inside a growing tumour, strategies for enhancing infectivity do not lead to a complete eradication of the tumour. Within typical times of experiments and treatments, we observe the onset of oscillations, which always prevent a full destruction of the tumour mass. These findings are in good agreement with available laboratory results. Further analysis shows why a fully successful therapy cannot exist for the proposed model and that care must be taken when designing and engineering viral vectors with enhanced features. In particular, bifurcation analysis reveals that creating longer lasting virus particles or using strategies for reducing infected cell lifespan can cause unexpected and unwanted surges in the overall tumour load over time. Our findings suggest that virotherapy alone seems unlikely to be effective in clinical settings unless adjuvant strategies are included. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: 29 pages, 13 figures, 1 table

MSC Class: 92-10

Journal ref: Mathematical Biosciences, 334: 108520 (2021)

arXiv:2109.03302 [pdf, ps, other]

Homology representations of compactified configurations on graphs applied to $\mathcal{M}_{2,n}$

Authors: Christin Bibby, Melody Chan, Nir Gadish, Claudia He Yun

Abstract: We obtain new calculations of the top weight rational cohomology of the moduli spaces $\mathcal{M}_{2,n}$, equivalently the rational homology of the tropical moduli spaces $Δ_{2,n}$, as a representation of $S_n$. These calculations are achieved fully for all $n\leq 10$, and partially -- for specific irreducible representations of $S_n$ -- for $n\le 22$. We also present conjectures, verified up to… ▽ More We obtain new calculations of the top weight rational cohomology of the moduli spaces $\mathcal{M}_{2,n}$, equivalently the rational homology of the tropical moduli spaces $Δ_{2,n}$, as a representation of $S_n$. These calculations are achieved fully for all $n\leq 10$, and partially -- for specific irreducible representations of $S_n$ -- for $n\le 22$. We also present conjectures, verified up to $n=22$, for the multiplicities of the irreducible representations $\mathrm{std}_n$ and $\mathrm{std}_n\otimes \mathrm{sgn}_n$. We achieve our calculations via a comparison with the homology of compactified configuration spaces of graphs. These homology groups are equipped with commuting actions of a symmetric group and the outer automorphism group of a free group. In this paper, we construct an efficient free resolution for these homology representations, from which we extract calculations on irreducible representations one at a time, simplifying the calculation of these homology representations. △ Less

Submitted 25 April, 2023; v1 submitted 7 September, 2021; originally announced September 2021.

Comments: 18 pages, minor edits

MSC Class: 05C10 (primary); 14H10; 14Q05; 14T20; 55R80; 55P65

arXiv:2103.07079 [pdf, other]

Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restri… ▽ More We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restrict our attention to positive definite matrices with small enough condition numbers, which are more relevant to matrices that arise in the analysis of SGD. For such matrices, we conjecture that the means of matrix products corresponding to with- and without-replacement variants of SGD satisfy a series of spectral norm inequalities that can be summarized as: "single-shuffle SGD converges faster than random-reshuffle SGD, which is in turn faster than with-replacement SGD." We present theorems that support our conjecture by proving several special cases. △ Less

Submitted 11 March, 2021; originally announced March 2021.

Comments: 26 pages, 2 figures

arXiv:2010.11767 [pdf, other]

Topology of tropical moduli spaces of weighted stable curves in higher genus

Authors: Siddarth Kannan, Shiyue Li, Stefano Serpente, Claudia He Yun

Abstract: Given integers $g \geq 0$, $n \geq 1$, and a vector $w \in (\mathbb{Q} \cap (0, 1])^n$ such that ${2g - 2 + \sum w_i > 0}$, we study the topology of the moduli space $Δ_{g, w}$ of $w$-stable tropical curves of genus $g$ with volume 1. The space $Δ_{g, w}$ is the dual complex of the divisor of singular curves in Hassett's moduli space of $w$-stable genus $g$ curves $\overline{\mathcal{M}}_{g, w}$.… ▽ More Given integers $g \geq 0$, $n \geq 1$, and a vector $w \in (\mathbb{Q} \cap (0, 1])^n$ such that ${2g - 2 + \sum w_i > 0}$, we study the topology of the moduli space $Δ_{g, w}$ of $w$-stable tropical curves of genus $g$ with volume 1. The space $Δ_{g, w}$ is the dual complex of the divisor of singular curves in Hassett's moduli space of $w$-stable genus $g$ curves $\overline{\mathcal{M}}_{g, w}$. When $g \geq 1$, we show that $Δ_{g, w}$ is simply connected for all values of $w$. We also give a formula for the Euler characteristic of $Δ_{g, w}$ in terms of the combinatorics of $w$. △ Less

Submitted 15 March, 2022; v1 submitted 22 October, 2020; originally announced October 2020.

Comments: 14 pages; 1 figure; final version accepted at Advances in Geometry

MSC Class: 14T05

arXiv:2010.03116 [pdf, other]

doi 10.1109/TGRS.2020.2991545

DML-GANR: Deep Metric Learning With Generative Adversarial Network Regularization for High Spatial Resolution Remote Sensing Image Retrieval

Authors: Yun Cao, Yuebin Wang, Junhuan Peng, Liqiang Zhang, Linlin Xu, Kai Yan, Lihua Li

Abstract: With a small number of labeled samples for training, it can save considerable manpower and material resources, especially when the amount of high spatial resolution remote sensing images (HSR-RSIs) increases considerably. However, many deep models face the problem of overfitting when using a small number of labeled samples. This might degrade HSRRSI retrieval accuracy. Aiming at obtaining more acc… ▽ More With a small number of labeled samples for training, it can save considerable manpower and material resources, especially when the amount of high spatial resolution remote sensing images (HSR-RSIs) increases considerably. However, many deep models face the problem of overfitting when using a small number of labeled samples. This might degrade HSRRSI retrieval accuracy. Aiming at obtaining more accurate HSR-RSI retrieval performance with small training samples, we develop a deep metric learning approach with generative adversarial network regularization (DML-GANR) for HSR-RSI retrieval. The DML-GANR starts from a high-level feature extraction (HFE) to extract high-level features, which includes convolutional layers and fully connected (FC) layers. Each of the FC layers is constructed by deep metric learning (DML) to maximize the interclass variations and minimize the intraclass variations. The generative adversarial network (GAN) is adopted to mitigate the overfitting problem and validate the qualities of extracted high-level features. DML-GANR is optimized through a customized approach, and the optimal parameters are obtained. The experimental results on the three data sets demonstrate the superior performance of DML-GANR over state-of-the-art techniques in HSR-RSI retrieval. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: 17 pages

arXiv:2010.03115 [pdf]

doi 10.1109/TGRS.2020.3011429

SLCRF: Subspace Learning with Conditional Random Field for Hyperspectral Image Classification

Authors: Yun Cao, Jie Mei, Yuebin Wang, Liqiang Zhang, Junhuan Peng, Bing Zhang, Lihua Li, Yibo Zheng

Abstract: Subspace learning (SL) plays an important role in hyperspectral image (HSI) classification, since it can provide an effective solution to reduce the redundant information in the image pixels of HSIs. Previous works about SL aim to improve the accuracy of HSI recognition. Using a large number of labeled samples, related methods can train the parameters of the proposed solutions to obtain better rep… ▽ More Subspace learning (SL) plays an important role in hyperspectral image (HSI) classification, since it can provide an effective solution to reduce the redundant information in the image pixels of HSIs. Previous works about SL aim to improve the accuracy of HSI recognition. Using a large number of labeled samples, related methods can train the parameters of the proposed solutions to obtain better representations of HSI pixels. However, the data instances may not be sufficient enough to learn a precise model for HSI classification in real applications. Moreover, it is well-known that it takes much time, labor and human expertise to label HSI images. To avoid the aforementioned problems, a novel SL method that includes the probability assumption called subspace learning with conditional random field (SLCRF) is developed. In SLCRF, first, the 3D convolutional autoencoder (3DCAE) is introduced to remove the redundant information in HSI pixels. In addition, the relationships are also constructed using the spectral-spatial information among the adjacent pixels. Then, the conditional random field (CRF) framework can be constructed and further embedded into the HSI SL procedure with the semi-supervised approach. Through the linearized alternating direction method termed LADMAP, the objective function of SLCRF is optimized using a defined iterative algorithm. The proposed method is comprehensively evaluated using the challenging public HSI datasets. We can achieve stateof-the-art performance using these HSI sets. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: 13 pages, 6 figures

arXiv:2010.02501 [pdf, other]

A Unifying View on Implicit Bias in Training Linear Neural Networks

Authors: Chulhee Yun, Shankar Krishnan, Hossein Mobahi

Abstract: We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize th… ▽ More We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis. △ Less

Submitted 10 September, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: 38 pages, 7 figures. Revision after ICLR 2021 camera-ready version. Figure 2 newly added, theorem statements revised, including correction of Theorem 2

arXiv:2008.04426 [pdf, ps, other]

The $S_n$-equivariant rational homology of the tropical moduli spaces $Δ_{2,n}$

Authors: Claudia He Yun

Abstract: We compute the $S_n$-equivariant rational homology of the tropical moduli spaces $Δ_{2,n}$ for $n\leq 8$ using a cellular chain complex for symmetric $Δ$-complexes in Sage. We compute the $S_n$-equivariant rational homology of the tropical moduli spaces $Δ_{2,n}$ for $n\leq 8$ using a cellular chain complex for symmetric $Δ$-complexes in Sage. △ Less

Submitted 10 August, 2020; originally announced August 2020.

Comments: 17 pages, 2 figures, 6 tables

MSC Class: 14T10 (Primary); 14Q05 (Secondary)

arXiv:2006.14759 [pdf]

Existence and convergence theorems for monotone generalized alpa-nonexpansive mappings in uniformly convex partially ordered hyperbolic metric spaces and its application

Authors: Chang Il Rim, Jong Gyong Kim, Chol-Hui Yun

Abstract: In this paper, we generalize the existence result in [14] and prove convergence theorems of the iterative scheme in [12, 16] for monotone generalized alpa-nonexpansive mappings in uniformly convex partially ordered hyperbolic metric spaces. And we also give a numerical example to show that this scheme converges faster than the scheme in [14] and apply the result to the integral equation. In this paper, we generalize the existence result in [14] and prove convergence theorems of the iterative scheme in [12, 16] for monotone generalized alpa-nonexpansive mappings in uniformly convex partially ordered hyperbolic metric spaces. And we also give a numerical example to show that this scheme converges faster than the scheme in [14] and apply the result to the integral equation. △ Less

Submitted 25 June, 2020; originally announced June 2020.

arXiv:2006.06946 [pdf, other]

SGD with shuffling: optimal rates without component convexity and large epoch requirements

Authors: Kwangjun Ahn, Chulhee Yun, Suvrit Sra

Abstract: We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is ge… ▽ More We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is general enough to cover gradient dominated nonconvex costs, and does not rely on the convexity of individual component functions unlike existing optimal convergence results. Secondly, assuming convexity of the individual components, we further sharpen the tight convergence results for RandomShuffle by removing the drawbacks common to all prior arts: large number of epochs required for the results to hold, and extra poly-log factor gaps to the lower bound. △ Less

Submitted 21 June, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 53 pages; supersedes the preprint arXiv:2004.08657; v2 corrects an erroneous claim about SingleShuffle and newly adds Theorem 24 and Appendix F for SingleShuffle

arXiv:1911.12876 [pdf, ps, other]

doi 10.1007/s11538-018-0424-4

Mathematical modelling of the interaction between cancer cells and an oncolytic virus: insights into the effects of treatment protocols

Authors: Adrianne L. Jenner, Chae-Ok Yun, Peter S. Kim, Adelle C. F. Coster

Abstract: Oncolytic virotherapy is an experimental cancer treatment that uses genetically engineered viruses to target and kill cancer cells. One major limitation of this treatment is that virus particles are rapidly cleared by the immune system, preventing them from arriving at the tumour site. To improve virus survival and infectivity modified virus particles with the polymer polyethylene glycol (PEG) and… ▽ More Oncolytic virotherapy is an experimental cancer treatment that uses genetically engineered viruses to target and kill cancer cells. One major limitation of this treatment is that virus particles are rapidly cleared by the immune system, preventing them from arriving at the tumour site. To improve virus survival and infectivity modified virus particles with the polymer polyethylene glycol (PEG) and the monoclonal antibody herceptin. While PEG modification appeared to improve plasma retention and initial infectivity it also increased the virus particle arrival time. We derive a mathematical model that describes the interaction between tumour cells and an oncolytic virus. We tune our model to represent the experimental data by Kim et al. (2011) and obtain optimised parameters. Our model provides a platform from which predictions may be made about the response of cancer growth to other treatment protocols beyond those in the experiments. Through model simulations we find that the treatment protocol affects the outcome dramatically. We quantify the effects of dosage strategy as a function of tumour cell replication and tumour carrying capacity on the outcome of oncolytic virotherapy as a treatment. The relative significance of the modification of the virus and the crucial role it plays in optimising treatment efficacy is explored. △ Less

Submitted 28 November, 2019; originally announced November 2019.

Comments: 15 pages, 6 figures

Journal ref: Bulletin of Mathematical Biology 80: 1615-1629 (2018)

arXiv:1907.03922 [pdf, ps, other]

Are deep ResNets provably better than linear predictors?

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residu… ▽ More Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the "near-identity regions" of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity. △ Less

Submitted 29 October, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

Comments: 15 pages. NeurIPS 2019 Camera-ready version

arXiv:1906.01355 [pdf]

Estimation of errors on perturbation of function contractivity factors and box-counting dimension of hidden variable recurrent fractal interpolation function

Authors: Mi-Kyong Ri, Chol-Hui Yun

Abstract: In this paper, we study errors on perturbation of function contractivity factors and box-counting dimension of hidden variable recurrent fractal interpolation function (HVRFIF). The HVRFIF is a hidden variable fractal interpolation function (HVFIF) constructed by recurrent iterated function system (RIFS) with function contractivity factors. The contractivity factors of RIFS determine fractal chara… ▽ More In this paper, we study errors on perturbation of function contractivity factors and box-counting dimension of hidden variable recurrent fractal interpolation function (HVRFIF). The HVRFIF is a hidden variable fractal interpolation function (HVFIF) constructed by recurrent iterated function system (RIFS) with function contractivity factors. The contractivity factors of RIFS determine fractal characteristics and shape of its attractor, so that the HVRFIF with function contractivity factors has more flexibility and diversity than the HVFIF with constant contractivity factors. Stability of interpolation function according to perturbation of the contractivity factors and the box-counting dimension of interpolation function plays very important roles in determining whether these functions can be applied to practical problems or not. We first estimate errors on perturbation of function contractivity factors and then obtain the upper and lower bounds of the box-counting dimension of one variable HVRFIF. Finally, in the similar way, we get the lower and upper bounds of box-counting dimension of hidden variable bivariable recurrent fractal interpolation function (HVBRFIF). △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: arXiv admin note: text overlap with arXiv:1904.11884

arXiv:1904.11884 [pdf]

Analytic properties of hidden variable recurrent fractal interpolation function with function contractivity factors

Authors: Mi-Kyong Ri, Chol-Hui Yun

Abstract: In this paper, we analyze the smoothness and stability of hidden variable recurrent fractal interpolation functions (HVRFIF) with function contractivity factors introduced in Ref. 1. The HVRFIF is a hidden variable fractal interpolation function (HVFIF) constructed by recurrent iterated function system (RIFS) with function contractivity factors. An attractor of RIFS has a local self-similar or sel… ▽ More In this paper, we analyze the smoothness and stability of hidden variable recurrent fractal interpolation functions (HVRFIF) with function contractivity factors introduced in Ref. 1. The HVRFIF is a hidden variable fractal interpolation function (HVFIF) constructed by recurrent iterated function system (RIFS) with function contractivity factors. An attractor of RIFS has a local self-similar or self-affine structure and looks more naturally than one of IFS. The contractivity factors of IFS(RIFS) determine fractal characteristic and shape of its attractor. Therefore, the HVRFIF with function contractivity factors has more flexibility and diversity than the HVFIF constructed by iterated function system (IFS) with constant contractivity factors. The analytic properties of the interpolation functions play very important roles in determining whether these functons can be applied to the practical problems or not. We analyze the smoothness of the one variable HVRFIFs in Ref. 1 and prove their stability according to perturbation of the interpolation dataset. △ Less

Submitted 23 April, 2019; originally announced April 2019.

arXiv:1904.10617 [pdf]

doi 10.1016/j.chaos.2020.109700

Box-counting dimension and analytic properties of hidden variable fractal interpolation functions with function contractivity factors

Authors: Chol-Hui Yun, Mi-Kyong Ri

Abstract: We estimate the bounds of box-counting dimension of hidden variable fractal interpolation functions (HVFIFs) and hidden variable bivariate fractal interpolation functions (HVBFIFs) with four function contractivity factors and present analytic properties of HVFIFs which are constructed to ensure more flexibility and diversity in modeling natural phenomena. Firstly, we construct the HVFIFs and analy… ▽ More We estimate the bounds of box-counting dimension of hidden variable fractal interpolation functions (HVFIFs) and hidden variable bivariate fractal interpolation functions (HVBFIFs) with four function contractivity factors and present analytic properties of HVFIFs which are constructed to ensure more flexibility and diversity in modeling natural phenomena. Firstly, we construct the HVFIFs and analyze their smoothness and stability. Secondly, we obtain the lower and upper bounds of box-counting dimension of the HVFIFs. Finally, in the similar way, we get the lower and upper bounds of box-counting dimension of HVBFIFs constructed in [21]. △ Less

Submitted 23 April, 2019; originally announced April 2019.

arXiv:1904.09110 [pdf]

doi 10.1142/S0218348X1950018X

Hidden variable recurrent fractal interpolation function with four function contractivity factors

Authors: Chol-Hui Yun

Abstract: In this paper, we introduce a construction of hidden variable recurrent fractal interpolation functions (HVRFIF) with four function contractivity factors. In the fractal interpolation theory, it is very important to ensure flexibility and diversity of the construction of interpolation function. Recurrent iterated function system (RIFS) produce fractal sets with local self-similarity structure. The… ▽ More In this paper, we introduce a construction of hidden variable recurrent fractal interpolation functions (HVRFIF) with four function contractivity factors. In the fractal interpolation theory, it is very important to ensure flexibility and diversity of the construction of interpolation function. Recurrent iterated function system (RIFS) produce fractal sets with local self-similarity structure. Therefore the RIFS can describe the irregular and complicated objects in nature better than the iterated function system (IFS). Hidden variable fractal interpolation function (HVFIF) is neither self-similar nor self-affine one. The HVFIF is more complicated, diverse and irregular than the fractal interpolation function (FIF). The contractivity factor is important one that determins characteristics of FIFs. We present a constructions of one variable HVRFIFs and bivariable HVRFIFs using RIFS with four function contractivity factors. △ Less

Submitted 19 April, 2019; originally announced April 2019.

arXiv:1809.10858 [pdf, ps, other]

Efficiently testing local optimality and escaping saddles for ReLU networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into a… ▽ More We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into at most $2^M$ regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, $O(M)$ (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases. △ Less

Submitted 28 May, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

Comments: 23 pages, appeared at ICLR 2019

arXiv:1802.03487 [pdf, ps, other]

Small nonlinearities in activation functions create bad local minima in neural networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like… ▽ More We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic. △ Less

Submitted 28 May, 2019; v1 submitted 9 February, 2018; originally announced February 2018.

Comments: 33 pages, appeared at ICLR 2019

arXiv:1707.02444 [pdf, ps, other]

Global optimality conditions for deep neural networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global mi… ▽ More We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global minimum. Surprisingly, our conditions provide an efficiently checkable test for global optimality, while such tests are typically intractable in nonconvex optimization. We further extend these results to deep nonlinear neural networks and prove similar sufficient conditions for global optimality, albeit in a more limited function space setting. △ Less

Submitted 24 March, 2018; v1 submitted 8 July, 2017; originally announced July 2017.

Comments: 14 pages. A camera-ready version that will appear at ICLR 2018

arXiv:1404.1300 [pdf, other]

A construction of fractal surfaces with function scaling factors on a rectangular grid

Authors: Chol-Hui Yun, Hui-Chol Choi, Hyong-Chol O

Abstract: A fractal surface is a set which is a graph of a bivariate continuous function. In the construction of fractal surfaces using IFS, vertical scaling factors in IFS are important one which characterizes a fractal feature of surfaces constructed. We construct IFS with function vertical scaling factors which are 0 on the boundaries of a rectangular grid using arbitrary data set on a rectangular grid a… ▽ More A fractal surface is a set which is a graph of a bivariate continuous function. In the construction of fractal surfaces using IFS, vertical scaling factors in IFS are important one which characterizes a fractal feature of surfaces constructed. We construct IFS with function vertical scaling factors which are 0 on the boundaries of a rectangular grid using arbitrary data set on a rectangular grid and give a condition for an attractor of the IFS constructed being a surface. Finally, lower and upper bounds of Box-counting dimension of the constructed surface are estimated. △ Less

Submitted 3 April, 2014; originally announced April 2014.

Comments: 9 pages, 2 figures

Report number: KISU-MATH-2014-E-R-008 MSC Class: 37C45; 28A80; 41A05

arXiv:1307.3229 [pdf, other]

Construction of Recurrent Fractal Interpolation Surfaces with Function Scaling Factors and Estimation of Box-counting Dimension on Rectangular Grids

Authors: Chol-Hui Yun, Hui-Chol Choi, Hyong-Chol O

Abstract: We consider a construction of recurrent fractal interpolation surfaces with function vertical scaling factors and estimation of their box-counting dimension. A recurrent fractal interpolation surface (RFIS) is an attractor of a recurrent iterated function system (RIFS) which is a graph of bivariate interpolation function. For any given data set on rectangular grids, we construct general recurrent… ▽ More We consider a construction of recurrent fractal interpolation surfaces with function vertical scaling factors and estimation of their box-counting dimension. A recurrent fractal interpolation surface (RFIS) is an attractor of a recurrent iterated function system (RIFS) which is a graph of bivariate interpolation function. For any given data set on rectangular grids, we construct general recurrent iterated function systems with function vertical scaling factors and prove the existence of bivariate functions whose graph are attractors of the above constructed RIFSs. Finally, we estimate lower and upper bounds for the box-counting dimension of the constructed RFISs. △ Less

Submitted 9 July, 2013; originally announced July 2013.

Comments: 12 pages, 3 figures

Report number: KISU-MATH-2013-E-R-004 MSC Class: 37C45; 28A80; 41A05

arXiv:1305.3365 [pdf, other]

A Construction of the Best Fractal Approximation

Authors: Yong-Suk Kang, Chol-Hui Yun, Dong-Hyok Kim

Abstract: In this paper we present a method for constructing the continuous best fractal approximation in the space of bounded functions. We construct the finite-dimensional subspace of the space of bounded functions whose base consists of the continuous fractal functions, and propose how to find the best approximation of given continuous function by element of the constructed space. In this paper we present a method for constructing the continuous best fractal approximation in the space of bounded functions. We construct the finite-dimensional subspace of the space of bounded functions whose base consists of the continuous fractal functions, and propose how to find the best approximation of given continuous function by element of the constructed space. △ Less

Submitted 28 March, 2014; v1 submitted 15 May, 2013; originally announced May 2013.

Comments: 9 pages

Report number: KISU-MATH-2013-E-R-007 MSC Class: Primary 37C45; 28A80; Secondary 41A05

Journal ref: Electronic Journal of Mathematical Analysis and Applications, Vol.2(2) July 2014, pp.144-151

arXiv:1304.2014 [pdf]

Image Compression predicated on Recurrent Iterated Function Systems

Authors: Chol-Hui Yun, W. Metzler, M. Barski

Abstract: Recurrent iterated function systems (RIFSs) are improvements of iterated function systems (IFSs) using elements of the theory of Marcovian stochastic processes which can produce more natural looking images. We construct new RIFSs consisting substantially of a vertical contraction factor function and nonlinear transformations. These RIFSs are applied to image compression. Recurrent iterated function systems (RIFSs) are improvements of iterated function systems (IFSs) using elements of the theory of Marcovian stochastic processes which can produce more natural looking images. We construct new RIFSs consisting substantially of a vertical contraction factor function and nonlinear transformations. These RIFSs are applied to image compression. △ Less

Submitted 7 April, 2013; originally announced April 2013.

Comments: 11 pages, presented at 2nd International Conference on Mathematics & Statistics, 16-19 June, 2008, Athens, Greece

Report number: KISU-MATH-2008-E-C-001

arXiv:1303.0615 [pdf, other]

doi 10.1016/j.chaos.2014.06.001

Construction of Fractal Surfaces by Recurrent Fractal Interpolation Curves

Authors: Chol-hui Yun, Hyong-chol O., Hui-chol Choi

Abstract: A method to construct fractal surfaces by recurrent fractal curves is provided. First we construct fractal interpolation curves using a recurrent iterated functions system(RIFS) with function scaling factors and estimate their box-counting dimension. Then we present a method of construction of wider class of fractal surfaces by fractal curves and Lipschitz functions and calculate the box-counting… ▽ More A method to construct fractal surfaces by recurrent fractal curves is provided. First we construct fractal interpolation curves using a recurrent iterated functions system(RIFS) with function scaling factors and estimate their box-counting dimension. Then we present a method of construction of wider class of fractal surfaces by fractal curves and Lipschitz functions and calculate the box-counting dimension of the constructed surfaces. Finally, we combine both methods to have more flexible constructions of fractal surfaces. △ Less

Submitted 11 August, 2014; v1 submitted 4 March, 2013; originally announced March 2013.

Comments: 14 pages, 2 figures

Report number: KISU-MATH-2013-E-R-003 MSC Class: Primary 37C45; 28A80; Secondary 41A05

Journal ref: Chaos, Solitons & Fractals, 66(2014), 136-143

arXiv:1208.2081 [pdf]

Box-counting dimension of a kind of fractal interpolation surface on rectangular grids

Authors: CholHui Yun, MunChol Kim

Abstract: We estimate a Box-counting dimension of fractal surfaces which are generated by iterated function systems with a vertical contraction factor function on an arbitrary data set over rectangular grids and can express well a lot of natural surfaces with very complicated structures. We estimate a Box-counting dimension of fractal surfaces which are generated by iterated function systems with a vertical contraction factor function on an arbitrary data set over rectangular grids and can express well a lot of natural surfaces with very complicated structures. △ Less

Submitted 9 August, 2012; originally announced August 2012.

Report number: KISU-MATH-2012-E-R-011

Journal ref: Romanian Journal of Mathematics and Computer Science, Vol. 2, No. 2, 2012, 61-69

Showing 1–47 of 47 results for author: Yun, C