-
Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems
Authors:
Yujun Kim,
Jaeyoung Cha,
Chulhee Yun
Abstract:
Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $κ$. In contrast, little is known when $K$ is smaller than $κ$, and it is still a challenging open question whether p…
▽ More
Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $κ$. In contrast, little is known when $K$ is smaller than $κ$, and it is still a challenging open question whether permutation-based SGD can converge faster in this small epoch regime (Safran and Shamir, 2021). As a step toward understanding this gap, we study the naive deterministic variant, Incremental Gradient Descent (IGD), on smooth and strongly convex functions. Our lower bounds reveal that for the small epoch regime, IGD can exhibit surprisingly slow convergence even when all component functions are strongly convex. Furthermore, when some component functions are allowed to be nonconvex, we prove that the optimality gap of IGD can be significantly worse throughout the small epoch regime. Our analyses reveal that the convergence properties of permutation-based SGD in the small epoch regime may vary drastically depending on the assumptions on component functions. Lastly, we supplement the paper with tight upper and lower bounds for IGD in the large epoch regime.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Provable Benefit of Random Permutations over Uniform Sampling in Stochastic Coordinate Descent
Authors:
Donghwa Kim,
Jaewook Lee,
Chulhee Yun
Abstract:
We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and random-permutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms' performanc…
▽ More
We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and random-permutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms' performance has remained elusive. Even for the benign case of positive-definite quadratic functions with permutation-invariant Hessians, previous efforts have failed to demonstrate a provable performance gap between RCD and RPCD. To this end, we present novel results showing that, for a class of quadratics with permutation-invariant structures, the contraction rate upper bound for RPCD is always strictly smaller than the contraction rate lower bound for RCD for every individual problem instance. Furthermore, we conjecture that this function class contains the worst-case examples of RPCD among all positive-definite quadratics. Combined with our RCD lower bound, this conjecture extends our results to the general class of positive-definite quadratic functions.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification
Authors:
Hyunji Jung,
Hanseul Cho,
Chulhee Yun
Abstract:
We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training o…
▽ More
We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training on a single task is implicitly biased towards the individual max-margin solution for the task, and the direction of the joint max-margin solution can be largely different from these individual solutions. Additionally, when tasks are given in a cyclic order, we present a non-asymptotic analysis on cycle-averaged forgetting, revealing that (1) alignment between tasks is indeed closely tied to catastrophic forgetting and backward knowledge transfer and (2) the amount of forgetting vanishes to zero as the cycle repeats. Lastly, we analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.
△ Less
Submitted 26 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
Stochastic Extragradient with Flip-Flop Shuffling & Anchoring: Provable Improvements
Authors:
Jiseok Chae,
Chulhee Yun,
Donghwan Kim
Abstract:
In minimax optimization, the extragradient (EG) method has been extensively studied because it outperforms the gradient descent-ascent method in convex-concave (C-C) problems. Yet, stochastic EG (SEG) has seen limited success in C-C problems, especially for unconstrained cases. Motivated by the recent progress of shuffling-based stochastic methods, we investigate the convergence of shuffling-based…
▽ More
In minimax optimization, the extragradient (EG) method has been extensively studied because it outperforms the gradient descent-ascent method in convex-concave (C-C) problems. Yet, stochastic EG (SEG) has seen limited success in C-C problems, especially for unconstrained cases. Motivated by the recent progress of shuffling-based stochastic methods, we investigate the convergence of shuffling-based SEG in unconstrained finite-sum minimax problems, in search of convergent shuffling-based SEG. Our analysis reveals that both random reshuffling and the recently proposed flip-flop shuffling alone can suffer divergence in C-C problems. However, with an additional simple trick called anchoring, we develop the SEG with flip-flop anchoring (SEG-FFA) method which successfully converges in C-C problems. We also show upper and lower bounds in the strongly-convex-strongly-concave setting, demonstrating that SEG-FFA has a provably faster convergence rate compared to other shuffling-based methods.
△ Less
Submitted 31 December, 2024;
originally announced January 2025.
-
Does SGD really happen in tiny subspaces?
Authors:
Minhak Song,
Kwangjun Ahn,
Chulhee Yun
Abstract:
Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural n…
▽ More
Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of the original update component. We observe similar behavior across practical setups, including the large learning rate regime (also known as Edge of Stability), Sharpness-Aware Minimization, momentum, and adaptive optimizers. We discuss the main causes and implications of this spurious alignment, shedding light on the dynamics of neural network training.
△ Less
Submitted 10 March, 2025; v1 submitted 24 May, 2024;
originally announced May 2024.
-
On the topology of the moduli of tropical unramified p-covers
Authors:
Yassine El Maazouz,
Paul Alexander Helminck,
Felix Röhrle,
Pedro Souza,
Claudia He Yun
Abstract:
We study the topology of the moduli space of unramified $\mathbb{Z}/p$-covers of tropical curves of genus $g \geq 2$, where $p$ is a prime number. We use recent techniques by Chan--Galatius--Payne to identify contractible subcomplexes of the moduli space. We then use this contractibility result to show that this moduli space is simply connected. In the case of genus 2, we determine the homotopy ty…
▽ More
We study the topology of the moduli space of unramified $\mathbb{Z}/p$-covers of tropical curves of genus $g \geq 2$, where $p$ is a prime number. We use recent techniques by Chan--Galatius--Payne to identify contractible subcomplexes of the moduli space. We then use this contractibility result to show that this moduli space is simply connected. In the case of genus 2, we determine the homotopy type of this moduli space for all primes $p$. This work is motivated by prospective applications to the top-weight cohomology of the space of prime cyclic étale covers of smooth algebraic curves.
△ Less
Submitted 3 October, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
Fundamental Benefit of Alternating Updates in Minimax Optimization
Authors:
Jaewook Lee,
Hanseul Cho,
Chulhee Yun
Abstract:
The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA). While Alt-GDA is commonly observed to converge faster, the performance gap between the two is not yet well understood theoretically, especially in terms of global convergence rates. To address this theory-practice…
▽ More
The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA). While Alt-GDA is commonly observed to converge faster, the performance gap between the two is not yet well understood theoretically, especially in terms of global convergence rates. To address this theory-practice gap, we present fine-grained convergence analyses of both algorithms for strongly-convex-strongly-concave and Lipschitz-gradient objectives. Our new iteration complexity upper bound of Alt-GDA is strictly smaller than the lower bound of Sim-GDA; i.e., Alt-GDA is provably faster. Moreover, we propose Alternating-Extrapolation GDA (Alex-GDA), a general algorithmic framework that subsumes Sim-GDA and Alt-GDA, for which the main idea is to alternately take gradients from extrapolations of the iterates. We show that Alex-GDA satisfies a smaller iteration complexity bound, identical to that of the Extra-gradient method, while requiring less gradient computations. We also prove that Alex-GDA enjoys linear convergence for bilinear problems, for which both Sim-GDA and Alt-GDA fail to converge at all.
△ Less
Submitted 15 July, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults
Authors:
Prin Phunyaphibarn,
Junghyun Lee,
Bohan Wang,
Huishuai Zhang,
Chulhee Yun
Abstract:
Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much fla…
▽ More
Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much flatter minima than those found by gradient descent. We hypothesize that the large catapult is caused by momentum "prolonging" the self-stabilization effect (Damian et al., 2023). We provide theoretical and empirical support for our hypothesis in a simple toy example and empirical evidence supporting our hypothesis for linear diagonal networks.
△ Less
Submitted 29 May, 2024; v1 submitted 25 November, 2023;
originally announced November 2023.
-
Linear attention is (maybe) all you need (to understand transformer optimization)
Authors:
Kwangjun Ahn,
Xiang Cheng,
Minhak Song,
Chulhee Yun,
Ali Jadbabaie,
Suvrit Sra
Abstract:
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and…
▽ More
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.
△ Less
Submitted 13 March, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
PGL orbits in tree varieties
Authors:
Izzet Coskun,
Demir Eken,
Chris Yun
Abstract:
In this paper, we introduce tree varieties as a natural generalization of products of partial flag varieties. We study orbits of the PGL action on tree varieties. We characterize tree varieties with finitely many PGL orbits, generalizing a celebrated theorem of Magyar, Weyman and Zelevinsky. We give criteria that guarantee that a tree variety has a dense PGL orbit and provide many examples of tree…
▽ More
In this paper, we introduce tree varieties as a natural generalization of products of partial flag varieties. We study orbits of the PGL action on tree varieties. We characterize tree varieties with finitely many PGL orbits, generalizing a celebrated theorem of Magyar, Weyman and Zelevinsky. We give criteria that guarantee that a tree variety has a dense PGL orbit and provide many examples of tree varieties that do not have dense PGL orbits. We show that a triple of two-step flag varieties $F(k_1, k_2; n)^3$ has a dense PGL orbit if and only if $k_1 + k_2 \not= n$.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory
Authors:
Minhak Song,
Chulhee Yun
Abstract:
Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this…
▽ More
Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this paper, we start by demonstrating through empirical studies that when the EoS phenomenon occurs, different GD trajectories (after a proper reparameterization) align on a specific bifurcation diagram independent of initialization. We then rigorously prove this trajectory alignment phenomenon for a two-layer fully-connected linear network and a single-neuron nonlinear network trained with a single data point. Our trajectory alignment analysis establishes both progressive sharpening and EoS phenomena, encompassing and extending recent findings in the literature.
△ Less
Submitted 26 October, 2023; v1 submitted 9 July, 2023;
originally announced July 2023.
-
A Serre spectral sequence for the moduli space of tropical curves
Authors:
Christin Bibby,
Melody Chan,
Nir Gadish,
Claudia He Yun
Abstract:
We construct, for all $g\geq 2$ and $n\geq 0$, a spectral sequence of rational $S_n$-representations which computes the $S_n$-equivariant reduced rational cohomology of the tropical moduli spaces of curves $Δ_{g,n}$ in terms of compactly supported cohomology groups of configuration spaces of $n$ points on graphs of genus $g$. Using the canonical $S_n$-equivariant isomorphisms…
▽ More
We construct, for all $g\geq 2$ and $n\geq 0$, a spectral sequence of rational $S_n$-representations which computes the $S_n$-equivariant reduced rational cohomology of the tropical moduli spaces of curves $Δ_{g,n}$ in terms of compactly supported cohomology groups of configuration spaces of $n$ points on graphs of genus $g$. Using the canonical $S_n$-equivariant isomorphisms $\widetilde{H}^{i-1}(Δ_{g,n};\mathbb{Q}) \cong W_0 H^i_c(\mathcal{M}_{g,n};\mathbb{Q})$, we calculate the weight $0$, compactly supported rational cohomology of the moduli spaces $\mathcal{M}_{g,n}$ in the range $g=3$ and $n\leq 9$, with partial computations available for $n\leq 13$.
△ Less
Submitted 15 April, 2024; v1 submitted 4 July, 2023;
originally announced July 2023.
-
Positive del Pezzo Geometry
Authors:
Nick Early,
Alheydis Geiger,
Marta Panizzut,
Bernd Sturmfels,
Claudia He Yun
Abstract:
Real, complex, and tropical algebraic geometry join forces in a new branch of mathematical physics called positive geometry. We develop the positive geometry of del Pezzo surfaces and their moduli spaces, viewed as very affine varieties. Their connected components are derived from polyhedral spaces with Weyl group symmetries. We study their canonical forms and scattering amplitudes, and we solve t…
▽ More
Real, complex, and tropical algebraic geometry join forces in a new branch of mathematical physics called positive geometry. We develop the positive geometry of del Pezzo surfaces and their moduli spaces, viewed as very affine varieties. Their connected components are derived from polyhedral spaces with Weyl group symmetries. We study their canonical forms and scattering amplitudes, and we solve the likelihood equations.
△ Less
Submitted 6 January, 2025; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima
Authors:
Dongkuk Si,
Chulhee Yun
Abstract:
Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + ρ\frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $ρ$ and/or no gradient normalization in $y_t$, which is detached from pr…
▽ More
Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + ρ\frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $ρ$ and/or no gradient normalization in $y_t$, which is detached from practice. To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant $ρ$ and gradient normalization in $y_t$) and explore their convergence properties on smooth functions with (non)convexity assumptions. Perhaps surprisingly, in many scenarios, we find out that SAM has limited capability to converge to global minima or stationary points. For smooth strongly convex functions, we show that while deterministic SAM enjoys tight global convergence rates of $\tilde Θ(\frac{1}{T^2})$, the convergence bound of stochastic SAM suffers an inevitable additive term $O(ρ^2)$, indicating convergence only up to neighborhoods of optima. In fact, such $O(ρ^2)$ factors arise for stochastic SAM in all the settings we consider, and also for deterministic SAM in nonconvex cases; importantly, we prove by examples that such terms are unavoidable. Our results highlight vastly different characteristics of SAM with vs. without decaying perturbation size or gradient normalization, and suggest that the intuitions gained from one version may not apply to the other.
△ Less
Submitted 27 October, 2023; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Provable Benefit of Mixup for Finding Optimal Decision Boundaries
Authors:
Junsoo Oh,
Chulhee Yun
Abstract:
We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem. For a family of data distributions with a separability constant $κ$, we analyze how well the optimal classifier in terms of training loss aligns with the optimal one in test accuracy (i.e., Bayes optimal classifier). For…
▽ More
We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem. For a family of data distributions with a separability constant $κ$, we analyze how well the optimal classifier in terms of training loss aligns with the optimal one in test accuracy (i.e., Bayes optimal classifier). For vanilla training without augmentation, we uncover an interesting phenomenon named the curse of separability. As we increase $κ$ to make the data distribution more separable, the sample complexity of vanilla training increases exponentially in $κ$; perhaps surprisingly, the task of finding optimal decision boundaries becomes harder for more separable distributions. For Mixup training, we show that Mixup mitigates this problem by significantly reducing the sample complexity. To this end, we develop new concentration results applicable to $n^2$ pair-wise augmented data points constructed from $n$ independent data, by carefully dealing with dependencies between overlapping pairs. Lastly, we study other masking-based Mixup-style techniques and show that they can distort the training loss and make its minimizer converge to a suboptimal classifier in terms of test accuracy.
△ Less
Submitted 5 June, 2023; v1 submitted 31 May, 2023;
originally announced June 2023.
-
Some thoughts and experiments on Bergman's compact amalgamation problem
Authors:
Michael Joswig,
Mario Kummer,
Andreas Thom,
Claudia He Yun
Abstract:
We study the question whether copies of $S^1$ in $\mathrm{SU}(3)$ can be amalgamated in a compact group. This is the simplest instance of a fundamental open problem in the theory of compact groups raised by George Bergman in 1987. Considerable computational experiments suggest that the answer is positive in this case. We obtain a positive answer for a relaxed problem using theoretical consideratio…
▽ More
We study the question whether copies of $S^1$ in $\mathrm{SU}(3)$ can be amalgamated in a compact group. This is the simplest instance of a fundamental open problem in the theory of compact groups raised by George Bergman in 1987. Considerable computational experiments suggest that the answer is positive in this case. We obtain a positive answer for a relaxed problem using theoretical considerations.
△ Less
Submitted 13 July, 2023; v1 submitted 17 April, 2023;
originally announced April 2023.
-
Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond
Authors:
Jaeyoung Cha,
Jaewook Lee,
Chulhee Yun
Abstract:
We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems. Unlike most existing results focusing on final iterate lower bounds in terms of the number of components $n$ and the number of epochs $K$, we seek bounds for arbitrary weighted average iterates that are tight in all factors including the…
▽ More
We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems. Unlike most existing results focusing on final iterate lower bounds in terms of the number of components $n$ and the number of epochs $K$, we seek bounds for arbitrary weighted average iterates that are tight in all factors including the condition number $κ$. For SGD with Random Reshuffling, we present lower bounds that have tighter $κ$ dependencies than existing bounds. Our results are the first to perfectly close the gap between lower and upper bounds for weighted average iterates in both strongly-convex and convex cases. We also prove weighted average iterate lower bounds for arbitrary permutation-based SGD, which apply to all variants that carefully choose the best permutation. Our bounds improve the existing bounds in factors of $n$ and $κ$ and thereby match the upper bounds shown for a recently proposed algorithm called GraB.
△ Less
Submitted 9 June, 2023; v1 submitted 13 March, 2023;
originally announced March 2023.
-
On the Training Instability of Shuffling SGD with Batch Normalization
Authors:
David X. Wu,
Chulhee Yun,
Suvrit Sra
Abstract:
We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for r…
▽ More
We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are "distorted" away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by confirming them empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.
△ Less
Submitted 14 August, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
SGDA with shuffling: faster convergence for nonconvex-PŁ minimax optimization
Authors:
Hanseul Cho,
Chulhee Yun
Abstract:
Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monot…
▽ More
Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvex-nonconcave objectives with Polyak-Łojasiewicz (PŁ) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-PŁ and primal-PŁ-PŁ objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for GDA with an arbitrary step-size ratio, which matches the full-batch upper bound for the primal-PŁ-PŁ case.
△ Less
Submitted 20 February, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Discrete Morse theory for symmetric Delta-complexes
Authors:
Claudia He Yun
Abstract:
We generalize Forman's discrete Morse theory to the context of symmetric $Δ$-complexes. As an application, we prove that the coloop subcomplex of the link of the origin $LA^{\mathrm{trop},\mathrm{P}}_g$ in the moduli space of principally polarized tropical abelian varieties of dimension $g$ with respect to the perfect cone decomposition is contractible.
We generalize Forman's discrete Morse theory to the context of symmetric $Δ$-complexes. As an application, we prove that the coloop subcomplex of the link of the origin $LA^{\mathrm{trop},\mathrm{P}}_g$ in the moduli space of principally polarized tropical abelian varieties of dimension $g$ with respect to the perfect cone decomposition is contractible.
△ Less
Submitted 2 September, 2022;
originally announced September 2022.
-
Equivariant Hodge polynomials of heavy/light moduli spaces
Authors:
Siddarth Kannan,
Stefano Serpente,
Claudia He Yun
Abstract:
Let $\bar{\mathcal{M}}_{g, m|n}$ denote Hassett's moduli space of weighted pointed stable curves of genus $g$ for the heavy/light weight data $\left(1^{(m)}, 1/n^{(n)}\right)$, and let $\mathcal{M}_{g, m|n} \subset \bar{\mathcal{M}}_{g, m|n}$ be the locus parameterizing smooth, not necessarily distinctly marked curves. We give a change-of-variables formula which computes the generating function fo…
▽ More
Let $\bar{\mathcal{M}}_{g, m|n}$ denote Hassett's moduli space of weighted pointed stable curves of genus $g$ for the heavy/light weight data $\left(1^{(m)}, 1/n^{(n)}\right)$, and let $\mathcal{M}_{g, m|n} \subset \bar{\mathcal{M}}_{g, m|n}$ be the locus parameterizing smooth, not necessarily distinctly marked curves. We give a change-of-variables formula which computes the generating function for $(S_m\times S_n)$-equivariant Hodge-Deligne polynomials of these spaces in terms of the generating functions for $S_{n}$-equivariant Hodge-Deligne polynomials of $\bar{\mathcal{M}}_{g,n}$ and $\mathcal{M}_{g,n}$.
△ Less
Submitted 22 April, 2024; v1 submitted 6 July, 2022;
originally announced July 2022.
-
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond
Authors:
Chulhee Yun,
Shashank Rajput,
Suvrit Sra
Abstract:
In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients…
▽ More
In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
△ Less
Submitted 23 March, 2022; v1 submitted 19 October, 2021;
originally announced October 2021.
-
The role of viral infectivity in oncolytic virotherapy outcomes: A mathematical study
Authors:
Pantea Pooladvand,
Chae-Ok Yun,
A-Rum Yoon,
Peter S. Kim,
Federico Frascoli
Abstract:
A model capturing the dynamics between virus and tumour cells in the context of oncolytic virotherapy is presented and analysed. The ability of the virus to be internalised by uninfected cells is described by an infectivity parameter, which is inferred from available experimental data. The parameter is also able to describe the effects of changes in the tumour environment that affect viral uptake…
▽ More
A model capturing the dynamics between virus and tumour cells in the context of oncolytic virotherapy is presented and analysed. The ability of the virus to be internalised by uninfected cells is described by an infectivity parameter, which is inferred from available experimental data. The parameter is also able to describe the effects of changes in the tumour environment that affect viral uptake from tumour cells. Results show that when a virus is inoculated inside a growing tumour, strategies for enhancing infectivity do not lead to a complete eradication of the tumour. Within typical times of experiments and treatments, we observe the onset of oscillations, which always prevent a full destruction of the tumour mass. These findings are in good agreement with available laboratory results. Further analysis shows why a fully successful therapy cannot exist for the proposed model and that care must be taken when designing and engineering viral vectors with enhanced features. In particular, bifurcation analysis reveals that creating longer lasting virus particles or using strategies for reducing infected cell lifespan can cause unexpected and unwanted surges in the overall tumour load over time. Our findings suggest that virotherapy alone seems unlikely to be effective in clinical settings unless adjuvant strategies are included.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
Homology representations of compactified configurations on graphs applied to $\mathcal{M}_{2,n}$
Authors:
Christin Bibby,
Melody Chan,
Nir Gadish,
Claudia He Yun
Abstract:
We obtain new calculations of the top weight rational cohomology of the moduli spaces $\mathcal{M}_{2,n}$, equivalently the rational homology of the tropical moduli spaces $Δ_{2,n}$, as a representation of $S_n$. These calculations are achieved fully for all $n\leq 10$, and partially -- for specific irreducible representations of $S_n$ -- for $n\le 22$. We also present conjectures, verified up to…
▽ More
We obtain new calculations of the top weight rational cohomology of the moduli spaces $\mathcal{M}_{2,n}$, equivalently the rational homology of the tropical moduli spaces $Δ_{2,n}$, as a representation of $S_n$. These calculations are achieved fully for all $n\leq 10$, and partially -- for specific irreducible representations of $S_n$ -- for $n\le 22$. We also present conjectures, verified up to $n=22$, for the multiplicities of the irreducible representations $\mathrm{std}_n$ and $\mathrm{std}_n\otimes \mathrm{sgn}_n$.
We achieve our calculations via a comparison with the homology of compactified configuration spaces of graphs. These homology groups are equipped with commuting actions of a symmetric group and the outer automorphism group of a free group. In this paper, we construct an efficient free resolution for these homology representations, from which we extract calculations on irreducible representations one at a time, simplifying the calculation of these homology representations.
△ Less
Submitted 25 April, 2023; v1 submitted 7 September, 2021;
originally announced September 2021.
-
Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?
Authors:
Chulhee Yun,
Suvrit Sra,
Ali Jadbabaie
Abstract:
We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restri…
▽ More
We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restrict our attention to positive definite matrices with small enough condition numbers, which are more relevant to matrices that arise in the analysis of SGD. For such matrices, we conjecture that the means of matrix products corresponding to with- and without-replacement variants of SGD satisfy a series of spectral norm inequalities that can be summarized as: "single-shuffle SGD converges faster than random-reshuffle SGD, which is in turn faster than with-replacement SGD." We present theorems that support our conjecture by proving several special cases.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
Topology of tropical moduli spaces of weighted stable curves in higher genus
Authors:
Siddarth Kannan,
Shiyue Li,
Stefano Serpente,
Claudia He Yun
Abstract:
Given integers $g \geq 0$, $n \geq 1$, and a vector $w \in (\mathbb{Q} \cap (0, 1])^n$ such that ${2g - 2 + \sum w_i > 0}$, we study the topology of the moduli space $Δ_{g, w}$ of $w$-stable tropical curves of genus $g$ with volume 1. The space $Δ_{g, w}$ is the dual complex of the divisor of singular curves in Hassett's moduli space of $w$-stable genus $g$ curves $\overline{\mathcal{M}}_{g, w}$.…
▽ More
Given integers $g \geq 0$, $n \geq 1$, and a vector $w \in (\mathbb{Q} \cap (0, 1])^n$ such that ${2g - 2 + \sum w_i > 0}$, we study the topology of the moduli space $Δ_{g, w}$ of $w$-stable tropical curves of genus $g$ with volume 1. The space $Δ_{g, w}$ is the dual complex of the divisor of singular curves in Hassett's moduli space of $w$-stable genus $g$ curves $\overline{\mathcal{M}}_{g, w}$. When $g \geq 1$, we show that $Δ_{g, w}$ is simply connected for all values of $w$. We also give a formula for the Euler characteristic of $Δ_{g, w}$ in terms of the combinatorics of $w$.
△ Less
Submitted 15 March, 2022; v1 submitted 22 October, 2020;
originally announced October 2020.
-
DML-GANR: Deep Metric Learning With Generative Adversarial Network Regularization for High Spatial Resolution Remote Sensing Image Retrieval
Authors:
Yun Cao,
Yuebin Wang,
Junhuan Peng,
Liqiang Zhang,
Linlin Xu,
Kai Yan,
Lihua Li
Abstract:
With a small number of labeled samples for training, it can save considerable manpower and material resources, especially when the amount of high spatial resolution remote sensing images (HSR-RSIs) increases considerably. However, many deep models face the problem of overfitting when using a small number of labeled samples. This might degrade HSRRSI retrieval accuracy. Aiming at obtaining more acc…
▽ More
With a small number of labeled samples for training, it can save considerable manpower and material resources, especially when the amount of high spatial resolution remote sensing images (HSR-RSIs) increases considerably. However, many deep models face the problem of overfitting when using a small number of labeled samples. This might degrade HSRRSI retrieval accuracy. Aiming at obtaining more accurate HSR-RSI retrieval performance with small training samples, we develop a deep metric learning approach with generative adversarial network regularization (DML-GANR) for HSR-RSI retrieval. The DML-GANR starts from a high-level feature extraction (HFE) to extract high-level features, which includes convolutional layers and fully connected (FC) layers. Each of the FC layers is constructed by deep metric learning (DML) to maximize the interclass variations and minimize the intraclass variations. The generative adversarial network (GAN) is adopted to mitigate the overfitting problem and validate the qualities of extracted high-level features. DML-GANR is optimized through a customized approach, and the optimal parameters are obtained. The experimental results on the three data sets demonstrate the superior performance of DML-GANR over state-of-the-art techniques in HSR-RSI retrieval.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
SLCRF: Subspace Learning with Conditional Random Field for Hyperspectral Image Classification
Authors:
Yun Cao,
Jie Mei,
Yuebin Wang,
Liqiang Zhang,
Junhuan Peng,
Bing Zhang,
Lihua Li,
Yibo Zheng
Abstract:
Subspace learning (SL) plays an important role in hyperspectral image (HSI) classification, since it can provide an effective solution to reduce the redundant information in the image pixels of HSIs. Previous works about SL aim to improve the accuracy of HSI recognition. Using a large number of labeled samples, related methods can train the parameters of the proposed solutions to obtain better rep…
▽ More
Subspace learning (SL) plays an important role in hyperspectral image (HSI) classification, since it can provide an effective solution to reduce the redundant information in the image pixels of HSIs. Previous works about SL aim to improve the accuracy of HSI recognition. Using a large number of labeled samples, related methods can train the parameters of the proposed solutions to obtain better representations of HSI pixels. However, the data instances may not be sufficient enough to learn a precise model for HSI classification in real applications. Moreover, it is well-known that it takes much time, labor and human expertise to label HSI images. To avoid the aforementioned problems, a novel SL method that includes the probability assumption called subspace learning with conditional random field (SLCRF) is developed. In SLCRF, first, the 3D convolutional autoencoder (3DCAE) is introduced to remove the redundant information in HSI pixels. In addition, the relationships are also constructed using the spectral-spatial information among the adjacent pixels. Then, the conditional random field (CRF) framework can be constructed and further embedded into the HSI SL procedure with the semi-supervised approach. Through the linearized alternating direction method termed LADMAP, the objective function of SLCRF is optimized using a defined iterative algorithm. The proposed method is comprehensively evaluated using the challenging public HSI datasets. We can achieve stateof-the-art performance using these HSI sets.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
A Unifying View on Implicit Bias in Training Linear Neural Networks
Authors:
Chulhee Yun,
Shankar Krishnan,
Hossein Mobahi
Abstract:
We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize th…
▽ More
We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis.
△ Less
Submitted 10 September, 2021; v1 submitted 6 October, 2020;
originally announced October 2020.
-
The $S_n$-equivariant rational homology of the tropical moduli spaces $Δ_{2,n}$
Authors:
Claudia He Yun
Abstract:
We compute the $S_n$-equivariant rational homology of the tropical moduli spaces $Δ_{2,n}$ for $n\leq 8$ using a cellular chain complex for symmetric $Δ$-complexes in Sage.
We compute the $S_n$-equivariant rational homology of the tropical moduli spaces $Δ_{2,n}$ for $n\leq 8$ using a cellular chain complex for symmetric $Δ$-complexes in Sage.
△ Less
Submitted 10 August, 2020;
originally announced August 2020.
-
Existence and convergence theorems for monotone generalized alpa-nonexpansive mappings in uniformly convex partially ordered hyperbolic metric spaces and its application
Authors:
Chang Il Rim,
Jong Gyong Kim,
Chol-Hui Yun
Abstract:
In this paper, we generalize the existence result in [14] and prove convergence theorems of the iterative scheme in [12, 16] for monotone generalized alpa-nonexpansive mappings in uniformly convex partially ordered hyperbolic metric spaces. And we also give a numerical example to show that this scheme converges faster than the scheme in [14] and apply the result to the integral equation.
In this paper, we generalize the existence result in [14] and prove convergence theorems of the iterative scheme in [12, 16] for monotone generalized alpa-nonexpansive mappings in uniformly convex partially ordered hyperbolic metric spaces. And we also give a numerical example to show that this scheme converges faster than the scheme in [14] and apply the result to the integral equation.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
SGD with shuffling: optimal rates without component convexity and large epoch requirements
Authors:
Kwangjun Ahn,
Chulhee Yun,
Suvrit Sra
Abstract:
We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is ge…
▽ More
We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is general enough to cover gradient dominated nonconvex costs, and does not rely on the convexity of individual component functions unlike existing optimal convergence results. Secondly, assuming convexity of the individual components, we further sharpen the tight convergence results for RandomShuffle by removing the drawbacks common to all prior arts: large number of epochs required for the results to hold, and extra poly-log factor gaps to the lower bound.
△ Less
Submitted 21 June, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Mathematical modelling of the interaction between cancer cells and an oncolytic virus: insights into the effects of treatment protocols
Authors:
Adrianne L. Jenner,
Chae-Ok Yun,
Peter S. Kim,
Adelle C. F. Coster
Abstract:
Oncolytic virotherapy is an experimental cancer treatment that uses genetically engineered viruses to target and kill cancer cells. One major limitation of this treatment is that virus particles are rapidly cleared by the immune system, preventing them from arriving at the tumour site. To improve virus survival and infectivity modified virus particles with the polymer polyethylene glycol (PEG) and…
▽ More
Oncolytic virotherapy is an experimental cancer treatment that uses genetically engineered viruses to target and kill cancer cells. One major limitation of this treatment is that virus particles are rapidly cleared by the immune system, preventing them from arriving at the tumour site. To improve virus survival and infectivity modified virus particles with the polymer polyethylene glycol (PEG) and the monoclonal antibody herceptin. While PEG modification appeared to improve plasma retention and initial infectivity it also increased the virus particle arrival time. We derive a mathematical model that describes the interaction between tumour cells and an oncolytic virus. We tune our model to represent the experimental data by Kim et al. (2011) and obtain optimised parameters. Our model provides a platform from which predictions may be made about the response of cancer growth to other treatment protocols beyond those in the experiments. Through model simulations we find that the treatment protocol affects the outcome dramatically. We quantify the effects of dosage strategy as a function of tumour cell replication and tumour carrying capacity on the outcome of oncolytic virotherapy as a treatment. The relative significance of the modification of the virus and the crucial role it plays in optimising treatment efficacy is explored.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Are deep ResNets provably better than linear predictors?
Authors:
Chulhee Yun,
Suvrit Sra,
Ali Jadbabaie
Abstract:
Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residu…
▽ More
Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the "near-identity regions" of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity.
△ Less
Submitted 29 October, 2019; v1 submitted 8 July, 2019;
originally announced July 2019.
-
Estimation of errors on perturbation of function contractivity factors and box-counting dimension of hidden variable recurrent fractal interpolation function
Authors:
Mi-Kyong Ri,
Chol-Hui Yun
Abstract:
In this paper, we study errors on perturbation of function contractivity factors and box-counting dimension of hidden variable recurrent fractal interpolation function (HVRFIF). The HVRFIF is a hidden variable fractal interpolation function (HVFIF) constructed by recurrent iterated function system (RIFS) with function contractivity factors. The contractivity factors of RIFS determine fractal chara…
▽ More
In this paper, we study errors on perturbation of function contractivity factors and box-counting dimension of hidden variable recurrent fractal interpolation function (HVRFIF). The HVRFIF is a hidden variable fractal interpolation function (HVFIF) constructed by recurrent iterated function system (RIFS) with function contractivity factors. The contractivity factors of RIFS determine fractal characteristics and shape of its attractor, so that the HVRFIF with function contractivity factors has more flexibility and diversity than the HVFIF with constant contractivity factors. Stability of interpolation function according to perturbation of the contractivity factors and the box-counting dimension of interpolation function plays very important roles in determining whether these functions can be applied to practical problems or not. We first estimate errors on perturbation of function contractivity factors and then obtain the upper and lower bounds of the box-counting dimension of one variable HVRFIF. Finally, in the similar way, we get the lower and upper bounds of box-counting dimension of hidden variable bivariable recurrent fractal interpolation function (HVBRFIF).
△ Less
Submitted 4 June, 2019;
originally announced June 2019.
-
Analytic properties of hidden variable recurrent fractal interpolation function with function contractivity factors
Authors:
Mi-Kyong Ri,
Chol-Hui Yun
Abstract:
In this paper, we analyze the smoothness and stability of hidden variable recurrent fractal interpolation functions (HVRFIF) with function contractivity factors introduced in Ref. 1. The HVRFIF is a hidden variable fractal interpolation function (HVFIF) constructed by recurrent iterated function system (RIFS) with function contractivity factors. An attractor of RIFS has a local self-similar or sel…
▽ More
In this paper, we analyze the smoothness and stability of hidden variable recurrent fractal interpolation functions (HVRFIF) with function contractivity factors introduced in Ref. 1. The HVRFIF is a hidden variable fractal interpolation function (HVFIF) constructed by recurrent iterated function system (RIFS) with function contractivity factors. An attractor of RIFS has a local self-similar or self-affine structure and looks more naturally than one of IFS. The contractivity factors of IFS(RIFS) determine fractal characteristic and shape of its attractor. Therefore, the HVRFIF with function contractivity factors has more flexibility and diversity than the HVFIF constructed by iterated function system (IFS) with constant contractivity factors. The analytic properties of the interpolation functions play very important roles in determining whether these functons can be applied to the practical problems or not. We analyze the smoothness of the one variable HVRFIFs in Ref. 1 and prove their stability according to perturbation of the interpolation dataset.
△ Less
Submitted 23 April, 2019;
originally announced April 2019.
-
Box-counting dimension and analytic properties of hidden variable fractal interpolation functions with function contractivity factors
Authors:
Chol-Hui Yun,
Mi-Kyong Ri
Abstract:
We estimate the bounds of box-counting dimension of hidden variable fractal interpolation functions (HVFIFs) and hidden variable bivariate fractal interpolation functions (HVBFIFs) with four function contractivity factors and present analytic properties of HVFIFs which are constructed to ensure more flexibility and diversity in modeling natural phenomena. Firstly, we construct the HVFIFs and analy…
▽ More
We estimate the bounds of box-counting dimension of hidden variable fractal interpolation functions (HVFIFs) and hidden variable bivariate fractal interpolation functions (HVBFIFs) with four function contractivity factors and present analytic properties of HVFIFs which are constructed to ensure more flexibility and diversity in modeling natural phenomena. Firstly, we construct the HVFIFs and analyze their smoothness and stability. Secondly, we obtain the lower and upper bounds of box-counting dimension of the HVFIFs. Finally, in the similar way, we get the lower and upper bounds of box-counting dimension of HVBFIFs constructed in [21].
△ Less
Submitted 23 April, 2019;
originally announced April 2019.
-
Hidden variable recurrent fractal interpolation function with four function contractivity factors
Authors:
Chol-Hui Yun
Abstract:
In this paper, we introduce a construction of hidden variable recurrent fractal interpolation functions (HVRFIF) with four function contractivity factors. In the fractal interpolation theory, it is very important to ensure flexibility and diversity of the construction of interpolation function. Recurrent iterated function system (RIFS) produce fractal sets with local self-similarity structure. The…
▽ More
In this paper, we introduce a construction of hidden variable recurrent fractal interpolation functions (HVRFIF) with four function contractivity factors. In the fractal interpolation theory, it is very important to ensure flexibility and diversity of the construction of interpolation function. Recurrent iterated function system (RIFS) produce fractal sets with local self-similarity structure. Therefore the RIFS can describe the irregular and complicated objects in nature better than the iterated function system (IFS). Hidden variable fractal interpolation function (HVFIF) is neither self-similar nor self-affine one. The HVFIF is more complicated, diverse and irregular than the fractal interpolation function (FIF). The contractivity factor is important one that determins characteristics of FIFs. We present a constructions of one variable HVRFIFs and bivariable HVRFIFs using RIFS with four function contractivity factors.
△ Less
Submitted 19 April, 2019;
originally announced April 2019.
-
Efficiently testing local optimality and escaping saddles for ReLU networks
Authors:
Chulhee Yun,
Suvrit Sra,
Ali Jadbabaie
Abstract:
We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into a…
▽ More
We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into at most $2^M$ regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, $O(M)$ (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases.
△ Less
Submitted 28 May, 2019; v1 submitted 28 September, 2018;
originally announced September 2018.
-
Small nonlinearities in activation functions create bad local minima in neural networks
Authors:
Chulhee Yun,
Suvrit Sra,
Ali Jadbabaie
Abstract:
We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like…
▽ More
We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic.
△ Less
Submitted 28 May, 2019; v1 submitted 9 February, 2018;
originally announced February 2018.
-
Global optimality conditions for deep neural networks
Authors:
Chulhee Yun,
Suvrit Sra,
Ali Jadbabaie
Abstract:
We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global mi…
▽ More
We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global minimum. Surprisingly, our conditions provide an efficiently checkable test for global optimality, while such tests are typically intractable in nonconvex optimization. We further extend these results to deep nonlinear neural networks and prove similar sufficient conditions for global optimality, albeit in a more limited function space setting.
△ Less
Submitted 24 March, 2018; v1 submitted 8 July, 2017;
originally announced July 2017.
-
A construction of fractal surfaces with function scaling factors on a rectangular grid
Authors:
Chol-Hui Yun,
Hui-Chol Choi,
Hyong-Chol O
Abstract:
A fractal surface is a set which is a graph of a bivariate continuous function. In the construction of fractal surfaces using IFS, vertical scaling factors in IFS are important one which characterizes a fractal feature of surfaces constructed. We construct IFS with function vertical scaling factors which are 0 on the boundaries of a rectangular grid using arbitrary data set on a rectangular grid a…
▽ More
A fractal surface is a set which is a graph of a bivariate continuous function. In the construction of fractal surfaces using IFS, vertical scaling factors in IFS are important one which characterizes a fractal feature of surfaces constructed. We construct IFS with function vertical scaling factors which are 0 on the boundaries of a rectangular grid using arbitrary data set on a rectangular grid and give a condition for an attractor of the IFS constructed being a surface. Finally, lower and upper bounds of Box-counting dimension of the constructed surface are estimated.
△ Less
Submitted 3 April, 2014;
originally announced April 2014.
-
Construction of Recurrent Fractal Interpolation Surfaces with Function Scaling Factors and Estimation of Box-counting Dimension on Rectangular Grids
Authors:
Chol-Hui Yun,
Hui-Chol Choi,
Hyong-Chol O
Abstract:
We consider a construction of recurrent fractal interpolation surfaces with function vertical scaling factors and estimation of their box-counting dimension. A recurrent fractal interpolation surface (RFIS) is an attractor of a recurrent iterated function system (RIFS) which is a graph of bivariate interpolation function. For any given data set on rectangular grids, we construct general recurrent…
▽ More
We consider a construction of recurrent fractal interpolation surfaces with function vertical scaling factors and estimation of their box-counting dimension. A recurrent fractal interpolation surface (RFIS) is an attractor of a recurrent iterated function system (RIFS) which is a graph of bivariate interpolation function. For any given data set on rectangular grids, we construct general recurrent iterated function systems with function vertical scaling factors and prove the existence of bivariate functions whose graph are attractors of the above constructed RIFSs. Finally, we estimate lower and upper bounds for the box-counting dimension of the constructed RFISs.
△ Less
Submitted 9 July, 2013;
originally announced July 2013.
-
A Construction of the Best Fractal Approximation
Authors:
Yong-Suk Kang,
Chol-Hui Yun,
Dong-Hyok Kim
Abstract:
In this paper we present a method for constructing the continuous best fractal approximation in the space of bounded functions. We construct the finite-dimensional subspace of the space of bounded functions whose base consists of the continuous fractal functions, and propose how to find the best approximation of given continuous function by element of the constructed space.
In this paper we present a method for constructing the continuous best fractal approximation in the space of bounded functions. We construct the finite-dimensional subspace of the space of bounded functions whose base consists of the continuous fractal functions, and propose how to find the best approximation of given continuous function by element of the constructed space.
△ Less
Submitted 28 March, 2014; v1 submitted 15 May, 2013;
originally announced May 2013.
-
Image Compression predicated on Recurrent Iterated Function Systems
Authors:
Chol-Hui Yun,
W. Metzler,
M. Barski
Abstract:
Recurrent iterated function systems (RIFSs) are improvements of iterated function systems (IFSs) using elements of the theory of Marcovian stochastic processes which can produce more natural looking images. We construct new RIFSs consisting substantially of a vertical contraction factor function and nonlinear transformations. These RIFSs are applied to image compression.
Recurrent iterated function systems (RIFSs) are improvements of iterated function systems (IFSs) using elements of the theory of Marcovian stochastic processes which can produce more natural looking images. We construct new RIFSs consisting substantially of a vertical contraction factor function and nonlinear transformations. These RIFSs are applied to image compression.
△ Less
Submitted 7 April, 2013;
originally announced April 2013.
-
Construction of Fractal Surfaces by Recurrent Fractal Interpolation Curves
Authors:
Chol-hui Yun,
Hyong-chol O.,
Hui-chol Choi
Abstract:
A method to construct fractal surfaces by recurrent fractal curves is provided. First we construct fractal interpolation curves using a recurrent iterated functions system(RIFS) with function scaling factors and estimate their box-counting dimension. Then we present a method of construction of wider class of fractal surfaces by fractal curves and Lipschitz functions and calculate the box-counting…
▽ More
A method to construct fractal surfaces by recurrent fractal curves is provided. First we construct fractal interpolation curves using a recurrent iterated functions system(RIFS) with function scaling factors and estimate their box-counting dimension. Then we present a method of construction of wider class of fractal surfaces by fractal curves and Lipschitz functions and calculate the box-counting dimension of the constructed surfaces. Finally, we combine both methods to have more flexible constructions of fractal surfaces.
△ Less
Submitted 11 August, 2014; v1 submitted 4 March, 2013;
originally announced March 2013.
-
Box-counting dimension of a kind of fractal interpolation surface on rectangular grids
Authors:
CholHui Yun,
MunChol Kim
Abstract:
We estimate a Box-counting dimension of fractal surfaces which are generated by iterated function systems with a vertical contraction factor function on an arbitrary data set over rectangular grids and can express well a lot of natural surfaces with very complicated structures.
We estimate a Box-counting dimension of fractal surfaces which are generated by iterated function systems with a vertical contraction factor function on an arbitrary data set over rectangular grids and can express well a lot of natural surfaces with very complicated structures.
△ Less
Submitted 9 August, 2012;
originally announced August 2012.