Skip to main content

Showing 1–12 of 12 results for author: Dereich, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09572  [pdf, other

    cs.LG math.LO math.OC stat.ML

    SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

    Authors: Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen

    Abstract: We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove t… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: 27 pages, 4 figures

    MSC Class: Primary 68T05; Secondary 68T07; 26B40; 03C64; 03C98

  2. arXiv:2504.19426  [pdf, ps, other

    math.OC cs.AI

    Sharp higher order convergence rates for the Adam optimizer

    Authors: Steffen Dereich, Arnulf Jentzen, Adrian Riekert

    Abstract: Gradient descent based optimization methods are the methods of choice to train deep neural networks in machine learning. Beyond the standard gradient descent method, also suitable modified variants of standard gradient descent involving acceleration techniques such as the momentum method and/or adaptivity techniques such as the RMSprop method are frequently considered optimization methods. These d… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

    Comments: 27 pages

    MSC Class: 68T05; 65K05; 90C25 ACM Class: I.2.0

  3. arXiv:2504.08867  [pdf, ps, other

    cs.LG math.PR stat.ML

    In almost all shallow analytic neural network optimization landscapes, efficient minimizers have strongly convex neighborhoods

    Authors: Felix Benning, Steffen Dereich

    Abstract: Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divid… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    MSC Class: 60G15; 60G60; 62J02; 62M45; 68T07

  4. arXiv:2501.15646  [pdf, ps, other

    cs.LG math.NA

    Mathematical analysis of the gradients in deep learning

    Authors: Steffen Dereich, Thang Do, Arnulf Jentzen, Frederic Weber

    Abstract: Deep learning algorithms -- typically consisting of a class of deep artificial neural networks (ANNs) trained by a stochastic gradient descent (SGD) optimization method -- are nowadays an integral part in many areas of science, industry, and also our day to day life. Roughly speaking, in their most basic form, ANNs can be regarded as functions that consist of a series of compositions of affine-lin… ▽ More

    Submitted 26 January, 2025; originally announced January 2025.

    Comments: 38 pages

    MSC Class: 68T07 ACM Class: I.2.6

  5. arXiv:2501.06081  [pdf, other

    math.OC cs.LG math.NA

    Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems

    Authors: Steffen Dereich, Arnulf Jentzen, Adrian Riekert

    Abstract: Deep learning methods - usually consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays omnipresent in data-driven learning problems as well as in scientific computing tasks such as optimal control (OC) and partial differential equation (PDE) problems. In practically relevant learning tasks, often not the plain-vanilla… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: 25 pages, 10 figures

  6. arXiv:2407.21078  [pdf, ps, other

    math.OC cs.LG math.PR stat.ML

    Convergence rates for the Adam optimizer

    Authors: Steffen Dereich, Arnulf Jentzen

    Abstract: Stochastic gradient descent (SGD) optimization methods are nowadays the method of choice for the training of deep neural networks (DNNs) in artificial intelligence systems. In practically relevant training problems, usually not the plain vanilla standard SGD method is the employed optimization scheme but instead suitably accelerated and adaptive SGD optimization methods are applied. As of today, m… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  7. arXiv:2407.08100  [pdf, ps, other

    cs.LG math.OC math.PR

    Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

    Authors: Steffen Dereich, Robin Graeber, Arnulf Jentzen

    Abstract: Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versi… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: 54 pages

    MSC Class: 60J22 (Primary); 65K10; 60J20; 65C40 (Secondary) ACM Class: G.1.6; F.2.0; G.3

  8. arXiv:2406.14340  [pdf, other

    math.OC cs.LG math.NA

    Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

    Authors: Steffen Dereich, Arnulf Jentzen, Adrian Riekert

    Abstract: It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: 68 pages, 8 figures

  9. arXiv:2303.03950  [pdf, ps, other

    cs.LG math.NA math.OC stat.ML

    On the existence of optimal shallow feedforward networks with ReLU activation

    Authors: Steffen Dereich, Sebastian Kassing

    Abstract: We prove existence of global minima in the loss landscape for the approximation of continuous target functions using shallow feedforward artificial neural networks with ReLU activation. This property is one of the fundamental artifacts separating ReLU from other commonly used activation functions. We propose a kind of closure of the search space so that in the extended space minimizers exist. In a… ▽ More

    Submitted 6 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2302.14690

    MSC Class: Primary 68T07; Secondary 68T05; 41A50

  10. arXiv:2302.14690  [pdf, other

    math.OC cs.LG math.NA stat.ML

    On the existence of minimizers in shallow residual ReLU neural network optimization landscapes

    Authors: Steffen Dereich, Arnulf Jentzen, Sebastian Kassing

    Abstract: In this article, we show existence of minimizers in the loss landscape for residual artificial neural networks (ANNs) with multi-dimensional input layer and one hidden layer with ReLU activation. Our work contrasts earlier results in [D. Gallon, A. Jentzen, and F. Lindner, preprint, arXiv:2211.15641, 2022] and [P. Petersen, M. Raslan, and F. Voigtlaender, Found. Comput. Math., 21 (2021), pp. 375-4… ▽ More

    Submitted 19 November, 2024; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: Author's Accepted Manuscript version. To appear in SINUM

    MSC Class: Primary 68T07; Secondary 68T05; 41A50

  11. arXiv:2108.05643  [pdf, other

    cs.LG math.ST

    On minimal representations of shallow ReLU networks

    Authors: S. Dereich, S. Kassing

    Abstract: The realization function of a shallow ReLU network is a continuous and piecewise affine function $f:\mathbb R^d\to \mathbb R$, where the domain $\mathbb R^{d}$ is partitioned by a set of $n$ hyperplanes into cells on which $f$ is affine. We show that the minimal representation for $f$ uses either $n$, $n+1$ or $n+2$ neurons and we characterize each of the three cases. In the particular case, where… ▽ More

    Submitted 12 August, 2021; originally announced August 2021.

    Comments: 16 pages

    MSC Class: Primary 68T05; Secondary 68T07; 26B40

  12. arXiv:2102.09385  [pdf, ps, other

    cs.LG math.PR math.ST

    Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes

    Authors: Steffen Dereich, Sebastian Kassing

    Abstract: In this article, we consider convergence of stochastic gradient descent schemes (SGD), including momentum stochastic gradient descent (MSGD), under weak assumptions on the underlying landscape. More explicitly, we show that on the event that the SGD stays bounded we have convergence of the SGD if there is only a countable number of critical points or if the objective function satisfies Lojasiewicz… ▽ More

    Submitted 9 January, 2024; v1 submitted 16 February, 2021; originally announced February 2021.

    MSC Class: 62L20 (Primary) 60J05; 60J20; 65C05 (Secondary)