Search | arXiv e-print repository

arXiv:2505.22085 [pdf, other]

PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learning

Authors: Arnulf Jentzen, Julian Kranz, Adrian Riekert

Abstract: Averaging techniques such as Ruppert--Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of stochastic gradient descent (SGD) optimization methods such as the popular ADAM optimizer. However, depending on the specific optimization problem under consideration, the type and the parameters for the averaging need to be adjusted to ac… ▽ More Averaging techniques such as Ruppert--Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of stochastic gradient descent (SGD) optimization methods such as the popular ADAM optimizer. However, depending on the specific optimization problem under consideration, the type and the parameters for the averaging need to be adjusted to achieve the smallest optimization error. In this work we propose an averaging approach, which we refer to as parallel averaged ADAM (PADAM), in which we compute parallely different averaged variants of ADAM and during the training process dynamically select the variant with the smallest optimization error. A central feature of this approach is that this procedure requires no more gradient evaluations than the usual ADAM optimizer as each of the averaged trajectories relies on the same underlying ADAM trajectory and thus on the same underlying gradients. We test the proposed PADAM optimizer in 13 stochastic optimization and deep neural network (DNN) learning problems and compare its performance with known optimizers from the literature such as standard SGD, momentum SGD, Adam with and without EMA, and ADAMW. In particular, we apply the compared optimizers to physics-informed neural network, deep Galerkin, deep backward stochastic differential equation and deep Kolmogorov approximations for boundary value partial differential equation problems from scientific machine learning, as well as to DNN approximations for optimal control and optimal stopping problems. In nearly all of the considered examples PADAM achieves, sometimes among others and sometimes exclusively, essentially the smallest optimization error. This work thus strongly suggest to consider PADAM for scientific machine learning problems and also motivates further research for adaptive averaging procedures within the training of DNNs. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: 38 pages, 13 figures

arXiv:2505.17032 [pdf, ps, other]

A brief review of the Deep BSDE method for solving high-dimensional partial differential equations

Authors: Jiequn Han, Arnulf Jentzen, Weinan E

Abstract: High-dimensional partial differential equations (PDEs) pose significant challenges for numerical computation due to the curse of dimensionality, which limits the applicability of traditional mesh-based methods. Since 2017, the Deep BSDE method has introduced deep learning techniques that enable the effective solution of nonlinear PDEs in very high dimensions. This innovation has sparked considerab… ▽ More High-dimensional partial differential equations (PDEs) pose significant challenges for numerical computation due to the curse of dimensionality, which limits the applicability of traditional mesh-based methods. Since 2017, the Deep BSDE method has introduced deep learning techniques that enable the effective solution of nonlinear PDEs in very high dimensions. This innovation has sparked considerable interest in using neural networks for high-dimensional PDEs, making it an active area of research. In this short review, we briefly sketch the Deep BSDE method, its subsequent developments, and future directions for the field. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Journal ref: ICBS proceedings of Frontiers of Science Awards (2024)

arXiv:2505.09572 [pdf, other]

SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

Authors: Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen

Abstract: We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove t… ▽ More We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to real-world scenarios, where we observe an analogous behavior. △ Less

Submitted 14 May, 2025; originally announced May 2025.

Comments: 27 pages, 4 figures

MSC Class: Primary 68T05; Secondary 68T07; 26B40; 03C64; 03C98

arXiv:2504.19426 [pdf, ps, other]

Sharp higher order convergence rates for the Adam optimizer

Authors: Steffen Dereich, Arnulf Jentzen, Adrian Riekert

Abstract: Gradient descent based optimization methods are the methods of choice to train deep neural networks in machine learning. Beyond the standard gradient descent method, also suitable modified variants of standard gradient descent involving acceleration techniques such as the momentum method and/or adaptivity techniques such as the RMSprop method are frequently considered optimization methods. These d… ▽ More Gradient descent based optimization methods are the methods of choice to train deep neural networks in machine learning. Beyond the standard gradient descent method, also suitable modified variants of standard gradient descent involving acceleration techniques such as the momentum method and/or adaptivity techniques such as the RMSprop method are frequently considered optimization methods. These days the most popular of such sophisticated optimization schemes is presumably the Adam optimizer that has been proposed in 2014 by Kingma and Ba. A highly relevant topic of research is to investigate the speed of convergence of such optimization methods. In particular, in 1964 Polyak showed that the standard gradient descent method converges in a neighborhood of a strict local minimizer with rate (x - 1)(x + 1)^{-1} while momentum achieves the (optimal) strictly faster convergence rate (\sqrt{x} - 1)(\sqrt{x} + 1)^{-1} where x \in (1,\infty) is the condition number (the ratio of the largest and the smallest eigenvalue) of the Hessian of the objective function at the local minimizer. It is the key contribution of this work to reveal that Adam also converges with the strictly faster convergence rate (\sqrt{x} - 1)(\sqrt{x} + 1)^{-1} while RMSprop only converges with the convergence rate (x - 1)(x + 1)^{-1}. △ Less

Submitted 27 April, 2025; originally announced April 2025.

Comments: 27 pages

MSC Class: 68T05; 65K05; 90C25 ACM Class: I.2.0

arXiv:2503.01660 [pdf, ps, other]

Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks

Authors: Thang Do, Arnulf Jentzen, Adrian Riekert

Abstract: Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to pr… ▽ More Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value. △ Less

Submitted 3 March, 2025; originally announced March 2025.

Comments: 42 pages

MSC Class: 68T07; 65K10; 60G60; 65D15 ACM Class: G.1.6; F.2.0; G.3

arXiv:2502.14180 [pdf]

On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems

Authors: Shokhrukh Ibragimov, Arnulf Jentzen, Benno Kuckuck

Abstract: We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of… ▽ More We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of first-order logic and set theory, it does require a degree of planning and logical reasoning, which can be controlled up to arbitrarily high difficulty by the complexity of the generated statements. Furthermore, we do extensive evaluations of the performance of various large language models, including recent models such as DeepSeek-R1 and OpenAI's o3-mini, on these datasets. All of the datasets along with the code used for generating them, as well as all data from the evaluations is publicly available at https://github.com/bkuckuck/logical-skills-of-llms. △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: 67 pages, 24 figures

ACM Class: I.2.6

arXiv:2501.15646 [pdf, ps, other]

Mathematical analysis of the gradients in deep learning

Authors: Steffen Dereich, Thang Do, Arnulf Jentzen, Frederic Weber

Abstract: Deep learning algorithms -- typically consisting of a class of deep artificial neural networks (ANNs) trained by a stochastic gradient descent (SGD) optimization method -- are nowadays an integral part in many areas of science, industry, and also our day to day life. Roughly speaking, in their most basic form, ANNs can be regarded as functions that consist of a series of compositions of affine-lin… ▽ More Deep learning algorithms -- typically consisting of a class of deep artificial neural networks (ANNs) trained by a stochastic gradient descent (SGD) optimization method -- are nowadays an integral part in many areas of science, industry, and also our day to day life. Roughly speaking, in their most basic form, ANNs can be regarded as functions that consist of a series of compositions of affine-linear functions with multidimensional versions of so-called activation functions. One of the most popular of such activation functions is the rectified linear unit (ReLU) function $\mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R}$. The ReLU function is, however, not differentiable and, typically, this lack of regularity transfers to the cost function of the supervised learning problem under consideration. Regardless of this lack of differentiability issue, deep learning practioners apply SGD methods based on suitably generalized gradients in standard deep learning libraries like {\sc TensorFlow} or {\sc Pytorch}. In this work we reveal an accurate and concise mathematical description of such generalized gradients in the training of deep fully-connected feedforward ANNs and we also study the resulting generalized gradient function analytically. Specifically, we provide an appropriate approximation procedure that uniquely describes the generalized gradient function, we prove that the generalized gradients are limiting Fréchet subgradients of the cost functional, and we conclude that the generalized gradients must coincide with the standard gradient of the cost functional on every open sets on which the cost functional is continuously differentiable. △ Less

Submitted 26 January, 2025; originally announced January 2025.

Comments: 38 pages

MSC Class: 68T07 ACM Class: I.2.6

arXiv:2501.06081 [pdf, other]

Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems

Authors: Steffen Dereich, Arnulf Jentzen, Adrian Riekert

Abstract: Deep learning methods - usually consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays omnipresent in data-driven learning problems as well as in scientific computing tasks such as optimal control (OC) and partial differential equation (PDE) problems. In practically relevant learning tasks, often not the plain-vanilla… ▽ More Deep learning methods - usually consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays omnipresent in data-driven learning problems as well as in scientific computing tasks such as optimal control (OC) and partial differential equation (PDE) problems. In practically relevant learning tasks, often not the plain-vanilla standard SGD optimization method is employed to train the considered class of DNNs but instead more sophisticated adaptive and accelerated variants of the standard SGD method such as the popular Adam optimizer are used. Inspired by the classical Polyak-Ruppert averaging approach, in this work we apply averaged variants of the Adam optimizer to train DNNs to approximately solve exemplary scientific computing problems in the form of PDEs and OC problems. We test the averaged variants of Adam in a series of learning problems including physics-informed neural network (PINN), deep backward stochastic differential equation (deep BSDE), and deep Kolmogorov approximations for PDEs (such as heat, Black-Scholes, Burgers, and Allen-Cahn PDEs), including DNN approximations for OC problems, and including DNN approximations for image classification problems (ResNet for CIFAR-10). In each of the numerical examples the employed averaged variants of Adam outperform the standard Adam and the standard SGD optimizers, particularly, in the situation of the scientific machine learning problems. The Python source codes for the numerical experiments associated to this work can be found on GitHub at https://github.com/deeplearningmethods/averaged-adam. △ Less

Submitted 10 January, 2025; originally announced January 2025.

Comments: 25 pages, 10 figures

arXiv:2412.01371 [pdf, other]

An overview of diffusion models for generative artificial intelligence

Authors: Davide Gallon, Arnulf Jentzen, Philippe von Wurstemberger

Abstract: This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also… ▽ More This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also review selected extensions and improvements of the basic framework from the literature such as improved DDPMs, denoising diffusion implicit models, classifier-free diffusion guidance models, and latent diffusion models. △ Less

Submitted 2 December, 2024; originally announced December 2024.

Comments: 56 pages, 5 figures

arXiv:2410.10533 [pdf, ps, other]

Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

Authors: Thang Do, Sonja Hannibal, Arnulf Jentzen

Abstract: Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in ri… ▽ More Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in rigorous theoretical terms. In particular, even in the standard setup of data driven supervised learning problems, it remained an open research problem to prove (or disprove) that SGD methods converge in the training of DNNs with the popular rectified linear unit (ReLU) activation function with high probability to global minimizers in the optimization landscape. In this work we answer this question negatively. Specifically, in this work we prove for a large class of SGD methods that the considered optimizer does with high probability not converge to global minimizers of the optimization problem. It turns out that the probability to not converge to a global minimizer converges at least exponentially quickly to one as the width of the first hidden layer of the ANN and the depth of the ANN, respectively, increase. The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods such as the momentum SGD, the Nesterov accelerated SGD, the Adagrad, the RMSProp, the Adam, the Adamax, the AMSGrad, and the Nadam optimizers. △ Less

Submitted 14 February, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

Comments: 91 pages. arXiv admin note: text overlap with arXiv:2310.20360

MSC Class: 68T07; 65K10; 60G60; 65D15 ACM Class: G.1.6; F.2.0; G.3

arXiv:2408.13222 [pdf, other]

An Overview on Machine Learning Methods for Partial Differential Equations: from Physics Informed Neural Networks to Deep Operator Learning

Authors: Lukas Gonon, Arnulf Jentzen, Benno Kuckuck, Siyu Liang, Adrian Riekert, Philippe von Wurstemberger

Abstract: The approximation of solutions of partial differential equations (PDEs) with numerical algorithms is a central topic in applied mathematics. For many decades, various types of methods for this purpose have been developed and extensively studied. One class of methods which has received a lot of attention in recent years are machine learning-based methods, which typically involve the training of art… ▽ More The approximation of solutions of partial differential equations (PDEs) with numerical algorithms is a central topic in applied mathematics. For many decades, various types of methods for this purpose have been developed and extensively studied. One class of methods which has received a lot of attention in recent years are machine learning-based methods, which typically involve the training of artificial neural networks (ANNs) by means of stochastic gradient descent type optimization methods. While approximation methods for PDEs using ANNs have first been proposed in the 1990s they have only gained wide popularity in the last decade with the rise of deep learning. This article aims to provide an introduction to some of these methods and the mathematical theory on which they are based. We discuss methods such as physics-informed neural networks (PINNs) and deep BSDE methods and consider several operator learning approaches. △ Less

Submitted 23 August, 2024; originally announced August 2024.

arXiv:2407.21078 [pdf, ps, other]

Convergence rates for the Adam optimizer

Authors: Steffen Dereich, Arnulf Jentzen

Abstract: Stochastic gradient descent (SGD) optimization methods are nowadays the method of choice for the training of deep neural networks (DNNs) in artificial intelligence systems. In practically relevant training problems, usually not the plain vanilla standard SGD method is the employed optimization scheme but instead suitably accelerated and adaptive SGD optimization methods are applied. As of today, m… ▽ More Stochastic gradient descent (SGD) optimization methods are nowadays the method of choice for the training of deep neural networks (DNNs) in artificial intelligence systems. In practically relevant training problems, usually not the plain vanilla standard SGD method is the employed optimization scheme but instead suitably accelerated and adaptive SGD optimization methods are applied. As of today, maybe the most popular variant of such accelerated and adaptive SGD optimization methods is the famous Adam optimizer proposed by Kingma & Ba in 2014. Despite the popularity of the Adam optimizer in implementations, it remained an open problem of research to provide a convergence analysis for the Adam optimizer even in the situation of simple quadratic stochastic optimization problems where the objective function (the function one intends to minimize) is strongly convex. In this work we solve this problem by establishing optimal convergence rates for the Adam optimizer for a large class of stochastic optimization problems, in particular, covering simple quadratic stochastic optimization problems. The key ingredient of our convergence analysis is a new vector field function which we propose to refer to as the Adam vector field. This Adam vector field accurately describes the macroscopic behaviour of the Adam optimization process but differs from the negative gradient of the objective function (the function we intend to minimize) of the considered stochastic optimization problem. In particular, our convergence analysis reveals that the Adam optimizer does typically not converge to critical points of the objective function (zeros of the gradient of the objective function) of the considered optimization problem but converges with rates to zeros of this Adam vector field. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.08100 [pdf, ps, other]

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

Authors: Steffen Dereich, Robin Graeber, Arnulf Jentzen

Abstract: Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versi… ▽ More Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versions of ChatGPT and Gemini, SGD methods are employed to create successful generative AI based text-to-image creation models such as Midjourney, DALL-E, and Stable Diffusion, but SGD methods are also used to train DNNs to approximately solve scientific models such as partial differential equation (PDE) models from physics and biology and optimal control and stopping problems from engineering. It is known that the plain vanilla standard SGD method fails to converge even in the situation of several convex optimization problems if the learning rates are bounded away from zero. However, in many practical relevant training scenarios, often not the plain vanilla standard SGD method but instead adaptive SGD methods such as the RMSprop and the Adam optimizers, in which the learning rates are modified adaptively during the training process, are employed. This naturally rises the question whether such adaptive optimizers, in which the learning rates are modified adaptively during the training process, do converge in the situation of non-vanishing learning rates. In this work we answer this question negatively by proving that adaptive SGD methods such as the popular Adam optimizer fail to converge to any possible random limit point if the learning rates are asymptotically bounded away from zero. In our proof of this non-convergence result we establish suitable pathwise a priori bounds for a class of accelerated and adaptive SGD methods, which are also of independent interest. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 54 pages

MSC Class: 60J22 (Primary); 65K10; 60J20; 65C40 (Secondary) ACM Class: G.1.6; F.2.0; G.3

arXiv:2406.14340 [pdf, other]

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

Authors: Steffen Dereich, Arnulf Jentzen, Adrian Riekert

Abstract: It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant… ▽ More It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant learning rates. The default learning rate schedules for SGD optimization methods in machine learning implementation frameworks such as TensorFlow and Pytorch are constant learning rates. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates for the values of the objective function of the considered optimization problem (the function that one intends to minimize). In particular, we propose a learning-rate-adaptive variant of the Adam optimizer and implement it in case of several neural network learning problems, particularly, in the context of deep learning approximation methods for partial differential equations such as deep Kolmogorov methods, physics-informed neural networks, and deep Ritz methods. In each of the presented learning problems the proposed learning-rate-adaptive variant of the Adam optimizer faster reduces the value of the objective function than the Adam optimizer with the default learning rate. For a simple class of quadratic minimization problems we also rigorously prove that a learning-rate-adaptive variant of the SGD optimization method converges to the minimizer of the considered minimization problem. Our convergence proof is based on an analysis of the laws of invariant measures of the SGD method as well as on a more general convergence analysis for SGD with random but predictable learning rates which we develop in this work. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 68 pages, 8 figures

arXiv:2406.10876 [pdf, ps, other]

Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for space-time solutions of semilinear partial differential equations

Authors: Julia Ackermann, Arnulf Jentzen, Benno Kuckuck, Joshua Lee Padgett

Abstract: It is a challenging topic in applied mathematics to solve high-dimensional nonlinear partial differential equations (PDEs). Standard approximation methods for nonlinear PDEs suffer under the curse of dimensionality (COD) in the sense that the number of computational operations of the approximation method grows at least exponentially in the PDE dimension and with such methods it is essentially impo… ▽ More It is a challenging topic in applied mathematics to solve high-dimensional nonlinear partial differential equations (PDEs). Standard approximation methods for nonlinear PDEs suffer under the curse of dimensionality (COD) in the sense that the number of computational operations of the approximation method grows at least exponentially in the PDE dimension and with such methods it is essentially impossible to approximately solve high-dimensional PDEs even when the fastest currently available computers are used. However, in the last years great progress has been made in this area of research through suitable deep learning (DL) based methods for PDEs in which deep neural networks (DNNs) are used to approximate solutions of PDEs. Despite the remarkable success of such DL methods in simulations, it remains a fundamental open problem of research to prove (or disprove) that such methods can overcome the COD in the approximation of PDEs. However, there are nowadays several partial error analysis results for DL methods for high-dimensional nonlinear PDEs in the literature which prove that DNNs can overcome the COD in the sense that the number of parameters of the approximating DNN grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon>0$ and the PDE dimension $d\in\mathbb{N}$. In the main result of this article we prove that for all $T,p\in(0,\infty)$ it holds that solutions $u_d\colon[0,T]\times\mathbb{R}^d\to\mathbb{R}$, $d\in\mathbb{N}$, of semilinear heat equations with Lipschitz continuous nonlinearities can be approximated in the $L^p$-sense on space-time regions without the COD by DNNs with the rectified linear unit (ReLU), the leaky ReLU, or the softplus activation function. In previous articles similar results have been established not for space-time regions but for the solutions $u_d(T,\cdot)$, $d\in\mathbb{N}$, at the terminal time $T$. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 64 pages. arXiv admin note: text overlap with arXiv:2309.13722, arXiv:2310.20360

MSC Class: 65M15; 65C05; 68T07 (Primary) 60H35 (Secondary)

arXiv:2402.05155 [pdf, other]

Non-convergence to global minimizers for Adam and stochastic gradient descent optimization and constructions of local minimizers in the training of artificial neural networks

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: Stochastic gradient descent (SGD) optimization methods such as the plain vanilla SGD method and the popular Adam optimizer are nowadays the method of choice in the training of artificial neural networks (ANNs). Despite the remarkable success of SGD methods in the ANN training in numerical simulations, it remains in essentially all practical relevant scenarios an open problem to rigorously explain… ▽ More Stochastic gradient descent (SGD) optimization methods such as the plain vanilla SGD method and the popular Adam optimizer are nowadays the method of choice in the training of artificial neural networks (ANNs). Despite the remarkable success of SGD methods in the ANN training in numerical simulations, it remains in essentially all practical relevant scenarios an open problem to rigorously explain why SGD methods seem to succeed to train ANNs. In particular, in most practically relevant supervised learning problems, it seems that SGD methods do with high probability not converge to global minimizers in the optimization landscape of the ANN training problem. Nevertheless, it remains an open problem of research to disprove the convergence of SGD methods to global minimizers. In this work we solve this research problem in the situation of shallow ANNs with the rectified linear unit (ReLU) and related activations with the standard mean square error loss by disproving in the training of such ANNs that SGD methods (such as the plain vanilla SGD, the momentum SGD, the AdaGrad, the RMSprop, and the Adam optimizers) can find a global minimizer with high probability. Even stronger, we reveal in the training of such ANNs that SGD methods do with high probability fail to converge to global minimizers in the optimization landscape. The findings of this work do, however, not disprove that SGD methods succeed to train ANNs since they do not exclude the possibility that SGD methods find good local minimizers whose risk values are close to the risk values of the global minimizers. In this context, another key contribution of this work is to establish the existence of a hierarchical structure of local minimizers with distinct risk values in the optimization landscape of ANN training problems with ReLU and related activations. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 36 pages

arXiv:2310.20360 [pdf, other]

Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory

Authors: Arnulf Jentzen, Benno Kuckuck, Philippe von Wurstemberger

Abstract: This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorit… ▽ More This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorithms (such as the basic stochastic gradient descent (SGD) method, accelerated methods, and adaptive methods). We also cover several theoretical aspects of deep learning algorithms such as approximation capacities of ANNs (including a calculus for ANNs), optimization theory (including Kurdyka-Łojasiewicz inequalities), and generalization errors. In the last part of the book some deep learning approximation methods for PDEs are reviewed including physics-informed neural networks (PINNs) and deep Galerkin methods. We hope that this book will be useful for students and scientists who do not yet have any background in deep learning at all and would like to gain a solid foundation as well as for practitioners who would like to obtain a firmer mathematical understanding of the objects and methods considered in deep learning. △ Less

Submitted 25 February, 2025; v1 submitted 31 October, 2023; originally announced October 2023.

Comments: 712 pages, 36 figures, 45 source codes, 87 exercises. In v2, the material on optimization algorithms/methods has been significantly expanded

MSC Class: 68T07

arXiv:2309.13722 [pdf, ps, other]

Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for Kolmogorov partial differential equations with Lipschitz nonlinearities in the $L^p$-sense

Authors: Julia Ackermann, Arnulf Jentzen, Thomas Kruse, Benno Kuckuck, Joshua Lee Padgett

Abstract: Recently, several deep learning (DL) methods for approximating high-dimensional partial differential equations (PDEs) have been proposed. The interest that these methods have generated in the literature is in large part due to simulations which appear to demonstrate that such DL methods have the capacity to overcome the curse of dimensionality (COD) for PDEs in the sense that the number of computa… ▽ More Recently, several deep learning (DL) methods for approximating high-dimensional partial differential equations (PDEs) have been proposed. The interest that these methods have generated in the literature is in large part due to simulations which appear to demonstrate that such DL methods have the capacity to overcome the curse of dimensionality (COD) for PDEs in the sense that the number of computational operations they require to achieve a certain approximation accuracy $\varepsilon\in(0,\infty)$ grows at most polynomially in the PDE dimension $d\in\mathbb N$ and the reciprocal of $\varepsilon$. While there is thus far no mathematical result that proves that one of such methods is indeed capable of overcoming the COD, there are now a number of rigorous results in the literature that show that deep neural networks (DNNs) have the expressive power to approximate PDE solutions without the COD in the sense that the number of parameters used to describe the approximating DNN grows at most polynomially in both the PDE dimension $d\in\mathbb N$ and the reciprocal of the approximation accuracy $\varepsilon>0$. Roughly speaking, in the literature it is has been proved for every $T>0$ that solutions $u_d\colon [0,T]\times\mathbb R^d\to \mathbb R$, $d\in\mathbb N$, of semilinear heat PDEs with Lipschitz continuous nonlinearities can be approximated by DNNs with ReLU activation at the terminal time in the $L^2$-sense without the COD provided that the initial value functions $\mathbb R^d\ni x\mapsto u_d(0,x)\in\mathbb R$, $d\in\mathbb N$, can be approximated by ReLU DNNs without the COD. It is the key contribution of this work to generalize this result by establishing this statement in the $L^p$-sense with $p\in(0,\infty)$ and by allowing the activation function to be more general covering the ReLU, the leaky ReLU, and the softplus activation functions as special cases. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: 52 pages

MSC Class: 65M15; 65C05; 68T07 (Primary) 60H35 (Secondary)

arXiv:2303.03390 [pdf, ps, other]

Nonlinear Monte Carlo methods with polynomial runtime for Bellman equations of discrete time high-dimensional stochastic optimal control problems

Authors: Christian Beck, Arnulf Jentzen, Konrad Kleinberg, Thomas Kruse

Abstract: Discrete time stochastic optimal control problems and Markov decision processes (MDPs), respectively, serve as fundamental models for problems that involve sequential decision making under uncertainty and as such constitute the theoretical foundation of reinforcement learning. In this article we study the numerical approximation of MDPs with infinite time horizon, finite control set, and general s… ▽ More Discrete time stochastic optimal control problems and Markov decision processes (MDPs), respectively, serve as fundamental models for problems that involve sequential decision making under uncertainty and as such constitute the theoretical foundation of reinforcement learning. In this article we study the numerical approximation of MDPs with infinite time horizon, finite control set, and general state spaces. Our set-up in particular covers infinite-horizon optimal stopping problems of discrete time Markov processes. A key tool to solve MDPs are Bellman equations which characterize the value functions of the MDPs and determine the optimal control strategies. By combining ideas from the full-history recursive multilevel Picard approximation method, which was recently introduced to solve certain nonlinear partial differential equations, and ideas from $Q$-learning we introduce a class of suitable nonlinear Monte Carlo methods and prove that the proposed methods do overcome the curse of dimensionality in the numerical approximation of the solutions of Bellman equations and the associated discrete time stochastic optimal control problems. △ Less

Submitted 3 March, 2023; originally announced March 2023.

MSC Class: 90C40; 90C39; 60J05; 93E20; 65C05

arXiv:2302.14690 [pdf, other]

On the existence of minimizers in shallow residual ReLU neural network optimization landscapes

Authors: Steffen Dereich, Arnulf Jentzen, Sebastian Kassing

Abstract: In this article, we show existence of minimizers in the loss landscape for residual artificial neural networks (ANNs) with multi-dimensional input layer and one hidden layer with ReLU activation. Our work contrasts earlier results in [D. Gallon, A. Jentzen, and F. Lindner, preprint, arXiv:2211.15641, 2022] and [P. Petersen, M. Raslan, and F. Voigtlaender, Found. Comput. Math., 21 (2021), pp. 375-4… ▽ More In this article, we show existence of minimizers in the loss landscape for residual artificial neural networks (ANNs) with multi-dimensional input layer and one hidden layer with ReLU activation. Our work contrasts earlier results in [D. Gallon, A. Jentzen, and F. Lindner, preprint, arXiv:2211.15641, 2022] and [P. Petersen, M. Raslan, and F. Voigtlaender, Found. Comput. Math., 21 (2021), pp. 375-444] which showed that in many situations minimizers do not exist for common smooth activation functions even in the case where the target functions are polynomials. The proof of the existence property makes use of a closure of the search space containing all functions generated by ANNs and additional discontinuous generalized responses. As we will show, the additional generalized responses in this larger space are suboptimal so that the minimum is attained in the original function class. △ Less

Submitted 19 November, 2024; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: Author's Accepted Manuscript version. To appear in SINUM

MSC Class: Primary 68T07; Secondary 68T05; 41A50

arXiv:2302.03286 [pdf, other]

Algorithmically Designed Artificial Neural Networks (ADANNs): Higher order deep operator learning for parametric partial differential equations

Authors: Arnulf Jentzen, Adrian Riekert, Philippe von Wurstemberger

Abstract: In this article we propose a new deep learning approach to approximate operators related to parametric partial differential equations (PDEs). In particular, we introduce a new strategy to design specific artificial neural network (ANN) architectures in conjunction with specific ANN initialization schemes which are tailor-made for the particular approximation problem under consideration. In the pro… ▽ More In this article we propose a new deep learning approach to approximate operators related to parametric partial differential equations (PDEs). In particular, we introduce a new strategy to design specific artificial neural network (ANN) architectures in conjunction with specific ANN initialization schemes which are tailor-made for the particular approximation problem under consideration. In the proposed approach we combine efficient classical numerical approximation techniques with deep operator learning methodologies. Specifically, we introduce customized adaptions of existing ANN architectures together with specialized initializations for these ANN architectures so that at initialization we have that the ANNs closely mimic a chosen efficient classical numerical algorithm for the considered approximation problem. The obtained ANN architectures and their initialization schemes are thus strongly inspired by numerical algorithms as well as by popular deep learning methodologies from the literature and in that sense we refer to the introduced ANNs in conjunction with their tailor-made initialization schemes as Algorithmically Designed Artificial Neural Networks (ADANNs). We numerically test the proposed ADANN methodology in the case of several parametric PDEs. In the tested numerical examples the ADANN methodology significantly outperforms existing traditional approximation algorithms as well as existing deep operator learning methodologies from the literature. △ Less

Submitted 29 May, 2024; v1 submitted 7 February, 2023; originally announced February 2023.

Comments: 39 pages, 16 Figures

arXiv:2301.08284 [pdf, ps, other]

The necessity of depth for artificial neural networks to approximate certain classes of smooth and bounded functions without the curse of dimensionality

Authors: Lukas Gonon, Robin Graeber, Arnulf Jentzen

Abstract: In this article we study high-dimensional approximation capacities of shallow and deep artificial neural networks (ANNs) with the rectified linear unit (ReLU) activation. In particular, it is a key contribution of this work to reveal that for all $a,b\in\mathbb{R}$ with $b-a\geq 7$ we have that the functions $[a,b]^d\ni x=(x_1,\dots,x_d)\mapsto\prod_{i=1}^d x_i\in\mathbb{R}$ for $d\in\mathbb{N}$ a… ▽ More In this article we study high-dimensional approximation capacities of shallow and deep artificial neural networks (ANNs) with the rectified linear unit (ReLU) activation. In particular, it is a key contribution of this work to reveal that for all $a,b\in\mathbb{R}$ with $b-a\geq 7$ we have that the functions $[a,b]^d\ni x=(x_1,\dots,x_d)\mapsto\prod_{i=1}^d x_i\in\mathbb{R}$ for $d\in\mathbb{N}$ as well as the functions $[a,b]^d\ni x =(x_1,\dots, x_d)\mapsto\sin(\prod_{i=1}^d x_i) \in \mathbb{R} $ for $ d \in \mathbb{N} $ can neither be approximated without the curse of dimensionality by means of shallow ANNs nor insufficiently deep ANNs with ReLU activation but can be approximated without the curse of dimensionality by sufficiently deep ANNs with ReLU activation. We show that the product functions and the sine of the product functions are polynomially tractable approximation problems among the approximating class of deep ReLU ANNs with the number of hidden layers being allowed to grow in the dimension $ d \in \mathbb{N} $. We establish the above outlined statements not only for the product functions and the sine of the product functions but also for other classes of target functions, in particular, for classes of uniformly globally bounded $ C^{ \infty } $-functions with compact support on any $[a,b]^d$ with $a\in\mathbb{R}$, $b\in(a,\infty)$. Roughly speaking, in this work we lay open that simple approximation problems such as approximating the sine or cosine of products cannot be solved in standard implementation frameworks by shallow or insufficiently deep ANNs with ReLU activation in polynomial time, but can be approximated by sufficiently deep ReLU ANNs with the number of parameters growing at most polynomially. △ Less

Submitted 19 January, 2023; originally announced January 2023.

Comments: 101 pages, 1 figure. arXiv admin note: substantial text overlap with arXiv:2112.14523

MSC Class: 65D40; 68T07

arXiv:2212.13111 [pdf, other]

Convergence to good non-optimal critical points in the training of neural networks: Gradient descent optimization with one random initialization overcomes all bad non-global local minima with high probability

Authors: Shokhrukh Ibragimov, Arnulf Jentzen, Adrian Riekert

Abstract: Gradient descent (GD) methods for the training of artificial neural networks (ANNs) belong nowadays to the most heavily employed computational schemes in the digital world. Despite the compelling success of such methods, it remains an open problem to provide a rigorous theoretical justification for the success of GD methods in the training of ANNs. The main difficulty is that the optimization risk… ▽ More Gradient descent (GD) methods for the training of artificial neural networks (ANNs) belong nowadays to the most heavily employed computational schemes in the digital world. Despite the compelling success of such methods, it remains an open problem to provide a rigorous theoretical justification for the success of GD methods in the training of ANNs. The main difficulty is that the optimization risk landscapes associated to ANNs usually admit many non-optimal critical points (saddle points as well as non-global local minima) whose risk values are strictly larger than the optimal risk value. It is a key contribution of this article to overcome this obstacle in certain simplified shallow ANN training situations. In such simplified ANN training scenarios we prove that the gradient flow (GF) dynamics with only one random initialization overcomes with high probability all bad non-global local minima (all non-global local minima whose risk values are much larger than the risk value of the global minima) and converges with high probability to a good critical point (a critical point whose risk value is very close to the optimal risk value of the global minima). This analysis allows us to establish convergence in probability to zero of the risk value of the GF trajectories with convergence rates as the ANN training time and the width of the ANN increase to infinity. We complement the analytical findings of this work with extensive numerical simulations for shallow and deep ANNs: All these numerical simulations strongly suggest that with high probability the considered GD method (stochastic GD or Adam) overcomes all bad non-global local minima, does not converge to a global minimum, but does converge to a good non-optimal critical point whose risk value is very close to the optimal risk value. △ Less

Submitted 26 December, 2022; originally announced December 2022.

Comments: 98 pages, 15 figures, 10 Python codes

MSC Class: 65K10; 65C50; 68T05; 60H35

arXiv:2211.15641 [pdf, ps, other]

Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks

Authors: Davide Gallon, Arnulf Jentzen, Felix Lindner

Abstract: In this article we investigate blow up phenomena for gradient descent optimization methods in the training of artificial neural networks (ANNs). Our theoretical analysis is focused on shallow ANNs with one neuron on the input layer, one neuron on the output layer, and one hidden layer. For ANNs with ReLU activation and at least two neurons on the hidden layer we establish the existence of a target… ▽ More In this article we investigate blow up phenomena for gradient descent optimization methods in the training of artificial neural networks (ANNs). Our theoretical analysis is focused on shallow ANNs with one neuron on the input layer, one neuron on the output layer, and one hidden layer. For ANNs with ReLU activation and at least two neurons on the hidden layer we establish the existence of a target function such that there exists a lower bound for the risk values of the critical points of the associated risk function which is strictly greater than the infimum of the image of the risk function. This allows us to demonstrate that every gradient flow trajectory with an initial risk smaller than this lower bound diverges. Furthermore, we analyze and compare various popular types of activation functions with regard to the divergence of gradient flow trajectories and gradient descent trajectories in the training of ANNs and with regard to the closely related question concerning the existence of global minimum points of the risk function. △ Less

Submitted 28 November, 2022; originally announced November 2022.

Comments: 84 pages, one figure

arXiv:2210.13530 [pdf, other]

doi 10.1016/j.cnsns.2023.107438

An efficient Monte Carlo scheme for Zakai equations

Authors: Christian Beck, Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, Ariel Neufeld

Abstract: In this paper we develop a numerical method for efficiently approximating solutions of certain Zakai equations in high dimensions. The key idea is to transform a given Zakai SPDE into a PDE with random coefficients. We show that under suitable regularity assumptions on the coefficients of the Zakai equation, the corresponding random PDE admits a solution random field which, for almost all realizat… ▽ More In this paper we develop a numerical method for efficiently approximating solutions of certain Zakai equations in high dimensions. The key idea is to transform a given Zakai SPDE into a PDE with random coefficients. We show that under suitable regularity assumptions on the coefficients of the Zakai equation, the corresponding random PDE admits a solution random field which, for almost all realizations of the random coefficients, can be written as a classical solution of a linear parabolic PDE. This makes it possible to apply the Feynman--Kac formula to obtain an efficient Monte Carlo scheme for computing approximate solutions of Zakai equations. The approach achieves good results in up to 25 dimensions with fast run times. △ Less

Submitted 20 August, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

MSC Class: 65C05; 65M75; 60H15; 62M20

arXiv:2208.02083 [pdf, ps, other]

doi 10.1007/s10957-024-02513-3

Gradient descent provably escapes saddle points in the training of shallow ReLU networks

Authors: Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

Abstract: Dynamical systems theory has recently been applied in optimization to prove that gradient descent algorithms bypass so-called strict saddle points of the loss function. However, in many modern machine learning applications, the required regularity conditions are not satisfied. In this paper, we prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we… ▽ More Dynamical systems theory has recently been applied in optimization to prove that gradient descent algorithms bypass so-called strict saddle points of the loss function. However, in many modern machine learning applications, the required regularity conditions are not satisfied. In this paper, we prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements. We explore its relevance for various machine learning tasks, with a particular focus on shallow rectified linear unit (ReLU) and leaky ReLU networks with scalar input. Building on a detailed examination of critical points of the square integral loss function for shallow ReLU and leaky ReLU networks relative to an affine target function, we show that gradient descent circumvents most saddle points. Furthermore, we prove convergence to global minima under favourable initialization conditions, quantified by an explicit threshold on the limiting loss. △ Less

Submitted 11 September, 2024; v1 submitted 3 August, 2022; originally announced August 2022.

MSC Class: 68T07; 37D10 ACM Class: I.2.6; G.1.6

Journal ref: J Optim Theory Appl (2024)

arXiv:2207.06246 [pdf, ps, other]

Normalized gradient flow optimization in the training of ReLU artificial neural networks

Authors: Simon Eberle, Arnulf Jentzen, Adrian Riekert, Georg Weiss

Abstract: The training of artificial neural networks (ANNs) is nowadays a highly relevant algorithmic procedure with many applications in science and industry. Roughly speaking, ANNs can be regarded as iterated compositions between affine linear functions and certain fixed nonlinear functions, which are usually multidimensional versions of a one-dimensional so-called activation function. The most popular ch… ▽ More The training of artificial neural networks (ANNs) is nowadays a highly relevant algorithmic procedure with many applications in science and industry. Roughly speaking, ANNs can be regarded as iterated compositions between affine linear functions and certain fixed nonlinear functions, which are usually multidimensional versions of a one-dimensional so-called activation function. The most popular choice of such a one-dimensional activation function is the rectified linear unit (ReLU) activation function which maps a real number to its positive part $ \mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R} $. In this article we propose and analyze a modified variant of the standard training procedure of such ReLU ANNs in the sense that we propose to restrict the negative gradient flow dynamics to a large submanifold of the ANN parameter space, which is a strict $ C^{ \infty } $-submanifold of the entire ANN parameter space that seems to enjoy better regularity properties than the entire ANN parameter space but which is also sufficiently large and sufficiently high dimensional so that it can represent all ANN realization functions that can be represented through the entire ANN parameter space. In the special situation of shallow ANNs with just one-dimensional ANN layers we also prove for every Lipschitz continuous target function that every gradient flow trajectory on this large submanifold of the ANN parameter space is globally bounded. For the standard gradient flow on the entire ANN parameter space with Lipschitz continuous target functions it remains an open problem of research to prove or disprove the global boundedness of gradient flow trajectories even in the situation of shallow ANNs with just one-dimensional ANN layers. △ Less

Submitted 13 July, 2022; originally announced July 2022.

Comments: 26 pages, 1 figure

arXiv:2206.13646 [pdf, ps, other]

On bounds for norms of reparameterized ReLU artificial neural network parameters: sums of fractional powers of the Lipschitz norm control the network parameter vector

Authors: Arnulf Jentzen, Timo Kröger

Abstract: It is an elementary fact in the scientific literature that the Lipschitz norm of the realization function of a feedforward fully-connected rectified linear unit (ReLU) artificial neural network (ANN) can, up to a multiplicative constant, be bounded from above by sums of powers of the norm of the ANN parameter vector. Roughly speaking, in this work we reveal in the case of shallow ANNs that the con… ▽ More It is an elementary fact in the scientific literature that the Lipschitz norm of the realization function of a feedforward fully-connected rectified linear unit (ReLU) artificial neural network (ANN) can, up to a multiplicative constant, be bounded from above by sums of powers of the norm of the ANN parameter vector. Roughly speaking, in this work we reveal in the case of shallow ANNs that the converse inequality is also true. More formally, we prove that the norm of the equivalence class of ANN parameter vectors with the same realization function is, up to a multiplicative constant, bounded from above by the sum of powers of the Lipschitz norm of the ANN realization function (with the exponents $ 1/2 $ and $ 1 $). Moreover, we prove that this upper bound only holds when employing the Lipschitz norm but does neither hold for Hölder norms nor for Sobolev-Slobodeckij norms. Furthermore, we prove that this upper bound only holds for sums of powers of the Lipschitz norm with the exponents $ 1/2 $ and $ 1 $ but does not hold for the Lipschitz norm alone. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: 39 pages, 1 figure

arXiv:2205.03672 [pdf, other]

Deep learning approximations for non-local nonlinear PDEs with Neumann boundary conditions

Authors: Victor Boussange, Sebastian Becker, Arnulf Jentzen, Benno Kuckuck, Loïc Pellissier

Abstract: Nonlinear partial differential equations (PDEs) are used to model dynamical processes in a large number of scientific fields, ranging from finance to biology. In many applications standard local models are not sufficient to accurately account for certain non-local phenomena such as, e.g., interactions at a distance. In order to properly capture these phenomena non-local nonlinear PDE models are fr… ▽ More Nonlinear partial differential equations (PDEs) are used to model dynamical processes in a large number of scientific fields, ranging from finance to biology. In many applications standard local models are not sufficient to accurately account for certain non-local phenomena such as, e.g., interactions at a distance. In order to properly capture these phenomena non-local nonlinear PDE models are frequently employed in the literature. In this article we propose two numerical methods based on machine learning and on Picard iterations, respectively, to approximately solve non-local nonlinear PDEs. The proposed machine learning-based method is an extended variant of a deep learning-based splitting-up type approximation method previously introduced in the literature and utilizes neural networks to provide approximate solutions on a subset of the spatial domain of the solution. The Picard iterations-based method is an extended variant of the so-called full history recursive multilevel Picard approximation scheme previously introduced in the literature and provides an approximate solution for a single point of the domain. Both methods are mesh-free and allow non-local nonlinear PDEs with Neumann boundary conditions to be solved in high dimensions. In the two methods, the numerical difficulties arising due to the dimensionality of the PDEs are avoided by (i) using the correspondence between the expected trajectory of reflected stochastic processes and the solution of PDEs (given by the Feynman-Kac formula) and by (ii) using a plain vanilla Monte Carlo integration to handle the non-local term. We evaluate the performance of the two methods on five different PDEs arising in physics and biology. In all cases, the methods yield good results in up to 10 dimensions with short run times. Our work extends recently developed methods to overcome the curse of dimensionality in solving PDEs. △ Less

Submitted 7 May, 2022; originally announced May 2022.

Comments: 59 pages

MSC Class: 35R09 (Primary) 65M75; 45K05; 35K20; 65C05; 65M22; 68T07 (Secondary)

arXiv:2202.11481 [pdf, other]

On the existence of infinitely many realization functions of non-global local minima in the training of artificial neural networks with ReLU activation

Authors: Shokhrukh Ibragimov, Arnulf Jentzen, Timo Kröger, Adrian Riekert

Abstract: Gradient descent (GD) type optimization schemes are the standard instruments to train fully connected feedforward artificial neural networks (ANNs) with rectified linear unit (ReLU) activation and can be considered as temporal discretizations of solutions of gradient flow (GF) differential equations. It has recently been proved that the risk of every bounded GF trajectory converges in the training… ▽ More Gradient descent (GD) type optimization schemes are the standard instruments to train fully connected feedforward artificial neural networks (ANNs) with rectified linear unit (ReLU) activation and can be considered as temporal discretizations of solutions of gradient flow (GF) differential equations. It has recently been proved that the risk of every bounded GF trajectory converges in the training of ANNs with one hidden layer and ReLU activation to the risk of a critical point. Taking this into account it is one of the key research issues in the mathematical convergence analysis of GF trajectories and GD type optimization schemes, respectively, to study sufficient and necessary conditions for critical points of the risk function and, thereby, to obtain an understanding about the appearance of critical points in dependence of the problem parameters such as the target function. In the first main result of this work we prove in the training of ANNs with one hidden layer and ReLU activation that for every $ a, b \in \mathbb{R} $ with $ a < b $ and every arbitrarily large $ δ> 0 $ we have that there exists a Lipschitz continuous target function $ f \colon [a,b] \to \mathbb{R} $ such that for every number $ H > 1 $ of neurons on the hidden layer we have that the risk function has uncountably many different realization functions of non-global local minimum points whose risks are strictly larger than the sum of the risk of the global minimum points and the arbitrarily large $ δ$. In the second main result of this work we show in the training of ANNs with one hidden layer and ReLU activation in the special situation where there is only one neuron on the hidden layer and where the target function is continuous and piecewise polynomial that there exist at most finitely many different realization functions of critical points. △ Less

Submitted 23 February, 2022; originally announced February 2022.

Comments: 49 pages, 1 figure

MSC Class: 68T07

arXiv:2202.02717 [pdf, other]

doi 10.1111/mafi.12405

Learning the random variables in Monte Carlo simulations with stochastic gradient descent: Machine learning for parametric PDEs and financial derivative pricing

Authors: Sebastian Becker, Arnulf Jentzen, Marvin S. Müller, Philippe von Wurstemberger

Abstract: In financial engineering, prices of financial products are computed approximately many times each trading day with (slightly) different parameters in each calculation. In many financial models such prices can be approximated by means of Monte Carlo (MC) simulations. To obtain a good approximation the MC sample size usually needs to be considerably large resulting in a long computing time to obtain… ▽ More In financial engineering, prices of financial products are computed approximately many times each trading day with (slightly) different parameters in each calculation. In many financial models such prices can be approximated by means of Monte Carlo (MC) simulations. To obtain a good approximation the MC sample size usually needs to be considerably large resulting in a long computing time to obtain a single approximation. In this paper we introduce a new approximation strategy for parametric approximation problems including the parametric financial pricing problems described above. A central aspect of the approximation strategy proposed in this article is to combine MC algorithms with machine learning techniques to, roughly speaking, learn the random variables (LRV) in MC simulations. In other words, we employ stochastic gradient descent (SGD) optimization methods not to train parameters of standard artificial neural networks (ANNs) but to learn random variables appearing in MC approximations. We numerically test the LRV strategy on various parametric problems with convincing results when compared with standard MC simulations, Quasi-Monte Carlo simulations, SGD-trained shallow ANNs, and SGD-trained deep ANNs. Our numerical simulations strongly indicate that the LRV strategy might be capable to overcome the curse of dimensionality in the $L^\infty$-norm in several cases where the standard deep learning approach has been proven not to be able to do so. This is not a contradiction to lower bounds established in the scientific literature because this new LRV strategy is outside of the class of algorithms for which lower bounds have been established in the scientific literature. The proposed LRV strategy is of general nature and not only restricted to the parametric financial pricing problems described above, but applicable to a large class of approximation problems. △ Less

Submitted 8 June, 2023; v1 submitted 6 February, 2022; originally announced February 2022.

Comments: 71 pages, 4 Figures, 14 Tables; to appear in Math. Finance

MSC Class: 35K15; 65C05; 65M75; 68T99; 91G20

arXiv:2112.14523 [pdf, ps, other]

Deep neural network approximation theory for high-dimensional functions

Authors: Pierfrancesco Beneventano, Patrick Cheridito, Robin Graeber, Arnulf Jentzen, Benno Kuckuck

Abstract: The purpose of this article is to develop machinery to study the capacity of deep neural networks (DNNs) to approximate high-dimensional functions. In particular, we show that DNNs have the expressive power to overcome the curse of dimensionality in the approximation of a large class of functions. More precisely, we prove that these functions can be approximated by DNNs on compact sets such that t… ▽ More The purpose of this article is to develop machinery to study the capacity of deep neural networks (DNNs) to approximate high-dimensional functions. In particular, we show that DNNs have the expressive power to overcome the curse of dimensionality in the approximation of a large class of functions. More precisely, we prove that these functions can be approximated by DNNs on compact sets such that the number of parameters necessary to represent the approximating DNNs grows at most polynomially in the reciprocal $1/\varepsilon$ of the approximation accuracy $\varepsilon>0$ and in the input dimension $d\in \mathbb{N} =\{1,2,3,\dots\}$. To this end, we introduce certain approximation spaces, consisting of sequences of functions that can be efficiently approximated by DNNs. We then establish closure properties which we combine with known and new bounds on the number of parameters necessary to approximate locally Lipschitz continuous functions, maximum functions, and product functions by DNNs. The main result of this article demonstrates that DNNs have sufficient expressiveness to approximate certain sequences of functions which can be constructed by means of a finite number of compositions using locally Lipschitz continuous functions, maxima, and products without the curse of dimensionality. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 82 pages, 1 figure

arXiv:2112.09684 [pdf, other]

doi 10.4208/jml.220114a

On the existence of global minima and convergence analyses for gradient descent methods in the training of deep neural networks

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function of the probability distribution of the input data of the considered supervised learnin… ▽ More In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function (describing the relationship between input data and the output data) is piecewise polynomial, and under the assumption that the risk function of the considered supervised learning problem admits at least one regular global minimum. In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point (in the sense of limiting Fréchet subdifferentiability). Our mathematical convergence analysis builds up on ideas from our previous article Eberle et al., on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Lojasiewicz inequalities, on tools from functional analysis such as the Arzelà-Ascoli theorem, on tools from nonsmooth analysis such as the concept of limiting Fréchet subgradients, as well as on the fact that the set of realization functions of shallow ReLU ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed by Petersen et al. △ Less

Submitted 13 July, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: 89 pages, 15 figures

Journal ref: Journal of Machine Learning, 1 (2022), pp. 141-246

arXiv:2112.07369 [pdf, other]

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

Authors: Martin Hutzenthaler, Arnulf Jentzen, Katharina Pohl, Adrian Riekert, Luca Scarpa

Abstract: In many numerical simulations stochastic gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) but till this day it remains an open problem of research to provide a mathematical convergence analysis which rigorously explains the success of SGD type optimization methods in the training of DNNs. In this work we study SGD type optimiz… ▽ More In many numerical simulations stochastic gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) but till this day it remains an open problem of research to provide a mathematical convergence analysis which rigorously explains the success of SGD type optimization methods in the training of DNNs. In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation. We first establish general regularity properties for the risk functions and their generalized gradient functions appearing in the training of such DNNs and, thereafter, we investigate the plain vanilla SGD optimization method in the training of such DNNs under the assumption that the target function under consideration is a constant function. Specifically, we prove under the assumption that the learning rates (the step sizes of the SGD optimization method) are sufficiently small but not $L^1$-summable and under the assumption that the target function is a constant function that the expectation of the riskof the considered SGD process converges in the training of such DNNs to zero as the number of SGD steps increases to infinity. △ Less

Submitted 22 June, 2023; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: 71 pages, 5 figures, 2 tables, 4 Python source codes. To appear in Electronic Research Archive

arXiv:2110.08297 [pdf, ps, other]

Strong $L^p$-error analysis of nonlinear Monte Carlo approximations for high-dimensional semilinear partial differential equations

Authors: Martin Hutzenthaler, Arnulf Jentzen, Benno Kuckuck, Joshua Lee Padgett

Abstract: Full-history recursive multilevel Picard (MLP) approximation schemes have been shown to overcome the curse of dimensionality in the numerical approximation of high-dimensional semilinear partial differential equations (PDEs) with general time horizons and Lipschitz continuous nonlinearities. However, each of the error analyses for MLP approximation schemes in the existing literature studies the… ▽ More Full-history recursive multilevel Picard (MLP) approximation schemes have been shown to overcome the curse of dimensionality in the numerical approximation of high-dimensional semilinear partial differential equations (PDEs) with general time horizons and Lipschitz continuous nonlinearities. However, each of the error analyses for MLP approximation schemes in the existing literature studies the $L^2$-root-mean-square distance between the exact solution of the PDE under consideration and the considered MLP approximation and none of the error analyses in the existing literature provides an upper bound for the more general $L^p$-distance between the exact solution of the PDE under consideration and the considered MLP approximation. It is the key contribution of this article to extend the $L^2$-error analysis for MLP approximation schemes in the literature to a more general $L^p$-error analysis with $p\in (0,\infty)$. In particular, the main result of this article proves that the proposed MLP approximation scheme indeed overcomes the curse of dimensionality in the numerical approximation of high-dimensional semilinear PDEs with the approximation error measured in the $L^p$-sense with $p \in (0,\infty)$. △ Less

Submitted 15 October, 2021; originally announced October 2021.

Comments: 42 pages.

arXiv:2108.10602 [pdf, ps, other]

Overcoming the curse of dimensionality in the numerical approximation of backward stochastic differential equations

Authors: Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen

Abstract: Backward stochastic differential equations (BSDEs) belong nowadays to the most frequently studied equations in stochastic analysis and computational stochastics. BSDEs in applications are often nonlinear and high-dimensional. In nearly all cases such nonlinear high-dimensional BSDEs cannot be solved explicitly and it has been and still is a very active topic of research to design and analyze numer… ▽ More Backward stochastic differential equations (BSDEs) belong nowadays to the most frequently studied equations in stochastic analysis and computational stochastics. BSDEs in applications are often nonlinear and high-dimensional. In nearly all cases such nonlinear high-dimensional BSDEs cannot be solved explicitly and it has been and still is a very active topic of research to design and analyze numerical approximation methods to approximatively solve nonlinear high-dimensional BSDEs. Although there are a large number of research articles in the scientific literature which analyze numerical approximation methods for nonlinear BSDEs, until today there has been no numerical approximation method in the scientific literature which has been proven to overcome the curse of dimensionality in the numerical approximation of nonlinear BSDEs in the sense that the number of computational operations of the numerical approximation method to approximatively compute one sample path of the BSDE solution grows at most polynomially in both the reciprocal $1/ \varepsilon$ of the prescribed approximation accuracy $\varepsilon \in (0,\infty)$ and the dimension $d\in \mathbb N=\{1,2,3,\ldots\}$ of the BSDE. It is the key contribution of this article to overcome this obstacle by introducing a new Monte Carlo-type numerical approximation method for high-dimensional BSDEs and by proving that this Monte Carlo-type numerical approximation method does indeed overcome the curse of dimensionality in the approximative computation of solution paths of BSDEs. △ Less

Submitted 24 August, 2021; originally announced August 2021.

arXiv:2108.08106 [pdf, other]

doi 10.3934/era.2023128

Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation

Authors: Simon Eberle, Arnulf Jentzen, Adrian Riekert, Georg S. Weiss

Abstract: The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type optimization schemes in the training of ANNs with R… ▽ More The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type optimization schemes in the training of ANNs with ReLU activation. GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods. In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. In the first main result of this article we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation admits for every initial value a solution which is also unique among a suitable class of solutions. In the second main result of this article we prove in the training of such ANNs under the assumption that the target function and the density function of the probability distribution of the input data are piecewise polynomial that every non-divergent GF trajectory converges with an appropriate rate of convergence to a critical point and that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point. △ Less

Submitted 18 August, 2021; originally announced August 2021.

Comments: 30 pages. arXiv admin note: text overlap with arXiv:2107.04479, arXiv:2108.04620

Journal ref: Electronic Research Archive 2023, Volume 31, Issue 5: 2519-2554

arXiv:2108.04620 [pdf, other]

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains - even in the simplest situation of the plain vanilla GD optimization method with random initi… ▽ More Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains - even in the simplest situation of the plain vanilla GD optimization method with random initializations and ANNs with one hidden layer - an open problem to prove (or disprove) the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero as the width of the ANNs, the number of independent random initializations, and the number of GD steps increase to infinity. In this article we prove this conjecture in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval, where the probability distributions for the random initializations of the ANN parameters are standard normal distributions, and where the target function under consideration is continuous and piecewise affine linear. Roughly speaking, the key ingredients in our mathematical convergence analysis are (i) to prove that suitable sets of global minima of the risk functions are \emph{twice continuously differentiable submanifolds of the ANN parameter spaces}, (ii) to prove that the Hessians of the risk functions on these sets of global minima satisfy an appropriate \emph{maximal rank condition}, and, thereafter, (iii) to apply the machinery in [Fehrman, B., Gess, B., Jentzen, A., Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21(136): 1--48, 2020] to establish convergence of the GD optimization method with random initializations. △ Less

Submitted 10 August, 2021; originally announced August 2021.

Comments: 44 pages. arXiv admin note: text overlap with arXiv:2107.04479

Journal ref: Journal of Machine Learning Research 23, 260 (2022), pp. 1-50

arXiv:2107.04479 [pdf, ps, other]

doi 10.1016/j.jmaa.2022.126601

Convergence analysis for gradient flows in the training of artificial neural networks with ReLU activation

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: Gradient descent (GD) type optimization schemes are the standard methods to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Such schemes can be considered as discretizations of gradient flows (GFs) associated to the training of ANNs with ReLU activation and most of the key difficulties in the mathematical convergence analysis of GD type optimization schemes in… ▽ More Gradient descent (GD) type optimization schemes are the standard methods to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Such schemes can be considered as discretizations of gradient flows (GFs) associated to the training of ANNs with ReLU activation and most of the key difficulties in the mathematical convergence analysis of GD type optimization schemes in the training of ANNs with ReLU activation seem to be already present in the dynamics of the corresponding GF differential equations. It is the key subject of this work to analyze such GF differential equations in the training of ANNs with ReLU activation and three layers (one input layer, one hidden layer, and one output layer). In particular, in this article we prove in the case where the target function is possibly multi-dimensional and continuous and in the case where the probability distribution of the input data is absolutely continuous with respect to the Lebesgue measure that the risk of every bounded GF trajectory converges to the risk of a critical point. In addition, in this article we show in the case of a 1-dimensional affine linear target function and in the case where the probability distribution of the input data coincides with the standard uniform distribution that the risk of every bounded GF trajectory converges to zero if the initial risk is sufficiently small. Finally, in the special situation where there is only one neuron on the hidden layer (1-dimensional hidden layer) we strengthen the above named result for affine linear target functions by proving that that the risk of every (not necessarily bounded) GF trajectory converges to zero if the initial risk is sufficiently small. △ Less

Submitted 9 July, 2021; originally announced July 2021.

Comments: 37 pages

Journal ref: Journal of Mathematical Analysis and Applications 517, 2 (2023)

arXiv:2104.00277 [pdf, ps, other]

doi 10.1007/s00033-022-01716-w

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural network… ▽ More In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with $d \in \mathbb{N}$ neurons on the input layer, $H \in \mathbb{N}$ neurons on the hidden layer, and one neuron on the output layer). The learning rates of the SGD process are assumed to be sufficiently small and the input data used in the SGD process to train the artificial neural networks is assumed to be independent and identically distributed. △ Less

Submitted 1 April, 2021; originally announced April 2021.

Comments: 29 pages

Journal ref: Zeitschrift für angewandte Mathematik und Physik 73 (2022)

arXiv:2103.10922 [pdf, other]

doi 10.1007/s00332-022-09823-8

Landscape analysis for shallow neural networks: complete classification of critical points for affine target functions

Authors: Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

Abstract: In this paper, we analyze the landscape of the true loss of neural networks with one hidden layer and ReLU, leaky ReLU, or quadratic activation. In all three cases, we provide a complete classification of the critical points in the case where the target function is affine and one-dimensional. In particular, we show that there exist no local maxima and clarify the structure of saddle points. Moreov… ▽ More In this paper, we analyze the landscape of the true loss of neural networks with one hidden layer and ReLU, leaky ReLU, or quadratic activation. In all three cases, we provide a complete classification of the critical points in the case where the target function is affine and one-dimensional. In particular, we show that there exist no local maxima and clarify the structure of saddle points. Moreover, we prove that non-global local minima can only be caused by `dead' ReLU neurons. In particular, they do not appear in the case of leaky ReLU or quadratic activation. Our approach is of a combinatorial nature and builds on a careful analysis of the different types of hidden neurons that can occur. △ Less

Submitted 6 July, 2022; v1 submitted 19 March, 2021; originally announced March 2021.

MSC Class: 68T07 ACM Class: I.2.6

Journal ref: J Nonlinear Sci 32, 64 (2022)

arXiv:2103.04488 [pdf, ps, other]

Lower bounds for artificial neural network approximations: A proof that shallow neural networks fail to overcome the curse of dimensionality

Authors: Philipp Grohs, Shokhrukh Ibragimov, Arnulf Jentzen, Sarah Koppensteiner

Abstract: Artificial neural networks (ANNs) have become a very powerful tool in the approximation of high-dimensional functions. Especially, deep ANNs, consisting of a large number of hidden layers, have been very successfully used in a series of practical relevant computational problems involving high-dimensional input data ranging from classification tasks in supervised learning to optimal decision proble… ▽ More Artificial neural networks (ANNs) have become a very powerful tool in the approximation of high-dimensional functions. Especially, deep ANNs, consisting of a large number of hidden layers, have been very successfully used in a series of practical relevant computational problems involving high-dimensional input data ranging from classification tasks in supervised learning to optimal decision problems in reinforcement learning. There are also a number of mathematical results in the scientific literature which study the approximation capacities of ANNs in the context of high-dimensional target functions. In particular, there are a series of mathematical results in the scientific literature which show that sufficiently deep ANNs have the capacity to overcome the curse of dimensionality in the approximation of certain target function classes in the sense that the number of parameters of the approximating ANNs grows at most polynomially in the dimension $d \in \mathbb{N}$ of the target functions under considerations. In the proofs of several of such high-dimensional approximation results it is crucial that the involved ANNs are sufficiently deep and consist a sufficiently large number of hidden layers which grows in the dimension of the considered target functions. It is the topic of this work to look a bit more detailed to the deepness of the involved ANNs in the approximation of high-dimensional target functions. In particular, the main result of this work proves that there exists a concretely specified sequence of functions which can be approximated without the curse of dimensionality by sufficiently deep ANNs but which cannot be approximated without the curse of dimensionality if the involved ANNs are shallow or not deep enough. △ Less

Submitted 7 March, 2021; originally announced March 2021.

Comments: 53 pages

arXiv:2103.02350 [pdf, ps, other]

Full history recursive multilevel Picard approximations for ordinary differential equations with expectations

Authors: Christian Beck, Martin Hutzenthaler, Arnulf Jentzen, Emilia Magnani

Abstract: We consider ordinary differential equations (ODEs) which involve expectations of a random variable. These ODEs are special cases of McKean-Vlasov stochastic differential equations (SDEs). A plain vanilla Monte Carlo approximation method for such ODEs requires a computational cost of order $\varepsilon^{-3}$ to achieve a root-mean-square error of size $\varepsilon$. In this work we adapt recently i… ▽ More We consider ordinary differential equations (ODEs) which involve expectations of a random variable. These ODEs are special cases of McKean-Vlasov stochastic differential equations (SDEs). A plain vanilla Monte Carlo approximation method for such ODEs requires a computational cost of order $\varepsilon^{-3}$ to achieve a root-mean-square error of size $\varepsilon$. In this work we adapt recently introduced full history recursive multilevel Picard (MLP) algorithms to reduce this computational complexity. Our main result shows for every $δ>0$ that the proposed MLP approximation algorithm requires only a computational effort of order $\varepsilon^{-(2+δ)}$ to achieve a root-mean-square error of size $\varepsilon$. △ Less

Submitted 3 March, 2021; originally announced March 2021.

Comments: 24 pages. arXiv admin note: substantial text overlap with arXiv:1903.05985

MSC Class: 65Lxx; 65Mxx; 65Cxx; 65M75 ACM Class: G.1.0; G.1.7; G.1.m; G.3

arXiv:2102.11840 [pdf, ps, other]

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases

Authors: Arnulf Jentzen, Timo Kröger

Abstract: In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why randomly initialized gradient descent optimization algorithms, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations even though the ob… ▽ More In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why randomly initialized gradient descent optimization algorithms, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this problem in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article we provide a further contribution to this area of research by considering overparameterized fully-connected rectified artificial neural networks with biases. Specifically, we show that for a fixed number of training data the mean squared error using batch gradient descent optimization applied to such a randomly initialized artificial neural network converges to zero at a linear convergence rate as long as the width of the artificial neural network is large enough, the learning rate is small enough, and the training input data are pairwise linearly independent. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: 38 pages

arXiv:2102.09924 [pdf, ps, other]

doi 10.1016/j.jco.2022.101646

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

Authors: Patrick Cheridito, Arnulf Jentzen, Adrian Riekert, Florian Rossmannek

Abstract: Gradient descent optimization algorithms are the standard ingredients that are used to train artificial neural networks (ANNs). Even though a huge number of numerical simulations indicate that gradient descent optimization methods do indeed convergence in the training of ANNs, until today there is no rigorous theoretical analysis which proves (or disproves) this conjecture. In particular, even in… ▽ More Gradient descent optimization algorithms are the standard ingredients that are used to train artificial neural networks (ANNs). Even though a huge number of numerical simulations indicate that gradient descent optimization methods do indeed convergence in the training of ANNs, until today there is no rigorous theoretical analysis which proves (or disproves) this conjecture. In particular, even in the case of the most basic variant of gradient descent optimization algorithms, the plain vanilla gradient descent method, it remains an open problem to prove or disprove the conjecture that gradient descent converges in the training of ANNs. In this article we solve this problem in the special situation where the target function under consideration is a constant function. More specifically, in the case of constant target functions we prove in the training of rectified fully-connected feedforward ANNs with one-hidden layer that the risk function of the gradient descent method does indeed converge to zero. Our mathematical analysis strongly exploits the property that the rectifier function is the activation function used in the considered ANNs. A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters. This Lyapunov function is the central tool in our convergence proof of the gradient descent method. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Comments: 23 pages

Journal ref: Journal of Complexity (2022)

arXiv:2012.12348 [pdf, ps, other]

doi 10.3934/dcdsb.2022238

An overview on deep learning-based approximation methods for partial differential equations

Authors: Christian Beck, Martin Hutzenthaler, Arnulf Jentzen, Benno Kuckuck

Abstract: It is one of the most challenging problems in applied mathematics to approximatively solve high-dimensional partial differential equations (PDEs). Recently, several deep learning-based approximation algorithms for attacking this problem have been proposed and tested numerically on a number of examples of high-dimensional PDEs. This has given rise to a lively field of research in which deep learnin… ▽ More It is one of the most challenging problems in applied mathematics to approximatively solve high-dimensional partial differential equations (PDEs). Recently, several deep learning-based approximation algorithms for attacking this problem have been proposed and tested numerically on a number of examples of high-dimensional PDEs. This has given rise to a lively field of research in which deep learning-based methods and related Monte Carlo methods are applied to the approximation of high-dimensional PDEs. In this article we offer an introduction to this field of research by revisiting selected mathematical results related to deep learning approximation methods for PDEs and reviewing the main ideas of their proofs. We also provide a short overview of the recent literature in this area of research. △ Less

Submitted 18 November, 2022; v1 submitted 22 December, 2020; originally announced December 2020.

Comments: 49 pages. Compared to the first version, the manuscript has been significantly expanded. In particular, Python source code implementing several of the presented methods using PyTorch, as well as numerical simulations have been added

MSC Class: 65M99 (Primary); 35-02; 65-02; 68T07 (Secondary)

Journal ref: Discrete Contin. Dyn. Syst. Ser. B 28 (2023), no. 6, 3697-3746

arXiv:2012.08443 [pdf, ps, other]

doi 10.1007/s40304-022-00292-9

Strong overall error analysis for the training of artificial neural networks via random initializations

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: Although deep learning based approximation algorithms have been applied very successfully to numerous problems, at the moment the reasons for their performance are not entirely understood from a mathematical point of view. Recently, estimates for the convergence of the overall error have been obtained in the situation of deep supervised learning, but with an extremely slow rate of convergence. In… ▽ More Although deep learning based approximation algorithms have been applied very successfully to numerous problems, at the moment the reasons for their performance are not entirely understood from a mathematical point of view. Recently, estimates for the convergence of the overall error have been obtained in the situation of deep supervised learning, but with an extremely slow rate of convergence. In this note we partially improve on these estimates. More specifically, we show that the depth of the neural network only needs to increase much slower in order to obtain the same rate of approximation. The results hold in the case of an arbitrary stochastic optimization algorithm with i.i.d.\ random initializations. △ Less

Submitted 15 December, 2020; originally announced December 2020.

Comments: 40 pages

Journal ref: Communications in Mathematics and Statistics (2023)

arXiv:2012.04326 [pdf, other]

High-dimensional approximation spaces of artificial neural networks and applications to partial differential equations

Authors: Pierfrancesco Beneventano, Patrick Cheridito, Arnulf Jentzen, Philippe von Wurstemberger

Abstract: In this paper we develop a new machinery to study the capacity of artificial neural networks (ANNs) to approximate high-dimensional functions without suffering from the curse of dimensionality. Specifically, we introduce a concept which we refer to as approximation spaces of artificial neural networks and we present several tools to handle those spaces. Roughly speaking, approximation spaces consi… ▽ More In this paper we develop a new machinery to study the capacity of artificial neural networks (ANNs) to approximate high-dimensional functions without suffering from the curse of dimensionality. Specifically, we introduce a concept which we refer to as approximation spaces of artificial neural networks and we present several tools to handle those spaces. Roughly speaking, approximation spaces consist of sequences of functions which can, in a suitable way, be approximated by ANNs without curse of dimensionality in the sense that the number of required ANN parameters to approximate a function of the sequence with an accuracy $\varepsilon > 0$ grows at most polynomially both in the reciprocal $1/\varepsilon$ of the required accuracy and in the dimension $d \in \mathbb{N} = \{1, 2, 3, \ldots \}$ of the function. We show that these approximation spaces are closed under various operations including linear combinations, formations of limits, and infinite compositions. To illustrate the utility of the machinery proposed in this paper, we employ the developed theory to prove that ANNs have the capacity to overcome the curse of dimensionality in the numerical approximation of certain first order transport partial differential equations (PDEs). We even prove that approximation spaces are closed under flows of first order transport PDEs. △ Less

Submitted 28 January, 2025; v1 submitted 8 December, 2020; originally announced December 2020.

Comments: 31 pages

arXiv:2012.01194 [pdf, ps, other]

Deep learning based numerical approximation algorithms for stochastic partial differential equations and high-dimensional nonlinear filtering problems

Authors: Christian Beck, Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, Ariel Neufeld

Abstract: In this article we introduce and study a deep learning based approximation algorithm for solutions of stochastic partial differential equations (SPDEs). In the proposed approximation algorithm we employ a deep neural network for every realization of the driving noise process of the SPDE to approximate the solution process of the SPDE under consideration. We test the performance of the proposed app… ▽ More In this article we introduce and study a deep learning based approximation algorithm for solutions of stochastic partial differential equations (SPDEs). In the proposed approximation algorithm we employ a deep neural network for every realization of the driving noise process of the SPDE to approximate the solution process of the SPDE under consideration. We test the performance of the proposed approximation algorithm in the case of stochastic heat equations with additive noise, stochastic heat equations with multiplicative noise, stochastic Black--Scholes equations with multiplicative noise, and Zakai equations from nonlinear filtering. In each of these SPDEs the proposed approximation algorithm produces accurate results with short run times in up to 50 space dimensions. △ Less

Submitted 2 December, 2020; originally announced December 2020.

arXiv:2009.13989 [pdf, ps, other]

Nonlinear Monte Carlo methods with polynomial runtime for high-dimensional iterated nested expectations

Authors: Christian Beck, Arnulf Jentzen, Thomas Kruse

Abstract: The approximative calculation of iterated nested expectations is a recurring challenging problem in applications. Nested expectations appear, for example, in the numerical approximation of solutions of backward stochastic differential equations (BSDEs), in the numerical approximation of solutions of semilinear parabolic partial differential equations (PDEs), in statistical physics, in optimal stop… ▽ More The approximative calculation of iterated nested expectations is a recurring challenging problem in applications. Nested expectations appear, for example, in the numerical approximation of solutions of backward stochastic differential equations (BSDEs), in the numerical approximation of solutions of semilinear parabolic partial differential equations (PDEs), in statistical physics, in optimal stopping problems such as the approximative pricing of American or Bermudan options, in risk measure estimation in mathematical finance, or in decision-making under uncertainty. Nested expectations which arise in the above named applications often consist of a large number of nestings. However, the computational effort of standard nested Monte Carlo approximations for iterated nested expectations grows exponentially in the number of nestings and it remained an open question whether it is possible to approximately calculate multiply iterated high-dimensional nested expectations in polynomial time. In this article we tackle this problem by proposing and studying a new class of full-history recursive multilevel Picard (MLP) approximation schemes for iterated nested expectations. In particular, we prove under suitable assumptions that these MLP approximation schemes can approximately calculate multiply iterated nested expectations with a computational effort growing at most polynomially in the number of nestings $ K \in \mathbb{N} = \{1, 2, 3, \ldots \} $, in the problem dimension $ d \in \mathbb{N} $, and in the reciprocal $\frac{1}{\varepsilon}$ of the desired approximation accuracy $ \varepsilon \in (0, \infty) $. △ Less

Submitted 29 September, 2020; originally announced September 2020.

Comments: 47 pages

MSC Class: 65C05 (Primary) 65M75; 68Q25 (Secondary)

Showing 1–50 of 139 results for author: Jentzen, A