Search | arXiv e-print repository

arXiv:2506.22048 [pdf, ps, other]

Schoenberg characterization of continuous non-stationary isotropic positive definite kernels

Authors: Felix Benning, Max David Schölpple

Abstract: We provide a characterization for the continuous positive definite kernels on $\mathbb R^d$ that are invariant to linear isometries, i.e. invariant under the orthogonal group $O(d)$. Furthermore, we provide necessary and sufficient conditions for these kernels to be strictly positive definite. This class of isotropic kernels is fairly general: First, it unifies stationary isotropic and dot product… ▽ More We provide a characterization for the continuous positive definite kernels on $\mathbb R^d$ that are invariant to linear isometries, i.e. invariant under the orthogonal group $O(d)$. Furthermore, we provide necessary and sufficient conditions for these kernels to be strictly positive definite. This class of isotropic kernels is fairly general: First, it unifies stationary isotropic and dot product kernels, and second, it includes neural network kernels that arise from infinite-width limits of neural networks. △ Less

Submitted 27 June, 2025; originally announced June 2025.

MSC Class: 33C50; 33C55; 42A82; 42C10; 43A35; 60G15; 68T07

arXiv:2504.08867 [pdf, ps, other]

In almost all shallow analytic neural network optimization landscapes, efficient minimizers have strongly convex neighborhoods

Authors: Felix Benning, Steffen Dereich

Abstract: Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divid… ▽ More Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divided into two domains: the 'efficient domain' (all parameters for which the respective realization function cannot be generated by a network having a smaller number of neurons) and the 'redundant domain' (the remaining parameters). In almost all regression problems on the efficient domain the optimization landscape only features local minima that are strongly convex. Formally, we will show that for certain randomly picked regression problems the optimization landscape is almost surely a Morse function on the efficient domain. The redundant domain has significantly smaller dimension than the efficient domain and on this domain, potential local minima are never isolated. △ Less

Submitted 11 April, 2025; originally announced April 2025.

MSC Class: 60G15; 60G60; 62J02; 62M45; 68T07

arXiv:2504.08513 [pdf, ps, other]

Measure Theory of Conditionally Independent Random Function Evaluation

Authors: Felix Benning

Abstract: The next evaluation point $x_{n+1}$ of a random function $\mathbf f = (\mathbf f(x))_{x\in \mathbb X}$ (a.k.a. stochastic process or random field) is often chosen based on the filtration of previously seen evaluations $\mathcal F_n := σ(\mathbf f(x_0),\dots, \mathbf f(x_n))$. This turns $x_{n+1}$ into a random variable $X_{n+1}$ and thereby $\mathbf f(X_{n+1})$ into a complex measure theoretical o… ▽ More The next evaluation point $x_{n+1}$ of a random function $\mathbf f = (\mathbf f(x))_{x\in \mathbb X}$ (a.k.a. stochastic process or random field) is often chosen based on the filtration of previously seen evaluations $\mathcal F_n := σ(\mathbf f(x_0),\dots, \mathbf f(x_n))$. This turns $x_{n+1}$ into a random variable $X_{n+1}$ and thereby $\mathbf f(X_{n+1})$ into a complex measure theoretical object. In applications, like geostatistics or Bayesian optimization, the evaluation locations $X_n$ are often treated as deterministic during the calculation of the conditional distribution $\mathbb P(\mathbf f(X_{n+1}) \in A \mid \mathcal F_n)$. We provide a framework to prove that the results obtained by this treatment are typically correct. We also treat the more general case where $X_{n+1}$ is not 'previsible' but independent from $\mathbf f$ conditional on $\mathcal F_n$ and the case of noisy evaluations. △ Less

Submitted 11 April, 2025; originally announced April 2025.

MSC Class: 60A10; 60G05; 60G15; 60G60

arXiv:2410.09973 [pdf, other]

Gradient Span Algorithms Make Predictable Progress in High Dimension

Authors: Felix Benning, Leif Döring

Abstract: We prove that all 'gradient span algorithms' have asymptotically deterministic behavior on scaled Gaussian random functions as the dimension tends to infinity. In particular, this result explains the counterintuitive phenomenon that different training runs of many large machine learning models result in approximately equal cost curves despite random initialization on a complicated non-convex lands… ▽ More We prove that all 'gradient span algorithms' have asymptotically deterministic behavior on scaled Gaussian random functions as the dimension tends to infinity. In particular, this result explains the counterintuitive phenomenon that different training runs of many large machine learning models result in approximately equal cost curves despite random initialization on a complicated non-convex landscape. The distributional assumption of (non-stationary) isotropic Gaussian random functions we use is sufficiently general to serve as realistic model for machine learning training but also encompass spin glasses and random quadratic functions. △ Less

Submitted 13 October, 2024; originally announced October 2024.

MSC Class: 60F99; 68T01; 82D30

arXiv:2305.01377 [pdf, other]

Random Function Descent

Authors: Felix Benning, Leif Döring

Abstract: Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensi… ▽ More Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to $\mathcal{O}(nd)$ complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup. △ Less

Submitted 15 October, 2024; v1 submitted 2 May, 2023; originally announced May 2023.

Journal ref: Advances in Neural Information Processing Systems, Vol. 37. Vancouver, Canada: Curran Associates, Inc., 2024

arXiv:2112.15392 [pdf, other]

High Dimensional Optimization through the Lens of Machine Learning

Authors: Felix Benning

Abstract: This thesis reviews numerical optimization methods with machine learning problems in mind. Since machine learning models are highly parametrized, we focus on methods suited for high dimensional optimization. We build intuition on quadratic models to figure out which methods are suited for non-convex optimization, and develop convergence proofs on convex functions for this selection of methods. Wit… ▽ More This thesis reviews numerical optimization methods with machine learning problems in mind. Since machine learning models are highly parametrized, we focus on methods suited for high dimensional optimization. We build intuition on quadratic models to figure out which methods are suited for non-convex optimization, and develop convergence proofs on convex functions for this selection of methods. With this theoretical foundation for stochastic gradient descent and momentum methods, we try to explain why the methods used commonly in the machine learning field are so successful. Besides explaining successful heuristics, the last chapter also provides a less extensive review of more theoretical methods, which are not quite as popular in practice. So in some sense this work attempts to answer the question: Why are the default Tensorflow optimizers included in the defaults? △ Less

Submitted 31 December, 2021; originally announced December 2021.

Comments: arXiv admin note: text overlap with arXiv:1606.04838 by other authors

Showing 1–6 of 6 results for author: Benning, F