-
Schoenberg characterization of continuous non-stationary isotropic positive definite kernels
Authors:
Felix Benning,
Max David Schölpple
Abstract:
We provide a characterization for the continuous positive definite kernels on $\mathbb R^d$ that are invariant to linear isometries, i.e. invariant under the orthogonal group $O(d)$. Furthermore, we provide necessary and sufficient conditions for these kernels to be strictly positive definite. This class of isotropic kernels is fairly general: First, it unifies stationary isotropic and dot product…
▽ More
We provide a characterization for the continuous positive definite kernels on $\mathbb R^d$ that are invariant to linear isometries, i.e. invariant under the orthogonal group $O(d)$. Furthermore, we provide necessary and sufficient conditions for these kernels to be strictly positive definite. This class of isotropic kernels is fairly general: First, it unifies stationary isotropic and dot product kernels, and second, it includes neural network kernels that arise from infinite-width limits of neural networks.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
In almost all shallow analytic neural network optimization landscapes, efficient minimizers have strongly convex neighborhoods
Authors:
Felix Benning,
Steffen Dereich
Abstract:
Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divid…
▽ More
Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divided into two domains: the 'efficient domain' (all parameters for which the respective realization function cannot be generated by a network having a smaller number of neurons) and the 'redundant domain' (the remaining parameters). In almost all regression problems on the efficient domain the optimization landscape only features local minima that are strongly convex. Formally, we will show that for certain randomly picked regression problems the optimization landscape is almost surely a Morse function on the efficient domain. The redundant domain has significantly smaller dimension than the efficient domain and on this domain, potential local minima are never isolated.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Measure Theory of Conditionally Independent Random Function Evaluation
Authors:
Felix Benning
Abstract:
The next evaluation point $x_{n+1}$ of a random function $\mathbf f = (\mathbf f(x))_{x\in \mathbb X}$ (a.k.a. stochastic process or random field) is often chosen based on the filtration of previously seen evaluations $\mathcal F_n := σ(\mathbf f(x_0),\dots, \mathbf f(x_n))$. This turns $x_{n+1}$ into a random variable $X_{n+1}$ and thereby $\mathbf f(X_{n+1})$ into a complex measure theoretical o…
▽ More
The next evaluation point $x_{n+1}$ of a random function $\mathbf f = (\mathbf f(x))_{x\in \mathbb X}$ (a.k.a. stochastic process or random field) is often chosen based on the filtration of previously seen evaluations $\mathcal F_n := σ(\mathbf f(x_0),\dots, \mathbf f(x_n))$. This turns $x_{n+1}$ into a random variable $X_{n+1}$ and thereby $\mathbf f(X_{n+1})$ into a complex measure theoretical object. In applications, like geostatistics or Bayesian optimization, the evaluation locations $X_n$ are often treated as deterministic during the calculation of the conditional distribution $\mathbb P(\mathbf f(X_{n+1}) \in A \mid \mathcal F_n)$. We provide a framework to prove that the results obtained by this treatment are typically correct. We also treat the more general case where $X_{n+1}$ is not 'previsible' but independent from $\mathbf f$ conditional on $\mathcal F_n$ and the case of noisy evaluations.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Gradient Span Algorithms Make Predictable Progress in High Dimension
Authors:
Felix Benning,
Leif Döring
Abstract:
We prove that all 'gradient span algorithms' have asymptotically deterministic behavior on scaled Gaussian random functions as the dimension tends to infinity. In particular, this result explains the counterintuitive phenomenon that different training runs of many large machine learning models result in approximately equal cost curves despite random initialization on a complicated non-convex lands…
▽ More
We prove that all 'gradient span algorithms' have asymptotically deterministic behavior on scaled Gaussian random functions as the dimension tends to infinity. In particular, this result explains the counterintuitive phenomenon that different training runs of many large machine learning models result in approximately equal cost curves despite random initialization on a complicated non-convex landscape.
The distributional assumption of (non-stationary) isotropic Gaussian random functions we use is sufficiently general to serve as realistic model for machine learning training but also encompass spin glasses and random quadratic functions.
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Random Function Descent
Authors:
Felix Benning,
Leif Döring
Abstract:
Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensi…
▽ More
Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to $\mathcal{O}(nd)$ complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.
△ Less
Submitted 15 October, 2024; v1 submitted 2 May, 2023;
originally announced May 2023.
-
High Dimensional Optimization through the Lens of Machine Learning
Authors:
Felix Benning
Abstract:
This thesis reviews numerical optimization methods with machine learning problems in mind. Since machine learning models are highly parametrized, we focus on methods suited for high dimensional optimization. We build intuition on quadratic models to figure out which methods are suited for non-convex optimization, and develop convergence proofs on convex functions for this selection of methods. Wit…
▽ More
This thesis reviews numerical optimization methods with machine learning problems in mind. Since machine learning models are highly parametrized, we focus on methods suited for high dimensional optimization. We build intuition on quadratic models to figure out which methods are suited for non-convex optimization, and develop convergence proofs on convex functions for this selection of methods. With this theoretical foundation for stochastic gradient descent and momentum methods, we try to explain why the methods used commonly in the machine learning field are so successful. Besides explaining successful heuristics, the last chapter also provides a less extensive review of more theoretical methods, which are not quite as popular in practice. So in some sense this work attempts to answer the question: Why are the default Tensorflow optimizers included in the defaults?
△ Less
Submitted 31 December, 2021;
originally announced December 2021.