-
Statistically guided deep learning
Authors:
Michael Kohler,
Adam Krzyzak
Abstract:
We present a theoretically well-founded deep learning algorithm for nonparametric regression. It uses over-parametrized deep neural networks with logistic activation function, which are fitted to the given data via gradient descent. We propose a special topology of these networks, a special random initialization of the weights, and a data-dependent choice of the learning rate and the number of gra…
▽ More
We present a theoretically well-founded deep learning algorithm for nonparametric regression. It uses over-parametrized deep neural networks with logistic activation function, which are fitted to the given data via gradient descent. We propose a special topology of these networks, a special random initialization of the weights, and a data-dependent choice of the learning rate and the number of gradient descent steps. We prove a theoretical bound on the expected $L_2$ error of this estimate, and illustrate its finite sample size performance by applying it to simulated data. Our results show that a theoretical analysis of deep learning which takes into account simultaneously optimization, generalization and approximation can result in a new deep learning estimate which has an improved finite sample performance.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent
Authors:
Michael Kohler,
Adam Krzyzak,
Benjamin Walter
Abstract:
Image classification based on over-parametrized convolutional neural networks with a global average-pooling layer is considered. The weights of the network are learned by gradient descent. A bound on the rate of convergence of the difference between the misclassification risk of the newly introduced convolutional neural network estimate and the minimal possible value is derived.
Image classification based on over-parametrized convolutional neural networks with a global average-pooling layer is considered. The weights of the network are learned by gradient descent. A bound on the rate of convergence of the difference between the misclassification risk of the newly introduced convolutional neural network estimate and the minimal possible value is derived.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Learning of deep convolutional network image classifiers via stochastic gradient descent and over-parametrization
Authors:
Michael Kohler,
Adam Krzyzak,
Alisha Sänger
Abstract:
Image classification from independent and identically distributed random variables is considered. Image classifiers are defined which are based on a linear combination of deep convolutional networks with max-pooling layer. Here all the weights are learned by stochastic gradient descent. A general result is presented which shows that the image classifiers are able to approximate the best possible d…
▽ More
Image classification from independent and identically distributed random variables is considered. Image classifiers are defined which are based on a linear combination of deep convolutional networks with max-pooling layer. Here all the weights are learned by stochastic gradient descent. A general result is presented which shows that the image classifiers are able to approximate the best possible deep convolutional network. In case that the a posteriori probability satisfies a suitable hierarchical composition model it is shown that the corresponding deep convolutional neural network image classifier achieves a rate of convergence which is independent of the dimension of the images.
△ Less
Submitted 5 March, 2025; v1 submitted 10 April, 2024;
originally announced April 2024.
-
On the rate of convergence of an over-parametrized Transformer classifier learned by gradient descent
Authors:
Michael Kohler,
Adam Krzyzak
Abstract:
One of the most recent and fascinating breakthroughs in artificial intelligence is ChatGPT, a chatbot which can simulate human conversation. ChatGPT is an instance of GPT4, which is a language model based on generative gredictive gransformers. So if one wants to study from a theoretical point of view, how powerful such artificial intelligence can be, one approach is to consider transformer network…
▽ More
One of the most recent and fascinating breakthroughs in artificial intelligence is ChatGPT, a chatbot which can simulate human conversation. ChatGPT is an instance of GPT4, which is a language model based on generative gredictive gransformers. So if one wants to study from a theoretical point of view, how powerful such artificial intelligence can be, one approach is to consider transformer networks and to study which problems one can solve with these networks theoretically. Here it is not only important what kind of models these network can approximate, or how they can generalize their knowledge learned by choosing the best possible approximation to a concrete data set, but also how well optimization of such transformer network based on concrete data set works. In this article we consider all these three different aspects simultaneously and show a theoretical upper bound on the missclassification probability of a transformer network fitted to the observed data. For simplicity we focus in this context on transformer encoder networks which can be applied to define an estimate in the context of a classification problem involving natural language.
△ Less
Submitted 20 June, 2024; v1 submitted 28 December, 2023;
originally announced December 2023.
-
Analysis of the rate of convergence of an over-parametrized deep neural network estimate learned by gradient descent
Authors:
Michael Kohler,
Adam Krzyzak
Abstract:
Estimation of a regression function from independent and identically distributed random variables is considered. The $L_2$ error with integration with respect to the design measure is used as an error criterion. Over-parametrized deep neural network estimates are defined where all the weights are learned by the gradient descent. It is shown that the expected $L_2$ error of these estimates converge…
▽ More
Estimation of a regression function from independent and identically distributed random variables is considered. The $L_2$ error with integration with respect to the design measure is used as an error criterion. Over-parametrized deep neural network estimates are defined where all the weights are learned by the gradient descent. It is shown that the expected $L_2$ error of these estimates converges to zero with the rate close to $n^{-1/(1+d)}$ in case that the regression function is Hölder smooth with Hölder exponent $p \in [1/2,1]$. In case of an interaction model where the regression function is assumed to be a sum of Hölder smooth functions where each of the functions depends only on $d^*$ many of $d$ components of the design variable, it is shown that these estimates achieve the corresponding $d^*$-dimensional rate of convergence.
△ Less
Submitted 4 October, 2022;
originally announced October 2022.
-
On the rate of convergence of a deep recurrent neural network estimate in a regression problem with dependent data
Authors:
Michael Kohler,
Adam Krzyzak
Abstract:
A regression problem with dependent data is considered. Regularity assumptions on the dependency of the data are introduced, and it is shown that under suitable structural assumptions on the regression function a deep recurrent neural network estimate is able to circumvent the curse of dimensionality.
A regression problem with dependent data is considered. Regularity assumptions on the dependency of the data are introduced, and it is shown that under suitable structural assumptions on the regression function a deep recurrent neural network estimate is able to circumvent the curse of dimensionality.
△ Less
Submitted 31 October, 2020;
originally announced November 2020.
-
On the rate of convergence of image classifiers based on convolutional neural networks
Authors:
M. Kohler,
A. Krzyzak,
B. Walter
Abstract:
Image classifiers based on convolutional neural networks are defined, and the rate of convergence of the misclassification risk of the estimates towards the optimal misclassification risk is analyzed. Under suitable assumptions on the smoothness and structure of the aposteriori probability a rate of convergence is shown which is independent of the dimension of the image. This proves that in image…
▽ More
Image classifiers based on convolutional neural networks are defined, and the rate of convergence of the misclassification risk of the estimates towards the optimal misclassification risk is analyzed. Under suitable assumptions on the smoothness and structure of the aposteriori probability a rate of convergence is shown which is independent of the dimension of the image. This proves that in image classification it is possible to circumvent the curse of dimensionality by convolutional neural networks.
△ Less
Submitted 14 October, 2020; v1 submitted 3 March, 2020;
originally announced March 2020.
-
Analysis of the rate of convergence of neural network regression estimates which are easy to implement
Authors:
Alina Braun,
Michael Kohler,
Adam Krzyzak
Abstract:
Recent results in nonparametric regression show that for deep learning, i.e., for neural network estimates with many hidden layers, we are able to achieve good rates of convergence even in case of high-dimensional predictor variables, provided suitable assumptions on the structure of the regression function are imposed. The estimates are defined by minimizing the empirical $L_2$ risk over a class…
▽ More
Recent results in nonparametric regression show that for deep learning, i.e., for neural network estimates with many hidden layers, we are able to achieve good rates of convergence even in case of high-dimensional predictor variables, provided suitable assumptions on the structure of the regression function are imposed. The estimates are defined by minimizing the empirical $L_2$ risk over a class of neural networks. In practice it is not clear how this can be done exactly. In this article we introduce a new neural network regression estimate where most of the weights are chosen regardless of the data motivated by some recent approximation results for neural networks, and which is therefore easy to implement. We show that for this estimate we can derive rates of convergence results in case the regression function is smooth. We combine this estimate with the projection pursuit, where we choose the directions randomly, and we show that for sufficiently many repititions we get a neural network regression estimate which is easy to implement and which achieves the one-dimensional rate of convergence (up to some logarithmic factor) in case that the regression function satisfies the assumptions of projection pursuit.
△ Less
Submitted 9 December, 2019;
originally announced December 2019.
-
Over-parametrized deep neural networks do not generalize well
Authors:
Michael Kohler,
Adam Krzyzak
Abstract:
Recently it was shown in several papers that backpropagation is able to find the global minimum of the empirical risk on the training data using over-parametrized deep neural networks. In this paper a similar result is shown for deep neural networks with the sigmoidal squasher activation function in a regression setting, and a lower bound is presented which proves that these networks do not genera…
▽ More
Recently it was shown in several papers that backpropagation is able to find the global minimum of the empirical risk on the training data using over-parametrized deep neural networks. In this paper a similar result is shown for deep neural networks with the sigmoidal squasher activation function in a regression setting, and a lower bound is presented which proves that these networks do not generalize well on a new data in the sense that they do not achieve the optimal minimax rate of convergence for estimation of smooth regression functions.
△ Less
Submitted 14 January, 2020; v1 submitted 9 December, 2019;
originally announced December 2019.
-
Estimation of a function of low local dimensionality by deep neural networks
Authors:
Michael Kohler,
Adam Krzyzak,
Sophie Langer
Abstract:
Deep neural networks (DNNs) achieve impressive results for complicated tasks like object detection on images and speech recognition. Motivated by this practical success, there is now a strong interest in showing good theoretical properties of DNNs. To describe for which tasks DNNs perform well and when they fail, it is a key challenge to understand their performance. The aim of this paper is to co…
▽ More
Deep neural networks (DNNs) achieve impressive results for complicated tasks like object detection on images and speech recognition. Motivated by this practical success, there is now a strong interest in showing good theoretical properties of DNNs. To describe for which tasks DNNs perform well and when they fail, it is a key challenge to understand their performance. The aim of this paper is to contribute to the current statistical theory of DNNs. We apply DNNs on high dimensional data and we show that the least squares regression estimates using DNNs are able to achieve dimensionality reduction in case that the regression function has locally low dimensionality. Consequently, the rate of convergence of the estimate does not depend on its input dimension $d$, but on its local dimension $d^*$ and the DNNs are able to circumvent the curse of dimensionality in case that $d^*$ is much smaller than $d$. In our simulation study we provide numerical experiments to support our theoretical result and we compare our estimate with other conventional nonparametric regression estimates. The performance of our estimates is also validated in experiments with real data.
△ Less
Submitted 15 June, 2020; v1 submitted 29 August, 2019;
originally announced August 2019.
-
Discretization of quaternionic continuous wavelet transforms
Authors:
A. Askari Hemmat,
K. Thirulogasanthar,
A. Krzyzak
Abstract:
A scheme to form a basis and a frame for a Hilbert space of quaternion valued square integrable function from a basis and a frame, respectively, of a Hilbert space of complex valued square integrable functions is introduced. Using the discretization techniques for 2D-continuous wavelet transform of the $SIM(2)$ group, the quaternionic continuous wavelet transform, living in a complex valued Hilber…
▽ More
A scheme to form a basis and a frame for a Hilbert space of quaternion valued square integrable function from a basis and a frame, respectively, of a Hilbert space of complex valued square integrable functions is introduced. Using the discretization techniques for 2D-continuous wavelet transform of the $SIM(2)$ group, the quaternionic continuous wavelet transform, living in a complex valued Hilbert space of square integrable functions, of the quaternion wavelet group is discretized, and thereby, a discrete frame for quaternion valued Hilbert space of square integrable functions is obtained.
△ Less
Submitted 8 December, 2016;
originally announced December 2016.
-
An Affine Invariant $k$-Nearest Neighbor Regression Estimate
Authors:
Gérard Biau,
Luc Devroye,
Vida Dujmovic,
Adam Krzyzak
Abstract:
We design a data-dependent metric in $\mathbb R^d$ and use it to define the $k$-nearest neighbors of a given point. Our metric is invariant under all affine transformations. We show that, with this metric, the standard $k$-nearest neighbor regression estimate is asymptotically consistent under the usual conditions on $k$, and minimal requirements on the input data.
We design a data-dependent metric in $\mathbb R^d$ and use it to define the $k$-nearest neighbors of a given point. Our metric is invariant under all affine transformations. We show that, with this metric, the standard $k$-nearest neighbor regression estimate is asymptotically consistent under the usual conditions on $k$, and minimal requirements on the input data.
△ Less
Submitted 18 May, 2012; v1 submitted 3 January, 2012;
originally announced January 2012.
-
Multi Matrix Vector Coherent States
Authors:
K. Thirulogasanthar,
G. Honnouvo,
A. Krzyzak
Abstract:
A class of vector coherent states is derived with multiple of matrices as vectors in a Hilbert space, where the Hilbert space is taken to be the tensor product of several other Hilbert spaces. As examples vector coherent states with multiple of quaternions and octonions are given. The resulting generalized oscillator algebra is briefly discussed. Further, vector coherent states for a tensored Ha…
▽ More
A class of vector coherent states is derived with multiple of matrices as vectors in a Hilbert space, where the Hilbert space is taken to be the tensor product of several other Hilbert spaces. As examples vector coherent states with multiple of quaternions and octonions are given. The resulting generalized oscillator algebra is briefly discussed. Further, vector coherent states for a tensored Hamiltonian system are obtained by the same method. As particular cases, coherent states are obtained for tensored Jaynes-Cummings type Hamiltonians and for a two-level two-mode generalization of the Jaynes-Cummings model.
△ Less
Submitted 2 September, 2004; v1 submitted 29 August, 2003;
originally announced August 2003.