-
Pathological spectra of the Fisher information metric and its variants in deep neural networks
Authors:
Ryo Karakida,
Shotaro Akaho,
Shun-ichi Amari
Abstract:
The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth and sample size when the network has random weights and is sufficiently wi…
▽ More
The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth and sample size when the network has random weights and is sufficiently wide. This study covers two widely-used FIMs for regression with linear output and for classification with softmax output. Both FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues become large outliers depending the width or sample size while the others are much smaller. It implies that the local shape of the parameter space or loss landscape is very sharp in a few specific directions while almost flat in the other directions. In particular, the softmax output disperses the outliers and makes a tail of the eigenvalue density spread from the bulk. We also show that pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. Thus, we provide a unified perspective on the FIM and its variants that will lead to more quantitative understanding of learning in large-scale DNNs.
△ Less
Submitted 27 September, 2020; v1 submitted 14 October, 2019;
originally announced October 2019.
-
The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks
Authors:
Ryo Karakida,
Shotaro Akaho,
Shun-ichi Amari
Abstract:
Normalization methods play an important role in enhancing the performance of deep learning while their theoretical understandings have been limited. To theoretically elucidate the effectiveness of normalization, we quantify the geometry of the parameter space determined by the Fisher information matrix (FIM), which also corresponds to the local shape of the loss landscape under certain conditions.…
▽ More
Normalization methods play an important role in enhancing the performance of deep learning while their theoretical understandings have been limited. To theoretically elucidate the effectiveness of normalization, we quantify the geometry of the parameter space determined by the Fisher information matrix (FIM), which also corresponds to the local shape of the loss landscape under certain conditions. We analyze deep neural networks with random initialization, which is known to suffer from a pathologically sharp shape of the landscape when the network becomes sufficiently wide. We reveal that batch normalization in the last layer contributes to drastically decreasing such pathological sharpness if the width and sample number satisfy a specific condition. In contrast, it is hard for batch normalization in the middle hidden layers to alleviate pathological sharpness in many settings. We also found that layer normalization cannot alleviate pathological sharpness either. Thus, we can conclude that batch normalization in the last layer significantly contributes to decreasing the sharpness induced by the FIM.
△ Less
Submitted 28 October, 2019; v1 submitted 7 June, 2019;
originally announced June 2019.
-
Unified framework for the entropy production and the stochastic interaction based on information geometry
Authors:
Sosuke Ito,
Masafumi Oizumi,
Shun-ichi Amari
Abstract:
We show a relationship between the entropy production in stochastic thermodynamics and the stochastic interaction in the information integrated theory. To clarify this relationship, we newly introduce an information geometric interpretation of the entropy production for a total system and the partial entropy productions for subsystems. We show that the violation of the additivity of the entropy pr…
▽ More
We show a relationship between the entropy production in stochastic thermodynamics and the stochastic interaction in the information integrated theory. To clarify this relationship, we newly introduce an information geometric interpretation of the entropy production for a total system and the partial entropy productions for subsystems. We show that the violation of the additivity of the entropy productions is related to the stochastic interaction. This framework is a thermodynamic foundation of the integrated information theory. We also show that our information geometric formalism leads to a novel expression of the entropy production related to an optimization problem minimizing the Kullback-Leibler divergence. We analytically illustrate this interpretation by using the spin model.
△ Less
Submitted 6 April, 2020; v1 submitted 22 October, 2018;
originally announced October 2018.
-
Fisher Information and Natural Gradient Learning of Random Deep Networks
Authors:
Shun-ichi Amari,
Ryo Karakida,
Masafumi Oizumi
Abstract:
A deep neural network is a hierarchical nonlinear model transforming input signals to output signals. Its input-output relation is considered to be stochastic, being described for a given input by a parameterized conditional probability distribution of outputs. The space of parameters consisting of weights and biases is a Riemannian manifold, where the metric is defined by the Fisher information m…
▽ More
A deep neural network is a hierarchical nonlinear model transforming input signals to output signals. Its input-output relation is considered to be stochastic, being described for a given input by a parameterized conditional probability distribution of outputs. The space of parameters consisting of weights and biases is a Riemannian manifold, where the metric is defined by the Fisher information matrix. The natural gradient method uses the steepest descent direction in a Riemannian manifold, so it is effective in learning, avoiding plateaus. It requires inversion of the Fisher information matrix, however, which is practically impossible when the matrix has a huge number of dimensions. Many methods for approximating the natural gradient have therefore been introduced. The present paper uses statistical neurodynamical method to reveal the properties of the Fisher information matrix in a net of random connections under the mean field approximation. We prove that the Fisher information matrix is unit-wise block diagonal supplemented by small order terms of off-block-diagonal elements, which provides a justification for the quasi-diagonal natural gradient method by Y. Ollivier. A unitwise block-diagonal Fisher metrix reduces to the tensor product of the Fisher information matrices of single units. We further prove that the Fisher information matrix of a single unit has a simple reduced form, a sum of a diagonal matrix and a rank 2 matrix of weight-bias correlations. We obtain the inverse of Fisher information explicitly. We then have an explicit form of the natural gradient, without relying on the numerical matrix inversion, which drastically speeds up stochastic gradient learning.
△ Less
Submitted 21 August, 2018;
originally announced August 2018.
-
Statistical Neurodynamics of Deep Networks: Geometry of Signal Spaces
Authors:
Shun-ichi Amari,
Ryo Karakida,
Masafumi Oizumi
Abstract:
Statistical neurodynamics studies macroscopic behaviors of randomly connected neural networks. We consider a deep layered feedforward network where input signals are processed layer by layer. The manifold of input signals is embedded in a higher dimensional manifold of the next layer as a curved submanifold, provided the number of neurons is larger than that of inputs. We show geometrical features…
▽ More
Statistical neurodynamics studies macroscopic behaviors of randomly connected neural networks. We consider a deep layered feedforward network where input signals are processed layer by layer. The manifold of input signals is embedded in a higher dimensional manifold of the next layer as a curved submanifold, provided the number of neurons is larger than that of inputs. We show geometrical features of the embedded manifold, proving that the manifold enlarges or shrinks locally isotropically so that it is always embedded conformally. We study the curvature of the embedded manifold. The scalar curvature converges to a constant or diverges to infinity slowly. The distance between two signals also changes, converging eventually to a stable fixed value, provided both the number of neurons in a layer and the number of layers tend to infinity. This causes a problem, since when we consider a curve in the input space, it is mapped as a continuous curve of fractal nature, but our theory contradictorily suggests that the curve eventually converges to a discrete set of equally spaced points. In reality, the numbers of neurons and layers are finite and thus, it is expected that the finite size effect causes the discrepancies between our theory and reality. We need to further study the discrepancies to understand their implications on information processing.
△ Less
Submitted 21 August, 2018;
originally announced August 2018.
-
Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach
Authors:
Ryo Karakida,
Shotaro Akaho,
Shun-ichi Amari
Abstract:
The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statisti…
▽ More
The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statistics of the FIM's eigenvalues and reveal that most of them are close to zero while the maximum eigenvalue takes a huge value. Because the landscape of the parameter space is defined by the FIM, it is locally flat in most dimensions, but strongly distorted in others. Moreover, we demonstrate the potential usage of the derived statistics in learning strategies. First, small eigenvalues that induce flatness can be connected to a norm-based capacity measure of generalization ability. Second, the maximum eigenvalue that induces the distortion enables us to quantitatively estimate an appropriately sized learning rate for gradient methods to converge.
△ Less
Submitted 8 October, 2019; v1 submitted 4 June, 2018;
originally announced June 2018.
-
Spontaneous Motion on Two-dimensional Continuous Attractors
Authors:
C. C. Alan Fung,
S. -I. Amari
Abstract:
Attractor models are simplified models used to describe the dynamics of firing rate profiles of a pool of neurons. The firing rate profile, or the neuronal activity, is thought to carry information. Continuous attractor neural networks (CANNs) describe the neural processing of continuous information such as object position, object orientation and direction of object motion. Recently, it was found…
▽ More
Attractor models are simplified models used to describe the dynamics of firing rate profiles of a pool of neurons. The firing rate profile, or the neuronal activity, is thought to carry information. Continuous attractor neural networks (CANNs) describe the neural processing of continuous information such as object position, object orientation and direction of object motion. Recently, it was found that, in one-dimensional CANNs, short-term synaptic depression can destabilize bump-shaped neuronal attractor activity profiles. In this paper, we study two-dimensional CANNs with short-term synaptic depression and with spike frequency adaptation. We found that the dynamics of CANNs with short-term synaptic depression and CANNs with spike frequency adaptation are qualitatively similar. We also found that in both kinds of CANNs the perturbative approach can be used to predict phase diagrams, dynamical variables and speed of spontaneous motion.
△ Less
Submitted 31 January, 2015;
originally announced February 2015.
-
State Concentration Exponent as a Measure of Quickness in Kauffman-type Networks
Authors:
Shun-ichi Amari,
Hiroyasu Ando,
Taro Toyoizumi,
Naoki Masuda
Abstract:
We study the dynamics of randomly connected networks composed of binary Boolean elements and those composed of binary majority vote elements. We elucidate their differences in both sparsely and densely connected cases. The quickness of large network dynamics is usually quantified by the length of transient paths, an analytically intractable measure. For discrete-time dynamics of networks of binary…
▽ More
We study the dynamics of randomly connected networks composed of binary Boolean elements and those composed of binary majority vote elements. We elucidate their differences in both sparsely and densely connected cases. The quickness of large network dynamics is usually quantified by the length of transient paths, an analytically intractable measure. For discrete-time dynamics of networks of binary elements, we address this dilemma with an alternative unified framework by using a concept termed state concentration, defined as the exponent of the average number of t-step ancestors in state transition graphs. The state transition graph is defined by nodes corresponding to network states and directed links corresponding to transitions. Using this exponent, we interrogate the dynamics of random Boolean and majority vote networks. We find that extremely sparse Boolean networks and majority vote networks with arbitrary density achieve quickness, owing in part to long-tailed in-degree distributions. As a corollary, only relatively dense majority vote networks can achieve both quickness and robustness.
△ Less
Submitted 4 March, 2013; v1 submitted 29 February, 2012;
originally announced February 2012.
-
Dually flat structure with escort probability and its application to alpha-Voronoi diagrams
Authors:
Atsumi Ohara,
Hiroshi Matsuzoe,
Shun-ichi Amari
Abstract:
This paper studies geometrical structure of the manifold of escort probability distributions and shows its new applicability to information science. In order to realize escort probabilities we use a conformal transformation that flattens so-called alpha-geometry of the space of discrete probability distributions, which well characterizes nonadditive statistics on the space. As a result escort prob…
▽ More
This paper studies geometrical structure of the manifold of escort probability distributions and shows its new applicability to information science. In order to realize escort probabilities we use a conformal transformation that flattens so-called alpha-geometry of the space of discrete probability distributions, which well characterizes nonadditive statistics on the space. As a result escort probabilities are proved to be flat coordinates of the usual probabilities for the derived dually flat structure. Finally, we demonstrate that escort probabilities with the new structure admits a simple algorithm to compute Voronoi diagrams and centroids with respect to alpha-divergences.
△ Less
Submitted 24 October, 2010;
originally announced October 2010.
-
Efficiency of Energy Transduction in a Molecular Chemical Engine
Authors:
Kazuo Sasaki,
Ryo Kanada,
Satoshi Amari
Abstract:
A simple model of the two-state ratchet type is proposed for molecular chemical engines that convert chemical free energy into mechanical work and vice versa. The engine works by catalyzing a chemical reaction and turning a rotor. Analytical expressions are obtained for the dependences of rotation and reaction rates on the concentrations of reactant and product molecules, from which the performa…
▽ More
A simple model of the two-state ratchet type is proposed for molecular chemical engines that convert chemical free energy into mechanical work and vice versa. The engine works by catalyzing a chemical reaction and turning a rotor. Analytical expressions are obtained for the dependences of rotation and reaction rates on the concentrations of reactant and product molecules, from which the performance of the engine is analyzed. In particular, the efficiency of energy transduction is discussed in some detail.
△ Less
Submitted 28 December, 2006; v1 submitted 19 July, 2006;
originally announced July 2006.
-
Diffusion Coefficient and Mobility of a Brownian Particle in a Tilted Periodic Potential
Authors:
Kazuo Sasaki,
Satoshi Amari
Abstract:
The Brownian motion of a particle in a one-dimensional periodic potential subjected to a uniform external force F is studied. Using the formula for the diffusion coefficient D obtained by other authors and an alternative one derived from the Fokker-Planck equation in the present work, D is compared with the differential mobility μ= dv/dF where v is the average velocity of the particle. Analytica…
▽ More
The Brownian motion of a particle in a one-dimensional periodic potential subjected to a uniform external force F is studied. Using the formula for the diffusion coefficient D obtained by other authors and an alternative one derived from the Fokker-Planck equation in the present work, D is compared with the differential mobility μ= dv/dF where v is the average velocity of the particle. Analytical and numerical calculations indicate that inequality D \ge μk_{B}T, with k_{B} the Boltzmann constant and T the temperature, holds if the periodic potential is symmetric, while it is violated for asymmetric potentials when F is small but nonzero.
△ Less
Submitted 1 February, 2005;
originally announced February 2005.
-
Mutual Information of Three-State Low Activity Diluted Neural Networks with Self-Control
Authors:
D. Bolle',
D. R. C. Dominguez,
S. Amari
Abstract:
The influence of a macroscopic time-dependent threshold on the retrieval process of three-state extremely diluted neural networks is examined. If the threshold is chosen appropriately in function of the noise and the pattern activity of the network, adapting itself in the course of the time evolution, it guarantees an autonomous functioning of the network. It is found that this self-control mech…
▽ More
The influence of a macroscopic time-dependent threshold on the retrieval process of three-state extremely diluted neural networks is examined. If the threshold is chosen appropriately in function of the noise and the pattern activity of the network, adapting itself in the course of the time evolution, it guarantees an autonomous functioning of the network. It is found that this self-control mechanism considerably improves the retrieval quality, especially in the limit of low activity, including the storage capacity, the basins of attraction and the information content. The mutual information is shown to be the relevant parameter to study the retrieval quality of such low activity models. Numerical results confirm these observations.
△ Less
Submitted 21 August, 2000; v1 submitted 5 June, 1998;
originally announced June 1998.