-
From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning
Authors:
Noa Rubin,
Kirsten Fischer,
Javed Lindner,
David Dahmen,
Inbar Seroussi,
Zohar Ringel,
Michael Krämer,
Moritz Helias
Abstract:
Feature learning in neural networks is crucial for their expressive power and inductive biases, motivating various theoretical approaches. Some approaches describe network behavior after training through a change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel…
▽ More
Feature learning in neural networks is crucial for their expressive power and inductive biases, motivating various theoretical approaches. Some approaches describe network behavior after training through a change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving directional changes to the kernel. The relationship and respective strengths of these two views have so far remained unresolved. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these two views. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network's probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output in the special case of a linear network. However, for linear and non-linear networks, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.
△ Less
Submitted 28 May, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
Field theory for optimal signal propagation in ResNets
Authors:
Kirsten Fischer,
David Dahmen,
Moritz Helias
Abstract:
Residual networks have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range…
▽ More
Residual networks have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks, finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size field theory for residual networks to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a theoretical framework to study ResNets at finite size.
△ Less
Submitted 26 August, 2024; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Decomposing neural networks as mappings of correlation functions
Authors:
Kirsten Fischer,
Alexandre René,
Christian Keup,
Moritz Layer,
David Dahmen,
Moritz Helias
Abstract:
Understanding the functional principles of information processing in deep neural networks continues to be a challenge, in particular for networks with trained and thus non-random weights. To address this issue, we study the mapping between probability distributions implemented by a deep feed-forward network. We characterize this mapping as an iterated transformation of distributions, where the non…
▽ More
Understanding the functional principles of information processing in deep neural networks continues to be a challenge, in particular for networks with trained and thus non-random weights. To address this issue, we study the mapping between probability distributions implemented by a deep feed-forward network. We characterize this mapping as an iterated transformation of distributions, where the non-linearity in each layer transfers information between different orders of correlation functions. This allows us to identify essential statistics in the data, as well as different information representations that can be used by neural networks. Applied to an XOR task and to MNIST, we show that correlations up to second order predominantly capture the information processing in the internal layers, while the input layer also extracts higher-order correlations from the data. This analysis provides a quantitative and explainable perspective on classification.
△ Less
Submitted 1 December, 2022; v1 submitted 10 February, 2022;
originally announced February 2022.
-
On Feature Relevance Uncertainty: A Monte Carlo Dropout Sampling Approach
Authors:
Kai Fischer,
Jonas Schneider
Abstract:
Understanding decisions made by neural networks is key for the deployment of intelligent systems in real world applications. However, the opaque decision making process of these systems is a disadvantage where interpretability is essential. Many feature-based explanation techniques have been introduced over the last few years in the field of machine learning to better understand decisions made by…
▽ More
Understanding decisions made by neural networks is key for the deployment of intelligent systems in real world applications. However, the opaque decision making process of these systems is a disadvantage where interpretability is essential. Many feature-based explanation techniques have been introduced over the last few years in the field of machine learning to better understand decisions made by neural networks and have become an important component to verify their reasoning capabilities. However, existing methods do not allow statements to be made about the uncertainty regarding a feature's relevance for the prediction. In this paper, we introduce Monte Carlo Relevance Propagation (MCRP) for feature relevance uncertainty estimation. A simple but powerful method based on Monte Carlo estimation of the feature relevance distribution to compute feature relevance uncertainty scores that allow a deeper understanding of a neural network's perception and reasoning.
△ Less
Submitted 11 April, 2023; v1 submitted 4 August, 2020;
originally announced August 2020.
-
Highly-scalable, physics-informed GANs for learning solutions of stochastic PDEs
Authors:
Liu Yang,
Sean Treichler,
Thorsten Kurth,
Keno Fischer,
David Barajas-Solano,
Josh Romero,
Valentin Churavy,
Alexandre Tartakovsky,
Michael Houston,
Prabhat,
George Karniadakis
Abstract:
Uncertainty quantification for forward and inverse problems is a central challenge across physical and biomedical disciplines. We address this challenge for the problem of modeling subsurface flow at the Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, and multiple correlation length s…
▽ More
Uncertainty quantification for forward and inverse problems is a central challenge across physical and biomedical disciplines. We address this challenge for the problem of modeling subsurface flow at the Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, and multiple correlation length scales of the Hanford Site require training a computationally intensive GAN model to thousands of dimensions. We develop a hierarchical scheme for exploiting domain parallelism, map discriminators and generators to multiple GPUs, and employ efficient communication schemes to ensure training stability and convergence. We developed a highly optimized implementation of this scheme that scales to 27,500 NVIDIA Volta GPUs and 4584 nodes on the Summit supercomputer with a 93.1% scaling efficiency, achieving peak and sustained half-precision rates of 1228 PF/s and 1207 PF/s.
△ Less
Submitted 28 October, 2019;
originally announced October 2019.
-
Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization
Authors:
Michael Volpp,
Lukas P. Fröhlich,
Kirsten Fischer,
Andreas Doerr,
Stefan Falkner,
Frank Hutter,
Christian Daniel
Abstract:
Transferring knowledge across tasks to improve data-efficiency is one of the open key challenges in the field of global black-box optimization. Readily available algorithms are typically designed to be universal optimizers and, therefore, often suboptimal for specific tasks. We propose a novel transfer learning method to obtain customized optimizers within the well-established framework of Bayesia…
▽ More
Transferring knowledge across tasks to improve data-efficiency is one of the open key challenges in the field of global black-box optimization. Readily available algorithms are typically designed to be universal optimizers and, therefore, often suboptimal for specific tasks. We propose a novel transfer learning method to obtain customized optimizers within the well-established framework of Bayesian optimization, allowing our algorithm to utilize the proven generalization capabilities of Gaussian processes. Using reinforcement learning to meta-train an acquisition function (AF) on a set of related tasks, the proposed method learns to extract implicit structural information and to exploit it for improved data-efficiency. We present experiments on a simulation-to-real transfer task as well as on several synthetic functions and on two hyperparameter search problems. The results show that our algorithm (1) automatically identifies structural properties of objective functions from available source tasks or simulations, (2) performs favourably in settings with both scarse and abundant source data, and (3) falls back to the performance level of general AFs if no particular structure is present.
△ Less
Submitted 14 February, 2020; v1 submitted 4 April, 2019;
originally announced April 2019.
-
Automatic Full Compilation of Julia Programs and ML Models to Cloud TPUs
Authors:
Keno Fischer,
Elliot Saba
Abstract:
Google's Cloud TPUs are a promising new hardware architecture for machine learning workloads. They have powered many of Google's milestone machine learning achievements in recent years. Google has now made TPUs available for general use on their cloud platform and as of very recently has opened them up further to allow use by non-TensorFlow frontends. We describe a method and implementation for of…
▽ More
Google's Cloud TPUs are a promising new hardware architecture for machine learning workloads. They have powered many of Google's milestone machine learning achievements in recent years. Google has now made TPUs available for general use on their cloud platform and as of very recently has opened them up further to allow use by non-TensorFlow frontends. We describe a method and implementation for offloading suitable sections of Julia programs to TPUs via this new API and the Google XLA compiler. Our method is able to completely fuse the forward pass of a VGG19 model expressed as a Julia program into a single TPU executable to be offloaded to the device. Our method composes well with existing compiler-based automatic differentiation techniques on Julia code, and we are thus able to also automatically obtain the VGG19 backwards pass and similarly offload it to the TPU. Targeting TPUs using our compiler, we are able to evaluate the VGG19 forward pass on a batch of 100 images in 0.23s which compares favorably to the 52.4s required for the original model on the CPU. Our implementation is less than 1000 lines of Julia, with no TPU specific changes made to the core Julia compiler or any other Julia packages.
△ Less
Submitted 23 October, 2018;
originally announced October 2018.
-
Assessing the association between pre-course metrics of student preparation and student performance in introductory statistics: Results from early data on simulation-based inference vs. nonsimulation based inference
Authors:
Nathan Tintle,
Jake Clark,
Karen Fischer,
Beth Chance,
George Cobb,
Soma Roy,
Todd Swanson,
Jill VanderStoep
Abstract:
The recent simulation-based inference (SBI) movement in algebra-based introductory statistics courses (Stat 101) has provided preliminary evidence of improved student conceptual understanding and retention. However, little is known about whether these positive effects are preferentially distributed across types of students entering the course. We consider how two metrics of Stat 101 student prepar…
▽ More
The recent simulation-based inference (SBI) movement in algebra-based introductory statistics courses (Stat 101) has provided preliminary evidence of improved student conceptual understanding and retention. However, little is known about whether these positive effects are preferentially distributed across types of students entering the course. We consider how two metrics of Stat 101 student preparation (pre-course performance on concept inventory and math ACT score) may or may not be associated with end of course student performance on conceptual inventories. Students across all preparation levels tended to show improvement in Stat 101, but more improvement was observed across all student preparation levels in early versions of a SBI course. Furthermore, students' gains tended to be similar regardless of whether students entered the course with more preparation or less. Recent data on a sample of students using a current version of an SBI course showed similar results, though direct comparison with non-SBI students was not possible. Overall, our analysis provides additional evidence that SBI curricula are effective at improving students' conceptual understanding of statistical ideas post-course regardless student preparation. Further work is needed to better understand nuances of student improvement based on other student demographics, prior coursework, as well as instructor and institutional variables.
△ Less
Submitted 26 February, 2018;
originally announced February 2018.