Search | arXiv e-print repository

Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

Authors: Mana Sakai, Ryo Karakida, Masaaki Imaizumi

Abstract: In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or… ▽ More In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard $1/\sqrt{n}$-scaling with $n$ dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2505.14102 [pdf, ps, other]

High-dimensional Nonparametric Contextual Bandit Problem

Authors: Shogo Iwazaki, Junpei Komiyama, Masaaki Imaizumi

Abstract: We consider the kernelized contextual bandit problem with a large feature space. This problem involves $K$ arms, and the goal of the forecaster is to maximize the cumulative rewards through learning the relationship between the contexts and the rewards. It serves as a general framework for various decision-making scenarios, such as personalized online advertising and recommendation systems. Kernel… ▽ More We consider the kernelized contextual bandit problem with a large feature space. This problem involves $K$ arms, and the goal of the forecaster is to maximize the cumulative rewards through learning the relationship between the contexts and the rewards. It serves as a general framework for various decision-making scenarios, such as personalized online advertising and recommendation systems. Kernelized contextual bandits generalize the linear contextual bandit problem and offers a greater modeling flexibility. Existing methods, when applied to Gaussian kernels, yield a trivial bound of $O(T)$ when we consider $Ω(\log T)$ feature dimensions. To address this, we introduce stochastic assumptions on the context distribution and show that no-regret learning is achievable even when the number of dimensions grows up to the number of samples. Furthermore, we analyze lenient regret, which allows a per-round regret of at most $Δ> 0$. We derive the rate of lenient regret in terms of $Δ$. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: 38 pages

arXiv:2505.13570 [pdf, ps, other]

Minimax Rates of Estimation for Optimal Transport Map between Infinite-Dimensional Spaces

Authors: Donlapark Ponnoprat, Masaaki Imaizumi

Abstract: We investigate the estimation of an optimal transport map between probability measures on an infinite-dimensional space and reveal its minimax optimal rate. Optimal transport theory defines distances within a space of probability measures, utilizing an optimal transport map as its key component. Estimating the optimal transport map from samples finds several applications, such as simulating dynami… ▽ More We investigate the estimation of an optimal transport map between probability measures on an infinite-dimensional space and reveal its minimax optimal rate. Optimal transport theory defines distances within a space of probability measures, utilizing an optimal transport map as its key component. Estimating the optimal transport map from samples finds several applications, such as simulating dynamics between probability measures and functional data analysis. However, some transport maps on infinite-dimensional spaces require exponential-order data for estimation, which undermines their applicability. In this paper, we investigate the estimation of an optimal transport map between infinite-dimensional spaces, focusing on optimal transport maps characterized by the notion of $γ$-smoothness. Consequently, we show that the order of the minimax risk is polynomial rate in the sample size even in the infinite-dimensional setup. We also develop an estimator whose estimation error matches the minimax optimal rate. With these results, we obtain a class of reasonably estimable optimal transport maps on infinite-dimensional spaces and a method for their estimation. Our experiments validate the theory and practical utility of our approach with application to functional data analysis. △ Less

Submitted 27 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

Comments: 60 pages, 5 figures

MSC Class: 62G05

arXiv:2505.04898 [pdf, other]

Precise gradient descent training dynamics for finite-width multi-layer neural networks

Authors: Qiyang Han, Masaaki Imaizumi

Abstract: In this paper, we provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks under the canonical single-index regression model, in the `finite-width proportional regime' where the sample size and feature dimension grow proportionally while the network width and depth remain bounded. Our non-asymptotic state evolution theory captur… ▽ More In this paper, we provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks under the canonical single-index regression model, in the `finite-width proportional regime' where the sample size and feature dimension grow proportionally while the network width and depth remain bounded. Our non-asymptotic state evolution theory captures Gaussian fluctuations in first-layer weights and concentration in deeper-layer weights, and remains valid for non-Gaussian features. Our theory differs from existing neural tangent kernel (NTK), mean-field (MF) theories and tensor program (TP) in several key aspects. First, our theory operates in the finite-width regime whereas these existing theories are fundamentally infinite-width. Second, our theory allows weights to evolve from individual initializations beyond the lazy training regime, whereas NTK and MF are either frozen at or only weakly sensitive to initialization, and TP relies on special initialization schemes. Third, our theory characterizes both training and generalization errors for general multi-layer neural networks beyond the uniform convergence regime, whereas existing theories study generalization almost exclusively in two-layer settings. As a statistical application, we show that vanilla gradient descent can be augmented to yield consistent estimates of the generalization error at each iteration, which can be used to guide early stopping and hyperparameter tuning. As a further theoretical implication, we show that despite model misspecification, the model learned by gradient descent retains the structure of a single-index function with an effective signal determined by a linear combination of the true signal and the initialization. △ Less

Submitted 7 May, 2025; originally announced May 2025.

arXiv:2503.23642 [pdf, other]

Learning a Single Index Model from Anisotropic Data with vanilla Stochastic Gradient Descent

Authors: Guillaume Braun, Minh Ha Quang, Masaaki Imaizumi

Abstract: We investigate the problem of learning a Single Index Model (SIM)- a popular model for studying the ability of neural networks to learn features - from anisotropic Gaussian inputs by training a neuron using vanilla Stochastic Gradient Descent (SGD). While the isotropic case has been extensively studied, the anisotropic case has received less attention and the impact of the covariance matrix on the… ▽ More We investigate the problem of learning a Single Index Model (SIM)- a popular model for studying the ability of neural networks to learn features - from anisotropic Gaussian inputs by training a neuron using vanilla Stochastic Gradient Descent (SGD). While the isotropic case has been extensively studied, the anisotropic case has received less attention and the impact of the covariance matrix on the learning dynamics remains unclear. For instance, Mousavi-Hosseini et al. (2023b) proposed a spherical SGD that requires a separate estimation of the data covariance matrix, thereby oversimplifying the influence of covariance. In this study, we analyze the learning dynamics of vanilla SGD under the SIM with anisotropic input data, demonstrating that vanilla SGD automatically adapts to the data's covariance structure. Leveraging these results, we derive upper and lower bounds on the sample complexity using a notion of effective dimension that is determined by the structure of the covariance matrix instead of the input data dimension. △ Less

Submitted 30 March, 2025; originally announced March 2025.

Journal ref: Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025

arXiv:2503.06393 [pdf, ps, other]

Landscape computations for the edge of chaos in nonlinear dynamical systems

Authors: Motoki Nakata, Masaaki Imaizumi

Abstract: We propose a stochastic sampling approach to identify stability boundaries in general dynamical systems. The global landscape of Lyapunov exponent in multi-dimensional parameter space provides transition boundaries for stable/unstable trajectories, i.e., the edge of chaos. Despite its usefulness, it is generally difficult to derive analytically. In this study, we reveal the transition boundaries b… ▽ More We propose a stochastic sampling approach to identify stability boundaries in general dynamical systems. The global landscape of Lyapunov exponent in multi-dimensional parameter space provides transition boundaries for stable/unstable trajectories, i.e., the edge of chaos. Despite its usefulness, it is generally difficult to derive analytically. In this study, we reveal the transition boundaries by leveraging the Markov chain Monte Carlo algorithm coupled directly with the numerical integration of nonlinear differential/difference equation. It is demonstrated that a posteriori modeling for parameter subspace along the edge of chaos determines an inherent constrained dynamical system to flexibly activate or de-activate the chaotic tra jectories. △ Less

Submitted 8 March, 2025; originally announced March 2025.

Comments: 9 pages, 5 figures

Report number: RIKEN-iTHEMS-Report-24

arXiv:2411.01161 [pdf, other]

Federated Learning with Relative Fairness

Authors: Shogo Nakakita, Tatsuya Kaneko, Shinya Takamaeda-Yamazaki, Masaaki Imaizumi

Abstract: This paper proposes a federated learning framework designed to achieve \textit{relative fairness} for clients. Traditional federated learning frameworks typically ensure absolute fairness by guaranteeing minimum performance across all client subgroups. However, this approach overlooks disparities in model performance between subgroups. The proposed framework uses a minimax problem approach to mini… ▽ More This paper proposes a federated learning framework designed to achieve \textit{relative fairness} for clients. Traditional federated learning frameworks typically ensure absolute fairness by guaranteeing minimum performance across all client subgroups. However, this approach overlooks disparities in model performance between subgroups. The proposed framework uses a minimax problem approach to minimize relative unfairness, extending previous methods in distributionally robust optimization (DRO). A novel fairness index, based on the ratio between large and small losses among clients, is introduced, allowing the framework to assess and improve the relative fairness of trained models. Theoretical guarantees demonstrate that the framework consistently reduces unfairness. We also develop an algorithm, named \textsc{Scaff-PD-IA}, which balances communication and computational efficiency while maintaining minimax-optimal convergence rates. Empirical evaluations on real-world datasets confirm its effectiveness in maintaining model performance while reducing disparity. △ Less

Submitted 2 November, 2024; originally announced November 2024.

Comments: 43 pages

arXiv:2410.08709 [pdf, other]

Distillation of Discrete Diffusion through Dimensional Correlations

Authors: Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

Abstract: Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in la… ▽ More Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in language) mainly due to the computational cost of processing high-dimensional joint distributions. In this paper, (i) we propose "mixture" models for discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and (ii) we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: First, conventional models with element-wise independence can well approximate the data distribution, but essentially require {\it many sampling steps}. Second, our loss functions enable the mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. Our experimental results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains. The code used in the paper is available at https://github.com/sony/di4c . △ Less

Submitted 8 May, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

Comments: 39 pages, ICML 2025 accepted

arXiv:2407.08228 [pdf, ps, other]

Wasserstein $k$-Centers Clustering for Distributional Data

Authors: Ryo Okano, Masaaki Imaizumi

Abstract: We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes modes of variation of the data because the space of probability distributions lacks a vector space structure, preventing the application of existing methods devis… ▽ More We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes modes of variation of the data because the space of probability distributions lacks a vector space structure, preventing the application of existing methods devised for functional data. Our clustering method for distributional data takes account of the differences in both means and modes of variation of clusters, in the spirit of the $k$-centers clustering approach proposed for functional data. Specifically, we consider the space of distributions equipped with the Wasserstein metric and define geodesic modes of variation of distributional data using the notion of geodesic principal component analysis. Then, we utilize geodesic modes of clusters to predict the cluster membership of each distribution. We theoretically show the validity of the proposed clustering criterion by studying the probability of correct membership. Through a simulation study and real data application, we demonstrate that the proposed distributional clustering method can improve the quality of the cluster compared to conventional clustering algorithms. △ Less

Submitted 21 June, 2025; v1 submitted 11 July, 2024; originally announced July 2024.

Comments: 40 pages

arXiv:2406.16032 [pdf, other]

Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

Authors: Naoki Yoshida, Shogo Nakakita, Masaaki Imaizumi

Abstract: We consider a variant of the stochastic gradient descent (SGD) with a random learning rate and reveal its convergence properties. SGD is a widely used stochastic optimization algorithm in machine learning, especially deep learning. Numerous studies reveal the convergence properties of SGD and its simplified variants. Among these, the analysis of convergence using a stationary distribution of updat… ▽ More We consider a variant of the stochastic gradient descent (SGD) with a random learning rate and reveal its convergence properties. SGD is a widely used stochastic optimization algorithm in machine learning, especially deep learning. Numerous studies reveal the convergence properties of SGD and its simplified variants. Among these, the analysis of convergence using a stationary distribution of updated parameters provides generalizable results. However, to obtain a stationary distribution, the update direction of the parameters must not degenerate, which limits the applicable variants of SGD. In this study, we consider a novel SGD variant, Poisson SGD, which has degenerated parameter update directions and instead utilizes a random learning rate. Consequently, we demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions on a loss function. Based on this, we further show that Poisson SGD finds global minima in non-convex optimization problems and also evaluate the generalization error using this method. As a proof technique, we approximate the distribution by Poisson SGD with that of the bouncy particle sampler (BPS) and derive its stationary distribution, using the theoretical advance of the piece-wise deterministic Markov process (PDMP). △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: 28 pages

arXiv:2405.16819 [pdf, other]

Automatic Domain Adaptation by Transformers in In-Context Learning

Authors: Ryuichiro Hataya, Kota Matsui, Masaaki Imaizumi

Abstract: Selecting or designing an appropriate domain adaptation algorithm for a given problem remains challenging. This paper presents a Transformer model that can provably approximate and opt for domain adaptation methods for a given dataset in the in-context learning framework, where a foundation model performs new tasks without updating its parameters at test time. Specifically, we prove that Transform… ▽ More Selecting or designing an appropriate domain adaptation algorithm for a given problem remains challenging. This paper presents a Transformer model that can provably approximate and opt for domain adaptation methods for a given dataset in the in-context learning framework, where a foundation model performs new tasks without updating its parameters at test time. Specifically, we prove that Transformers can approximate instance-based and feature-based unsupervised domain adaptation algorithms and automatically select an algorithm suited for a given dataset. Numerical results indicate that in-context learning demonstrates an adaptive domain adaptation surpassing existing methods. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2404.17812 [pdf, other]

High-Dimensional Single-Index Models: Link Estimation and Marginal Inference

Authors: Kazuma Sawaya, Yoshimasa Uematsu, Masaaki Imaizumi

Abstract: This study proposes a novel method for estimation and hypothesis testing in high-dimensional single-index models. We address a common scenario where the sample size and the dimension of regression coefficients are large and comparable. Unlike traditional approaches, which often overlook the estimation of the unknown link function, we introduce a new method for link function estimation. Leveraging… ▽ More This study proposes a novel method for estimation and hypothesis testing in high-dimensional single-index models. We address a common scenario where the sample size and the dimension of regression coefficients are large and comparable. Unlike traditional approaches, which often overlook the estimation of the unknown link function, we introduce a new method for link function estimation. Leveraging the information from the estimated link function, we propose more efficient estimators that are better aligned with the underlying model. Furthermore, we rigorously establish the asymptotic normality of each coordinate of the estimator. This provides a valid construction of confidence intervals and $p$-values for any finite collection of coordinates. Numerical experiments validate our theoretical results. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: 42 pages

arXiv:2401.17269 [pdf, other]

Effect of Weight Quantization on Learning Models by Typical Case Analysis

Authors: Shuhei Kashiwamura, Ayaka Sakata, Masaaki Imaizumi

Abstract: This paper examines the quantization methods used in large-scale data analysis models and their hyperparameter choices. The recent surge in data analysis scale has significantly increased computational resource requirements. To address this, quantizing model weights has become a prevalent practice in data analysis applications such as deep learning. Quantization is particularly vital for deploying… ▽ More This paper examines the quantization methods used in large-scale data analysis models and their hyperparameter choices. The recent surge in data analysis scale has significantly increased computational resource requirements. To address this, quantizing model weights has become a prevalent practice in data analysis applications such as deep learning. Quantization is particularly vital for deploying large models on devices with limited computational resources. However, the selection of quantization hyperparameters, like the number of bits and value range for weight quantization, remains an underexplored area. In this study, we employ the typical case analysis from statistical physics, specifically the replica method, to explore the impact of hyperparameters on the quantization of simple learning models. Our analysis yields three key findings: (i) an unstable hyperparameter phase, known as replica symmetry breaking, occurs with a small number of bits and a large quantization width; (ii) there is an optimal quantization width that minimizes error; and (iii) quantization delays the onset of overparameterization, helping to mitigate overfitting as indicated by the double descent phenomenon. We also discover that non-uniform quantization can enhance stability. Additionally, we develop an approximate message-passing algorithm to validate our theoretical results. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2310.16819 [pdf, other]

CATE Lasso: Conditional Average Treatment Effect Estimation with High-Dimensional Linear Regression

Authors: Masahiro Kato, Masaaki Imaizumi

Abstract: In causal inference about two treatments, Conditional Average Treatment Effects (CATEs) play an important role as a quantity representing an individualized causal effect, defined as a difference between the expected outcomes of the two treatments conditioned on covariates. This study assumes two linear regression models between a potential outcome and covariates of the two treatments and defines C… ▽ More In causal inference about two treatments, Conditional Average Treatment Effects (CATEs) play an important role as a quantity representing an individualized causal effect, defined as a difference between the expected outcomes of the two treatments conditioned on covariates. This study assumes two linear regression models between a potential outcome and covariates of the two treatments and defines CATEs as a difference between the linear regression models. Then, we propose a method for consistently estimating CATEs even under high-dimensional and non-sparse parameters. In our study, we demonstrate that desirable theoretical properties, such as consistency, remain attainable even without assuming sparsity explicitly if we assume a weaker assumption called implicit sparsity originating from the definition of CATEs. In this assumption, we suppose that parameters of linear models in potential outcomes can be divided into treatment-specific and common parameters, where the treatment-specific parameters take difference values between each linear regression model, while the common parameters remain identical. Thus, in a difference between two linear regression models, the common parameters disappear, leaving only differences in the treatment-specific parameters. Consequently, the non-zero parameters in CATEs correspond to the differences in the treatment-specific parameters. Leveraging this assumption, we develop a Lasso regression method specialized for CATE estimation and present that the estimator is consistent. Finally, we confirm the soundness of the proposed method by simulation studies. △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2307.06137 [pdf, other]

Distribution-on-Distribution Regression with Wasserstein Metric: Multivariate Gaussian Case

Authors: Ryo Okano, Masaaki Imaizumi

Abstract: Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce mod… ▽ More Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, utilizing the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem's analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes. △ Less

Submitted 8 February, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

Comments: 34 pages

arXiv:2307.04042 [pdf, other]

Sup-Norm Convergence of Deep Neural Network Estimator for Nonparametric Regression by Adversarial Training

Authors: Masaaki Imaizumi

Abstract: We show the sup-norm convergence of deep neural network estimators with a novel adversarial training scheme. For the nonparametric regression problem, it has been shown that an estimator using deep neural networks can achieve better performances in the sense of the $L2$-norm. In contrast, it is difficult for the neural estimator with least-squares to achieve the sup-norm convergence, due to the de… ▽ More We show the sup-norm convergence of deep neural network estimators with a novel adversarial training scheme. For the nonparametric regression problem, it has been shown that an estimator using deep neural networks can achieve better performances in the sense of the $L2$-norm. In contrast, it is difficult for the neural estimator with least-squares to achieve the sup-norm convergence, due to the deep structure of neural network models. In this study, we develop an adversarial training scheme and investigate the sup-norm convergence of deep neural network estimators. First, we find that ordinary adversarial training makes neural estimators inconsistent. Second, we show that a deep neural network estimator achieves the optimal rate in the sup-norm sense by the proposed adversarial training with correction. We extend our adversarial training to general setups of a loss function and a data-generating function. Our experiments support the theoretical findings. △ Less

Submitted 8 July, 2023; originally announced July 2023.

Comments: 38 pages

arXiv:2306.11017 [pdf, ps, other]

High-dimensional Contextual Bandit Problem without Sparsity

Authors: Junpei Komiyama, Masaaki Imaizumi

Abstract: In this research, we investigate the high-dimensional linear contextual bandit problem where the number of features $p$ is greater than the budget $T$, or it may even be infinite. Differing from the majority of previous works in this field, we do not impose sparsity on the regression coefficients. Instead, we rely on recent findings on overparameterized models, which enables us to analyze the perf… ▽ More In this research, we investigate the high-dimensional linear contextual bandit problem where the number of features $p$ is greater than the budget $T$, or it may even be infinite. Differing from the majority of previous works in this field, we do not impose sparsity on the regression coefficients. Instead, we rely on recent findings on overparameterized models, which enables us to analyze the performance of the minimum-norm interpolating estimator when data distributions have small effective ranks. We propose an explore-then-commit (EtC) algorithm to address this problem and examine its performance. Through our analysis, we derive the optimal rate of the ETC algorithm in terms of $T$ and show that this rate can be achieved by balancing exploration and exploitation. Moreover, we introduce an adaptive explore-then-commit (AEtC) algorithm that adaptively finds the optimal balance. We assess the performance of the proposed algorithms through a series of simulations. △ Less

Submitted 25 June, 2025; v1 submitted 19 June, 2023; originally announced June 2023.

Journal ref: Advances in Neural Information Processing Systems, 36, 49416-49427, 2023

arXiv:2305.15754 [pdf, other]

Bayesian Analysis for Over-parameterized Linear Model via Effective Spectra

Authors: Tomoya Wakayama, Masaaki Imaizumi

Abstract: In high-dimensional Bayesian statistics, various methods have been developed, including prior distributions that induce parameter sparsity to handle many parameters. Yet, these approaches often overlook the rich spectral structure of the covariate matrix, which can be crucial when true signals are not sparse. To address this gap, we introduce a data-adaptive Gaussian prior whose covariance is alig… ▽ More In high-dimensional Bayesian statistics, various methods have been developed, including prior distributions that induce parameter sparsity to handle many parameters. Yet, these approaches often overlook the rich spectral structure of the covariate matrix, which can be crucial when true signals are not sparse. To address this gap, we introduce a data-adaptive Gaussian prior whose covariance is aligned with the leading eigenvectors of the sample covariance. This prior design targets the data's intrinsic complexity rather than its ambient dimension by concentrating the parameter search along principal data directions. We establish contraction rates of the corresponding posterior distribution, which reveal how the mass in the spectrum affects the prediction error bounds. Furthermore, we derive a truncated Gaussian approximation to the posterior (i.e., a Bernstein-von Mises-type result), which allows for uncertainty quantification with a reduced computational burden. Our findings demonstrate that Bayesian methods leveraging spectral information of the data are effective for estimation in non-sparse, high-dimensional settings. △ Less

Submitted 5 May, 2025; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 46 pages

arXiv:2302.02988 [pdf, other]

Asymptotically Optimal Fixed-Budget Best Arm Identification with Variance-Dependent Bounds

Authors: Masahiro Kato, Masaaki Imaizumi, Takuya Ishihara, Toru Kitagawa

Abstract: We investigate the problem of fixed-budget best arm identification (BAI) for minimizing expected simple regret. In an adaptive experiment, a decision maker draws one of multiple treatment arms based on past observations and observes the outcome of the drawn arm. After the experiment, the decision maker recommends the treatment arm with the highest expected outcome. We evaluate the decision based o… ▽ More We investigate the problem of fixed-budget best arm identification (BAI) for minimizing expected simple regret. In an adaptive experiment, a decision maker draws one of multiple treatment arms based on past observations and observes the outcome of the drawn arm. After the experiment, the decision maker recommends the treatment arm with the highest expected outcome. We evaluate the decision based on the expected simple regret, which is the difference between the expected outcomes of the best arm and the recommended arm. Due to inherent uncertainty, we evaluate the regret using the minimax criterion. First, we derive asymptotic lower bounds for the worst-case expected simple regret, which are characterized by the variances of potential outcomes (leading factor). Based on the lower bounds, we propose the Two-Stage (TS)-Hirano-Imbens-Ridder (HIR) strategy, which utilizes the HIR estimator (Hirano et al., 2003) in recommending the best arm. Our theoretical analysis shows that the TS-HIR strategy is asymptotically minimax optimal, meaning that the leading factor of its worst-case expected simple regret matches our derived worst-case lower bound. Additionally, we consider extensions of our method, such as the asymptotic optimality for the probability of misidentification. Finally, we validate the proposed method's effectiveness through simulations. △ Less

Submitted 12 July, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

arXiv:2209.07330 [pdf, other]

Best Arm Identification with Contextual Information under a Small Gap

Authors: Masahiro Kato, Masaaki Imaizumi, Takuya Ishihara, Toru Kitagawa

Abstract: We study the best-arm identification (BAI) problem with a fixed budget and contextual (covariate) information. In each round of an adaptive experiment, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, which is a treatment arm with the maximal expected reward marginalized over the contextua… ▽ More We study the best-arm identification (BAI) problem with a fixed budget and contextual (covariate) information. In each round of an adaptive experiment, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, which is a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. In this study, we consider a class of nonparametric bandit models that converge to location-shift models when the gaps go to zero. First, we derive lower bounds of the misidentification probability for a certain class of strategies and bandit models (probabilistic models of potential outcomes) under a small-gap regime. A small-gap regime is a situation where gaps of the expected rewards between the best and suboptimal treatment arms go to zero, which corresponds to one of the worst cases in identifying the best treatment arm. We then develop the ``Random Sampling (RS)-Augmented Inverse Probability weighting (AIPW) strategy,'' which is asymptotically optimal in the sense that the probability of misidentification under the strategy matches the lower bound when the budget goes to infinity in the small-gap regime. The RS-AIPW strategy consists of the RS rule tracking a target sample allocation ratio and the recommendation rule using the AIPW estimator. △ Less

Submitted 4 January, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: For the sake of completeness, we show a part of the results of Kato et al. (arXiv:2201.04469). arXiv admin note: text overlap with arXiv:2201.04469

arXiv:2204.08369 [pdf, ps, other]

Benign Overfitting in Time Series Linear Models with Over-Parameterization

Authors: Shogo Nakakita, Masaaki Imaizumi

Abstract: The success of large-scale models in recent years has increased the importance of statistical models with numerous parameters. Several studies have analyzed over-parameterized linear models with high-dimensional data, which may not be sparse; however, existing results rely on the assumption of sample independence. In this study, we analyze a linear regression model with dependent time-series data… ▽ More The success of large-scale models in recent years has increased the importance of statistical models with numerous parameters. Several studies have analyzed over-parameterized linear models with high-dimensional data, which may not be sparse; however, existing results rely on the assumption of sample independence. In this study, we analyze a linear regression model with dependent time-series data in an over-parameterized setting. We consider an estimator using interpolation and develop a theory for the excess risk of the estimator. Then, we derive non-asymptotic risk bounds for the estimator for cases with dependent data. This analysis reveals that the coherence of the temporal covariance plays a key role; the risk bound is influenced by the product of temporal covariance matrices at different time steps. Moreover, we show the convergence rate of the risk bound and demonstrate that it is also influenced by the coherence of the temporal covariance. Finally, we provide several examples of specific dependent processes applicable to our setting. △ Less

Submitted 13 March, 2025; v1 submitted 18 April, 2022; originally announced April 2022.

Comments: Accepted at Bernoulli

arXiv:2202.05495 [pdf, other]

Inference for Projection-Based Wasserstein Distances on Finite Spaces

Authors: Ryo Okano, Masaaki Imaizumi

Abstract: The Wasserstein distance is a distance between two probability distributions and has recently gained increasing popularity in statistics and machine learning, owing to its attractive properties. One important approach to extending this distance is using low-dimensional projections of distributions to avoid a high computational cost and the curse of dimensionality in empirical estimation, such as t… ▽ More The Wasserstein distance is a distance between two probability distributions and has recently gained increasing popularity in statistics and machine learning, owing to its attractive properties. One important approach to extending this distance is using low-dimensional projections of distributions to avoid a high computational cost and the curse of dimensionality in empirical estimation, such as the sliced Wasserstein or max-sliced Wasserstein distances. Despite their practical success in machine learning tasks, the availability of statistical inferences for projection-based Wasserstein distances is limited owing to the lack of distributional limit results. In this paper, we consider distances defined by integrating or maximizing Wasserstein distances between low-dimensional projections of two probability distributions. Then we derive limit distributions regarding these distances when the two distributions are supported on finite points. We also propose a bootstrap procedure to estimate quantiles of limit distributions from data. This facilitates asymptotically exact interval estimation and hypothesis testing for these distances. Our theoretical results are based on the arguments of Sommerfeld and Munk (2018) for deriving distributional limits regarding the original Wasserstein distance on finite spaces and the theory of sensitivity analysis in nonlinear programming. Finally, we conduct numerical experiments to illustrate the theoretical results and demonstrate the applicability of our inferential methods to real data analysis. △ Less

Submitted 11 February, 2022; originally announced February 2022.

arXiv:2202.05245 [pdf, ps, other]

Benign-Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression

Authors: Masahiro Kato, Masaaki Imaizumi

Abstract: We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE), with linear regression models. As the development of machine learning for causal inference, a wide range of large-scale models for causality are gaining attention. One problem is that suspicions have been raised that the large-scale models are prone to overfitting to observations with sampl… ▽ More We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE), with linear regression models. As the development of machine learning for causal inference, a wide range of large-scale models for causality are gaining attention. One problem is that suspicions have been raised that the large-scale models are prone to overfitting to observations with sample selection, hence the large models may not be suitable for causal prediction. In this study, to resolve the suspicious, we investigate on the validity of causal inference methods for overparameterized models, by applying the recent theory of benign overfitting (Bartlett et al., 2020). Specifically, we consider samples whose distribution switches depending on an assignment rule, and study the prediction of CATE with linear models whose dimension diverges to infinity. We focus on two methods: the T-learner, which based on a difference between separately constructed estimators with each treatment group, and the inverse probability weight (IPW)-learner, which solves another regression problem approximated by a propensity score. In both methods, the estimator consists of interpolators that fit the samples perfectly. As a result, we show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known. This difference stems from that the T-learner is unable to preserve eigenspaces of the covariances, which is necessary for benign overfitting in the overparameterized setting. Our result provides new insights into the usage of causal inference methods in the overparameterizated setting, in particular, doubly robust estimators. △ Less

Submitted 11 February, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:1906.11300 by other authors

arXiv:2201.13127 [pdf, other]

Unified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation: Bridging KL-Divergence and Integral Probability Metrics

Authors: Masahiro Kato, Masaaki Imaizumi, Kentaro Minami

Abstract: This paper provides a unified perspective for the Kullback-Leibler (KL)-divergence and the integral probability metrics (IPMs) from the perspective of maximum likelihood density-ratio estimation (DRE). Both the KL-divergence and the IPMs are widely used in various fields in applications such as generative modeling. However, a unified understanding of these concepts has still been unexplored. In th… ▽ More This paper provides a unified perspective for the Kullback-Leibler (KL)-divergence and the integral probability metrics (IPMs) from the perspective of maximum likelihood density-ratio estimation (DRE). Both the KL-divergence and the IPMs are widely used in various fields in applications such as generative modeling. However, a unified understanding of these concepts has still been unexplored. In this paper, we show that the KL-divergence and the IPMs can be represented as maximal likelihoods differing only by sampling schemes, and use this result to derive a unified form of the IPMs and a relaxed estimation method. To develop the estimation problem, we construct an unconstrained maximum likelihood estimator to perform DRE with a stratified sampling scheme. We further propose a novel class of probability divergences, called the Density Ratio Metrics (DRMs), that interpolates the KL-divergence and the IPMs. In addition to these findings, we also introduce some applications of the DRMs, such as DRE and generative adversarial networks. In experiments, we validate the effectiveness of our proposed methods. △ Less

Submitted 31 January, 2022; originally announced January 2022.

arXiv:2201.04545 [pdf, other]

doi 10.1109/TIT.2022.3215088

On generalization bounds for deep networks based on loss surface implicit regularization

Authors: Masaaki Imaizumi, Johannes Schmidt-Hieber

Abstract: The classical statistical learning theory implies that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. While previous work focuses on the implicit regularization induced by sto… ▽ More The classical statistical learning theory implies that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. While previous work focuses on the implicit regularization induced by stochastic gradient descent (SGD), we study here how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces another form of implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima. △ Less

Submitted 16 October, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

Comments: To appear in IEEE Transaction on Information Theory

arXiv:2201.04469 [pdf, other]

Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap

Authors: Masahiro Kato, Kaito Ariu, Masaaki Imaizumi, Masahiro Nomura, Chao Qin

Abstract: We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First,… ▽ More We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First, we review a lower bound derived by Kaufmann et al. (2016). Then, we propose the "Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)" strategy, which consists of the sampling rule using the Neyman allocation with an estimated standard deviation and the recommendation rule using an AIPW estimator. Our proposed strategy is optimal because the upper bound matches the lower bound when the budget goes to infinity and the gap goes to zero. △ Less

Submitted 28 December, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

arXiv:2112.00213 [pdf, other]

Minimax Analysis for Inverse Risk in Nonparametric Planer Invertible Regression

Authors: Akifumi Okuno, Masaaki Imaizumi

Abstract: We study a minimax risk of estimating inverse functions on a plane, while keeping an estimator is also invertible. Learning invertibility from data and exploiting an invertible estimator are used in many domains, such as statistics, econometrics, and machine learning. Although the consistency and universality of invertible estimators have been well investigated, analysis of the efficiency of these… ▽ More We study a minimax risk of estimating inverse functions on a plane, while keeping an estimator is also invertible. Learning invertibility from data and exploiting an invertible estimator are used in many domains, such as statistics, econometrics, and machine learning. Although the consistency and universality of invertible estimators have been well investigated, analysis of the efficiency of these methods is still under development. In this study, we study a minimax risk for estimating invertible bi-Lipschitz functions on a square in a $2$-dimensional plane. We first introduce two types of $L^2$-risks to evaluate an estimator which preserves invertibility. Then, we derive lower and upper rates for minimax values for the risks associated with inverse functions. For the derivation, we exploit a representation of invertible functions using level-sets. Specifically, to obtain the upper rate, we develop an estimator asymptotically almost everywhere invertible, whose risk attains the derived minimax lower rate up to logarithmic factors. The derived minimax rate corresponds to that of the non-invertible bi-Lipschitz function, which shows that the invertibility does not reduce the complexity of the estimation problem in terms of the rate. % the minimax rate, similar to other shape constraints. △ Less

Submitted 25 December, 2023; v1 submitted 30 November, 2021; originally announced December 2021.

Comments: 34 pages, 34 figures, accepted to Electronic Journal of Statistics

arXiv:2108.01312 [pdf, other]

Learning Causal Models from Conditional Moment Restrictions by Importance Weighting

Authors: Masahiro Kato, Masaaki Imaizumi, Kenichiro McAlinn, Haruo Kakehi, Shota Yasui

Abstract: We consider learning causal relationships under conditional moment restrictions. Unlike causal inference under unconditional moment restrictions, conditional moment restrictions pose serious challenges for causal inference, especially in high-dimensional settings. To address this issue, we propose a method that transforms conditional moment restrictions to unconditional moment restrictions through… ▽ More We consider learning causal relationships under conditional moment restrictions. Unlike causal inference under unconditional moment restrictions, conditional moment restrictions pose serious challenges for causal inference, especially in high-dimensional settings. To address this issue, we propose a method that transforms conditional moment restrictions to unconditional moment restrictions through importance weighting, using a conditional density ratio estimator. Using this transformation, we successfully estimate nonparametric functions defined under conditional moment restrictions. Our proposed framework is general and can be applied to a wide range of methods, including neural networks. We analyze the estimation error, providing theoretical support for our proposed method. In experiments, we confirm the soundness of our proposed method. △ Less

Submitted 28 September, 2022; v1 submitted 3 August, 2021; originally announced August 2021.

arXiv:2103.00500 [pdf, other]

Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks

Authors: Ryumei Nakada, Masaaki Imaizumi

Abstract: We investigate the asymptotic risk of a general class of overparameterized likelihood models, including deep models. The recent empirical success of large-scale models has motivated several theoretical studies to investigate a scenario wherein both the number of samples, $n$, and parameters, $p$, diverge to infinity and derive an asymptotic risk at the limit. However, these theorems are only valid… ▽ More We investigate the asymptotic risk of a general class of overparameterized likelihood models, including deep models. The recent empirical success of large-scale models has motivated several theoretical studies to investigate a scenario wherein both the number of samples, $n$, and parameters, $p$, diverge to infinity and derive an asymptotic risk at the limit. However, these theorems are only valid for linear-in-feature models, such as generalized linear regression, kernel regression, and shallow neural networks. Hence, it is difficult to investigate a wider class of nonlinear models, including deep neural networks with three or more layers. In this study, we consider a likelihood maximization problem without the model constraints and analyze the upper bound of an asymptotic risk of an estimator with penalization. Technically, we combine a property of the Fisher information matrix with an extended Marchenko-Pastur law and associate the combination with empirical process techniques. The derived bound is general, as it describes both the double descent and the regularized risk curves, depending on the penalization. Our results are valid without the linear-in-feature constraints on models and allow us to derive the general spectral distributions of a Fisher information matrix from the likelihood. We demonstrate that several explicit models, such as parallel deep neural networks, ensemble learning, and residual networks, are in agreement with our theory. This result indicates that even large and deep models have a small asymptotic risk if they exhibit a specific structure, such as divisibility. To verify this finding, we conduct a real-data experiment with parallel deep neural networks. Our results expand the applicability of the asymptotic risk analysis, and may also contribute to the understanding and application of deep learning. △ Less

Submitted 15 March, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

Comments: 36 pages

arXiv:2102.03609 [pdf, other]

Understanding Higher-order Structures in Evolving Graphs: A Simplicial Complex based Kernel Estimation Approach

Authors: Manohar Kaul, Masaaki Imaizumi

Abstract: Dynamic graphs are rife with higher-order interactions, such as co-authorship relationships and protein-protein interactions in biological networks, that naturally arise between more than two nodes at once. In spite of the ubiquitous presence of such higher-order interactions, limited attention has been paid to the higher-order counterpart of the popular pairwise link prediction problem. Existing… ▽ More Dynamic graphs are rife with higher-order interactions, such as co-authorship relationships and protein-protein interactions in biological networks, that naturally arise between more than two nodes at once. In spite of the ubiquitous presence of such higher-order interactions, limited attention has been paid to the higher-order counterpart of the popular pairwise link prediction problem. Existing higher-order structure prediction methods are mostly based on heuristic feature extraction procedures, which work well in practice but lack theoretical guarantees. Such heuristics are primarily focused on predicting links in a static snapshot of the graph. Moreover, these heuristic-based methods fail to effectively utilize and benefit from the knowledge of latent substructures already present within the higher-order structures. In this paper, we overcome these obstacles by capturing higher-order interactions succinctly as \textit{simplices}, model their neighborhood by face-vectors, and develop a nonparametric kernel estimator for simplices that views the evolving graph from the perspective of a time process (i.e., a sequence of graph snapshots). Our method substantially outperforms several baseline higher-order prediction methods. As a theoretical achievement, we prove the consistency and asymptotic normality in terms of the Wasserstein distance of our estimator using Stein's method. △ Less

Submitted 6 February, 2021; originally announced February 2021.

arXiv:2102.02981 [pdf, ps, other]

Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

Authors: Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, Tengyang Xie

Abstract: We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights… ▽ More We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions, characterized by the critical inequality \citep{bartlett2005}. Based on this result, we analyze convergence rates for OPE. In particular, we introduce novel alternative completeness conditions under which OPE is feasible and we present the first finite-sample result with first-order efficiency in non-tabular environments, i.e., having the minimal coefficient in the leading term. △ Less

Submitted 24 July, 2022; v1 submitted 4 February, 2021; originally announced February 2021.

Comments: Under Review

arXiv:2011.02256 [pdf, other]

Advantage of Deep Neural Networks for Estimating Functions with Singularity on Hypersurfaces

Authors: Masaaki Imaizumi, Kenji Fukumizu

Abstract: We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fil… ▽ More We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fill this gap by considering the estimation for a class of non-smooth functions that have singularities on hypersurfaces. Our findings are as follows: (i) We derive the generalization error of a DNN estimator and prove that its convergence rate is almost optimal. (ii) We elucidate a phase diagram of estimation problems, which describes the situations where the DNNs outperform a general class of estimators, including kernel methods, Gaussian process methods, and others. We additionally show that DNNs outperform harmonic analysis based estimators. This advantage of DNNs comes from the fact that a shape of singularity can be successfully handled by their multi-layered structure. △ Less

Submitted 8 February, 2022; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: Complete version of arXiv:1802.04474

arXiv:1910.06552 [pdf, other]

Improved Generalization Bounds of Group Invariant / Equivariant Deep Networks via Quotient Feature Spaces

Authors: Akiyoshi Sannai, Masaaki Imaizumi, Makoto Kawano

Abstract: Numerous invariant (or equivariant) neural networks have succeeded in handling invariant data such as point clouds and graphs. However, a generalization theory for the neural networks has not been well developed, because several essential factors for the theory, such as network size and margin distribution, are not deeply connected to the invariance and equivariance. In this study, we develop a no… ▽ More Numerous invariant (or equivariant) neural networks have succeeded in handling invariant data such as point clouds and graphs. However, a generalization theory for the neural networks has not been well developed, because several essential factors for the theory, such as network size and margin distribution, are not deeply connected to the invariance and equivariance. In this study, we develop a novel generalization error bound for invariant and equivariant deep neural networks. To describe the effect of invariance and equivariance on generalization, we develop a notion of a \textit{quotient feature space}, which measures the effect of group actions for the properties. Our main result proves that the volume of quotient feature spaces can describe the generalization error. Furthermore, the bound shows that the invariance and equivariance significantly improve the leading term of the bound. We apply our result to specific invariant and equivariant networks, such as DeepSets (Zaheer et al. (2017)), and show that their generalization bound is considerably improved by $\sqrt{n!}$, where $n!$ is the number of permutations. We also discuss the expressive power of invariant DNNs and show that they can achieve an optimal approximation rate. Our experimental result supports our theoretical claims. △ Less

Submitted 19 June, 2021; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: Old title: "Improved Generalization Bound of Permutation Invariant Deep Neural Networks"

arXiv:1907.02177 [pdf, other]

Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality

Authors: Ryumei Nakada, Masaaki Imaizumi

Abstract: In this study, we prove that an intrinsic low dimensionality of covariates is the main factor that determines the performance of deep neural networks (DNNs). DNNs generally provide outstanding empirical performance. Hence, numerous studies have actively investigated the theoretical properties of DNNs to understand their underlying mechanisms. In particular, the behavior of DNNs in terms of high-di… ▽ More In this study, we prove that an intrinsic low dimensionality of covariates is the main factor that determines the performance of deep neural networks (DNNs). DNNs generally provide outstanding empirical performance. Hence, numerous studies have actively investigated the theoretical properties of DNNs to understand their underlying mechanisms. In particular, the behavior of DNNs in terms of high-dimensional data is one of the most critical questions. However, this issue has not been sufficiently investigated from the aspect of covariates, although high-dimensional data have practically low intrinsic dimensionality. In this study, we derive bounds for an approximation error and a generalization error regarding DNNs with intrinsically low dimensional covariates. We apply the notion of the Minkowski dimension and develop a novel proof technique. Consequently, we show that convergence rates of the errors by DNNs do not depend on the nominal high dimensionality of data, but on its lower intrinsic dimension. We further prove that the rate is optimal in the minimax sense. We identify an advantage of DNNs by showing that DNNs can handle a broader class of intrinsic low dimensional data than other adaptive estimators. Finally, we conduct a numerical simulation to validate the theoretical results. △ Less

Submitted 17 September, 2020; v1 submitted 3 July, 2019; originally announced July 2019.

Comments: 38 pages

Journal ref: Journal of Machine Learning Research, 21(174), 2020

arXiv:1901.09541 [pdf, other]

On Random Subsampling of Gaussian Process Regression: A Graphon-Based Analysis

Authors: Kohei Hayashi, Masaaki Imaizumi, Yuichi Yoshida

Abstract: In this paper, we study random subsampling of Gaussian process regression, one of the simplest approximation baselines, from a theoretical perspective. Although subsampling discards a large part of training data, we show provable guarantees on the accuracy of the predictive mean/variance and its generalization ability. For analysis, we consider embedding kernel matrices into graphons, which encaps… ▽ More In this paper, we study random subsampling of Gaussian process regression, one of the simplest approximation baselines, from a theoretical perspective. Although subsampling discards a large part of training data, we show provable guarantees on the accuracy of the predictive mean/variance and its generalization ability. For analysis, we consider embedding kernel matrices into graphons, which encapsulate the difference of the sample size and enables us to evaluate the approximation and generalization errors in a unified manner. The experimental results show that the subsampling approximation achieves a better trade-off regarding accuracy and runtime than the Nyström and random Fourier expansion methods. △ Less

Submitted 28 January, 2019; originally announced January 2019.

arXiv:1802.04474 [pdf, other]

Deep Neural Networks Learn Non-Smooth Functions Effectively

Authors: Masaaki Imaizumi, Kenji Fukumizu

Abstract: We theoretically discuss why deep neural networks (DNNs) performs better than other models in some cases by investigating statistical properties of DNNs for non-smooth functions. While DNNs have empirically shown higher performance than other standard methods, understanding its mechanism is still a challenging problem. From an aspect of the statistical theory, it is known many standard methods att… ▽ More We theoretically discuss why deep neural networks (DNNs) performs better than other models in some cases by investigating statistical properties of DNNs for non-smooth functions. While DNNs have empirically shown higher performance than other standard methods, understanding its mechanism is still a challenging problem. From an aspect of the statistical theory, it is known many standard methods attain the optimal rate of generalization errors for smooth functions in large sample asymptotics, and thus it has not been straightforward to find theoretical advantages of DNNs. This paper fills this gap by considering learning of a certain class of non-smooth functions, which was not covered by the previous theory. We derive the generalization error of estimators by DNNs with a ReLU activation, and show that convergence rates of the generalization by DNNs are almost optimal to estimate the non-smooth functions, while some of the popular models do not attain the optimal rate. In addition, our theoretical result provides guidelines for selecting an appropriate number of layers and edges of DNNs. We provide numerical experiments to support the theoretical results. △ Less

Submitted 7 July, 2018; v1 submitted 13 February, 2018; originally announced February 2018.

Comments: 31 pages

arXiv:1708.00132 [pdf, other]

On Tensor Train Rank Minimization: Statistical Efficiency and Scalable Algorithm

Authors: Masaaki Imaizumi, Takanori Maehara, Kohei Hayashi

Abstract: Tensor train (TT) decomposition provides a space-efficient representation for higher-order tensors. Despite its advantage, we face two crucial limitations when we apply the TT decomposition to machine learning problems: the lack of statistical theory and of scalable algorithms. In this paper, we address the limitations. First, we introduce a convex relaxation of the TT decomposition problem and de… ▽ More Tensor train (TT) decomposition provides a space-efficient representation for higher-order tensors. Despite its advantage, we face two crucial limitations when we apply the TT decomposition to machine learning problems: the lack of statistical theory and of scalable algorithms. In this paper, we address the limitations. First, we introduce a convex relaxation of the TT decomposition problem and derive its error bound for the tensor completion task. Next, we develop an alternating optimization method with a randomization technique, in which the time complexity is as efficient as the space complexity is. In experiments, we numerically confirm the derived bounds and empirically demonstrate the performance of our method with a real higher-order tensor. △ Less

Submitted 1 August, 2017; v1 submitted 31 July, 2017; originally announced August 2017.

Comments: 24 pages

arXiv:1707.09688 [pdf, ps, other]

Consistent Nonparametric Different-Feature Selection via the Sparsest $k$-Subgraph Problem

Authors: Satoshi Hara, Takayuki Katsuki, Hiroki Yanagisawa, Masaaki Imaizumi, Takafumi Ono, Ryo Okamoto, Shigeki Takeuchi

Abstract: Two-sample feature selection is the problem of finding features that describe a difference between two probability distributions, which is a ubiquitous problem in both scientific and engineering studies. However, existing methods have limited applicability because of their restrictive assumptions on data distributoins or computational difficulty. In this paper, we resolve these difficulties by for… ▽ More Two-sample feature selection is the problem of finding features that describe a difference between two probability distributions, which is a ubiquitous problem in both scientific and engineering studies. However, existing methods have limited applicability because of their restrictive assumptions on data distributoins or computational difficulty. In this paper, we resolve these difficulties by formulating the problem as a sparsest $k$-subgraph problem. The proposed method is nonparametric and does not assume any specific parametric models on the data distributions. We show that the proposed method is computationally efficient and does not require any extra computation for model selection. Moreover, we prove that the proposed method provides a consistent estimator of features under mild conditions. Our experimental results show that the proposed method outperforms the current method with regard to both accuracy and computation time. △ Less

Submitted 31 July, 2017; v1 submitted 30 July, 2017; originally announced July 2017.

Comments: 32 pages

arXiv:1506.06722 [pdf, ps, other]

Approximation method for discrete Markov decision models with a large state space

Authors: Masaaki Imaizumi

Abstract: To solve discrete Markov decision models with a large number of dimensions is always difficult (and at times, impossible), because size of state space and computation cost increases exponentially with the number of dimensions. This phenomenon is called "The Curse of Dimensionality," and it prevents us from using models with many state variables. To overcome this problem, we propose a new approxima… ▽ More To solve discrete Markov decision models with a large number of dimensions is always difficult (and at times, impossible), because size of state space and computation cost increases exponentially with the number of dimensions. This phenomenon is called "The Curse of Dimensionality," and it prevents us from using models with many state variables. To overcome this problem, we propose a new approximation method, named statistical least square temporal difference (SLSTD) method, that can solve discrete Markov decision models with large state space. SLSTD method approximate the value function on the whole state space at once, and obtain optimal approximation weight by minimizing temporal residuals of the Bellman equation. Furthermore, a stochastic approximation method enables us to optimize the problem with low computational cost. SLSTD method can solve DMD models with large state space more easily than other existing methods, and in some cases, reduces the computation time by over 99 percent. We also show the parameter estimated by SLSTD is consistent and asymptotically normally distributed. △ Less

Submitted 11 November, 2016; v1 submitted 22 June, 2015; originally announced June 2015.

Comments: 21 pages

arXiv:1506.05967 [pdf, ps, other]

Doubly Decomposing Nonparametric Tensor Regression

Authors: Masaaki Imaizumi, Kohei Hayashi

Abstract: Nonparametric extension of tensor regression is proposed. Nonlinearity in a high-dimensional tensor space is broken into simple local functions by incorporating low-rank tensor decomposition. Compared to naive nonparametric approaches, our formulation considerably improves the convergence rate of estimation while maintaining consistency with the same function class under specific conditions. To es… ▽ More Nonparametric extension of tensor regression is proposed. Nonlinearity in a high-dimensional tensor space is broken into simple local functions by incorporating low-rank tensor decomposition. Compared to naive nonparametric approaches, our formulation considerably improves the convergence rate of estimation while maintaining consistency with the same function class under specific conditions. To estimate local functions, we develop a Bayesian estimator with the Gaussian process prior. Experimental results show its theoretical properties and high performance in terms of predicting a summary statistic of a real complex network. △ Less

Submitted 8 March, 2016; v1 submitted 19 June, 2015; originally announced June 2015.

Comments: 21 pages

Showing 1–40 of 40 results for author: Imaizumi, M