Search | arXiv e-print repository

Minimax Rates of Estimation for Optimal Transport Map between Infinite-Dimensional Spaces

Authors: Donlapark Ponnoprat, Masaaki Imaizumi

Abstract: We investigate the estimation of an optimal transport map between probability measures on an infinite-dimensional space and reveal its minimax optimal rate. Optimal transport theory defines distances within a space of probability measures, utilizing an optimal transport map as its key component. Estimating the optimal transport map from samples finds several applications, such as simulating dynami… ▽ More We investigate the estimation of an optimal transport map between probability measures on an infinite-dimensional space and reveal its minimax optimal rate. Optimal transport theory defines distances within a space of probability measures, utilizing an optimal transport map as its key component. Estimating the optimal transport map from samples finds several applications, such as simulating dynamics between probability measures and functional data analysis. However, some transport maps on infinite-dimensional spaces require exponential-order data for estimation, which undermines their applicability. In this paper, we investigate the estimation of an optimal transport map between infinite-dimensional spaces, focusing on optimal transport maps characterized by the notion of $γ$-smoothness. Consequently, we show that the order of the minimax risk is polynomial rate in the sample size even in the infinite-dimensional setup. We also develop an estimator whose estimation error matches the minimax optimal rate. With these results, we obtain a class of reasonably estimable optimal transport maps on infinite-dimensional spaces and a method for their estimation. Our experiments validate the theory and practical utility of our approach with application to functional data analysis. △ Less

Submitted 27 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

Comments: 60 pages, 5 figures

MSC Class: 62G05

arXiv:2505.04898 [pdf, other]

Precise gradient descent training dynamics for finite-width multi-layer neural networks

Authors: Qiyang Han, Masaaki Imaizumi

Abstract: In this paper, we provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks under the canonical single-index regression model, in the `finite-width proportional regime' where the sample size and feature dimension grow proportionally while the network width and depth remain bounded. Our non-asymptotic state evolution theory captur… ▽ More In this paper, we provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks under the canonical single-index regression model, in the `finite-width proportional regime' where the sample size and feature dimension grow proportionally while the network width and depth remain bounded. Our non-asymptotic state evolution theory captures Gaussian fluctuations in first-layer weights and concentration in deeper-layer weights, and remains valid for non-Gaussian features. Our theory differs from existing neural tangent kernel (NTK), mean-field (MF) theories and tensor program (TP) in several key aspects. First, our theory operates in the finite-width regime whereas these existing theories are fundamentally infinite-width. Second, our theory allows weights to evolve from individual initializations beyond the lazy training regime, whereas NTK and MF are either frozen at or only weakly sensitive to initialization, and TP relies on special initialization schemes. Third, our theory characterizes both training and generalization errors for general multi-layer neural networks beyond the uniform convergence regime, whereas existing theories study generalization almost exclusively in two-layer settings. As a statistical application, we show that vanilla gradient descent can be augmented to yield consistent estimates of the generalization error at each iteration, which can be used to guide early stopping and hyperparameter tuning. As a further theoretical implication, we show that despite model misspecification, the model learned by gradient descent retains the structure of a single-index function with an effective signal determined by a linear combination of the true signal and the initialization. △ Less

Submitted 7 May, 2025; originally announced May 2025.

arXiv:2502.11467 [pdf, other]

Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size

Authors: Naoki Takeshita, Masaaki Imaizumi

Abstract: Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of transformers has garnered significant attention. A notable example is the mathematical analysis of their approximation power, which validates the empirical expressiv… ▽ More Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of transformers has garnered significant attention. A notable example is the mathematical analysis of their approximation power, which validates the empirical expressive capability of transformers. In this study, we investigate the ability of transformers to approximate column-symmetric polynomials, an extension of symmetric polynomials that take matrices as input. Consequently, we establish an explicit relationship between the size of the transformer network and its approximation capability, leveraging the parameter efficiency of transformers and their compatibility with symmetry by focusing on the algebraic properties of symmetric polynomials. △ Less

Submitted 17 February, 2025; originally announced February 2025.

Comments: 29 pages

arXiv:2410.19244 [pdf, ps, other]

Universality of Estimator for High-Dimensional Linear Models with Block Dependency

Authors: Toshiki Tsuda, Masaaki Imaizumi

Abstract: We study the universality property of estimators for high-dimensional linear models, which exhibits a distribution of estimators is independent of whether covariates follows a Gaussian distribution. Recent high-dimensional statistics require covariates to strictly follow a Gaussian distribution to reveal precise properties of estimators. To relax the Gaussianity requirement, the existing literatur… ▽ More We study the universality property of estimators for high-dimensional linear models, which exhibits a distribution of estimators is independent of whether covariates follows a Gaussian distribution. Recent high-dimensional statistics require covariates to strictly follow a Gaussian distribution to reveal precise properties of estimators. To relax the Gaussianity requirement, the existing literature has studied the conditions that allow estimators to achieve universality. In particular, independence of each element of the high-dimensional covariates plays an important role. In this study, we focus on high-dimensional linear models and covariates with block dependencies, where elements of covariates only within a block can be dependent, then show that estimators for the model maintain the universality. Specifically, we prove that a distribution of estimators with Gaussian covariates is approximated by an estimator with non-Gaussian covariates with same moments, with the setup of block dependence. To establish the result, we develop a generalized Lindeberg principle to handle block dependencies and derive new error bounds for correlated elements of covariates. We also apply our result of the universality to a distribution of robust estimators. △ Less

Submitted 24 October, 2024; originally announced October 2024.

Comments: 35 pages

arXiv:2410.08709 [pdf, other]

Distillation of Discrete Diffusion through Dimensional Correlations

Authors: Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

Abstract: Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in la… ▽ More Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in language) mainly due to the computational cost of processing high-dimensional joint distributions. In this paper, (i) we propose "mixture" models for discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and (ii) we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: First, conventional models with element-wise independence can well approximate the data distribution, but essentially require {\it many sampling steps}. Second, our loss functions enable the mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. Our experimental results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains. The code used in the paper is available at https://github.com/sony/di4c . △ Less

Submitted 8 May, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

Comments: 39 pages, ICML 2025 accepted

arXiv:2404.17812 [pdf, other]

High-Dimensional Single-Index Models: Link Estimation and Marginal Inference

Authors: Kazuma Sawaya, Yoshimasa Uematsu, Masaaki Imaizumi

Abstract: This study proposes a novel method for estimation and hypothesis testing in high-dimensional single-index models. We address a common scenario where the sample size and the dimension of regression coefficients are large and comparable. Unlike traditional approaches, which often overlook the estimation of the unknown link function, we introduce a new method for link function estimation. Leveraging… ▽ More This study proposes a novel method for estimation and hypothesis testing in high-dimensional single-index models. We address a common scenario where the sample size and the dimension of regression coefficients are large and comparable. Unlike traditional approaches, which often overlook the estimation of the unknown link function, we introduce a new method for link function estimation. Leveraging the information from the estimated link function, we propose more efficient estimators that are better aligned with the underlying model. Furthermore, we rigorously establish the asymptotic normality of each coordinate of the estimator. This provides a valid construction of confidence intervals and $p$-values for any finite collection of coordinates. Numerical experiments validate our theoretical results. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: 42 pages

arXiv:2307.09257 [pdf, other]

doi 10.1214/23-EJS2211

Uniform Confidence Band for Optimal Transport Map on One-Dimensional Data

Authors: Donlapark Ponnoprat, Ryo Okano, Masaaki Imaizumi

Abstract: We develop a statistical inference method for an optimal transport map between distributions on real numbers with uniform confidence bands. The concept of optimal transport (OT) is used to measure distances between distributions, and OT maps are used to construct the distance. OT has been applied in many fields in recent years, and its statistical properties have attracted much interest. In partic… ▽ More We develop a statistical inference method for an optimal transport map between distributions on real numbers with uniform confidence bands. The concept of optimal transport (OT) is used to measure distances between distributions, and OT maps are used to construct the distance. OT has been applied in many fields in recent years, and its statistical properties have attracted much interest. In particular, since the OT map is a function, a uniform norm-based statistical inference is significant for visualization and interpretation. In this study, we derive a limit distribution of a uniform norm of an estimation error for the OT map, and then develop a uniform confidence band based on it. In addition to our limit theorem, we develop a bootstrap method with kernel smoothing, then also derive its validation and guarantee on an asymptotic coverage probability of the confidence band. Our proof is based on the functional delta method and the representation of OT maps on the reals. △ Less

Submitted 15 February, 2024; v1 submitted 18 July, 2023; originally announced July 2023.

Comments: 37 pages

MSC Class: 62G15; 62G20

Journal ref: Electronic Journal of Statistics 2024, Vol. 18, No. 1, 515-552

arXiv:2307.06137 [pdf, other]

Distribution-on-Distribution Regression with Wasserstein Metric: Multivariate Gaussian Case

Authors: Ryo Okano, Masaaki Imaizumi

Abstract: Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce mod… ▽ More Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, utilizing the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem's analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes. △ Less

Submitted 8 February, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

Comments: 34 pages

arXiv:2305.17731 [pdf, other]

Moment-Based Adjustments of Statistical Inference in High-Dimensional Generalized Linear Models

Authors: Kazuma Sawaya, Yoshimasa Uematsu, Masaaki Imaizumi

Abstract: We developed a statistical inference method applicable to a broad range of generalized linear models (GLMs) in high-dimensional settings, where the number of unknown coefficients scales proportionally with the sample size. Although a pioneering inference method has been developed for logistic regression, which is a specific instance of GLMs, we cannot apply this method directly to other GLMs becau… ▽ More We developed a statistical inference method applicable to a broad range of generalized linear models (GLMs) in high-dimensional settings, where the number of unknown coefficients scales proportionally with the sample size. Although a pioneering inference method has been developed for logistic regression, which is a specific instance of GLMs, we cannot apply this method directly to other GLMs because of unknown hyper-parameters. In this study, we addressed this limitation by developing a new inference method designed for a certain class of GLMs. Our method is based on the adjustment of asymptotic normality in high dimensions and is feasible in the sense that it is possible even with unknown hyper-parameters. Specifically, we introduce a novel convex loss-based estimator and its associated system, which are essential components of inference. Next, we devise a moment-based method for estimating the system parameters required by the method. Consequently, we construct confidence intervals for GLMs in a high-dimensional regime. We prove that our proposed method has desirable theoretical properties, such as strong consistency and exact coverage probability. Finally, we experimentally confirmed its validity. △ Less

Submitted 23 May, 2024; v1 submitted 28 May, 2023; originally announced May 2023.

Comments: 33 pages

arXiv:2305.15754 [pdf, other]

Bayesian Analysis for Over-parameterized Linear Model via Effective Spectra

Authors: Tomoya Wakayama, Masaaki Imaizumi

Abstract: In high-dimensional Bayesian statistics, various methods have been developed, including prior distributions that induce parameter sparsity to handle many parameters. Yet, these approaches often overlook the rich spectral structure of the covariate matrix, which can be crucial when true signals are not sparse. To address this gap, we introduce a data-adaptive Gaussian prior whose covariance is alig… ▽ More In high-dimensional Bayesian statistics, various methods have been developed, including prior distributions that induce parameter sparsity to handle many parameters. Yet, these approaches often overlook the rich spectral structure of the covariate matrix, which can be crucial when true signals are not sparse. To address this gap, we introduce a data-adaptive Gaussian prior whose covariance is aligned with the leading eigenvectors of the sample covariance. This prior design targets the data's intrinsic complexity rather than its ambient dimension by concentrating the parameter search along principal data directions. We establish contraction rates of the corresponding posterior distribution, which reveal how the mass in the spectrum affects the prediction error bounds. Furthermore, we derive a truncated Gaussian approximation to the posterior (i.e., a Bernstein-von Mises-type result), which allows for uncertainty quantification with a reduced computational burden. Our findings demonstrate that Bayesian methods leveraging spectral information of the data are effective for estimation in non-sparse, high-dimensional settings. △ Less

Submitted 5 May, 2025; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 46 pages

arXiv:2304.04037 [pdf, other]

Benign Overfitting of Non-Sparse High-Dimensional Linear Regression with Correlated Noise

Authors: Toshiki Tsuda, Masaaki Imaizumi

Abstract: We investigate the high-dimensional linear regression problem in the presence of noise correlated with Gaussian covariates. This correlation, known as endogeneity in regression models, often arises from unobserved variables and other factors. It has been a major challenge in causal inference and econometrics. When the covariates are high-dimensional, it has been common to assume sparsity on the tr… ▽ More We investigate the high-dimensional linear regression problem in the presence of noise correlated with Gaussian covariates. This correlation, known as endogeneity in regression models, often arises from unobserved variables and other factors. It has been a major challenge in causal inference and econometrics. When the covariates are high-dimensional, it has been common to assume sparsity on the true parameters and estimate them using regularization, even with the endogeneity. However, when sparsity does not hold, it has not been well understood to control the endogeneity and high dimensionality simultaneously. This study demonstrates that an estimator without regularization can achieve consistency, that is, benign overfitting, under certain assumptions on the covariance matrix. Specifically, our results show that the error of this estimator converges to zero when the covariance matrices of correlated noise and instrumental variables satisfy a condition on their eigenvalues. We consider several extensions relaxing these conditions and conduct experiments to support our theoretical findings. As a technical contribution, we utilize the convex Gaussian minimax theorem (CGMT) in our dual problem and extend CGMT itself. △ Less

Submitted 20 October, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

Comments: 73 pages

arXiv:2302.02988 [pdf, other]

Asymptotically Optimal Fixed-Budget Best Arm Identification with Variance-Dependent Bounds

Authors: Masahiro Kato, Masaaki Imaizumi, Takuya Ishihara, Toru Kitagawa

Abstract: We investigate the problem of fixed-budget best arm identification (BAI) for minimizing expected simple regret. In an adaptive experiment, a decision maker draws one of multiple treatment arms based on past observations and observes the outcome of the drawn arm. After the experiment, the decision maker recommends the treatment arm with the highest expected outcome. We evaluate the decision based o… ▽ More We investigate the problem of fixed-budget best arm identification (BAI) for minimizing expected simple regret. In an adaptive experiment, a decision maker draws one of multiple treatment arms based on past observations and observes the outcome of the drawn arm. After the experiment, the decision maker recommends the treatment arm with the highest expected outcome. We evaluate the decision based on the expected simple regret, which is the difference between the expected outcomes of the best arm and the recommended arm. Due to inherent uncertainty, we evaluate the regret using the minimax criterion. First, we derive asymptotic lower bounds for the worst-case expected simple regret, which are characterized by the variances of potential outcomes (leading factor). Based on the lower bounds, we propose the Two-Stage (TS)-Hirano-Imbens-Ridder (HIR) strategy, which utilizes the HIR estimator (Hirano et al., 2003) in recommending the best arm. Our theoretical analysis shows that the TS-HIR strategy is asymptotically minimax optimal, meaning that the leading factor of its worst-case expected simple regret matches our derived worst-case lower bound. Additionally, we consider extensions of our method, such as the asymptotic optimality for the probability of misidentification. Finally, we validate the proposed method's effectiveness through simulations. △ Less

Submitted 12 July, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

arXiv:2210.09756 [pdf, ps, other]

Dimension-free Bounds for Sum of Dependent Matrices and Operators with Heavy-Tailed Distribution

Authors: Shogo Nakakita, Pierre Alquier, Masaaki Imaizumi

Abstract: We prove deviation inequalities for sums of high-dimensional random matrices and operators with dependence and {\rc heavy tails}. Estimation of high-dimensional matrices is a concern for numerous modern applications. However, most results are stated for independent observations. Therefore, it is critical to derive results for dependent and heavy-tailed matrices. In this paper, we derive a dimensio… ▽ More We prove deviation inequalities for sums of high-dimensional random matrices and operators with dependence and {\rc heavy tails}. Estimation of high-dimensional matrices is a concern for numerous modern applications. However, most results are stated for independent observations. Therefore, it is critical to derive results for dependent and heavy-tailed matrices. In this paper, we derive a dimension-free upper bound on the deviation of the sums. Thus, the bound does not depend explicitly on the dimension of the matrices but rather on their effective rank. Our result generalizes several existing studies on the deviation of sums of matrices. It relies on two techniques: (i) a variational approximation of the dual of moment generating functions, and (ii) robustification through the truncation of the eigenvalues of the matrices. We reveal that our results are applicable to several problems, such as covariance matrix estimation, hidden Markov models, and overparameterized linear regression. At the beginning, we have attached a corrigendum of the original paper. We correct Theorem 4 of the original paper by introducing a log-Sobolev inequality in place of the boundedness condition. We show that the examples discussed in the original paper can be recovered under new conditions. The original paper, uncorrected version -- which includes the aforementioned error -- is appended after this corrigendum for transparency and comparison. △ Less

Submitted 25 June, 2025; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: 40 pages in total

Journal ref: Electron. J. Statist. 18(1): 1130-1159 (2024)

arXiv:2209.07330 [pdf, other]

Best Arm Identification with Contextual Information under a Small Gap

Authors: Masahiro Kato, Masaaki Imaizumi, Takuya Ishihara, Toru Kitagawa

Abstract: We study the best-arm identification (BAI) problem with a fixed budget and contextual (covariate) information. In each round of an adaptive experiment, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, which is a treatment arm with the maximal expected reward marginalized over the contextua… ▽ More We study the best-arm identification (BAI) problem with a fixed budget and contextual (covariate) information. In each round of an adaptive experiment, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, which is a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. In this study, we consider a class of nonparametric bandit models that converge to location-shift models when the gaps go to zero. First, we derive lower bounds of the misidentification probability for a certain class of strategies and bandit models (probabilistic models of potential outcomes) under a small-gap regime. A small-gap regime is a situation where gaps of the expected rewards between the best and suboptimal treatment arms go to zero, which corresponds to one of the worst cases in identifying the best treatment arm. We then develop the ``Random Sampling (RS)-Augmented Inverse Probability weighting (AIPW) strategy,'' which is asymptotically optimal in the sense that the probability of misidentification under the strategy matches the lower bound when the budget goes to infinity in the small-gap regime. The RS-AIPW strategy consists of the RS rule tracking a target sample allocation ratio and the recommendation rule using the AIPW estimator. △ Less

Submitted 4 January, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: For the sake of completeness, we show a part of the results of Kato et al. (arXiv:2201.04469). arXiv admin note: text overlap with arXiv:2201.04469

arXiv:2204.08369 [pdf, ps, other]

Benign Overfitting in Time Series Linear Models with Over-Parameterization

Authors: Shogo Nakakita, Masaaki Imaizumi

Abstract: The success of large-scale models in recent years has increased the importance of statistical models with numerous parameters. Several studies have analyzed over-parameterized linear models with high-dimensional data, which may not be sparse; however, existing results rely on the assumption of sample independence. In this study, we analyze a linear regression model with dependent time-series data… ▽ More The success of large-scale models in recent years has increased the importance of statistical models with numerous parameters. Several studies have analyzed over-parameterized linear models with high-dimensional data, which may not be sparse; however, existing results rely on the assumption of sample independence. In this study, we analyze a linear regression model with dependent time-series data in an over-parameterized setting. We consider an estimator using interpolation and develop a theory for the excess risk of the estimator. Then, we derive non-asymptotic risk bounds for the estimator for cases with dependent data. This analysis reveals that the coherence of the temporal covariance plays a key role; the risk bound is influenced by the product of temporal covariance matrices at different time steps. Moreover, we show the convergence rate of the risk bound and demonstrate that it is also influenced by the coherence of the temporal covariance. Finally, we provide several examples of specific dependent processes applicable to our setting. △ Less

Submitted 13 March, 2025; v1 submitted 18 April, 2022; originally announced April 2022.

Comments: Accepted at Bernoulli

arXiv:2202.05245 [pdf, ps, other]

Benign-Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression

Authors: Masahiro Kato, Masaaki Imaizumi

Abstract: We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE), with linear regression models. As the development of machine learning for causal inference, a wide range of large-scale models for causality are gaining attention. One problem is that suspicions have been raised that the large-scale models are prone to overfitting to observations with sampl… ▽ More We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE), with linear regression models. As the development of machine learning for causal inference, a wide range of large-scale models for causality are gaining attention. One problem is that suspicions have been raised that the large-scale models are prone to overfitting to observations with sample selection, hence the large models may not be suitable for causal prediction. In this study, to resolve the suspicious, we investigate on the validity of causal inference methods for overparameterized models, by applying the recent theory of benign overfitting (Bartlett et al., 2020). Specifically, we consider samples whose distribution switches depending on an assignment rule, and study the prediction of CATE with linear models whose dimension diverges to infinity. We focus on two methods: the T-learner, which based on a difference between separately constructed estimators with each treatment group, and the inverse probability weight (IPW)-learner, which solves another regression problem approximated by a propensity score. In both methods, the estimator consists of interpolators that fit the samples perfectly. As a result, we show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known. This difference stems from that the T-learner is unable to preserve eigenspaces of the covariances, which is necessary for benign overfitting in the overparameterized setting. Our result provides new insights into the usage of causal inference methods in the overparameterizated setting, in particular, doubly robust estimators. △ Less

Submitted 11 February, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:1906.11300 by other authors

arXiv:2201.04469 [pdf, other]

Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap

Authors: Masahiro Kato, Kaito Ariu, Masaaki Imaizumi, Masahiro Nomura, Chao Qin

Abstract: We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First,… ▽ More We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First, we review a lower bound derived by Kaufmann et al. (2016). Then, we propose the "Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)" strategy, which consists of the sampling rule using the Neyman allocation with an estimated standard deviation and the recommendation rule using an AIPW estimator. Our proposed strategy is optimal because the upper bound matches the lower bound when the budget goes to infinity and the gap goes to zero. △ Less

Submitted 28 December, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

arXiv:2112.00213 [pdf, other]

Minimax Analysis for Inverse Risk in Nonparametric Planer Invertible Regression

Authors: Akifumi Okuno, Masaaki Imaizumi

Abstract: We study a minimax risk of estimating inverse functions on a plane, while keeping an estimator is also invertible. Learning invertibility from data and exploiting an invertible estimator are used in many domains, such as statistics, econometrics, and machine learning. Although the consistency and universality of invertible estimators have been well investigated, analysis of the efficiency of these… ▽ More We study a minimax risk of estimating inverse functions on a plane, while keeping an estimator is also invertible. Learning invertibility from data and exploiting an invertible estimator are used in many domains, such as statistics, econometrics, and machine learning. Although the consistency and universality of invertible estimators have been well investigated, analysis of the efficiency of these methods is still under development. In this study, we study a minimax risk for estimating invertible bi-Lipschitz functions on a square in a $2$-dimensional plane. We first introduce two types of $L^2$-risks to evaluate an estimator which preserves invertibility. Then, we derive lower and upper rates for minimax values for the risks associated with inverse functions. For the derivation, we exploit a representation of invertible functions using level-sets. Specifically, to obtain the upper rate, we develop an estimator asymptotically almost everywhere invertible, whose risk attains the derived minimax lower rate up to logarithmic factors. The derived minimax rate corresponds to that of the non-invertible bi-Lipschitz function, which shows that the invertibility does not reduce the complexity of the estimation problem in terms of the rate. % the minimax rate, similar to other shape constraints. △ Less

Submitted 25 December, 2023; v1 submitted 30 November, 2021; originally announced December 2021.

Comments: 34 pages, 34 figures, accepted to Electronic Journal of Statistics

arXiv:2111.04004 [pdf, other]

Exponential escape efficiency of SGD from sharp minima in non-stationary regime

Authors: Hikaru Ibayashi, Masaaki Imaizumi

Abstract: We show that stochastic gradient descent (SGD) escapes from sharp minima exponentially fast even before SGD reaches stationary distribution. SGD has been a de-facto standard training algorithm for various machine learning tasks. However, there still exists an open question as to why SGDs find highly generalizable parameters from non-convex target functions, such as the loss function of neural netw… ▽ More We show that stochastic gradient descent (SGD) escapes from sharp minima exponentially fast even before SGD reaches stationary distribution. SGD has been a de-facto standard training algorithm for various machine learning tasks. However, there still exists an open question as to why SGDs find highly generalizable parameters from non-convex target functions, such as the loss function of neural networks. An "escape efficiency" has been an attractive notion to tackle this question, which measures how SGD efficiently escapes from sharp minima with potentially low generalization performance. Despite its importance, the notion has the limitation that it works only when SGD reaches a stationary distribution after sufficient updates. In this paper, we develop a new theory to investigate escape efficiency of SGD with Gaussian noise, by introducing the Large Deviation Theory for dynamical systems. Based on the theory, we prove that the fast escape form sharp minima, named exponential escape, occurs in a non-stationary setting, and that it holds not only for continuous SGD but also for discrete SGD. A key notion for the result is a quantity called "steepness," which describes the SGD's stochastic behavior throughout its training process. Our experiments are consistent with our theory. △ Less

Submitted 18 March, 2022; v1 submitted 7 November, 2021; originally announced November 2021.

arXiv:2104.02978 [pdf, other]

Fast Convergence on Perfect Classification for Functional Data

Authors: Tomoya Wakayama, Masaaki Imaizumi

Abstract: We investigate the availability of approaching perfect classification on functional data with finite samples. The seminal work (Delaigle and Hall (2012)) showed that perfect classification for functional data is easier to achieve than for finite-dimensional data. This result is based on their finding that a sufficient condition for the existence of a perfect classifier, named a Delaigle--Hall cond… ▽ More We investigate the availability of approaching perfect classification on functional data with finite samples. The seminal work (Delaigle and Hall (2012)) showed that perfect classification for functional data is easier to achieve than for finite-dimensional data. This result is based on their finding that a sufficient condition for the existence of a perfect classifier, named a Delaigle--Hall condition, is only available for functional data. However, there is a danger that a large sample size is required to achieve the perfect classification even though the Delaigle--Hall condition holds, because a minimax convergence rate of errors with functional data has a logarithm order in sample size. This study solves this complication by proving that the Delaigle--Hall condition also achieves fast convergence of the misclassification error in sample size, under the bounded entropy condition on functional data. We study a reproducing kernel Hilbert space-based classifier under the Delaigle--Hall condition, and show that a convergence rate of its misclassification error has an exponential order in sample size. Technically, our proof is based on (i) connecting the Delaigle--Hall condition and a margin of classifiers, and (ii) handling metric entropy of functional data. Our experiments support our result, and also illustrate that some other classifiers for functional data have a similar property. △ Less

Submitted 7 January, 2023; v1 submitted 7 April, 2021; originally announced April 2021.

Comments: 32 pages, accepted by Statistics Sinica

arXiv:2103.00500 [pdf, other]

Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks

Authors: Ryumei Nakada, Masaaki Imaizumi

Abstract: We investigate the asymptotic risk of a general class of overparameterized likelihood models, including deep models. The recent empirical success of large-scale models has motivated several theoretical studies to investigate a scenario wherein both the number of samples, $n$, and parameters, $p$, diverge to infinity and derive an asymptotic risk at the limit. However, these theorems are only valid… ▽ More We investigate the asymptotic risk of a general class of overparameterized likelihood models, including deep models. The recent empirical success of large-scale models has motivated several theoretical studies to investigate a scenario wherein both the number of samples, $n$, and parameters, $p$, diverge to infinity and derive an asymptotic risk at the limit. However, these theorems are only valid for linear-in-feature models, such as generalized linear regression, kernel regression, and shallow neural networks. Hence, it is difficult to investigate a wider class of nonlinear models, including deep neural networks with three or more layers. In this study, we consider a likelihood maximization problem without the model constraints and analyze the upper bound of an asymptotic risk of an estimator with penalization. Technically, we combine a property of the Fisher information matrix with an extended Marchenko-Pastur law and associate the combination with empirical process techniques. The derived bound is general, as it describes both the double descent and the regularized risk curves, depending on the penalization. Our results are valid without the linear-in-feature constraints on models and allow us to derive the general spectral distributions of a Fisher information matrix from the likelihood. We demonstrate that several explicit models, such as parallel deep neural networks, ensemble learning, and residual networks, are in agreement with our theory. This result indicates that even large and deep models have a small asymptotic risk if they exhibit a specific structure, such as divisibility. To verify this finding, we conduct a real-data experiment with parallel deep neural networks. Our results expand the applicability of the asymptotic risk analysis, and may also contribute to the understanding and application of deep learning. △ Less

Submitted 15 March, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

Comments: 36 pages

arXiv:2102.02981 [pdf, ps, other]

Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

Authors: Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, Tengyang Xie

Abstract: We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights… ▽ More We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions, characterized by the critical inequality \citep{bartlett2005}. Based on this result, we analyze convergence rates for OPE. In particular, we introduce novel alternative completeness conditions under which OPE is feasible and we present the first finite-sample result with first-order efficiency in non-tabular environments, i.e., having the minimal coefficient in the leading term. △ Less

Submitted 24 July, 2022; v1 submitted 4 February, 2021; originally announced February 2021.

Comments: Under Review

arXiv:2012.15678 [pdf, ps, other]

On Gaussian Approximation for M-Estimator

Authors: Masaaki Imaizumi, Taisuke Otsu

Abstract: This study develops a non-asymptotic Gaussian approximation theory for distributions of M-estimators, which are defined as maximizers of empirical criterion functions. In existing mathematical statistics literature, numerous studies have focused on approximating the distributions of the M-estimators for statistical inference. In contrast to the existing approaches, which mainly focus on limiting b… ▽ More This study develops a non-asymptotic Gaussian approximation theory for distributions of M-estimators, which are defined as maximizers of empirical criterion functions. In existing mathematical statistics literature, numerous studies have focused on approximating the distributions of the M-estimators for statistical inference. In contrast to the existing approaches, which mainly focus on limiting behaviors, this study employs a non-asymptotic approach, establishes abstract Gaussian approximation results for maximizers of empirical criteria, and proposes a Gaussian multiplier bootstrap approximation method. Our developments can be considered as extensions of the seminal works (Chernozhukov, Chetverikov and Kato (2013, 2014, 2015)) on the approximation theory for distributions of suprema of empirical processes toward their maximizers. Through this work, we shed new lights on the statistical theory of M-estimators. Our theory covers not only regular estimators, such as the least absolute deviations, but also some non-regular cases where it is difficult to derive or to approximate numerically the limiting distributions such as non-Donsker classes and cube root estimators. △ Less

Submitted 2 January, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

Comments: 48 pages

arXiv:1910.07773 [pdf, other]

Hypothesis Test and Confidence Analysis with Wasserstein Distance on General Dimension

Authors: Masaaki Imaizumi, Hirofumi Ota, Takuo Hamaguchi

Abstract: We develop a general framework for statistical inference with the 1-Wasserstein distance. Recently, the Wasserstein distance has attracted considerable attention and has been widely applied to various machine learning tasks because of its excellent properties. However, hypothesis tests and a confidence analysis for the Wasserstein distance have not been established in a general multivariate settin… ▽ More We develop a general framework for statistical inference with the 1-Wasserstein distance. Recently, the Wasserstein distance has attracted considerable attention and has been widely applied to various machine learning tasks because of its excellent properties. However, hypothesis tests and a confidence analysis for the Wasserstein distance have not been established in a general multivariate setting. This is because the limit distribution of the empirical distribution with the Wasserstein distance is unavailable without strong restriction. To address this problem, in this study, we develop a novel non-asymptotic Gaussian approximation for the empirical 1-Wasserstein distance. Using the approximation method, we develop a hypothesis test and confidence analysis for the empirical 1-Wasserstein distance. Additionally, we provide a theoretical guarantee and an efficient algorithm for the proposed approximation. Our experiments validate its performance numerically. △ Less

Submitted 15 February, 2022; v1 submitted 17 October, 2019; originally announced October 2019.

Comments: 36 pages

arXiv:1612.07490 [pdf, other]

A simple method to construct confidence bands in functional linear regression

Authors: Masaaki Imaizumi, Kengo Kato

Abstract: This paper develops a simple method to construct confidence bands, centered at a principal component analysis (PCA) based estimator, for the slope function in a functional linear regression model with a scalar response variable and a functional predictor variable. The PCA-based estimator is a series estimator with estimated basis functions, and so construction of valid confidence bands for it is a… ▽ More This paper develops a simple method to construct confidence bands, centered at a principal component analysis (PCA) based estimator, for the slope function in a functional linear regression model with a scalar response variable and a functional predictor variable. The PCA-based estimator is a series estimator with estimated basis functions, and so construction of valid confidence bands for it is a non-trivial challenge. We propose a confidence band that aims at covering the slope function at "most" of points with a prespecified probability (level), and prove its asymptotic validity under suitable regularity conditions. Importantly, this is the first paper that derives confidence bands having theoretical justifications for the PCA-based estimator. We also propose a practical method to choose the cut-off level used in PCA-based estimation, and conduct numerical studies to verify the finite sample performance of the proposed confidence band. Finally, we apply our methodology to spectrometric data, and discuss extensions of our methodology to cases where additional vector-valued regressors are present. △ Less

Submitted 1 May, 2017; v1 submitted 22 December, 2016; originally announced December 2016.

Comments: 29 pages

arXiv:1609.00286 [pdf, other]

PCA-based estimation for functional linear regression with functional responses

Authors: Masaaki Imaizumi, Kengo Kato

Abstract: This paper studies a regression model where both predictor and response variables are random functions. We consider a functional linear model where the conditional mean of the response variable at each time point is given by a linear functional of the predictor variable. In this paper, we are interested in estimation of the integral kernel $b(s,t)$ of the conditional expectation operator, where… ▽ More This paper studies a regression model where both predictor and response variables are random functions. We consider a functional linear model where the conditional mean of the response variable at each time point is given by a linear functional of the predictor variable. In this paper, we are interested in estimation of the integral kernel $b(s,t)$ of the conditional expectation operator, where $s$ is an output variable while $t$ is a variable that interacts with the predictor variable. This problem is an ill-posed inverse problem, and we consider two estimators based on the functional principal component analysis (PCA). We show that under suitable regularity conditions, an estimator based on the single truncation attains the convergence rate for the integrated squared error that is characterized by smoothness of the function $b (s,t)$ in $t$ together with the decay rate of the eigenvalues of the covariance operator, but the rate does not depend on smoothness of $b(s,t)$ in $s$. This rate is shown to be minimax optimal, and consequently smoothness of $b(s,t)$ in $s$ does not affect difficulty of estimating $b$. We also consider an alternative estimator based on the double truncation, and provide conditions under which the alternative estimator attains the optimal rate. We conduct simulations to verify the performance of PCA-based estimators in the finite sample. Finally, we apply our estimators to investigate the relation between the lifetime pattern of working hours and total income, and the relation between the electricity spot price and the wind power infeed. △ Less

Submitted 22 March, 2017; v1 submitted 1 September, 2016; originally announced September 2016.

Comments: 30 pages

Showing 1–26 of 26 results for author: Imaizumi, M