-
Statistical learnability of smooth boundaries via pairwise binary classification with deep ReLU networks
Authors:
Hiroki Waida,
Takafumi Kanamori
Abstract:
The topic of nonparametric estimation of smooth boundaries is extensively studied in the conventional setting where pairs of single covariate and response variable are observed. However, this traditional setting often suffers from the cost of data collection. Recent years have witnessed the consistent development of learning algorithms for binary classification problems where one can instead obser…
▽ More
The topic of nonparametric estimation of smooth boundaries is extensively studied in the conventional setting where pairs of single covariate and response variable are observed. However, this traditional setting often suffers from the cost of data collection. Recent years have witnessed the consistent development of learning algorithms for binary classification problems where one can instead observe paired covariates and binary variable representing the statistical relationship between the covariates. In this work, we theoretically study the question of whether multiple smooth boundaries are learnable if the pairwise binary classification setting is considered. We investigate the question with the statistical dependence of paired covariates to develop a learning algorithm using vector-valued functions. The main theorem shows that there is an empirical risk minimization algorithm in a class of deep ReLU networks such that it produces a consistent estimator for indicator functions defined with smooth boundaries. We also discuss how the pairwise binary classification setting is different from the conventional settings, focusing on the structural condition of function classes. As a by-product, we apply the main theorem to a multiclass nonparametric classification problem where the estimation performance is measured by the excess risk in terms of misclassification.
△ Less
Submitted 19 January, 2025; v1 submitted 13 January, 2025;
originally announced January 2025.
-
Estimating Density Models with Truncation Boundaries using Score Matching
Authors:
Song Liu,
Takafumi Kanamori,
Daniel J. Williams
Abstract:
Truncated densities are probability density functions defined on truncated domains. They share the same parametric form with their non-truncated counterparts up to a normalizing constant. Since the computation of their normalizing constants is usually infeasible, Maximum Likelihood Estimation cannot be easily applied to estimate truncated density models. Score Matching (SM) is a powerful tool for…
▽ More
Truncated densities are probability density functions defined on truncated domains. They share the same parametric form with their non-truncated counterparts up to a normalizing constant. Since the computation of their normalizing constants is usually infeasible, Maximum Likelihood Estimation cannot be easily applied to estimate truncated density models. Score Matching (SM) is a powerful tool for fitting parameters using only unnormalized models. However, it cannot be directly applied here as boundary conditions used to derive a tractable SM objective are not satisfied by truncated densities. In this paper, we study parameter estimation for truncated probability densities using SM. The estimator minimizes a weighted Fisher divergence. The weight function is simply the shortest distance from a data point to the boundary of the domain. We show this choice of weight function naturally arises from minimizing the Stein discrepancy as well as upperbounding the finite-sample estimation error. The usefulness of our method is demonstrated by numerical experiments and a study on the Chicago crime data set. We also show that the proposed density estimation can correct the outlier-trimming bias caused by aggressive outlier detection methods.
△ Less
Submitted 20 April, 2022; v1 submitted 9 October, 2019;
originally announced October 2019.
-
Graph-based Composite Local Bregman Divergences on Discrete Sample Spaces
Authors:
Takafumi Kanamori,
Takashi Takenouchi
Abstract:
One of the most common methods for statistical inference is the maximum likelihood estimator (MLE). The MLE needs to compute the normalization constant in statistical models, and it is often intractable. Using unnormalized statistical models and replacing the likelihood with the other scoring rule are a good way to circumvent such high computation cost, where the scoring rule measures the goodness…
▽ More
One of the most common methods for statistical inference is the maximum likelihood estimator (MLE). The MLE needs to compute the normalization constant in statistical models, and it is often intractable. Using unnormalized statistical models and replacing the likelihood with the other scoring rule are a good way to circumvent such high computation cost, where the scoring rule measures the goodness of fit of the model to observed samples. The scoring rule is closely related to the Bregman divergence, which is a discrepancy measure between two probability distributions. In this paper, the purpose is to provide a general framework of statistical inference using unnormalized statistical models on discrete sample spaces. A localized version of scoring rules is important to obtain computationally efficient estimators. We show that the local scoring rules are related to the localized version of Bregman divergences. Through the localized Bregman divergence, we investigate the statistical consistency of local scoring rules. We show that the consistency is determined by the structure of neighborhood system defined on discrete sample spaces. In addition, we show a way of applying local scoring rules to classification problems. In numerical experiments, we investigated the relation between the neighborhood system and the estimation accuracy.
△ Less
Submitted 24 April, 2016; v1 submitted 22 April, 2016;
originally announced April 2016.
-
Robust Estimation under Heavy Contamination using Enlarged Models
Authors:
Takafumi Kanamori,
Hironori Fujisawa
Abstract:
In data analysis, contamination caused by outliers is inevitable, and robust statistical methods are strongly demanded. In this paper, our concern is to develop a new approach for robust data analysis based on scoring rules. The scoring rule is a discrepancy measure to assess the quality of probabilistic forecasts. We propose a simple way of estimating not only the parameter in the statistical mod…
▽ More
In data analysis, contamination caused by outliers is inevitable, and robust statistical methods are strongly demanded. In this paper, our concern is to develop a new approach for robust data analysis based on scoring rules. The scoring rule is a discrepancy measure to assess the quality of probabilistic forecasts. We propose a simple way of estimating not only the parameter in the statistical model but also the contamination ratio of outliers. Estimating the contamination ratio is important, since one can detect outliers out of the training samples based on the estimated contamination ratio. For this purpose, we use scoring rules with an extended statistical models, that is called the enlarged models. Also, the regression problems are considered. We study a complex heterogeneous contamination, in which the contamination ratio of outliers in the dependent variable may depend on the independent variable. We propose a simple method to obtain a robust regression estimator under heterogeneous contamination. In addition, we show that our method provides also an estimator of the expected contamination ratio that is available to detect the outliers out of training samples. Numerical experiments demonstrate the effectiveness of our methods compared to the conventional estimators.
△ Less
Submitted 20 November, 2013;
originally announced November 2013.
-
Affine Invariant Divergences associated with Composite Scores and its Applications
Authors:
Takafumi Kanamori,
Hironori Fujisawa
Abstract:
In statistical analysis, measuring a score of predictive performance is an important task. In many scientific fields, appropriate scores were tailored to tackle the problems at hand. A proper score is a popular tool to obtain statistically consistent forecasts. Furthermore, a mathematical characterization of the proper score was studied. As a result, it was revealed that the proper score correspon…
▽ More
In statistical analysis, measuring a score of predictive performance is an important task. In many scientific fields, appropriate scores were tailored to tackle the problems at hand. A proper score is a popular tool to obtain statistically consistent forecasts. Furthermore, a mathematical characterization of the proper score was studied. As a result, it was revealed that the proper score corresponds to a Bregman divergence, which is an extension of the squared distance over the set of probability distributions. In the present paper, we introduce composite scores as an extension of the typical scores in order to obtain a wider class of probabilistic forecasting. Then, we propose a class of composite scores, named Holder scores, that induce equivariant estimators. The equivariant estimators have a favorable property, implying that the estimator is transformed in a consistent way, when the data is transformed. In particular, we deal with the affine transformation of the data. By using the equivariant estimators under the affine transformation, one can obtain estimators that do no essentially depend on the choice of the system of units in the measurement. Conversely, we prove that the Holder score is characterized by the invariance property under the affine transformations. Furthermore, we investigate statistical properties of the estimators using Holder scores for the statistical problems including estimation of regression functions and robust parameter estimation, and illustrate the usefulness of the newly introduced scores for statistical forecasting.
△ Less
Submitted 11 May, 2013;
originally announced May 2013.
-
Relative Density-Ratio Estimation for Robust Distribution Comparison
Authors:
Makoto Yamada,
Taiji Suzuki,
Takafumi Kanamori,
Hirotaka Hachiya,
Masashi Sugiyama
Abstract:
Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, di…
▽ More
Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach.
△ Less
Submitted 23 June, 2011;
originally announced June 2011.
-
Condition Number Analysis of Kernel-based Density Ratio Estimation
Authors:
Takafumi Kanamori,
Taiji Suzuki,
Masashi Sugiyama
Abstract:
The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), and feature selection (mutual information). Recently, several methods of directly estimating the density ratio have been developed, e.g., kernel mean matching, maximum likelihood density ratio estimation…
▽ More
The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), and feature selection (mutual information). Recently, several methods of directly estimating the density ratio have been developed, e.g., kernel mean matching, maximum likelihood density ratio estimation, and least-squares density ratio fitting. In this paper, we consider a kernelized variant of the least-squares method and investigate its theoretical properties from the viewpoint of the condition number using smoothed analysis techniques--the condition number of the Hessian matrix determines the convergence rate of optimization and the numerical stability. We show that the kernel least-squares method has a smaller condition number than a version of kernel mean matching and other M-estimators, implying that the kernel least-squares method has preferable numerical properties. We further give an alternative formulation of the kernel least-squares estimator which is shown to possess an even smaller condition number. We show that numerical studies meet our theoretical analysis.
△ Less
Submitted 15 December, 2009;
originally announced December 2009.