-
TULiP: Test-time Uncertainty Estimation via Linearization and Weight Perturbation
Authors:
Yuhui Zhang,
Dongshen Wu,
Yuichiro Wada,
Takafumi Kanamori
Abstract:
A reliable uncertainty estimation method is the foundation of many modern out-of-distribution (OOD) detectors, which are critical for safe deployments of deep learning models in the open world. In this work, we propose TULiP, a theoretically-driven post-hoc uncertainty estimator for OOD detection. Our approach considers a hypothetical perturbation applied to the network before convergence. Based o…
▽ More
A reliable uncertainty estimation method is the foundation of many modern out-of-distribution (OOD) detectors, which are critical for safe deployments of deep learning models in the open world. In this work, we propose TULiP, a theoretically-driven post-hoc uncertainty estimator for OOD detection. Our approach considers a hypothetical perturbation applied to the network before convergence. Based on linearized training dynamics, we bound the effect of such perturbation, resulting in an uncertainty score computable by perturbing model parameters. Ultimately, our approach computes uncertainty from a set of sampled predictions. We visualize our bound on synthetic regression and classification datasets. Furthermore, we demonstrate the effectiveness of TULiP using large-scale OOD detection benchmarks for image classification. Our method exhibits state-of-the-art performance, particularly for near-distribution samples.
△ Less
Submitted 23 May, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
Scaling-based Data Augmentation for Generative Models and its Theoretical Extension
Authors:
Yoshitaka Koike,
Takumi Nakagawa,
Hiroki Waida,
Takafumi Kanamori
Abstract:
This paper studies stable learning methods for generative models that enable high-quality data generation. Noise injection is commonly used to stabilize learning. However, selecting a suitable noise distribution is challenging. Diffusion-GAN, a recently developed method, addresses this by using the diffusion process with a timestep-dependent discriminator. We investigate Diffusion-GAN and reveal t…
▽ More
This paper studies stable learning methods for generative models that enable high-quality data generation. Noise injection is commonly used to stabilize learning. However, selecting a suitable noise distribution is challenging. Diffusion-GAN, a recently developed method, addresses this by using the diffusion process with a timestep-dependent discriminator. We investigate Diffusion-GAN and reveal that data scaling is a key component for stable learning and high-quality data generation. Building on our findings, we propose a learning algorithm, Scale-GAN, that uses data scaling and variance-based regularization. Furthermore, we theoretically prove that data scaling controls the bias-variance trade-off of the estimation error bound. As a theoretical extension, we consider GAN with invertible data augmentations. Comparative evaluations on benchmark datasets demonstrate the effectiveness of our method in improving stability and accuracy.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Robust Estimation for Kernel Exponential Families with Smoothed Total Variation Distances
Authors:
Takafumi Kanamori,
Kodai Yokoyama,
Takayuki Kawashima
Abstract:
In statistical inference, we commonly assume that samples are independent and identically distributed from a probability distribution included in a pre-specified statistical model. However, such an assumption is often violated in practice. Even an unexpected extreme sample called an {\it outlier} can significantly impact classical estimators. Robust statistics studies how to construct reliable sta…
▽ More
In statistical inference, we commonly assume that samples are independent and identically distributed from a probability distribution included in a pre-specified statistical model. However, such an assumption is often violated in practice. Even an unexpected extreme sample called an {\it outlier} can significantly impact classical estimators. Robust statistics studies how to construct reliable statistical methods that efficiently work even when the ideal assumption is violated. Recently, some works revealed that robust estimators such as Tukey's median are well approximated by the generative adversarial net (GAN), a popular learning method for complex generative models using neural networks. GAN is regarded as a learning method using integral probability metrics (IPM), which is a discrepancy measure for probability distributions. In most theoretical analyses of Tukey's median and its GAN-based approximation, however, the Gaussian or elliptical distribution is assumed as the statistical model. In this paper, we explore the application of GAN-like estimators to a general class of statistical models. As the statistical model, we consider the kernel exponential family that includes both finite and infinite-dimensional models. To construct a robust estimator, we propose the smoothed total variation (STV) distance as a class of IPMs. Then, we theoretically investigate the robustness properties of the STV-based estimators. Our analysis reveals that the STV-based estimator is robust against the distribution contamination for the kernel exponential family. Furthermore, we analyze the prediction accuracy of a Monte Carlo approximation method, which circumvents the computational difficulty of the normalization constant.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
A Convex Framework for Confounding Robust Inference
Authors:
Kei Ishikawa,
Niao He,
Takafumi Kanamori
Abstract:
We study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the poli…
▽ More
We study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the policy value. In this paper, we propose a general estimator that provides a sharp lower bound of the policy value using convex programming. The generality of our estimator enables various extensions such as sensitivity analysis with f-divergence, model selection with cross validation and information criterion, and robust policy learning with the sharp lower bound. Furthermore, our estimation method can be reformulated as an empirical risk minimization problem thanks to the strong duality, which enables us to provide strong theoretical guarantees of the proposed estimator using techniques of the M-estimation.
△ Less
Submitted 1 November, 2023; v1 submitted 21 September, 2023;
originally announced September 2023.
-
Denoising Cosine Similarity: A Theory-Driven Approach for Efficient Representation Learning
Authors:
Takumi Nakagawa,
Yutaro Sanada,
Hiroki Waida,
Yuhui Zhang,
Yuichiro Wada,
Kōsaku Takanashi,
Tomonori Yamada,
Takafumi Kanamori
Abstract:
Representation learning has been increasing its impact on the research and practice of machine learning, since it enables to learn representations that can apply to various downstream tasks efficiently. However, recent works pay little attention to the fact that real-world datasets used during the stage of representation learning are commonly contaminated by noise, which can degrade the quality of…
▽ More
Representation learning has been increasing its impact on the research and practice of machine learning, since it enables to learn representations that can apply to various downstream tasks efficiently. However, recent works pay little attention to the fact that real-world datasets used during the stage of representation learning are commonly contaminated by noise, which can degrade the quality of learned representations. This paper tackles the problem to learn robust representations against noise in a raw dataset. To this end, inspired by recent works on denoising and the success of the cosine-similarity-based objective functions in representation learning, we propose the denoising Cosine-Similarity (dCS) loss. The dCS loss is a modified cosine-similarity loss and incorporates a denoising property, which is supported by both our theoretical and empirical findings. To make the dCS loss implementable, we also construct the estimators of the dCS loss with statistical guarantees. Finally, we empirically show the efficiency of the dCS loss over the baseline objective functions in vision and speech domains.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
Towards Understanding the Mechanism of Contrastive Learning via Similarity Structure: A Theoretical Analysis
Authors:
Hiroki Waida,
Yuichiro Wada,
Léo Andéol,
Takumi Nakagawa,
Yuhui Zhang,
Takafumi Kanamori
Abstract:
Contrastive learning is an efficient approach to self-supervised representation learning. Although recent studies have made progress in the theoretical understanding of contrastive learning, the investigation of how to characterize the clusters of the learned representations is still limited. In this paper, we aim to elucidate the characterization from theoretical perspectives. To this end, we con…
▽ More
Contrastive learning is an efficient approach to self-supervised representation learning. Although recent studies have made progress in the theoretical understanding of contrastive learning, the investigation of how to characterize the clusters of the learned representations is still limited. In this paper, we aim to elucidate the characterization from theoretical perspectives. To this end, we consider a kernel-based contrastive learning framework termed Kernel Contrastive Learning (KCL), where kernel functions play an important role when applying our theoretical results to other frameworks. We introduce a formulation of the similarity structure of learned representations by utilizing a statistical dependency viewpoint. We investigate the theoretical properties of the kernel-based contrastive loss via this formulation. We first prove that the formulation characterizes the structure of representations learned with the kernel-based contrastive learning framework. We show a new upper bound of the classification error of a downstream task, which explains that our theory is consistent with the empirical success of contrastive learning. We also establish a generalization error bound of KCL. Finally, we show a guarantee for the generalization ability of KCL to the downstream classification task via a surrogate bound.
△ Less
Submitted 1 April, 2023;
originally announced April 2023.
-
Deep Clustering with a Constraint for Topological Invariance based on Symmetric InfoNCE
Authors:
Yuhui Zhang,
Yuichiro Wada,
Hiroki Waida,
Kaito Goto,
Yusaku Hino,
Takafumi Kanamori
Abstract:
We consider the scenario of deep clustering, in which the available prior knowledge is limited. In this scenario, few existing state-of-the-art deep clustering methods can perform well for both non-complex topology and complex topology datasets. To address the problem, we propose a constraint utilizing symmetric InfoNCE, which helps an objective of deep clustering method in the scenario train the…
▽ More
We consider the scenario of deep clustering, in which the available prior knowledge is limited. In this scenario, few existing state-of-the-art deep clustering methods can perform well for both non-complex topology and complex topology datasets. To address the problem, we propose a constraint utilizing symmetric InfoNCE, which helps an objective of deep clustering method in the scenario train the model so as to be efficient for not only non-complex topology but also complex topology datasets. Additionally, we provide several theoretical explanations of the reason why the constraint can enhances performance of deep clustering methods. To confirm the effectiveness of the proposed constraint, we introduce a deep clustering method named MIST, which is a combination of an existing deep clustering method and our constraint. Our numerical experiments via MIST demonstrate that the constraint is effective. In addition, MIST outperforms other state-of-the-art deep clustering methods for most of the commonly used ten benchmark datasets.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Learning Domain Invariant Representations by Joint Wasserstein Distance Minimization
Authors:
Léo Andeol,
Yusei Kawakami,
Yuichiro Wada,
Takafumi Kanamori,
Klaus-Robert Müller,
Grégoire Montavon
Abstract:
Domain shifts in the training data are common in practical applications of machine learning; they occur for instance when the data is coming from different sources. Ideally, a ML model should work well independently of these shifts, for example, by learning a domain-invariant representation. However, common ML losses do not give strong guarantees on how consistently the ML model performs for diffe…
▽ More
Domain shifts in the training data are common in practical applications of machine learning; they occur for instance when the data is coming from different sources. Ideally, a ML model should work well independently of these shifts, for example, by learning a domain-invariant representation. However, common ML losses do not give strong guarantees on how consistently the ML model performs for different domains, in particular, whether the model performs well on a domain at the expense of its performance on another domain. In this paper, we build new theoretical foundations for this problem, by contributing a set of mathematical relations between classical losses for supervised ML and the Wasserstein distance in joint space (i.e. representation and output space). We show that classification or regression losses, when combined with a GAN-type discriminator between domains, form an upper-bound to the true Wasserstein distance between domains. This implies a more invariant representation and also more stable prediction performance across domains. Theoretical results are corroborated empirically on several image datasets. Our proposed approach systematically produces the highest minimum classification accuracy across domains, and the most invariant representation.
△ Less
Submitted 21 August, 2023; v1 submitted 9 June, 2021;
originally announced June 2021.
-
Robust modal regression with direct log-density derivative estimation
Authors:
Hiroaki Sasaki,
Tomoya Sakai,
Takafumi Kanamori
Abstract:
Modal regression is aimed at estimating the global mode (i.e., global maximum) of the conditional density function of the output variable given input variables, and has led to regression methods robust against heavy-tailed or skewed noises. The conditional mode is often estimated through maximization of the modal regression risk (MRR). In order to apply a gradient method for the maximization, the…
▽ More
Modal regression is aimed at estimating the global mode (i.e., global maximum) of the conditional density function of the output variable given input variables, and has led to regression methods robust against heavy-tailed or skewed noises. The conditional mode is often estimated through maximization of the modal regression risk (MRR). In order to apply a gradient method for the maximization, the fundamental challenge is accurate approximation of the gradient of MRR, not MRR itself. To overcome this challenge, in this paper, we take a novel approach of directly approximating the gradient of MRR. To approximate the gradient, we develop kernelized and neural-network-based versions of the least-squares log-density derivative estimator, which directly approximates the derivative of the log-density without density estimation. With direct approximation of the MRR gradient, we first propose a modal regression method with kernels, and derive a new parameter update rule based on a fixed-point method. Then, the derived update rule is theoretically proved to have a monotonic hill-climbing property towards the conditional mode. Furthermore, we indicate that our approach of directly approximating the gradient is compatible with recent sophisticated stochastic gradient methods (e.g., Adam), and then propose another modal regression method based on neural networks. Finally, the superior performance of the proposed methods is demonstrated on various artificial and benchmark datasets.
△ Less
Submitted 18 October, 2019;
originally announced October 2019.
-
Estimating Density Models with Truncation Boundaries using Score Matching
Authors:
Song Liu,
Takafumi Kanamori,
Daniel J. Williams
Abstract:
Truncated densities are probability density functions defined on truncated domains. They share the same parametric form with their non-truncated counterparts up to a normalizing constant. Since the computation of their normalizing constants is usually infeasible, Maximum Likelihood Estimation cannot be easily applied to estimate truncated density models. Score Matching (SM) is a powerful tool for…
▽ More
Truncated densities are probability density functions defined on truncated domains. They share the same parametric form with their non-truncated counterparts up to a normalizing constant. Since the computation of their normalizing constants is usually infeasible, Maximum Likelihood Estimation cannot be easily applied to estimate truncated density models. Score Matching (SM) is a powerful tool for fitting parameters using only unnormalized models. However, it cannot be directly applied here as boundary conditions used to derive a tractable SM objective are not satisfied by truncated densities. In this paper, we study parameter estimation for truncated probability densities using SM. The estimator minimizes a weighted Fisher divergence. The weight function is simply the shortest distance from a data point to the boundary of the domain. We show this choice of weight function naturally arises from minimizing the Stein discrepancy as well as upperbounding the finite-sample estimation error. The usefulness of our method is demonstrated by numerical experiments and a study on the Chicago crime data set. We also show that the proposed density estimation can correct the outlier-trimming bias caused by aggressive outlier detection methods.
△ Less
Submitted 20 April, 2022; v1 submitted 9 October, 2019;
originally announced October 2019.
-
Unified estimation framework for unnormalized models with statistical efficiency
Authors:
Masatoshi Uehara,
Takafumi Kanamori,
Takashi Takenouchi,
Takeru Matsuda
Abstract:
The parameter estimation of unnormalized models is a challenging problem. The maximum likelihood estimation (MLE) is computationally infeasible for these models since normalizing constants are not explicitly calculated. Although some consistent estimators have been proposed earlier, the problem of statistical efficiency remains. In this study, we propose a unified, statistically efficient estimati…
▽ More
The parameter estimation of unnormalized models is a challenging problem. The maximum likelihood estimation (MLE) is computationally infeasible for these models since normalizing constants are not explicitly calculated. Although some consistent estimators have been proposed earlier, the problem of statistical efficiency remains. In this study, we propose a unified, statistically efficient estimation framework for unnormalized models and several efficient estimators, whose asymptotic variance is the same as the MLE. The computational cost of these estimators is also reasonable and they can be employed whether the sample space is discrete or continuous. The loss functions of the proposed estimators are derived by combining the following two methods: (1) density-ratio matching using Bregman divergence, and (2) plugging-in nonparametric estimators. We also analyze the properties of the proposed estimators when the unnormalized models are misspecified. The experimental results demonstrate the advantages of our method over existing approaches.
△ Less
Submitted 5 June, 2020; v1 submitted 22 January, 2019;
originally announced January 2019.
-
Variable Selection for Nonparametric Learning with Power Series Kernels
Authors:
Kota Matsui,
Wataru Kumagai,
Kenta Kanamori,
Mitsuaki Nishikimi,
Takafumi Kanamori
Abstract:
In this paper, we propose a variable selection method for general nonparametric kernel-based estimation. The proposed method consists of two-stage estimation: (1) construct a consistent estimator of the target function, (2) approximate the estimator using a few variables by l1-type penalized estimation. We see that the proposed method can be applied to various kernel nonparametric estimation such…
▽ More
In this paper, we propose a variable selection method for general nonparametric kernel-based estimation. The proposed method consists of two-stage estimation: (1) construct a consistent estimator of the target function, (2) approximate the estimator using a few variables by l1-type penalized estimation. We see that the proposed method can be applied to various kernel nonparametric estimation such as kernel ridge regression, kernel-based density and density-ratio estimation. We prove that the proposed method has the property of the variable selection consistency when the power series kernel is used. This result is regarded as an extension of the variable selection consistency for the non-negative garrote to the kernel-based estimators. Several experiments including simulation studies and real data applications show the effectiveness of the proposed method.
△ Less
Submitted 4 December, 2018; v1 submitted 1 June, 2018;
originally announced June 2018.
-
Fisher Efficient Inference of Intractable Models
Authors:
Song Liu,
Takafumi Kanamori,
Wittawat Jitkrittum,
Yu Chen
Abstract:
Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic Cram{é}r-Rao lower bound (efficiency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may not be tractable due to the normalization term…
▽ More
Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic Cram{é}r-Rao lower bound (efficiency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may not be tractable due to the normalization term of the density model. In this paper, we derive a Discriminative Likelihood Estimator (DLE) from the Kullback-Leibler divergence minimization criterion implemented via density ratio estimation and a Stein operator. We study the problem of model inference using DLE. We prove its consistency and show that the asymptotic variance of its solution can attain the equality of the efficiency bound under mild regularity conditions. We also propose a dual formulation of DLE which can be easily optimized. Numerical studies validate our asymptotic theorems and we give an example where DLE successfully estimates an intractable model constructed using a pre-trained deep neural network.
△ Less
Submitted 1 November, 2019; v1 submitted 18 May, 2018;
originally announced May 2018.
-
Mode-Seeking Clustering and Density Ridge Estimation via Direct Estimation of Density-Derivative-Ratios
Authors:
Hiroaki Sasaki,
Takafumi Kanamori,
Aapo Hyvärinen,
Gang Niu,
Masashi Sugiyama
Abstract:
Modes and ridges of the probability density function behind observed data are useful geometric features. Mode-seeking clustering assigns cluster labels by associating data samples with the nearest modes, and estimation of density ridges enables us to find lower-dimensional structures hidden in data. A key technical challenge both in mode-seeking clustering and density ridge estimation is accurate…
▽ More
Modes and ridges of the probability density function behind observed data are useful geometric features. Mode-seeking clustering assigns cluster labels by associating data samples with the nearest modes, and estimation of density ridges enables us to find lower-dimensional structures hidden in data. A key technical challenge both in mode-seeking clustering and density ridge estimation is accurate estimation of the ratios of the first- and second-order density derivatives to the density. A naive approach takes a three-step approach of first estimating the data density, then computing its derivatives, and finally taking their ratios. However, this three-step approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator, and division by the estimated density could significantly magnify the estimation error. To cope with these problems, we propose a novel estimator for the \emph{density-derivative-ratios}. The proposed estimator does not involve density estimation, but rather \emph{directly} approximates the ratios of density derivatives of any order. Moreover, we establish a convergence rate of the proposed estimator. Based on the proposed estimator, novel methods both for mode-seeking clustering and density ridge estimation are developed, and the respective convergence rates to the mode and ridge of the underlying density are also established. Finally, we experimentally demonstrate that the developed methods significantly outperform existing methods, particularly for relatively high-dimensional data.
△ Less
Submitted 30 March, 2018; v1 submitted 6 July, 2017;
originally announced July 2017.
-
Parallel Distributed Block Coordinate Descent Methods based on Pairwise Comparison Oracle
Authors:
Kota Matsui,
Wataru Kumagai,
Takafumi Kanamori
Abstract:
This paper provides a block coordinate descent algorithm to solve unconstrained optimization problems. In our algorithm, computation of function values or gradients is not required. Instead, pairwise comparison of function values is used. Our algorithm consists of two steps; one is the direction estimate step and the other is the search step. Both steps require only pairwise comparison of function…
▽ More
This paper provides a block coordinate descent algorithm to solve unconstrained optimization problems. In our algorithm, computation of function values or gradients is not required. Instead, pairwise comparison of function values is used. Our algorithm consists of two steps; one is the direction estimate step and the other is the search step. Both steps require only pairwise comparison of function values, which tells us only the order of function values over two points. In the direction estimate step, a Newton type search direction is estimated. A computation method like block coordinate descent methods is used with the pairwise comparison. In the search step, a numerical solution is updated along the estimated direction. The computation in the direction estimate step can be easily parallelized, and thus, the algorithm works efficiently to find the minimizer of the objective function. Also, we show an upper bound of the convergence rate. In numerical experiments, we show that our method efficiently finds the optimal solution compared to some existing methods based on the pairwise comparison.
△ Less
Submitted 13 September, 2014;
originally announced September 2014.
-
Breakdown Point of Robust Support Vector Machine
Authors:
Takafumi Kanamori,
Shuhei Fujiwara,
Akiko Takeda
Abstract:
The support vector machine (SVM) is one of the most successful learning methods for solving classification problems. Despite its popularity, SVM has a serious drawback, that is sensitivity to outliers in training samples. The penalty on misclassification is defined by a convex loss called the hinge loss, and the unboundedness of the convex loss causes the sensitivity to outliers. To deal wit…
▽ More
The support vector machine (SVM) is one of the most successful learning methods for solving classification problems. Despite its popularity, SVM has a serious drawback, that is sensitivity to outliers in training samples. The penalty on misclassification is defined by a convex loss called the hinge loss, and the unboundedness of the convex loss causes the sensitivity to outliers. To deal with outliers, robust variants of SVM have been proposed, such as the robust outlier detection algorithm and an SVM with a bounded loss called the ramp loss. In this paper, we propose a robust variant of SVM and investigate its robustness in terms of the breakdown point. The breakdown point is a robustness measure that is the largest amount of contamination such that the estimated classifier still gives information about the non-contaminated data. The main contribution of this paper is to show an exact evaluation of the breakdown point for the robust SVM. For learning parameters such as the regularization parameter in our algorithm, we derive a simple formula that guarantees the robustness of the classifier. When the learning parameters are determined with a grid search using cross validation, our formula works to reduce the number of candidate search points. The robustness of the proposed method is confirmed in numerical experiments. We show that the statistical properties of the robust SVM are well explained by a theoretical analysis of the breakdown point.
△ Less
Submitted 2 September, 2014;
originally announced September 2014.
-
Affine Invariant Divergences associated with Composite Scores and its Applications
Authors:
Takafumi Kanamori,
Hironori Fujisawa
Abstract:
In statistical analysis, measuring a score of predictive performance is an important task. In many scientific fields, appropriate scores were tailored to tackle the problems at hand. A proper score is a popular tool to obtain statistically consistent forecasts. Furthermore, a mathematical characterization of the proper score was studied. As a result, it was revealed that the proper score correspon…
▽ More
In statistical analysis, measuring a score of predictive performance is an important task. In many scientific fields, appropriate scores were tailored to tackle the problems at hand. A proper score is a popular tool to obtain statistically consistent forecasts. Furthermore, a mathematical characterization of the proper score was studied. As a result, it was revealed that the proper score corresponds to a Bregman divergence, which is an extension of the squared distance over the set of probability distributions. In the present paper, we introduce composite scores as an extension of the typical scores in order to obtain a wider class of probabilistic forecasting. Then, we propose a class of composite scores, named Holder scores, that induce equivariant estimators. The equivariant estimators have a favorable property, implying that the estimator is transformed in a consistent way, when the data is transformed. In particular, we deal with the affine transformation of the data. By using the equivariant estimators under the affine transformation, one can obtain estimators that do no essentially depend on the choice of the system of units in the measurement. Conversely, we prove that the Holder score is characterized by the invariance property under the affine transformations. Furthermore, we investigate statistical properties of the estimators using Holder scores for the statistical problems including estimation of regression functions and robust parameter estimation, and illustrate the usefulness of the newly introduced scores for statistical forecasting.
△ Less
Submitted 11 May, 2013;
originally announced May 2013.
-
Density-Difference Estimation
Authors:
Masashi Sugiyama,
Takafumi Kanamori,
Taiji Suzuki,
Marthinus Christoffel du Plessis,
Song Liu,
Ichiro Takeuchi
Abstract:
We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small error incurred in the first stage can…
▽ More
We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small error incurred in the first stage can cause a big error in the second stage. In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a non-parametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. The usefulness of the proposed method is also demonstrated experimentally.
△ Less
Submitted 30 June, 2012;
originally announced July 2012.
-
A Unified Robust Classification Model
Authors:
Akiko Takeda,
Hiroyuki Mitsugi,
Takafumi Kanamori
Abstract:
A wide variety of machine learning algorithms such as support vector machine (SVM), minimax probability machine (MPM), and Fisher discriminant analysis (FDA), exist for binary classification. The purpose of this paper is to provide a unified classification model that includes the above models through a robust optimization approach. This unified model has several benefits. One is that the extension…
▽ More
A wide variety of machine learning algorithms such as support vector machine (SVM), minimax probability machine (MPM), and Fisher discriminant analysis (FDA), exist for binary classification. The purpose of this paper is to provide a unified classification model that includes the above models through a robust optimization approach. This unified model has several benefits. One is that the extensions and improvements intended for SVM become applicable to MPM and FDA, and vice versa. Another benefit is to provide theoretical results to above learning methods at once by dealing with the unified model. We give a statistical interpretation of the unified classification model and propose a non-convex optimization algorithm that can be applied to non-convex variants of existing learning methods.
△ Less
Submitted 18 June, 2012;
originally announced June 2012.
-
A Conjugate Property between Loss Functions and Uncertainty Sets in Classification Problems
Authors:
Takafumi Kanamori,
Akiko Takeda,
Taiji Suzuki
Abstract:
In binary classification problems, mainly two approaches have been proposed; one is loss function approach and the other is uncertainty set approach. The loss function approach is applied to major learning algorithms such as support vector machine (SVM) and boosting methods. The loss function represents the penalty of the decision function on the training samples. In the learning algorithm, the em…
▽ More
In binary classification problems, mainly two approaches have been proposed; one is loss function approach and the other is uncertainty set approach. The loss function approach is applied to major learning algorithms such as support vector machine (SVM) and boosting methods. The loss function represents the penalty of the decision function on the training samples. In the learning algorithm, the empirical mean of the loss function is minimized to obtain the classifier. Against a backdrop of the development of mathematical programming, nowadays learning algorithms based on loss functions are widely applied to real-world data analysis. In addition, statistical properties of such learning algorithms are well-understood based on a lots of theoretical works. On the other hand, the learning method using the so-called uncertainty set is used in hard-margin SVM, mini-max probability machine (MPM) and maximum margin MPM. In the learning algorithm, firstly, the uncertainty set is defined for each binary label based on the training samples. Then, the best separating hyperplane between the two uncertainty sets is employed as the decision function. This is regarded as an extension of the maximum-margin approach. The uncertainty set approach has been studied as an application of robust optimization in the field of mathematical programming. The statistical properties of learning algorithms with uncertainty sets have not been intensively studied. In this paper, we consider the relation between the above two approaches. We point out that the uncertainty set is described by using the level set of the conjugate of the loss function. Based on such relation, we study statistical properties of learning algorithms using uncertainty sets.
△ Less
Submitted 30 April, 2012;
originally announced April 2012.
-
Semi-Supervised learning with Density-Ratio Estimation
Authors:
Masanori Kawakita,
Takafumi Kanamori
Abstract:
In this paper, we study statistical properties of semi-supervised learning, which is considered as an important problem in the community of machine learning. In the standard supervised learning, only the labeled data is observed. The classification and regression problems are formalized as the supervised learning. In semi-supervised learning, unlabeled data is also obtained in addition to labeled…
▽ More
In this paper, we study statistical properties of semi-supervised learning, which is considered as an important problem in the community of machine learning. In the standard supervised learning, only the labeled data is observed. The classification and regression problems are formalized as the supervised learning. In semi-supervised learning, unlabeled data is also obtained in addition to labeled data. Hence, exploiting unlabeled data is important to improve the prediction accuracy in semi-supervised learning. This problems is regarded as a semiparametric estimation problem with missing data. Under the the discriminative probabilistic models, it had been considered that the unlabeled data is useless to improve the estimation accuracy. Recently, it was revealed that the weighted estimator using the unlabeled data achieves better prediction accuracy in comparison to the learning method using only labeled data, especially when the discriminative probabilistic model is misspecified. That is, the improvement under the semiparametric model with missing data is possible, when the semiparametric model is misspecified. In this paper, we apply the density-ratio estimator to obtain the weight function in the semi-supervised learning. The benefit of our approach is that the proposed estimator does not require well-specified probabilistic models for the probability of the unlabeled data. Based on the statistical asymptotic theory, we prove that the estimation accuracy of our method outperforms the supervised learning using only labeled data. Some numerical experiments present the usefulness of our methods.
△ Less
Submitted 17 April, 2012;
originally announced April 2012.
-
Relative Density-Ratio Estimation for Robust Distribution Comparison
Authors:
Makoto Yamada,
Taiji Suzuki,
Takafumi Kanamori,
Hirotaka Hachiya,
Masashi Sugiyama
Abstract:
Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, di…
▽ More
Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach.
△ Less
Submitted 23 June, 2011;
originally announced June 2011.
-
f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models
Authors:
Takafumi Kanamori,
Taiji Suzuki,
Masashi Sugiyama
Abstract:
A density ratio is defined by the ratio of two probability densities. We study the inference problem of density ratios and apply a semi-parametric density-ratio estimator to the two-sample homogeneity test. In the proposed test procedure, the f-divergence between two probability densities is estimated using a density-ratio estimator. The f-divergence estimator is then exploited for the two-sample…
▽ More
A density ratio is defined by the ratio of two probability densities. We study the inference problem of density ratios and apply a semi-parametric density-ratio estimator to the two-sample homogeneity test. In the proposed test procedure, the f-divergence between two probability densities is estimated using a density-ratio estimator. The f-divergence estimator is then exploited for the two-sample homogeneity test. We derive the optimal estimator of f-divergence in the sense of the asymptotic variance, and then investigate the relation between the proposed test procedure and the existing score test based on empirical likelihood estimator. Through numerical studies, we illustrate the adequacy of the asymptotic theory for finite-sample inference.
△ Less
Submitted 24 October, 2010;
originally announced October 2010.
-
A Bregman Extension of quasi-Newton updates I: An Information Geometrical framework
Authors:
Takafumi Kanamori,
Atsumi Ohara
Abstract:
We study quasi-Newton methods from the viewpoint of information geometry induced associated with Bregman divergences. Fletcher has studied a variational problem which derives the approximate Hessian update formula of the quasi-Newton methods. We point out that the variational problem is identical to optimization of the Kullback-Leibler divergence, which is a discrepancy measure between two probabi…
▽ More
We study quasi-Newton methods from the viewpoint of information geometry induced associated with Bregman divergences. Fletcher has studied a variational problem which derives the approximate Hessian update formula of the quasi-Newton methods. We point out that the variational problem is identical to optimization of the Kullback-Leibler divergence, which is a discrepancy measure between two probability distributions. The Kullback-Leibler divergence for the multinomial normal distribution corresponds to the objective function Fletcher has considered. We introduce the Bregman divergence as an extension of the Kullback-Leibler divergence, and derive extended quasi-Newton update formulae based on the variational problem with the Bregman divergence. As well as the Kullback-Leibler divergence, the Bregman divergence introduces the information geometrical structure on the set of positive definite matrices. From the geometrical viewpoint, we study the approximation Hessian update, the invariance property of the update formulae, and the sparse quasi-Newton methods. Especially, we point out that the sparse quasi-Newton method is closely related to statistical methods such as the EM-algorithm and the boosting algorithm. Information geometry is useful tool not only to better understand the quasi-Newton methods but also to design new update formulae.
△ Less
Submitted 14 October, 2010;
originally announced October 2010.
-
A Bregman Extension of quasi-Newton updates II: Convergence and Robustness Properties
Authors:
Takafumi Kanamori,
Atsumi Ohara
Abstract:
We propose an extension of quasi-Newton methods, and investigate the convergence and the robustness properties of the proposed update formulae for the approximate Hessian matrix. Fletcher has studied a variational problem which derives the approximate Hessian update formula of the quasi-Newton methods. We point out that the variational problem is identical to optimization of the Kullback-Leibler d…
▽ More
We propose an extension of quasi-Newton methods, and investigate the convergence and the robustness properties of the proposed update formulae for the approximate Hessian matrix. Fletcher has studied a variational problem which derives the approximate Hessian update formula of the quasi-Newton methods. We point out that the variational problem is identical to optimization of the Kullback-Leibler divergence, which is a discrepancy measure between two probability distributions. Then, we introduce the Bregman divergence as an extension of the Kullback-Leibler divergence, and derive extended quasi-Newton update formulae based on the variational problem with the Bregman divergence. The proposed update formulae belong to a class of self-scaling quasi-Newton methods. We study the convergence property of the proposed quasi-Newton method, and moreover, we apply the tools in the robust statistics to analyze the robustness property of the Hessian update formulae against the numerical rounding errors included in the line search for the step length. As the result, we found that the influence of the inexact line search is bounded only for the standard BFGS formula for the Hessian approximation. Numerical studies are conducted to verify the usefulness of the tools borrowed from robust statistics.
△ Less
Submitted 14 October, 2010;
originally announced October 2010.
-
Pooling Design and Bias Correction in DNA Library Screening
Authors:
Takafumi Kanamori,
Hiroaki Uehara,
Masakazu Jimbo
Abstract:
We study the group test for DNA library screening based on probabilistic approach. Group test is a method of detecting a few positive items from among a large number of items, and has wide range of applications. In DNA library screening, positive item corresponds to the clone having a specified DNA segment, and it is necessary to identify and isolate the positive clones for compiling the librarie…
▽ More
We study the group test for DNA library screening based on probabilistic approach. Group test is a method of detecting a few positive items from among a large number of items, and has wide range of applications. In DNA library screening, positive item corresponds to the clone having a specified DNA segment, and it is necessary to identify and isolate the positive clones for compiling the libraries. In the group test, a group of items, called pool, is assayed in a lump in order to save the cost of testing, and positive items are detected based on the observation from each pool. It is known that the design of grouping, that is, pooling design is important to %reduce the estimation bias and achieve accurate detection. In the probabilistic approach, positive clones are picked up based on the posterior probability. Naive methods of computing the posterior, however, involves exponentially many sums, and thus we need a device. Loopy belief propagation (loopy BP) algorithm is one of popular methods to obtain approximate posterior probability efficiently. There are some works investigating the relation between the accuracy of the loopy BP and the pooling design. Based on these works, we develop pooling design with small estimation bias of posterior probability, and we show that the balanced incomplete block design (BIBD) has nice property for our purpose. Some numerical experiments show that the bias correction under the BIBD is useful to improve the estimation accuracy.
△ Less
Submitted 25 April, 2010; v1 submitted 22 April, 2010;
originally announced April 2010.
-
Condition Number Analysis of Kernel-based Density Ratio Estimation
Authors:
Takafumi Kanamori,
Taiji Suzuki,
Masashi Sugiyama
Abstract:
The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), and feature selection (mutual information). Recently, several methods of directly estimating the density ratio have been developed, e.g., kernel mean matching, maximum likelihood density ratio estimation…
▽ More
The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), and feature selection (mutual information). Recently, several methods of directly estimating the density ratio have been developed, e.g., kernel mean matching, maximum likelihood density ratio estimation, and least-squares density ratio fitting. In this paper, we consider a kernelized variant of the least-squares method and investigate its theoretical properties from the viewpoint of the condition number using smoothed analysis techniques--the condition number of the Hessian matrix determines the convergence rate of optimization and the numerical stability. We show that the kernel least-squares method has a smaller condition number than a version of kernel mean matching and other M-estimators, implying that the kernel least-squares method has preferable numerical properties. We further give an alternative formulation of the kernel least-squares estimator which is shown to possess an even smaller condition number. We show that numerical studies meet our theoretical analysis.
△ Less
Submitted 15 December, 2009;
originally announced December 2009.