Search | arXiv e-print repository

Bandit and Delayed Feedback in Online Structured Prediction

Authors: Yuki Shibukawa, Taira Tsuchiya, Shinsaku Sakaue, Kenji Yamanishi

Abstract: Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full information setup, we can achieve finite bounds on the surrogate regret, i.e., the extra target loss relative to the best possible surrogate loss. In practice, however, full information fe… ▽ More Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full information setup, we can achieve finite bounds on the surrogate regret, i.e., the extra target loss relative to the best possible surrogate loss. In practice, however, full information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, bandit and delayed feedback. For the bandit setting, using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(\sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, making this result less desirable. To address this, we propose an algorithm that achieves a surrogate regret bound of $O(T^{2/3})$, which is independent of $K$. This is enabled with a carefully designed pseudo-inverse matrix estimator. Furthermore, for the delayed full information feedback setup, we obtain a surrogate regret bound of $O(D^{2/3} T^{1/3})$ for the delay time $D$. We also provide algorithms for the delayed bandit feedback setup. Finally, we numerically evaluate the performance of the proposed algorithms in online classification with bandit feedback. △ Less

Submitted 25 February, 2025; originally announced February 2025.

Comments: 33 pages

arXiv:2412.01163 [pdf, other]

Graph Community Augmentation with GMM-based Modeling in Latent Space

Authors: Shintaro Fukushima, Kenji Yamanishi

Abstract: This study addresses the issue of graph generation with generative models. In particular, we are concerned with graph community augmentation problem, which refers to the problem of generating unseen or unfamiliar graphs with a new community out of the probability distribution estimated with a given graph dataset. The graph community augmentation means that the generated graphs have a new community… ▽ More This study addresses the issue of graph generation with generative models. In particular, we are concerned with graph community augmentation problem, which refers to the problem of generating unseen or unfamiliar graphs with a new community out of the probability distribution estimated with a given graph dataset. The graph community augmentation means that the generated graphs have a new community. There is a chance of discovering an unseen but important structure of graphs with a new community, for example, in a social network such as a purchaser network. Graph community augmentation may also be helpful for generalization of data mining models in a case where it is difficult to collect real graph data enough. In fact, there are many ways to generate a new community in an existing graph. It is desirable to discover a new graph with a new community beyond the given graph while we keep the structure of the original graphs to some extent for the generated graphs to be realistic. To this end, we propose an algorithm called the graph community augmentation (GCA). The key ideas of GCA are (i) to fit Gaussian mixture model (GMM) to data points in the latent space into which the nodes in the original graph are embedded, and (ii) to add data points in the new cluster in the latent space for generating a new community based on the minimum description length (MDL) principle. We empirically demonstrate the effectiveness of GCA for generating graphs with a new community structure on synthetic and real datasets. △ Less

Submitted 2 December, 2024; originally announced December 2024.

Comments: IEEE Copyright. Accepted to 24th IEEE International Conference on Data Mining (ICDM). 10pages

arXiv:2409.08387 [pdf, ps, other]

Foundation of Calculating Normalized Maximum Likelihood for Continuous Probability Models

Authors: Atsushi Suzuki, Kota Fukuzawa, Kenji Yamanishi

Abstract: The normalized maximum likelihood (NML) code length is widely used as a model selection criterion based on the minimum description length principle, where the model with the shortest NML code length is selected. A common method to calculate the NML code length is to use the sum (for a discrete model) or integral (for a continuous model) of a function defined by the distribution of the maximum like… ▽ More The normalized maximum likelihood (NML) code length is widely used as a model selection criterion based on the minimum description length principle, where the model with the shortest NML code length is selected. A common method to calculate the NML code length is to use the sum (for a discrete model) or integral (for a continuous model) of a function defined by the distribution of the maximum likelihood estimator. While this method has been proven to correctly calculate the NML code length of discrete models, no proof has been provided for continuous cases. Consequently, it has remained unclear whether the method can accurately calculate the NML code length of continuous models. In this paper, we solve this problem affirmatively, proving that the method is also correct for continuous cases. Remarkably, completing the proof for continuous cases is non-trivial in that it cannot be achieved by merely replacing the sums in discrete cases with integrals, as the decomposition trick applied to sums in the discrete model case proof is not applicable to integrals in the continuous model case proof. To overcome this, we introduce a novel decomposition approach based on the coarea formula from geometric measure theory, which is essential to establishing our proof for continuous cases. △ Less

Submitted 12 September, 2024; originally announced September 2024.

arXiv:2403.18269 [pdf, other]

Clustering Change Sign Detection by Fusing Mixture Complexity

Authors: Kento Urano, Ryo Yuki, Kenji Yamanishi

Abstract: This paper proposes an early detection method for cluster structural changes. Cluster structure refers to discrete structural characteristics, such as the number of clusters, when data are represented using finite mixture models, such as Gaussian mixture models. We focused on scenarios in which the cluster structure gradually changed over time. For finite mixture models, the concept of mixture com… ▽ More This paper proposes an early detection method for cluster structural changes. Cluster structure refers to discrete structural characteristics, such as the number of clusters, when data are represented using finite mixture models, such as Gaussian mixture models. We focused on scenarios in which the cluster structure gradually changed over time. For finite mixture models, the concept of mixture complexity (MC) measures the continuous cluster size by considering the cluster proportion bias and overlap between clusters. In this paper, we propose MC fusion as an extension of MC to handle situations in which multiple mixture numbers are possible in a finite mixture model. By incorporating the fusion of multiple models, our approach accurately captured the cluster structure during transitional periods of gradual change. Moreover, we introduce a method for detecting changes in the cluster structure by examining the transition of MC fusion. We demonstrate the effectiveness of our method through empirical analysis using both artificial and real-world datasets. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 23 pages

arXiv:2311.18694 [pdf, other]

Balancing Summarization and Change Detection in Graph Streams

Authors: Shintaro Fukushima, Kenji Yamanishi

Abstract: This study addresses the issue of balancing graph summarization and graph change detection. Graph summarization compresses large-scale graphs into a smaller scale. However, the question remains: To what extent should the original graph be compressed? This problem is solved from the perspective of graph change detection, aiming to detect statistically significant changes using a stream of summary g… ▽ More This study addresses the issue of balancing graph summarization and graph change detection. Graph summarization compresses large-scale graphs into a smaller scale. However, the question remains: To what extent should the original graph be compressed? This problem is solved from the perspective of graph change detection, aiming to detect statistically significant changes using a stream of summary graphs. If the compression rate is extremely high, important changes can be ignored, whereas if the compression rate is extremely low, false alarms may increase with more memory. This implies that there is a trade-off between compression rate in graph summarization and accuracy in change detection. We propose a novel quantitative methodology to balance this trade-off to simultaneously realize reliable graph summarization and change detection. We introduce a probabilistic structure of hierarchical latent variable model into a graph, thereby designing a parameterized summary graph on the basis of the minimum description length principle. The parameter specifying the summary graph is then optimized so that the accuracy of change detection is guaranteed to suppress Type I error probability (probability of raising false alarms) to be less than a given confidence level. First, we provide a theoretical framework for connecting graph summarization with change detection. Then, we empirically demonstrate its effectiveness on synthetic and real datasets. △ Less

Submitted 12 December, 2023; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: 6 pages, Accepted to 23rd IEEE International Conference on Data Mining (ICDM2023)

arXiv:2307.09259 [pdf, other]

Adaptive Topological Feature via Persistent Homology: Filtration Learning for Point Clouds

Authors: Naoki Nishikawa, Yuichi Ike, Kenji Yamanishi

Abstract: Machine learning for point clouds has been attracting much attention, with many applications in various fields, such as shape recognition and material science. For enhancing the accuracy of such machine learning methods, it is often effective to incorporate global topological features, which are typically extracted by persistent homology. In the calculation of persistent homology for a point cloud… ▽ More Machine learning for point clouds has been attracting much attention, with many applications in various fields, such as shape recognition and material science. For enhancing the accuracy of such machine learning methods, it is often effective to incorporate global topological features, which are typically extracted by persistent homology. In the calculation of persistent homology for a point cloud, we choose a filtration for the point cloud, an increasing sequence of spaces. Since the performance of machine learning methods combined with persistent homology is highly affected by the choice of a filtration, we need to tune it depending on data and tasks. In this paper, we propose a framework that learns a filtration adaptively with the use of neural networks. In order to make the resulting persistent homology isometry-invariant, we develop a neural network architecture with such invariance. Additionally, we show a theoretical result on a finite-dimensional approximation of filtration functions, which justifies the proposed network architecture. Experimental results demonstrated the efficacy of our framework in several classification tasks. △ Less

Submitted 24 December, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

Comments: 24 pages with 6 figures

arXiv:2305.07971 [pdf, ps, other]

Tight and fast generalization error bound of graph embedding in metric space

Authors: Atsushi Suzuki, Atsushi Nitanda, Taiji Suzuki, Jing Wang, Feng Tian, Kenji Yamanishi

Abstract: Recent studies have experimentally shown that we can achieve in non-Euclidean metric space effective and efficient graph embedding, which aims to obtain the vertices' representations reflecting the graph's structure in the metric space. Specifically, graph embedding in hyperbolic space has experimentally succeeded in embedding graphs with hierarchical-tree structure, e.g., data in natural language… ▽ More Recent studies have experimentally shown that we can achieve in non-Euclidean metric space effective and efficient graph embedding, which aims to obtain the vertices' representations reflecting the graph's structure in the metric space. Specifically, graph embedding in hyperbolic space has experimentally succeeded in embedding graphs with hierarchical-tree structure, e.g., data in natural languages, social networks, and knowledge bases. However, recent theoretical analyses have shown a much higher upper bound on non-Euclidean graph embedding's generalization error than Euclidean one's, where a high generalization error indicates that the incompleteness and noise in the data can significantly damage learning performance. It implies that the existing bound cannot guarantee the success of graph embedding in non-Euclidean metric space in a practical training data size, which can prevent non-Euclidean graph embedding's application in real problems. This paper provides a novel upper bound of graph embedding's generalization error by evaluating the local Rademacher complexity of the model as a function set of the distances of representation couples. Our bound clarifies that the performance of graph embedding in non-Euclidean metric space, including hyperbolic space, is better than the existing upper bounds suggest. Specifically, our new upper bound is polynomial in the metric space's geometric radius $R$ and can be $O(\frac{1}{S})$ at the fastest, where $S$ is the training data size. Our bound is significantly tighter and faster than the existing one, which can be exponential to $R$ and $O(\frac{1}{\sqrt{S}})$ at the fastest. Specific calculations on example cases show that graph embedding in non-Euclidean metric space can outperform that in Euclidean space with much smaller training data than the existing bound has suggested. △ Less

Submitted 13 May, 2023; originally announced May 2023.

arXiv:2302.12127 [pdf, other]

Detecting Signs of Model Change with Continuous Model Selection Based on Descriptive Dimensionality

Authors: Kenji Yamanishi, So Hirai

Abstract: We address the issue of detecting changes of models that lie behind a data stream. The model refers to an integer-valued structural information such as the number of free parameters in a parametric model. Specifically we are concerned with the problem of how we can detect signs of model changes earlier than they are actualized. To this end, we employ {\em continuous model selection} on the basis o… ▽ More We address the issue of detecting changes of models that lie behind a data stream. The model refers to an integer-valued structural information such as the number of free parameters in a parametric model. Specifically we are concerned with the problem of how we can detect signs of model changes earlier than they are actualized. To this end, we employ {\em continuous model selection} on the basis of the notion of {\em descriptive dimensionality}~(Ddim). It is a real-valued model dimensionality, which is designed for quantifying the model dimensionality in the model transition period. Continuous model selection is to determine the real-valued model dimensionality in terms of Ddim from a given data. We propose a novel methodology for detecting signs of model changes by tracking the rise-up of Ddim in a data stream. We apply this methodology to detecting signs of changes of the number of clusters in a Gaussian mixture model and those of the order in an auto regression model. With synthetic and real data sets, we empirically demonstrate its effectiveness by showing that it is able to visualize well how rapidly model dimensionality moves in the transition period and to raise early warning signals of model changes earlier than they are detected with existing methods. △ Less

Submitted 23 February, 2023; originally announced February 2023.

arXiv:2105.10475 [pdf, ps, other]

Generalization Error Bound for Hyperbolic Ordinal Embedding

Authors: Atsushi Suzuki, Atsushi Nitanda, Jing Wang, Linchuan Xu, Marc Cavazza, Kenji Yamanishi

Abstract: Hyperbolic ordinal embedding (HOE) represents entities as points in hyperbolic space so that they agree as well as possible with given constraints in the form of entity i is more similar to entity j than to entity k. It has been experimentally shown that HOE can obtain representations of hierarchical data such as a knowledge base and a citation network effectively, owing to hyperbolic space's expo… ▽ More Hyperbolic ordinal embedding (HOE) represents entities as points in hyperbolic space so that they agree as well as possible with given constraints in the form of entity i is more similar to entity j than to entity k. It has been experimentally shown that HOE can obtain representations of hierarchical data such as a knowledge base and a citation network effectively, owing to hyperbolic space's exponential growth property. However, its theoretical analysis has been limited to ideal noiseless settings, and its generalization error in compensation for hyperbolic space's exponential representation ability has not been guaranteed. The difficulty is that existing generalization error bound derivations for ordinal embedding based on the Gramian matrix do not work in HOE, since hyperbolic space is not inner-product space. In this paper, through our novel characterization of HOE with decomposed Lorentz Gramian matrices, we provide a generalization error bound of HOE for the first time, which is at most exponential with respect to the embedding space's radius. Our comparison between the bounds of HOE and Euclidean ordinal embedding shows that HOE's generalization error is reasonable as a cost for its exponential representation ability. △ Less

Submitted 21 May, 2021; originally announced May 2021.

arXiv:2011.09465 [pdf, other]

Detecting Hierarchical Changes in Latent Variable Models

Authors: Shintaro Fukushima, Kenji Yamanishi

Abstract: This paper addresses the issue of detecting hierarchical changes in latent variable models (HCDL) from data streams. There are three different levels of changes for latent variable models: 1) the first level is the change in data distribution for fixed latent variables, 2) the second one is that in the distribution over latent variables, and 3) the third one is that in the number of latent variabl… ▽ More This paper addresses the issue of detecting hierarchical changes in latent variable models (HCDL) from data streams. There are three different levels of changes for latent variable models: 1) the first level is the change in data distribution for fixed latent variables, 2) the second one is that in the distribution over latent variables, and 3) the third one is that in the number of latent variables. It is important to detect these changes because we can analyze the causes of changes by identifying which level a change comes from (change interpretability). This paper proposes an information-theoretic framework for detecting changes of the three levels in a hierarchical way. The key idea to realize it is to employ the MDL (minimum description length) change statistics for measuring the degree of change, in combination with DNML (decomposed normalized maximum likelihood) code-length calculation. We give a theoretical basis for making reliable alarms for changes. Focusing on stochastic block models, we employ synthetic and benchmark datasets to empirically demonstrate the effectiveness of our framework in terms of change interpretability as well as change detection. △ Less

Submitted 22 November, 2020; v1 submitted 18 November, 2020; originally announced November 2020.

Comments: 12pages, Accepted to 20th IEEE International Conference on Data Mining (ICDM2020)

arXiv:2008.07720 [pdf, other]

Word2vec Skip-gram Dimensionality Selection via Sequential Normalized Maximum Likelihood

Authors: Pham Thuc Hung, Kenji Yamanishi

Abstract: In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply information criteria with the aim of selecting the… ▽ More In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply information criteria with the aim of selecting the best dimensionality so that the corresponding model can be as close as possible to the true distribution. We examine the following information criteria for the dimensionality selection problem: the Akaike Information Criterion, Bayesian Information Criterion, and Sequential Normalized Maximum Likelihood (SNML) criterion. SNML is the total codelength required for the sequential encoding of a data sequence on the basis of the minimum description length. The proposed approach is applied to both the original SG model and the SG Negative Sampling model to clarify the idea of using information criteria. Additionally, as the original SNML suffers from computational disadvantages, we introduce novel heuristics for its efficient computation. Moreover, we empirically demonstrate that SNML outperforms both BIC and AIC. In comparison with other evaluation methods for word embedding, the dimensionality selected by SNML is significantly closer to the optimal dimensionality obtained by word analogy or word similarity tasks. △ Less

Submitted 24 August, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

arXiv:2007.15897 [pdf, other]

A Novel Global Spatial Attention Mechanism in Convolutional Neural Network for Medical Image Classification

Authors: Linchuan Xu, Jun Huang, Atsushi Nitanda, Ryo Asaoka, Kenji Yamanishi

Abstract: Spatial attention has been introduced to convolutional neural networks (CNNs) for improving both their performance and interpretability in visual tasks including image classification. The essence of the spatial attention is to learn a weight map which represents the relative importance of activations within the same layer or channel. All existing attention mechanisms are local attentions in the se… ▽ More Spatial attention has been introduced to convolutional neural networks (CNNs) for improving both their performance and interpretability in visual tasks including image classification. The essence of the spatial attention is to learn a weight map which represents the relative importance of activations within the same layer or channel. All existing attention mechanisms are local attentions in the sense that weight maps are image-specific. However, in the medical field, there are cases that all the images should share the same weight map because the set of images record the same kind of symptom related to the same object and thereby share the same structural content. In this paper, we thus propose a novel global spatial attention mechanism in CNNs mainly for medical image classification. The global weight map is instantiated by a decision boundary between important pixels and unimportant pixels. And we propose to realize the decision boundary by a binary classifier in which the intensities of all images at a pixel are the features of the pixel. The binary classification is integrated into an image classification CNN and is to be optimized together with the CNN. Experiments on two medical image datasets and one facial expression dataset showed that with the proposed attention, not only the performance of four powerful CNNs which are GoogleNet, VGG, ResNet, and DenseNet can be improved, but also meaningful attended regions can be obtained, which is beneficial for understanding the content of images of a domain. △ Less

Submitted 31 July, 2020; originally announced July 2020.

arXiv:2007.15179 [pdf, other]

Detecting Change Signs with Differential MDL Change Statistics for COVID-19 Pandemic Analysis

Authors: Kenji Yamanishi, Linchuan Xu, Ryo Yuki, Shintaro Fukushima, Chuan-hao Lin

Abstract: We are concerned with the issue of detecting changes and their signs from a data stream. For example, when given time series of COVID-19 cases in a region, we may raise early warning signals of outbreaks by detecting signs of changes in the cases. We propose a novel methodology to address this issue. The key idea is to employ a new information-theoretic notion, which we call the differential minim… ▽ More We are concerned with the issue of detecting changes and their signs from a data stream. For example, when given time series of COVID-19 cases in a region, we may raise early warning signals of outbreaks by detecting signs of changes in the cases. We propose a novel methodology to address this issue. The key idea is to employ a new information-theoretic notion, which we call the differential minimum description length change statistics (D-MDL), for measuring the scores of change sign. We first give a fundamental theory for D-MDL. We then demonstrate its effectiveness using synthetic datasets. We apply it to detecting early warning signals of the COVID-19 epidemic. We empirically demonstrate that D-MDL is able to raise early warning signals of events such as significant increase/decrease of cases. Remarkably, for about $64\%$ of the events of significant increase of cases in 37 studied countries, our method can detect warning signals as early as nearly six days on average before the events, buying considerably long time for making responses. We further relate the warning signals to the basic reproduction number $R0$ and the timing of social distancing. The results showed that our method can effectively monitor the dynamics of $R0$, and confirmed the effectiveness of social distancing at containing the epidemic in a region. We conclude that our method is a promising approach to the pandemic analysis from a data science viewpoint. The software for the experiments is available at https://github.com/IbarakikenYukishi/differential-mdl-change-statistics. An online detection system is available at https://ibarakikenyukishi.github.io/d-mdl-html/index.html △ Less

Submitted 19 February, 2021; v1 submitted 29 July, 2020; originally announced July 2020.

arXiv:2007.12160 [pdf, other]

Online Robust and Adaptive Learning from Data Streams

Authors: Shintaro Fukushima, Atsushi Nitanda, Kenji Yamanishi

Abstract: In online learning from non-stationary data streams, it is necessary to learn robustly to outliers and to adapt quickly to changes in the underlying data generating mechanism. In this paper, we refer to the former attribute of online learning algorithms as robustness and to the latter as adaptivity. There is an obvious tradeoff between the two attributes. It is a fundamental issue to quantify and… ▽ More In online learning from non-stationary data streams, it is necessary to learn robustly to outliers and to adapt quickly to changes in the underlying data generating mechanism. In this paper, we refer to the former attribute of online learning algorithms as robustness and to the latter as adaptivity. There is an obvious tradeoff between the two attributes. It is a fundamental issue to quantify and evaluate the tradeoff because it provides important information on the data generating mechanism. However, no previous work has considered the tradeoff quantitatively. We propose a novel algorithm called the stochastic approximation-based robustness-adaptivity algorithm (SRA) to evaluate the tradeoff. The key idea of SRA is to update parameters of distribution or sufficient statistics with the biased stochastic approximation scheme, while dropping data points with large values of the stochastic update. We address the relation between the two parameters: one is the step size of the stochastic approximation, and the other is the threshold parameter of the norm of the stochastic update. The former controls the adaptivity and the latter does the robustness. We give a theoretical analysis for the non-asymptotic convergence of SRA in the presence of outliers, which depends on both the step size and threshold parameter. Because SRA is formulated on the majorization-minimization principle, it is a general algorithm that includes many algorithms, such as the online EM algorithm and stochastic gradient descent. Empirical experiments for both synthetic and real datasets demonstrated that SRA was superior to previous methods. △ Less

Submitted 27 September, 2021; v1 submitted 23 July, 2020; originally announced July 2020.

Comments: 42 pages

arXiv:2007.07467 [pdf, ps, other]

Mixture Complexity and Its Application to Gradual Clustering Change Detection

Authors: Shunki Kyoya, Kenji Yamanishi

Abstract: In model-based clustering using finite mixture models, it is a significant challenge to determine the number of clusters (cluster size). It used to be equal to the number of mixture components (mixture size); however, this may not be valid in the presence of overlaps or weight biases. In this study, we propose to continuously measure the cluster size in a mixture model by a new concept called mixt… ▽ More In model-based clustering using finite mixture models, it is a significant challenge to determine the number of clusters (cluster size). It used to be equal to the number of mixture components (mixture size); however, this may not be valid in the presence of overlaps or weight biases. In this study, we propose to continuously measure the cluster size in a mixture model by a new concept called mixture complexity (MC). It is formally defined from the viewpoint of information theory and can be seen as a natural extension of the cluster size considering overlap and weight bias. Subsequently, we apply MC to the issue of gradual clustering change detection. Conventionally, clustering changes has been considered to be abrupt, induced by the changes in the mixture size or cluster size. Meanwhile, we consider the clustering changes to be gradual in terms of MC; it has the benefits of finding the changes earlier and discerning the significant and insignificant changes. We further demonstrate that the MC can be decomposed according to the hierarchical structures of the mixture models; it helps us to analyze the detail of substructures. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:1910.11540 [pdf, other]

Descriptive Dimensionality and Its Characterization of MDL-based Learning and Change Detection

Authors: Kenji Yamanishi

Abstract: This paper introduces a new notion of dimensionality of probabilistic models from an information-theoretic view point. We call it the "descriptive dimension"(Ddim). We show that Ddim coincides with the number of independent parameters for the parametric class, and can further be extended to real-valued dimensionality when a number of models are mixed. The paper then derives the rate of convergence… ▽ More This paper introduces a new notion of dimensionality of probabilistic models from an information-theoretic view point. We call it the "descriptive dimension"(Ddim). We show that Ddim coincides with the number of independent parameters for the parametric class, and can further be extended to real-valued dimensionality when a number of models are mixed. The paper then derives the rate of convergence of the MDL (Minimum Description Length) learning algorithm which outputs a normalized maximum likelihood (NML) distribution with model of the shortest NML codelength. The paper proves that the rate is governed by Ddim. The paper also derives error probabilities of the MDL-based test for multiple model change detection. It proves that they are also governed by Ddim. Through the analysis, we demonstrate that Ddim is an intrinsic quantity which characterizes the performance of the MDL-based learning and change detection. △ Less

Submitted 25 October, 2019; originally announced October 2019.

arXiv:1905.00699 [pdf, ps, other]

doi 10.1098/rsos.191643

Long-tailed distributions of inter-event times as mixtures of exponential distributions

Authors: Makoto Okada, Kenji Yamanishi, Naoki Masuda

Abstract: Inter-event times of various human behavior are apparently non-Poissonian and obey long-tailed distributions as opposed to exponential distributions, which correspond to Poisson processes. It has been suggested that human individuals may switch between different states in each of which they are regarded to generate events obeying a Poisson process. If this is the case, inter-event times should app… ▽ More Inter-event times of various human behavior are apparently non-Poissonian and obey long-tailed distributions as opposed to exponential distributions, which correspond to Poisson processes. It has been suggested that human individuals may switch between different states in each of which they are regarded to generate events obeying a Poisson process. If this is the case, inter-event times should approximately obey a mixture of exponential distributions with different parameter values. In the present study, we introduce the minimum description length principle to compare mixtures of exponential distributions with different numbers of components (i.e., constituent exponential distributions). Because these distributions violate the identifiability property, one is mathematically not allowed to apply the Akaike or Bayes information criteria to their maximum likelihood estimator to carry out model selection. We overcome this theoretical barrier by applying a minimum description principle to joint likelihoods of the data and latent variables. We show that mixtures of exponential distributions with a few components are selected as opposed to more complex mixtures in various data sets and that the fitting accuracy is comparable to that of state-of-the-art algorithms to fit power-law distributions to data. Our results lend support to Poissonian explanations of apparently non-Poissonian human behavior. △ Less

Submitted 26 February, 2020; v1 submitted 30 April, 2019; originally announced May 2019.

Comments: 2 figures, 4 tables, SI and code are available here: https://github.com/naokimas/exp_mixture_model

Journal ref: Royal Society Open Science, 7, 191643 (2020)

arXiv:1810.03825 [pdf, other]

Adaptive Minimax Regret against Smooth Logarithmic Losses over High-Dimensional $\ell_1$-Balls via Envelope Complexity

Authors: Kohei Miyaguchi, Kenji Yamanishi

Abstract: We develop a new theoretical framework, the \emph{envelope complexity}, to analyze the minimax regret with logarithmic loss functions and derive a Bayesian predictor that adaptively achieves the minimax regret over high-dimensional $\ell_1$-balls within a factor of two. The prior is newly derived for achieving the minimax regret and called the \emph{spike-and-tails~(ST) prior} as it looks like. Th… ▽ More We develop a new theoretical framework, the \emph{envelope complexity}, to analyze the minimax regret with logarithmic loss functions and derive a Bayesian predictor that adaptively achieves the minimax regret over high-dimensional $\ell_1$-balls within a factor of two. The prior is newly derived for achieving the minimax regret and called the \emph{spike-and-tails~(ST) prior} as it looks like. The resulting regret bound is so simple that it is completely determined with the smoothness of the loss function and the radius of the balls except with logarithmic factors, and it has a generalized form of existing regret/risk bounds. In the preliminary experiment, we confirm that the ST prior outperforms the conventional minimax-regret prior under non-high-dimensional asymptotics. △ Less

Submitted 13 October, 2018; v1 submitted 9 October, 2018; originally announced October 2018.

arXiv:1805.10487 [pdf, other]

Stable Geodesic Update on Hyperbolic Space and its Application to Poincare Embeddings

Authors: Yosuke Enokida, Atsushi Suzuki, Kenji Yamanishi

Abstract: A hyperbolic space has been shown to be more capable of modeling complex networks than a Euclidean space. This paper proposes an explicit update rule along geodesics in a hyperbolic space. The convergence of our algorithm is theoretically guaranteed, and the convergence rate is better than the conventional Euclidean gradient descent algorithm. Moreover, our algorithm avoids the "bias" problem of e… ▽ More A hyperbolic space has been shown to be more capable of modeling complex networks than a Euclidean space. This paper proposes an explicit update rule along geodesics in a hyperbolic space. The convergence of our algorithm is theoretically guaranteed, and the convergence rate is better than the conventional Euclidean gradient descent algorithm. Moreover, our algorithm avoids the "bias" problem of existing methods using the Riemannian gradient. Experimental results demonstrate the good performance of our algorithm in the \Poincare embeddings of knowledge base data. △ Less

Submitted 26 May, 2018; originally announced May 2018.

arXiv:1804.09904 [pdf, other]

High-dimensional Penalty Selection via Minimum Description Length Principle

Authors: Kohei Miyaguchi, Kenji Yamanishi

Abstract: We tackle the problem of penalty selection of regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood(LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of… ▽ More We tackle the problem of penalty selection of regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood(LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of penalty functions in view of the minimum description length principle, and guides us to a good penalty function through the high-dimensional space. However, the minimization of LNML entails two major challenges: 1) the computation of the normalizing factor of LNML and 2) its minimization in high-dimensional spaces. In this paper, we present a novel regularization selection method (MDL-RS), in which a tight upper bound of LNML (uLNML) is minimized with local convergence guarantee. Our main contribution is the derivation of uLNML, which is a uniform-gap upper bound of LNML in an analytic expression. This solves the above challenges in an approximate manner because it allows us to accurately approximate LNML and then efficiently minimize it. The experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters. △ Less

Submitted 26 April, 2018; originally announced April 2018.

Comments: Preprint before review

arXiv:1801.03705 [pdf, ps, other]

Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis

Authors: Atsushi Suzuki, Kenji Yamanishi

Abstract: The normalized maximum likelihood code length has been widely used in model selection, and its favorable properties, such as its consistency and the upper bound of its statistical risk, have been demonstrated. This paper proposes a novel methodology for calculating the normalized maximum likelihood code length on the basis of Fourier analysis. Our methodology provides an efficient non-asymptotic c… ▽ More The normalized maximum likelihood code length has been widely used in model selection, and its favorable properties, such as its consistency and the upper bound of its statistical risk, have been demonstrated. This paper proposes a novel methodology for calculating the normalized maximum likelihood code length on the basis of Fourier analysis. Our methodology provides an efficient non-asymptotic calculation formula for exponential family models and an asymptotic calculation formula for general parametric models with a weaker assumption compared to that in previous work. △ Less

Submitted 11 January, 2018; originally announced January 2018.

arXiv:1711.02478 [pdf, ps, other]

doi 10.1007/s10618-019-00657-9

Grafting for Combinatorial Boolean Model using Frequent Itemset Mining

Authors: Taito Lee, Shin Matsushima, Kenji Yamanishi

Abstract: This paper introduces the combinatorial Boolean model (CBM), which is defined as the class of linear combinations of conjunctions of Boolean attributes. This paper addresses the issue of learning CBM from labeled data. CBM is of high knowledge interpretability but naïve learning of it requires exponentially large computation time with respect to data dimension and sample size. To overcome this com… ▽ More This paper introduces the combinatorial Boolean model (CBM), which is defined as the class of linear combinations of conjunctions of Boolean attributes. This paper addresses the issue of learning CBM from labeled data. CBM is of high knowledge interpretability but naïve learning of it requires exponentially large computation time with respect to data dimension and sample size. To overcome this computational difficulty, we propose an algorithm GRAB (GRAfting for Boolean datasets), which efficiently learns CBM within the $L_1$-regularized loss minimization framework. The key idea of GRAB is to reduce the loss minimization problem to the weighted frequent itemset mining, in which frequent patterns are efficiently computable. We employ benchmark datasets to empirically demonstrate that GRAB is effective in terms of computational efficiency, prediction accuracy and knowledge discovery. △ Less

Submitted 13 November, 2017; v1 submitted 7 November, 2017; originally announced November 2017.

Journal ref: Data Min Knowl Disc 34, 101-123 (2020)

arXiv:1709.00925 [pdf, other]

Upper Bound on Normalized Maximum Likelihood Codes for Gaussian Mixture Models

Authors: So Hirai, Kenji Yamanishi

Abstract: This paper shows that the normalized maximum likelihood~(NML) code-length calculated in [1] is an upper bound on the NML code-length strictly calculated for the Gaussian Mixture Model. When we use this upper bound on the NML code-length, we must change the scale of the data sequence to satisfy the restricted domain. However, we also show that the algorithm for model selection is essentially univer… ▽ More This paper shows that the normalized maximum likelihood~(NML) code-length calculated in [1] is an upper bound on the NML code-length strictly calculated for the Gaussian Mixture Model. When we use this upper bound on the NML code-length, we must change the scale of the data sequence to satisfy the restricted domain. However, we also show that the algorithm for model selection is essentially universal, regardless of the scale conversion of the data in Gaussian Mixture Models, and that, consequently, the experimental results in [1] can be used as they are. In addition to this, we correct the NML code-length in [1] for generalized logistic distributions. △ Less

Submitted 18 November, 2018; v1 submitted 4 September, 2017; originally announced September 2017.

arXiv:1603.07094 [pdf, ps, other]

Predicting Glaucoma Visual Field Loss by Hierarchically Aggregating Clustering-based Predictors

Authors: Motohide Higaki, Kai Morino, Hiroshi Murata, Ryo Asaoka, Kenji Yamanishi

Abstract: This study addresses the issue of predicting the glaucomatous visual field loss from patient disease datasets. Our goal is to accurately predict the progress of the disease in individual patients. As very few measurements are available for each patient, it is difficult to produce good predictors for individuals. A recently proposed clustering-based method enhances the power of prediction using pat… ▽ More This study addresses the issue of predicting the glaucomatous visual field loss from patient disease datasets. Our goal is to accurately predict the progress of the disease in individual patients. As very few measurements are available for each patient, it is difficult to produce good predictors for individuals. A recently proposed clustering-based method enhances the power of prediction using patient data with similar spatiotemporal patterns. Each patient is categorized into a cluster of patients, and a predictive model is constructed using all of the data in the class. Predictions are highly dependent on the quality of clustering, but it is difficult to identify the best clustering method. Thus, we propose a method for aggregating cluster-based predictors to obtain better prediction accuracy than from a single cluster-based prediction. Further, the method shows very high performances by hierarchically aggregating experts generated from several cluster-based methods. We use real datasets to demonstrate that our method performs significantly better than conventional clustering-based and patient-wise regression methods, because the hierarchical aggregating strategy has a mechanism whereby good predictors in a small community can thrive. △ Less

Submitted 23 March, 2016; originally announced March 2016.

arXiv:1205.3549 [pdf, ps, other]

Normalized Maximum Likelihood Coding for Exponential Family with Its Applications to Optimal Clustering

Authors: So Hirai, Kenji Yamanishi

Abstract: We are concerned with the issue of how to calculate the normalized maximum likelihood (NML) code-length. There is a problem that the normalization term of the NML code-length may diverge when it is continuous and unbounded and a straightforward computation of it is highly expensive when the data domain is finite . In previous works it has been investigated how to calculate the NML code-length for… ▽ More We are concerned with the issue of how to calculate the normalized maximum likelihood (NML) code-length. There is a problem that the normalization term of the NML code-length may diverge when it is continuous and unbounded and a straightforward computation of it is highly expensive when the data domain is finite . In previous works it has been investigated how to calculate the NML code-length for specific types of distributions. We first propose a general method for computing the NML code-length for the exponential family. Then we specifically focus on Gaussian mixture model (GMM), and propose a new efficient method for computing the NML to them. We develop it by generalizing Rissanen's re-normalizing technique. Then we apply this method to the clustering issue, in which a clustering structure is modeled using a GMM, and the main task is to estimate the optimal number of clusters on the basis of the NML code-length. We demonstrate using artificial data sets the superiority of the NML-based clustering over other criteria such as AIC, BIC in terms of the data size required for high accuracy rate to be achieved. △ Less

Submitted 16 May, 2012; v1 submitted 15 May, 2012; originally announced May 2012.

arXiv:1110.2899 [pdf, ps, other]

Discovering Emerging Topics in Social Streams via Link Anomaly Detection

Authors: Toshimitsu Takahashi, Ryota Tomioka, Kenji Yamanishi

Abstract: Detection of emerging topics are now receiving renewed interest motivated by the rapid growth of social networks. Conventional term-frequency-based approaches may not be appropriate in this context, because the information exchanged are not only texts but also images, URLs, and videos. We focus on the social aspects of theses networks. That is, the links between users that are generated dynamicall… ▽ More Detection of emerging topics are now receiving renewed interest motivated by the rapid growth of social networks. Conventional term-frequency-based approaches may not be appropriate in this context, because the information exchanged are not only texts but also images, URLs, and videos. We focus on the social aspects of theses networks. That is, the links between users that are generated dynamically intentionally or unintentionally through replies, mentions, and retweets. We propose a probability model of the mentioning behaviour of a social network user, and propose to detect the emergence of a new topic from the anomaly measured through the model. We combine the proposed mention anomaly score with a recently proposed change-point detection technique based on the Sequentially Discounting Normalized Maximum Likelihood (SDNML), or with Kleinberg's burst model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social network posts. We demonstrate our technique in a number of real data sets we gathered from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect new topics at least as early as the conventional term-frequency-based approach, and sometimes much earlier when the keyword is ill-defined. △ Less

Submitted 13 October, 2011; originally announced October 2011.

Comments: 10 pages, 6 figures

arXiv:cmp-lg/9705005 [pdf, ps, other]

Document Classification Using a Finite Mixture Model

Authors: Hang Li, Kenji Yamanishi

Abstract: We propose a new method of classifying documents into categories. The simple method of conducting hypothesis testing over word-based distributions in categories suffers from the data sparseness problem. In order to address this difficulty, Guthrie et.al. have developed a method using distributions based on hard clustering of words, i.e., in which a word is assigned to a single cluster and words… ▽ More We propose a new method of classifying documents into categories. The simple method of conducting hypothesis testing over word-based distributions in categories suffers from the data sparseness problem. In order to address this difficulty, Guthrie et.al. have developed a method using distributions based on hard clustering of words, i.e., in which a word is assigned to a single cluster and words in the same cluster are treated uniformly. This method might, however, degrade classification results, since the distributions it employs are not always precise enough for representing the differences between categories. We propose here the use of soft clustering of words, i.e., in which a word can be assigned to several different clusters and each cluster is characterized by a specific word probability distribution. We define for each document category a finite mixture model, which is a linear combination of the probability distributions of the clusters. We thereby treat the problem of classifying documents as that of conducting statistical hypothesis testing over finite mixture models. In order to accomplish this testing, we employ the EM algorithm which helps efficiently estimate parameters in a finite mixture model. Experimental results indicate that our method outperforms not only the method using distributions based on hard clustering, but also the method using word-based distributions and the method based on cosine-similarity. △ Less

Submitted 6 May, 1997; originally announced May 1997.

Comments: latex file, uses aclap.sty and epsf.sty, 9 pages, to appear ACL/EACL-97

Showing 1–27 of 27 results for author: Yamanishi, K