Search | arXiv e-print repository

ExplainReduce: Summarising local explanations via proxies

Authors: Lauri Seppäläinen, Mudong Guo, Kai Puolamäki

Abstract: Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach… ▽ More Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach include LIME, SHAP, and SLISEMAP. This paper shows how a large set of local explanations can be reduced to a small "proxy set" of simple models, which can act as a generative global explanation. This reduction procedure, ExplainReduce, can be formulated as an optimisation problem and approximated efficiently using greedy heuristics. △ Less

Submitted 14 February, 2025; originally announced February 2025.

Comments: 22 pages with a 7 page appendix, 7 + 5 figures, 2 tables. The datasets and source code used in the paper are available at https://github.com/edahelsinki/explainreduce

ACM Class: I.2.4

arXiv:2406.00502 [pdf, other]

Non-geodesically-convex optimization in the Wasserstein space

Authors: Hoang Phuc Hau Luu, Hanlin Yu, Bernardo Williams, Petrus Mikkola, Marcelo Hartmann, Kai Puolamäki, Arto Klami

Abstract: We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. Specifically, the objective exhibits some difference-of-convex structure along these geodesics. The setting also encompasses sampling problems where the logarithm of the target distribution is difference-of-convex. We derive m… ▽ More We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. Specifically, the objective exhibits some difference-of-convex structure along these geodesics. The setting also encompasses sampling problems where the logarithm of the target distribution is difference-of-convex. We derive multiple convergence insights for a novel semi Forward-Backward Euler scheme under several nonconvex (and possibly nonsmooth) regimes. Notably, the semi Forward-Backward Euler is just a slight modification of the Forward-Backward Euler whose convergence is -- to our knowledge -- still unknown in our very general non-geodesically-convex setting. △ Less

Submitted 7 January, 2025; v1 submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.08486 [pdf, other]

Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction

Authors: Anri Patron, Ayush Prasad, Hoang Phuc Hau Luu, Kai Puolamäki

Abstract: A fundamental problem in supervised learning is to find a good set of features or distance measures. If the new set of features is of lower dimensionality and can be obtained by a simple transformation of the original data, they can make the model understandable, reduce overfitting, and even help to detect distribution drift. We propose a supervised dimensionality reduction method Gradient Boostin… ▽ More A fundamental problem in supervised learning is to find a good set of features or distance measures. If the new set of features is of lower dimensionality and can be obtained by a simple transformation of the original data, they can make the model understandable, reduce overfitting, and even help to detect distribution drift. We propose a supervised dimensionality reduction method Gradient Boosting Mapping (GBMAP), where the outputs of weak learners -- defined as one-layer perceptrons -- define the embedding. We show that the embedding coordinates provide better features for the supervised learning task, making simple linear models competitive with the state-of-the-art regressors and classifiers. We also use the embedding to find a principled distance measure between points. The features and distance measures automatically ignore directions irrelevant to the supervised learning task. We also show that we can reliably detect out-of-distribution data points with potentially large regression or classification errors. GBMAP is fast and works in seconds for dataset of million data points or hundreds of features. As a bonus, GBMAP provides a regression and classification performance comparable to the state-of-the-art supervised learning methods. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 32 pages, 8 figures, 5 tables

arXiv:2310.15610 [pdf, other]

doi 10.1371/journal.pone.0297714

Using Slisemap to interpret physical data

Authors: Lauri Seppäläinen, Anton Björklund, Vitus Besel, Kai Puolamäki

Abstract: Manifold visualisation techniques are commonly used to visualise high-dimensional datasets in physical sciences. In this paper we apply a recently introduced manifold visualisation method, called Slise, on datasets from physics and chemistry. Slisemap combines manifold visualisation with explainable artificial intelligence. Explainable artificial intelligence is used to investigate the decision pr… ▽ More Manifold visualisation techniques are commonly used to visualise high-dimensional datasets in physical sciences. In this paper we apply a recently introduced manifold visualisation method, called Slise, on datasets from physics and chemistry. Slisemap combines manifold visualisation with explainable artificial intelligence. Explainable artificial intelligence is used to investigate the decision processes of black box machine learning models and complex simulators. With Slisemap we find an embedding such that data items with similar local explanations are grouped together. Hence, Slisemap gives us an overview of the different behaviours of a black box model. This makes Slisemap into a supervised manifold visualisation method, where the patterns in the embedding reflect a target property. In this paper we show how Slisemap can be used and evaluated on physical data and that Slisemap is helpful in finding meaningful information on classification and regression models trained on these datasets. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: 17 pages, 5 + 1 figures, 1 table. The datasets and source code used in the paper are available at https://www.edahelsinki.fi/papers/slisemap_phys

Journal ref: PLOS ONE 19 (2024) 1-16

arXiv:2306.12110 [pdf, other]

doi 10.1007/978-3-031-43430-3_26

$χ$iplot: web-first visualisation platform for multidimensional data

Authors: Akihiro Tanaka, Juniper Tyree, Anton Björklund, Jarmo Mäkelä, Kai Puolamäki

Abstract: $χ$iplot is an HTML5-based system for interactive exploration of data and machine learning models. A key aspect is interaction, not only for the interactive plots but also between plots. Even though $χ$iplot is not restricted to any single application domain, we have developed and tested it with domain experts in quantum chemistry to study molecular interactions and regression models. $χ… ▽ More $χ$iplot is an HTML5-based system for interactive exploration of data and machine learning models. A key aspect is interaction, not only for the interactive plots but also between plots. Even though $χ$iplot is not restricted to any single application domain, we have developed and tested it with domain experts in quantum chemistry to study molecular interactions and regression models. $χ$iplot can be run both locally and online in a web browser (keeping the data local). The plots and data can also easily be exported and shared. A modular structure also makes $χ$iplot optimal for developing machine learning and new interaction methods. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: 5 pages, 1 figure, accepted to the demo track of ECML PKDD 2023, https://github.com/edahelsinki/xiplot

ACM Class: J.0; J.2; H.5

Journal ref: Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track. ECML PKDD 2023. Lecture Notes in Computer Science, vol 14175 pp. 335-339

arXiv:2201.04455 [pdf, other]

doi 10.1007/s10994-022-06261-1

SLISEMAP: Supervised dimensionality reduction through local explanations

Authors: Anton Björklund, Jarmo Mäkelä, Kai Puolamäki

Abstract: Existing methods for explaining black box learning models often focus on building local explanations of model behaviour for a particular data item. It is possible to create global explanations for all data items, but these explanations generally have low fidelity for complex black box models. We propose a new supervised manifold visualisation method, SLISEMAP, that simultaneously finds local expla… ▽ More Existing methods for explaining black box learning models often focus on building local explanations of model behaviour for a particular data item. It is possible to create global explanations for all data items, but these explanations generally have low fidelity for complex black box models. We propose a new supervised manifold visualisation method, SLISEMAP, that simultaneously finds local explanations for all data items and builds a (typically) two-dimensional global visualisation of the black box model such that data items with similar local explanations are projected nearby. We provide a mathematical derivation of our problem and an open source implementation implemented using the GPU-optimised PyTorch library. We compare SLISEMAP to multiple popular dimensionality reduction methods and find that SLISEMAP is able to utilise labelled data to create embeddings with consistent local white box models. We also compare SLISEMAP to other model-agnostic local explanation methods and show that SLISEMAP provides comparable explanations and that the visualisations can give a broader understanding of black box regression and classification models. △ Less

Submitted 18 May, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

Comments: 26 pages, 10 figures, 3 tables. This revision replaces the $λ_z$ parameter with $z_{radius}$, which is more intuitive and stable. There are also many typographical and clarity improvements. The source code for our implementation of the algorithm (and the experiments) are available from GitHub at https://github.com/edahelsinki/slisemap . Machine Learning (2022)

Journal ref: Machine Learning 112, 1-43 (2023)

arXiv:2107.01126 [pdf, other]

Interactive Causal Structure Discovery in Earth System Sciences

Authors: Laila Melkas, Rafael Savvides, Suyog Chandramouli, Jarmo Mäkelä, Tuomo Nieminen, Ivan Mammarella, Kai Puolamäki

Abstract: Causal structure discovery (CSD) models are making inroads into several domains, including Earth system sciences. Their widespread adaptation is however hampered by the fact that the resulting models often do not take into account the domain knowledge of the experts and that it is often necessary to modify the resulting models iteratively. We present a workflow that is required to take this knowle… ▽ More Causal structure discovery (CSD) models are making inroads into several domains, including Earth system sciences. Their widespread adaptation is however hampered by the fact that the resulting models often do not take into account the domain knowledge of the experts and that it is often necessary to modify the resulting models iteratively. We present a workflow that is required to take this knowledge into account and to apply CSD algorithms in Earth system sciences. At the same time, we describe open research questions that still need to be addressed. We present a way to interactively modify the outputs of the CSD algorithms and argue that the user interaction can be modelled as a greedy finding of the local maximum-a-posteriori solution of the likelihood function, which is composed of the likelihood of the causal model and the prior distribution representing the knowledge of the expert user. We use a real-world data set for examples constructed in collaboration with our co-authors, who are the domain area experts. We show that finding maximally usable causal models in the Earth system sciences or other similar domains is a difficult task which contains many interesting open research questions. We argue that taking the domain knowledge into account has a substantial effect on the final causal models discovered. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: 23 pages, 8 figures, to be published in Proceedings of the 2021 KDD Workshop on Causal Discovery

arXiv:2006.09467 [pdf, other]

doi 10.1145/1557019.1557065

Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

Authors: Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, Heikki Mannila

Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. Fo… ▽ More There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery. △ Less

Submitted 16 June, 2020; originally announced June 2020.

Journal ref: KDD 2009: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

arXiv:1912.06384 [pdf, other]

Low-Cost Outdoor Air Quality Monitoring and Sensor Calibration: A Survey and Critical Analysis

Authors: Francesco Concas, Julien Mineraud, Eemil Lagerspetz, Samu Varjonen, Xiaoli Liu, Kai Puolamäki, Petteri Nurmi, Sasu Tarkoma

Abstract: The significance of air pollution and the problems associated with it are fueling deployments of air quality monitoring stations worldwide. The most common approach for air quality monitoring is to rely on environmental monitoring stations, which unfortunately are very expensive both to acquire and to maintain. Hence environmental monitoring stations are typically sparsely deployed, resulting in l… ▽ More The significance of air pollution and the problems associated with it are fueling deployments of air quality monitoring stations worldwide. The most common approach for air quality monitoring is to rely on environmental monitoring stations, which unfortunately are very expensive both to acquire and to maintain. Hence environmental monitoring stations are typically sparsely deployed, resulting in limited spatial resolution for measurements. Recently, low-cost air quality sensors have emerged as an alternative that can improve the granularity of monitoring. The use of low-cost air quality sensors, however, presents several challenges: they suffer from cross-sensitivities between different ambient pollutants; they can be affected by external factors, such as traffic, weather changes, and human behavior; and their accuracy degrades over time. Periodic re-calibration can improve the accuracy of low-cost sensors, particularly with machine-learning-based calibration, which has shown great promise due to its capability to calibrate sensors in-field. In this article, we survey the rapidly growing research landscape of low-cost sensor technologies for air quality monitoring and their calibration using machine learning techniques. We also identify open research challenges and present directions for future research. △ Less

Submitted 25 January, 2021; v1 submitted 13 December, 2019; originally announced December 2019.

arXiv:1910.04069 [pdf, other]

Estimating regression errors without ground truth values

Authors: Henri Tiittanen, Emilia Oikarinen, Andreas Henelius, Kai Puolamäki

Abstract: Regression analysis is a standard supervised machine learning method used to model an outcome variable in terms of a set of predictor variables. In most real-world applications we do not know the true value of the outcome variable being predicted outside the training data, i.e., the ground truth is unknown. It is hence not straightforward to directly observe when the estimate from a model potentia… ▽ More Regression analysis is a standard supervised machine learning method used to model an outcome variable in terms of a set of predictor variables. In most real-world applications we do not know the true value of the outcome variable being predicted outside the training data, i.e., the ground truth is unknown. It is hence not straightforward to directly observe when the estimate from a model potentially is wrong, due to phenomena such as overfitting and concept drift. In this paper we present an efficient framework for estimating the generalization error of regression functions, applicable to any family of regression functions when the ground truth is unknown. We present a theoretical derivation of the framework and empirically evaluate its strengths and limitations. We find that it performs robustly and is useful for detecting concept drift in datasets in several real-world domains. △ Less

Submitted 9 October, 2019; originally announced October 2019.

Comments: 33 pages, 9 figures, 2 tables

arXiv:1905.02515 [pdf, other]

Guided Visual Exploration of Relations in Data Sets

Authors: Kai Puolamäki, Emilia Oikarinen, Andreas Henelius

Abstract: Efficient explorative data analysis systems must take into account both what a user knows and wants to know. This paper proposes a principled framework for interactive visual exploration of relations in data, through views most informative given the user's current knowledge and objectives. The user can input pre-existing knowledge of relations in the data and also formulate specific exploration in… ▽ More Efficient explorative data analysis systems must take into account both what a user knows and wants to know. This paper proposes a principled framework for interactive visual exploration of relations in data, through views most informative given the user's current knowledge and objectives. The user can input pre-existing knowledge of relations in the data and also formulate specific exploration interests, which are then taken into account in the exploration. The idea is to steer the exploration process towards the interests of the user, instead of showing uninteresting or already known relations. The user's knowledge is modelled by a distribution over data sets parametrised by subsets of rows and columns of data, called tile constraints. We provide a computationally efficient implementation of this concept based on constrained randomisation. Furthermore, we describe a novel dimensionality reduction method for finding the views most informative to the user, which at the limit of no background knowledge and with generic objectives reduces to PCA. We show that the method is suitable for interactive use and is robust to noise, outperforms standard projection pursuit visualisation methods, and gives understandable and useful results in analysis of real-world data. We provide an open-source implementation of the framework. △ Less

Submitted 1 July, 2021; v1 submitted 7 May, 2019; originally announced May 2019.

Comments: 32 pages, 13 figures. This article extends arXiv:1804.03194 and arXiv:1805.07725

Journal ref: Journal of Machine Learning Research 22(96):1-32, 2021

arXiv:1811.05974 [pdf, other]

doi 10.1103/PhysRevE.99.053311

Randomisation Algorithms for Large Sparse Matrices

Authors: Kai Puolamäki, Andreas Henelius, Antti Ukkonen

Abstract: In many domains it is necessary to generate surrogate networks, e.g., for hypothesis testing of different properties of a network. Furthermore, generating surrogate networks typically requires that different properties of the network is preserved, e.g., edges may not be added or deleted and the edge weights may be restricted to certain intervals. In this paper we introduce a novel efficient proper… ▽ More In many domains it is necessary to generate surrogate networks, e.g., for hypothesis testing of different properties of a network. Furthermore, generating surrogate networks typically requires that different properties of the network is preserved, e.g., edges may not be added or deleted and the edge weights may be restricted to certain intervals. In this paper we introduce a novel efficient property-preserving Markov Chain Monte Carlo method termed CycleSampler for generating surrogate networks in which (i) edge weights are constrained to an interval and node weights are preserved exactly, and (ii) edge and node weights are both constrained to intervals. These two types of constraints cover a wide variety of practical use-cases. The method is applicable to both undirected and directed graphs. We empirically demonstrate the efficiency of the CycleSampler method on real-world datasets. We provide an implementation of CycleSampler in R, with parts implemented in C. △ Less

Submitted 14 November, 2018; originally announced November 2018.

Journal ref: Phys. Rev. E 99, 053311 (2019)

arXiv:1805.07725 [pdf, other]

Human-guided data exploration using randomisation

Authors: Kai Puolamäki, Emilia Oikarinen, Buse Atli, Andreas Henelius

Abstract: An explorative data analysis system should be aware of what the user already knows and what the user wants to know of the data: otherwise the system cannot provide the user with the most informative and useful views of the data. We propose a principled way to do exploratory data analysis, where the user's background knowledge is modeled by a distribution parametrised by subsets of rows and columns… ▽ More An explorative data analysis system should be aware of what the user already knows and what the user wants to know of the data: otherwise the system cannot provide the user with the most informative and useful views of the data. We propose a principled way to do exploratory data analysis, where the user's background knowledge is modeled by a distribution parametrised by subsets of rows and columns in the data, called tiles. The user can also use tiles to describe his or her interests concerning relations in the data. We provide a computationally efficient implementation of this concept based on constrained randomisation. The implementation is used to model both the background knowledge and the user's information request and is a necessary prerequisite for any interactive system. Furthermore, we describe a novel linear projection pursuit method to find and show the views most informative to the user, which at the limit of no background knowledge and with generic objectives reduces to PCA. We show that our method is robust under noise and fast enough for interactive use. We also show that the method gives understandable and useful results when analysing real-world data sets. We will release an open source library implementing the idea, including the experiments presented in this paper. We show that our method can outperform standard projection pursuit visualisation methods in exploration tasks. Our framework makes it possible to construct human-guided data exploration systems which are fast, powerful, and give results that are easy to comprehend. △ Less

Submitted 30 December, 2018; v1 submitted 20 May, 2018; originally announced May 2018.

Comments: 14 pages, 8 figures

arXiv:1804.03194 [pdf, other]

Human-Guided Data Exploration

Authors: Andreas Henelius, Emilia Oikarinen, Kai Puolamäki

Abstract: The outcome of the explorative data analysis (EDA) phase is vital for successful data analysis. EDA is more effective when the user interacts with the system used to carry out the exploration. In the recently proposed paradigm of iterative data mining the user controls the exploration by inputting knowledge in the form of patterns observed during the process. The system then shows the user views o… ▽ More The outcome of the explorative data analysis (EDA) phase is vital for successful data analysis. EDA is more effective when the user interacts with the system used to carry out the exploration. In the recently proposed paradigm of iterative data mining the user controls the exploration by inputting knowledge in the form of patterns observed during the process. The system then shows the user views of the data that are maximally informative given the user's current knowledge. Although this scheme is good at showing surprising views of the data to the user, there is a clear shortcoming: the user cannot steer the process. In many real cases we want to focus on investigating specific questions concerning the data. This paper presents the Human Guided Data Exploration framework, generalising previous research. This framework allows the user to incorporate existing knowledge into the exploration process, focus on exploring a subset of the data, and compare different complex hypotheses concerning relations in the data. The framework utilises a computationally efficient constrained randomisation scheme. To showcase the framework, we developed a free open-source tool, using which the empirical evaluation on real-world datasets was carried out. Our evaluation shows that the ability to focus on particular subsets and being able to compare hypotheses are important additions to the interactive iterative data mining process. △ Less

Submitted 9 April, 2018; originally announced April 2018.

arXiv:1710.08167 [pdf, other]

doi 10.1007/s10618-019-00655-x

Interactive Visual Data Exploration with Subjective Feedback: An Information-Theoretic Approach

Authors: Kai Puolamäki, Emilia Oikarinen, Bo Kang, Jefrey Lijffijt, Tijl De Bie

Abstract: Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing methods use predefined criteria to choose the representation of data. There is a lack of methods that (i) elicit from the user what she has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns c… ▽ More Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing methods use predefined criteria to choose the representation of data. There is a lack of methods that (i) elicit from the user what she has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns can be input as knowledge to the system. The knowledge syntax here is intuitive, such as "this set of points forms a cluster", and requires no knowledge of maths. This background knowledge is used to find a Maximum Entropy distribution of the data, after which the system provides the user data projections in which the data and the Maximum Entropy distribution differ the most, hence showing the user aspects of the data that are maximally informative given the user's current knowledge. We provide an open source EDA system with tailored interactive visualizations to demonstrate these concepts. We study the performance of the system and present use cases on both synthetic and real data. We find that the model and the prototype system allow the user to learn information efficiently from various data sources and the system works sufficiently fast in practice. We conclude that the information theoretic approach to exploratory data analysis where patterns observed by a user are formalized as constraints provides a principled, intuitive, and efficient basis for constructing an EDA system. △ Less

Submitted 23 October, 2017; originally announced October 2017.

Comments: 12 pages, 9 figures, 2 tables, conference submission

Journal ref: Data Mining and Knowledge Discovery 34 (2020) 21-49

arXiv:1710.04521 [pdf, other]

doi 10.1109/ICDE.2018.00148

Subjectively Interesting Subgroup Discovery on Real-valued Targets

Authors: Jefrey Lijffijt, Bo Kang, Wouter Duivesteijn, Kai Puolamäki, Emilia Oikarinen, Tijl De Bie

Abstract: Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the fact that there are exponentially many variable combinations to potentially consider, and there are infinitely many if we consider weighted combinations, even for linear combinations. Hence, an obvious question is whether we can automate the search for interesting patterns… ▽ More Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the fact that there are exponentially many variable combinations to potentially consider, and there are infinitely many if we consider weighted combinations, even for linear combinations. Hence, an obvious question is whether we can automate the search for interesting patterns and visualizations. In this paper, we consider the setting where a user wants to learn as efficiently as possible about real-valued attributes. For example, to understand the distribution of crime rates in different geographic areas in terms of other (numerical, ordinal and/or categorical) variables that describe the areas. We introduce a method to find subgroups in the data that are maximally informative (in the formal Information Theoretic sense) with respect to a single or set of real-valued target attributes. The subgroup descriptions are in terms of a succinct set of arbitrarily-typed other attributes. The approach is based on the Subjective Interestingness framework FORSIED to enable the use of prior knowledge when finding most informative non-redundant patterns, and hence the method also supports iterative data mining. △ Less

Submitted 12 October, 2017; originally announced October 2017.

Comments: 12 pages, 10 figures, 2 tables, conference submission

arXiv:1707.07576 [pdf, ps, other]

Interpreting Classifiers through Attribute Interactions in Datasets

Authors: Andreas Henelius, Kai Puolamäki, Antti Ukkonen

Abstract: In this work we present the novel ASTRID method for investigating which attribute interactions classifiers exploit when making predictions. Attribute interactions in classification tasks mean that two or more attributes together provide stronger evidence for a particular class label. Knowledge of such interactions makes models more interpretable by revealing associations between attributes. This h… ▽ More In this work we present the novel ASTRID method for investigating which attribute interactions classifiers exploit when making predictions. Attribute interactions in classification tasks mean that two or more attributes together provide stronger evidence for a particular class label. Knowledge of such interactions makes models more interpretable by revealing associations between attributes. This has applications, e.g., in pharmacovigilance to identify interactions between drugs or in bioinformatics to investigate associations between single nucleotide polymorphisms. We also show how the found attribute partitioning is related to a factorisation of the data generating distribution and empirically demonstrate the utility of the proposed method. △ Less

Submitted 24 July, 2017; originally announced July 2017.

Comments: presented at 2017 ICML Workshop on Human Interpretability in Machine Learning (WHI 2017), Sydney, NSW, Australia

arXiv:1612.08714 [pdf, other]

Clustering with Confidence: Finding Clusters with Statistical Guarantees

Authors: Andreas Henelius, Kai Puolamäki, Henrik Boström, Panagiotis Papapetrou

Abstract: Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or re-running a clustering algorithm involving some stochastic component may lead to completely different clusters. There is, hence, a need for techniques that can quant… ▽ More Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or re-running a clustering algorithm involving some stochastic component may lead to completely different clusters. There is, hence, a need for techniques that can quantify the instability of the generated clusters. In this study, we propose a technique for quantifying the instability of a clustering solution and for finding robust clusters, termed core clusters, which correspond to clusters where the co-occurrence probability of each data item within a cluster is at least $1 - α$. We demonstrate how solving the core clustering problem is linked to finding the largest maximal cliques in a graph. We show that the method can be used with both clustering and classification algorithms. The proposed method is tested on both simulated and real datasets. The results show that the obtained clusters indeed meet the guarantees on robustness. △ Less

Submitted 30 December, 2016; v1 submitted 27 December, 2016; originally announced December 2016.

Comments: 30 pages, 5 figures, 5 tables. Added URL to the source code

arXiv:1612.07597 [pdf, other]

Finding Statistically Significant Attribute Interactions

Authors: Andreas Henelius, Antti Ukkonen, Kai Puolamäki

Abstract: In many data exploration tasks it is meaningful to identify groups of attribute interactions that are specific to a variable of interest. For instance, in a dataset where the attributes are medical markers and the variable of interest (class variable) is binary indicating presence/absence of disease, we would like to know which medical markers interact with respect to the binary class label. These… ▽ More In many data exploration tasks it is meaningful to identify groups of attribute interactions that are specific to a variable of interest. For instance, in a dataset where the attributes are medical markers and the variable of interest (class variable) is binary indicating presence/absence of disease, we would like to know which medical markers interact with respect to the binary class label. These interactions are useful in several practical applications, for example, to gain insight into the structure of the data, in feature selection, and in data anonymisation. We present a novel method, based on statistical significance testing, that can be used to verify if the data set has been created by a given factorised class-conditional joint distribution, where the distribution is parametrised by a partition of its attributes. Furthermore, we provide a method, named ASTRID, for automatically finding a partition of attributes describing the distribution that has generated the data. State-of-the-art classifiers are utilised to capture the interactions present in the data by systematically breaking attribute interactions and observing the effect of this breaking on classifier performance. We empirically demonstrate the utility of the proposed method with examples using real and synthetic data. △ Less

Submitted 16 March, 2017; v1 submitted 22 December, 2016; originally announced December 2016.

Comments: 9 pages, 4 tables, 1 figure

arXiv:1207.1414 [pdf]

Two-Way Latent Grouping Model for User Preference Prediction

Authors: Eerika Savia, Kai Puolamaki, Janne Sinkkonen, Samuel Kaski

Abstract: We introduce a novel latent grouping model for predicting the relevance of a new document to a user. The model assumes a latent group structure for both users and documents. We compared the model against a state-of-the-art method, the User Rating Profile model, where only users have a latent group structure. We estimate both models by Gibbs sampling. The new method predicts relevance more accurate… ▽ More We introduce a novel latent grouping model for predicting the relevance of a new document to a user. The model assumes a latent group structure for both users and documents. We compared the model against a state-of-the-art method, the User Rating Profile model, where only users have a latent group structure. We estimate both models by Gibbs sampling. The new method predicts relevance more accurately for new documents that have few known ratings. The reason is that generalization over documents then becomes necessary and hence the twoway grouping is profitable. △ Less

Submitted 4 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

Report number: UAI-P-2005-PG-518-524

arXiv:0712.2682 [pdf, other]

doi 10.1016/j.ipl.2008.03.013

An Approximation Ratio for Biclustering

Authors: Kai Puolamäki, Sami Hanhijärvi, Gemma C. Garriga

Abstract: The problem of biclustering consists of the simultaneous clustering of rows and columns of a matrix such that each of the submatrices induced by a pair of row and column clusters is as uniform as possible. In this paper we approximate the optimal biclustering by applying one-way clustering algorithms independently on the rows and on the columns of the input matrix. We show that such a solution y… ▽ More The problem of biclustering consists of the simultaneous clustering of rows and columns of a matrix such that each of the submatrices induced by a pair of row and column clusters is as uniform as possible. In this paper we approximate the optimal biclustering by applying one-way clustering algorithms independently on the rows and on the columns of the input matrix. We show that such a solution yields a worst-case approximation ratio of 1+sqrt(2) under L1-norm for 0-1 valued matrices, and of 2 under L2-norm for real valued matrices. △ Less

Submitted 22 August, 2008; v1 submitted 17 December, 2007; originally announced December 2007.

Comments: 9 pages, 2 figures; presentation clarified, replaced to match the version to be published in IPL

Report number: Publications in Computer and Information Science E13

Journal ref: Information Processing Letters 108 (2008) 45-49

Showing 1–21 of 21 results for author: Puolamäki, K