Skip to main content

Showing 1–21 of 21 results for author: Puolamäki, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.10311  [pdf, other

    cs.LG cs.AI cs.HC

    ExplainReduce: Summarising local explanations via proxies

    Authors: Lauri Seppäläinen, Mudong Guo, Kai Puolamäki

    Abstract: Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

    Comments: 22 pages with a 7 page appendix, 7 + 5 figures, 2 tables. The datasets and source code used in the paper are available at https://github.com/edahelsinki/explainreduce

    ACM Class: I.2.4

  2. arXiv:2406.00502  [pdf, other

    math.OC cs.LG

    Non-geodesically-convex optimization in the Wasserstein space

    Authors: Hoang Phuc Hau Luu, Hanlin Yu, Bernardo Williams, Petrus Mikkola, Marcelo Hartmann, Kai Puolamäki, Arto Klami

    Abstract: We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. Specifically, the objective exhibits some difference-of-convex structure along these geodesics. The setting also encompasses sampling problems where the logarithm of the target distribution is difference-of-convex. We derive m… ▽ More

    Submitted 7 January, 2025; v1 submitted 1 June, 2024; originally announced June 2024.

  3. arXiv:2405.08486  [pdf, other

    cs.LG

    Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction

    Authors: Anri Patron, Ayush Prasad, Hoang Phuc Hau Luu, Kai Puolamäki

    Abstract: A fundamental problem in supervised learning is to find a good set of features or distance measures. If the new set of features is of lower dimensionality and can be obtained by a simple transformation of the original data, they can make the model understandable, reduce overfitting, and even help to detect distribution drift. We propose a supervised dimensionality reduction method Gradient Boostin… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 32 pages, 8 figures, 5 tables

  4. Using Slisemap to interpret physical data

    Authors: Lauri Seppäläinen, Anton Björklund, Vitus Besel, Kai Puolamäki

    Abstract: Manifold visualisation techniques are commonly used to visualise high-dimensional datasets in physical sciences. In this paper we apply a recently introduced manifold visualisation method, called Slise, on datasets from physics and chemistry. Slisemap combines manifold visualisation with explainable artificial intelligence. Explainable artificial intelligence is used to investigate the decision pr… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: 17 pages, 5 + 1 figures, 1 table. The datasets and source code used in the paper are available at https://www.edahelsinki.fi/papers/slisemap_phys

    Journal ref: PLOS ONE 19 (2024) 1-16

  5. $χ$iplot: web-first visualisation platform for multidimensional data

    Authors: Akihiro Tanaka, Juniper Tyree, Anton Björklund, Jarmo Mäkelä, Kai Puolamäki

    Abstract: $χ$iplot is an HTML5-based system for interactive exploration of data and machine learning models. A key aspect is interaction, not only for the interactive plots but also between plots. Even though $χ$iplot is not restricted to any single application domain, we have developed and tested it with domain experts in quantum chemistry to study molecular interactions and regression models. $χ… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: 5 pages, 1 figure, accepted to the demo track of ECML PKDD 2023, https://github.com/edahelsinki/xiplot

    ACM Class: J.0; J.2; H.5

    Journal ref: Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track. ECML PKDD 2023. Lecture Notes in Computer Science, vol 14175 pp. 335-339

  6. SLISEMAP: Supervised dimensionality reduction through local explanations

    Authors: Anton Björklund, Jarmo Mäkelä, Kai Puolamäki

    Abstract: Existing methods for explaining black box learning models often focus on building local explanations of model behaviour for a particular data item. It is possible to create global explanations for all data items, but these explanations generally have low fidelity for complex black box models. We propose a new supervised manifold visualisation method, SLISEMAP, that simultaneously finds local expla… ▽ More

    Submitted 18 May, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

    Comments: 26 pages, 10 figures, 3 tables. This revision replaces the $λ_z$ parameter with $z_{radius}$, which is more intuitive and stable. There are also many typographical and clarity improvements. The source code for our implementation of the algorithm (and the experiments) are available from GitHub at https://github.com/edahelsinki/slisemap . Machine Learning (2022)

    Journal ref: Machine Learning 112, 1-43 (2023)

  7. arXiv:2107.01126  [pdf, other

    physics.data-an cs.HC cs.LG

    Interactive Causal Structure Discovery in Earth System Sciences

    Authors: Laila Melkas, Rafael Savvides, Suyog Chandramouli, Jarmo Mäkelä, Tuomo Nieminen, Ivan Mammarella, Kai Puolamäki

    Abstract: Causal structure discovery (CSD) models are making inroads into several domains, including Earth system sciences. Their widespread adaptation is however hampered by the fact that the resulting models often do not take into account the domain knowledge of the experts and that it is often necessary to modify the resulting models iteratively. We present a workflow that is required to take this knowle… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

    Comments: 23 pages, 8 figures, to be published in Proceedings of the 2021 KDD Workshop on Causal Discovery

  8. Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

    Authors: Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, Heikki Mannila

    Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. Fo… ▽ More

    Submitted 16 June, 2020; originally announced June 2020.

    Journal ref: KDD 2009: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

  9. arXiv:1912.06384  [pdf, other

    eess.SP cs.LG stat.ML

    Low-Cost Outdoor Air Quality Monitoring and Sensor Calibration: A Survey and Critical Analysis

    Authors: Francesco Concas, Julien Mineraud, Eemil Lagerspetz, Samu Varjonen, Xiaoli Liu, Kai Puolamäki, Petteri Nurmi, Sasu Tarkoma

    Abstract: The significance of air pollution and the problems associated with it are fueling deployments of air quality monitoring stations worldwide. The most common approach for air quality monitoring is to rely on environmental monitoring stations, which unfortunately are very expensive both to acquire and to maintain. Hence environmental monitoring stations are typically sparsely deployed, resulting in l… ▽ More

    Submitted 25 January, 2021; v1 submitted 13 December, 2019; originally announced December 2019.

  10. arXiv:1910.04069  [pdf, other

    stat.ML cs.LG

    Estimating regression errors without ground truth values

    Authors: Henri Tiittanen, Emilia Oikarinen, Andreas Henelius, Kai Puolamäki

    Abstract: Regression analysis is a standard supervised machine learning method used to model an outcome variable in terms of a set of predictor variables. In most real-world applications we do not know the true value of the outcome variable being predicted outside the training data, i.e., the ground truth is unknown. It is hence not straightforward to directly observe when the estimate from a model potentia… ▽ More

    Submitted 9 October, 2019; originally announced October 2019.

    Comments: 33 pages, 9 figures, 2 tables

  11. arXiv:1905.02515  [pdf, other

    stat.ML cs.LG

    Guided Visual Exploration of Relations in Data Sets

    Authors: Kai Puolamäki, Emilia Oikarinen, Andreas Henelius

    Abstract: Efficient explorative data analysis systems must take into account both what a user knows and wants to know. This paper proposes a principled framework for interactive visual exploration of relations in data, through views most informative given the user's current knowledge and objectives. The user can input pre-existing knowledge of relations in the data and also formulate specific exploration in… ▽ More

    Submitted 1 July, 2021; v1 submitted 7 May, 2019; originally announced May 2019.

    Comments: 32 pages, 13 figures. This article extends arXiv:1804.03194 and arXiv:1805.07725

    Journal ref: Journal of Machine Learning Research 22(96):1-32, 2021

  12. arXiv:1811.05974  [pdf, other

    cs.DS physics.data-an q-bio.QM

    Randomisation Algorithms for Large Sparse Matrices

    Authors: Kai Puolamäki, Andreas Henelius, Antti Ukkonen

    Abstract: In many domains it is necessary to generate surrogate networks, e.g., for hypothesis testing of different properties of a network. Furthermore, generating surrogate networks typically requires that different properties of the network is preserved, e.g., edges may not be added or deleted and the edge weights may be restricted to certain intervals. In this paper we introduce a novel efficient proper… ▽ More

    Submitted 14 November, 2018; originally announced November 2018.

    Journal ref: Phys. Rev. E 99, 053311 (2019)

  13. arXiv:1805.07725  [pdf, other

    stat.ML cs.LG

    Human-guided data exploration using randomisation

    Authors: Kai Puolamäki, Emilia Oikarinen, Buse Atli, Andreas Henelius

    Abstract: An explorative data analysis system should be aware of what the user already knows and what the user wants to know of the data: otherwise the system cannot provide the user with the most informative and useful views of the data. We propose a principled way to do exploratory data analysis, where the user's background knowledge is modeled by a distribution parametrised by subsets of rows and columns… ▽ More

    Submitted 30 December, 2018; v1 submitted 20 May, 2018; originally announced May 2018.

    Comments: 14 pages, 8 figures

  14. arXiv:1804.03194  [pdf, other

    stat.ML cs.HC cs.LG

    Human-Guided Data Exploration

    Authors: Andreas Henelius, Emilia Oikarinen, Kai Puolamäki

    Abstract: The outcome of the explorative data analysis (EDA) phase is vital for successful data analysis. EDA is more effective when the user interacts with the system used to carry out the exploration. In the recently proposed paradigm of iterative data mining the user controls the exploration by inputting knowledge in the form of patterns observed during the process. The system then shows the user views o… ▽ More

    Submitted 9 April, 2018; originally announced April 2018.

  15. arXiv:1710.08167  [pdf, other

    stat.ML cs.IT cs.LG

    Interactive Visual Data Exploration with Subjective Feedback: An Information-Theoretic Approach

    Authors: Kai Puolamäki, Emilia Oikarinen, Bo Kang, Jefrey Lijffijt, Tijl De Bie

    Abstract: Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing methods use predefined criteria to choose the representation of data. There is a lack of methods that (i) elicit from the user what she has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns c… ▽ More

    Submitted 23 October, 2017; originally announced October 2017.

    Comments: 12 pages, 9 figures, 2 tables, conference submission

    Journal ref: Data Mining and Knowledge Discovery 34 (2020) 21-49

  16. Subjectively Interesting Subgroup Discovery on Real-valued Targets

    Authors: Jefrey Lijffijt, Bo Kang, Wouter Duivesteijn, Kai Puolamäki, Emilia Oikarinen, Tijl De Bie

    Abstract: Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the fact that there are exponentially many variable combinations to potentially consider, and there are infinitely many if we consider weighted combinations, even for linear combinations. Hence, an obvious question is whether we can automate the search for interesting patterns… ▽ More

    Submitted 12 October, 2017; originally announced October 2017.

    Comments: 12 pages, 10 figures, 2 tables, conference submission

  17. arXiv:1707.07576  [pdf, ps, other

    stat.ML cs.LG

    Interpreting Classifiers through Attribute Interactions in Datasets

    Authors: Andreas Henelius, Kai Puolamäki, Antti Ukkonen

    Abstract: In this work we present the novel ASTRID method for investigating which attribute interactions classifiers exploit when making predictions. Attribute interactions in classification tasks mean that two or more attributes together provide stronger evidence for a particular class label. Knowledge of such interactions makes models more interpretable by revealing associations between attributes. This h… ▽ More

    Submitted 24 July, 2017; originally announced July 2017.

    Comments: presented at 2017 ICML Workshop on Human Interpretability in Machine Learning (WHI 2017), Sydney, NSW, Australia

  18. arXiv:1612.08714  [pdf, other

    stat.ML cs.LG

    Clustering with Confidence: Finding Clusters with Statistical Guarantees

    Authors: Andreas Henelius, Kai Puolamäki, Henrik Boström, Panagiotis Papapetrou

    Abstract: Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or re-running a clustering algorithm involving some stochastic component may lead to completely different clusters. There is, hence, a need for techniques that can quant… ▽ More

    Submitted 30 December, 2016; v1 submitted 27 December, 2016; originally announced December 2016.

    Comments: 30 pages, 5 figures, 5 tables. Added URL to the source code

  19. arXiv:1612.07597  [pdf, other

    stat.ML cs.LG

    Finding Statistically Significant Attribute Interactions

    Authors: Andreas Henelius, Antti Ukkonen, Kai Puolamäki

    Abstract: In many data exploration tasks it is meaningful to identify groups of attribute interactions that are specific to a variable of interest. For instance, in a dataset where the attributes are medical markers and the variable of interest (class variable) is binary indicating presence/absence of disease, we would like to know which medical markers interact with respect to the binary class label. These… ▽ More

    Submitted 16 March, 2017; v1 submitted 22 December, 2016; originally announced December 2016.

    Comments: 9 pages, 4 tables, 1 figure

  20. arXiv:1207.1414  [pdf

    cs.IR cs.LG stat.ML

    Two-Way Latent Grouping Model for User Preference Prediction

    Authors: Eerika Savia, Kai Puolamaki, Janne Sinkkonen, Samuel Kaski

    Abstract: We introduce a novel latent grouping model for predicting the relevance of a new document to a user. The model assumes a latent group structure for both users and documents. We compared the model against a state-of-the-art method, the User Rating Profile model, where only users have a latent group structure. We estimate both models by Gibbs sampling. The new method predicts relevance more accurate… ▽ More

    Submitted 4 July, 2012; originally announced July 2012.

    Comments: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

    Report number: UAI-P-2005-PG-518-524

  21. An Approximation Ratio for Biclustering

    Authors: Kai Puolamäki, Sami Hanhijärvi, Gemma C. Garriga

    Abstract: The problem of biclustering consists of the simultaneous clustering of rows and columns of a matrix such that each of the submatrices induced by a pair of row and column clusters is as uniform as possible. In this paper we approximate the optimal biclustering by applying one-way clustering algorithms independently on the rows and on the columns of the input matrix. We show that such a solution y… ▽ More

    Submitted 22 August, 2008; v1 submitted 17 December, 2007; originally announced December 2007.

    Comments: 9 pages, 2 figures; presentation clarified, replaced to match the version to be published in IPL

    Report number: Publications in Computer and Information Science E13

    Journal ref: Information Processing Letters 108 (2008) 45-49