Skip to main content

Showing 1–17 of 17 results for author: Dembczynski, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.08994  [pdf, other

    stat.ML cs.LG

    Optimal Downsampling for Imbalanced Classification with Generalized Linear Models

    Authors: Yan Chen, Jose Blanchet, Krzysztof Dembczynski, Laura Fee Nern, Aaron Flores

    Abstract: Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We propose a pseudo maximum likelihood estimator and study its asymptotic normality in the context of increasingly imbalanced populations relative to an increasingly la… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Journal ref: Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:1306-1314, 2025

  2. arXiv:2406.14743  [pdf, other

    cs.LG stat.ML

    A General Online Algorithm for Optimizing Complex Performance Metrics

    Authors: Wojciech Kotłowski, Marek Wydmuch, Erik Schultheis, Rohit Babbar, Krzysztof Dembczyński

    Abstract: We consider sequential maximization of performance metrics that are general functions of a confusion matrix of a classifier (such as precision, F-measure, or G-mean). Such metrics are, in general, non-decomposable over individual instances, making their optimization very challenging. While they have been extensively studied under different frameworks in the batch setting, their analysis in the onl… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: This is the authors' version of the work accepted to ICML 2024

  3. arXiv:2401.16594  [pdf, other

    cs.LG

    Consistent algorithms for multi-label classification with macro-at-$k$ metrics

    Authors: Erik Schultheis, Wojciech Kotłowski, Marek Wydmuch, Rohit Babbar, Strom Borman, Krzysztof Dembczyński

    Abstract: We consider the optimization of complex performance metrics in multi-label classification under the population utility framework. We mainly focus on metrics linearly decomposable into a sum of binary classification utilities applied separately to each label with an additional requirement of exactly $k$ labels predicted for each instance. These "macro-at-$k$" metrics possess desired properties for… ▽ More

    Submitted 29 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: This is the authors' version of the work accepted to ICLR 2024; the final version of the paper, errors and typos corrected, and minor modifications to improve clarity

  4. arXiv:2311.05081  [pdf, other

    cs.LG

    Generalized test utilities for long-tail performance in extreme multi-label classification

    Authors: Erik Schultheis, Marek Wydmuch, Wojciech Kotłowski, Rohit Babbar, Krzysztof Dembczyński

    Abstract: Extreme multi-label classification (XMLC) is the task of selecting a small subset of relevant labels from a very large set of possible labels. As such, it is characterized by long-tail labels, i.e., most labels have very few positive instances. With standard performance measures such as precision@k, a classifier can ignore tail labels and still report good performance. However, it is often argued… ▽ More

    Submitted 17 January, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: This is the authors' version of the work accepted to NeurIPS 2023; the final version of the paper, errors and typos corrected, and minor modifications to improve clarity

  5. On Missing Labels, Long-tails and Propensities in Extreme Multi-label Classification

    Authors: Erik Schultheis, Marek Wydmuch, Rohit Babbar, Krzysztof Dembczyński

    Abstract: The propensity model introduced by Jain et al. 2016 has become a standard approach for dealing with missing and long-tail labels in extreme multi-label classification (XMLC). In this paper, we critically revise this approach showing that despite its theoretical soundness, its application in contemporary XMLC works is debatable. We exhaustively discuss the flaws of the propensity-based approach, an… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

    Comments: This is the author's version of the work accepted at KDD '22

  6. arXiv:2203.06676  [pdf, other

    cs.LG cs.AI stat.ML

    Set-valued prediction in hierarchical classification with constrained representation complexity

    Authors: Thomas Mortier, Eyke Hüllermeier, Krzysztof Dembczyński, Willem Waegeman

    Abstract: Set-valued prediction is a well-known concept in multi-class classification. When a classifier is uncertain about the class label for a test instance, it can predict a set of classes instead of a single class. In this paper, we focus on hierarchical multi-class classification problems, where valid sets (typically) correspond to internal nodes of the hierarchy. We argue that this is a very strong r… ▽ More

    Submitted 13 March, 2022; originally announced March 2022.

  7. Propensity-scored Probabilistic Label Trees

    Authors: Marek Wydmuch, Kalina Jasinska-Kobus, Rohit Babbar, Krzysztof Dembczyński

    Abstract: Extreme multi-label classification (XMLC) refers to the task of tagging instances with small subsets of relevant labels coming from an extremely large set of all possible labels. Recently, XMLC has been widely applied to diverse web applications such as automatic content labeling, online advertising, or recommendation systems. In such environments, label distribution is often highly imbalanced, co… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: The extended version of SIGIR '21 Short Research Paper

  8. arXiv:2009.11218  [pdf, ps, other

    cs.LG stat.ML

    Probabilistic Label Trees for Extreme Multi-label Classification

    Authors: Kalina Jasinska-Kobus, Marek Wydmuch, Krzysztof Dembczynski, Mikhail Kuznetsov, Robert Busa-Fekete

    Abstract: Extreme multi-label classification (XMLC) is a learning task of tagging instances with a small subset of relevant labels chosen from an extremely large pool of possible labels. Problems of this scale can be efficiently handled by organizing labels as a tree, like in hierarchical softmax used for multi-class problems. In this paper, we thoroughly investigate probabilistic label trees (PLTs) which c… ▽ More

    Submitted 23 September, 2020; originally announced September 2020.

  9. arXiv:2007.04451  [pdf, ps, other

    cs.LG stat.ML

    Online probabilistic label trees

    Authors: Kalina Jasinska-Kobus, Marek Wydmuch, Devanathan Thiruvenkatachari, Krzysztof Dembczyński

    Abstract: We introduce online probabilistic label trees (OPLTs), an algorithm that trains a label tree classifier in a fully online manner without any prior knowledge about the number of training instances, their features and labels. OPLTs are characterized by low time and space complexity as well as strong theoretical guarantees. They can be used for online multi-label and multi-class classification, inclu… ▽ More

    Submitted 26 March, 2021; v1 submitted 8 July, 2020; originally announced July 2020.

    Comments: Accepted at AISTATS 2021

  10. arXiv:1906.08129  [pdf, other

    cs.LG stat.ML

    Efficient Set-Valued Prediction in Multi-Class Classification

    Authors: Thomas Mortier, Marek Wydmuch, Krzysztof Dembczyński, Eyke Hüllermeier, Willem Waegeman

    Abstract: In cases of uncertainty, a multi-class classifier preferably returns a set of candidate classes instead of predicting a single class label with little guarantee. More precisely, the classifier should strive for an optimal balance between the correctness (the true class is among the candidates) and the precision (the candidates are not too many) of its prediction. We formalize this problem within a… ▽ More

    Submitted 27 May, 2020; v1 submitted 19 June, 2019; originally announced June 2019.

  11. arXiv:1906.00294  [pdf, ps, other

    cs.LG cs.CC stat.ML

    On the computational complexity of the probabilistic label tree algorithms

    Authors: Robert Busa-Fekete, Krzysztof Dembczynski, Alexander Golovnev, Kalina Jasinska, Mikhail Kuznetsov, Maxim Sviridenko, Chao Xu

    Abstract: Label tree-based algorithms are widely used to tackle multi-class and multi-label problems with a large number of labels. We focus on a particular subclass of these algorithms that use probabilistic classifiers in the tree nodes. Examples of such algorithms are hierarchical softmax (HSM), designed for multi-class classification, and probabilistic label trees (PLTs) that generalize HSM to multi-lab… ▽ More

    Submitted 1 June, 2019; originally announced June 2019.

  12. arXiv:1810.11671  [pdf, ps, other

    cs.LG stat.ML

    A no-regret generalization of hierarchical softmax to extreme multi-label classification

    Authors: Marek Wydmuch, Kalina Jasinska, Mikhail Kuznetsov, Róbert Busa-Fekete, Krzysztof Dembczyński

    Abstract: Extreme multi-label classification (XMLC) is a problem of tagging an instance with a small subset of relevant labels chosen from an extremely large pool of possible labels. Large label spaces can be efficiently handled by organizing labels as a tree, like in the hierarchical softmax (HSM) approach commonly used for multi-class problems. In this paper, we investigate probabilistic label trees (PLTs… ▽ More

    Submitted 27 October, 2018; originally announced October 2018.

    Comments: Accepted at NIPS 2018

  13. arXiv:1809.02352  [pdf, other

    stat.ML cs.LG

    Multi-Target Prediction: A Unifying View on Problems and Methods

    Authors: Willem Waegeman, Krzysztof Dembczynski, Eyke Huellermeier

    Abstract: Multi-target prediction (MTP) is concerned with the simultaneous prediction of multiple target variables of diverse type. Due to its enormous application potential, it has developed into an active and rapidly expanding research field that combines several subfields of machine learning, including multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot… ▽ More

    Submitted 7 September, 2018; originally announced September 2018.

  14. Exact and efficient top-K inference for multi-target prediction by querying separable linear relational models

    Authors: Michiel Stock, Krzysztof Dembczynski, Bernard De Baets, Willem Waegeman

    Abstract: Many complex multi-target prediction problems that concern large target spaces are characterised by a need for efficient prediction strategies that avoid the computation of predictions for all targets explicitly. Examples of such problems emerge in several subfields of machine learning, such as collaborative filtering, multi-label classification, dyadic prediction and biological network inference.… ▽ More

    Submitted 14 June, 2016; originally announced June 2016.

    Journal ref: Data Min Knowl Disc (2016) 30:1370-1394

  15. arXiv:1504.07272  [pdf, other

    cs.LG

    Surrogate regret bounds for generalized classification performance metrics

    Authors: Wojciech Kotłowski, Krzysztof Dembczyński

    Abstract: We consider optimization of generalized performance metrics for binary classification by means of surrogate losses. We focus on a class of metrics, which are linear-fractional functions of the false positive and false negative rates (examples of which include $F_β$-measure, Jaccard similarity coefficient, AM measure, and many others). Our analysis concerns the following two-step procedure. First,… ▽ More

    Submitted 7 October, 2016; v1 submitted 27 April, 2015; originally announced April 2015.

    Comments: 22 pages

  16. arXiv:1310.4849  [pdf, other

    stat.ML cs.LG

    On the Bayes-optimality of F-measure maximizers

    Authors: Willem Waegeman, Krzysztof Dembczynski, Arkadiusz Jachnik, Weiwei Cheng, Eyke Hullermeier

    Abstract: The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective,… ▽ More

    Submitted 6 March, 2015; v1 submitted 17 October, 2013; originally announced October 2013.

    Journal ref: JMLR 15 (2014) 3333-3388

  17. arXiv:1206.6401  [pdf

    cs.LG stat.ML

    Consistent Multilabel Ranking through Univariate Losses

    Authors: Krzysztof Dembczynski, Wojciech Kotlowski, Eyke Huellermeier

    Abstract: We consider the problem of rank loss minimization in the setting of multilabel classification, which is usually tackled by means of convex surrogate losses defined on pairs of labels. Very recently, this approach was put into question by a negative result showing that commonly used pairwise surrogate losses, such as exponential and logistic losses, are inconsistent. In this paper, we show a positi… ▽ More

    Submitted 27 June, 2012; originally announced June 2012.

    Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)