Search | arXiv e-print repository

Diversity-aware clustering: Computational Complexity and Approximation Algorithms

Authors: Suhas Thejaswi, Ameet Gadekar, Bruno Ordozgoiti, Aristides Gionis

Abstract: In this work, we study diversity-aware clustering problems where the data points are associated with multiple attributes resulting in intersecting groups. A clustering solution needs to ensure that the number of chosen cluster centers from each group should be within the range defined by a lower and upper bound threshold for each group, while simultaneously minimizing the clustering objective, whi… ▽ More In this work, we study diversity-aware clustering problems where the data points are associated with multiple attributes resulting in intersecting groups. A clustering solution needs to ensure that the number of chosen cluster centers from each group should be within the range defined by a lower and upper bound threshold for each group, while simultaneously minimizing the clustering objective, which can be either $k$-median, $k$-means or $k$-supplier. We study the computational complexity of the proposed problems, offering insights into their NP-hardness, polynomial-time inapproximability, and fixed-parameter intractability. We present parameterized approximation algorithms with approximation ratios $1+ \frac{2}{e} + ε\approx 1.736$, $1+\frac{8}{e} + ε\approx 3.943$, and $5$ for diversity-aware $k$-median, diversity-aware $k$-means and diversity-aware $k$-supplier, respectively. Assuming Gap-ETH, the approximation ratios are tight for the diversity-aware $k$-median and diversity-aware $k$-means problems. Our results imply the same approximation factors for their respective fair variants with disjoint groups -- fair $k$-median, fair $k$-means, and fair $k$-supplier -- with lower bound requirements. △ Less

Submitted 20 May, 2025; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: Algorithmic Fairness, Fair Clustering, Diversity-aware Clustering, Intersectionaly, Subgroup fairness

arXiv:2306.04489 [pdf, other]

Fair Column Subset Selection

Authors: Antonis Matakos, Bruno Ordozgoiti, Suhas Thejaswi

Abstract: The problem of column subset selection asks for a subset of columns from an input matrix such that the matrix can be reconstructed as accurately as possible within the span of the selected columns. A natural extension is to consider a setting where the matrix rows are partitioned into two groups, and the goal is to choose a subset of columns that minimizes the maximum reconstruction error of both… ▽ More The problem of column subset selection asks for a subset of columns from an input matrix such that the matrix can be reconstructed as accurately as possible within the span of the selected columns. A natural extension is to consider a setting where the matrix rows are partitioned into two groups, and the goal is to choose a subset of columns that minimizes the maximum reconstruction error of both groups, relative to their respective best rank-k approximation. Extending the known results of column subset selection to this fair setting is not straightforward: in certain scenarios it is unavoidable to choose columns separately for each group, resulting in double the expected column count. We propose a deterministic leverage-score sampling strategy for the fair setting and show that sampling a column subset of minimum size becomes NP-hard in the presence of two groups. Despite these negative results, we give an approximation algorithm that guarantees a solution within 1.5 times the optimal solution size. We also present practical heuristic algorithms based on rank-revealing QR factorization. Finally, we validate our methods through an extensive set of experiments using real-world data. △ Less

Submitted 12 August, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: KDD 2024

arXiv:2206.08054 [pdf, other]

Generalized Leverage Scores: Geometric Interpretation and Applications

Authors: Bruno Ordozgoiti, Antonis Matakos, Aristides Gionis

Abstract: In problems involving matrix computations, the concept of leverage has found a large number of applications. In particular, leverage scores, which relate the columns of a matrix to the subspaces spanned by its leading singular vectors, are helpful in revealing column subsets to approximately factorize a matrix with quality guarantees. As such, they provide a solid foundation for a variety of machi… ▽ More In problems involving matrix computations, the concept of leverage has found a large number of applications. In particular, leverage scores, which relate the columns of a matrix to the subspaces spanned by its leading singular vectors, are helpful in revealing column subsets to approximately factorize a matrix with quality guarantees. As such, they provide a solid foundation for a variety of machine-learning methods. In this paper we extend the definition of leverage scores to relate the columns of a matrix to arbitrary subsets of singular vectors. We establish a precise connection between column and singular-vector subsets, by relating the concepts of leverage scores and principal angles between subspaces. We employ this result to design approximation algorithms with provable guarantees for two well-known problems: generalized column subset selection and sparse canonical correlation analysis. We run numerical experiments to provide further insight on the proposed methods. The novel bounds we derive improve our understanding of fundamental concepts in matrix approximations. In addition, our insights may serve as building blocks for further contributions. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: ICML 2022

arXiv:2112.07030 [pdf, other]

Clustering with fair-center representation: parameterized approximation algorithms and heuristics

Authors: Suhas Thejaswi, Ameet Gadekar, Bruno Ordozgoiti, Michal Osadnik

Abstract: We study a variant of classical clustering formulations in the context of algorithmic fairness, known as diversity-aware clustering. In this variant we are given a collection of facility subsets, and a solution must contain at least a specified number of facilities from each subset while simultaneously minimizing the clustering objective ($k$-median or $k$-means). We investigate the fixed-paramete… ▽ More We study a variant of classical clustering formulations in the context of algorithmic fairness, known as diversity-aware clustering. In this variant we are given a collection of facility subsets, and a solution must contain at least a specified number of facilities from each subset while simultaneously minimizing the clustering objective ($k$-median or $k$-means). We investigate the fixed-parameter tractability of these problems and show several negative hardness and inapproximability results, even when we afford exponential running time with respect to some parameters. Motivated by these results we identify natural parameters of the problem, and present fixed-parameter approximation algorithms with approximation ratios $\big(1 + \frac{2}{e} +ε\big)$ and $\big(1 + \frac{8}{e}+ ε\big)$ for diversity-aware $k$-median and diversity-aware $k$-means respectively, and argue that these ratios are essentially tight assuming the gap-exponential time hypothesis. We also present a simple and more practical bicriteria approximation algorithm with better running time bounds. We finally propose efficient and practical heuristics. We evaluate the scalability and effectiveness of our methods in a wide variety of rigorously conducted experiments, on both real and synthetic data. △ Less

Submitted 24 October, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

ACM Class: G.2.1; F.2.0; F.1.3

arXiv:2106.11696 [pdf, other]

Diversity-aware $k$-median : Clustering with fair center representation

Authors: Suhas Thejaswi, Bruno Ordozgoiti, Aristides Gionis

Abstract: We introduce a novel problem for diversity-aware clustering. We assume that the potential cluster centers belong to a set of groups defined by protected attributes, such as ethnicity, gender, etc. We then ask to find a minimum-cost clustering of the data into $k$ clusters so that a specified minimum number of cluster centers are chosen from each group. We thus require that all groups are represent… ▽ More We introduce a novel problem for diversity-aware clustering. We assume that the potential cluster centers belong to a set of groups defined by protected attributes, such as ethnicity, gender, etc. We then ask to find a minimum-cost clustering of the data into $k$ clusters so that a specified minimum number of cluster centers are chosen from each group. We thus require that all groups are represented in the clustering solution as cluster centers, according to specified requirements. More precisely, we are given a set of clients $C$, a set of facilities $\pazocal{F}$, a collection $\mathcal{F}=\{F_1,\dots,F_t\}$ of facility groups $F_i \subseteq \pazocal{F}$, budget $k$, and a set of lower-bound thresholds $R=\{r_1,\dots,r_t\}$, one for each group in $\mathcal{F}$. The \emph{diversity-aware $k$-median problem} asks to find a set $S$ of $k$ facilities in $\pazocal{F}$ such that $|S \cap F_i| \geq r_i$, that is, at least $r_i$ centers in $S$ are from group $F_i$, and the $k$-median cost $\sum_{c \in C} \min_{s \in S} d(c,s)$ is minimized. We show that in the general case where the facility groups may overlap, the diversity-aware $k$-median problem is \np-hard, fixed-parameter intractable, and inapproximable to any multiplicative factor. On the other hand, when the facility groups are disjoint, approximation algorithms can be obtained by reduction to the \emph{matroid median} and \emph{red-blue median} problems. Experimentally, we evaluate our approximation methods for the tractable cases, and present a relaxation-based heuristic for the theoretically intractable case, which can provide high-quality and efficient solutions for real-world datasets. △ Less

Submitted 24 October, 2022; v1 submitted 22 June, 2021; originally announced June 2021.

Comments: To appear in ECML-PKDD 2021

arXiv:2006.13567 [pdf, other]

Off-the-grid: Fast and Effective Hyperparameter Search for Kernel Clustering

Authors: Bruno Ordozgoiti, Lluís A. Belanche Muñoz

Abstract: Kernel functions are a powerful tool to enhance the $k$-means clustering algorithm via the kernel trick. It is known that the parameters of the chosen kernel function can have a dramatic impact on the result. In supervised settings, these can be tuned via cross-validation, but for clustering this is not straightforward and heuristics are usually employed. In this paper we study the impact of kerne… ▽ More Kernel functions are a powerful tool to enhance the $k$-means clustering algorithm via the kernel trick. It is known that the parameters of the chosen kernel function can have a dramatic impact on the result. In supervised settings, these can be tuned via cross-validation, but for clustering this is not straightforward and heuristics are usually employed. In this paper we study the impact of kernel parameters on kernel $k$-means. In particular, we derive a lower bound, tight up to constant factors, below which the parameter of the RBF kernel will render kernel $k$-means meaningless. We argue that grid search can be ineffective for hyperparameter search in this context and propose an alternative algorithm for this purpose. In addition, we offer an efficient implementation based on fast approximate exponentiation with provable quality guarantees. Our experimental results demonstrate the ability of our method to efficiently reveal a rich and useful set of hyperparameter values. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: ECML-PKDD 2020

arXiv:2002.00775 [pdf, other]

Finding large balanced subgraphs in signed networks

Authors: Bruno Ordozgoiti, Antonis Matakos, Aristides Gionis

Abstract: Signed networks are graphs whose edges are labelled with either a positive or a negative sign, and can be used to capture nuances in interactions that are missed by their unsigned counterparts. The concept of balance in signed graph theory determines whether a network can be partitioned into two perfectly opposing subsets, and is therefore useful for modelling phenomena such as the existence of po… ▽ More Signed networks are graphs whose edges are labelled with either a positive or a negative sign, and can be used to capture nuances in interactions that are missed by their unsigned counterparts. The concept of balance in signed graph theory determines whether a network can be partitioned into two perfectly opposing subsets, and is therefore useful for modelling phenomena such as the existence of polarized communities in social networks. While determining whether a graph is balanced is easy, finding a large balanced subgraph is hard. The few heuristics available in the literature for this purpose are either ineffective or non-scalable. In this paper we propose an efficient algorithm for finding large balanced subgraphs in signed networks. The algorithm relies on signed spectral theory and a novel bound for perturbations of the graph Laplacian. In a wide variety of experiments on real-world data we show that our algorithm can find balanced subgraphs much larger than those detected by existing methods, and in addition, it is faster. We test its scalability on graphs of up to 34 million edges. △ Less

Submitted 3 February, 2020; originally announced February 2020.

Comments: 11 pages, 6 figures, The Web Conference 2020

arXiv:2001.09410 [pdf, other]

doi 10.1145/3366423.3380121

Searching for polarization in signed graphs: a local spectral approach

Authors: Han Xiao, Bruno Ordozgoiti, Aristides Gionis

Abstract: Signed graphs have been used to model interactions in social net-works, which can be either positive (friendly) or negative (antagonistic). The model has been used to study polarization and other related phenomena in social networks, which can be harmful to the process of democratic deliberation in our society. An interesting and challenging task in this application domain is to detect polarized c… ▽ More Signed graphs have been used to model interactions in social net-works, which can be either positive (friendly) or negative (antagonistic). The model has been used to study polarization and other related phenomena in social networks, which can be harmful to the process of democratic deliberation in our society. An interesting and challenging task in this application domain is to detect polarized communities in signed graphs. A number of different methods have been proposed for this task. However, existing approaches aim at finding globally optimal solutions. Instead, in this paper we are interested in finding polarized communities that are related to a small set of seed nodes provided as input. Seed nodes may consist of two sets, which constitute the two sides of a polarized structure. In this paper we formulate the problem of finding local polarized communities in signed graphs as a locally-biased eigen-problem. By viewing the eigenvector associated with the smallest eigenvalue of the Laplacian matrix as the solution of a constrained optimization problem, we are able to incorporate the local information as an additional constraint. In addition, we show that the locally-biased vector can be used to find communities with approximation guarantee with respect to a local analogue of the Cheeger constant on signed graphs. By exploiting the sparsity in the input graph, an indicator vector for the polarized communities can be found in time linear to the graph size. Our experiments on real-world networks validate the proposed algorithm and demonstrate its usefulness in finding local structures in this semi-supervised manner. △ Less

Submitted 26 January, 2020; originally announced January 2020.

Comments: 11 pages, 6 figures, accepted by WWW 2020, April 20-24, 2020, Taipei, Taiwan

arXiv:1910.02438 [pdf, other]

Discovering Polarized Communities in Signed Networks

Authors: Francesco Bonchi, Edoardo Galimberti, Aristides Gionis, Bruno Ordozgoiti, Giancarlo Ruffo

Abstract: Signed networks contain edge annotations to indicate whether each interaction is friendly (positive edge) or antagonistic (negative edge). The model is simple but powerful and it can capture novel and interesting structural properties of real-world phenomena. The analysis of signed networks has many applications from modeling discussions in social media, to mining user reviews, and to recommending… ▽ More Signed networks contain edge annotations to indicate whether each interaction is friendly (positive edge) or antagonistic (negative edge). The model is simple but powerful and it can capture novel and interesting structural properties of real-world phenomena. The analysis of signed networks has many applications from modeling discussions in social media, to mining user reviews, and to recommending products in e-commerce sites. In this paper we consider the problem of discovering polarized communities in signed networks. In particular, we search for two communities (subsets of the network vertices) where within communities there are mostly positive edges while across communities there are mostly negative edges. We formulate this novel problem as a "discrete eigenvector" problem, which we show to be NP-hard. We then develop two intuitive spectral algorithms: one deterministic, and one randomized with quality guarantee $\sqrt{n}$ (where $n$ is the number of vertices in the graph), tight up to constant factors. We validate our algorithms against non-trivial baselines on real-world signed networks. Our experiments confirm that our algorithms produce higher quality solutions, are much faster and can scale to much larger networks than the baselines, and are able to detect ground-truth polarized communities. △ Less

Submitted 6 October, 2019; originally announced October 2019.

Journal ref: CIKM 2019, November 3-7, 2019, Beijing, China

arXiv:1902.10419 [pdf, other]

doi 10.1145/3308558.3313475

Reconciliation k-median: Clustering with Non-Polarized Representatives

Authors: Bruno Ordozgoiti, Aristides Gionis

Abstract: We propose a new variant of the k-median problem, where the objective function models not only the cost of assigning data points to cluster representatives, but also a penalty term for disagreement among the representatives. We motivate this novel problem by applications where we are interested in clustering data while avoiding selecting representatives that are too far from each other. For exampl… ▽ More We propose a new variant of the k-median problem, where the objective function models not only the cost of assigning data points to cluster representatives, but also a penalty term for disagreement among the representatives. We motivate this novel problem by applications where we are interested in clustering data while avoiding selecting representatives that are too far from each other. For example, we may want to summarize a set of news sources, but avoid selecting ideologically-extreme articles in order to reduce polarization. To solve the proposed k-median formulation we adopt the local-search algorithm of Arya et al. We show that the algorithm provides a provable approximation guarantee, which becomes constant under an assumption on the minimum number of points for each cluster. We experimentally evaluate our problem formulation and proposed algorithm on datasets inspired by the motivating applications. In particular, we experiment with data extracted from Twitter, the US Congress voting records, and popular news sources. The results show that our objective can lead to choosing less polarized groups of representatives without significant loss in representation fidelity. △ Less

Submitted 28 July, 2021; v1 submitted 27 February, 2019; originally announced February 2019.

Comments: The Web Conference 2019

arXiv:1804.04421 [pdf, other]

Regularized Greedy Column Subset Selection

Authors: Bruno Ordozgoiti, Alberto Mozo, Jesús García López de Lacalle

Abstract: The Column Subset Selection Problem provides a natural framework for unsupervised feature selection. Despite being a hard combinatorial optimization problem, there exist efficient algorithms that provide good approximations. The drawback of the problem formulation is that it incorporates no form of regularization, and is therefore very sensitive to noise when presented with scarce data. In this pa… ▽ More The Column Subset Selection Problem provides a natural framework for unsupervised feature selection. Despite being a hard combinatorial optimization problem, there exist efficient algorithms that provide good approximations. The drawback of the problem formulation is that it incorporates no form of regularization, and is therefore very sensitive to noise when presented with scarce data. In this paper we propose a regularized formulation of this problem, and derive a correct greedy algorithm that is similar in efficiency to existing greedy methods for the unregularized problem. We study its adequacy for feature selection and propose suitable formulations. Additionally, we derive a lower bound for the error of the proposed problems. Through various numerical experiments on real and synthetic data, we demonstrate the significantly increased robustness and stability of our method, as well as the improved conditioning of its output, all while remaining efficient for practical use. △ Less

Submitted 12 April, 2018; originally announced April 2018.

arXiv:1610.07419 [pdf, other]

Using Machine Learning to Detect Noisy Neighbors in 5G Networks

Authors: Udi Margolin, Alberto Mozo, Bruno Ordozgoiti, Danny Raz, Elisha Rosensweig, Itai Segall

Abstract: 5G networks are expected to be more dynamic and chaotic in their structure than current networks. With the advent of Network Function Virtualization (NFV), Network Functions (NF) will no longer be tightly coupled with the hardware they are running on, which poses new challenges in network management. Noisy neighbor is a term commonly used to describe situations in NFV infrastructure where an appli… ▽ More 5G networks are expected to be more dynamic and chaotic in their structure than current networks. With the advent of Network Function Virtualization (NFV), Network Functions (NF) will no longer be tightly coupled with the hardware they are running on, which poses new challenges in network management. Noisy neighbor is a term commonly used to describe situations in NFV infrastructure where an application experiences degradation in performance due to the fact that some of the resources it needs are occupied by other applications in the same cloud node. These situations cannot be easily identified using straightforward approaches, which calls for the use of sophisticated methods for NFV infrastructure management. In this paper we demonstrate how Machine Learning (ML) techniques can be used to identify such events. Through experiments using data collected at real NFV infrastructure, we show that standard models for automated classification can detect the noisy neighbor phenomenon with an accuracy of more than 90% in a simple scenario. △ Less

Submitted 24 October, 2016; originally announced October 2016.

Showing 1–12 of 12 results for author: Ordozgoiti, B