-
On Finding Randomly Planted Cliques in Arbitrary Graphs
Authors:
Francesco Agrimonti,
Marco Bressan,
Tommaso d'Orsi
Abstract:
We study a planted clique model introduced by Feige where a complete graph of size $c\cdot n$ is planted uniformly at random in an arbitrary $n$-vertex graph. We give a simple deterministic algorithm that, in almost linear time, recovers a clique of size $(c/3)^{O(1/c)} \cdot n$ as long as the original graph has maximum degree at most $(1-p)n$ for some fixed $p>0$. The proof hinges on showing that…
▽ More
We study a planted clique model introduced by Feige where a complete graph of size $c\cdot n$ is planted uniformly at random in an arbitrary $n$-vertex graph. We give a simple deterministic algorithm that, in almost linear time, recovers a clique of size $(c/3)^{O(1/c)} \cdot n$ as long as the original graph has maximum degree at most $(1-p)n$ for some fixed $p>0$. The proof hinges on showing that the degrees of the final graph are correlated with the planted clique, in a way similar to (but more intricate than) the classical $G(n,\frac{1}{2})+K_{\sqrt{n}}$ planted clique model. Our algorithm suggests a separation from the worst-case model, where, assuming the Unique Games Conjecture, no polynomial algorithm can find cliques of size $Ω(n)$ for every fixed $c>0$, even if the input graph has maximum degree $(1-p)n$. Our techniques extend beyond the planted clique model. For example, when the planted graph is a balanced biclique, we recover a balanced biclique of size larger than the best guarantees known for the worst case.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Of Dice and Games: A Theory of Generalized Boosting
Authors:
Marco Bressan,
Nataly Brukhim,
Nicolò Cesa-Bianchi,
Emmanuel Esposito,
Yishay Mansour,
Shay Moran,
Maximilian Thiessen
Abstract:
Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddr…
▽ More
Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed. In this work, we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses, and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold). We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., can always be achieved), which ones are boostable (i.e., imply strong learning), and which ones are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
A Theory of Interpretable Approximations
Authors:
Marco Bressan,
Nicolò Cesa-Bianchi,
Emmanuel Esposito,
Yishay Mansour,
Shay Moran,
Maximilian Thiessen
Abstract:
Can a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are *interpretable* by humans. In this work we study such questions by introducing *interpretable approximations*, a notion that captures the idea of approximating a target concept $c$ by a small aggregation of co…
▽ More
Can a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are *interpretable* by humans. In this work we study such questions by introducing *interpretable approximations*, a notion that captures the idea of approximating a target concept $c$ by a small aggregation of concepts from some base class $\mathcal{H}$. In particular, we consider the approximation of a binary concept $c$ by decision trees based on a simple class $\mathcal{H}$ (e.g., of bounded VC dimension), and use the tree depth as a measure of complexity. Our primary contribution is the following remarkable trichotomy. For any given pair of $\mathcal{H}$ and $c$, exactly one of these cases holds: (i) $c$ cannot be approximated by $\mathcal{H}$ with arbitrary accuracy; (ii) $c$ can be approximated by $\mathcal{H}$ with arbitrary accuracy, but there exists no universal rate that bounds the complexity of the approximations as a function of the accuracy; or (iii) there exists a constant $κ$ that depends only on $\mathcal{H}$ and $c$ such that, for *any* data distribution and *any* desired accuracy level, $c$ can be approximated by $\mathcal{H}$ with a complexity not exceeding $κ$. This taxonomy stands in stark contrast to the landscape of supervised classification, which offers a complex array of distribution-free and universally learnable scenarios. We show that, in the case of interpretable approximations, even a slightly nontrivial a-priori guarantee on the complexity of approximations implies approximations with constant (distribution-free and accuracy-free) complexity. We extend our trichotomy to classes $\mathcal{H}$ of unbounded VC dimension and give characterizations of interpretability based on the algebra generated by $\mathcal{H}$.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Efficient Algorithms for Learning Monophonic Halfspaces in Graphs
Authors:
Marco Bressan,
Emmanuel Esposito,
Maximilian Thiessen
Abstract:
We study the problem of learning a binary classifier on the vertices of a graph. In particular, we consider classifiers given by monophonic halfspaces, partitions of the vertices that are convex in a certain abstract sense. Monophonic halfspaces, and related notions such as geodesic halfspaces,have recently attracted interest, and several connections have been drawn between their properties(e.g.,…
▽ More
We study the problem of learning a binary classifier on the vertices of a graph. In particular, we consider classifiers given by monophonic halfspaces, partitions of the vertices that are convex in a certain abstract sense. Monophonic halfspaces, and related notions such as geodesic halfspaces,have recently attracted interest, and several connections have been drawn between their properties(e.g., their VC dimension) and the structure of the underlying graph $G$. We prove several novel results for learning monophonic halfspaces in the supervised, online, and active settings. Our main result is that a monophonic halfspace can be learned with near-optimal passive sample complexity in time polynomial in $n = |V(G)|$. This requires us to devise a polynomial-time algorithm for consistent hypothesis checking, based on several structural insights on monophonic halfspaces and on a reduction to $2$-satisfiability. We prove similar results for the online and active settings. We also show that the concept class can be enumerated with delay $\operatorname{poly}(n)$, and that empirical risk minimization can be performed in time $2^{ω(G)}\operatorname{poly}(n)$ where $ω(G)$ is the clique number of $G$. These results answer open questions from the literature (González et al., 2020), and show a contrast with geodesic halfspaces, for which some of the said problems are NP-hard (Seiffarth et al., 2023).
△ Less
Submitted 17 June, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Fully-Dynamic Approximate Decision Trees With Worst-Case Update Time Guarantees
Authors:
Marco Bressan,
Mauro Sozio
Abstract:
We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $α$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{α^3}\Big)$…
▽ More
We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $α$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{α^3}\Big)$ elementary operations per update, where $d$ is the number of features and $n$ the maximum size of the active set (the net result of the update requests). We give similar bounds for the information gain and the variance gain. In fact, all these bounds are corollaries of a more general result, stated in terms of decision rules -- functions that, given a set $S$ of labeled examples, decide whether to split $S$ or predict a label. Decision rules give a unified view of greedy decision tree algorithms regardless of the example and label domains, and lead to a general notion of $ε$-approximate decision trees that, for natural decision rules such as those used by ID3 or C4.5, implies the gain approximation guarantees above. The heart of our work provides a deterministic algorithm that, given any decision rule and any $ε> 0$, maintains an $ε$-approximate tree using $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}ε\right)$ operations per update, where $f(n)$ is the complexity of evaluating the rule over a set of $n$ examples and $h$ is the maximum height of the maintained tree.
△ Less
Submitted 10 February, 2023; v1 submitted 8 February, 2023;
originally announced February 2023.
-
Fully-Dynamic Decision Trees
Authors:
Marco Bressan,
Gabriel Damay,
Mauro Sozio
Abstract:
We develop the first fully dynamic algorithm that maintains a decision tree over an arbitrary sequence of insertions and deletions of labeled examples. Given $ε> 0$ our algorithm guarantees that, at every point in time, every node of the decision tree uses a split with Gini gain within an additive $ε$ of the optimum. For real-valued features the algorithm has an amortized running time per insertio…
▽ More
We develop the first fully dynamic algorithm that maintains a decision tree over an arbitrary sequence of insertions and deletions of labeled examples. Given $ε> 0$ our algorithm guarantees that, at every point in time, every node of the decision tree uses a split with Gini gain within an additive $ε$ of the optimum. For real-valued features the algorithm has an amortized running time per insertion/deletion of $O\big(\frac{d \log^3 n}{ε^2}\big)$, which improves to $O\big(\frac{d \log^2 n}ε\big)$ for binary or categorical features, while it uses space $O(n d)$, where $n$ is the maximum number of examples at any point in time and $d$ is the number of features. Our algorithm is nearly optimal, as we show that any algorithm with similar guarantees uses amortized running time $Ω(d)$ and space $\tildeΩ (n d)$. We complement our theoretical results with an extensive experimental evaluation on real-world data, showing the effectiveness of our algorithm.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
The Complexity of Pattern Counting in Directed Graphs, Parameterised by the Outdegree
Authors:
Marco Bressan,
Matthias Lanzinger,
Marc Roth
Abstract:
We study the fixed-parameter tractability of the following fundamental problem: given two directed graphs $\vec H$ and $\vec G$, count the number of copies of $\vec H$ in $\vec G$. The standard setting, where the tractability is well understood, uses only $|\vec H|$ as a parameter. In this paper we take a step forward, and adopt as a parameter $|\vec H|+d(\vec G)$, where $d(\vec G)$ is the maximum…
▽ More
We study the fixed-parameter tractability of the following fundamental problem: given two directed graphs $\vec H$ and $\vec G$, count the number of copies of $\vec H$ in $\vec G$. The standard setting, where the tractability is well understood, uses only $|\vec H|$ as a parameter. In this paper we take a step forward, and adopt as a parameter $|\vec H|+d(\vec G)$, where $d(\vec G)$ is the maximum outdegree of $|\vec G|$. Under this parameterization, we completely characterize the fixed-parameter tractability of the problem in both its non-induced and induced versions through two novel structural parameters, the fractional cover number $ρ^*$ and the source number $α_s$. On the one hand we give algorithms with running time $f(|\vec H|,d(\vec G)) \cdot |\vec G|^{ρ^*\!(\vec H)+O(1)}$ and $f(|\vec H|,d(\vec G)) \cdot |\vec G|^{α_s(\vec H)+O(1)}$ for counting respectively the copies and induced copies of $\vec H$ in $\vec G$; on the other hand we show that, unless the Exponential Time Hypothesis fails, for any class $\vec C$ of directed graphs the (induced) counting problem is fixed-parameter tractable if and only if $ρ^*(\vec C)$ ($α_s(\vec C)$) is bounded. These results explain how the orientation of the pattern can make counting easy or hard, and prove that a classic algorithm by Chiba and Nishizeki and its extensions (Chiba, Nishizeki SICOMP 85; Bressan Algorithmica 21) are optimal unless ETH fails.
△ Less
Submitted 3 November, 2022;
originally announced November 2022.
-
Active Learning of Classifiers with Label and Seed Queries
Authors:
Marco Bressan,
Nicolò Cesa-Bianchi,
Silvio Lattanzi,
Andrea Paudice,
Maximilian Thiessen
Abstract:
We study exact active learning of binary and multiclass classifiers with margin. Given an $n$-point set $X \subset \mathbb{R}^m$, we want to learn any unknown classifier on $X$ whose classes have finite strong convex hull margin, a new notion extending the SVM margin. In the standard active learning setting, where only label queries are allowed, learning a classifier with strong convex hull margin…
▽ More
We study exact active learning of binary and multiclass classifiers with margin. Given an $n$-point set $X \subset \mathbb{R}^m$, we want to learn any unknown classifier on $X$ whose classes have finite strong convex hull margin, a new notion extending the SVM margin. In the standard active learning setting, where only label queries are allowed, learning a classifier with strong convex hull margin $γ$ requires in the worst case $Ω\big(1+\frac{1}γ\big)^{(m-1)/2}$ queries. On the other hand, using the more powerful seed queries (a variant of equivalence queries), the target classifier could be learned in $O(m \log n)$ queries via Littlestone's Halving algorithm; however, Halving is computationally inefficient. In this work we show that, by carefully combining the two types of queries, a binary classifier can be learned in time $\operatorname{poly}(n+m)$ using only $O(m^2 \log n)$ label queries and $O\big(m \log \frac{m}γ\big)$ seed queries; the result extends to $k$-class classifiers at the price of a $k!k^2$ multiplicative overhead. Similar results hold when the input points have bounded bit complexity, or when only one class has strong convex hull margin against the rest. We complement the upper bounds by showing that in the worst case any algorithm needs $Ω\big(k m \log \frac{1}γ\big)$ seed and label queries to learn a $k$-class classifier with strong convex hull margin $γ$.
△ Less
Submitted 8 September, 2022;
originally announced September 2022.
-
Counting Subgraphs in Somewhere Dense Graphs
Authors:
Marco Bressan,
Leslie Ann Goldberg,
Kitty Meeks,
Marc Roth
Abstract:
We study the problems of counting copies and induced copies of a small pattern graph $H$ in a large host graph $G$. Recent work fully classified the complexity of those problems according to structural restrictions on the patterns $H$. In this work, we address the more challenging task of analysing the complexity for restricted patterns and restricted hosts. Specifically we ask which families of a…
▽ More
We study the problems of counting copies and induced copies of a small pattern graph $H$ in a large host graph $G$. Recent work fully classified the complexity of those problems according to structural restrictions on the patterns $H$. In this work, we address the more challenging task of analysing the complexity for restricted patterns and restricted hosts. Specifically we ask which families of allowed patterns and hosts imply fixed-parameter tractability, i.e., the existence of an algorithm running in time $f(H)\cdot |G|^{O(1)}$ for some computable function $f$. Our main results present exhaustive and explicit complexity classifications for families that satisfy natural closure properties. Among others, we identify the problems of counting small matchings and independent sets in subgraph-closed graph classes $\mathcal{G}$ as our central objects of study and establish the following crisp dichotomies as consequences of the Exponential Time Hypothesis: (1) Counting $k$-matchings in a graph $G\in\mathcal{G}$ is fixed-parameter tractable if and only if $\mathcal{G}$ is nowhere dense. (2) Counting $k$-independent sets in a graph $G\in\mathcal{G}$ is fixed-parameter tractable if and only if $\mathcal{G}$ is nowhere dense. Moreover, we obtain almost tight conditional lower bounds if $\mathcal{G}$ is somewhere dense, i.e., not nowhere dense. These base cases of our classifications subsume a wide variety of previous results on the matching and independent set problem, such as counting $k$-matchings in bipartite graphs (Curticapean, Marx; FOCS 14), in $F$-colourable graphs (Roth, Wellnitz; SODA 20), and in degenerate graphs (Bressan, Roth; FOCS 21), as well as counting $k$-independent sets in bipartite graphs (Curticapean et al.; Algorithmica 19).
△ Less
Submitted 12 April, 2024; v1 submitted 7 September, 2022;
originally announced September 2022.
-
Complex Network-Based Approach for Feature Extraction and Classification of Musical Genres
Authors:
Matheus Henrique Pimenta-Zanon,
Glaucia Maria Bressan,
Fabrício Martins Lopes
Abstract:
Musical genre's classification has been a relevant research topic. The association between music and genres is fundamental for the media industry, which manages musical recommendation systems, and for music streaming services, which may appear classified by genres. In this context, this work presents a feature extraction method for the automatic classification of musical genres, based on complex n…
▽ More
Musical genre's classification has been a relevant research topic. The association between music and genres is fundamental for the media industry, which manages musical recommendation systems, and for music streaming services, which may appear classified by genres. In this context, this work presents a feature extraction method for the automatic classification of musical genres, based on complex networks and their topological measurements. The proposed method initially converts the musics into sequences of musical notes and then maps the sequences as complex networks. Topological measurements are extracted to characterize the network topology, which composes a feature vector that applies to the classification of musical genres. The method was evaluated in the classification of 10 musical genres by adopting the GTZAN dataset and 8 musical genres by adopting the FMA dataset. The results were compared with methods in the literature. The proposed method outperformed all compared methods by presenting high accuracy and low standard deviation, showing its suitability for the musical genre's classification, which contributes to the media industry in the automatic classification with assertiveness and robustness. The proposed method is implemented in an open source in the Python language and freely available at https://github.com/omatheuspimenta/examinner.
△ Less
Submitted 9 October, 2021;
originally announced October 2021.
-
On Margin-Based Cluster Recovery with Oracle Queries
Authors:
Marco Bressan,
Nicolò Cesa-Bianchi,
Silvio Lattanzi,
Andrea Paudice
Abstract:
We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous work, the classic SVM…
▽ More
We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous work, the classic SVM margin, and standard notions of stability for center-based clusterings. Then, under our margin assumptions we design algorithms that, in a variety of settings, recover all clusters exactly using only $O(\log n)$ queries. For the Euclidean case, $\mathbb{R}^m$, we give an algorithm that recovers arbitrary convex clusters, in polynomial time, and with a number of queries that is lower than the best existing algorithm by $Θ(m^m)$ factors. For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(\log n)$ query bound, and is provably near-optimal as a function of the packing number of the space. Finally, for clusterings realized by binary concept classes, we give a combinatorial characterization of recoverability with $O(\log n)$ queries, and we show that, for many concept classes in Euclidean spaces, this characterization is equivalent to our margin condition. Our results show a deep connection between cluster margins and active cluster recoverability.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Exact and Approximate Pattern Counting in Degenerate Graphs: New Algorithms, Hardness Results, and Complexity Dichotomies
Authors:
Marco Bressan,
Marc Roth
Abstract:
We study the problems of counting the homomorphisms, counting the copies, and counting the induced copies of a $k$-vertex graph $H$ in a $d$-degenerate $n$-vertex graph $G$. Our main result establishes exhaustive and explicit complexity classifications for counting subgraphs and induced subgraphs. We show that the (not necessarily induced) copies of $H$ in $G$ can be counted in time…
▽ More
We study the problems of counting the homomorphisms, counting the copies, and counting the induced copies of a $k$-vertex graph $H$ in a $d$-degenerate $n$-vertex graph $G$. Our main result establishes exhaustive and explicit complexity classifications for counting subgraphs and induced subgraphs. We show that the (not necessarily induced) copies of $H$ in $G$ can be counted in time $f(k,d)\cdot n^{\max(\mathsf{imn}(H),1)}\cdot \log n$, where $f$ is some computable function and $\mathsf{imn}(H)$ is the size of the largest induced matching of $H$. Whenever the class of allowed patterns has unbounded induced matching number, this algorithm is essentially optimal: Unless the Exponential Time Hypothesis (ETH) fails, there is no algorithm running in time $f(k,d)\cdot n^{o(\mathsf{imn}(H)/\log \mathsf{imn}(H))}$ for any function $f$. In case of counting induced subgraphs, we obtain a similar classification along the independence number $α$: we can count the induced copies of $H$ in $G$ in time $f(k,d)\cdot n^{α(H)}\cdot \log n$, and if the class of allowed patterns has unbounded independence number, an algorithm running in time $f(k,d)\cdot n^{o(α(H)/\log α(H))}$ is impossible, unless ETH fails. In the language of parameterized complexity, our results yield dichotomies in fixed-parameter tractable and $\#\mathsf{W}[1]$-hard cases if we parameterize by the size of the pattern and the degeneracy of the host graph. Our results imply that several patterns cannot be counted in time $f(k,d)\cdot n^{o(k/\log k)}$, including $k$-matchings, $k$-independent sets, (induced) $k$-paths, (induced) $k$-cycles, and induced $(k,k)$-bicliques, unless ETH fails. Those lower bounds for exact counting are complemented with new algorithms for approximate counting of subgraphs and induced subgraphs in degenerate graphs.
△ Less
Submitted 1 June, 2021; v1 submitted 9 March, 2021;
originally announced March 2021.
-
Exact Recovery of Clusters in Finite Metric Spaces Using Oracle Queries
Authors:
Marco Bressan,
Nicolò Cesa-Bianchi,
Silvio Lattanzi,
Andrea Paudice
Abstract:
We investigate the problem of exact cluster recovery using oracle queries. Previous results show that clusters in Euclidean spaces that are convex and separated with a margin can be reconstructed exactly using only $O(\log n)$ same-cluster queries, where $n$ is the number of input points. In this work, we study this problem in the more challenging non-convex setting. We introduce a structural char…
▽ More
We investigate the problem of exact cluster recovery using oracle queries. Previous results show that clusters in Euclidean spaces that are convex and separated with a margin can be reconstructed exactly using only $O(\log n)$ same-cluster queries, where $n$ is the number of input points. In this work, we study this problem in the more challenging non-convex setting. We introduce a structural characterization of clusters, called $(β,γ)$-convexity, that can be applied to any finite set of points equipped with a metric (or even a semimetric, as the triangle inequality is not needed). Using $(β,γ)$-convexity, we can translate natural density properties of clusters (which include, for instance, clusters that are strongly non-convex in $\mathbb{R}^d$) into a graph-theoretic notion of convexity. By exploiting this convexity notion, we design a deterministic algorithm that recovers $(β,γ)$-convex clusters using $O(k^2 \log n + k^2 (6/βγ)^{dens(X)})$ same-cluster queries, where $k$ is the number of clusters and $dens(X)$ is the density dimension of the semimetric. We show that an exponential dependence on the density dimension is necessary, and we also show that, if we are allowed to make $O(k^2 + k\log n)$ additional queries to a "cluster separation" oracle, then we can recover clusters that have different and arbitrary scales, even when the scale of each cluster is unknown.
△ Less
Submitted 13 July, 2021; v1 submitted 31 January, 2021;
originally announced February 2021.
-
Faster motif counting via succinct color coding and adaptive sampling
Authors:
Marco Bressan,
Stefano Leucci,
Alessandro Panconesi
Abstract:
We address the problem of computing the distribution of induced connected subgraphs, aka \emph{graphlets} or \emph{motifs}, in large graphs. The current state-of-the-art algorithms estimate the motif counts via uniform sampling, by leveraging the color coding technique by Alon, Yuster and Zwick. In this work we extend the applicability of this approach, by introducing a set of algorithmic optimiza…
▽ More
We address the problem of computing the distribution of induced connected subgraphs, aka \emph{graphlets} or \emph{motifs}, in large graphs. The current state-of-the-art algorithms estimate the motif counts via uniform sampling, by leveraging the color coding technique by Alon, Yuster and Zwick. In this work we extend the applicability of this approach, by introducing a set of algorithmic optimizations and techniques that reduce the running time and space usage of color coding and improve the accuracy of the counts. To this end, we first show how to optimize color coding to efficiently build a compact table of a representative subsample of all graphlets in the input graph. For $8$-node motifs, we can build such a table in one hour for a graph with $65$M nodes and $1.8$B edges, which is $2000$ times larger than the state of the art. We then introduce a novel adaptive sampling scheme that breaks the "additive error barrier" of uniform sampling, guaranteeing multiplicative approximations instead of just additive ones. This allows us to count not only the most frequent motifs, but also extremely rare ones. For instance, on one graph we accurately count nearly $10.000$ distinct $8$-node motifs whose relative frequency is so small that uniform sampling would literally take centuries to find them. Our results show that color coding is still the most promising approach to scalable motif counting.
△ Less
Submitted 17 July, 2021; v1 submitted 4 September, 2020;
originally announced September 2020.
-
Efficient and near-optimal algorithms for sampling small connected subgraphs
Authors:
Marco Bressan
Abstract:
We study the following problem: given an integer $k \ge 3$ and a simple graph $G$, sample a connected induced $k$-node subgraph of $G$ uniformly at random. This is a fundamental graph mining primitive with applications in social network analysis, bioinformatics, and more. Surprisingly, no efficient algorithm is known for uniform sampling; the only somewhat efficient algorithms available yield samp…
▽ More
We study the following problem: given an integer $k \ge 3$ and a simple graph $G$, sample a connected induced $k$-node subgraph of $G$ uniformly at random. This is a fundamental graph mining primitive with applications in social network analysis, bioinformatics, and more. Surprisingly, no efficient algorithm is known for uniform sampling; the only somewhat efficient algorithms available yield samples that are only approximately uniform, with running times that are unclear or suboptimal. In this work we provide: (i) a near-optimal mixing time bound for a well-known random walk technique, (ii) the first efficient algorithm for truly uniform graphlet sampling, and (iii) the first sublinear-time algorithm for $ε$-uniform graphlet sampling.
△ Less
Submitted 28 October, 2021; v1 submitted 23 July, 2020;
originally announced July 2020.
-
Exact Recovery of Mangled Clusters with Same-Cluster Queries
Authors:
Marco Bressan,
Nicolò Cesa-Bianchi,
Silvio Lattanzi,
Andrea Paudice
Abstract:
We study the cluster recovery problem in the semi-supervised active clustering framework. Given a finite set of input points, and an oracle revealing whether any two points lie in the same cluster, our goal is to recover all clusters exactly using as few queries as possible. To this end, we relax the spherical $k$-means cluster assumption of Ashtiani et al.\ to allow for arbitrary ellipsoidal clus…
▽ More
We study the cluster recovery problem in the semi-supervised active clustering framework. Given a finite set of input points, and an oracle revealing whether any two points lie in the same cluster, our goal is to recover all clusters exactly using as few queries as possible. To this end, we relax the spherical $k$-means cluster assumption of Ashtiani et al.\ to allow for arbitrary ellipsoidal clusters with margin. This removes the assumption that the clustering is center-based (i.e., defined through an optimization problem), and includes all those cases where spherical clusters are individually transformed by any combination of rotations, axis scalings, and point deletions. We show that, even in this much more general setting, it is still possible to recover the latent clustering exactly using a number of queries that scales only logarithmically with the number of input points. More precisely, we design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k^3 \ln k \ln n)$ oracle queries and $\tilde{O}(kn + k^3)$ time to recover the clustering with zero misclassification error. The $O(\cdot)$ notation hides an exponential dependence on the dimensionality of the clusters, which we show to be necessary thus characterizing the query complexity of the problem. Our algorithm is simple, easy to implement, and can also learn the clusters using low-stretch separators, a class of ellipsoids with additional theoretical guarantees. Experiments on large synthetic datasets confirm that we can reconstruct clusterings exactly and efficiently.
△ Less
Submitted 30 October, 2020; v1 submitted 8 June, 2020;
originally announced June 2020.
-
Motivo: fast motif counting via succinct color coding and adaptive sampling
Authors:
Marco Bressan,
Stefano Leucci,
Alessandro Panconesi
Abstract:
The randomized technique of color coding is behind state-of-the-art algorithms for estimating graph motif counts. Those algorithms, however, are not yet capable of scaling well to very large graphs with billions of edges. In this paper we develop novel tools for the `motif counting via color coding' framework. As a result, our new algorithm, Motivo, is able to scale well to larger graphs while at…
▽ More
The randomized technique of color coding is behind state-of-the-art algorithms for estimating graph motif counts. Those algorithms, however, are not yet capable of scaling well to very large graphs with billions of edges. In this paper we develop novel tools for the `motif counting via color coding' framework. As a result, our new algorithm, Motivo, is able to scale well to larger graphs while at the same time provide more accurate graphlet counts than ever before. This is achieved thanks to two types of improvements. First, we design new succinct data structures that support fast common color coding operations, and a biased coloring trick that trades accuracy versus running time and memory usage. These adaptations drastically reduce the time and memory requirements of color coding. Second, we develop an adaptive graphlet sampling strategy, based on a fractional set cover problem, that breaks the additive approximation barrier of standard sampling. This strategy gives multiplicative approximations for all graphlets at once, allowing us to count not only the most frequent graphlets but also extremely rare ones.
To give an idea of the improvements, in $40$ minutes Motivo counts $7$-nodes motifs on a graph with $65$M nodes and $1.8$B edges; this is $30$ and $500$ times larger than the state of the art, respectively in terms of nodes and edges. On the accuracy side, in one hour Motivo produces accurate counts of $\approx \! 10.000$ distinct $8$-node motifs on graphs where state-of-the-art algorithms fail even to find the second most frequent motif. Our method requires just a high-end desktop machine. These results show how color coding can bring motif mining to the realm of truly massive graphs using only ordinary hardware.
△ Less
Submitted 4 June, 2019;
originally announced June 2019.
-
Correlation Clustering with Adaptive Similarity Queries
Authors:
Marco Bressan,
Nicolò Cesa-Bianchi,
Andrea Paudice,
Fabio Vitale
Abstract:
In correlation clustering, we are given $n$ objects together with a binary similarity score between each pair of them. The goal is to partition the objects into clusters so to minimise the disagreements with the scores. In this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both the disag…
▽ More
In correlation clustering, we are given $n$ objects together with a binary similarity score between each pair of them. The goal is to partition the objects into clusters so to minimise the disagreements with the scores. In this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both the disagreements and the total number of queries. On the one hand, we describe simple active learning algorithms, which provably achieve an almost optimal trade-off while giving cluster recovery guarantees, and we test them on different datasets. On the other hand, we prove information-theoretical bounds on the number of queries necessary to guarantee a prescribed disagreement bound. These results give a rich characterization of the trade-off between queries and clustering error.
△ Less
Submitted 14 January, 2020; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Faster algorithms for counting subgraphs in sparse graphs
Authors:
Marco Bressan
Abstract:
Given a $k$-node pattern graph $H$ and an $n$-node host graph $G$, the subgraph counting problem asks to compute the number of copies of $H$ in $G$. In this work we address the following question: can we count the copies of $H$ faster if $G$ is sparse? We answer in the affirmative by introducing a novel tree-like decomposition for directed acyclic graphs, inspired by the classic tree decomposition…
▽ More
Given a $k$-node pattern graph $H$ and an $n$-node host graph $G$, the subgraph counting problem asks to compute the number of copies of $H$ in $G$. In this work we address the following question: can we count the copies of $H$ faster if $G$ is sparse? We answer in the affirmative by introducing a novel tree-like decomposition for directed acyclic graphs, inspired by the classic tree decomposition for undirected graphs. This decomposition gives a dynamic program for counting the homomorphisms of $H$ in $G$ by exploiting the degeneracy of $G$, which allows us to beat the state-of-the-art subgraph counting algorithms when $G$ is sparse enough. For example, we can count the induced copies of any $k$-node pattern $H$ in time $2^{O(k^2)} O(n^{0.25k + 2} \log n)$ if $G$ has bounded degeneracy, and in time $2^{O(k^2)} O(n^{0.625k + 1} \log n)$ if $G$ has bounded average degree. These bounds are instantiations of a more general result, parameterized by the degeneracy of $G$ and the structure of $H$, which generalizes classic bounds on counting cliques and complete bipartite graphs. We also give lower bounds based on the Exponential Time Hypothesis, showing that our results are actually a characterization of the complexity of subgraph counting in bounded-degeneracy graphs.
△ Less
Submitted 30 August, 2020; v1 submitted 5 May, 2018;
originally announced May 2018.
-
On approximating the stationary distribution of time-reversible Markov chains
Authors:
Marco Bressan,
Enoch Peserico,
Luca Pretto
Abstract:
Approximating the stationary probability of a state in a Markov chain through Markov chain Monte Carlo techniques is, in general, inefficient. Standard random walk approaches require $\tilde{O}(τ/π(v))$ operations to approximate the probability $π(v)$ of a state $v$ in a chain with mixing time $τ$, and even the best available techniques still have complexity $\tilde{O}(τ^{1.5}/π(v)^{0.5})$, and si…
▽ More
Approximating the stationary probability of a state in a Markov chain through Markov chain Monte Carlo techniques is, in general, inefficient. Standard random walk approaches require $\tilde{O}(τ/π(v))$ operations to approximate the probability $π(v)$ of a state $v$ in a chain with mixing time $τ$, and even the best available techniques still have complexity $\tilde{O}(τ^{1.5}/π(v)^{0.5})$, and since these complexities depend inversely on $π(v)$, they can grow beyond any bound in the size of the chain or in its mixing time. In this paper we show that, for time-reversible Markov chains, there exists a simple randomized approximation algorithm that breaks this "small-$π(v)$ barrier".
△ Less
Submitted 30 December, 2017;
originally announced January 2018.
-
The Limits of Popularity-Based Recommendations, and the Role of Social Ties
Authors:
Marco Bressan,
Stefano Leucci,
Alessandro Panconesi,
Prabhakar Raghavan,
Erisa Terolli
Abstract:
In this paper we introduce a mathematical model that captures some of the salient features of recommender systems that are based on popularity and that try to exploit social ties among the users. We show that, under very general conditions, the market always converges to a steady state, for which we are able to give an explicit form. Thanks to this we can tell rather precisely how much a market is…
▽ More
In this paper we introduce a mathematical model that captures some of the salient features of recommender systems that are based on popularity and that try to exploit social ties among the users. We show that, under very general conditions, the market always converges to a steady state, for which we are able to give an explicit form. Thanks to this we can tell rather precisely how much a market is altered by a recommendation system, and determine the power of users to influence others. Our theoretical results are complemented by experiments with real world social networks showing that social graphs prevent large market distortions in spite of the presence of highly influential users.
△ Less
Submitted 14 July, 2016;
originally announced July 2016.
-
The Power of Local Information in PageRank
Authors:
Marco Bressan,
Enoch Peserico,
Luca Pretto
Abstract:
How large a fraction of a graph must one explore to rank a small set of nodes according to their PageRank scores? We show that the answer is quite nuanced, and depends crucially on the interplay between the correctness guarantees one requires and the way one can access the graph. On the one hand, assuming the graph can be accessed only via "natural" exploration queries that reveal small pieces of…
▽ More
How large a fraction of a graph must one explore to rank a small set of nodes according to their PageRank scores? We show that the answer is quite nuanced, and depends crucially on the interplay between the correctness guarantees one requires and the way one can access the graph. On the one hand, assuming the graph can be accessed only via "natural" exploration queries that reveal small pieces of its topology, we prove that deterministic and Las Vegas algorithms must in the worst case perform $n - o(n)$ queries and explore essentially the entire graph, independently of the specific types of query employed. On the other hand we show that, depending on the types of query available, Monte Carlo algorithms can perform asymptotically better: if allowed to both explore the local topology around single nodes and access nodes at random in the graph they need $Ω(n^{2/3})$ queries in the worst case, otherwise they still need $Ω(n)$ queries similarly to Las Vegas algorithms. All our bounds generalize and tighten those already known, cover the different types of graph exploration queries appearing in the literature, and immediately apply also to the problem of approximating the PageRank score of single nodes.
△ Less
Submitted 1 April, 2016;
originally announced April 2016.
-
Simple set cardinality estimation through random sampling
Authors:
Marco Bressan,
Enoch Peserico,
Luca Pretto
Abstract:
We present a simple algorithm that estimates the cardinality $n$ of a set $V$ when allowed to sample elements of $V$ uniformly and independently at random. Our algorithm with probability $(1-δ)$ returns a $(1\pmε)-$approximation of $n$ drawing $O\big(\sqrt{n} \cdot ε^{-1}\sqrt{\log(δ^{-1})}\big)$ samples (for $ε^{-1}\sqrt{\log(δ^{-1})} = O(\sqrt{n})$).
We present a simple algorithm that estimates the cardinality $n$ of a set $V$ when allowed to sample elements of $V$ uniformly and independently at random. Our algorithm with probability $(1-δ)$ returns a $(1\pmε)-$approximation of $n$ drawing $O\big(\sqrt{n} \cdot ε^{-1}\sqrt{\log(δ^{-1})}\big)$ samples (for $ε^{-1}\sqrt{\log(δ^{-1})} = O(\sqrt{n})$).
△ Less
Submitted 11 April, 2018; v1 submitted 24 December, 2015;
originally announced December 2015.
-
Sublinear algorithms for local graph centrality estimation
Authors:
Marco Bressan,
Enoch Peserico,
Luca Pretto
Abstract:
We study the complexity of local graph centrality estimation, with the goal of approximating the centrality score of a given target node while exploring only a sublinear number of nodes/arcs of the graph and performing a sublinear number of elementary operations. We develop a technique, that we apply to the PageRank and Heat Kernel centralities, for building a low-variance score estimator through…
▽ More
We study the complexity of local graph centrality estimation, with the goal of approximating the centrality score of a given target node while exploring only a sublinear number of nodes/arcs of the graph and performing a sublinear number of elementary operations. We develop a technique, that we apply to the PageRank and Heat Kernel centralities, for building a low-variance score estimator through a local exploration of the graph. We obtain an algorithm that, given any node in any graph of $m$ arcs, with probability $(1-δ)$ computes a multiplicative $(1\pmε)$-approximation of its score by examining only $\tilde{O}(\min(m^{2/3} Δ^{1/3} d^{-2/3},\, m^{4/5} d^{-3/5}))$ nodes/arcs, where $Δ$ and $d$ are respectively the maximum and average outdegree of the graph (omitting for readability $\operatorname{poly}(ε^{-1})$ and $\operatorname{polylog}(δ^{-1})$ factors). A similar bound holds for computational complexity. We also prove a lower bound of $Ω(\min(m^{1/2} Δ^{1/2} d^{-1/2}, \, m^{2/3} d^{-1/3}))$ for both query complexity and computational complexity. Moreover, our technique yields a $\tilde{O}(n^{2/3})$ query complexity algorithm for the graph access model of [Brautbar et al., 2010], widely used in social network mining; we show this algorithm is optimal up to a sublogarithmic factor. These are the first algorithms yielding worst-case sublinear bounds for general directed graphs and any choice of the target node.
△ Less
Submitted 4 August, 2018; v1 submitted 7 April, 2014;
originally announced April 2014.