-
Alice and the Caterpillar: A more descriptive null model for assessing data mining results
Authors:
Giulia Preti,
Gianmarco De Francisci Morales,
Matteo Riondato
Abstract:
We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number…
▽ More
We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number of caterpillars, i.e., paths of length three, is preserved, in addition to other properties considered by other models. We describe Alice, a suite of Markov chain Monte Carlo algorithms for sampling datasets from our null models, based on a carefully defined set of states and efficient operations to move between them. The results of our experimental evaluation show that Alice mixes fast and scales well, and that our null model finds different significant results than ones previously considered in the literature.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Polaris: Sampling from the Multigraph Configuration Model with Prescribed Color Assortativity
Authors:
Giulia Preti,
Matteo Riondato,
Aristides Gionis,
Gianmarco De Francisci Morales
Abstract:
We introduce Polaris, a network null model for colored multi-graphs that preserves the Joint Color Matrix. Polaris is specifically designed for studying network polarization, where vertices belong to a side in a debate or a partisan group, represented by a vertex color, and relations have different strengths, represented by an integer-valued edge multiplicity. The key feature of Polaris is preserv…
▽ More
We introduce Polaris, a network null model for colored multi-graphs that preserves the Joint Color Matrix. Polaris is specifically designed for studying network polarization, where vertices belong to a side in a debate or a partisan group, represented by a vertex color, and relations have different strengths, represented by an integer-valued edge multiplicity. The key feature of Polaris is preserving the Joint Color Matrix (JCM) of the multigraph, which specifies the number of edges connecting vertices of any two given colors. The JCM is the basic property that determines color assortativity, a fundamental aspect in studying homophily and segregation in polarized networks. By using Polaris, network scientists can test whether a phenomenon is entirely explained by the JCM of the observed network or whether other phenomena might be at play. Technically, our null model is an extension of the configuration model: an ensemble of colored multigraphs characterized by the same degree sequence and the same JCM. To sample from this ensemble, we develop a suite of Markov Chain Monte Carlo algorithms, collectively named Polaris-*. It includes Polaris-B, an adaptation of a generic Metropolis-Hastings algorithm, and Polaris-C, a faster, specialized algorithm with higher acceptance probabilities. This new null model and the associated algorithms provide a more nuanced toolset for examining polarization in social networks, thus enabling statistically sound conclusions.
△ Less
Submitted 18 December, 2024; v1 submitted 2 September, 2024;
originally announced September 2024.
-
An impossibility result for Markov Chain Monte Carlo sampling from micro-canonical bipartite graph ensembles
Authors:
Giulia Preti,
Gianmarco De Francisci Morales,
Matteo Riondato
Abstract:
Markov Chain Monte Carlo (MCMC) algorithms are commonly used to sample from graph ensembles. Two graphs are neighbors in the state space if one can be obtained from the other with only a few modifications, e.g., edge rewirings. For many common ensembles, e.g., those preserving the degree sequences of bipartite graphs, rewiring operations involving two edges are sufficient to create a fully-connect…
▽ More
Markov Chain Monte Carlo (MCMC) algorithms are commonly used to sample from graph ensembles. Two graphs are neighbors in the state space if one can be obtained from the other with only a few modifications, e.g., edge rewirings. For many common ensembles, e.g., those preserving the degree sequences of bipartite graphs, rewiring operations involving two edges are sufficient to create a fully-connected state space, and they can be performed efficiently. We show that, for ensembles of bipartite graphs with fixed degree sequences and number of butterflies (k2,2 bi-cliques), there is no universal constant c such that a rewiring of at most c edges at every step is sufficient for any such ensemble to be fully connected. Our proof relies on an explicit construction of a family of pairs of graphs with the same degree sequences and number of butterflies, with each pair indexed by a natural c, and such that any sequence of rewiring operations transforming one graph into the other must include at least one rewiring operation involving at least c edges. Whether rewiring these many edges is sufficient to guarantee the full connectivity of the state space of any such ensemble remains an open question. Our result implies the impossibility of developing efficient, graph-agnostic, MCMC algorithms for these ensembles, as the necessity to rewire an impractically large number of edges may hinder taking a step on the state space.
△ Less
Submitted 10 September, 2024; v1 submitted 21 August, 2023;
originally announced August 2023.
-
RePBubLik: Reducing the Polarized Bubble Radius with Link Insertions
Authors:
Shahrzad Haddadan,
Cristina Menghini,
Matteo Riondato,
Eli Upfal
Abstract:
The topology of the hyperlink graph among pages expressing different opinions may influence the exposure of readers to diverse content. Structural bias may trap a reader in a polarized bubble with no access to other opinions. We model readers' behavior as random walks. A node is in a polarized bubble if the expected length of a random walk from it to a page of different opinion is large. The struc…
▽ More
The topology of the hyperlink graph among pages expressing different opinions may influence the exposure of readers to diverse content. Structural bias may trap a reader in a polarized bubble with no access to other opinions. We model readers' behavior as random walks. A node is in a polarized bubble if the expected length of a random walk from it to a page of different opinion is large. The structural bias of a graph is the sum of the radii of highly-polarized bubbles. We study the problem of decreasing the structural bias through edge insertions. Healing all nodes with high polarized bubble radius is hard to approximate within a logarithmic factor, so we focus on finding the best $k$ edges to insert to maximally reduce the structural bias. We present RePBubLik, an algorithm that leverages a variant of the random walk closeness centrality to select the edges to insert. RePBubLik obtains, under mild conditions, a constant-factor approximation. It reduces the structural bias faster than existing edge-recommendation methods, including some designed to reduce the polarization of a graph.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.
-
MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
Authors:
Leonardo Pellegrina,
Cyrus Cousins,
Fabio Vandin,
Matteo Riondato
Abstract:
We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-signi…
▽ More
We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This feature is a strong improvement over previously proposed solutions that could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
TRIÈST: Counting Local and Global Triangles in Fully-dynamic Streams with Fixed Memory Size
Authors:
Lorenzo De Stefani,
Alessandro Epasto,
Matteo Riondato,
Eli Upfal
Abstract:
We present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all…
▽ More
We present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all times. This is in contrast with previous approaches which use hard-to-choose parameters (e.g., a fixed sampling probability) and offer no guarantees on the amount of memory they will use. We show a full analysis of the variance of the estimations and novel concentration bounds for these quantities. Our experimental results on very large graphs show that TRIÈST outperforms state-of-the-art approaches in accuracy and exhibits a small update time.
△ Less
Submitted 28 June, 2016; v1 submitted 24 February, 2016;
originally announced February 2016.
-
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages
Authors:
Matteo Riondato,
Eli Upfal
Abstract:
We present ABRA, a suite of algorithms that compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms rely on random sampling and their analysis leverages on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowled…
▽ More
We present ABRA, a suite of algorithms that compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms rely on random sampling and their analysis leverages on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowledge, this is the first application of these concepts to the field of graph analysis. The results of our experimental evaluation show that our approach is much faster than exact methods, and vastly outperforms, in both speed and number of samples, current state-of-the-art algorithms with the same quality guarantees.
△ Less
Submitted 18 February, 2016;
originally announced February 2016.
-
Wiggins: Detecting Valuable Information in Dynamic Networks Using Limited Resources
Authors:
Ahmad Mahmoody,
Matteo Riondato,
Eli Upfal
Abstract:
Detecting new information and events in a dynamic network by probing individual nodes has many practical applications: discovering new webpages, analyzing influence properties in network, and detecting failure propagation in electronic circuits or infections in public drinkable water systems. In practice, it is infeasible for anyone but the owner of the network (if existent) to monitor all nodes a…
▽ More
Detecting new information and events in a dynamic network by probing individual nodes has many practical applications: discovering new webpages, analyzing influence properties in network, and detecting failure propagation in electronic circuits or infections in public drinkable water systems. In practice, it is infeasible for anyone but the owner of the network (if existent) to monitor all nodes at all times. In this work we study the constrained setting when the observer can only probe a small set of nodes at each time step to check whether new pieces of information (items) have reached those nodes.
We formally define the problem through an infinite time generating process that places new items in subsets of nodes according to an unknown probability distribution. Items have an exponentially decaying novelty, modeling their decreasing value. The observer uses a probing schedule (i.e., a probability distribution over the set of nodes) to choose, at each time step, a small set of nodes to check for new items. The goal is to compute a schedule that minimizes the average novelty of undetected items. We present an algorithm, WIGGINS, to compute the optimal schedule through convex optimization, and then show how it can be adapted when the parameters of the problem must be learned or change over time. We also present a scalable variant of WIGGINS for the MapReduce framework. The results of our experimental evaluation on real social networks demonstrate the practicality of our approach.
△ Less
Submitted 29 July, 2015; v1 submitted 13 April, 2015;
originally announced April 2015.
-
Finding the True Frequent Itemsets
Authors:
Matteo Riondato,
Fabio Vandin
Abstract:
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction $θ$ of a transactional dataset $\mathcal{D}$. Often though, the ultimate goal of mining $\mathcal{D}$ is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications…
▽ More
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction $θ$ of a transactional dataset $\mathcal{D}$. Often though, the ultimate goal of mining $\mathcal{D}$ is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications $\mathcal{D}$ is a collection of samples obtained from an unknown probability distribution $π$ on transactions, and by extracting the FIs in $\mathcal{D}$ one attempts to infer itemsets that are frequently (i.e., with probability at least $θ$) generated by $π$, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of \emph{false positives}, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold $\hatθ$ such that the collection of itemsets with frequency at least $\hatθ$ in $\mathcal{D}$ contains only TFIs with probability at least $1-δ$, for some user-specified $δ$. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of $\mathcal{D}$ at frequency $θ$ and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.
△ Less
Submitted 22 January, 2014; v1 submitted 7 January, 2013;
originally announced January 2013.
-
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
Authors:
Matteo Riondato,
Eli Upfal
Abstract:
The tasks of extracting (top-$K$) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practic…
▽ More
The tasks of extracting (top-$K$) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-$K$) FI's and AR's. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call \emph{d-index}, and is the maximum integer $d$ such that the dataset contains at least $d$ transactions of length at least $d$ such that no one of them is a superset of or equal to another. We show that this bound is strict for a large class of datasets.
△ Less
Submitted 22 February, 2013; v1 submitted 29 November, 2011;
originally announced November 2011.
-
Space-Round Tradeoffs for MapReduce Computations
Authors:
Andrea Pietracaprina,
Geppino Pucci,
Matteo Riondato,
Francesco Silvestri,
Eli Upfal
Abstract:
This work explores fundamental modeling and algorithmic issues arising in the well-established MapReduce framework. First, we formally specify a computational model for MapReduce which captures the functional flavor of the paradigm by allowing for a flexible use of parallelism. Indeed, the model diverges from a traditional processor-centric view by featuring parameters which embody only global and…
▽ More
This work explores fundamental modeling and algorithmic issues arising in the well-established MapReduce framework. First, we formally specify a computational model for MapReduce which captures the functional flavor of the paradigm by allowing for a flexible use of parallelism. Indeed, the model diverges from a traditional processor-centric view by featuring parameters which embody only global and local memory constraints, thus favoring a more data-centric view. Second, we apply the model to the fundamental computation task of matrix multiplication presenting upper and lower bounds for both dense and sparse matrix multiplication, which highlight interesting tradeoffs between space and round complexity. Finally, building on the matrix multiplication results, we derive further space-round tradeoffs on matrix inversion and matching.
△ Less
Submitted 9 November, 2011;
originally announced November 2011.
-
The VC-Dimension of Queries and Selectivity Estimation Through Sampling
Authors:
Matteo Riondato,
Mert Akdere,
Ugur Cetintemel,
Stanley B. Zdonik,
Eli Upfal
Abstract:
We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VC-dimension of a range space…
▽ More
We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection predicate and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size (number of tuples) of the database. We leverage on this result and develop a method that, given a class of queries, builds a concise random sample of a database, such that with high probability the execution of any query in the class on the sample provides an accurate estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the sample needs to be refreshed only following major changes in the database. The sample representation computed by our method is typically sufficiently small to be stored in main memory. We present extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft SQL Server.
△ Less
Submitted 11 August, 2011; v1 submitted 30 January, 2011;
originally announced January 2011.
-
Mining Top-K Frequent Itemsets Through Progressive Sampling
Authors:
Andrea Pietracaprina,
Matteo Riondato,
Eli Upfal,
Fabio Vandin
Abstract:
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an uppe…
▽ More
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real bench- mark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.
△ Less
Submitted 27 June, 2010;
originally announced June 2010.
-
Preliminary study of metabolic radiotherapy with 188Re via small animal imaging
Authors:
G. Baldazzi,
D. Bollini,
A. Muciaccio,
F. -L. Navarria,
G. Pancaldi,
A. Perrotta,
M. Zuffa,
P. Boccaccio,
N. Uzunov,
M. Bello,
D. Bernardini,
U. Mazzi,
G. Moschini,
M. Riondato,
A. Rosato,
F. Garibaldi,
R. Pani,
A. Antoccia,
F. de Notaristefani,
G. Hull,
V. Orsolini Cencelli,
A. Sgura,
C. Tanzarella
Abstract:
188Re is a beta- (Emax = 2.12 MeV) and gamma (155 keV) emitter. Since its chemistry is similar to that of the largely employed tracer, 99mTc, molecules of hyaluronic acid (HA) have been labelled with 188Re to produce a target specific radiopharmaceutical. The radiolabeled compound, i.v. injected in healthy mice, is able to accumulate into the liver after a few minutes. To study the effect of met…
▽ More
188Re is a beta- (Emax = 2.12 MeV) and gamma (155 keV) emitter. Since its chemistry is similar to that of the largely employed tracer, 99mTc, molecules of hyaluronic acid (HA) have been labelled with 188Re to produce a target specific radiopharmaceutical. The radiolabeled compound, i.v. injected in healthy mice, is able to accumulate into the liver after a few minutes. To study the effect of metabolic radiotherapy in mice, we have built a small gamma camera based on a matrix of YAP:Ce crystals, with 0.6x0.6x10 mm**3 pixels, read out by a R2486 Hamamatsu PSPMT. A high-sensitivity 20 mm thick lead parallel-hole collimator, with hole diameter 1.5 mm and septa of 0.18 mm, is placed in front of the YAP matrix. Preliminary results obtained with various phantoms containing a solution of 188Re and with C57 black mice injected with the 188Re-HA solution are presented. To increase the space resolution and to obtain two orthogonal projections simultaneously we are building in parallel two new cameras to be positioned at 90 degrees. They use a CsI(Tl) matrix with 1x1x5 mm**3 pixels read out by H8500 Hamamatsu Flat panel PMT.
△ Less
Submitted 1 June, 2005;
originally announced June 2005.