-
To Store or Not to Store: a graph theoretical approach for Dataset Versioning
Authors:
Anxin Guo,
Jingwei Li,
Pattara Sukprasert,
Samir Khuller,
Amol Deshpande,
Koyel Mukherjee
Abstract:
In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MinSum Retrieval (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This…
▽ More
In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MinSum Retrieval (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB'15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems.
We establish that the currently best-known heuristic, LMG, can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get $o(n)$-approximation for MSR on general graphs even if we relax the storage constraints by an $O(\log n)$ factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs.
Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential ``moves'', to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms that are qualitatively better than all previous heuristics for MSR, as well as for another variant BoundedMin Retrieval (BMR).
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
Practical Parallel Algorithms for Near-Optimal Densest Subgraphs on Massive Graphs
Authors:
Pattara Sukprasert,
Quanquan C. Liu,
Laxman Dhulipala,
Julian Shun
Abstract:
The densest subgraph problem has received significant attention, both in theory and in practice, due to its applications in problems such as community detection, social network analysis, and spam detection. Due to the high cost of obtaining exact solutions, much attention has focused on designing approximate densest subgraph algorithms. However, existing approaches are not able to scale to massive…
▽ More
The densest subgraph problem has received significant attention, both in theory and in practice, due to its applications in problems such as community detection, social network analysis, and spam detection. Due to the high cost of obtaining exact solutions, much attention has focused on designing approximate densest subgraph algorithms. However, existing approaches are not able to scale to massive graphs with billions of edges.
In this paper, we introduce a new framework that combines approximate densest subgraph algorithms with a pruning optimization. We design new parallel variants of the state-of-the-art sequential Greedy++ algorithm, and plug it into our framework in conjunction with a parallel pruning technique based on $k$-core decomposition to obtain parallel $(1+\varepsilon)$-approximate densest subgraph algorithms. On a single thread, our algorithms achieve $2.6$--$34\times$ speedup over Greedy++, and obtain up to $22.37\times$ self relative parallel speedup on a 30-core machine with two-way hyper-threading. Compared with the state-of-the-art parallel algorithm by Harb et al. [NeurIPS'22], we achieve up to a $114\times$ speedup on the same machine. Finally, against the recent sequential algorithm of Xu et al. [PACMMOD'23], we achieve up to a $25.9\times$ speedup. The scalability of our algorithms enables us to obtain near-optimal density statistics on the hyperlink2012 (with roughly 113 billion edges) and clueweb (with roughly 37 billion edges) graphs for the first time in the literature.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Constant Approximation for Individual Preference Stable Clustering
Authors:
Anders Aamand,
Justin Y. Chen,
Allen Liu,
Sandeep Silwal,
Pattara Sukprasert,
Ali Vakilian,
Fred Zhang
Abstract:
Individual preference (IP) stability, introduced by Ahmadi et al. (ICML 2022), is a natural clustering objective inspired by stability and fairness constraints. A clustering is $α$-IP stable if the average distance of every data point to its own cluster is at most $α$ times the average distance to any other cluster. Unfortunately, determining if a dataset admits a $1$-IP stable clustering is NP-Ha…
▽ More
Individual preference (IP) stability, introduced by Ahmadi et al. (ICML 2022), is a natural clustering objective inspired by stability and fairness constraints. A clustering is $α$-IP stable if the average distance of every data point to its own cluster is at most $α$ times the average distance to any other cluster. Unfortunately, determining if a dataset admits a $1$-IP stable clustering is NP-Hard. Moreover, before this work, it was unknown if an $o(n)$-IP stable clustering always \emph{exists}, as the prior state of the art only guaranteed an $O(n)$-IP stable clustering. We close this gap in understanding and show that an $O(1)$-IP stable clustering always exists for general metrics, and we give an efficient algorithm which outputs such a clustering. We also introduce generalizations of IP stability beyond average distance and give efficient, near-optimal algorithms in the cases where we consider the maximum and minimum distances within and between clusters.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Simple Dynamic Spanners with Near-optimal Recourse against an Adaptive Adversary
Authors:
Sayan Bhattacharya,
Thatchaphol Saranurak,
Pattara Sukprasert
Abstract:
Designing dynamic algorithms against an adaptive adversary whose performance match the ones assuming an oblivious adversary is a major research program in the field of dynamic graph algorithms. One of the prominent examples whose oblivious-vs-adaptive gap remains maximally large is the \emph{fully dynamic spanner} problem; there exist algorithms assuming an oblivious adversary with near-optimal si…
▽ More
Designing dynamic algorithms against an adaptive adversary whose performance match the ones assuming an oblivious adversary is a major research program in the field of dynamic graph algorithms. One of the prominent examples whose oblivious-vs-adaptive gap remains maximally large is the \emph{fully dynamic spanner} problem; there exist algorithms assuming an oblivious adversary with near-optimal size-stretch trade-off using only $\operatorname{polylog}(n)$ update time [Baswana, Khurana, and Sarkar TALG'12; Forster and Goranci STOC'19; Bernstein, Forster, and Henzinger SODA'20], while against an adaptive adversary, even when we allow infinite time and only count recourse (i.e. the number of edge changes per update in the maintained spanner), all previous algorithms with stretch at most $\log^{5}(n)$ require at least $Ω(n)$ amortized recourse [Ausiello, Franciosa, and Italiano ESA'05].
In this paper, we completely close this gap with respect to recourse by showing algorithms against an adaptive adversary with near-optimal size-stretch trade-off and recourse. More precisely, for any $k\ge1$, our algorithm maintains a $(2k-1)$-spanner of size $O(n^{1+1/k}\log n)$ with $O(\log n)$ amortized recourse, which is optimal in all parameters up to a $O(\log n)$ factor. As a step toward algorithms with small update time (not just recourse), we show another algorithm that maintains a $3$-spanner of size $\tilde O(n^{1.5})$ with $\operatorname{polylog}(n)$ amortized recourse \emph{and} simultaneously $\tilde O(\sqrt{n})$ worst-case update time.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
Individual Preference Stability for Clustering
Authors:
Saba Ahmadi,
Pranjal Awasthi,
Samir Khuller,
Matthäus Kleindessner,
Jamie Morgenstern,
Pattara Sukprasert,
Ali Vakilian
Abstract:
In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first…
▽ More
In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first show that deciding whether a given data set allows for an IP-stable clustering in general is NP-hard. As a result, we explore the design of efficient algorithms for finding IP-stable clusterings in some restricted metric spaces. We present a polytime algorithm to find a clustering satisfying exact IP-stability on the real line, and an efficient algorithm to find an IP-stable 2-clustering for a tree metric. We also consider relaxing the stability constraint, i.e., every data point should not be too far from its own cluster compared to any other cluster. For this case, we provide polytime algorithms with different guarantees. We evaluate some of our algorithms and several standard clustering approaches on real data sets.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
Approximating k-Edge-Connected Spanning Subgraphs via a Near-Linear Time LP Solver
Authors:
Parinya Chalermsook,
Chien-Chung Huang,
Danupon Nanongkai,
Thatchaphol Saranurak,
Pattara Sukprasert,
Sorrachai Yingchareonthawornchai
Abstract:
In the $k$-edge-connected spanning subgraph ($k$ECSS) problem, our goal is to compute a minimum-cost sub-network that is resilient against up to $k$ link failures: Given an $n$-node $m$-edge graph with a cost function on the edges, our goal is to compute a minimum-cost $k$-edge-connected spanning subgraph. This NP-hard problem generalizes the minimum spanning tree problem and is the "uniform case"…
▽ More
In the $k$-edge-connected spanning subgraph ($k$ECSS) problem, our goal is to compute a minimum-cost sub-network that is resilient against up to $k$ link failures: Given an $n$-node $m$-edge graph with a cost function on the edges, our goal is to compute a minimum-cost $k$-edge-connected spanning subgraph. This NP-hard problem generalizes the minimum spanning tree problem and is the "uniform case" of a much broader class of survival network design problems (SNDP). A factor of two has remained the best approximation ratio for polynomial-time algorithms for the whole class of SNDP, even for a special case of $2$ECSS. The fastest $2$-approximation algorithm is however rather slow, taking $O(mn k)$ time [Khuller, Vishkin, STOC'92]. A faster time complexity of $O(n^2)$ can be obtained, but with a higher approximation guarantee of $(2k-1)$ [Gabow, Goemans, Williamson, IPCO'93].
Our main contribution is an algorithm that $(1+ε)$-approximates the optimal fractional solution in $\tilde O(m/ε^2)$ time (independent of $k$), which can be turned into a $(2+ε)$ approximation algorithm that runs in time $\tilde O\left(\frac{m}{ε^2} + \frac{k^2n^{1.5}}{ε^2}\right)$ for (integral) $k$ECSS; this improves the running time of the aforementioned results while keeping the approximation ratio arbitrarily close to a factor of two.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
Retraction: Improved Approximation Schemes for Dominating Set Problems in Unit Disk Graphs
Authors:
Jittat Fakcharoenphol,
Pattara Sukprasert
Abstract:
Retraction note: After posting the manuscript on arXiv, we were informed by Erik Jan van Leeuwen that both results were known and they appeared in his thesis[vL09]. A PTAS for MDS is at Theorem 6.3.21 on page 79 and A PTAS for MCDS is at Theorem 6.3.31 on page 82. The techniques used are very similar. He noted that the idea for dealing with the connected version using a constant number of extra la…
▽ More
Retraction note: After posting the manuscript on arXiv, we were informed by Erik Jan van Leeuwen that both results were known and they appeared in his thesis[vL09]. A PTAS for MDS is at Theorem 6.3.21 on page 79 and A PTAS for MCDS is at Theorem 6.3.31 on page 82. The techniques used are very similar. He noted that the idea for dealing with the connected version using a constant number of extra layers in the shifting technique not only appeared Zhang et al.[ZGWD09] but also in his 2005 paper [vL05]. Finally, van Leeuwen also informed us that the open problem that we posted has been resolved by Marx~[Mar06, Mar07] who showed that an efficient PTAS for MDS does not exist [Mar06] and under ETH, the running time of $n^{O(1/ε)}$ is best possible [Mar07]. We thank Erik Jan van Leeuwen for the information and we regret that we made this mistake.
Abstract before retraction: We present two (exponentially) faster PTAS's for dominating set problems in unit disk graphs. Given a geometric representation of a unit disk graph, our PTAS's that find $(1+ε)$-approximate solutions to the Minimum Dominating Set (MDS) and the Minimum Connected Dominating Set (MCDS) of the input graph run in time $n^{O(1/ε)}$. This can be compared to the best known $n^{O(1/ε
\log {1/ε})}$-time PTAS by Nieberg and Hurink~[WAOA'05] for MDS that only uses graph structures and an $n^{O(1/ε^2)}$-time PTAS for MCDS by Zhang, Gao, Wu, and Du~[J Glob Optim'09]. Our key ingredients are improved dynamic programming algorithms that depend exponentially on more essential 1-dimensional "widths" of the problems.
△ Less
Submitted 17 September, 2021; v1 submitted 2 September, 2021;
originally announced September 2021.
-
Multi-transversals for Triangles and the Tuza's Conjecture
Authors:
Parinya Chalermsook,
Samir Khuller,
Pattara Sukprasert,
Sumedha Uniyal
Abstract:
In this paper, we study a primal and dual relationship about triangles: For any graph $G$, let $ν(G)$ be the maximum number of edge-disjoint triangles in $G$, and $τ(G)$ be the minimum subset $F$ of edges such that $G \setminus F$ is triangle-free. It is easy to see that $ν(G) \leq τ(G) \leq 3 ν(G)$, and in fact, this rather obvious inequality holds for a much more general primal-dual relation bet…
▽ More
In this paper, we study a primal and dual relationship about triangles: For any graph $G$, let $ν(G)$ be the maximum number of edge-disjoint triangles in $G$, and $τ(G)$ be the minimum subset $F$ of edges such that $G \setminus F$ is triangle-free. It is easy to see that $ν(G) \leq τ(G) \leq 3 ν(G)$, and in fact, this rather obvious inequality holds for a much more general primal-dual relation between $k$-hyper matching and covering in hypergraphs. Tuza conjectured in $1981$ that $τ(G) \leq 2 ν(G)$, and this question has received attention from various groups of researchers in discrete mathematics, settling various special cases such as planar graphs and generalized to bounded maximum average degree graphs, some cases of minor-free graphs, and very dense graphs. Despite these efforts, the conjecture in general graphs has remained wide open for almost four decades.
In this paper, we provide a proof of a non-trivial consequence of the conjecture; that is, for every $k \geq 2$, there exist a (multi)-set $F \subseteq E(G): |F| \leq 2k ν(G)$ such that each triangle in $G$ overlaps at least $k$ elements in $F$. Our result can be seen as a strengthened statement of Krivelevich's result on the fractional version of Tuza's conjecture (and we give some examples illustrating this.) The main technical ingredient of our result is a charging argument, that locally identifies edges in $F$ based on a local view of the packing solution. This idea might be useful in further studying the primal-dual relations in general and the Tuza's conjecture in particular.
△ Less
Submitted 3 February, 2021; v1 submitted 1 January, 2020;
originally announced January 2020.
-
Finding All Useless Arcs in Directed Planar Graphs
Authors:
Jittat Fakcharoenphol,
Bundit Laekhanukit,
Pattara Sukprasert
Abstract:
We present a linear-time algorithm for simplifying flow networks on directed planar graphs: Given a directed planar graph on $n$ vertices, a source vertex $s$ and a sink vertex $t$, our algorithm removes all the arcs that do not participate in any simple $s,t$-path in linear-time. The output graph produced by our algorithm satisfies the prerequisite needed by the $O(n\log n)$-time algorithm of Wei…
▽ More
We present a linear-time algorithm for simplifying flow networks on directed planar graphs: Given a directed planar graph on $n$ vertices, a source vertex $s$ and a sink vertex $t$, our algorithm removes all the arcs that do not participate in any simple $s,t$-path in linear-time. The output graph produced by our algorithm satisfies the prerequisite needed by the $O(n\log n)$-time algorithm of Weihe [FOCS'94 \& JCSS'97] for computing maximum $s,t$-flow in directed planar graphs. Previously, Weihe's algorithm could not run in $O(n\log n)$-time due to the absence of the preprocessing step; all the preceding algorithms run in $\tildeΩ(n^2)$-time [Misiolek-Chen, COCOON'05 \& IPL'06; Biedl, Brejov{á} and Vinar, MFCS'00]. Consequently, this provides an alternative $O(n\log n)$-time algorithm for computing maximum $s,t$-flow in directed planar graphs in addition to the known $O(n\log n)$-time algorithms [Borradaile-Klein, SODA'06 \& J.ACM'09; Erickson, SODA'10].
Our algorithm can be seen as a (truly) linear-time $s,t$-flow sparsifier for directed planar graphs, which runs faster than any maximum $s,t$-flow algorithm (which can also be seen of as a sparsifier). The simplified structures of the resulting graph might be useful in future developments of maximum $s,t$-flow algorithms in both directed and undirected planar graphs.
△ Less
Submitted 8 May, 2018; v1 submitted 15 February, 2017;
originally announced February 2017.