-
To Store or Not to Store: a graph theoretical approach for Dataset Versioning
Authors:
Anxin Guo,
Jingwei Li,
Pattara Sukprasert,
Samir Khuller,
Amol Deshpande,
Koyel Mukherjee
Abstract:
In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MinSum Retrieval (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This…
▽ More
In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MinSum Retrieval (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB'15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems.
We establish that the currently best-known heuristic, LMG, can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get $o(n)$-approximation for MSR on general graphs even if we relax the storage constraints by an $O(\log n)$ factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs.
Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential ``moves'', to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms that are qualitatively better than all previous heuristics for MSR, as well as for another variant BoundedMin Retrieval (BMR).
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
Online Flexible Busy Time Scheduling on Heterogeneous Machines
Authors:
Gruia Calinescu,
Sami Davies,
Samir Khuller,
Shirley Zhang
Abstract:
We study the online busy time scheduling model on heterogeneous machines. In our setting, unit-length jobs arrive online with a deadline that is known to the algorithm at the job's arrival time. An algorithm has access to machines, each with different associated capacities and costs. The goal is to schedule jobs on machines before their deadline, so that the total cost incurred by the scheduling a…
▽ More
We study the online busy time scheduling model on heterogeneous machines. In our setting, unit-length jobs arrive online with a deadline that is known to the algorithm at the job's arrival time. An algorithm has access to machines, each with different associated capacities and costs. The goal is to schedule jobs on machines before their deadline, so that the total cost incurred by the scheduling algorithm is minimized. Relatively little is known about online busy time scheduling when machines are heterogeneous (i.e., have different costs and capacities), despite this being the most practical model for clients using cloud computing services. We make significant progress in understanding this model by designing an 8-competitive algorithm for the problem on unit-length jobs, and providing a lower bound on the competitive ratio of 2. We further prove that our lower bound is tight in the natural setting when jobs have agreeable deadlines.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Scalable Auction Algorithms for Bipartite Maximum Matching Problems
Authors:
Quanquan C. Liu,
Yiduo Ke,
Samir Khuller
Abstract:
In this paper, we give new auction algorithms for maximum weighted bipartite matching (MWM) and maximum cardinality bipartite $b$-matching (MCbM). Our algorithms run in $O\left(\log n/\varepsilon^8\right)$ and $O\left(\log n/\varepsilon^2\right)$ rounds, respectively, in the blackboard distributed setting. We show that our MWM algorithm can be implemented in the distributed, interactive setting us…
▽ More
In this paper, we give new auction algorithms for maximum weighted bipartite matching (MWM) and maximum cardinality bipartite $b$-matching (MCbM). Our algorithms run in $O\left(\log n/\varepsilon^8\right)$ and $O\left(\log n/\varepsilon^2\right)$ rounds, respectively, in the blackboard distributed setting. We show that our MWM algorithm can be implemented in the distributed, interactive setting using $O(\log^2 n)$ and $O(\log n)$ bit messages, respectively, directly answering the open question posed by Demange, Gale and Sotomayor [DNO14]. Furthermore, we implement our algorithms in a variety of other models including the the semi-streaming model, the shared-memory work-depth model, and the massively parallel computation model. Our semi-streaming MWM algorithm uses $O(1/\varepsilon^8)$ passes in $O(n \log n \cdot \log(1/\varepsilon))$ space and our MCbM algorithm runs in $O(1/\varepsilon^2)$ passes using $O\left(\left(\sum_{i \in L} b_i + |R|\right)\log(1/\varepsilon)\right)$ space (where parameters $b_i$ represent the degree constraints on the $b$-matching and $L$ and $R$ represent the left and right side of the bipartite graph, respectively). Both of these algorithms improves \emph{exponentially} the dependence on $\varepsilon$ in the space complexity in the semi-streaming model against the best-known algorithms for these problems, in addition to improvements in round complexity for MCbM. Finally, our algorithms eliminate the large polylogarithmic dependence on $n$ in depth and number of rounds in the work-depth and massively parallel computation models, respectively, improving on previous results which have large polylogarithmic dependence on $n$ (and exponential dependence on $\varepsilon$ in the MPC model).
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
An Algorithmic Approach to Address Course Enrollment Challenges
Authors:
Arpita Biswas,
Yiduo Ke,
Samir Khuller,
Quanquan C. Liu
Abstract:
Massive surges of enrollments in courses have led to a crisis in several computer science departments - not only is the demand for certain courses extremely high from majors, but the demand from non-majors is also very high. Much of the time, this leads to significant frustration on the part of the students, and getting seats in desired courses is a rather ad-hoc process. One approach is to first…
▽ More
Massive surges of enrollments in courses have led to a crisis in several computer science departments - not only is the demand for certain courses extremely high from majors, but the demand from non-majors is also very high. Much of the time, this leads to significant frustration on the part of the students, and getting seats in desired courses is a rather ad-hoc process. One approach is to first collect information from students about which courses they want to take and to develop optimization models for assigning students to available seats in a fair manner. What makes this problem complex is that the courses themselves have time conflicts, and the students have credit caps (an upper bound on the number of courses they would like to enroll in). We model this problem as follows. We have $n$ agents (students), and there are "resources" (these correspond to courses). Each agent is only interested in a subset of the resources (courses of interest), and each resource can only be assigned to a bounded number of agents (available seats). In addition, each resource corresponds to an interval of time, and the objective is to assign non-overlapping resources to agents so as to produce "fair and high utility" schedules.
In this model, we provide a number of results under various settings and objective functions. Specifically, in this paper, we consider the following objective functions: total utility, max-min (Santa Claus objective), and envy-freeness. The total utility objective function maximizes the sum of the utilities of all courses assigned to students. The max-min objective maximizes the minimum utility obtained by any student. Finally, envy-freeness ensures that no student envies another student's allocation. Under these settings and objective functions, we show a number of theoretical results. Specifically, we show that the course allocation under [...]
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Individual Preference Stability for Clustering
Authors:
Saba Ahmadi,
Pranjal Awasthi,
Samir Khuller,
Matthäus Kleindessner,
Jamie Morgenstern,
Pattara Sukprasert,
Ali Vakilian
Abstract:
In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first…
▽ More
In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first show that deciding whether a given data set allows for an IP-stable clustering in general is NP-hard. As a result, we explore the design of efficient algorithms for finding IP-stable clusterings in some restricted metric spaces. We present a polytime algorithm to find a clustering satisfying exact IP-stability on the real line, and an efficient algorithm to find an IP-stable 2-clustering for a tree metric. We also consider relaxing the stability constraint, i.e., every data point should not be too far from its own cluster compared to any other cluster. For this case, we provide polytime algorithms with different guarantees. We evaluate some of our algorithms and several standard clustering approaches on real data sets.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
Correlated Stochastic Knapsack with a Submodular Objective
Authors:
Sheng Yang,
Samir Khuller,
Sunav Choudhary,
Subrata Mitra,
Kanak Mahadik
Abstract:
We study the correlated stochastic knapsack problem of a submodular target function, with optional additional constraints. We utilize the multilinear extension of submodular function, and bundle it with an adaptation of the relaxed linear constraints from Ma [Mathematics of Operations Research, Volume 43(3), 2018] on correlated stochastic knapsack problem. The relaxation is then solved by the stoc…
▽ More
We study the correlated stochastic knapsack problem of a submodular target function, with optional additional constraints. We utilize the multilinear extension of submodular function, and bundle it with an adaptation of the relaxed linear constraints from Ma [Mathematics of Operations Research, Volume 43(3), 2018] on correlated stochastic knapsack problem. The relaxation is then solved by the stochastic continuous greedy algorithm, and rounded by a novel method to fit the contention resolution scheme (Feldman et al. [FOCS 2011]). We obtain a pseudo-polynomial time $(1 - 1/\sqrt{e})/2 \simeq 0.1967$ approximation algorithm with or without those additional constraints, eliminating the need of a key assumption and improving on the $(1 - 1/\sqrt[4]{e})/2 \simeq 0.1106$ approximation by Fukunaga et al. [AAAI 2019].
△ Less
Submitted 3 August, 2022; v1 submitted 4 July, 2022;
originally announced July 2022.
-
Balancing Flow Time and Energy Consumption
Authors:
Sami Davies,
Samir Khuller,
Shirley Zhang
Abstract:
In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for $n$ uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most $k$ synchronized batches of size at most $B$. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedu…
▽ More
In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for $n$ uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most $k$ synchronized batches of size at most $B$. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedule. However, algorithms that minimize the cost from the scheduler's perspective -- such as ones that minimize the active time of the processor -- can result in schedules where the total flow time is arbitrarily high \cite{ChangGabowKhuller}. Such schedules are not valuable from the perspective of the client. In response, our work provides dynamic programs which minimize flow time subject to active time constraints. Our main contribution focuses on jobs with agreeable deadlines; for such job instances, we introduce dynamic programs that achieve runtimes of O$(B \cdot k \cdot n)$ for unit jobs and O$(B \cdot k \cdot n^5)$ for uniform length jobs. These results improve upon our modification of a different, classical dynamic programming approach by Baptiste. While the modified DP works when deadlines are non-agreeable, this solution is more expensive, with runtime $O(B \cdot k^2 \cdot n^7)$ \cite{Baptiste00}.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
A Pairwise Fair and Community-preserving Approach to k-Center Clustering
Authors:
Brian Brubach,
Darshan Chakrabarti,
John P. Dickerson,
Samir Khuller,
Aravind Srinivasan,
Leonidas Tsepenekas
Abstract:
Clustering is a foundational problem in machine learning with numerous applications. As machine learning increases in ubiquity as a backend for automated systems, concerns about fairness arise. Much of the current literature on fairness deals with discrimination against protected classes in supervised learning (group fairness). We define a different notion of fair clustering wherein the probabilit…
▽ More
Clustering is a foundational problem in machine learning with numerous applications. As machine learning increases in ubiquity as a backend for automated systems, concerns about fairness arise. Much of the current literature on fairness deals with discrimination against protected classes in supervised learning (group fairness). We define a different notion of fair clustering wherein the probability that two points (or a community of points) become separated is bounded by an increasing function of their pairwise distance (or community diameter). We capture the situation where data points represent people who gain some benefit from being clustered together. Unfairness arises when certain points are deterministically separated, either arbitrarily or by someone who intends to harm them as in the case of gerrymandering election districts. In response, we formally define two new types of fairness in the clustering setting, pairwise fairness and community preservation. To explore the practicality of our fairness goals, we devise an approach for extending existing $k$-center algorithms to satisfy these fairness constraints. Analysis of this approach proves that reasonable approximations can be achieved while maintaining fairness. In experiments, we compare the effectiveness of our approach to classical $k$-center algorithms/heuristics and explore the tradeoff between optimal clustering and fairness.
△ Less
Submitted 14 July, 2020;
originally announced July 2020.
-
Multi-transversals for Triangles and the Tuza's Conjecture
Authors:
Parinya Chalermsook,
Samir Khuller,
Pattara Sukprasert,
Sumedha Uniyal
Abstract:
In this paper, we study a primal and dual relationship about triangles: For any graph $G$, let $ν(G)$ be the maximum number of edge-disjoint triangles in $G$, and $τ(G)$ be the minimum subset $F$ of edges such that $G \setminus F$ is triangle-free. It is easy to see that $ν(G) \leq τ(G) \leq 3 ν(G)$, and in fact, this rather obvious inequality holds for a much more general primal-dual relation bet…
▽ More
In this paper, we study a primal and dual relationship about triangles: For any graph $G$, let $ν(G)$ be the maximum number of edge-disjoint triangles in $G$, and $τ(G)$ be the minimum subset $F$ of edges such that $G \setminus F$ is triangle-free. It is easy to see that $ν(G) \leq τ(G) \leq 3 ν(G)$, and in fact, this rather obvious inequality holds for a much more general primal-dual relation between $k$-hyper matching and covering in hypergraphs. Tuza conjectured in $1981$ that $τ(G) \leq 2 ν(G)$, and this question has received attention from various groups of researchers in discrete mathematics, settling various special cases such as planar graphs and generalized to bounded maximum average degree graphs, some cases of minor-free graphs, and very dense graphs. Despite these efforts, the conjecture in general graphs has remained wide open for almost four decades.
In this paper, we provide a proof of a non-trivial consequence of the conjecture; that is, for every $k \geq 2$, there exist a (multi)-set $F \subseteq E(G): |F| \leq 2k ν(G)$ such that each triangle in $G$ overlaps at least $k$ elements in $F$. Our result can be seen as a strengthened statement of Krivelevich's result on the fractional version of Tuza's conjecture (and we give some examples illustrating this.) The main technical ingredient of our result is a charging argument, that locally identifies edges in $F$ based on a local view of the packing solution. This idea might be useful in further studying the primal-dual relations in general and the Tuza's conjecture in particular.
△ Less
Submitted 3 February, 2021; v1 submitted 1 January, 2020;
originally announced January 2020.
-
An Algorithm for Multi-Attribute Diverse Matching
Authors:
Saba Ahmadi,
Faez Ahmed,
John P. Dickerson,
Mark Fuge,
Samir Khuller
Abstract:
Bipartite b-matching, where agents on one side of a market are matched to one or more agents or items on the other, is a classical model that is used in myriad application areas such as healthcare, advertising, education, and general resource allocation. Traditionally, the primary goal of such models is to maximize a linear function of the constituent matches (e.g., linear social welfare maximizat…
▽ More
Bipartite b-matching, where agents on one side of a market are matched to one or more agents or items on the other, is a classical model that is used in myriad application areas such as healthcare, advertising, education, and general resource allocation. Traditionally, the primary goal of such models is to maximize a linear function of the constituent matches (e.g., linear social welfare maximization) subject to some constraints. Recent work has studied a new goal of balancing whole-match diversity and economic efficiency, where the objective is instead a monotone submodular function over the matching. Basic versions of this problem are solvable in polynomial time. In this work, we prove that the problem of simultaneously maximizing diversity along several features (e.g., country of citizenship, gender, skills) is NP-hard. To address this problem, we develop the first combinatorial algorithm that constructs provably-optimal diverse b-matchings in pseudo-polynomial time. We also provide a Mixed-Integer Quadratic formulation for the same problem and show that our method guarantees optimal solutions and takes less computation time for a reviewer assignment application.
△ Less
Submitted 12 February, 2020; v1 submitted 7 September, 2019;
originally announced September 2019.
-
Min-Max Correlation Clustering via MultiCut
Authors:
Saba Ahmadi,
Sainyam Galhotra,
Samir Khuller,
Barna Saha,
Roy Schwartz
Abstract:
Correlation clustering is a fundamental combinatorial optimization problem arising in many contexts and applications that has been the subject of dozens of papers in the literature. In this problem we are given a general weighted graph where each edge is labeled positive or negative. The goal is to obtain a partitioning (clustering) of the vertices that minimizes disagreements - weight of negative…
▽ More
Correlation clustering is a fundamental combinatorial optimization problem arising in many contexts and applications that has been the subject of dozens of papers in the literature. In this problem we are given a general weighted graph where each edge is labeled positive or negative. The goal is to obtain a partitioning (clustering) of the vertices that minimizes disagreements - weight of negative edges trapped inside a cluster plus positive edges between different clusters. Most of the papers on this topic mainly focus on minimizing total disagreement, a global objective for this problem. In this paper, we study a cluster-wise objective function that asks to minimize the maximum number of disagreements of each cluster, which we call min-max correlation clustering. The min-max objective is a natural objective that respects the quality of every cluster. In this paper, we provide the first nontrivial approximation algorithm for this problem achieving an $\mathcal{O}(\sqrt{\log n\cdot\max\{\log(|E^-|),\log(k)\}})$ approximation for general weighted graphs, where $|E^-|$ denotes the number of negative edges and $k$ is the number of clusters in the optimum solution. To do so, we also obtain a corresponding result for multicut where we wish to find a multicut solution while trying to minimize the total weight of cut edges on every component. The results are then further improved to obtain (i) $\mathcal{O}(r^2)$-approximation for min-max correlation clustering and min-max multicut for graphs that exclude $K_{r,r}$ minors (ii) a 14-approximation for the min-max correlation clustering on complete graphs.
△ Less
Submitted 28 June, 2019;
originally announced July 2019.
-
Near Optimal Coflow Scheduling in Networks
Authors:
Mosharaf Chowdhury,
Samir Khuller,
Manish Purohit,
Sheng Yang,
Jie You
Abstract:
The coflow scheduling problem has emerged as a popular abstraction in the last few years to study data communication problems within a data center. In this basic framework, each coflow has a set of communication demands and the goal is to schedule many coflows in a manner that minimizes the total weighted completion time. A coflow is said to complete when all its communication needs are met. This…
▽ More
The coflow scheduling problem has emerged as a popular abstraction in the last few years to study data communication problems within a data center. In this basic framework, each coflow has a set of communication demands and the goal is to schedule many coflows in a manner that minimizes the total weighted completion time. A coflow is said to complete when all its communication needs are met. This problem has been extremely well studied for the case of complete bipartite graphs that model a data center with full bisection bandwidth and several approximation algorithms and effective heuristics have been proposed recently.
In this work, we study a slightly different model of coflow scheduling in general graphs (to capture traffic between data centers) and develop practical and efficient approximation algorithms for it. Our main result is a randomized 2 approximation algorithm for the single path and free path model, significantly improving prior work. In addition, we demonstrate via extensive experiments that the algorithm is practical, easy to implement and performs well in practice.
△ Less
Submitted 17 June, 2019;
originally announced June 2019.
-
On the cost of essentially fair clusterings
Authors:
Ioana O. Bercea,
Martin Groß,
Samir Khuller,
Aounon Kumar,
Clemens Rösner,
Daniel R. Schmidt,
Melanie Schmidt
Abstract:
Clustering is a fundamental tool in data mining. It partitions points into groups (clusters) and may be used to make decisions for each point based on its group. However, this process may harm protected (minority) classes if the clustering algorithm does not adequately represent them in desirable clusters -- especially if the data is already biased.
At NIPS 2017, Chierichetti et al. proposed a m…
▽ More
Clustering is a fundamental tool in data mining. It partitions points into groups (clusters) and may be used to make decisions for each point based on its group. However, this process may harm protected (minority) classes if the clustering algorithm does not adequately represent them in desirable clusters -- especially if the data is already biased.
At NIPS 2017, Chierichetti et al. proposed a model for fair clustering requiring the representation in each cluster to (approximately) preserve the global fraction of each protected class. Restricting to two protected classes, they developed both a 4-approximation for the fair $k$-center problem and a $O(t)$-approximation for the fair $k$-median problem, where $t$ is a parameter for the fairness model. For multiple protected classes, the best known result is a 14-approximation for fair $k$-center.
We extend and improve the known results. Firstly, we give a 5-approximation for the fair $k$-center problem with multiple protected classes. Secondly, we propose a relaxed fairness notion under which we can give bicriteria constant-factor approximations for all of the classical clustering objectives $k$-center, $k$-supplier, $k$-median, $k$-means and facility location. The latter approximations are achieved by a framework that takes an arbitrary existing unfair (integral) solution and a fair (fractional) LP solution and combines them into an essentially fair clustering with a weakly supervised rounding scheme. In this way, a fair clustering can be established belatedly, in a situation where the centers are already fixed.
△ Less
Submitted 26 November, 2018;
originally announced November 2018.
-
Select and Permute: An Improved Online Framework for Scheduling to Minimize Weighted Completion Time
Authors:
Samir Khuller,
Jingling Li,
Pascal Sturmfels,
Kevin Sun,
Prayaag Venkat
Abstract:
In this paper, we introduce a new online scheduling framework for minimizing total weighted completion time in a general setting. The framework is inspired by the work of Hall et al. [Mathematics of Operations Research, Vol 22(3):513-544, 1997] and Garg et al. [Proc. of Foundations of Software Technology and Theoretical Computer Science, pp. 96-107, 2007], who show how to convert an offline approx…
▽ More
In this paper, we introduce a new online scheduling framework for minimizing total weighted completion time in a general setting. The framework is inspired by the work of Hall et al. [Mathematics of Operations Research, Vol 22(3):513-544, 1997] and Garg et al. [Proc. of Foundations of Software Technology and Theoretical Computer Science, pp. 96-107, 2007], who show how to convert an offline approximation to an online scheme. Our framework uses two offline approximation algorithms (one for the simpler problem of scheduling without release times, and another for the minimum unscheduled weight problem) to create an online algorithm with provably good competitive ratios.
We illustrate multiple applications of this method that yield improved competitive ratios. Our framework gives algorithms with the best or only known competitive ratios for the concurrent open shop, coflow, and concurrent cluster models. We also introduce a randomized variant of our framework based on the ideas of Chakrabarti et al. [Proc. of International Colloquium on Automata, Languages, and Programming, pp. 646-657, 1996] and use it to achieve improved competitive ratios for these same problems.
△ Less
Submitted 21 April, 2017;
originally announced April 2017.
-
Scheduling Distributed Clusters of Parallel Machines: Primal-Dual and LP-based Approximation Algorithms [Full Version]
Authors:
Riley Murray,
Samir Khuller,
Megan Chao
Abstract:
The Map-Reduce computing framework rose to prominence with datasets of such size that dozens of machines on a single cluster were needed for individual jobs. As datasets approach the exabyte scale, a single job may need distributed processing not only on multiple machines, but on multiple clusters. We consider a scheduling problem to minimize weighted average completion time of N jobs on M distrib…
▽ More
The Map-Reduce computing framework rose to prominence with datasets of such size that dozens of machines on a single cluster were needed for individual jobs. As datasets approach the exabyte scale, a single job may need distributed processing not only on multiple machines, but on multiple clusters. We consider a scheduling problem to minimize weighted average completion time of N jobs on M distributed clusters of parallel machines. In keeping with the scale of the problems motivating this work, we assume that (1) each job is divided into M "subjobs" and (2) distinct subjobs of a given job may be processed concurrently.
When each cluster is a single machine, this is the NP-Hard concurrent open shop problem. A clear limitation of such a model is that a serial processing assumption sidesteps the issue of how different tasks of a given subjob might be processed in parallel. Our algorithms explicitly model clusters as pools of resources and effectively overcome this issue.
Under a variety of parameter settings, we develop two constant factor approximation algorithms for this problem. The first algorithm uses an LP relaxation tailored to this problem from prior work. This LP-based algorithm provides strong performance guarantees. Our second algorithm exploits a surprisingly simple mapping to the special case of one machine per cluster. This mapping-based algorithm is combinatorial and extremely fast. These are the first constant factor approximations for this problem.
△ Less
Submitted 27 October, 2016;
originally announced October 2016.
-
LP Rounding and Combinatorial Algorithms for Minimizing Active and Busy Time
Authors:
Jessica Chang,
Samir Khuller,
Koyel Mukherjee
Abstract:
We consider fundamental scheduling problems motivated by energy issues. In this framework, we are given a set of jobs, each with a release time, deadline and required processing length. The jobs need to be scheduled on a machine so that at most g jobs are active at any given time. The duration for which a machine is active (i.e., "on") is referred to as its active time. The goal is to find a feasi…
▽ More
We consider fundamental scheduling problems motivated by energy issues. In this framework, we are given a set of jobs, each with a release time, deadline and required processing length. The jobs need to be scheduled on a machine so that at most g jobs are active at any given time. The duration for which a machine is active (i.e., "on") is referred to as its active time. The goal is to find a feasible schedule for all jobs, minimizing the total active time. When preemption is allowed at integer time points, we show that a minimal feasible schedule already yields a 3-approximation (and this bound is tight) and we further improve this to a 2-approximation via LP rounding techniques. Our second contribution is for the non-preemptive version of this problem. However, since even asking if a feasible schedule on one machine exists is NP-hard, we allow for an unbounded number of virtual machines, each having capacity of g. This problem is known as the busy time problem in the literature and a 4-approximation is known for this problem. We develop a new combinatorial algorithm that gives a 3-approximation. Furthermore, we consider the preemptive busy time problem, giving a simple and exact greedy algorithm when unbounded parallelism is allowed, i.e., g is unbounded. For arbitrary g, this yields an algorithm that is 2-approximate.
△ Less
Submitted 25 October, 2016;
originally announced October 2016.
-
Minimizing Uncertainty through Sensor Placement with Angle Constraints
Authors:
Ioana O. Bercea,
Volkan Isler,
Samir Khuller
Abstract:
We study the problem of sensor placement in environments in which localization is a necessity, such as ad-hoc wireless sensor networks that allow the placement of a few anchors that know their location or sensor arrays that are tracking a target. In most of these situations, the quality of localization depends on the relative angle between the target and the pair of sensors observing it. In this p…
▽ More
We study the problem of sensor placement in environments in which localization is a necessity, such as ad-hoc wireless sensor networks that allow the placement of a few anchors that know their location or sensor arrays that are tracking a target. In most of these situations, the quality of localization depends on the relative angle between the target and the pair of sensors observing it. In this paper, we consider placing a small number of sensors which ensure good angular $α$-coverage: given $α$ in $[0,π/2]$, for each target location $t$, there must be at least two sensors $s_1$ and $s_2$ such that the $\angle(s_1 t s_2)$ is in the interval $[α, π-α]$. One of the main difficulties encountered in such problems is that since the constraints depend on at least two sensors, building a solution must account for the inherent dependency between selected sensors, a feature that generic Set Cover techniques do not account for. We introduce a general framework that guarantees an angular coverage that is arbitrarily close to $α$ for any $α<= π/3$ and apply it to a variety of problems to get bi-criteria approximations. When the angular coverage is required to be at least a constant fraction of $α$, we obtain results that are strictly better than what standard geometric Set Cover methods give. When the angular coverage is required to be at least $(1-1/δ)\cdotα$, we obtain a $\mathcal{O}(\log δ)$- approximation for sensor placement with $α$-coverage on the plane. In the presence of additional distance or visibility constraints, the framework gives a $\mathcal{O}(\logδ\cdot\log k_{OPT})$-approximation, where $k_{OPT}$ is the size of the optimal solution. We also use our framework to give a $\mathcal{O}(\log δ)$-approximation that ensures $(1-1/δ)\cdot α$-coverage and covers every target within distance $3R$.
△ Less
Submitted 19 July, 2016;
originally announced July 2016.
-
Constant factor Approximation Algorithms for Uniform Hard Capacitated Facility Location Problems: Natural LP is not too bad
Authors:
Sapna Grover,
Neelima Gupta,
Samir Khuller,
Aditya Pancholi
Abstract:
In this paper, we give first constant factor approximation for capacitated knapsack median problem (CKM) for hard uniform capacities, violating the budget only by an additive factor of $f_{max}$ where $f_{max}$ is the maximum cost of a facility opened by the optimal and violating capacities by $(2+ε)$ factor. Natural LP for the problem is known to have an unbounded integrality gap when any one of…
▽ More
In this paper, we give first constant factor approximation for capacitated knapsack median problem (CKM) for hard uniform capacities, violating the budget only by an additive factor of $f_{max}$ where $f_{max}$ is the maximum cost of a facility opened by the optimal and violating capacities by $(2+ε)$ factor. Natural LP for the problem is known to have an unbounded integrality gap when any one of the two constraints is allowed to be violated by a factor less than $2$. Thus, we present a result which is very close to the best achievable from the natural LP. To the best of our knowledge, the problem has not been studied earlier.
For capacitated facility location problem with uniform capacities, a constant factor approximation algorithm is presented violating the capacities a little ($1 + ε$). Though constant factor results are known for the problem without violating the capacities, the result is interesting as it is obtained by rounding the solution to the natural LP, which is known to have an unbounded integrality gap without violating the capacities. Thus, we achieve the best possible from the natural LP for the problem. The result shows that natural LP is not too bad.
Finally, we raise some issues with the proofs of the results presented in \cite{capkmByrkaFRS2013} for capacitated $k$-facility location problem (C$k$FLP). \cite{capkmByrkaFRS2013} presents $O(1/ε^2)$ approximation violating the capacities by a factor of $(2 + ε)$ using dependent rounding. We first fix these issues using our techniques. Also, it can be argued that (deterministic) pipage rounding cannot be used to open the facilities instead of dependent rounding. Our techniques for CKM provide a constant factor approximation for CkFLP violating the capacities by $(2 + ε)$.
△ Less
Submitted 23 March, 2022; v1 submitted 26 June, 2016;
originally announced June 2016.
-
On Correcting Inputs: Inverse Optimization for Online Structured Prediction
Authors:
Hal Daumé III,
Samir Khuller,
Manish Purohit,
Gregory Sanders
Abstract:
Algorithm designers typically assume that the input data is correct, and then proceed to find "optimal" or "sub-optimal" solutions using this input data. However this assumption of correct data does not always hold in practice, especially in the context of online learning systems where the objective is to learn appropriate feature weights given some training samples. Such scenarios necessitate the…
▽ More
Algorithm designers typically assume that the input data is correct, and then proceed to find "optimal" or "sub-optimal" solutions using this input data. However this assumption of correct data does not always hold in practice, especially in the context of online learning systems where the objective is to learn appropriate feature weights given some training samples. Such scenarios necessitate the study of inverse optimization problems where one is given an input instance as well as a desired output and the task is to adjust the input data so that the given output is indeed optimal. Motivated by learning structured prediction models, in this paper we consider inverse optimization with a margin, i.e., we require the given output to be better than all other feasible outputs by a desired margin. We consider such inverse optimization problems for maximum weight matroid basis, matroid intersection, perfect matchings, minimum cost maximum flows, and shortest paths and derive the first known results for such problems with a non-zero margin. The effectiveness of these algorithmic approaches to online learning for structured prediction is also discussed.
△ Less
Submitted 11 October, 2015;
originally announced October 2015.
-
Analyzing the Optimal Neighborhood: Algorithms for Budgeted and Partial Connected Dominating Set Problems
Authors:
Samir Khuller,
Manish Purohit,
Kanthi Sarpatwar
Abstract:
We study partial and budgeted versions of the well studied connected dominating set problem. In the partial connected dominating set problem, we are given an undirected graph G = (V,E) and an integer n', and the goal is to find a minimum subset of vertices that induces a connected subgraph of G and dominates at least n' vertices. We obtain the first polynomial time algorithm with an O(\ln Δ) appro…
▽ More
We study partial and budgeted versions of the well studied connected dominating set problem. In the partial connected dominating set problem, we are given an undirected graph G = (V,E) and an integer n', and the goal is to find a minimum subset of vertices that induces a connected subgraph of G and dominates at least n' vertices. We obtain the first polynomial time algorithm with an O(\ln Δ) approximation factor for this problem, thereby significantly extending the results of Guha and Khuller (Algorithmica, Vol. 20(4), Pages 374-387, 1998) for the connected dominating set problem. We note that none of the methods developed earlier can be applied directly to solve this problem. In the budgeted connected dominating set problem, there is a budget on the number of vertices we can select, and the goal is to dominate as many vertices as possible. We obtain a (1/13)(1 - 1/e) approximation algorithm for this problem. Finally, we show that our techniques extend to a more general setting where the profit function associated with a subset of vertices is a monotone "special" submodular function. This generalization captures the connected dominating set problem with capacities and/or weighted profits as special cases. This implies a O(\ln q) approximation (where q denotes the quota) and an O(1) approximation algorithms for the partial and budgeted versions of these problems. While the algorithms are simple, the results make a surprising use of the greedy set cover framework in defining a useful profit function.
△ Less
Submitted 10 November, 2013;
originally announced November 2013.
-
Data Placement and Replica Selection for Improving Co-location in Distributed Environments
Authors:
K. Ashwin Kumar,
Amol Deshpande,
Samir Khuller
Abstract:
Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing contention on scarce data center resources like network bandwidth; further, the energy requirements for power…
▽ More
Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing contention on scarce data center resources like network bandwidth; further, the energy requirements for powering the computing equipment are also growing dramatically. As we show empirically, increasing the execution parallelism by spreading out data across a large number of machines may achieve the intended goal of decreasing query latencies, but in most cases, may increase the total resource and energy consumption significantly. For many analytical workloads, however, minimizing query latencies is often not critical; in such scenarios, we argue that we should instead focus on minimizing the average query span, i.e., the average number of machines that are involved in processing of a query, through colocation of data items that are frequently accessed together. In this work, we exploit the fact that most distributed environments need to use replication for fault tolerance, and we devise workload-driven replica selection and placement algorithms that attempt to minimize the average query span. We model a historical query workload trace as a hypergraph over a set of data items, and formulate and analyze the problem of replica placement by drawing connections to several well-studied graph theoretic concepts. We develop a series of algorithms to decide which data items to replicate, and where to place the replicas. We show effectiveness of our proposed approach by presenting results on a collection of synthetic and real workloads. Our experiments show that careful data placement and replication can dramatically reduce the average query spans resulting in significant reductions in the resource consumption.
△ Less
Submitted 18 February, 2013;
originally announced February 2013.
-
LP Rounding for k-Centers with Non-uniform Hard Capacities
Authors:
Marek Cygan,
MohammadTaghi Hajiaghayi,
Samir Khuller
Abstract:
In this paper we consider a generalization of the classical k-center problem with capacities. Our goal is to select k centers in a graph, and assign each node to a nearby center, so that we respect the capacity constraints on centers. The objective is to minimize the maximum distance a node has to travel to get to its assigned center. This problem is NP-hard, even when centers have no capacity res…
▽ More
In this paper we consider a generalization of the classical k-center problem with capacities. Our goal is to select k centers in a graph, and assign each node to a nearby center, so that we respect the capacity constraints on centers. The objective is to minimize the maximum distance a node has to travel to get to its assigned center. This problem is NP-hard, even when centers have no capacity restrictions and optimal factor 2 approximation algorithms are known. With capacities, when all centers have identical capacities, a 6 approximation is known with no better lower bounds than for the infinite capacity version.
While many generalizations and variations of this problem have been studied extensively, no progress was made on the capacitated version for a general capacity function. We develop the first constant factor approximation algorithm for this problem. Our algorithm uses an LP rounding approach to solve this problem, and works for the case of non-uniform hard capacities, when multiple copies of a node may not be chosen and can be extended to the case when there is a hard bound on the number of copies of a node that may be selected. In addition we establish a lower bound on the integrality gap of 7(5) for non-uniform (uniform) hard capacities. In addition we prove that if there is a (3-eps)-factor approximation for this problem then P=NP.
Finally, for non-uniform soft capacities we present a much simpler 11-approximation algorithm, which we find as one more evidence that hard capacities are much harder to deal with.
△ Less
Submitted 15 August, 2012;
originally announced August 2012.
-
A Model for Minimizing Active Processor Time
Authors:
Jessica Chang,
Harold N. Gabow,
Samir Khuller
Abstract:
We introduce the following elementary scheduling problem. We are given a collection of n jobs, where each job has an integer length as well as a set Ti of time intervals in which it can be feasibly scheduled. Given a parameter B, the processor can schedule up to B jobs at a timeslot t so long as it is "active" at t. The goal is to schedule all the jobs in the fewest number of active timeslots. The…
▽ More
We introduce the following elementary scheduling problem. We are given a collection of n jobs, where each job has an integer length as well as a set Ti of time intervals in which it can be feasibly scheduled. Given a parameter B, the processor can schedule up to B jobs at a timeslot t so long as it is "active" at t. The goal is to schedule all the jobs in the fewest number of active timeslots. The machine consumes a fixed amount of energy per active timeslot, regardless of the number of jobs scheduled in that slot (as long as the number of jobs is non-zero). In other words, subject to all units of each job being scheduled in its feasible region and at each slot at most B jobs being scheduled, we are interested in minimizing the total time during which the machine is active. We present a linear time algorithm for the case where jobs are unit length and each Ti is a single interval. For general Ti, we show that the problem is NP-complete even for B = 3. However when B = 2, we show that it can be efficiently solved. In addition, we consider a version of the problem where jobs have arbitrary lengths and can be preempted at any point in time. For general B, the problem can be solved by linear programming. For B = 2, the problem amounts to finding a triangle-free 2-matching on a special graph. We extend the algorithm of Babenko et. al. to handle our variant, and also to handle non-unit length jobs. This yields an O(sqrt(L)m) time algorithm to solve the preemptive scheduling problem for B = 2, where L is the sum of the job lengths. We also show that for B = 2 and unit length jobs, the optimal non-preemptive schedule has at most 4/3 times the active time of the optimal preemptive schedule; this bound extends to several versions of the problem when jobs have arbitrary length.
△ Less
Submitted 1 August, 2012;
originally announced August 2012.
-
On Computing Compression Trees for Data Collection in Sensor Networks
Authors:
Jian Li,
Amol Deshpande,
Samir Khuller
Abstract:
We address the problem of efficiently gathering correlated data from a wired or a wireless sensor network, with the aim of designing algorithms with provable optimality guarantees, and understanding how close we can get to the known theoretical lower bounds. Our proposed approach is based on finding an optimal or a near-optimal {\em compression tree} for a given sensor network: a compression tre…
▽ More
We address the problem of efficiently gathering correlated data from a wired or a wireless sensor network, with the aim of designing algorithms with provable optimality guarantees, and understanding how close we can get to the known theoretical lower bounds. Our proposed approach is based on finding an optimal or a near-optimal {\em compression tree} for a given sensor network: a compression tree is a directed tree over the sensor network nodes such that the value of a node is compressed using the value of its parent. We consider this problem under different communication models, including the {\em broadcast communication} model that enables many new opportunities for energy-efficient data collection. We draw connections between the data collection problem and a previously studied graph concept, called {\em weakly connected dominating sets}, and we use this to develop novel approximation algorithms for the problem. We present comparative results on several synthetic and real-world datasets showing that our algorithms construct near-optimal compression trees that yield a significant reduction in the data collection cost.
△ Less
Submitted 30 July, 2009;
originally announced July 2009.
-
Designing Multi-Commodity Flow Trees
Authors:
Samir Khuller,
Balaji Raghavachari,
Neal E. Young
Abstract:
The traditional multi-commodity flow problem assumes a given flow network in which multiple commodities are to be maximally routed in response to given demands. This paper considers the multi-commodity flow network-design problem: given a set of multi-commodity flow demands, find a network subject to certain constraints such that the commodities can be maximally routed.
This paper focuses on t…
▽ More
The traditional multi-commodity flow problem assumes a given flow network in which multiple commodities are to be maximally routed in response to given demands. This paper considers the multi-commodity flow network-design problem: given a set of multi-commodity flow demands, find a network subject to certain constraints such that the commodities can be maximally routed.
This paper focuses on the case when the network is required to be a tree. The main result is an approximation algorithm for the case when the tree is required to be of constant degree. The algorithm reduces the problem to the minimum-weight balanced-separator problem; the performance guarantee of the algorithm is within a factor of 4 of the performance guarantee of the balanced-separator procedure. If Leighton and Rao's balanced-separator procedure is used, the performance guarantee is O(log n). This improves the O(log^2 n) approximation factor that is trivial to obtain by a direct application of the balanced-separator method.
△ Less
Submitted 30 May, 2002;
originally announced May 2002.
-
A Network-Flow Technique for Finding Low-Weight Bounded-Degree Spanning Trees
Authors:
S. Fekete,
S. Khuller,
M. Klemmstein,
B. Raghavachari,
Neal E. Young
Abstract:
The problem considered is the following. Given a graph with edge weights satisfying the triangle inequality, and a degree bound for each vertex, compute a low-weight spanning tree such that the degree of each vertex is at most its specified bound. The problem is NP-hard (it generalizes Traveling Salesman (TSP)). This paper describes a network-flow heuristic for modifying a given tree T to meet t…
▽ More
The problem considered is the following. Given a graph with edge weights satisfying the triangle inequality, and a degree bound for each vertex, compute a low-weight spanning tree such that the degree of each vertex is at most its specified bound. The problem is NP-hard (it generalizes Traveling Salesman (TSP)). This paper describes a network-flow heuristic for modifying a given tree T to meet the constraints. Choosing T to be a minimum spanning tree (MST) yields approximation algorithms with performance guarantee less than 2 for the problem on geometric graphs with L_p-norms. The paper also describes a Euclidean graph whose minimum TSP costs twice the MST, disproving a conjecture made in ``Low-Degree Spanning Trees of Small Weight'' (1996).
△ Less
Submitted 18 May, 2002;
originally announced May 2002.
-
Balancing Minimum Spanning and Shortest Path Trees
Authors:
Samir Khuller,
Balaji Raghavachari,
Neal E. Young
Abstract:
This paper give a simple linear-time algorithm that, given a weighted digraph, finds a spanning tree that simultaneously approximates a shortest-path tree and a minimum spanning tree. The algorithm provides a continuous trade-off: given the two trees and epsilon > 0, the algorithm returns a spanning tree in which the distance between any vertex and the root of the shortest-path tree is at most 1…
▽ More
This paper give a simple linear-time algorithm that, given a weighted digraph, finds a spanning tree that simultaneously approximates a shortest-path tree and a minimum spanning tree. The algorithm provides a continuous trade-off: given the two trees and epsilon > 0, the algorithm returns a spanning tree in which the distance between any vertex and the root of the shortest-path tree is at most 1+epsilon times the shortest-path distance, and yet the total weight of the tree is at most 1+2/epsilon times the weight of a minimum spanning tree. This is the best tradeoff possible. The paper also describes a fast parallel implementation.
△ Less
Submitted 18 May, 2002;
originally announced May 2002.
-
Low-Degree Spanning Trees of Small Weight
Authors:
Samir Khuller,
Balaji Raghavachari,
Neal E. Young
Abstract:
The degree-d spanning tree problem asks for a minimum-weight spanning tree in which the degree of each vertex is at most d. When d=2 the problem is TSP, and in this case, the well-known Christofides algorithm provides a 1.5-approximation algorithm (assuming the edge weights satisfy the triangle inequality).
In 1984, Christos Papadimitriou and Umesh Vazirani posed the challenge of finding an al…
▽ More
The degree-d spanning tree problem asks for a minimum-weight spanning tree in which the degree of each vertex is at most d. When d=2 the problem is TSP, and in this case, the well-known Christofides algorithm provides a 1.5-approximation algorithm (assuming the edge weights satisfy the triangle inequality).
In 1984, Christos Papadimitriou and Umesh Vazirani posed the challenge of finding an algorithm with performance guarantee less than 2 for Euclidean graphs (points in R^n) and d > 2. This paper gives the first answer to that challenge, presenting an algorithm to compute a degree-3 spanning tree of cost at most 5/3 times the MST. For points in the plane, the ratio improves to 3/2 and the algorithm can also find a degree-4 spanning tree of cost at most 5/4 times the MST.
△ Less
Submitted 18 May, 2002;
originally announced May 2002.
-
Approximating the Minimum Equivalent Digraph
Authors:
Samir Khuller,
Balaji Raghavachari,
Neal E. Young
Abstract:
The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard. This paper gives an approximation algorithm with performance guarantee of pi^2/6 ~ 1.64. The algorithm and its analysis are based on the simple idea of contracting long cycles. (This result is strengthened slig…
▽ More
The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard. This paper gives an approximation algorithm with performance guarantee of pi^2/6 ~ 1.64. The algorithm and its analysis are based on the simple idea of contracting long cycles. (This result is strengthened slightly in ``On strongly connected digraphs with bounded cycle length'' (1996).) The analysis applies directly to 2-Exchange, a simple ``local improvement'' algorithm, showing that its performance guarantee is 1.75.
△ Less
Submitted 18 May, 2002;
originally announced May 2002.
-
A Primal-Dual Parallel Approximation Technique Applied to Weighted Set and Vertex Cover
Authors:
Samir Khuller,
Uzi Vishkin,
Neal Young
Abstract:
The paper describes a simple deterministic parallel/distributed (2+epsilon)-approximation algorithm for the minimum-weight vertex-cover problem and its dual (edge/element packing).
The paper describes a simple deterministic parallel/distributed (2+epsilon)-approximation algorithm for the minimum-weight vertex-cover problem and its dual (edge/element packing).
△ Less
Submitted 18 May, 2002;
originally announced May 2002.
-
On Strongly Connected Digraphs with Bounded Cycle Length
Authors:
Samir Khuller,
Balaji Raghavachari,
Neal Young
Abstract:
The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard. This paper gives a proof that, for graphs where each directed cycle has at most three edges, the MEG problem is equivalent to maximum bipartite matching, and therefore solvable in polynomial time. This leads to…
▽ More
The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard. This paper gives a proof that, for graphs where each directed cycle has at most three edges, the MEG problem is equivalent to maximum bipartite matching, and therefore solvable in polynomial time. This leads to an improvement in the performance guarantee of the previously best approximation algorithm for the general problem in ``Approximating the Minimum Equivalent Digraph'' (1995).
△ Less
Submitted 9 May, 2002;
originally announced May 2002.