-
Safe Subjoins in Acyclic Joins
Authors:
Foto N. Afrati
Abstract:
It is expensive to compute joins, often due to large intermediate relations. For acyclic joins, monotone join expressions are guaranteed to produce intermediate relations not larger than the size of the output of the join when it is computed on a fully reduced database. Any subexpression of an acyclic join does not offer this guarantee, as it is easy to prove. In this paper, we consider joins with…
▽ More
It is expensive to compute joins, often due to large intermediate relations. For acyclic joins, monotone join expressions are guaranteed to produce intermediate relations not larger than the size of the output of the join when it is computed on a fully reduced database. Any subexpression of an acyclic join does not offer this guarantee, as it is easy to prove. In this paper, we consider joins with projections too and we ask the question whether we can characterize join subexpressions that produce, on every fully reduced database, an output without dangling tuples (which translates, in the case of joins without projections, to an output of size not larger than the size of the output of the join). We call such a subexpression a safe subjoin. Surprisingly, we prove that there is a simple characterization which is the following: A subjoin is safe if and only if there is a parse tree of the join (a.k.a. join tree) such that the relations in the subjoin form a subtree of it. We provide an algorithm that finds such a parse tree, if there is one.
△ Less
Submitted 20 August, 2022;
originally announced August 2022.
-
Querying collections of tree-structured records in the presence of within-record referential constraints
Authors:
Foto N. Afrati,
Matthew Damigos
Abstract:
In this paper, we consider a tree-structured data model used in many commercial databases like Dremel, F1, JSON stores. We define identity and referential constraints within each tree-structured record. The query language is a variant of SQL and flattening is used as an evaluation mechanism. We investigate querying in the presence of these constraints, and point out the challenges that arise from…
▽ More
In this paper, we consider a tree-structured data model used in many commercial databases like Dremel, F1, JSON stores. We define identity and referential constraints within each tree-structured record. The query language is a variant of SQL and flattening is used as an evaluation mechanism. We investigate querying in the presence of these constraints, and point out the challenges that arise from taking them into account during query evaluation.
△ Less
Submitted 30 August, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
On the complexity of query containment and computing certain answers in the presence of ACs
Authors:
Foto N. Afrati,
Matthew Damigos
Abstract:
We often add arithmetic to extend the expressiveness of query languages and study the complexity of problems such as testing query containment and finding certain answers in the framework of answering queries using views. When adding arithmetic comparisons, the complexity of such problems is higher than the complexity of their counterparts without them. It has been observed that we can achieve low…
▽ More
We often add arithmetic to extend the expressiveness of query languages and study the complexity of problems such as testing query containment and finding certain answers in the framework of answering queries using views. When adding arithmetic comparisons, the complexity of such problems is higher than the complexity of their counterparts without them. It has been observed that we can achieve lower complexity if we restrict some of the comparisons in the containing query to be closed or open semi-interval comparisons. Here, focusing a) on the problem of containment for conjunctive queries with arithmetic comparisons (CQAC queries, for short), we prove upper bounds on its computational complexity and b) on the problem of computing certain answers, we find large classes of CQAC queries and views where this problem is polynomial.
△ Less
Submitted 18 November, 2020; v1 submitted 25 August, 2020;
originally announced August 2020.
-
SharesSkew: An Algorithm to Handle Skew for Joins in MapReduce
Authors:
Foto Afrati,
Nikos Stasinopoulos,
Jeffrey D. Ullman,
Angelos Vassilakopoulos
Abstract:
In this paper, we investigate the problem of computing a multiway join in one round of MapReduce when the data may be skewed. We optimize on communication cost, i.e., the amount of data that is transferred from the mappers to the reducers. We identify join attributes values that appear very frequently, Heavy Hitters (HH). We distribute HH valued records to reducers avoiding skew by using an adapta…
▽ More
In this paper, we investigate the problem of computing a multiway join in one round of MapReduce when the data may be skewed. We optimize on communication cost, i.e., the amount of data that is transferred from the mappers to the reducers. We identify join attributes values that appear very frequently, Heavy Hitters (HH). We distribute HH valued records to reducers avoiding skew by using an adaptation of the Shares~\cite{AfUl} algorithm to achieve minimum communication cost. Our algorithm is implemented for experimentation and is offered as open source software. Furthermore, we investigate a class of multiway joins for which a simpler variant of the algorithm can handle skew. We offer closed forms for computing the parameters of the algorithm for chain and symmetric joins.
△ Less
Submitted 12 December, 2015;
originally announced December 2015.
-
Computing Marginals Using MapReduce
Authors:
Foto Afrati,
Shantanu Sharma,
Jeffrey D. Ullman,
Jonathan R. Ullman
Abstract:
We consider the problem of computing the data-cube marginals of a fixed order $k$ (i.e., all marginals that aggregate over $k$ dimensions), using a single round of MapReduce. The focus is on the relationship between the reducer size (number of inputs allowed at a single reducer) and the replication rate (number of reducers to which an input is sent). We show that the replication rate is minimized…
▽ More
We consider the problem of computing the data-cube marginals of a fixed order $k$ (i.e., all marginals that aggregate over $k$ dimensions), using a single round of MapReduce. The focus is on the relationship between the reducer size (number of inputs allowed at a single reducer) and the replication rate (number of reducers to which an input is sent). We show that the replication rate is minimized when the reducers receive all the inputs necessary to compute one marginal of higher order. That observation lets us view the problem as one of covering sets of $k$ dimensions with sets of a larger size $m$, a problem that has been studied under the name "covering numbers." We offer a number of constructions that, for different values of $k$ and $m$ meet or come close to yielding the minimum possible replication rate for a given reducer size.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.
-
Meta-MapReduce: A Technique for Reducing Communication in MapReduce Computations
Authors:
Foto Afrati,
Shlomi Dolev,
Shantanu Sharma,
Jeffrey D. Ullman
Abstract:
MapReduce has proven to be one of the most useful paradigms in the revolution of distributed computing, where cloud services and cluster computing become the standard venue for computing. The federation of cloud and big data activities is the next challenge where MapReduce should be modified to avoid (big) data migration across remote (cloud) sites. This is exactly our scope of research, where onl…
▽ More
MapReduce has proven to be one of the most useful paradigms in the revolution of distributed computing, where cloud services and cluster computing become the standard venue for computing. The federation of cloud and big data activities is the next challenge where MapReduce should be modified to avoid (big) data migration across remote (cloud) sites. This is exactly our scope of research, where only the very essential data for obtaining the result is transmitted, reducing communication, processing and preserving data privacy as much as possible. In this work, we propose an algorithmic technique for MapReduce algorithms, called Meta-MapReduce, that decreases the communication cost by allowing us to process and move metadata to clouds and from the map phase to reduce phase. In Meta-MapReduce, the reduce phase fetches only the required data at required iterations, which in turn, assists in preserving the data privacy.
△ Less
Submitted 28 July, 2016; v1 submitted 5 August, 2015;
originally announced August 2015.
-
Assignment Problems of Different-Sized Inputs in MapReduce
Authors:
Foto Afrati,
Shlomi Dolev,
Ephraim Korach,
Shantanu Sharma,
Jeffrey D. Ullman
Abstract:
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inputs may vary in terms of size. We consider, for th…
▽ More
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inputs may vary in terms of size. We consider, for the first time, mapping schemas where input sizes are part of the considerations and restrictions. One of the significant parameters to optimize in any MapReduce job is communication cost between the map and reduce phases. The communication cost can be optimized by minimizing the number of copies of inputs sent to the reducers. The communication cost is closely related to the number of reducers of constrained capacity that are used to accommodate appropriately the inputs, so that the requirement of how the inputs must meet in a reducer is satisfied. In this work, we consider a family of problems where it is required that each input meets with each other input in at least one reducer. We also consider a slightly different family of problems in which, each input of a list, X, is required to meet each input of another list, Y, in at least one reducer. We prove that finding an optimal mapping schema for these families of problems is NP-hard, and present a bin-packing-based approximation algorithm for finding a near optimal mapping schema.
△ Less
Submitted 20 October, 2016; v1 submitted 16 July, 2015;
originally announced July 2015.
-
Handling Skew in Multiway Joins in Parallel Processing
Authors:
Foto N. Afrati,
Jeffrey D. Ullman,
Angelos Vasilakopoulos
Abstract:
Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In th…
▽ More
Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In this paper we will introduce a novel technique for handling skew when we want to compute a multiway join in one MapReduce round with minimum communication cost. This technique is actually an adaptation of the Shares algorithm [Afrati et. al, TKDE 2011].
△ Less
Submitted 13 April, 2015;
originally announced April 2015.
-
Consistent Answers of Conjunctive Queries on Graphs
Authors:
Foto N. Afrati,
Phokion G. Kolaitis,
Angelos Vasilakopoulos
Abstract:
During the past decade, there has been an extensive investigation of the computational complexity of the consistent answers of Boolean conjunctive queries under primary key constraints. Much of this investigation has focused on self-join-free Boolean conjunctive queries. In this paper, we study the consistent answers of Boolean conjunctive queries involving a single binary relation, i.e., we consi…
▽ More
During the past decade, there has been an extensive investigation of the computational complexity of the consistent answers of Boolean conjunctive queries under primary key constraints. Much of this investigation has focused on self-join-free Boolean conjunctive queries. In this paper, we study the consistent answers of Boolean conjunctive queries involving a single binary relation, i.e., we consider arbitrary Boolean conjunctive queries on directed graphs. In the presence of a single key constraint, we show that for each such Boolean conjunctive query, either the problem of computing its consistent answers is expressible in first-order logic, or it is polynomial-time solvable, but not expressible in first-order logic.
△ Less
Submitted 2 March, 2015;
originally announced March 2015.
-
Assignment of Different-Sized Inputs in MapReduce
Authors:
Foto Afrati,
Shlomi Dolev,
Ephraim Korach,
Shantanu Sharma,
Jeffrey D. Ullman
Abstract:
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inputs may vary in terms of size. We consider, for th…
▽ More
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inputs may vary in terms of size. We consider, for the first time, mapping schemas where input sizes are part of the considerations and restrictions. One of the significant parameters to optimize in any MapReduce job is communication cost between the map and reduce phases. The communication cost can be optimized by minimizing the number of copies of inputs sent to the reducers. The communication cost is closely related to the number of reducers of constrained capacity that are used to accommodate appropriately the inputs, so that the requirement of how the inputs must meet in a reducer is satisfied. In this work, we consider a family of problems where it is required that each input meets with each other input in at least one reducer. We also consider a slightly different family of problems in which, each input of a set, X, is required to meet each input of another set, Y, in at least one reducer. We prove that finding an optimal mapping schema for these families of problem is NP-hard, and present several approximation algorithms for finding a near optimal mapping schema.
△ Less
Submitted 27 January, 2015;
originally announced January 2015.
-
GYM: A Multiround Join Algorithm In MapReduce
Authors:
Foto Afrati,
Manas Joglekar,
Christopher RĂ©,
Semih Salihoglu,
Jeffrey D. Ullman
Abstract:
Multiround algorithms are now commonly used in distributed data processing systems, yet the extent to which algorithms can benefit from running more rounds is not well understood. This paper answers this question for a spectrum of rounds for the problem of computing the equijoin of $n$ relations. Specifically, given any query $Q$ with width $\w$, {\em intersection width} $\iw$, input size…
▽ More
Multiround algorithms are now commonly used in distributed data processing systems, yet the extent to which algorithms can benefit from running more rounds is not well understood. This paper answers this question for a spectrum of rounds for the problem of computing the equijoin of $n$ relations. Specifically, given any query $Q$ with width $\w$, {\em intersection width} $\iw$, input size $\mathrm{IN}$, output size $\mathrm{OUT}$, and a cluster of machines with $M$ memory available per machine, we show that:
(1) $Q$ can be computed in $O(n)$ rounds with $O(n\frac{(\mathrm{IN}^{\w} + \mathrm{OUT})^2}{M})$ communication cost.
(2) $Q$ can be computed in $O(\log(n))$ rounds with $O(n\frac{(\mathrm{IN}^{\max(\w, 3\iw)} + \mathrm{OUT})^2}{M})$ communication cost. \end{itemize} Intersection width is a new notion of queries and generalized hypertree decompositions (GHDs) of queries we introduce to capture how connected the adjacent cyclic components of the GHDs are.
We achieve our first result by introducing a distributed and generalized version of Yannakakis's algorithm, called GYM. GYM takes as input any GHD of $Q$ with width $\w$ and depth $d$, and computes $Q$ in $O(d + \log(n))$ rounds and $O(n\frac{(\mathrm{IN}^{\w} + \mathrm{OUT})^2}{M})$ communication cost. We achieve our second result by showing how to construct GHDs of $Q$ with width $\max(\w, 3\iw)$ and depth $O(\log(n))$. We describe another technique to construct GHDs with longer widths and shorter depths, demonstrating a spectrum of tradeoffs one can make between communication and the number of rounds.
△ Less
Submitted 25 January, 2017; v1 submitted 15 October, 2014;
originally announced October 2014.
-
Efficient Lineage for SUM Aggregate Queries
Authors:
Foto N. Afrati,
Dimitris Fotakis,
Angelos Vasilakopoulos
Abstract:
AI systems typically make decisions and find patterns in data based on the computation of aggregate and specifically sum functions, expressed as queries, on data's attributes. This computation can become costly or even inefficient when these queries concern the whole or big parts of the data and especially when we are dealing with big data. New types of intelligent analytics require also the expla…
▽ More
AI systems typically make decisions and find patterns in data based on the computation of aggregate and specifically sum functions, expressed as queries, on data's attributes. This computation can become costly or even inefficient when these queries concern the whole or big parts of the data and especially when we are dealing with big data. New types of intelligent analytics require also the explanation of why something happened. In this paper we present a randomised algorithm that constructs a small summary of the data, called Aggregate Lineage, which can approximate well and explain all sums with large values in time that depends only on its size. The size of Aggregate Lineage is practically independent on the size of the original data. Our algorithm does not assume any knowledge on the set of sum queries to be approximated.
△ Less
Submitted 9 June, 2014; v1 submitted 10 December, 2013;
originally announced December 2013.
-
Enumerating Subgraph Instances Using Map-Reduce
Authors:
Foto N. Afrati,
Dimitris Fotakis,
Jeffrey D. Ullman
Abstract:
The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize comm…
▽ More
The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of (Afrati and Ullman, TKDE 2011)for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be "convertible," in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.
△ Less
Submitted 21 November, 2012; v1 submitted 2 August, 2012;
originally announced August 2012.
-
Upper and Lower Bounds on the Cost of a Map-Reduce Computation
Authors:
Foto N. Afrati,
Anish Das Sarma,
Semih Salihoglu,
Jeffrey D. Ullman
Abstract:
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of…
▽ More
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of map-reduce computation. This model enables a generic recipe for discovering lower bounds on communication cost as a function of the maximum number of inputs that can be assigned to one reducer. We use the model to analyze the tradeoff for three problems: finding pairs of strings at Hamming distance $d$, finding triangles and other patterns in a larger graph, and matrix multiplication. For finding strings of Hamming distance 1, we have upper and lower bounds that match exactly. For triangles and many other graphs, we have upper and lower bounds that are the same to within a constant factor. For the problem of matrix multiplication, we have matching upper and lower bounds for one-round map-reduce algorithms. We are also able to explore two-round map-reduce algorithms for matrix multiplication and show that these never have more communication, for a given reducer size, than the best one-round algorithm, and often have significantly less.
△ Less
Submitted 19 June, 2012;
originally announced June 2012.
-
Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation
Authors:
Foto N. Afrati,
Anish Das Sarma,
Semih Salihoglu,
Jeffrey D. Ullman
Abstract:
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Mos…
▽ More
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Most past work provides custom solutions to specific problems, e.g., performing fuzzy joins in map-reduce, clustering, graph analyses, and so on. While some problems are amenable to very efficient map-reduce algorithms, some other problems do not lend themselves to a natural distribution, and have provable lower bounds. Clearly, the ease of "map-reducability" is closely related to whether the problem can be partitioned into independent pieces, which are distributed across mappers/reducers. What makes a problem distributable? Can we characterize general properties of problems that determine how easy or hard it is to find efficient map-reduce algorithms?
This is a vision paper that attempts to answer the questions described above.
△ Less
Submitted 8 April, 2012;
originally announced April 2012.
-
A New Framework for Join Product Skew
Authors:
Foto Afrati,
Victor Kyritsis,
Paraskevas V. Lekeas,
Dora Souliou
Abstract:
Different types of data skew can result in load imbalance in the context of parallel joins under the shared nothing architecture. We study one important type of skew, join product skew (JPS). A static approach based on frequency classes is proposed which takes for granted the data distribution of join attribute values. It comes from the observation that the join selectivity can be expressed as a s…
▽ More
Different types of data skew can result in load imbalance in the context of parallel joins under the shared nothing architecture. We study one important type of skew, join product skew (JPS). A static approach based on frequency classes is proposed which takes for granted the data distribution of join attribute values. It comes from the observation that the join selectivity can be expressed as a sum of products of frequencies of the join attribute values. As a consequence, an appropriate assignment of join sub-tasks, that takes into consideration the magnitude of the frequency products can alleviate the join product skew. Motivated by the aforementioned remark, we propose an algorithm, called Handling Join Product Skew (HJPS), to handle join product skew.
△ Less
Submitted 3 June, 2010; v1 submitted 31 May, 2010;
originally announced May 2010.
-
On relating CTL to Datalog
Authors:
Foto Afrati,
Theodore Andronikos,
Vassia Pavlaki,
Eugenie Foustoucos,
Irene Guessarian
Abstract:
CTL is the dominant temporal specification language in practice mainly due to the fact that it admits model checking in linear time. Logic programming and the database query language Datalog are often used as an implementation platform for logic languages. In this paper we present the exact relation between CTL and Datalog and moreover we build on this relation and known efficient algorithms for…
▽ More
CTL is the dominant temporal specification language in practice mainly due to the fact that it admits model checking in linear time. Logic programming and the database query language Datalog are often used as an implementation platform for logic languages. In this paper we present the exact relation between CTL and Datalog and moreover we build on this relation and known efficient algorithms for CTL to obtain efficient algorithms for fragments of stratified Datalog. The contributions of this paper are: a) We embed CTL into STD which is a proper fragment of stratified Datalog. Moreover we show that STD expresses exactly CTL -- we prove that by embedding STD into CTL. Both embeddings are linear. b) CTL can also be embedded to fragments of Datalog without negation. We define a fragment of Datalog with the successor build-in predicate that we call TDS and we embed CTL into TDS in linear time. We build on the above relations to answer open problems of stratified Datalog. We prove that query evaluation is linear and that containment and satisfiability problems are both decidable. The results presented in this paper are the first for fragments of stratified Datalog that are more general than those containing only unary EDBs.
△ Less
Submitted 6 October, 2005; v1 submitted 4 October, 2005;
originally announced October 2005.