-
Safe Subjoins in Acyclic Joins
Authors:
Foto N. Afrati
Abstract:
It is expensive to compute joins, often due to large intermediate relations. For acyclic joins, monotone join expressions are guaranteed to produce intermediate relations not larger than the size of the output of the join when it is computed on a fully reduced database. Any subexpression of an acyclic join does not offer this guarantee, as it is easy to prove. In this paper, we consider joins with…
▽ More
It is expensive to compute joins, often due to large intermediate relations. For acyclic joins, monotone join expressions are guaranteed to produce intermediate relations not larger than the size of the output of the join when it is computed on a fully reduced database. Any subexpression of an acyclic join does not offer this guarantee, as it is easy to prove. In this paper, we consider joins with projections too and we ask the question whether we can characterize join subexpressions that produce, on every fully reduced database, an output without dangling tuples (which translates, in the case of joins without projections, to an output of size not larger than the size of the output of the join). We call such a subexpression a safe subjoin. Surprisingly, we prove that there is a simple characterization which is the following: A subjoin is safe if and only if there is a parse tree of the join (a.k.a. join tree) such that the relations in the subjoin form a subtree of it. We provide an algorithm that finds such a parse tree, if there is one.
△ Less
Submitted 20 August, 2022;
originally announced August 2022.
-
Querying collections of tree-structured records in the presence of within-record referential constraints
Authors:
Foto N. Afrati,
Matthew Damigos
Abstract:
In this paper, we consider a tree-structured data model used in many commercial databases like Dremel, F1, JSON stores. We define identity and referential constraints within each tree-structured record. The query language is a variant of SQL and flattening is used as an evaluation mechanism. We investigate querying in the presence of these constraints, and point out the challenges that arise from…
▽ More
In this paper, we consider a tree-structured data model used in many commercial databases like Dremel, F1, JSON stores. We define identity and referential constraints within each tree-structured record. The query language is a variant of SQL and flattening is used as an evaluation mechanism. We investigate querying in the presence of these constraints, and point out the challenges that arise from taking them into account during query evaluation.
△ Less
Submitted 30 August, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
On the complexity of query containment and computing certain answers in the presence of ACs
Authors:
Foto N. Afrati,
Matthew Damigos
Abstract:
We often add arithmetic to extend the expressiveness of query languages and study the complexity of problems such as testing query containment and finding certain answers in the framework of answering queries using views. When adding arithmetic comparisons, the complexity of such problems is higher than the complexity of their counterparts without them. It has been observed that we can achieve low…
▽ More
We often add arithmetic to extend the expressiveness of query languages and study the complexity of problems such as testing query containment and finding certain answers in the framework of answering queries using views. When adding arithmetic comparisons, the complexity of such problems is higher than the complexity of their counterparts without them. It has been observed that we can achieve lower complexity if we restrict some of the comparisons in the containing query to be closed or open semi-interval comparisons. Here, focusing a) on the problem of containment for conjunctive queries with arithmetic comparisons (CQAC queries, for short), we prove upper bounds on its computational complexity and b) on the problem of computing certain answers, we find large classes of CQAC queries and views where this problem is polynomial.
△ Less
Submitted 18 November, 2020; v1 submitted 25 August, 2020;
originally announced August 2020.
-
Handling Skew in Multiway Joins in Parallel Processing
Authors:
Foto N. Afrati,
Jeffrey D. Ullman,
Angelos Vasilakopoulos
Abstract:
Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In th…
▽ More
Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In this paper we will introduce a novel technique for handling skew when we want to compute a multiway join in one MapReduce round with minimum communication cost. This technique is actually an adaptation of the Shares algorithm [Afrati et. al, TKDE 2011].
△ Less
Submitted 13 April, 2015;
originally announced April 2015.
-
Consistent Answers of Conjunctive Queries on Graphs
Authors:
Foto N. Afrati,
Phokion G. Kolaitis,
Angelos Vasilakopoulos
Abstract:
During the past decade, there has been an extensive investigation of the computational complexity of the consistent answers of Boolean conjunctive queries under primary key constraints. Much of this investigation has focused on self-join-free Boolean conjunctive queries. In this paper, we study the consistent answers of Boolean conjunctive queries involving a single binary relation, i.e., we consi…
▽ More
During the past decade, there has been an extensive investigation of the computational complexity of the consistent answers of Boolean conjunctive queries under primary key constraints. Much of this investigation has focused on self-join-free Boolean conjunctive queries. In this paper, we study the consistent answers of Boolean conjunctive queries involving a single binary relation, i.e., we consider arbitrary Boolean conjunctive queries on directed graphs. In the presence of a single key constraint, we show that for each such Boolean conjunctive query, either the problem of computing its consistent answers is expressible in first-order logic, or it is polynomial-time solvable, but not expressible in first-order logic.
△ Less
Submitted 2 March, 2015;
originally announced March 2015.
-
Efficient Lineage for SUM Aggregate Queries
Authors:
Foto N. Afrati,
Dimitris Fotakis,
Angelos Vasilakopoulos
Abstract:
AI systems typically make decisions and find patterns in data based on the computation of aggregate and specifically sum functions, expressed as queries, on data's attributes. This computation can become costly or even inefficient when these queries concern the whole or big parts of the data and especially when we are dealing with big data. New types of intelligent analytics require also the expla…
▽ More
AI systems typically make decisions and find patterns in data based on the computation of aggregate and specifically sum functions, expressed as queries, on data's attributes. This computation can become costly or even inefficient when these queries concern the whole or big parts of the data and especially when we are dealing with big data. New types of intelligent analytics require also the explanation of why something happened. In this paper we present a randomised algorithm that constructs a small summary of the data, called Aggregate Lineage, which can approximate well and explain all sums with large values in time that depends only on its size. The size of Aggregate Lineage is practically independent on the size of the original data. Our algorithm does not assume any knowledge on the set of sum queries to be approximated.
△ Less
Submitted 9 June, 2014; v1 submitted 10 December, 2013;
originally announced December 2013.
-
Enumerating Subgraph Instances Using Map-Reduce
Authors:
Foto N. Afrati,
Dimitris Fotakis,
Jeffrey D. Ullman
Abstract:
The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize comm…
▽ More
The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of (Afrati and Ullman, TKDE 2011)for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be "convertible," in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.
△ Less
Submitted 21 November, 2012; v1 submitted 2 August, 2012;
originally announced August 2012.
-
Upper and Lower Bounds on the Cost of a Map-Reduce Computation
Authors:
Foto N. Afrati,
Anish Das Sarma,
Semih Salihoglu,
Jeffrey D. Ullman
Abstract:
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of…
▽ More
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of map-reduce computation. This model enables a generic recipe for discovering lower bounds on communication cost as a function of the maximum number of inputs that can be assigned to one reducer. We use the model to analyze the tradeoff for three problems: finding pairs of strings at Hamming distance $d$, finding triangles and other patterns in a larger graph, and matrix multiplication. For finding strings of Hamming distance 1, we have upper and lower bounds that match exactly. For triangles and many other graphs, we have upper and lower bounds that are the same to within a constant factor. For the problem of matrix multiplication, we have matching upper and lower bounds for one-round map-reduce algorithms. We are also able to explore two-round map-reduce algorithms for matrix multiplication and show that these never have more communication, for a given reducer size, than the best one-round algorithm, and often have significantly less.
△ Less
Submitted 19 June, 2012;
originally announced June 2012.
-
Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation
Authors:
Foto N. Afrati,
Anish Das Sarma,
Semih Salihoglu,
Jeffrey D. Ullman
Abstract:
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Mos…
▽ More
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Most past work provides custom solutions to specific problems, e.g., performing fuzzy joins in map-reduce, clustering, graph analyses, and so on. While some problems are amenable to very efficient map-reduce algorithms, some other problems do not lend themselves to a natural distribution, and have provable lower bounds. Clearly, the ease of "map-reducability" is closely related to whether the problem can be partitioned into independent pieces, which are distributed across mappers/reducers. What makes a problem distributable? Can we characterize general properties of problems that determine how easy or hard it is to find efficient map-reduce algorithms?
This is a vision paper that attempts to answer the questions described above.
△ Less
Submitted 8 April, 2012;
originally announced April 2012.