-
Measuring Human-Robot Trust with the MDMT (Multi-Dimensional Measure of Trust)
Authors:
Bertram F. Malle,
Daniel Ullman
Abstract:
We describe the steps of developing the MDMT (Multi-Dimensional Measure of Trust), an intuitive self-report measure of perceived trustworthiness of various agents (human, robot, animal). We summarize the evidence that led to the original four-dimensional form (v1) and to the most recent five-dimensional form (v2). We examine the measure's strengths and limitations and point to further necessary va…
▽ More
We describe the steps of developing the MDMT (Multi-Dimensional Measure of Trust), an intuitive self-report measure of perceived trustworthiness of various agents (human, robot, animal). We summarize the evidence that led to the original four-dimensional form (v1) and to the most recent five-dimensional form (v2). We examine the measure's strengths and limitations and point to further necessary validations.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
In-Context Learning Dynamics with Random Binary Sequences
Authors:
Eric J. Bigelow,
Ekdeep Singh Lubana,
Robert P. Dick,
Hidenori Tanaka,
Tomer D. Ullman
Abstract:
Large language models (LLMs) trained on huge corpora of text datasets demonstrate intriguing capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a framework that enables us to analyze in-context l…
▽ More
Large language models (LLMs) trained on huge corpora of text datasets demonstrate intriguing capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a framework that enables us to analyze in-context learning dynamics to understand latent concepts underlying LLMs' behavioral patterns. This provides a more nuanced understanding than success-or-failure evaluation benchmarks, but does not require observing internal activations as a mechanistic interpretation of circuits would. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of in-context learning by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate seemingly random numbers and learn basic formal languages, with striking in-context learning dynamics where model outputs transition sharply from seemingly random behaviors to deterministic repetition.
△ Less
Submitted 15 April, 2024; v1 submitted 26 October, 2023;
originally announced October 2023.
-
Matrix Multiplication Using Only Addition
Authors:
Daniel Cussen,
Jeffrey D. Ullman
Abstract:
Matrix multiplication consumes a large fraction of the time taken in many machine-learning algorithms. Thus, accelerator chips that perform matrix multiplication faster than conventional processors or even GPU's are of increasing interest. In this paper, we demonstrate a method of performing matrix multiplication without a scalar multiplier circuit. In many cases of practical interest, only a sing…
▽ More
Matrix multiplication consumes a large fraction of the time taken in many machine-learning algorithms. Thus, accelerator chips that perform matrix multiplication faster than conventional processors or even GPU's are of increasing interest. In this paper, we demonstrate a method of performing matrix multiplication without a scalar multiplier circuit. In many cases of practical interest, only a single addition and a single on-chip copy operation are needed to replace a multiplication. It thus becomes possible to design a matrix-multiplier chip that, because it does not need time, space- and energy-consuming multiplier circuits, can hold many more processors, and thus provide a net speedup.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Temporal and Object Quantification Networks
Authors:
Jiayuan Mao,
Zhezheng Luo,
Chuang Gan,
Joshua B. Tenenbaum,
Jiajun Wu,
Leslie Pack Kaelbling,
Tomer D. Ullman
Abstract:
We present Temporal and Object Quantification Networks (TOQ-Nets), a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events. This is done by including reasoning layers that implement finite-domain quantification over objects and time. The structure allows them to generalize directly to input instances with varying numb…
▽ More
We present Temporal and Object Quantification Networks (TOQ-Nets), a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events. This is done by including reasoning layers that implement finite-domain quantification over objects and time. The structure allows them to generalize directly to input instances with varying numbers of objects in temporal sequences of varying lengths. We evaluate TOQ-Nets on input domains that require recognizing event-types in terms of complex temporal relational patterns. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios containing more objects than were present during training and to temporal warpings of input sequences.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
AGENT: A Benchmark for Core Psychological Reasoning
Authors:
Tianmin Shu,
Abhishek Bhandwaldar,
Chuang Gan,
Kevin A. Smith,
Shari Liu,
Dan Gutfreund,
Elizabeth Spelke,
Joshua B. Tenenbaum,
Tomer D. Ullman
Abstract:
For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life. Intuitive psychology, the ability to reason about hidden mental variables that drive observable actions, comes naturally to people: even pre-verbal infants can tell agents from objects, expecting agents to act efficiently to achieve goals given constraint…
▽ More
For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life. Intuitive psychology, the ability to reason about hidden mental variables that drive observable actions, comes naturally to people: even pre-verbal infants can tell agents from objects, expecting agents to act efficiently to achieve goals given constraints. Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning. Inspired by cognitive development studies on intuitive psychology, we present a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology. We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.
△ Less
Submitted 25 July, 2021; v1 submitted 24 February, 2021;
originally announced February 2021.
-
Panda: Partitioned Data Security on Outsourced Sensitive and Non-sensitive Data
Authors:
Sharad Mehrotra,
Shantanu Sharma,
Jeffrey D. Ullman,
Dhrubajyoti Ghosh,
Peeyush Gupta
Abstract:
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along with the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existing encryption-based approaches. We, first, pro…
▽ More
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along with the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existing encryption-based approaches. We, first, provide a new security definition, entitled partitioned data security for guaranteeing that the joint processing of non-sensitive data (in cleartext) and sensitive data (in encrypted form) does not lead to any leakage. Then, this paper proposes a new secure approach, entitled query binning (QB) that allows secure execution of queries over non-sensitive and sensitive parts of the data. QB maps a query to a set of queries over the sensitive and non-sensitive data in a way that no leakage will occur due to the joint processing over sensitive and non-sensitive data. In particular, we propose secure algorithms for selection, range, and join queries to be executed over encrypted sensitive and cleartext non-sensitive datasets. Interestingly, in addition to improving performance, we show that QB actually strengthens the security of the underlying cryptographic technique by preventing size, frequency-count, and workload-skew attacks.
△ Less
Submitted 13 May, 2020;
originally announced May 2020.
-
Efficient Multiway Hash Join on Reconfigurable Hardware
Authors:
Kunle Olukotun,
Raghu Prabhakar,
Rekha Singhal,
Jeffrey D. Ullman,
Yaqi Zhang
Abstract:
We propose the algorithms for performing multiway joins using a new type of coarse grain reconfigurable hardware accelerator~-- ``Plasticine''~-- that, compared with other accelerators, emphasizes high compute capability and high on-chip communication bandwidth. Joining three or more relations in a single step, i.e. multiway join, is efficient when the join of any two relations yields too large an…
▽ More
We propose the algorithms for performing multiway joins using a new type of coarse grain reconfigurable hardware accelerator~-- ``Plasticine''~-- that, compared with other accelerators, emphasizes high compute capability and high on-chip communication bandwidth. Joining three or more relations in a single step, i.e. multiway join, is efficient when the join of any two relations yields too large an intermediate relation. We show at least 200X speedup for a sequence of binary hash joins execution on Plasticine over CPU. We further show that in some realistic cases, a Plasticine-like accelerator can make 3-way joins more efficient than a cascade of binary hash joins on the same hardware, by a factor of up to 45X.
△ Less
Submitted 30 May, 2019;
originally announced May 2019.
-
Partitioned Data Security on Outsourced Sensitive and Non-sensitive Data
Authors:
Sharad Mehrotra,
Shantanu Sharma,
Jeffrey D. Ullman,
Anurag Mishra
Abstract:
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existing encryption-based approaches. We propose a new se…
▽ More
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existing encryption-based approaches. We propose a new secure approach, entitled query binning (QB) that allows non-sensitive parts of the data to be outsourced in clear-text while guaranteeing that no information is leaked by the joint processing of non-sensitive data (in clear-text) and sensitive data (in encrypted form). QB maps a query to a set of queries over the sensitive and non-sensitive data in a way that no leakage will occur due to the joint processing over sensitive and non-sensitive data. Interestingly, in addition to improve performance, we show that QB actually strengthens the security of the underlying cryptographic technique by preventing size, frequency-count, and workload-skew attacks.
△ Less
Submitted 19 December, 2018;
originally announced December 2018.
-
Efficient and Private Approximations of Distributed Databases Calculations
Authors:
Philip Derbeko,
Shlomi Dolev,
Ehud Gudes,
Jeffrey D. Ullman
Abstract:
In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separated databases and, a relative to it, issue of private data release were intensively investigated. However, despite a considerable progress, computational complexity, due to an increasing size of data, remains a limitin…
▽ More
In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separated databases and, a relative to it, issue of private data release were intensively investigated. However, despite a considerable progress, computational complexity, due to an increasing size of data, remains a limiting factor in real-world deployments, especially in case of privacy-preserving computations.
In this paper, we present a general method for trade off between performance and accuracy of distributed calculations by performing data sampling. Sampling was a topic of extensive research that recently received a boost of interest. We provide a sampling method targeted at separate, non-collaborating, vertically partitioned datasets. The method is exemplified and tested on approximation of intersection set both without and with privacy-preserving mechanism. An analysis of the bound on error as a function of the sample size is discussed and heuristic algorithm is suggested to further improve the performance. The algorithms were implemented and experimental results confirm the validity of the approach.
△ Less
Submitted 19 May, 2016;
originally announced May 2016.
-
Building Machines That Learn and Think Like People
Authors:
Brenden M. Lake,
Tomer D. Ullman,
Joshua B. Tenenbaum,
Samuel J. Gershman
Abstract:
Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achieveme…
▽ More
Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.
△ Less
Submitted 2 November, 2016; v1 submitted 1 April, 2016;
originally announced April 2016.
-
Some Pairs Problems
Authors:
Jeffrey D. Ullman,
Jonathan Ullman
Abstract:
A common form of MapReduce application involves discovering relationships between certain pairs of inputs. Similarity joins serve as a good example of this type of problem, which we call a "some-pairs" problem. In the framework of Afrati et al. (VLDB 2013), algorithms are measured by the tradeoff between reducer size (maximum number of inputs a reducer can handle) and the replication rate (average…
▽ More
A common form of MapReduce application involves discovering relationships between certain pairs of inputs. Similarity joins serve as a good example of this type of problem, which we call a "some-pairs" problem. In the framework of Afrati et al. (VLDB 2013), algorithms are measured by the tradeoff between reducer size (maximum number of inputs a reducer can handle) and the replication rate (average number of reducers to which an input must be sent. There are two obvious approaches to solving some-pairs problems in general. We show that no general-purpose MapReduce algorithm can beat both of these two algorithms in the worst case. We then explore a recursive algorithm for solving some-pairs problems and heuristics for beating the lower bound on common instances of the some-pairs class of problems.
△ Less
Submitted 3 February, 2016;
originally announced February 2016.
-
SharesSkew: An Algorithm to Handle Skew for Joins in MapReduce
Authors:
Foto Afrati,
Nikos Stasinopoulos,
Jeffrey D. Ullman,
Angelos Vassilakopoulos
Abstract:
In this paper, we investigate the problem of computing a multiway join in one round of MapReduce when the data may be skewed. We optimize on communication cost, i.e., the amount of data that is transferred from the mappers to the reducers. We identify join attributes values that appear very frequently, Heavy Hitters (HH). We distribute HH valued records to reducers avoiding skew by using an adapta…
▽ More
In this paper, we investigate the problem of computing a multiway join in one round of MapReduce when the data may be skewed. We optimize on communication cost, i.e., the amount of data that is transferred from the mappers to the reducers. We identify join attributes values that appear very frequently, Heavy Hitters (HH). We distribute HH valued records to reducers avoiding skew by using an adaptation of the Shares~\cite{AfUl} algorithm to achieve minimum communication cost. Our algorithm is implemented for experimentation and is offered as open source software. Furthermore, we investigate a class of multiway joins for which a simpler variant of the algorithm can handle skew. We offer closed forms for computing the parameters of the algorithm for chain and symmetric joins.
△ Less
Submitted 12 December, 2015;
originally announced December 2015.
-
Computing Marginals Using MapReduce
Authors:
Foto Afrati,
Shantanu Sharma,
Jeffrey D. Ullman,
Jonathan R. Ullman
Abstract:
We consider the problem of computing the data-cube marginals of a fixed order $k$ (i.e., all marginals that aggregate over $k$ dimensions), using a single round of MapReduce. The focus is on the relationship between the reducer size (number of inputs allowed at a single reducer) and the replication rate (number of reducers to which an input is sent). We show that the replication rate is minimized…
▽ More
We consider the problem of computing the data-cube marginals of a fixed order $k$ (i.e., all marginals that aggregate over $k$ dimensions), using a single round of MapReduce. The focus is on the relationship between the reducer size (number of inputs allowed at a single reducer) and the replication rate (number of reducers to which an input is sent). We show that the replication rate is minimized when the reducers receive all the inputs necessary to compute one marginal of higher order. That observation lets us view the problem as one of covering sets of $k$ dimensions with sets of a larger size $m$, a problem that has been studied under the name "covering numbers." We offer a number of constructions that, for different values of $k$ and $m$ meet or come close to yielding the minimum possible replication rate for a given reducer size.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.
-
Meta-MapReduce: A Technique for Reducing Communication in MapReduce Computations
Authors:
Foto Afrati,
Shlomi Dolev,
Shantanu Sharma,
Jeffrey D. Ullman
Abstract:
MapReduce has proven to be one of the most useful paradigms in the revolution of distributed computing, where cloud services and cluster computing become the standard venue for computing. The federation of cloud and big data activities is the next challenge where MapReduce should be modified to avoid (big) data migration across remote (cloud) sites. This is exactly our scope of research, where onl…
▽ More
MapReduce has proven to be one of the most useful paradigms in the revolution of distributed computing, where cloud services and cluster computing become the standard venue for computing. The federation of cloud and big data activities is the next challenge where MapReduce should be modified to avoid (big) data migration across remote (cloud) sites. This is exactly our scope of research, where only the very essential data for obtaining the result is transmitted, reducing communication, processing and preserving data privacy as much as possible. In this work, we propose an algorithmic technique for MapReduce algorithms, called Meta-MapReduce, that decreases the communication cost by allowing us to process and move metadata to clouds and from the map phase to reduce phase. In Meta-MapReduce, the reduce phase fetches only the required data at required iterations, which in turn, assists in preserving the data privacy.
△ Less
Submitted 28 July, 2016; v1 submitted 5 August, 2015;
originally announced August 2015.
-
Assignment Problems of Different-Sized Inputs in MapReduce
Authors:
Foto Afrati,
Shlomi Dolev,
Ephraim Korach,
Shantanu Sharma,
Jeffrey D. Ullman
Abstract:
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inputs may vary in terms of size. We consider, for th…
▽ More
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inputs may vary in terms of size. We consider, for the first time, mapping schemas where input sizes are part of the considerations and restrictions. One of the significant parameters to optimize in any MapReduce job is communication cost between the map and reduce phases. The communication cost can be optimized by minimizing the number of copies of inputs sent to the reducers. The communication cost is closely related to the number of reducers of constrained capacity that are used to accommodate appropriately the inputs, so that the requirement of how the inputs must meet in a reducer is satisfied. In this work, we consider a family of problems where it is required that each input meets with each other input in at least one reducer. We also consider a slightly different family of problems in which, each input of a list, X, is required to meet each input of another list, Y, in at least one reducer. We prove that finding an optimal mapping schema for these families of problems is NP-hard, and present a bin-packing-based approximation algorithm for finding a near optimal mapping schema.
△ Less
Submitted 20 October, 2016; v1 submitted 16 July, 2015;
originally announced July 2015.
-
Handling Skew in Multiway Joins in Parallel Processing
Authors:
Foto N. Afrati,
Jeffrey D. Ullman,
Angelos Vasilakopoulos
Abstract:
Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In th…
▽ More
Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In this paper we will introduce a novel technique for handling skew when we want to compute a multiway join in one MapReduce round with minimum communication cost. This technique is actually an adaptation of the Shares algorithm [Afrati et. al, TKDE 2011].
△ Less
Submitted 13 April, 2015;
originally announced April 2015.
-
Assignment of Different-Sized Inputs in MapReduce
Authors:
Foto Afrati,
Shlomi Dolev,
Ephraim Korach,
Shantanu Sharma,
Jeffrey D. Ullman
Abstract:
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inputs may vary in terms of size. We consider, for th…
▽ More
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inputs may vary in terms of size. We consider, for the first time, mapping schemas where input sizes are part of the considerations and restrictions. One of the significant parameters to optimize in any MapReduce job is communication cost between the map and reduce phases. The communication cost can be optimized by minimizing the number of copies of inputs sent to the reducers. The communication cost is closely related to the number of reducers of constrained capacity that are used to accommodate appropriately the inputs, so that the requirement of how the inputs must meet in a reducer is satisfied. In this work, we consider a family of problems where it is required that each input meets with each other input in at least one reducer. We also consider a slightly different family of problems in which, each input of a set, X, is required to meet each input of another set, Y, in at least one reducer. We prove that finding an optimal mapping schema for these families of problem is NP-hard, and present several approximation algorithms for finding a near optimal mapping schema.
△ Less
Submitted 27 January, 2015;
originally announced January 2015.
-
GYM: A Multiround Join Algorithm In MapReduce
Authors:
Foto Afrati,
Manas Joglekar,
Christopher RĂ©,
Semih Salihoglu,
Jeffrey D. Ullman
Abstract:
Multiround algorithms are now commonly used in distributed data processing systems, yet the extent to which algorithms can benefit from running more rounds is not well understood. This paper answers this question for a spectrum of rounds for the problem of computing the equijoin of $n$ relations. Specifically, given any query $Q$ with width $\w$, {\em intersection width} $\iw$, input size…
▽ More
Multiround algorithms are now commonly used in distributed data processing systems, yet the extent to which algorithms can benefit from running more rounds is not well understood. This paper answers this question for a spectrum of rounds for the problem of computing the equijoin of $n$ relations. Specifically, given any query $Q$ with width $\w$, {\em intersection width} $\iw$, input size $\mathrm{IN}$, output size $\mathrm{OUT}$, and a cluster of machines with $M$ memory available per machine, we show that:
(1) $Q$ can be computed in $O(n)$ rounds with $O(n\frac{(\mathrm{IN}^{\w} + \mathrm{OUT})^2}{M})$ communication cost.
(2) $Q$ can be computed in $O(\log(n))$ rounds with $O(n\frac{(\mathrm{IN}^{\max(\w, 3\iw)} + \mathrm{OUT})^2}{M})$ communication cost. \end{itemize} Intersection width is a new notion of queries and generalized hypertree decompositions (GHDs) of queries we introduce to capture how connected the adjacent cyclic components of the GHDs are.
We achieve our first result by introducing a distributed and generalized version of Yannakakis's algorithm, called GYM. GYM takes as input any GHD of $Q$ with width $\w$ and depth $d$, and computes $Q$ in $O(d + \log(n))$ rounds and $O(n\frac{(\mathrm{IN}^{\w} + \mathrm{OUT})^2}{M})$ communication cost. We achieve our second result by showing how to construct GHDs of $Q$ with width $\max(\w, 3\iw)$ and depth $O(\log(n))$. We describe another technique to construct GHDs with longer widths and shorter depths, demonstrating a spectrum of tradeoffs one can make between communication and the number of rounds.
△ Less
Submitted 25 January, 2017; v1 submitted 15 October, 2014;
originally announced October 2014.
-
Enumerating Subgraph Instances Using Map-Reduce
Authors:
Foto N. Afrati,
Dimitris Fotakis,
Jeffrey D. Ullman
Abstract:
The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize comm…
▽ More
The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of (Afrati and Ullman, TKDE 2011)for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be "convertible," in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.
△ Less
Submitted 21 November, 2012; v1 submitted 2 August, 2012;
originally announced August 2012.
-
Upper and Lower Bounds on the Cost of a Map-Reduce Computation
Authors:
Foto N. Afrati,
Anish Das Sarma,
Semih Salihoglu,
Jeffrey D. Ullman
Abstract:
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of…
▽ More
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of map-reduce computation. This model enables a generic recipe for discovering lower bounds on communication cost as a function of the maximum number of inputs that can be assigned to one reducer. We use the model to analyze the tradeoff for three problems: finding pairs of strings at Hamming distance $d$, finding triangles and other patterns in a larger graph, and matrix multiplication. For finding strings of Hamming distance 1, we have upper and lower bounds that match exactly. For triangles and many other graphs, we have upper and lower bounds that are the same to within a constant factor. For the problem of matrix multiplication, we have matching upper and lower bounds for one-round map-reduce algorithms. We are also able to explore two-round map-reduce algorithms for matrix multiplication and show that these never have more communication, for a given reducer size, than the best one-round algorithm, and often have significantly less.
△ Less
Submitted 19 June, 2012;
originally announced June 2012.
-
Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation
Authors:
Foto N. Afrati,
Anish Das Sarma,
Semih Salihoglu,
Jeffrey D. Ullman
Abstract:
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Mos…
▽ More
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Most past work provides custom solutions to specific problems, e.g., performing fuzzy joins in map-reduce, clustering, graph analyses, and so on. While some problems are amenable to very efficient map-reduce algorithms, some other problems do not lend themselves to a natural distribution, and have provable lower bounds. Clearly, the ease of "map-reducability" is closely related to whether the problem can be partitioned into independent pieces, which are distributed across mappers/reducers. What makes a problem distributable? Can we characterize general properties of problems that determine how easy or hard it is to find efficient map-reduce algorithms?
This is a vision paper that attempts to answer the questions described above.
△ Less
Submitted 8 April, 2012;
originally announced April 2012.