-
Optimizing Big Active Data Management Systems
Authors:
Shahrzad Haji Amin Shirazi,
Xikui Wang,
Michael J. Carey,
Vassilis J. Tsotras
Abstract:
Within the dynamic world of Big Data, traditional systems typically operate in a passive mode, processing and responding to user queries by returning the requested data. However, this methodology falls short of meeting the evolving demands of users who not only wish to analyze data but also to receive proactive updates on topics of interest. To bridge this gap, Big Active Data (BAD) frameworks hav…
▽ More
Within the dynamic world of Big Data, traditional systems typically operate in a passive mode, processing and responding to user queries by returning the requested data. However, this methodology falls short of meeting the evolving demands of users who not only wish to analyze data but also to receive proactive updates on topics of interest. To bridge this gap, Big Active Data (BAD) frameworks have been proposed to support extensive data subscriptions and analytics for millions of subscribers. As data volumes and the number of interested users continue to increase, the imperative to optimize BAD systems for enhanced scalability, performance, and efficiency becomes paramount. To this end, this paper introduces three main optimizations, namely: strategic aggregation, intelligent modifications to the query plan, and early result filtering, all aimed at reinforcing a BAD platform's capability to actively manage and efficiently process soaring rates of incoming data and distribute notifications to larger numbers of subscribers.
△ Less
Submitted 20 December, 2024; v1 submitted 18 December, 2024;
originally announced December 2024.
-
EBV: Electronic Bee-Veterinarian for Principled Mining and Forecasting of Honeybee Time Series
Authors:
Mst. Shamima Hossain,
Christos Faloutsos,
Boris Baer,
Hyoseung Kim,
Vassilis J. Tsotras
Abstract:
Honeybees are vital for pollination and food production. Among many factors, extreme temperature (e.g., due to climate change) is particularly dangerous for bee health. Anticipating such extremities would allow beekeepers to take early preventive action. Thus, given sensor (temperature) time series data from beehives, how can we find patterns and do forecasting? Forecasting is crucial as it helps…
▽ More
Honeybees are vital for pollination and food production. Among many factors, extreme temperature (e.g., due to climate change) is particularly dangerous for bee health. Anticipating such extremities would allow beekeepers to take early preventive action. Thus, given sensor (temperature) time series data from beehives, how can we find patterns and do forecasting? Forecasting is crucial as it helps spot unexpected behavior and thus issue warnings to the beekeepers. In that case, what are the right models for forecasting? ARIMA, RNNs, or something else?
We propose the EBV (Electronic Bee-Veterinarian) method, which has the following desirable properties: (i) principled: it is based on a) diffusion equations from physics and b) control theory for feedback-loop controllers; (ii) effective: it works well on multiple, real-world time sequences, (iii) explainable: it needs only a handful of parameters (e.g., bee strength) that beekeepers can easily understand and trust, and (iv) scalable: it performs linearly in time. We applied our method to multiple real-world time sequences, and found that it yields accurate forecasting (up to 49% improvement in RMSE compared to baselines), and segmentation. Specifically, discontinuities detected by EBV mostly coincide with domain expert's opinions, showcasing our approach's potential and practical feasibility. Moreover, EBV is scalable and fast, taking about 20 minutes on a stock laptop for reconstructing two months of sensor data.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Reachability and Top-k Reachability Queries with Transfer Decay
Authors:
Elena V. Strzheletska,
Vassilis J. Tsotras
Abstract:
The prevalence of location tracking systems has resulted in large volumes of spatiotemporal data generated every day. Addressing reachability queries on such datasets is important for a wide range of applications (surveillance, public health, social networks, etc.) A spatiotemporal reachability query identifies whether a physical item (or information etc.) could have been transferred from the sour…
▽ More
The prevalence of location tracking systems has resulted in large volumes of spatiotemporal data generated every day. Addressing reachability queries on such datasets is important for a wide range of applications (surveillance, public health, social networks, etc.) A spatiotemporal reachability query identifies whether a physical item (or information etc.) could have been transferred from the source object $O_S$ to the target object $O_T$ during a time interval $I$ (either directly, or through a chain of intermediate transfers). In previous research on spatiotemporal reachability queries, the number of such transfers is not limited, and the weight of a piece of transferred information remains the same. This paper introduces novel reachability queries, which assume a scenario of information decay. Such queries arise when the value of information that travels through the chain of intermediate objects decreases with each transfer. To address such queries efficiently over large spatiotemporal datasets, we introduce the RICCdecay algorithm. Further, the decay scenario leads to an important extension: if there are many different sources of information, the aggregate value of information an object can obtain varies. As a result, we introduce a top-k reachability problem, identifying the k objects with the highest accumulated information. We also present the RICCtopK algorithm that can efficiently compute top-k reachability with transfer decay queries. An experimental evaluation shows the efficiency of the proposed algorithms over previous approaches.
△ Less
Submitted 18 May, 2021;
originally announced May 2021.
-
Bridging BAD Islands: Declarative Data Sharing at Scale
Authors:
Xikui Wang,
Michael J. Carey,
Vassilis J. Tsotras
Abstract:
In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In…
▽ More
In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our prior work, we developed a Big Active Data (BAD) system for enabling Big Data subscriptions and analytics with millions of subscribers. Based on that, we introduce a new mechanism for enabling the sharing of Big Data at scale declaratively so that developers can easily create and provide data sharing services using declarative statements and can benefit from an underlying scalable infrastructure. We show our implementation on top of the BAD system, explain the data sharing data flow among multiple systems, and present a prototype system with experimental results.
△ Less
Submitted 5 January, 2021;
originally announced January 2021.
-
Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management Systems
Authors:
Christina Pavlopoulou,
Michael J. Carey,
Vassilis J. Tsotras
Abstract:
Query Optimization remains an open problem for Big Data Management Systems. Traditional optimizers are cost-based and use statistical estimates of intermediate result cardinalities to assign costs and pick the best plan. However, such estimates tend to become less accurate because of filtering conditions caused either from undetected correlations between multiple predicates local to a single datas…
▽ More
Query Optimization remains an open problem for Big Data Management Systems. Traditional optimizers are cost-based and use statistical estimates of intermediate result cardinalities to assign costs and pick the best plan. However, such estimates tend to become less accurate because of filtering conditions caused either from undetected correlations between multiple predicates local to a single dataset, predicates with query parameters, or predicates involving user-defined functions (UDFs). Consequently, traditional query optimizers tend to ignore or miscalculate those settings, thus leading to suboptimal execution plans. Given the volume of today's data, a suboptimal plan can quickly become very inefficient.
In this work, we revisit the old idea of runtime dynamic optimization and adapt it to a shared-nothing distributed database system, AsterixDB. The optimization runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created from each stage is used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimations, thus leading to much better execution plans. While it introduces the overhead for materializing these intermediate results, our experiments show that this overhead is relatively small and it is an acceptable price to pay given the optimization benefits. In fact, our experimental evaluation shows that runtime dynamic optimization leads to much better execution plans as compared to the current default AsterixDB plans as well as to plans produced by static cost-based optimization (i.e. based on the initial dataset statistics) and other state-of-the-art approaches.
△ Less
Submitted 5 October, 2020; v1 submitted 1 October, 2020;
originally announced October 2020.
-
Subscribing to Big Data at Scale
Authors:
Xikui Wang,
Michael J. Carey,
Vassilis J. Tsotras
Abstract:
Today, data is being actively generated by a variety of devices, services, and applications. Such data is important not only for the information that it contains, but also for its relationships to other data and to interested users. Most existing Big Data systems focus on passively answering queries from users, rather than actively collecting data, processing it, and serving it to users. To satisf…
▽ More
Today, data is being actively generated by a variety of devices, services, and applications. Such data is important not only for the information that it contains, but also for its relationships to other data and to interested users. Most existing Big Data systems focus on passively answering queries from users, rather than actively collecting data, processing it, and serving it to users. To satisfy both passive and active requests at scale, users need either to heavily customize an existing passive Big Data system or to glue multiple systems together. Either choice would require significant effort from users and incur additional overhead. In this paper, we present the BAD (Big Active Data) system, which is designed to preserve the merits of passive Big Data systems and introduce new features for actively serving Big Data to users at scale. We show the design and implementation of the BAD system, demonstrate how BAD facilitates providing both passive and active data services, investigate the BAD system's performance at scale, and illustrate the complexities that would result from instead providing BAD-like services with a "glued" system.
△ Less
Submitted 9 September, 2020;
originally announced September 2020.
-
BAD to the Bone: Big Active Data at its Core
Authors:
Steven Jacobs,
Xikui Wang,
Michael J. Carey,
Vassilis J. Tsotras,
Md Yusuf Sarwar Uddin
Abstract:
Virtually all of today's Big Data systems are passive in nature, responding to queries posted by their users. Instead, we are working to shift Big Data platforms from passive to active. In our view, a Big Active Data (BAD) system should continuously and reliably capture Big Data while enabling timely and automatic delivery of relevant information to a large pool of interested users, as well as sup…
▽ More
Virtually all of today's Big Data systems are passive in nature, responding to queries posted by their users. Instead, we are working to shift Big Data platforms from passive to active. In our view, a Big Active Data (BAD) system should continuously and reliably capture Big Data while enabling timely and automatic delivery of relevant information to a large pool of interested users, as well as supporting retrospective analyses of historical information. While various scalable streaming query engines have been created, their active behavior is limited to a (relatively) small window of the incoming data. To this end we have created a BAD platform that combines ideas and capabilities from both Big Data and Active Data (e.g., Publish/Subscribe, Streaming Engines). It supports complex subscriptions that consider not only newly arrived items but also their relationships to past, stored data. Further, it can provide actionable notifications by enriching the subscription results with other useful data. Our platform extends an existing open-source Big Data Management System, Apache AsterixDB, with an active toolkit. The toolkit contains features to rapidly ingest semistructured data, share execution pipelines among users, manage scaled user data subscriptions, and actively monitor the state of the data to produce individualized information for each user. This paper describes the features and design of our current BAD data platform and demonstrates its ability to scale without sacrificing query capabilities or result individualization.
△ Less
Submitted 23 May, 2020; v1 submitted 22 February, 2020;
originally announced February 2020.
-
Apache VXQuery: A Scalable XQuery Implementation
Authors:
E. Preston Carman Jr.,
Till Westmann,
Vinayak R. Borkar,
Michael J. Carey,
Vassilis J. Tsotras
Abstract:
The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. To efficiently query such large data collections and take advantage of parallelism, we have implemented Apache VXQuery, an open-source scalable XQuery processor. The system builds upon two other open-source frameworks -- Hyracks, a parallel execution engine, and Algebricks, a…
▽ More
The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. To efficiently query such large data collections and take advantage of parallelism, we have implemented Apache VXQuery, an open-source scalable XQuery processor. The system builds upon two other open-source frameworks -- Hyracks, a parallel execution engine, and Algebricks, a language agnostic compiler toolbox. Apache VXQuery extends these two frameworks and provides an implementation of the XQuery specifics (data model, data-model dependent functions and optimizations, and a parser). We describe the architecture of Apache VXQuery, its integration with Hyracks and Algebricks, and the XQuery optimization rules applied to the query plan to improve path expression efficiency and to enable query parallelism. An experimental evaluation using a real 500GB dataset with various selection, aggregation and join XML queries shows that Apache VXQuery performs well both in terms of scale-up and speed-up. Our experiments show that it is about 3x faster than Saxon (an open-source and commercial XQuery processor) on a 4-core, single node implementation, and around 2.5x faster than Apache MRQL (a MapReduce-based parallel query processor) on an eight (4-core) node cluster.
△ Less
Submitted 1 April, 2015;
originally announced April 2015.
-
Revisiting Aggregation for Data Intensive Applications: A Performance Study
Authors:
Jian Wen,
Vinayak R. Borkar,
Michael J. Carey,
Vassilis J. Tsotras
Abstract:
Aggregation has been an important operation since the early days of relational databases. Today's Big Data applications bring further challenges when processing aggregation queries, demanding adaptive aggregation algorithms that can process large volumes of data relative to a potentially limited memory budget (especially in multiuser settings). Despite its importance, the design and evaluation of…
▽ More
Aggregation has been an important operation since the early days of relational databases. Today's Big Data applications bring further challenges when processing aggregation queries, demanding adaptive aggregation algorithms that can process large volumes of data relative to a potentially limited memory budget (especially in multiuser settings). Despite its importance, the design and evaluation of aggregation algorithms has not received the same attention that other basic operators, such as joins, have received in the literature. As a result, when considering which aggregation algorithm(s) to implement in a new parallel Big Data processing platform (AsterixDB), we faced a lack of "off the shelf" answers that we could simply read about and then implement based on prior performance studies.
In this paper we revisit the engineering of efficient local aggregation algorithms for use in Big Data platforms. We discuss the salient implementation details of several candidate algorithms and present an in-depth experimental performance study to guide future Big Data engine developers. We show that the efficient implementation of the aggregation operator for a Big Data platform is non-trivial and that many factors, including memory usage, spilling strategy, and I/O and CPU cost, should be considered. Further, we introduce precise cost models that can help in choosing an appropriate algorithm based on input parameters including memory budget, grouping key cardinality, and data skew.
△ Less
Submitted 31 October, 2013;
originally announced November 2013.
-
On The Spatiotemporal Burstiness of Terms
Authors:
Theodoros Lappas,
Marcos R. Vieira,
Dimitrios Gunopulos,
Vassilis J. Tsotras
Abstract:
Thousands of documents are made available to the users via the web on a daily basis. One of the most extensively studied problems in the context of such document streams is burst identification. Given a term t, a burst is generally exhibited when an unusually high frequency is observed for t. While spatial and temporal burstiness have been studied individually in the past, our work is the first to…
▽ More
Thousands of documents are made available to the users via the web on a daily basis. One of the most extensively studied problems in the context of such document streams is burst identification. Given a term t, a burst is generally exhibited when an unusually high frequency is observed for t. While spatial and temporal burstiness have been studied individually in the past, our work is the first to simultaneously track and measure spatiotemporal term burstiness. In addition, we use the mined burstiness information toward an efficient document-search engine: given a user's query of terms, our engine returns a ranked list of documents discussing influential events with a strong spatiotemporal impact. We demonstrate the efficiency of our methods with an extensive experimental evaluation on real and synthetic datasets.
△ Less
Submitted 30 May, 2012;
originally announced May 2012.