-
Transactional Cloud Applications: Status Quo, Challenges, and Opportunities
Authors:
Rodrigo Laigner,
George Christodoulou,
Kyriakos Psarakis,
Asterios Katsifodimos,
Yongluan Zhou
Abstract:
Transactional cloud applications such as payment, booking, reservation systems, and complex business workflows are currently being rewritten for deployment in the cloud. This migration to the cloud is happening mainly for reasons of cost and scalability. Over the years, application developers have used different migration approaches, such as microservice frameworks, actors, and stateful dataflow s…
▽ More
Transactional cloud applications such as payment, booking, reservation systems, and complex business workflows are currently being rewritten for deployment in the cloud. This migration to the cloud is happening mainly for reasons of cost and scalability. Over the years, application developers have used different migration approaches, such as microservice frameworks, actors, and stateful dataflow systems.
The migration to the cloud has brought back data management challenges traditionally handled by database management systems. Those challenges include ensuring state consistency, maintaining durability, and managing the application lifecycle. At the same time, the shift to a distributed computing infrastructure introduced new issues, such as message delivery, task scheduling, containerization, and (auto)scaling.
Although the data management community has made progress in developing analytical and transactional database systems, transactional cloud applications have received little attention in database research. This tutorial aims to highlight recent trends in the area and discusses open research challenges for the data management community.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows
Authors:
George Siachamis,
Kyriakos Psarakis,
Marios Fragkoulis,
Arie van Deursen,
Paris Carbone,
Asterios Katsifodimos
Abstract:
Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated chec…
▽ More
Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated checkpoints - an extension of the original Chandy-Lamport checkpoints from 1985. However, the reasons behind this prevalence of the coordinated approach remain anecdotal, as reported by practitioners of the stream processing community. At the same time, common checkpointing approaches, such as the uncoordinated and the communication-induced ones, remain largely unexplored.
This paper is the first to address this gap by i) shedding light on why practitioners have favored the coordinated approach and ii) by investigating whether there are viable alternatives. To this end, we implement three checkpointing approaches that we surveyed and adapted for the distinct needs of streaming dataflows. Our analysis shows that the coordinated approach outperforms the uncoordinated and communication-induced protocols under uniformly distributed workloads. To our surprise, however, the uncoordinated approach is not only competitive to the coordinated one in uniformly distributed workloads, but it also outperforms the coordinated approach in skewed workloads. We conclude that rather than blindly employing coordinated checkpointing, research should focus on optimizing the very promising uncoordinated approach, as it can address issues with skew and support prevalent cyclic queries. We believe that our findings can trigger further research into checkpointing mechanisms.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Styx: Transactional Stateful Functions on Streaming Dataflows
Authors:
Kyriakos Psarakis,
George Christodoulou,
George Siachamis,
Marios Fragkoulis,
Asterios Katsifodimos
Abstract:
Developing stateful cloud applications, such as low-latency workflows and microservices with strict consistency requirements, remains arduous for programmers. The Stateful Functions-as-a-Service (SFaaS) paradigm aims to serve these use cases. However, existing approaches provide weak transactional guarantees or perform expensive external state accesses requiring inefficient transactional protocols…
▽ More
Developing stateful cloud applications, such as low-latency workflows and microservices with strict consistency requirements, remains arduous for programmers. The Stateful Functions-as-a-Service (SFaaS) paradigm aims to serve these use cases. However, existing approaches provide weak transactional guarantees or perform expensive external state accesses requiring inefficient transactional protocols that increase execution latency.
In this paper, we present Styx, a novel dataflow-based SFaaS runtime that executes serializable transactions consisting of stateful functions that form arbitrary call-graphs with exactly-once guarantees. Styx extends a deterministic transactional protocol by contributing: i) a function acknowledgment scheme to determine transaction boundaries required in SFaaS workloads, ii) a function-execution caching mechanism, and iii) an early-commit reply mechanism that substantially reduces transaction execution latency. Experiments with the YCSB, TPC-C, and Deathstar benchmarks show that Styx outperforms state-of-the-art approaches by achieving at least one order of magnitude higher throughput while exhibiting near-linear scalability and low latency.
△ Less
Submitted 6 February, 2025; v1 submitted 11 December, 2023;
originally announced December 2023.
-
SiMa: Effective and Efficient Matching Across Data Silos Using Graph Neural Networks
Authors:
Christos Koutras,
Rihan Hai,
Kyriakos Psarakis,
Marios Fragkoulis,
Asterios Katsifodimos
Abstract:
How can we leverage existing column relationships within silos, to predict similar ones across silos? Can we do this efficiently and effectively? Existing matching approaches do not exploit prior knowledge, relying on prohibitively expensive similarity computations. In this paper we present the first technique for matching columns across data silos, called SiMa, which leverages Graph Neural Networ…
▽ More
How can we leverage existing column relationships within silos, to predict similar ones across silos? Can we do this efficiently and effectively? Existing matching approaches do not exploit prior knowledge, relying on prohibitively expensive similarity computations. In this paper we present the first technique for matching columns across data silos, called SiMa, which leverages Graph Neural Networks (GNNs) to learn from existing column relationships within data silos, and dataset-specific profiles. The main novelty of SiMa is its ability to be trained incrementally on column relationships within each silo individually, without requiring the consolidation of all datasets in a single place. Our experiments show that SiMa is more effective than the - otherwise inapplicable to the setting of silos - state-of-the-art matching methods, while requiring orders of magnitude less computational resources. Moreover, we demonstrate that SiMa considerably outperforms other state-of-the-art column representation learning methods.
△ Less
Submitted 3 March, 2024; v1 submitted 25 June, 2022;
originally announced June 2022.
-
Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows
Authors:
Kyriakos Psarakis,
Wouter Zorgdrager,
Marios Fragkoulis,
Guido Salvaneschi,
Asterios Katsifodimos
Abstract:
Although the cloud has reached a state of robustness, the burden of using its resources falls on the shoulders of programmers who struggle to keep up with ever-growing cloud infrastructure services and abstractions. As a result, state management, scaling, operation, and failure management of scalable cloud applications, require disproportionately more effort than developing the applications' actua…
▽ More
Although the cloud has reached a state of robustness, the burden of using its resources falls on the shoulders of programmers who struggle to keep up with ever-growing cloud infrastructure services and abstractions. As a result, state management, scaling, operation, and failure management of scalable cloud applications, require disproportionately more effort than developing the applications' actual business logic.
Our vision aims to raise the abstraction level for programming scalable cloud applications by compiling stateful entities -- a programming model enabling imperative transactional programs authored in Python -- into stateful streaming dataflows. We propose a compiler pipeline that analyzes the abstract syntax tree of stateful entities and transforms them into an intermediate representation based on stateful dataflow graphs. It then compiles that intermediate representation into different dataflow engines, leveraging their exactly-once message processing guarantees to prevent state or failure management primitives from "leaking" into the level of the programming model. Preliminary experiments with a proof of concept implementation show that despite program transformation and translation to dataflows, stateful entities can perform at sub-100ms latency even for transactional workloads.
△ Less
Submitted 3 September, 2023; v1 submitted 17 November, 2021;
originally announced December 2021.
-
Valentine: Evaluating Matching Techniques for Dataset Discovery
Authors:
Christos Koutras,
George Siachamis,
Andra Ionescu,
Kyriakos Psarakis,
Jerry Brons,
Marios Fragkoulis,
Christoph Lofi,
Angela Bonifati,
Asterios Katsifodimos
Abstract:
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of sche…
▽ More
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method's success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.
△ Less
Submitted 13 February, 2021; v1 submitted 14 October, 2020;
originally announced October 2020.
-
Multi-label Classification for Automatic Tag Prediction in the Context of Programming Challenges
Authors:
Bianca Iancu,
Gabriele Mazzola,
Kyriakos Psarakis,
Panagiotis Soilis
Abstract:
One of the best ways for developers to test and improve their skills in a fun and challenging way are programming challenges, offered by a plethora of websites. For the inexperienced ones, some of the problems might appear too challenging, requiring some suggestions to implement a solution. On the other hand, tagging problems can be a tedious task for problem creators. In this paper, we focus on a…
▽ More
One of the best ways for developers to test and improve their skills in a fun and challenging way are programming challenges, offered by a plethora of websites. For the inexperienced ones, some of the problems might appear too challenging, requiring some suggestions to implement a solution. On the other hand, tagging problems can be a tedious task for problem creators. In this paper, we focus on automating the task of tagging a programming challenge description using machine and deep learning methods. We observe that the deep learning methods implemented outperform well-known IR approaches such as tf-idf, thus providing a starting point for further research on the task.
△ Less
Submitted 27 November, 2019;
originally announced November 2019.