Search | arXiv e-print repository

An Evaluation of N-Gram Selection Strategies for Regular Expression Indexing in Contemporary Text Analysis Tasks

Authors: Ling Zhang, Shaleen Deep, Jignesh M. Patel, Karthikeyan Sankaralingam

Abstract: Efficient evaluation of regular expressions (regex, for short) is crucial for text analysis, and n-gram indexes are fundamental to achieving fast regex evaluation performance. However, these indexes face scalability challenges because of the exponential number of possible n-grams that must be indexed. Many existing selection strategies, developed decades ago, have not been rigorously evaluated on… ▽ More Efficient evaluation of regular expressions (regex, for short) is crucial for text analysis, and n-gram indexes are fundamental to achieving fast regex evaluation performance. However, these indexes face scalability challenges because of the exponential number of possible n-grams that must be indexed. Many existing selection strategies, developed decades ago, have not been rigorously evaluated on contemporary large-scale workloads and lack comprehensive performance comparisons. Therefore, a unified and comprehensive evaluation framework is necessary to compare these methods under the same experimental settings. This paper presents the first systematic evaluation of three representative n-gram selection strategies across five workloads, including real-time production logs and genomic sequence analysis. We examine their trade-offs in terms of index construction time, storage overhead, false positive rates, and end-to-end query performance. Through empirical results, this study provides a modern perspective on existing n-gram based regular expression evaluation methods, extensive observations, valuable discoveries, and an adaptable testing framework to guide future research in this domain. We make our implementations of these methods and our test framework available as open-source at https://github.com/mush-zhang/RegexIndexComparison. △ Less

Submitted 16 April, 2025; originally announced April 2025.

arXiv:2406.07847 [pdf, other]

Output-sensitive Conjunctive Query Evaluation

Authors: Shaleen Deep, Hangdong Zhao, Austen Z. Fan, Paraschos Koutris

Abstract: Join evaluation is one of the most fundamental operations performed by database systems and arguably the most well-studied problem in the Database community. A staggering number of join algorithms have been developed, and commercial database engines use finely tuned join heuristics that take into account many factors including the selectivity of predicates, memory, IO, etc. However, most of the re… ▽ More Join evaluation is one of the most fundamental operations performed by database systems and arguably the most well-studied problem in the Database community. A staggering number of join algorithms have been developed, and commercial database engines use finely tuned join heuristics that take into account many factors including the selectivity of predicates, memory, IO, etc. However, most of the results have catered to either full join queries or non-full join queries but with degree constraints (such as PK-FK relationships) that make joins \emph{easier} to evaluate. Further, most of the algorithms are also not output-sensitive. In this paper, we present a novel, output-sensitive algorithm for the evaluation of acyclic Conjunctive Queries (CQs) that contain arbitrary free variables. Our result is based on a novel generalization of the Yannakakis algorithm and shows that it is possible to improve the running time guarantee of the Yannakakis algorithm by a polynomial factor. Importantly, our algorithmic improvement does not depend on the use of fast matrix multiplication, as a recently proposed algorithm does. The upper bound is complemented with matching lower bounds conditioned on two variants of the $k$-clique conjecture. The application of our algorithm recovers known prior results and improves on known state-of-the-art results for common queries such as paths and stars. △ Less

Submitted 23 October, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: 24 pages, accepted to PODS'2025

arXiv:2403.12436 [pdf, ps, other]

Evaluating Datalog over Semirings: A Grounding-based Approach

Authors: Hangdong Zhao, Shaleen Deep, Paraschos Koutris, Sudeepa Roy, Val Tannen

Abstract: Datalog is a powerful yet elegant language that allows expressing recursive computation. Although Datalog evaluation has been extensively studied in the literature, so far, only loose upper bounds are known on how fast a Datalog program can be evaluated. In this work, we ask the following question: given a Datalog program over a naturally-ordered semiring $σ$, what is the tightest possible runtime… ▽ More Datalog is a powerful yet elegant language that allows expressing recursive computation. Although Datalog evaluation has been extensively studied in the literature, so far, only loose upper bounds are known on how fast a Datalog program can be evaluated. In this work, we ask the following question: given a Datalog program over a naturally-ordered semiring $σ$, what is the tightest possible runtime? To this end, our main contribution is a general two-phase framework for analyzing the data complexity of Datalog over $σ$: first ground the program into an equivalent system of polynomial equations (i.e. grounding) and then find the least fixpoint of the grounding over $σ$. We present algorithms that use structure-aware query evaluation techniques to obtain the smallest possible groundings. Next, efficient algorithms for fixpoint evaluation are introduced over two classes of semirings: (1) finite-rank semirings and (2) absorptive semirings of total order. Combining both phases, we obtain state-of-the-art and new algorithmic results. Finally, we complement our results with a matching fine-grained lower bound. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: To appear at PODS 2024

arXiv:2311.04824 [pdf, other]

Multi-Relational Algebra and Its Applications to Data Insights

Authors: Xi Wu, Zichen Zhu, Xiangyao Yu, Shaleen Deep, Stratis Viglas, John Cieslewicz, Somesh Jha, Jeffrey F. Naughton

Abstract: A range of data insight analytical tasks involves analyzing a large set of tables of different schemas, possibly induced by various groupings, to find salient patterns. This paper presents Multi-Relational Algebra, an extension of the classic Relational Algebra, to facilitate such transformations and their compositions. Multi-Relational Algebra has two main characteristics: (1) Information Unit. T… ▽ More A range of data insight analytical tasks involves analyzing a large set of tables of different schemas, possibly induced by various groupings, to find salient patterns. This paper presents Multi-Relational Algebra, an extension of the classic Relational Algebra, to facilitate such transformations and their compositions. Multi-Relational Algebra has two main characteristics: (1) Information Unit. The information unit is a slice $(r, X)$, where $r$ is a (region) tuple, and $X$ is a (feature) table. Specifically, a slice can encompass multiple columns, which surpasses the information unit of "a single tuple" or "a group of tuples of one column" in the classic relational algebra, (2) Schema Flexibility. Slices can have varying schemas, not constrained to a single schema. This flexibility further expands the expressive power of the algebra. Through various examples, we show that multi-relational algebra can effortlessly express many complex analytic problems, some of which are beyond the scope of traditional relational analytics. We have implemented and deployed a service for multi-relational analytics. Due to a unified logical design, we are able to conduct systematic optimization for a variety of seemingly different tasks. Our service has garnered interest from numerous internal teams who have developed data-insight applications using it, and serves millions of operators daily. △ Less

Submitted 29 September, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

arXiv:2310.00815 [pdf]

ReAcTable: Enhancing ReAct for Table Question Answering

Authors: Yunjia Zhang, Jordan Henkel, Avrilia Floratou, Joyce Cahoon, Shaleen Deep, Jignesh M. Patel

Abstract: Table Question Answering (TQA) presents a substantial challenge at the intersection of natural language processing and data analytics. This task involves answering natural language (NL) questions on top of tabular data, demanding proficiency in logical reasoning, understanding of data semantics, and fundamental analytical capabilities. Due to its significance, a substantial volume of research has… ▽ More Table Question Answering (TQA) presents a substantial challenge at the intersection of natural language processing and data analytics. This task involves answering natural language (NL) questions on top of tabular data, demanding proficiency in logical reasoning, understanding of data semantics, and fundamental analytical capabilities. Due to its significance, a substantial volume of research has been dedicated to exploring a wide range of strategies aimed at tackling this challenge including approaches that leverage Large Language Models (LLMs) through in-context learning or Chain-of-Thought (CoT) prompting as well as approaches that train and fine-tune custom models. Nonetheless, a conspicuous gap exists in the research landscape, where there is limited exploration of how innovative foundational research, which integrates incremental reasoning with external tools in the context of LLMs, as exemplified by the ReAct paradigm, could potentially bring advantages to the TQA task. In this paper, we aim to fill this gap, by introducing ReAcTable (ReAct for Table Question Answering tasks), a framework inspired by the ReAct paradigm that is carefully enhanced to address the challenges uniquely appearing in TQA tasks such as interpreting complex data semantics, dealing with errors generated by inconsistent data and generating intricate data transformations. ReAcTable relies on external tools such as SQL and Python code executors, to progressively enhance the data by generating intermediate data representations, ultimately transforming it into a more accessible format for answering the questions with greater ease. We demonstrate that ReAcTable achieves remarkable performance even when compared to fine-tuned approaches. In particular, it outperforms the best prior result on the WikiTQ benchmark, achieving an accuracy of 68.0% without requiring training a new model or fine-tuning. △ Less

Submitted 1 October, 2023; originally announced October 2023.

arXiv:2309.12436 [pdf, other]

Rapidash: Efficient Constraint Discovery via Rapid Verification

Authors: Zifan Liu, Shaleen Deep, Anna Fariha, Fotis Psallidas, Ashish Tiwari, Avrilia Floratou

Abstract: Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. Given their significance, there has been considerable research interest in achieving fast verification and discovery of exact DCs within the database community. Despite the signifi… ▽ More Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. Given their significance, there has been considerable research interest in achieving fast verification and discovery of exact DCs within the database community. Despite the significant advancements in the field, prior work exhibits notable limitations when confronted with large-scale datasets. The current state-of-the-art exact DC verification algorithm demonstrates a quadratic (worst-case) time complexity relative to the dataset's number of rows. In the context of DC discovery, existing methodologies rely on a two-step algorithm that commences with an expensive data structure-building phase, often requiring hours to complete even for datasets containing only a few million rows. Consequently, users are left without any insights into the DCs that hold on their dataset until this lengthy building phase concludes. In this paper, we introduce Rapidash, a comprehensive framework for DC verification and discovery. Our work makes a dual contribution. First, we establish a connection between orthogonal range search and DC verification. We introduce a novel exact DC verification algorithm that demonstrates near-linear time complexity, representing a theoretical improvement over prior work. Second, we propose an anytime DC discovery algorithm that leverages our novel verification algorithm to gradually provide DCs to users, eliminating the need for the time-intensive building phase observed in prior work. To validate the effectiveness of our algorithms, we conduct extensive evaluations on four large-scale production datasets. Our results reveal that our DC verification algorithm achieves up to 40 times faster performance compared to state-of-the-art approaches. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: comments and suggestions are welcome!

arXiv:2308.09284 [pdf, ps, other]

The Fine-Grained Complexity of CFL Reachability

Authors: Paraschos Koutris, Shaleen Deep

Abstract: Many problems in static program analysis can be modeled as the context-free language (CFL) reachability problem on directed labeled graphs. The CFL reachability problem can be generally solved in time $O(n^3)$, where $n$ is the number of vertices in the graph, with some specific cases that can be solved faster. In this work, we ask the following question: given a specific CFL, what is the exact ex… ▽ More Many problems in static program analysis can be modeled as the context-free language (CFL) reachability problem on directed labeled graphs. The CFL reachability problem can be generally solved in time $O(n^3)$, where $n$ is the number of vertices in the graph, with some specific cases that can be solved faster. In this work, we ask the following question: given a specific CFL, what is the exact exponent in the monomial of the running time? In other words, for which cases do we have linear, quadratic or cubic algorithms, and are there problems with intermediate runtimes? This question is inspired by recent efforts to classify classic problems in terms of their exact polynomial complexity, known as {\em fine-grained complexity}. Although recent efforts have shown some conditional lower bounds (mostly for the class of combinatorial algorithms), a general picture of the fine-grained complexity landscape for CFL reachability is missing. Our main contribution is lower bound results that pinpoint the exact running time of several classes of CFLs or specific CFLs under widely believed lower bound conjectures (Boolean Matrix Multiplication and $k$-Clique). We particularly focus on the family of Dyck-$k$ languages (which are strings with well-matched parentheses), a fundamental class of CFL reachability problems. We present new lower bounds for the case of sparse input graphs where the number of edges $m$ is the input parameter, a common setting in the database literature. For this setting, we show a cubic lower bound for Andersen's Pointer Analysis which significantly strengthens prior known results. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Appeared in POPL 2023. Please note the erratum on the first page

arXiv:2305.01598 [pdf, other]

From Words to Code: Harnessing Data for Program Synthesis from Natural Language

Authors: Anirudh Khatry, Joyce Cahoon, Jordan Henkel, Shaleen Deep, Venkatesh Emani, Avrilia Floratou, Sumit Gulwani, Vu Le, Mohammad Raza, Sherry Shi, Mukul Singh, Ashish Tiwari

Abstract: Creating programs to correctly manipulate data is a difficult task, as the underlying programming languages and APIs can be challenging to learn for many users who are not skilled programmers. Large language models (LLMs) demonstrate remarkable potential for generating code from natural language, but in the data manipulation domain, apart from the natural language (NL) description of the intended… ▽ More Creating programs to correctly manipulate data is a difficult task, as the underlying programming languages and APIs can be challenging to learn for many users who are not skilled programmers. Large language models (LLMs) demonstrate remarkable potential for generating code from natural language, but in the data manipulation domain, apart from the natural language (NL) description of the intended task, we also have the dataset on which the task is to be performed, or the "data context". Existing approaches have utilized data context in a limited way by simply adding relevant information from the input data into the prompts sent to the LLM. In this work, we utilize the available input data to execute the candidate programs generated by the LLMs and gather their outputs. We introduce semantic reranking, a technique to rerank the programs generated by LLMs based on three signals coming the program outputs: (a) semantic filtering and well-formedness based score tuning: do programs even generate well-formed outputs, (b) semantic interleaving: how do the outputs from different candidates compare to each other, and (c) output-based score tuning: how do the outputs compare to outputs predicted for the same task. We provide theoretical justification for semantic interleaving. We also introduce temperature mixing, where we combine samples generated by LLMs using both high and low temperatures. We extensively evaluate our approach in three domains, namely databases (SQL), data science (Pandas) and business intelligence (Excel's Power Query M) on a variety of new and existing benchmarks. We observe substantial gains across domains, with improvements of up to 45% in top-1 accuracy and 34% in top-3 accuracy. △ Less

Submitted 3 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: 14 pages

arXiv:2304.06221 [pdf, ps, other]

doi 10.1145/3584372.3588675

Space-Time Tradeoffs for Conjunctive Queries with Access Patterns

Authors: Hangdong Zhao, Shaleen Deep, Paraschos Koutris

Abstract: In this paper, we investigate space-time tradeoffs for answering conjunctive queries with access patterns (CQAPs). The goal is to create a space-efficient data structure in an initial preprocessing phase and use it for answering (multiple) queries in an online phase. Previous work has developed data structures that trades off space usage for answering time for queries of practical interest, such a… ▽ More In this paper, we investigate space-time tradeoffs for answering conjunctive queries with access patterns (CQAPs). The goal is to create a space-efficient data structure in an initial preprocessing phase and use it for answering (multiple) queries in an online phase. Previous work has developed data structures that trades off space usage for answering time for queries of practical interest, such as the path and triangle query. However, these approaches lack a comprehensive framework and are not generalizable. Our main contribution is a general algorithmic framework for obtaining space-time tradeoffs for any CQAP. Our framework builds upon the $\PANDA$ algorithm and tree decomposition techniques. We demonstrate that our framework captures all state-of-the-art tradeoffs that were independently produced for various queries. Further, we show surprising improvements over the state-of-the-art tradeoffs known in the existing literature for reachability queries. △ Less

Submitted 2 May, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

arXiv:2302.00120 [pdf, other]

Holistic Cube Analysis: A Query Framework for Data Insights

Authors: Xi Wu, Shaleen Deep, Joe Benassi, Fengan Li, Yaqi Zhang, Uyeong Jang, James Foster, Stella Kim, Yujing Sun, Long Nguyen, Stratis Viglas, Somesh Jha, John Cieslewicz, Jeffrey F. Naughton

Abstract: Many data insight questions can be viewed as searching in a large space of tables and finding important ones, where the notion of importance is defined in some adhoc user defined manner. This paper presents Holistic Cube Analysis (HoCA), a framework that augments the capabilities of relational queries for such problems. HoCA first augments the relational data model and introduces a new data type A… ▽ More Many data insight questions can be viewed as searching in a large space of tables and finding important ones, where the notion of importance is defined in some adhoc user defined manner. This paper presents Holistic Cube Analysis (HoCA), a framework that augments the capabilities of relational queries for such problems. HoCA first augments the relational data model and introduces a new data type AbstractCube, defined as a function which maps a region-features pair to a relational table (a region is a tuple which specifies values of a set of dimensions). AbstractCube provides a logical form of data, and HoCA operators are cube-to-cube transformations. We describe two basic but fundamental HoCA operators, cube crawling and cube join (with many possible extensions). Cube crawling explores a region space, and outputs a cube that maps regions to signal vectors. Cube join, in turn, is critical for composition, allowing one to join information from different cubes for deeper analysis. Cube crawling introduces two novel programming features, (programmable) Region Analysis Models (RAMs) and Multi-Model Crawling. Crucially, RAM has a notion of population features, which allows one to go beyond only analyzing local features at a region, and program region-population analysis that compares region and population features, capturing a large class of importance notions. HoCA has a rich algorithmic space, such as optimizing crawling and join performance, and physical design of cubes. We have implemented and deployed HoCA at Google. Our early HoCA offering has attracted more than 30 teams building applications with it, across a diverse spectrum of fields including system monitoring, experimentation analysis, and business intelligence. For many applications, HoCA empowers novel and powerful analyses, such as instances of recurrent crawling, which are challenging to achieve otherwise. △ Less

Submitted 1 July, 2023; v1 submitted 31 January, 2023; originally announced February 2023.

Comments: Establishing initial concepts of HoCA

arXiv:2201.05566 [pdf, other]

Ranked Enumeration of Join Queries with Projections

Authors: Shaleen Deep, Xiao Hu, Paraschos Koutris

Abstract: Join query evaluation with ordering is a fundamental data processing task in relational database management systems. SQL and custom graph query languages such as Cypher offer this functionality by allowing users to specify the order via the ORDER BY clause. In many scenarios, the users also want to see the first $k$ results quickly (expressed by the LIMIT clause), but the value of $k$ is not prede… ▽ More Join query evaluation with ordering is a fundamental data processing task in relational database management systems. SQL and custom graph query languages such as Cypher offer this functionality by allowing users to specify the order via the ORDER BY clause. In many scenarios, the users also want to see the first $k$ results quickly (expressed by the LIMIT clause), but the value of $k$ is not predetermined as user queries are arriving in an online fashion. Recent work has made considerable progress in identifying optimal algorithms for ranked enumeration of join queries that do not contain any projections. In this paper, we initiate the study of the problem of enumerating results in ranked order for queries with projections. Our main result shows that for any acyclic query, it is possible to obtain a near-linear (in the size of the database) delay algorithm after only a linear time preprocessing step for two important ranking functions: sum and lexicographic ordering. For a practical subset of acyclic queries known as star queries, we show an even stronger result that allows a user to obtain a smooth tradeoff between faster answering time guarantees using more preprocessing time. Our results are also extensible to queries containing cycles and unions. We also perform a comprehensive experimental evaluation to demonstrate that our algorithms, which are simple to implement, improve up to three orders of magnitude in the running time over state-of-the-art algorithms implemented within open-source RDBMS and specialized graph databases. △ Less

Submitted 22 January, 2022; v1 submitted 14 January, 2022; originally announced January 2022.

Comments: Accepted at VLDB 2022. Comments and suggestions are always welcome

arXiv:2109.10889 [pdf, ps, other]

General Space-Time Tradeoffs via Relational Queries

Authors: Shaleen Deep, Xiao Hu, Paraschos Koutris

Abstract: In this paper, we investigate space-time tradeoffs for answering Boolean conjunctive queries. The goal is to create a data structure in an initial preprocessing phase and use it for answering (multiple) queries. Previous work has developed data structures that trade off space usage for answering time and has proved conditional space lower bounds for queries of practical interest such as the path a… ▽ More In this paper, we investigate space-time tradeoffs for answering Boolean conjunctive queries. The goal is to create a data structure in an initial preprocessing phase and use it for answering (multiple) queries. Previous work has developed data structures that trade off space usage for answering time and has proved conditional space lower bounds for queries of practical interest such as the path and triangle query. However, most of these results cater to only those queries, lack a comprehensive framework, and are not generalizable. The isolated treatment of these queries also fails to utilize the connections with extensive research on related problems within the database community. The key insight in this work is to exploit the formalism of relational algebra by casting the problems as answering join queries over a relational database. Using the notion of boolean {\em adorned queries} and {\em access patterns}, we propose a unified framework that captures several widely studied algorithmic problems. Our main contribution is three-fold. First, we present an algorithm that recovers existing space-time tradeoffs for several problems. The algorithm is based on an application of the {\em join size bound} to capture the space usage of our data structure. We combine our data structure with {\em query decomposition} techniques to further improve the tradeoffs and show that it is readily extensible to queries with negation. Second, we falsify two proposed conjectures in the existing literature related to the space-time lower bound for path queries and triangle detection for which we show unexpectedly better algorithms. This result opens a new avenue for improving several algorithmic results that have so far been assumed to be (conditionally) optimal. Finally, we prove new conditional space-time lower bounds for star and path queries. △ Less

Submitted 13 August, 2023; v1 submitted 22 September, 2021; originally announced September 2021.

Comments: Appeared in WADS 2023. Comments and suggestions are always welcome

arXiv:2101.03712 [pdf, other]

Enumeration Algorithms for Conjunctive Queries with Projection

Authors: Shaleen Deep, Xiao Hu, Paraschos Koutris

Abstract: We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to ma… ▽ More We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain delay guarantees, which maybe of independent interest. In particular, for star queries, we design combinatorial algorithms that provide instance-specific delay guarantees in linear preprocessing time. These algorithms improve upon the currently best known results. Further, we show how existing results can be improved upon by using fast matrix multiplication. We also present new results involving tradeoff between preprocessing time and delay guarantees for enumeration of path queries that contain projections. Boolean matrix multiplication is an important query that can be expressed as a CQ with projection where the join attribute is projected away. Our results can therefore also be interpreted as sparse, output-sensitive matrix multiplication with delay guarantees. △ Less

Submitted 26 May, 2025; v1 submitted 11 January, 2021; originally announced January 2021.

Comments: Accepted journal version for LMCS

arXiv:2011.05549 [pdf, other]

Comprehensive and Efficient Workload Compression

Authors: Shaleen Deep, Anja Gruenheid, Paraschos Koutris, Jeffrey Naughton, Stratis Viglas

Abstract: This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be con… ▽ More This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be controlled by monitoring system usage patterns, i.e.,~a representative workload, over time. To construct such a workload in a principled manner, we formalize the notions of workload {\em representativity} and {\em coverage}. These metrics capture the intuition that the distribution of features in a compressed workload should match a target distribution, increasing representativity, and include common queries as well as outliers, increasing coverage. We show that solving this problem optimally is NP-hard and present a novel greedy algorithm that provides approximation guarantees. We compare our techniques to established algorithms in this problem space such as sampling and clustering, and demonstrate advantages and key trade-offs of our techniques. △ Less

Submitted 3 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

arXiv:2002.12459 [pdf, other]

Fast Join Project Query Evaluation using Matrix Multiplication

Authors: Shaleen Deep, Xiao Hu, Paraschos Koutris

Abstract: In the last few years, much effort has been devoted to developing join algorithms in order to achieve worst-case optimality for join queries over relational databases. Towards this end, the database community has had considerable success in developing succinct algorithms that achieve worst-case optimal runtime for full join queries, i.e the join is over all variables present in the input database.… ▽ More In the last few years, much effort has been devoted to developing join algorithms in order to achieve worst-case optimality for join queries over relational databases. Towards this end, the database community has had considerable success in developing succinct algorithms that achieve worst-case optimal runtime for full join queries, i.e the join is over all variables present in the input database. However, not much is known about join evaluation with {\em projections} beyond some simple techniques of pushing down the projection operator in the query execution plan. Such queries have a large number of applications in entity matching, graph analytics and searching over compressed graphs. In this paper, we study how a class of join queries with projections can be evaluated faster using worst-case optimal algorithms together with matrix multiplication. Crucially, our algorithms are parameterized by the output size of the final result, allowing for choice of the best execution strategy. We implement our algorithms as a subroutine and compare the performance with state-of-the-art techniques to show they can be improved upon by as much as 50x. More importantly, our experiments indicate that matrix multiplication is a useful operation that can help speed up join processing owing to highly optimized open source libraries that are also highly parallelizable. △ Less

Submitted 27 February, 2020; originally announced February 2020.

arXiv:2002.02154 [pdf, other]

Related Tasks can Share! A Multi-task Framework for Affective language

Authors: Kumar Shikhar Deep, Md Shad Akhtar, Asif Ekbal, Pushpak Bhattacharyya

Abstract: Expressing the polarity of sentiment as 'positive' and 'negative' usually have limited scope compared with the intensity/degree of polarity. These two tasks (i.e. sentiment classification and sentiment intensity prediction) are closely related and may offer assistance to each other during the learning process. In this paper, we propose to leverage the relatedness of multiple tasks in a multi-task… ▽ More Expressing the polarity of sentiment as 'positive' and 'negative' usually have limited scope compared with the intensity/degree of polarity. These two tasks (i.e. sentiment classification and sentiment intensity prediction) are closely related and may offer assistance to each other during the learning process. In this paper, we propose to leverage the relatedness of multiple tasks in a multi-task learning framework. Our multi-task model is based on convolutional-Gated Recurrent Unit (GRU) framework, which is further assisted by a diverse hand-crafted feature set. Evaluation and analysis suggest that joint-learning of the related tasks in a multi-task framework can outperform each of the individual tasks in the single-task frameworks. △ Less

Submitted 6 February, 2020; originally announced February 2020.

Comments: 12 pages, 3 figures and 3 tables. Accepted in 20th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2019. To be published in Springer LNCS volume

ACM Class: I.2.7

arXiv:2001.10169 [pdf, other]

doi 10.1007/978-3-030-36718-3_34

A Deep Neural Framework for Contextual Affect Detection

Authors: Kumar Shikhar Deep, Asif Ekbal, Pushpak Bhattacharyya

Abstract: A short and simple text carrying no emotion can represent some strong emotions when reading along with its context, i.e., the same sentence can express extreme anger as well as happiness depending on its context. In this paper, we propose a Contextual Affect Detection (CAD) framework which learns the inter-dependence of words in a sentence, and at the same time the inter-dependence of sentences in… ▽ More A short and simple text carrying no emotion can represent some strong emotions when reading along with its context, i.e., the same sentence can express extreme anger as well as happiness depending on its context. In this paper, we propose a Contextual Affect Detection (CAD) framework which learns the inter-dependence of words in a sentence, and at the same time the inter-dependence of sentences in a dialogue. Our proposed CAD framework is based on a Gated Recurrent Unit (GRU), which is further assisted by contextual word embeddings and other diverse hand-crafted feature sets. Evaluation and analysis suggest that our model outperforms the state-of-the-art methods by 5.49% and 9.14% on Friends and EmotionPush dataset, respectively. △ Less

Submitted 28 January, 2020; originally announced January 2020.

Comments: 12 pages, 5 tables and 3 figures. Accepted in ICONIP 2019 (International Conference on Neural Information Processing) Published in Lecture Notes in Computer Science, vol 11955. Springer, Cham https://link.springer.com/chapter/10.1007/978-3-030-36718-3_34

ACM Class: I.2.7

Journal ref: LNCS 11955 (2019) 398-409

arXiv:1909.00845 [pdf, other]

Revenue Maximization for Query Pricing

Authors: Shuchi Chawla, Shaleen Deep, Paraschos Koutris, Yifeng Teng

Abstract: Buying and selling of data online has increased substantially over the last few years. Several frameworks have already been proposed that study query pricing in theory and practice. The key guiding principle in these works is the notion of {\em arbitrage-freeness} where the broker can set different prices for different queries made to the dataset, but must ensure that the pricing function does not… ▽ More Buying and selling of data online has increased substantially over the last few years. Several frameworks have already been proposed that study query pricing in theory and practice. The key guiding principle in these works is the notion of {\em arbitrage-freeness} where the broker can set different prices for different queries made to the dataset, but must ensure that the pricing function does not provide the buyers with opportunities for arbitrage. However, little is known about revenue maximization aspect of query pricing. In this paper, we study the problem faced by a broker selling access to data with the goal of maximizing her revenue. We show that this problem can be formulated as a revenue maximization problem with single-minded buyers and unlimited supply, for which several approximation algorithms are known. We perform an extensive empirical evaluation of the performance of several pricing algorithms for the query pricing problem on real-world instances. In addition to previously known approximation algorithms, we propose several new heuristics and analyze them both theoretically and experimentally. Our experiments show that algorithms with the best theoretical bounds are not necessarily the best empirically. We identify algorithms and heuristics that are both fast and also provide consistently good performance when valuations are drawn from a variety of distributions. △ Less

Submitted 9 September, 2019; v1 submitted 2 September, 2019; originally announced September 2019.

Comments: To appear in PVLDB; version 2 with some cosmetic changes

arXiv:1903.00846 [pdf, other]

A survey of security and privacy issues in the Internet of Things from the layered context

Authors: Samundra Deep, Xi Zheng, Alireza Jolfaei, Dongjin Yu, Pouya Ostovari, Ali Kashif Bashir

Abstract: Internet of Things (IoT) is a novel paradigm, which not only facilitates a large number of devices to be ubiquitously connected over the Internet but also provides a mechanism to remotely control these devices. The IoT is pervasive and is almost an integral part of our daily life. As devices are becoming increasingly connected, privacy and security issues become more and more critical and these ne… ▽ More Internet of Things (IoT) is a novel paradigm, which not only facilitates a large number of devices to be ubiquitously connected over the Internet but also provides a mechanism to remotely control these devices. The IoT is pervasive and is almost an integral part of our daily life. As devices are becoming increasingly connected, privacy and security issues become more and more critical and these need to be addressed on an urgent basis. IoT implementations and devices are eminently prone to threats that could compromise the security and privacy of the consumers, which, in turn, could influence its practical deployment. In recent past, some research has been carried out to secure IoT devices with an intention to alleviate the security concerns of users. The purpose of this paper is to highlight the security and privacy issues in IoT systems. To this effect, the paper examines the security issues at each layer in the IoT protocol stack, identifies the underlying challenges and key security requirements and provides a brief overview of existing security solutions to safeguard the IoT from the layered context. △ Less

Submitted 24 February, 2020; v1 submitted 3 March, 2019; originally announced March 2019.

arXiv:1902.02698 [pdf, other]

doi 10.46298/lmcs-21(2:14)2025

Ranked Enumeration of Conjunctive Query Results

Authors: Shaleen Deep, Paraschos Koutris

Abstract: We study the problem of enumerating answers of Conjunctive Queries ranked according to a given ranking function. Our main contribution is a novel algorithm with small preprocessing time, logarithmic delay, and non-trivial space usage during execution. To allow for efficient enumeration, we exploit certain properties of ranking functions that frequently occur in practice. To this end, we introduce… ▽ More We study the problem of enumerating answers of Conjunctive Queries ranked according to a given ranking function. Our main contribution is a novel algorithm with small preprocessing time, logarithmic delay, and non-trivial space usage during execution. To allow for efficient enumeration, we exploit certain properties of ranking functions that frequently occur in practice. To this end, we introduce the notions of {\em decomposable} and {\em compatible} (w.r.t. a query decomposition) ranking functions, which allow for partial aggregation of tuple scores in order to efficiently enumerate the output. We complement the algorithmic results with lower bounds that justify why restrictions on the structure of ranking functions are necessary. Our results extend and improve upon a long line of work that has studied ranked enumeration from both a theoretical and practical perspective. △ Less

Submitted 15 May, 2025; v1 submitted 7 February, 2019; originally announced February 2019.

Journal ref: Logical Methods in Computer Science, Volume 21, Issue 2 (May 16, 2025) lmcs:8638

arXiv:1709.06186 [pdf, ps, other]

Compressed Representations of Conjunctive Query Results

Authors: Shaleen Deep, Paraschos Koutris

Abstract: Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a data processing pipeline. Motivated by this problem, we study the construction of space-efficient compressed representations of the output of conjunctive queries,… ▽ More Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a data processing pipeline. Motivated by this problem, we study the construction of space-efficient compressed representations of the output of conjunctive queries, with the goal of supporting the efficient access of the intermediate compressed result for a given access pattern. In particular, we initiate the study of an important tradeoff: minimizing the space necessary to store the compressed result, versus minimizing the answer time and delay for an access request over the result. Our main contribution is a novel parameterized data structure, which can be tuned to trade off space for answer time. The tradeoff allows us to control the space requirement of the data structure precisely, and depends both on the structure of the query and the access pattern. We show how we can use the data structure in conjunction with query decomposition techniques, in order to efficiently represent the outputs for several classes of conjunctive queries. △ Less

Submitted 27 March, 2018; v1 submitted 18 September, 2017; originally announced September 2017.

Comments: To appear in PODS'18; 35 pages; comments welcome

arXiv:1606.09376 [pdf, ps, other]

The Design of Arbitrage-Free Data Pricing Schemes

Authors: Shaleen Deep, Paraschos Koutris

Abstract: Motivated by a growing market that involves buying and selling data over the web, we study pricing schemes that assign value to queries issued over a database. Previous work studied pricing mechanisms that compute the price of a query by extending a data seller's explicit prices on certain queries, or investigated the properties that a pricing function should exhibit without detailing a generic co… ▽ More Motivated by a growing market that involves buying and selling data over the web, we study pricing schemes that assign value to queries issued over a database. Previous work studied pricing mechanisms that compute the price of a query by extending a data seller's explicit prices on certain queries, or investigated the properties that a pricing function should exhibit without detailing a generic construction. In this work, we present a formal framework for pricing queries over data that allows the construction of general families of pricing functions, with the main goal of avoiding arbitrage. We consider two types of pricing schemes: instance-independent schemes, where the price depends only on the structure of the query, and answer-dependent schemes, where the price also depends on the query output. Our main result is a complete characterization of the structure of pricing functions in both settings, by relating it to properties of a function over a lattice. We use our characterization, together with information-theoretic methods, to construct a variety of arbitrage-free pricing functions. Finally, we discuss various tradeoffs in the design space and present techniques for efficient computation of the proposed pricing functions. △ Less

Submitted 30 June, 2016; originally announced June 2016.

Comments: full paper

Showing 1–22 of 22 results for author: Deep, S