Search | arXiv e-print repository

Introducing Schema Inference as a Scalable SQL Function [Extended Version]

Authors: Calvin Dani, Shiva Jahangiri, Thomas Hütter

Abstract: This paper introduces a novel approach to schema inference as an on-demand function integrated directly within a DBMS, targeting NoSQL databases where schema flexibility can create challenges. Unlike previous methods relying on external frameworks like Apache Spark, our solution enables schema inference as a SQL function, allowing users to infer schemas natively within the DBMS. Implemented in Apa… ▽ More This paper introduces a novel approach to schema inference as an on-demand function integrated directly within a DBMS, targeting NoSQL databases where schema flexibility can create challenges. Unlike previous methods relying on external frameworks like Apache Spark, our solution enables schema inference as a SQL function, allowing users to infer schemas natively within the DBMS. Implemented in Apache AsterixDB, it performs schema discovery in two phases, local inference and global schema merging, leveraging internal resources for improved performance. Experiments with real world datasets show up to a two orders of magnitude performance boost over external methods, enhancing usability and scalability. △ Less

Submitted 20 November, 2024; originally announced November 2024.

Comments: Extended version of EDBT 2025 submission

arXiv:2304.04817 [pdf]

FINEX: A Fast Index for Exact & Flexible Density-Based Clustering (Extended Version with Proofs)*

Authors: Konstantin Emil Thiel, Daniel Kocher, Nikolaus Augsten, Thomas Hütter, Willi Mann, Daniel Ulrich Schmitt

Abstract: Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters (ε, MinPts) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. However, existi… ▽ More Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters (ε, MinPts) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. However, existing solutions suffer from the following limitations: (a) Ineffective pruning of expensive neighborhood computations. (b) Approximate clustering, where objects are falsely labeled noise. (c) Restricted parameter tuning that is limited to ε whereas MinPts is constant, which reduces the explorable clusterings. (d) Inflexibility in terms of applicable data types and distance functions. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexities of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is effcient and flexible regarding data types and distance functions. Moreover, FINEX respects the original and straightforward notion of density-based clustering. In our experiments on 12 large real-world datasets from various domains, FINEX frequently outperforms state-of-the-art techniques for exact clustering by orders of magnitude. △ Less

Submitted 10 April, 2023; originally announced April 2023.

arXiv:2201.08099 [pdf]

JEDI: These aren't the JSON documents you're looking for... (Extended Version*)

Authors: Thomas Hütter, Nikolaus Augsten, Christoph M. Kirsch, Michael J. Carey, Chen Li

Abstract: The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold $τ$, retrieve all JSON documents that are within $τ$ from the query document. Due to its recursive definition, JSON data are naturally represented as… ▽ More The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold $τ$, retrieve all JSON documents that are within $τ$ from the query document. Due to its recursive definition, JSON data are naturally represented as trees. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document. This feature poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON documents. We develop an algorithm, called QuickJEDI, for computing JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and a highly effective upper bound based on tree sorting. Our algorithm for the upper bound runs in $O(n τ)$ time and $O(n + τ\log n)$ space, which substantially improves the previous best bound of $O(n^2)$ time and $O(n \log n)$ space (where $n$ is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes. △ Less

Submitted 21 January, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

Comments: This is an extended version of an upcoming publication at ACM SIGMOD 2022. Please cite the original SIGMOD version

arXiv:physics/0402067 [pdf]

Ion energy balance during fast wave heating in TORE SUPRA

Authors: Thierry Hutter, Alain Becoulet, Jean-Pierre Coulon, Vincent Saoutic, Vincent Basiuk, G. T. Hoang

Abstract: Direct coupling of the fast magnetosonic wave to the electrons has been studied on the tokamak TORE SUPRA. Preliminary experiments were dedicated to optimise the scenario for Fast Wave Electron Heating (FWEH) and Current Drive (FWCD). In a first part, thermal kinetic and diamagnetic energy are compared when fast wave is applied to the plasma in two different regimes: 1/ the minority hydrogen hea… ▽ More Direct coupling of the fast magnetosonic wave to the electrons has been studied on the tokamak TORE SUPRA. Preliminary experiments were dedicated to optimise the scenario for Fast Wave Electron Heating (FWEH) and Current Drive (FWCD). In a first part, thermal kinetic and diamagnetic energy are compared when fast wave is applied to the plasma in two different regimes: 1/ the minority hydrogen heating scenario (ICRH), 2/ the direct electron damping. Effects of ion resonant layers, marginally present in the plasma in the later regime (FWEH), is then presented and discussed. △ Less

Submitted 13 February, 2004; originally announced February 2004.

Comments: 21st European Conference on Controlled Fusion and Plasma Physics, 1994

Showing 1–4 of 4 results for author: Hutter, T