Search | arXiv e-print repository

Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

Authors: Allaa Boutaleb, Bernd Amann, Hubert Naacke, Rafael Angarita

Abstract: Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals seve… ▽ More Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search. △ Less

Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

Comments: Accepted @ ACL 2025's Table Representation Learning Workshop (TRL)

arXiv:2306.02221 [pdf, other]

ATEM: A Topic Evolution Model for the Detection of Emerging Topics in Scientific Archives

Authors: Hamed Rahimi, Hubert Naacke, Camelia Constantin, Bernd Amann

Abstract: This paper presents ATEM, a novel framework for studying topic evolution in scientific archives. ATEM is based on dynamic topic modeling and dynamic graph embedding techniques that explore the dynamics of content and citations of documents within a scientific corpus. ATEM explores a new notion of contextual emergence for the discovery of emerging interdisciplinary research topics based on the dyna… ▽ More This paper presents ATEM, a novel framework for studying topic evolution in scientific archives. ATEM is based on dynamic topic modeling and dynamic graph embedding techniques that explore the dynamics of content and citations of documents within a scientific corpus. ATEM explores a new notion of contextual emergence for the discovery of emerging interdisciplinary research topics based on the dynamics of citation links in topic clusters. Our experiments show that ATEM can efficiently detect emerging cross-disciplinary topics within the DBLP archive of over five million computer science articles. △ Less

Submitted 3 June, 2023; originally announced June 2023.

arXiv:2305.14587 [pdf, other]

Contextualized Topic Coherence Metrics

Authors: Hamed Rahimi, Jacob Louis Hoover, David Mimno, Hubert Naacke, Camelia Constantin, Bernd Amann

Abstract: The recent explosion in work on neural topic modeling has been criticized for optimizing automated topic evaluation metrics at the expense of actual meaningful topic identification. But human annotation remains expensive and time-consuming. We propose LLM-based methods inspired by standard human topic evaluations, in a family of metrics called Contextualized Topic Coherence (CTC). We evaluate both… ▽ More The recent explosion in work on neural topic modeling has been criticized for optimizing automated topic evaluation metrics at the expense of actual meaningful topic identification. But human annotation remains expensive and time-consuming. We propose LLM-based methods inspired by standard human topic evaluations, in a family of metrics called Contextualized Topic Coherence (CTC). We evaluate both a fully automated version as well as a semi-automated CTC that allows human-centered evaluation of coherence while maintaining the efficiency of automated methods. We evaluate CTC relative to five other metrics on six topic models and find that it outperforms automated topic coherence methods, works well on short documents, and is not susceptible to meaningless but high-scoring topics. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2302.01501 [pdf, other]

ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics

Authors: Hamed Rahimi, Hubert Naacke, Camelia Constantin, Bernd Amann

Abstract: This paper presents an algorithmic family of dynamic topic models called Aligned Neural Topic Models (ANTM), which combine novel data mining algorithms to provide a modular framework for discovering evolving topics. ANTM maintains the temporal continuity of evolving topics by extracting time-aware features from documents using advanced pre-trained Large Language Models (LLMs) and employing an over… ▽ More This paper presents an algorithmic family of dynamic topic models called Aligned Neural Topic Models (ANTM), which combine novel data mining algorithms to provide a modular framework for discovering evolving topics. ANTM maintains the temporal continuity of evolving topics by extracting time-aware features from documents using advanced pre-trained Large Language Models (LLMs) and employing an overlapping sliding window algorithm for sequential document clustering. This overlapping sliding window algorithm identifies a different number of topics within each time frame and aligns semantically similar document clusters across time periods. This process captures emerging and fading trends across different periods and allows for a more interpretable representation of evolving topics. Experiments on four distinct datasets show that ANTM outperforms probabilistic dynamic topic models in terms of topic coherence and diversity metrics. Moreover, it improves the scalability and flexibility of dynamic topic models by being accessible and adaptable to different types of algorithms. Additionally, a Python package is developed for researchers and scientists who wish to study the trends and evolving patterns of topics in large-scale textual data. △ Less

Submitted 4 June, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

arXiv:2111.13927 [pdf, other]

Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic Queries

Authors: Eric Simon, Bernd Amann, Rutian Liu, Stéphane Gançarski

Abstract: We present a comprehensive set of conditions and rules to control the correctness of aggregation queries within an interactive data analysis session. The goal is to extend self-service data preparation and BI tools to automatically detect semantically incorrect aggregate queries on analytic tables and views built by using the common analytic operations including filter, project, join, aggregate, u… ▽ More We present a comprehensive set of conditions and rules to control the correctness of aggregation queries within an interactive data analysis session. The goal is to extend self-service data preparation and BI tools to automatically detect semantically incorrect aggregate queries on analytic tables and views built by using the common analytic operations including filter, project, join, aggregate, union, difference, and pivot. We introduce aggregable properties to describe for any attribute of an analytic table which aggregation functions correctly aggregates the attribute along which sets of dimension attributes. These properties can also be used to formally identify attributes which are summarizable with respect to some aggregation function along a given set of dimension attributes. This is particularly helpful to detect incorrect aggregations of measures obtained through the use of non-distributive aggregation functions like average and count. We extend the notion of summarizability by introducing a new generalized summarizability condition to control the aggregation of attributes after any analytic operation. Finally, we define propagation rules which transform aggregable properties of the query input tables into new aggregable properties for the result tables, preserving summarizability and generalized summarizability. △ Less

Submitted 6 December, 2021; v1 submitted 27 November, 2021; originally announced November 2021.

Comments: 58 pages, 23 figures

arXiv:1907.00050 [pdf, other]

State-of-the-Art on Query & Transaction Processing Acceleration

Authors: Bernd Amann, Youry Khmelevsky, Gaetan Hains

Abstract: The vast amount of processing power and memory bandwidth provided by modern Graphics Processing Units (GPUs) make them a platform for data-intensive applications. The database community identified GPUs as effective co-processors for data processing. In the past years, there were many approaches to make use of GPUs at different levels of a database system. In this Internal Technical Report, based o… ▽ More The vast amount of processing power and memory bandwidth provided by modern Graphics Processing Units (GPUs) make them a platform for data-intensive applications. The database community identified GPUs as effective co-processors for data processing. In the past years, there were many approaches to make use of GPUs at different levels of a database system. In this Internal Technical Report, based on the [1] and some other research papers, we identify possible research areas at LIP6 for GPU-accelerated database management systems. We describe some key properties, typical challenges of GPU-aware database architectures, and identify major open challenges. △ Less

Submitted 26 June, 2019; originally announced July 2019.

Comments: 7 pages, 4 tables

arXiv:1610.06500 [pdf, other]

Continuous Top-k Queries over Real-Time Web Streams

Authors: Nelly Vouzoukidou, Bernd Amann, Vassilis Christophides

Abstract: The Web has become a large-scale real-time information system forcing us to revise both how to effectively assess relevance of information for a user and how to efficiently implement information retrieval and dissemination functionality. To increase information relevance, Real-time Web applications such as Twitter and Facebook, extend content and social-graph relevance scores with "real-time" user… ▽ More The Web has become a large-scale real-time information system forcing us to revise both how to effectively assess relevance of information for a user and how to efficiently implement information retrieval and dissemination functionality. To increase information relevance, Real-time Web applications such as Twitter and Facebook, extend content and social-graph relevance scores with "real-time" user generated events (e.g. re-tweets, replies, likes). To accommodate high arrival rates of information items and user events we explore a publish/subscribe paradigm in which we index queries and update on the fly their results each time a new item and relevant events arrive. In this setting, we need to process continuous top-k text queries combining both static and dynamic scores. To the best of our knowledge, this is the first work addressing how non-predictable, dynamic scores can be handled in a continuous top-k query setting. △ Less

Submitted 20 October, 2016; originally announced October 2016.

arXiv:1604.08903 [pdf, other]

SPARQL query processing with Apache Spark

Authors: Hubert Naacke, Olivier Curé, Bernd Amann

Abstract: The number of linked data sources and the size of the linked open data graph keep growing every day. As a consequence, semantic RDF services are more and more confronted to various "big data" problems. Query processing is one of them and needs to be efficiently addressed with executions over scalable, highly available and fault tolerant frameworks. Data management systems requiring these propertie… ▽ More The number of linked data sources and the size of the linked open data graph keep growing every day. As a consequence, semantic RDF services are more and more confronted to various "big data" problems. Query processing is one of them and needs to be efficiently addressed with executions over scalable, highly available and fault tolerant frameworks. Data management systems requiring these properties are rarely built from scratch but are rather designed on top of an existing cluster computing engine. In this work, we consider the processing of SPARQL queries with Apache Spark. We propose and compare five different query processing approaches based on different join execution models and Spark components. A detailed experimentation, on real-world and synthetic data sets, emphasizes that two approaches tailored for the RDF data model outperform the other ones on all major query shapes, i.e., star, snowflake, chain and hybrid. △ Less

Submitted 3 November, 2016; v1 submitted 29 April, 2016; originally announced April 2016.

Comments: 13 pages (ACM 2 columns format), 10 figures

arXiv:1510.03409 [pdf, other]

LiteMat: a scalable, cost-efficient inference encoding scheme for large RDF graphs

Authors: Olivier Curé, Hubert Naacke, Tendry Randriamalala, Bernd Amann

Abstract: The number of linked data sources and the size of the linked open data graph keep growing every day. As a consequence, semantic RDF services are more and more confronted with various "big data" problems. Query processing in the presence of inferences is one them. For instance, to complete the answer set of SPARQL queries, RDF database systems evaluate semantic RDFS relationships (subPropertyOf, su… ▽ More The number of linked data sources and the size of the linked open data graph keep growing every day. As a consequence, semantic RDF services are more and more confronted with various "big data" problems. Query processing in the presence of inferences is one them. For instance, to complete the answer set of SPARQL queries, RDF database systems evaluate semantic RDFS relationships (subPropertyOf, subClassOf) through time-consuming query rewriting algorithms or space-consuming data materialization solutions. To reduce the memory footprint and ease the exchange of large datasets, these systems generally apply a dictionary approach for compressing triple data sizes by replacing resource identifiers (IRIs), blank nodes and literals with integer values. In this article, we present a structured resource identification scheme using a clever encoding of concepts and property hierarchies for efficiently evaluating the main common RDFS entailment rules while minimizing triple materialization and query rewriting. We will show how this encoding can be computed by a scalable parallel algorithm and directly be implemented over the Apache Spark framework. The efficiency of our encoding scheme is emphasized by an evaluation conducted over both synthetic and real world datasets. △ Less

Submitted 12 October, 2015; originally announced October 2015.

Comments: 8 pages, 1 figure

arXiv:1507.02321 [pdf, other]

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Authors: Olivier Curé, Hubert Naacke, Mohamed-Amine Baazizi, Bernd Amann

Abstract: Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper presents an in-depth analysis and experimental comparison of five representative and complementary distribution approaches. For achieving fair experimental resul… ▽ More Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper presents an in-depth analysis and experimental comparison of five representative and complementary distribution approaches. For achieving fair experimental results, we are using Apache Spark as a common parallel computing framework by rewriting the concerned algorithms using the Spark API. Spark provides guarantees in terms of fault tolerance, high availability and scalability which are essential in such systems. Our different implementations aim to highlight the fundamental implementation-independent characteristics of each approach in terms of data preparation, load balancing, data replication and to some extent to query answering cost and performance. The presented measures are obtained by testing each system on one synthetic and one real-world data set over query workloads with differing characteristics and different partitioning constraints. △ Less

Submitted 8 July, 2015; originally announced July 2015.

Comments: 16 pages, 3 figures

Showing 1–10 of 10 results for author: Amann, B