A Plaque Test for Redundancies in Relational Data [Extendend Version]
Authors:
Christoph Köhnen,
Stefan Klessinger,
Jens Zumbrägel,
Stefanie Scherzinger
Abstract:
Inspired by the visualization of dental plaque at the dentist's office, this article proposes a novel visualization of redundancies in relational data. Our approach is based on a well-principled information-theoretic framework that has so far seen limited practical application in systems and tools. In this framework, we quantify the information content (or entropy) of each cell in a relation insta…
▽ More
Inspired by the visualization of dental plaque at the dentist's office, this article proposes a novel visualization of redundancies in relational data. Our approach is based on a well-principled information-theoretic framework that has so far seen limited practical application in systems and tools. In this framework, we quantify the information content (or entropy) of each cell in a relation instance given a set of functional dependencies. The entropy value signifies the likelihood of recovering the cell value based on the dependencies and the remaining tuples. By highlighting cells with lower entropy, we effectively visualize redundancies in the data. We present an initial prototype implementation and demonstrate that a straightforward approach is insufficient to handle practical problem sizes. To address this limitation, we propose several optimizations which we prove to be correct. In addition, we present a Monte Carlo approximation with a known error, enabling a computationally tractable analysis. By applying our visualization technique to real-world datasets, we showcase its potential. Our vision is to empower data analysts by directing their focus in data profiling toward pertinent redundancies, analogous to the diagnostic role of a plaque test at the dentist's office.
△ Less
Submitted 2 September, 2024; v1 submitted 5 June, 2023;
originally announced June 2023.