-
HCT-QA: A Benchmark for Question Answering on Human-Centric Tables
Authors:
Mohammad S. Ahmad,
Zan A. Naeem,
Michaël Aupetit,
Ahmed Elmagarmid,
Mohamed Eltabakh,
Xiasong Ma,
Mourad Ouzzani,
Chaoyi Ruan
Abstract:
Tabular data embedded within PDF files, web pages, and other document formats are prevalent across numerous sectors such as government, engineering, science, and business. These human-centric tables (HCTs) possess a unique combination of high business value, intricate layouts, limited operational power at scale, and sometimes serve as the only data source for critical insights. However, their comp…
▽ More
Tabular data embedded within PDF files, web pages, and other document formats are prevalent across numerous sectors such as government, engineering, science, and business. These human-centric tables (HCTs) possess a unique combination of high business value, intricate layouts, limited operational power at scale, and sometimes serve as the only data source for critical insights. However, their complexity poses significant challenges to traditional data extraction, processing, and querying methods. While current solutions focus on transforming these tables into relational formats for SQL queries, they fall short in handling the diverse and complex layouts of HCTs and hence being amenable to querying. This paper describes HCT-QA, an extensive benchmark of HCTs, natural language queries, and related answers on thousands of tables. Our dataset includes 2,188 real-world HCTs with 9,835 QA pairs and 4,679 synthetic tables with 67.5K QA pairs. While HCTs can be potentially processed by different type of query engines, in this paper, we focus on Large Language Models as potential engines and assess their ability in processing and querying such tables.
△ Less
Submitted 9 March, 2025;
originally announced April 2025.
-
Measuring the Validity of Clustering Validation Datasets
Authors:
Hyeon Jeon,
Michaël Aupetit,
DongHwa Shin,
Aeri Cho,
Seokhyeon Park,
Jinwook Seo
Abstract:
Clustering techniques are often validated using benchmark datasets where class labels are used as ground-truth clusters. However, depending on the datasets, class labels may not align with the actual data clusters, and such misalignment hampers accurate validation. Therefore, it is essential to evaluate and compare datasets regarding their cluster-label matching (CLM), i.e., how well their class l…
▽ More
Clustering techniques are often validated using benchmark datasets where class labels are used as ground-truth clusters. However, depending on the datasets, class labels may not align with the actual data clusters, and such misalignment hampers accurate validation. Therefore, it is essential to evaluate and compare datasets regarding their cluster-label matching (CLM), i.e., how well their class labels match actual clusters. Internal validation measures (IVMs), like Silhouette, can compare CLM over different labeling of the same dataset, but are not designed to do so across different datasets. We thus introduce Adjusted IVMs as fast and reliable methods to evaluate and compare CLM across datasets. We establish four axioms that require validation measures to be independent of data properties not related to cluster structure (e.g., dimensionality, dataset size). Then, we develop standardized protocols to convert any IVM to satisfy these axioms, and use these protocols to adjust six widely used IVMs. Quantitative experiments (1) verify the necessity and effectiveness of our protocols and (2) show that adjusted IVMs outperform the competitors, including standard IVMs, in accurately evaluating CLM both within and across datasets. We also show that the datasets can be filtered or improved using our method to form more reliable benchmarks for clustering validation.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Classes are not Clusters: Improving Label-based Evaluation of Dimensionality Reduction
Authors:
Hyeon Jeon,
Yun-Hsin Kuo,
Michaël Aupetit,
Kwan-Liu Ma,
Jinwook Seo
Abstract:
A common way to evaluate the reliability of dimensionality reduction (DR) embeddings is to quantify how well labeled classes form compact, mutually separated clusters in the embeddings. This approach is based on the assumption that the classes stay as clear clusters in the original high-dimensional space. However, in reality, this assumption can be violated; a single class can be fragmented into m…
▽ More
A common way to evaluate the reliability of dimensionality reduction (DR) embeddings is to quantify how well labeled classes form compact, mutually separated clusters in the embeddings. This approach is based on the assumption that the classes stay as clear clusters in the original high-dimensional space. However, in reality, this assumption can be violated; a single class can be fragmented into multiple separated clusters, and multiple classes can be merged into a single cluster. We thus cannot always assure the credibility of the evaluation using class labels. In this paper, we introduce two novel quality measures -- Label-Trustworthiness and Label-Continuity (Label-T&C) -- advancing the process of DR evaluation based on class labels. Instead of assuming that classes are well-clustered in the original space, Label-T&C work by (1) estimating the extent to which classes form clusters in the original and embedded spaces and (2) evaluating the difference between the two. A quantitative evaluation showed that Label-T&C outperform widely used DR evaluation measures (e.g., Trustworthiness and Continuity, Kullback-Leibler divergence) in terms of the accuracy in assessing how well DR embeddings preserve the cluster structure, and are also scalable. Moreover, we present case studies demonstrating that Label-T&C can be successfully used for revealing the intrinsic characteristics of DR techniques and their hyperparameters.
△ Less
Submitted 11 August, 2023; v1 submitted 1 August, 2023;
originally announced August 2023.
-
Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures
Authors:
Hyeon Jeon,
Michael Aupetit,
DongHwa Shin,
Aeri Cho,
Seokhyeon Park,
Jinwook Seo
Abstract:
We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the…
▽ More
We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the CLM of benchmark datasets casts doubt on the validity of external validations. Still, evaluating the degree of CLM is challenging. For example, internal clustering validation measures can be used to quantify CLM within the same dataset to evaluate its different clusterings but are not designed to compare clusterings of different datasets. In this work, we propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets. We first determine four axioms for between-dataset internal measures, complementing Ackerman and Ben-David's within-dataset axioms. We then propose processes to generalize internal measures to fulfill these new axioms, and use them to extend the widely used Calinski-Harabasz index for between-dataset CLM evaluation. Through quantitative experiments, we (1) verify the validity and necessity of the generalization processes and (2) show that the proposed between-dataset Calinski-Harabasz index accurately evaluates CLM across datasets. Finally, we demonstrate the importance of evaluating CLM of benchmark datasets before conducting external validation.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
ClassSPLOM -- A Scatterplot Matrix to Visualize Separation of Multiclass Multidimensional Data
Authors:
Michael Aupetit,
Ahmed Ali
Abstract:
In multiclass classification of multidimensional data, the user wants to build a model of the classes to predict the label of unseen data. The model is trained on the data and tested on unseen data with known labels to evaluate its quality. The results are visualized as a confusion matrix which shows how many data labels have been predicted correctly or confused with other classes. The multidimens…
▽ More
In multiclass classification of multidimensional data, the user wants to build a model of the classes to predict the label of unseen data. The model is trained on the data and tested on unseen data with known labels to evaluate its quality. The results are visualized as a confusion matrix which shows how many data labels have been predicted correctly or confused with other classes. The multidimensional nature of the data prevents the direct visualization of the classes so we design ClassSPLOM to give more perceptual insights about the classification results. It uses the Scatterplot Matrix (SPLOM) metaphor to visualize a Linear Discriminant Analysis projection of the data for each pair of classes and a set of Receiving Operating Curves to evaluate their trustworthiness. We illustrate ClassSPLOM on a use case in Arabic dialects identification.
△ Less
Submitted 30 January, 2022;
originally announced January 2022.
-
Distortion-Aware Brushing for Interactive Cluster Analysis in Multidimensional Projections
Authors:
Hyeon Jeon,
Michael Aupetit,
Soohyun Lee,
Hyung-Kwon Ko,
Youngtaek Kim,
Jinwook Seo
Abstract:
Brushing is an everyday interaction in 2D scatterplots, which allows users to select and filter data points within a continuous, enclosed region and conduct further analysis on the points. However, such conventional brushing cannot be directly applied to Multidimensional Projections (MDP), as they hardly escape from False and Missing Neighbors distortions that make the relative positions of the po…
▽ More
Brushing is an everyday interaction in 2D scatterplots, which allows users to select and filter data points within a continuous, enclosed region and conduct further analysis on the points. However, such conventional brushing cannot be directly applied to Multidimensional Projections (MDP), as they hardly escape from False and Missing Neighbors distortions that make the relative positions of the points unreliable. To alleviate this problem, we introduce Distortion-aware brushing, a novel brushing technique for MDP. While users perform brushing, Distortion-aware brushing resolves distortions around currently brushed points by dynamically relocating points in the projection; the points whose data are close to the brushed data in the multidimensional (MD) space go near the corresponding brushed points in the projection, and the opposites move away. Hence, users can overcome distortions and readily extract out clustered data in the MD space using the technique. We demonstrate the effectiveness and applicability of Distortion-aware brushing through usage scenarios with two datasets. Finally, by conducting user studies with 30 participants, we verified that Distortion-aware brushing significantly outperforms previous brushing techniques in precisely separating clusters in the MD space, and works robustly regardless of the types or the amount of distortions in MDP.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
ClustML: A Measure of Cluster Pattern Complexity in Scatterplots Learnt from Human-labeled Groupings
Authors:
Mostafa M. Abbas,
Ehsan Ullah,
Abdelkader Baggag,
Halima Bensmail,
Michael Sedlmair,
Michaël Aupetit
Abstract:
Visual quality measures (VQMs) are designed to support analysts by automatically detecting and quantifying patterns in visualizations. We propose a new VQM for visual grouping patterns in scatterplots, called ClustML, which is trained on previously collected human subject judgments. Our model encodes scatterplots in the parametric space of a Gaussian Mixture Model and uses a classifier trained on…
▽ More
Visual quality measures (VQMs) are designed to support analysts by automatically detecting and quantifying patterns in visualizations. We propose a new VQM for visual grouping patterns in scatterplots, called ClustML, which is trained on previously collected human subject judgments. Our model encodes scatterplots in the parametric space of a Gaussian Mixture Model and uses a classifier trained on human judgment data to estimate the perceptual complexity of grouping patterns. The numbers of initial mixture components and final combined groups. It improves on existing VQMs, first, by better estimating human judgments on two-Gaussian cluster patterns and, second, by giving higher accuracy when ranking general cluster patterns in scatterplots. We use it to analyze kinship data for genome-wide association studies, in which experts rely on the visual analysis of large sets of scatterplots. We make the benchmark datasets and the new VQM available for practical use and further improvements.
△ Less
Submitted 1 May, 2024; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Aquanims: Area-Preserving Animated Transitions in Statistical Data Graphics based on a Hydraulic Metaphor
Authors:
Michael Aupetit
Abstract:
We propose "aquanims" as new design metaphors for animated transitions that preserve displayed areas during the transformation. Animated transitions are used to facilitate understanding of graphical transformations between different visualizations. Area is key information to preserve during filtering or ordering transitions of area-based charts like bar charts, histograms, treemaps, or mosaic plot…
▽ More
We propose "aquanims" as new design metaphors for animated transitions that preserve displayed areas during the transformation. Animated transitions are used to facilitate understanding of graphical transformations between different visualizations. Area is key information to preserve during filtering or ordering transitions of area-based charts like bar charts, histograms, treemaps, or mosaic plots. As liquids are incompressible fluids, we use a hydraulic metaphor to convey the sense of area preservation during animated transitions: in aquanims, graphical objects can change shape, position, color, and even connectedness but not displayed area, as for a liquid contained in a transparent vessel or transferred between such vessels communicating through hidden pipes. We present various aquanims for product plots like bar charts and histograms to accommodate changes in data, in the ordering of bars or in a number of bins, and to provide animated tips. We also consider confusion matrices visualized as fluctuation diagrams and mosaic plots, and show how aquanims can be used to ease the understanding of different classification errors of real data.
△ Less
Submitted 29 January, 2021;
originally announced January 2021.
-
An Enhanced MA Plot with R-Shiny to Ease Exploratory Analysis of Transcriptomic Data
Authors:
Ali Sheharyar,
Talar Boghos Yacoubian,
Dina Aljogol,
Borbala Mifsud,
Dena Al Thani,
Michael Aupetit
Abstract:
MA plots are used to analyze the genome-wide differences in gene expression between two distinct biological conditions. An MA plot is usually rendered as a static scatter plot. Our interview with 3 experts in genomics showed that we could improve the usability of this plot by adding interactive analytic features. In this work we present the design study of the enhanced MA plot.
MA plots are used to analyze the genome-wide differences in gene expression between two distinct biological conditions. An MA plot is usually rendered as a static scatter plot. Our interview with 3 experts in genomics showed that we could improve the usability of this plot by adding interactive analytic features. In this work we present the design study of the enhanced MA plot.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
Aquanims -- Area-Preserving Animated Transitions based on a Hydraulic Metaphor
Authors:
Michael Aupetit
Abstract:
We propose "Aquanims" as new design metaphors for animated transitions that preserve displayed areas during the transformation. As liquids are incompressible fluids, we use a hydraulic metaphor to convey the sense of area preservation during animated transitions. We study the design space of Aquanims for rectangle-based charts.
We propose "Aquanims" as new design metaphors for animated transitions that preserve displayed areas during the transformation. As liquids are incompressible fluids, we use a hydraulic metaphor to convey the sense of area preservation during animated transitions. We study the design space of Aquanims for rectangle-based charts.
△ Less
Submitted 15 November, 2020;
originally announced November 2020.
-
Unsupervised User Stance Detection on Twitter
Authors:
Kareem Darwish,
Peter Stefanov,
Michaël Aupetit,
Preslav Nakov
Abstract:
We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our framework has three major advantages over pre-exist…
▽ More
We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our framework has three major advantages over pre-existing methods, which are based on supervised or semi-supervised classification. First, we do not require any prior labeling of users: instead, we create clusters, which are much easier to label manually afterwards, e.g., in a matter of seconds or minutes instead of hours. Second, there is no need for domain- or topic-level knowledge either to specify the relevant stances (labels) or to conduct the actual labeling. Third, our framework is robust in the face of data skewness, e.g., when some users or some stances have greater representation in the data. We experiment with different combinations of user similarity features, dataset sizes, dimensionality reduction methods, and clustering algorithms to ascertain the most effective and most computationally efficient combinations across three different datasets (in English and Turkish). We further verified our results on additional tweet sets covering six different controversial topics. Our best combination in terms of effectiveness and efficiency uses retweeted accounts as features, UMAP for dimensionality reduction, and Mean Shift for clustering, and yields a small number of high-quality user clusters, typically just 2--3, with more than 98\% purity. The resulting user clusters can be used to train downstream classifiers. Moreover, our framework is robust to variations in the hyper-parameter values and also with respect to random initialization.
△ Less
Submitted 21 May, 2020; v1 submitted 3 April, 2019;
originally announced April 2019.
-
A Twitter Tale of Three Hurricanes: Harvey, Irma, and Maria
Authors:
Firoj Alam,
Ferda Ofli,
Muhammad Imran,
Michael Aupetit
Abstract:
People increasingly use microblogging platforms such as Twitter during natural disasters and emergencies. Research studies have revealed the usefulness of the data available on Twitter for several disaster response tasks. However, making sense of social media data is a challenging task due to several reasons such as limitations of available tools to analyze high-volume and high-velocity data strea…
▽ More
People increasingly use microblogging platforms such as Twitter during natural disasters and emergencies. Research studies have revealed the usefulness of the data available on Twitter for several disaster response tasks. However, making sense of social media data is a challenging task due to several reasons such as limitations of available tools to analyze high-volume and high-velocity data streams. This work presents an extensive multidimensional analysis of textual and multimedia content from millions of tweets shared on Twitter during the three disaster events. Specifically, we employ various Artificial Intelligence techniques from Natural Language Processing and Computer Vision fields, which exploit different machine learning algorithms to process the data generated during the disaster events. Our study reveals the distributions of various types of useful information that can inform crisis managers and responders as well as facilitate the development of future automated systems for disaster management.
△ Less
Submitted 15 May, 2018; v1 submitted 14 May, 2018;
originally announced May 2018.
-
Visualizing Dimensionality Reduction Artifacts: An Evaluation
Authors:
Nicolas Heulot,
Jean-Daniel Fekete,
Michael Aupetit
Abstract:
Multidimensional scaling allows visualizing high-dimensional data as 2D maps with the premise that insights in 2D reveal valid information in high-dimensions. However, the resulting projections suffer from artifacts such as bad local neighborhood preservation and clusters tearing. Interactively coloring the projection according to the discrepancy between original proximities relative to a referenc…
▽ More
Multidimensional scaling allows visualizing high-dimensional data as 2D maps with the premise that insights in 2D reveal valid information in high-dimensions. However, the resulting projections suffer from artifacts such as bad local neighborhood preservation and clusters tearing. Interactively coloring the projection according to the discrepancy between original proximities relative to a reference item reveals these artifacts, but it is not clear if conveying these proximities using color and displaying only local information really helps the visual analysis of projections. We conducted a controlled experiment to investigate the relevance of this interactive technique to help the visual analysis of any projection regardless its quality. We compared the bare projection to the interactive coloring of the original proximities on different visual analysis tasks involving outliers and clusters. Results indicate that the interactive coloring is worthwhile for local tasks as it is significantly robust to projection artifacts whereas the projection is not. However this interactive technique does not help significantly for visual clustering tasks for that projections already give a suitable overview.
△ Less
Submitted 15 May, 2017;
originally announced May 2017.
-
Visualization of Wearable Data and Biometrics for Analysis and Recommendations in Childhood Obesity
Authors:
Michael Aupetit,
Luis Fernandez-Luque,
Meghna Singh,
Jaideep Srivastava
Abstract:
Obesity is one of the major health risk factors be- hind the rise of non-communicable conditions. Understanding the factors influencing obesity is very complex since there are many variables that can affect the health behaviors leading to it. Nowadays, multiple data sources can be used to study health behaviors, such as wearable sensors for physical activity and sleep, social media, mobile and hea…
▽ More
Obesity is one of the major health risk factors be- hind the rise of non-communicable conditions. Understanding the factors influencing obesity is very complex since there are many variables that can affect the health behaviors leading to it. Nowadays, multiple data sources can be used to study health behaviors, such as wearable sensors for physical activity and sleep, social media, mobile and health data. In this paper we describe the design of a dashboard for the visualization of actigraphy and biometric data from a childhood obesity camp in Qatar. This dashboard allows quantitative discoveries that can be used to guide patient behavior and orient qualitative research.
△ Less
Submitted 10 May, 2017;
originally announced May 2017.
-
A new supervised non-linear mapping
Authors:
Sylvain Lespinats,
Anke Meyer-Baese,
Michael Aupetit
Abstract:
Supervised mapping methods project multi-dimensional labeled data onto a 2-dimensional space attempting to preserve both data similarities and topology of classes. Supervised mappings are expected to help the user to understand the underlying original class structure and to classify new data visually. Several methods have been designed to achieve supervised mapping, but many of them modify origina…
▽ More
Supervised mapping methods project multi-dimensional labeled data onto a 2-dimensional space attempting to preserve both data similarities and topology of classes. Supervised mappings are expected to help the user to understand the underlying original class structure and to classify new data visually. Several methods have been designed to achieve supervised mapping, but many of them modify original distances prior to the mapping so that original data similarities are corrupted and even overlapping classes tend to be separated onto the map ignoring their original topology. We propose ClassiMap, an alternative method for supervised mapping. Mappings come with distortions which can be split between tears (close points mapped far apart) and false neighborhoods (points far apart mapped as neighbors). Some mapping methods favor the former while others favor the latter. ClassiMap switches between such mapping methods so that tears tend to appear between classes and false neighborhood within classes, better preserving classes' topology. We also propose two new objective criteria instead of the usual subjective visual inspection to perform fair comparisons of supervised mapping methods. ClassiMap appears to be the best supervised mapping method according to these criteria in our experiments on synthetic and real datasets.
△ Less
Submitted 9 March, 2012;
originally announced March 2012.
-
Concerning the differentiability of the energy function in vector quantization algorithms
Authors:
Dominique Lepetz,
Max Nemoz-Gaillard,
Michael Aupetit
Abstract:
The adaptation rule for Vector Quantization algorithms, and consequently the convergence of the generated sequence, depends on the existence and properties of a function called the energy function, defined on a topological manifold. Our aim is to investigate the conditions of existence of such a function for a class of algorithms examplified by the initial ''K-means'' and Kohonen algorithms. The…
▽ More
The adaptation rule for Vector Quantization algorithms, and consequently the convergence of the generated sequence, depends on the existence and properties of a function called the energy function, defined on a topological manifold. Our aim is to investigate the conditions of existence of such a function for a class of algorithms examplified by the initial ''K-means'' and Kohonen algorithms. The results presented here supplement previous studies and show that the energy function is not always a potential but at least the uniform limit of a series of potential functions which we call a pseudo-potential. Our work also shows that a large number of existing vector quantization algorithms developped by the Artificial Neural Networks community fall into this category. The framework we define opens the way to study the convergence of all the corresponding adaptation rules at once, and a theorem gives promising insights in that direction. We also demonstrate that the ''K-means'' energy function is a pseudo-potential but not a potential in general. Consequently, the energy function associated to the ''Neural-Gas'' is not a potential in general.
△ Less
Submitted 11 April, 2006;
originally announced April 2006.