-
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
Authors:
Junjie Xing,
Yeye He,
Mengyu Zhou,
Haoyu Dong,
Shi Han,
Lingjiao Chen,
Dongmei Zhang,
Surajit Chaudhuri,
H. V. Jagadish
Abstract:
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenario…
▽ More
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area.
In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
On the eternal non-Markovianity of qubit maps
Authors:
Vinayak Jagadish,
R. Srikanth
Abstract:
As is well known, unital Pauli maps can be eternally non-CP-divisible. In contrast, here we show that in the case of non-unital maps, eternal non-Markovianity in the non-unital part is ruled out. In the unital case, the eternal non-Markovianity can be obtained by a convex combination of two dephasing semigroups, but not all three of them. We study these results and the ramifications arising from t…
▽ More
As is well known, unital Pauli maps can be eternally non-CP-divisible. In contrast, here we show that in the case of non-unital maps, eternal non-Markovianity in the non-unital part is ruled out. In the unital case, the eternal non-Markovianity can be obtained by a convex combination of two dephasing semigroups, but not all three of them. We study these results and the ramifications arising from them.
△ Less
Submitted 12 January, 2025;
originally announced January 2025.
-
OpenForge: Probabilistic Metadata Integration
Authors:
Tianji Cong,
Fatemeh Nargesian,
Junjie Xing,
H. V. Jagadish
Abstract:
Modern data stores increasingly rely on metadata for enabling diverse activities such as data cataloging and search. However, metadata curation remains a labor-intensive task, and the broader challenge of metadata maintenance -- ensuring its consistency, usefulness, and freshness -- has been largely overlooked. In this work, we tackle the problem of resolving relationships among metadata concepts…
▽ More
Modern data stores increasingly rely on metadata for enabling diverse activities such as data cataloging and search. However, metadata curation remains a labor-intensive task, and the broader challenge of metadata maintenance -- ensuring its consistency, usefulness, and freshness -- has been largely overlooked. In this work, we tackle the problem of resolving relationships among metadata concepts from disparate sources. These relationships are critical for creating clean, consistent, and up-to-date metadata repositories, and a central challenge for metadata integration.
We propose OpenForge, a two-stage prior-posterior framework for metadata integration. In the first stage, OpenForge exploits multiple methods including fine-tuned large language models to obtain prior beliefs about concept relationships. In the second stage, OpenForge refines these predictions by leveraging Markov Random Field, a probabilistic graphical model. We formalize metadata integration as an optimization problem, where the objective is to identify the relationship assignments that maximize the joint probability of assignments. The MRF formulation allows OpenForge to capture prior beliefs while encoding critical relationship properties, such as transitivity, in probabilistic inference. Experiments on real-world datasets demonstrate the effectiveness and efficiency of OpenForge. On a use case of matching two metadata vocabularies, OpenForge outperforms GPT-4, the second-best method, by 25 F1-score points.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
VecAug: Unveiling Camouflaged Frauds with Cohort Augmentation for Enhanced Detection
Authors:
Fei Xiao,
Shaofeng Cai,
Gang Chen,
H. V. Jagadish,
Beng Chin Ooi,
Meihui Zhang
Abstract:
Fraud detection presents a challenging task characterized by ever-evolving fraud patterns and scarce labeled data. Existing methods predominantly rely on graph-based or sequence-based approaches. While graph-based approaches connect users through shared entities to capture structural information, they remain vulnerable to fraudsters who can disrupt or manipulate these connections. In contrast, seq…
▽ More
Fraud detection presents a challenging task characterized by ever-evolving fraud patterns and scarce labeled data. Existing methods predominantly rely on graph-based or sequence-based approaches. While graph-based approaches connect users through shared entities to capture structural information, they remain vulnerable to fraudsters who can disrupt or manipulate these connections. In contrast, sequence-based approaches analyze users' behavioral patterns, offering robustness against tampering but overlooking the interactions between similar users. Inspired by cohort analysis in retention and healthcare, this paper introduces VecAug, a novel cohort-augmented learning framework that addresses these challenges by enhancing the representation learning of target users with personalized cohort information. To this end, we first propose a vector burn-in technique for automatic cohort identification, which retrieves a task-specific cohort for each target user. Then, to fully exploit the cohort information, we introduce an attentive cohort aggregation technique for augmenting target user representations. To improve the robustness of such cohort augmentation, we also propose a novel label-aware cohort neighbor separation mechanism to distance negative cohort neighbors and calibrate the aggregated cohort information. By integrating this cohort information with target user representations, VecAug enhances the modeling capacity and generalization capabilities of the model to be augmented. Our framework is flexible and can be seamlessly integrated with existing fraud detection models. We deploy our framework on e-commerce platforms and evaluate it on three fraud detection datasets, and results show that VecAug improves the detection performance of base models by up to 2.48\% in AUC and 22.5\% in R@P$_{0.9}$, outperforming state-of-the-art methods significantly.
△ Less
Submitted 1 August, 2024;
originally announced August 2024.
-
CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics
Authors:
Qingpeng Cai,
Kaiping Zheng,
H. V. Jagadish,
Beng Chin Ooi,
James Yip
Abstract:
Cohort studies are of significant importance in the field of healthcare analysis. However, existing methods typically involve manual, labor-intensive, and expert-driven pattern definitions or rely on simplistic clustering techniques that lack medical relevance. Automating cohort studies with interpretable patterns has great potential to facilitate healthcare analysis but remains an unmet need in p…
▽ More
Cohort studies are of significant importance in the field of healthcare analysis. However, existing methods typically involve manual, labor-intensive, and expert-driven pattern definitions or rely on simplistic clustering techniques that lack medical relevance. Automating cohort studies with interpretable patterns has great potential to facilitate healthcare analysis but remains an unmet need in prior research efforts. In this paper, we propose a cohort auto-discovery model, CohortNet, for interpretable healthcare analysis, focusing on the effective identification, representation, and exploitation of cohorts characterized by medically meaningful patterns. CohortNet initially learns fine-grained patient representations by separately processing each feature, considering both individual feature trends and feature interactions at each time step. Subsequently, it classifies each feature into distinct states and employs a heuristic cohort exploration strategy to effectively discover substantial cohorts with concrete patterns. For each identified cohort, it learns comprehensive cohort representations with credible evidence through associated patient retrieval. Ultimately, given a new patient, CohortNet can leverage relevant cohorts with distinguished importance, which can provide a more holistic understanding of the patient's conditions. Extensive experiments on three real-world datasets demonstrate that it consistently outperforms state-of-the-art approaches and offers interpretable insights from diverse perspectives in a top-down fashion.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression
Authors:
Farima Fatahi Bayat,
Xin Liu,
H. V. Jagadish,
Lu Wang
Abstract:
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts, which undermines their reliability. To mitigate this issue, inference-time methods steer LLM representations toward the "truthful directions" previously learned for truth elicitation. However, applying these truthful directions with the same intensity fails to generalize across different query…
▽ More
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts, which undermines their reliability. To mitigate this issue, inference-time methods steer LLM representations toward the "truthful directions" previously learned for truth elicitation. However, applying these truthful directions with the same intensity fails to generalize across different query contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to each specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy. The adaptive nature of LITO counters the limitations of one-size-fits-all intervention methods, maximizing truthfulness by reflecting the model's internal knowledge only when it is confident. Our code is available at https://github.com/launchnlp/LITO.
△ Less
Submitted 6 June, 2024; v1 submitted 30 April, 2024;
originally announced May 2024.
-
Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities
Authors:
Mahdi Erfanian,
H. V. Jagadish,
Abolfazl Asudeh
Abstract:
The potential harms of the under-representation of minorities in training data, particularly in multi-modal settings, is a well-recognized concern. While there has been extensive effort in detecting such under-representation, resolution has remained a challenge. With recent advancements in generative AI, large language models and foundation models have emerged as versatile tools across various dom…
▽ More
The potential harms of the under-representation of minorities in training data, particularly in multi-modal settings, is a well-recognized concern. While there has been extensive effort in detecting such under-representation, resolution has remained a challenge. With recent advancements in generative AI, large language models and foundation models have emerged as versatile tools across various domains. In this paper, we propose Chameleon, a system that efficiently utilizes these tools to augment a data set with a minimal addition of synthetically generated tuples, in order to enhance the coverage of the under-represented groups. Our system follows a rejection sampling approach to ensure the generated tuples have a high quality and follow the underlying distribution. In order to minimize the rejection chance of the generated tuples, we propose multiple strategies for providing a guide for the foundation model. Our experiment results, in addition to confirming the efficiency of our proposed algorithms, illustrate the effectiveness of our approach, as the unfairness of the model in a downstream task significantly dropped after data repair using Chameleon.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
Observatory: Characterizing Embeddings of Relational Tables
Authors:
Tianji Cong,
Madelon Hulsebos,
Zhenjie Sun,
Paul Groth,
H. V. Jagadish
Abstract:
Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model fo…
▽ More
Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.
To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze nine such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models.
△ Less
Submitted 27 January, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions
Authors:
Yin Lin,
Bolin Ding,
H. V. Jagadish,
Jingren Zhou
Abstract:
Before applying data analytics or machine learning to a data set, a vital step is usually the construction of an informative set of features from the data. In this paper, we present SMARTFEAT, an efficient automated feature engineering tool to assist data users, even non-experts, in constructing useful features. Leveraging the power of Foundation Models (FMs), our approach enables the creation of…
▽ More
Before applying data analytics or machine learning to a data set, a vital step is usually the construction of an informative set of features from the data. In this paper, we present SMARTFEAT, an efficient automated feature engineering tool to assist data users, even non-experts, in constructing useful features. Leveraging the power of Foundation Models (FMs), our approach enables the creation of new features from the data, based on contextual information and open-world knowledge. Our method incorporates an intelligent operator selector that discerns a subset of operators, effectively avoiding exhaustive combinations of original features, as is typically observed in traditional automated feature engineering tools. Moreover, we address the limitations of performing data tasks through row-level interactions with FMs, which could lead to significant delays and costs due to excessive API calls. We introduce a function generator that facilitates the acquisition of efficient data transformations, such as dataframe built-in methods or lambda functions, ensuring the applicability of SMARTFEAT to generate new features for large datasets. Code repo with prompt details and datasets: (https://github.com/niceIrene/SMARTFEAT).
△ Less
Submitted 13 December, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Experimental realization of quantum non-Markovianity through the convex mixing of Pauli semigroups on an NMR quantum processor
Authors:
Vaishali Gulati,
Vinayak Jagadish,
R. Srikanth,
Kavita Dorai
Abstract:
This experimental study aims to investigate the convex combinations of Pauli semigroups with arbitrary mixing parameters to determine whether the resulting dynamical map exhibits Markovian or non-Markovian behavior. Specifically, we consider the cases of equal as well as unequal mixing of two Pauli semigroups, and demonstrate that the resulting map is always non-Markovian. Additionally, we study t…
▽ More
This experimental study aims to investigate the convex combinations of Pauli semigroups with arbitrary mixing parameters to determine whether the resulting dynamical map exhibits Markovian or non-Markovian behavior. Specifically, we consider the cases of equal as well as unequal mixing of two Pauli semigroups, and demonstrate that the resulting map is always non-Markovian. Additionally, we study three cases of three-way mixing of the three Pauli semigroups and determine the Markovianity or non-Markovianity of the resulting maps by experimentally determining the decay rates. To simulate the non-unitary dynamics of a single qubit system with different mixing combinations of Pauli semigroups on an NMR quantum processor, we use an algorithm involving two ancillary qubits. The experimental results align with the theoretical predictions.
△ Less
Submitted 26 April, 2024; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Noninvertibility and non-Markovianity of quantum dynamical maps
Authors:
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
We identify two broad types of noninvertibilities in quantum dynamical maps, one necessarily associated with CP indivisibility and one not so. We study the production of (non-)Markovian, invertible maps by the process of mixing noninvertible Pauli maps, and quantify the fraction of the same. The memory kernel perspective appears to be less transparent on the issue of invertibility than the approac…
▽ More
We identify two broad types of noninvertibilities in quantum dynamical maps, one necessarily associated with CP indivisibility and one not so. We study the production of (non-)Markovian, invertible maps by the process of mixing noninvertible Pauli maps, and quantify the fraction of the same. The memory kernel perspective appears to be less transparent on the issue of invertibility than the approaches based on maps or master equations. Here we consider a related and potentially helpful issue: the identification of criteria of parameterized families of maps leading to the existence of a well-defined semigroup limit.
△ Less
Submitted 14 October, 2023; v1 submitted 22 June, 2023;
originally announced June 2023.
-
Petz recovery maps for qudit quantum channels
Authors:
Lea Lautenbacher,
Vinayak Jagadish,
Francesco Petruccione,
Nadja K. Bernardes
Abstract:
This study delves into the efficacy of the Petz recovery map within the context of two paradigmatic quantum channels: dephasing and amplitude-damping. While prior investigations have predominantly focused on qubits, our research extends this inquiry to higher-dimensional systems. We introduce a novel, state-independent framework based on the Choi-Jamiołkowski isomorphism to evaluate the performanc…
▽ More
This study delves into the efficacy of the Petz recovery map within the context of two paradigmatic quantum channels: dephasing and amplitude-damping. While prior investigations have predominantly focused on qubits, our research extends this inquiry to higher-dimensional systems. We introduce a novel, state-independent framework based on the Choi-Jamiołkowski isomorphism to evaluate the performance of the Petz map. By analyzing different channels and the (non-)unital nature of these processes, we emphasize the pivotal role of the reference state selection in determining the map's effectiveness. Furthermore, our analysis underscores the considerable impact of suboptimal choices on performance, prompting a broader consideration of factors such as system dimensionality.
△ Less
Submitted 22 May, 2024; v1 submitted 19 May, 2023;
originally announced May 2023.
-
Pylon: Semantic Table Union Search in Data Lakes
Authors:
Tianji Cong,
Fatemeh Nargesian,
H. V. Jagadish
Abstract:
The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from…
▽ More
The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from disparate sources. In this work, we address one important class of discovery needs: finding union-able tables.
The task is to find tables in a data lake that can be unioned with a given query table. The challenge is to recognize union-able columns even if they are represented differently. In this paper, we propose a data-driven learning approach: specifically, an unsupervised representation learning and embedding retrieval task. Our key idea is to exploit self-supervised contrastive learning to learn an embedding model that takes into account the indexing/search data structure and produces embeddings close by for columns with semantically similar values while pushing apart columns with semantically dissimilar values. We then find union-able tables based on similarities between their constituent columns in embedding space. On a real-world data lake, we demonstrate that our best-performing model achieves significant improvements in precision ($16\% \uparrow$), recall ($17\% \uparrow $), and query response time (7x faster) compared to the state-of-the-art.
△ Less
Submitted 13 January, 2023; v1 submitted 12 January, 2023;
originally announced January 2023.
-
Detection of Groups with Biased Representation in Ranking
Authors:
Jinyang Li,
Yuval Moskovitch,
H. V. Jagadish
Abstract:
Real-life tools for decision-making in many critical domains are based on ranking results. With the increasing awareness of algorithmic fairness, recent works have presented measures for fairness in ranking. Many of those definitions consider the representation of different ``protected groups'', in the top-$k$ ranked items, for any reasonable $k$. Given the protected groups, confirming algorithmic…
▽ More
Real-life tools for decision-making in many critical domains are based on ranking results. With the increasing awareness of algorithmic fairness, recent works have presented measures for fairness in ranking. Many of those definitions consider the representation of different ``protected groups'', in the top-$k$ ranked items, for any reasonable $k$. Given the protected groups, confirming algorithmic fairness is a simple task. However, the groups' definitions may be unknown in advance. In this paper, we study the problem of detecting groups with biased representation in the top-$k$ ranked items, eliminating the need to pre-define protected groups. The number of such groups possible can be exponential, making the problem hard. We propose efficient search algorithms for two different fairness measures: global representation bounds, and proportional representation. Then we propose a method to explain the bias in the representations of groups utilizing the notion of Shapley values. We conclude with an experimental study, showing the scalability of our approach and demonstrating the usefulness of the proposed algorithms.
△ Less
Submitted 6 July, 2023; v1 submitted 30 December, 2022;
originally announced January 2023.
-
WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses
Authors:
Tianji Cong,
James Gale,
Jason Frantz,
H. V. Jagadish,
Çağatay Demiralp
Abstract:
Data discovery is a major challenge in enterprise data analysis: users often struggle to find data relevant to their analysis goals or even to navigate through data across data sources, each of which may easily contain thousands of tables. One common user need is to discover tables joinable with a given table. This need is particularly critical because join is a ubiquitous operation in data analys…
▽ More
Data discovery is a major challenge in enterprise data analysis: users often struggle to find data relevant to their analysis goals or even to navigate through data across data sources, each of which may easily contain thousands of tables. One common user need is to discover tables joinable with a given table. This need is particularly critical because join is a ubiquitous operation in data analysis, and join paths are mostly obscure to users, especially across databases. Furthermore, users are typically interested in finding ``semantically'' joinable tables: with columns that can be transformed to become joinable even if they are not joinable as currently represented in the data store. We present WarpGate, a system prototype for data discovery over cloud data warehouses. WarpGate implements an embedding-based solution to semantic join discovery, which encodes columns into high-dimensional vector space such that joinable columns map to points that are near each other. Through experiments on several table corpora, we show that WarpGate (i) captures semantic relationships between tables, especially those across databases, and (ii) is sample efficient and thus scalable to very large tables of millions of rows. We also showcase an application of WarpGate within an enterprise product for cloud data analytics.
△ Less
Submitted 2 January, 2023; v1 submitted 28 December, 2022;
originally announced December 2022.
-
Reinforcement Learning Enhanced Weighted Sampling for Accurate Subgraph Counting on Fully Dynamic Graph Streams
Authors:
Kaixin Wang,
Cheng Long,
Da Yan,
Jie Zhang,
H. V. Jagadish
Abstract:
As the popularity of graph data increases, there is a growing need to count the occurrences of subgraph patterns of interest, for a variety of applications. Many graphs are massive in scale and also fully dynamic (with insertions and deletions of edges), rendering exact computation of these counts to be infeasible. Common practice is, instead, to use a small set of edges as a sample to estimate th…
▽ More
As the popularity of graph data increases, there is a growing need to count the occurrences of subgraph patterns of interest, for a variety of applications. Many graphs are massive in scale and also fully dynamic (with insertions and deletions of edges), rendering exact computation of these counts to be infeasible. Common practice is, instead, to use a small set of edges as a sample to estimate the counts. Existing sampling algorithms for fully dynamic graphs sample the edges with uniform probability. In this paper, we show that we can do much better if we sample edges based on their individual properties. Specifically, we propose a weighted sampling algorithm called WSD for estimating the subgraph count in a fully dynamic graph stream, which samples the edges based on their weights that indicate their importance and reflect their properties. We determine the weights of edges in a data-driven fashion, using a novel method based on reinforcement learning. We conduct extensive experiments to verify that our technique can produce estimates with smaller errors while often running faster compared with existing algorithms.
△ Less
Submitted 12 November, 2022;
originally announced November 2022.
-
Principles of Query Visualization
Authors:
Wolfgang Gatterbauer,
Cody Dunne,
H. V. Jagadish,
Mirek Riedewald
Abstract:
Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data.
Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data.
△ Less
Submitted 2 August, 2022;
originally announced August 2022.
-
CompactIE: Compact Facts in Open Information Extraction
Authors:
Farima Fatahi Bayat,
Nikita Bhutani,
H. V. Jagadish
Abstract:
A major drawback of modern neural OpenIE systems and benchmarks is that they prioritize high coverage of information in extractions over compactness of their constituents. This severely limits the usefulness of OpenIE extractions in many downstream tasks. The utility of extractions can be improved if extractions are compact and share constituents. To this end, we study the problem of identifying c…
▽ More
A major drawback of modern neural OpenIE systems and benchmarks is that they prioritize high coverage of information in extractions over compactness of their constituents. This severely limits the usefulness of OpenIE extractions in many downstream tasks. The utility of extractions can be improved if extractions are compact and share constituents. To this end, we study the problem of identifying compact extractions with neural-based methods. We propose CompactIE, an OpenIE system that uses a novel pipelined approach to produce compact extractions with overlapping constituents. It first detects constituents of the extractions and then links them to build extractions. We train our system on compact extractions obtained by processing existing benchmarks. Our experiments on CaRB and Wire57 datasets indicate that CompactIE finds 1.5x-2x more compact extractions than previous systems, with high precision, establishing a new state-of-the-art performance in OpenIE.
△ Less
Submitted 9 June, 2022; v1 submitted 5 May, 2022;
originally announced May 2022.
-
Representation Bias in Data: A Survey on Identification and Resolution Techniques
Authors:
Nima Shahbazi,
Yin Lin,
Abolfazl Asudeh,
H. V. Jagadish
Abstract:
Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that "bias in, bias out", one cannot expect AI-based so…
▽ More
Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that "bias in, bias out", one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties. There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.
△ Less
Submitted 18 March, 2023; v1 submitted 22 March, 2022;
originally announced March 2022.
-
Measure of invertible dynamical maps under convex combinations of noninvertible dynamical maps
Authors:
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
We study the convex combinations of the $(d+1)$ generalized Pauli dynamical maps in a Hilbert space of dimension $d$. For certain choices of the decoherence function, the maps are noninvertible and they remain under convex combinations as well. For the case of dynamical maps characterized by the decoherence function $(1-e^{-ct})/n$ with the decoherence parameter $n$ and decay factor $c$, we evalua…
▽ More
We study the convex combinations of the $(d+1)$ generalized Pauli dynamical maps in a Hilbert space of dimension $d$. For certain choices of the decoherence function, the maps are noninvertible and they remain under convex combinations as well. For the case of dynamical maps characterized by the decoherence function $(1-e^{-ct})/n$ with the decoherence parameter $n$ and decay factor $c$, we evaluate the fraction of invertible maps obtained upon mixing, which is found to increase superexponentially with dimension $d$.
△ Less
Submitted 28 July, 2022; v1 submitted 10 January, 2022;
originally announced January 2022.
-
Noninvertibility as a requirement for creating a semigroup under convex combinations of channels
Authors:
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
We study the conditions under which a semigroup is obtained upon convex combinations of channels. In particular, we study the set of Pauli and generalized Pauli channels. We find that mixing only semigroups can never produce a semigroup. Counter-intuitively, we find that for a convex combination to yield a semigroup, most of the input channels have to be noninvertible.
We study the conditions under which a semigroup is obtained upon convex combinations of channels. In particular, we study the set of Pauli and generalized Pauli channels. We find that mixing only semigroups can never produce a semigroup. Counter-intuitively, we find that for a convex combination to yield a semigroup, most of the input channels have to be noninvertible.
△ Less
Submitted 10 March, 2022; v1 submitted 17 November, 2021;
originally announced November 2021.
-
ARM-Net: Adaptive Relation Modeling Network for Structured Data
Authors:
Shaofeng Cai,
Kaiping Zheng,
Gang Chen,
H. V. Jagadish,
Beng Chin Ooi,
Meihui Zhang
Abstract:
Relational databases are the de facto standard for storing and querying structured data, and extracting insights from structured data requires advanced analytics. Deep neural networks (DNNs) have achieved super-human prediction performance in particular data types, e.g., images. However, existing DNNs may not produce meaningful results when applied to structured data. The reason is that there are…
▽ More
Relational databases are the de facto standard for storing and querying structured data, and extracting insights from structured data requires advanced analytics. Deep neural networks (DNNs) have achieved super-human prediction performance in particular data types, e.g., images. However, existing DNNs may not produce meaningful results when applied to structured data. The reason is that there are correlations and dependencies across combinations of attribute values in a table, and these do not follow simple additive patterns that can be easily mimicked by a DNN. The number of possible such cross features is combinatorial, making them computationally prohibitive to model. Furthermore, the deployment of learning models in real-world applications has also highlighted the need for interpretability, especially for high-stakes applications, which remains another issue of concern to DNNs.
In this paper, we present ARM-Net, an adaptive relation modeling network tailored for structured data, and a lightweight framework ARMOR based on ARM-Net for relational data analytics. The key idea is to model feature interactions with cross features selectively and dynamically, by first transforming the input features into exponential space, and then determining the interaction order and interaction weights adaptively for each cross feature. We propose a novel sparse attention mechanism to dynamically generate the interaction weights given the input tuple, so that we can explicitly model cross features of arbitrary orders with noisy features filtered selectively. Then during model inference, ARM-Net can specify the cross features being used for each prediction for higher accuracy and better interpretability. Our extensive experiments on real-world datasets demonstrate that ARM-Net consistently outperforms existing models and provides more interpretable predictions for data-driven decision making.
△ Less
Submitted 5 July, 2021;
originally announced July 2021.
-
Initial entanglement, entangling unitaries, and completely positive maps
Authors:
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
The problem of conditions on the initial correlations between the system and the environment that lead to completely positive (CP) or not-completely positive (NCP) maps has been studied by various authors. Two lines of study may be discerned: one concerned with families of initial correlations that induce CP dynamics under the application of an arbitrary joint unitary on the system and environment…
▽ More
The problem of conditions on the initial correlations between the system and the environment that lead to completely positive (CP) or not-completely positive (NCP) maps has been studied by various authors. Two lines of study may be discerned: one concerned with families of initial correlations that induce CP dynamics under the application of an arbitrary joint unitary on the system and environment; the other concerned with specific initial states that may be highly entangled. Here we study the latter problem, and highlight the interplay between the initial correlations and the unitary applied. In particular, for almost any initial entangled state, one can furnish infinitely many joint unitaries that generate CP dynamics on the system. Restricting to the case of initial, pure entangled states, we obtain the scaling of the dimension of the set of these unitaries and show that it is of zero measure in the set of all possible interaction unitaries.
△ Less
Submitted 15 April, 2021; v1 submitted 22 December, 2020;
originally announced December 2020.
-
Patterns Count-Based Labels for Datasets
Authors:
Yuval Moskovitch,
H. V. Jagadish
Abstract:
Counts of attribute-value combinations are central to the profiling of a dataset, particularly in determining fitness for use and in eliminating bias and unfairness. While counts of individual attribute values may be stored in some dataset profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. In this paper, we develop the notion of sto…
▽ More
Counts of attribute-value combinations are central to the profiling of a dataset, particularly in determining fitness for use and in eliminating bias and unfairness. While counts of individual attribute values may be stored in some dataset profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. In this paper, we develop the notion of storing a "label" of limited size that can be used to obtain good estimates for these counts. A label, in this paper, contains information regarding the count of selected patterns--attributes values combinations--in the data. We define an estimation function, that uses this label to estimate the count of every pattern. We present the problem of finding the optimal label given a bound on its size and propose a heuristic algorithm for generating optimal labels. We experimentally show the accuracy of count estimates derived from the resulting labels and the efficiency of our algorithm.
△ Less
Submitted 7 November, 2020; v1 submitted 30 October, 2020;
originally announced October 2020.
-
MithraDetective: A System for Cherry-picked Trendlines Detection
Authors:
Yoko Nagafuchi,
Yin Lin,
Kaushal Mamgain,
Abolfazl Asudeh,
H. V. Jagadish,
You,
Wu,
Cong Yu
Abstract:
Given a data set, misleading conclusions can be drawn from it by cherry-picking selected samples. One important class of conclusions is a trend derived from a data set of values over time. Our goal is to evaluate whether the 'trends' described by the extracted samples are representative of the true situation represented in the data. We demonstrate MithraDetective, a system to compute a support sco…
▽ More
Given a data set, misleading conclusions can be drawn from it by cherry-picking selected samples. One important class of conclusions is a trend derived from a data set of values over time. Our goal is to evaluate whether the 'trends' described by the extracted samples are representative of the true situation represented in the data. We demonstrate MithraDetective, a system to compute a support score to indicate how cherry-picked a statement is; that is, whether the reported trend is well-supported by the data. The system can also be used to discover more supported alternatives. MithraDetective provides an interactive visual interface for both tasks.
△ Less
Submitted 17 October, 2020;
originally announced October 2020.
-
Compressed Sensing Tomography for qudits in Hilbert spaces of non-power-of-two dimensions
Authors:
Revanth Badveli,
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
The techniques of low-rank matrix recovery were adapted for Quantum State Tomography (QST) previously by D. Gross et al. [Phys. Rev. Lett. 105, 150401 (2010)], where they consider the tomography of $n$ spin-$1/2$ systems. For the density matrix of dimension $d = 2^n$ and rank $r$ with $r \ll 2^n$, it was shown that randomly chosen Pauli measurements of the order $O(dr \log(d)^2)$ are enough to ful…
▽ More
The techniques of low-rank matrix recovery were adapted for Quantum State Tomography (QST) previously by D. Gross et al. [Phys. Rev. Lett. 105, 150401 (2010)], where they consider the tomography of $n$ spin-$1/2$ systems. For the density matrix of dimension $d = 2^n$ and rank $r$ with $r \ll 2^n$, it was shown that randomly chosen Pauli measurements of the order $O(dr \log(d)^2)$ are enough to fully reconstruct the density matrix by running a specific convex optimization algorithm. The result utilized the low operator-norm of the Pauli operator basis, which makes it `incoherent' to low-rank matrices. For quantum systems of dimension $d$ not a power of two, Pauli measurements are not available, and one may consider using SU($d$) measurements. Here, we point out that the SU($d$) operators, owing to their high operator norm, do not provide a significant savings in the number of measurement settings required for successful recovery of all rank-$r$ states. We propose an alternative strategy, in which the quantum information is swapped into the subspace of a power-two system using only $\textrm{poly}(\log(d)^2)$ gates at most, with QST being implemented subsequently by performing $O(dr \log(d)^2)$ Pauli measurements. We show that, despite the increased dimensionality, this method is more efficient than the one using SU($d$) measurements.
△ Less
Submitted 6 July, 2020; v1 submitted 2 June, 2020;
originally announced June 2020.
-
QueryVis: Logic-based diagrams help users understand complicated SQL queries faster
Authors:
Aristotelis Leventidis,
Jiahui Zhang,
Cody Dunne,
Wolfgang Gatterbauer,
H. V. Jagadish,
Mirek Riedewald
Abstract:
Understanding the meaning of existing SQL queries is critical for code maintenance and reuse. Yet SQL can be hard to read, even for expert users or the original creator of a query. We conjecture that it is possible to capture the logical intent of queries in \emph{automatically-generated visual diagrams} that can help users understand the meaning of queries faster and more accurately than SQL text…
▽ More
Understanding the meaning of existing SQL queries is critical for code maintenance and reuse. Yet SQL can be hard to read, even for expert users or the original creator of a query. We conjecture that it is possible to capture the logical intent of queries in \emph{automatically-generated visual diagrams} that can help users understand the meaning of queries faster and more accurately than SQL text alone. We present initial steps in that direction with visual diagrams that are based on the first-order logic foundation of SQL and can capture the meaning of deeply nested queries. Our diagrams build upon a rich history of diagrammatic reasoning systems in logic and were designed using a large body of human-computer interaction best practices: they are \emph{minimal} in that no visual element is superfluous; they are \emph{unambiguous} in that no two queries with different semantics map to the same visualization; and they \emph{extend} previously existing visual representations of relational schemata and conjunctive queries in a natural way. An experimental evaluation involving 42 users on Amazon Mechanical Turk shows that with only a 2--3 minute static tutorial, participants could interpret queries meaningfully faster with our diagrams than when reading SQL alone. Moreover, we have evidence that our visual diagrams result in participants making fewer errors than with SQL. We believe that more regular exposure to diagrammatic representations of SQL can give rise to a \emph{pattern-based} and thus more intuitive use and re-use of SQL. All details on the experimental study, the evaluation stimuli, raw data, and analyses, and source code are available at https://osf.io/mycr2
△ Less
Submitted 23 April, 2020;
originally announced April 2020.
-
Duoquest: A Dual-Specification System for Expressive SQL Queries
Authors:
Christopher Baik,
Zhongjun Jin,
Michael Cafarella,
H. V. Jagadish
Abstract:
Querying a relational database is difficult because it requires users to know both the SQL language and be familiar with the schema. On the other hand, many users possess enough domain familiarity or expertise to describe their desired queries by alternative means. For such users, two major alternatives to writing SQL are natural language interfaces (NLIs) and programming-by-example (PBE). Both of…
▽ More
Querying a relational database is difficult because it requires users to know both the SQL language and be familiar with the schema. On the other hand, many users possess enough domain familiarity or expertise to describe their desired queries by alternative means. For such users, two major alternatives to writing SQL are natural language interfaces (NLIs) and programming-by-example (PBE). Both of these alternatives face certain pitfalls: natural language queries (NLQs) are often ambiguous, even for human interpreters, while current PBE approaches require either low-complexity queries, user schema knowledge, exact example tuples from the user, or a closed-world assumption to be tractable. Consequently, we propose dual-specification query synthesis, which consumes both a NLQ and an optional PBE-like table sketch query that enables users to express varied levels of domain-specific knowledge. We introduce the novel dual-specification Duoquest system, which leverages guided partial query enumeration to efficiently explore the space of possible queries. We present results from user studies in which Duoquest demonstrates a 62.5% absolute increase in query construction accuracy over a state-of-the-art NLI and comparable accuracy to a PBE system on a more limited workload supported by the PBE system. In a simulation study on the prominent Spider benchmark, Duoquest demonstrates a >2x increase in top-1 accuracy over both NLI and PBE.
△ Less
Submitted 16 March, 2020;
originally announced March 2020.
-
Dynamics of quantum correlations in a Qubit-Oscillator system interacting via a dissipative bath
Authors:
Revanth Badveli,
Vinayak Jagadish,
S. Akshaya,
R. Srikanth,
Francesco Petruccione
Abstract:
The entanglement dynamics in a bipartite system consisting of a qubit and a harmonic oscillator interacting only through their coupling with the same bath is studied. The considered model assumes that the qubit is coupled to the bath via the Jaynes-Cummings interaction, whilst the position of the oscillator is coupled to the position of the bath via a dipole interaction. We give a microscopic deri…
▽ More
The entanglement dynamics in a bipartite system consisting of a qubit and a harmonic oscillator interacting only through their coupling with the same bath is studied. The considered model assumes that the qubit is coupled to the bath via the Jaynes-Cummings interaction, whilst the position of the oscillator is coupled to the position of the bath via a dipole interaction. We give a microscopic derivation of the Gorini-Kossakowski-Sudarshan-Lindblad equation for the considered model. Based on the Kossakowski Matrix, we show that non-classical correlations including entanglement can be generated by the considered dynamics. We then analytically identify specific initial states for which entanglement is generated. This result is also supported by our numerical simulations.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
Convex combinations of CP-divisible Pauli channels that are not semigroups
Authors:
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
We study the memory property of the channels obtained by convex combinations of Markovian channels that are not necessarily quantum dynamical semigroups (QDSs). In particular, we characterize the geometry of the region of (non-)Markovian channels obtained by the convex combination of the three Pauli channels, as a function of deviation from the semigroup form in a family of channels. The regions a…
▽ More
We study the memory property of the channels obtained by convex combinations of Markovian channels that are not necessarily quantum dynamical semigroups (QDSs). In particular, we characterize the geometry of the region of (non-)Markovian channels obtained by the convex combination of the three Pauli channels, as a function of deviation from the semigroup form in a family of channels. The regions are highly convex, and interestingly, the measure of the non-Markovian region shrinks with greater deviation from the QDS structure for the considered family, underscoring the counterintuitive nature of (non-)Markovianity under channel mixing.
△ Less
Submitted 29 September, 2020; v1 submitted 16 December, 2019;
originally announced December 2019.
-
Responsible Scoring Mechanisms Through Function Sampling
Authors:
Abolfazl Asudeh,
H. V. Jagadish
Abstract:
Human decision-makers often receive assistance from data-driven algorithmic systems that provide a score for evaluating objects, including individuals. The scores are generated by a function (mechanism) that takes a set of features as input and generates a score.The scoring functions are either machine-learned or human-designed and can be used for different decision purposes such as ranking or cla…
▽ More
Human decision-makers often receive assistance from data-driven algorithmic systems that provide a score for evaluating objects, including individuals. The scores are generated by a function (mechanism) that takes a set of features as input and generates a score.The scoring functions are either machine-learned or human-designed and can be used for different decision purposes such as ranking or classification.
Given the potential impact of these scoring mechanisms on individuals' lives and on society, it is important to make sure these scores are computed responsibly. Hence we need tools for responsible scoring mechanism design. In this paper, focusing on linear scoring functions, we highlight the importance of unbiased function sampling and perturbation in the function space for devising such tools. We provide unbiased samplers for the entire function space, as well as a $θ$-vicinity around a given function.
We then illustrate the value of these samplers for designing effective algorithms in three diverse problem scenarios in the context of ranking. Finally, as a fundamental method for designing responsible scoring mechanisms, we propose a novel approach for approximating the construction of the arrangement of hyperplanes. Despite the exponential complexity of an arrangement in the number of dimensions, using function sampling, our algorithm is linear in the number of samples and hyperplanes, and independent of the number of dimensions.
△ Less
Submitted 22 November, 2019;
originally announced November 2019.
-
Convex Combinations of Pauli Semigroups: Geometry, Measure and an Application
Authors:
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
Finite-time Markovian channels, unlike their infinitesimal counterparts, do not form a convex set. As a particular instance of this observation, we consider the problem of mixing the three Pauli channels, conservatively assumed to be quantum dynamical semigroups, and fully characterize the resulting ``Pauli simplex.'' We show that neither the set of non-Markovian (completely positive indivisible)…
▽ More
Finite-time Markovian channels, unlike their infinitesimal counterparts, do not form a convex set. As a particular instance of this observation, we consider the problem of mixing the three Pauli channels, conservatively assumed to be quantum dynamical semigroups, and fully characterize the resulting ``Pauli simplex.'' We show that neither the set of non-Markovian (completely positive indivisible) nor Markovian channels is convex in the Pauli simplex, and that the measure of non-Markovian channels is about 0.87. All channels in the Pauli simplex are P divisible. A potential application in the context of quantum resource theory is also discussed.
△ Less
Submitted 1 June, 2020; v1 submitted 9 October, 2019;
originally announced October 2019.
-
Measure of not-completely-positive qubit maps: the general case
Authors:
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
We show that the set of not-completely-positive (NCP) maps is unbounded, unless further assumptions are made. This is done by first proposing a reasonable definition of a valid NCP map, which is nontrivial because NCP maps may lack a full positivity domain. The definition is motivated by specific examples. We prove that for valid NCP maps, the eigenvalue spectrum of the corresponding dynamical mat…
▽ More
We show that the set of not-completely-positive (NCP) maps is unbounded, unless further assumptions are made. This is done by first proposing a reasonable definition of a valid NCP map, which is nontrivial because NCP maps may lack a full positivity domain. The definition is motivated by specific examples. We prove that for valid NCP maps, the eigenvalue spectrum of the corresponding dynamical matrix is not bounded. Based on this, we argue that in general the volume measure of qubit maps, including NCP maps, is not well defined.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Database Meets Deep Learning: Challenges and Opportunities
Authors:
Wei Wang,
Meihui Zhang,
Gang Chen,
H. V. Jagadish,
Beng Chin Ooi,
Kian-Lee Tan
Abstract:
Deep learning has recently become very popular on account of its incredible success in many complex data-driven applications, such as image classification and speech recognition. The database community has worked on data-driven applications for many years, and therefore should be playing a lead role in supporting this new wave. However, databases and deep learning are different in terms of both te…
▽ More
Deep learning has recently become very popular on account of its incredible success in many complex data-driven applications, such as image classification and speech recognition. The database community has worked on data-driven applications for many years, and therefore should be playing a lead role in supporting this new wave. However, databases and deep learning are different in terms of both techniques and applications. In this paper, we discuss research problems at the intersection of the two fields. In particular, we discuss possible improvements for deep learning systems from a database perspective, and analyze database applications that may benefit from deep learning techniques.
△ Less
Submitted 18 January, 2020; v1 submitted 21 June, 2019;
originally announced June 2019.
-
Open Information Extraction from Question-Answer Pairs
Authors:
Nikita Bhutani,
Yoshihiko Suhara,
Wang-Chiew Tan,
Alon Halevy,
H. V. Jagadish
Abstract:
Open Information Extraction (OpenIE) extracts meaningful structured tuples from free-form text. Most previous work on OpenIE considers extracting data from one sentence at a time. We describe NeurON, a system for extracting tuples from question-answer pairs. Since real questions and answers often contain precisely the information that users care about, such information is particularly desirable to…
▽ More
Open Information Extraction (OpenIE) extracts meaningful structured tuples from free-form text. Most previous work on OpenIE considers extracting data from one sentence at a time. We describe NeurON, a system for extracting tuples from question-answer pairs. Since real questions and answers often contain precisely the information that users care about, such information is particularly desirable to extend a knowledge base with.
NeurON addresses several challenges. First, an answer text is often hard to understand without knowing the question, and second, relevant information can span multiple sentences. To address these, NeurON formulates extraction as a multi-source sequence-to-sequence learning task, wherein it combines distributed representations of a question and an answer to generate knowledge facts. We describe experiments on two real-world datasets that demonstrate that NeurON can find a significant number of new and interesting facts to extend a knowledge base compared to state-of-the-art OpenIE methods.
△ Less
Submitted 6 April, 2019; v1 submitted 1 March, 2019;
originally announced March 2019.
-
An Invitation to Quantum Channels
Authors:
Vinayak Jagadish,
Francesco Petruccione
Abstract:
Open quantum systems have become an active area of research, owing to its potential applications in many different fields ranging from computation to biology. Here, we review the formalism of dynamical maps used to represent the time evolution of open quantum systems and discuss the various representations and properties of the same, with many examples.
Open quantum systems have become an active area of research, owing to its potential applications in many different fields ranging from computation to biology. Here, we review the formalism of dynamical maps used to represent the time evolution of open quantum systems and discuss the various representations and properties of the same, with many examples.
△ Less
Submitted 3 February, 2019;
originally announced February 2019.
-
Measure of positive and not completely positive single-qubit Pauli maps
Authors:
Vinayak Jagadish,
R. Srikanth,
Francesco Petruccione
Abstract:
The time evolution of an initially uncorrelated system is governed by a completely positive (CP) map. More generally, the system may contain initial (quantum) correlations with an environment, in which case the system evolves according to a not-completely positive (NCP) map. It is an interesting question what the relative measure is for these two types of maps within the set of positive maps. Afte…
▽ More
The time evolution of an initially uncorrelated system is governed by a completely positive (CP) map. More generally, the system may contain initial (quantum) correlations with an environment, in which case the system evolves according to a not-completely positive (NCP) map. It is an interesting question what the relative measure is for these two types of maps within the set of positive maps. After indicating the scope of the full problem of computing the true volume for generic maps acting on a qubit, we study the case of Pauli channels in an abstract space whose elements represent an equivalence class of maps that are identical up to a non-Pauli unitary. In this space, we show that the volume of NCP maps is twice that of CP maps.
△ Less
Submitted 3 February, 2019;
originally announced February 2019.
-
Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases
Authors:
Christopher Baik,
H. V. Jagadish,
Yunyao Li
Abstract:
A critical challenge in constructing a natural language interface to database (NLIDB) is bridging the semantic gap between a natural language query (NLQ) and the underlying data. Two specific ways this challenge exhibits itself is through keyword mapping and join path inference. Keyword mapping is the task of mapping individual keywords in the original NLQ to database elements (such as relations,…
▽ More
A critical challenge in constructing a natural language interface to database (NLIDB) is bridging the semantic gap between a natural language query (NLQ) and the underlying data. Two specific ways this challenge exhibits itself is through keyword mapping and join path inference. Keyword mapping is the task of mapping individual keywords in the original NLQ to database elements (such as relations, attributes or values). It is challenging due to the ambiguity in mapping the user's mental model and diction to the schema definition and contents of the underlying database. Join path inference is the process of selecting the relations and join conditions in the FROM clause of the final SQL query, and is difficult because NLIDB users lack the knowledge of the database schema or SQL and therefore cannot explicitly specify the intermediate tables and joins needed to construct a final SQL query. In this paper, we propose leveraging information from the SQL query log of a database to enhance the performance of existing NLIDBs with respect to these challenges. We present a system Templar that can be used to augment existing NLIDBs. Our extensive experimental evaluation demonstrates the effectiveness of our approach, leading up to 138% improvement in top-1 accuracy in existing NLIDBs by leveraging SQL query log information.
△ Less
Submitted 31 January, 2019;
originally announced February 2019.
-
Demonstration of a Multiresolution Schema Mapping System
Authors:
Zhongjun Jin,
Christopher Baik,
Michael Cafarella,
H. V. Jagadish,
Yuze Lou
Abstract:
Enterprise databases usually contain large and complex schemas. Authoring complete schema mapping queries in this case requires deep knowledge about the source and target schemas and is thereby very challenging to programmers. Sample-driven schema mapping allows the user to describe the schema mapping using data records. However, real data records are still harder to specify than other useful insi…
▽ More
Enterprise databases usually contain large and complex schemas. Authoring complete schema mapping queries in this case requires deep knowledge about the source and target schemas and is thereby very challenging to programmers. Sample-driven schema mapping allows the user to describe the schema mapping using data records. However, real data records are still harder to specify than other useful insights about the desired schema mapping the user might have. In this project, we develop a schema mapping system, PRISM, that enables multiresolution schema mapping. The end user is not limited to providing high-resolution constraints like exact data records but may also provide constraints of various resolutions, like incomplete data records, value ranges, and data types. This new interaction paradigm gives the user more flexibility in describing the desired schema mapping. This demonstration showcases how to use PRISM for schema mapping in a real database.
△ Less
Submitted 18 December, 2018;
originally announced December 2018.
-
Assessing and Remedying Coverage for a Given Dataset
Authors:
Abolfazl Asudeh,
Zhongjun Jin,
H. V. Jagadish
Abstract:
Data analysis impacts virtually every aspect of our society today. Often, this analysis is performed on an existing dataset, possibly collected through a process that the data scientists had limited control over. The existing data analyzed may not include the complete universe, but it is expected to cover the diversity of items in the universe. Lack of adequate coverage in the dataset can result i…
▽ More
Data analysis impacts virtually every aspect of our society today. Often, this analysis is performed on an existing dataset, possibly collected through a process that the data scientists had limited control over. The existing data analyzed may not include the complete universe, but it is expected to cover the diversity of items in the universe. Lack of adequate coverage in the dataset can result in undesirable outcomes such as biased decisions and algorithmic racism, as well as creating vulnerabilities such as opening up room for adversarial attacks.
In this paper, we assess the coverage of a given dataset over multiple categorical attributes. We first provide efficient techniques for traversing the combinatorial explosion of value combinations to identify any regions of attribute space not adequately covered by the data. Then, we determine the least amount of additional data that must be obtained to resolve this lack of adequate coverage. We confirm the value of our proposal through both theoretical analyses and comprehensive experiments on real data.
△ Less
Submitted 23 February, 2019; v1 submitted 15 October, 2018;
originally announced October 2018.
-
Reducing Uncertainty of Schema Matching via Crowdsourcing with Accuracy Rates
Authors:
Chen Jason Zhang,
Lei Chen,
H. V. Jagadish,
Mengchen Zhang,
Yongxin Tong
Abstract:
Schema matching is a central challenge for data integration systems. Inspired by the popularity and the success of crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching. Since crowdsourcing platforms are most effective for simple questions, we assume that each Correspondence Correctness Question (CCQ) asks the crowd to decide whether a given corr…
▽ More
Schema matching is a central challenge for data integration systems. Inspired by the popularity and the success of crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching. Since crowdsourcing platforms are most effective for simple questions, we assume that each Correspondence Correctness Question (CCQ) asks the crowd to decide whether a given correspondence should exist in the correct matching. Furthermore, members of a crowd may sometimes return incorrect answers with different probabilities. Accuracy rates of individual crowd workers are probabilities of returning correct answers which can be attributes of CCQs as well as evaluations of individual workers. We prove that uncertainty reduction equals to entropy of answers minus entropy of crowds and show how to obtain lower and upper bounds for it. We propose frameworks and efficient algorithms to dynamically manage the CCQs to maximize the uncertainty reduction within a limited budget of questions. We develop two novel approaches, namely `Single CCQ' and `Multiple CCQ', which adaptively select, publish and manage questions. We verify the value of our solutions with simulation and real implementation.
△ Less
Submitted 11 September, 2018;
originally announced September 2018.
-
GOTO Rankings Considered Helpful
Authors:
Emery Berger,
Stephen M. Blackburn,
Carla Brodley,
H. V. Jagadish,
Kathryn S. McKinley,
Mario A. Nascimento,
Minjeong Shin,
Lexing Xie
Abstract:
Rankings are a fact of life. Whether or not one likes them, they exist and are influential. Within academia, and in computer science in particular, rankings not only capture our attention but also widely influence people who have a limited understanding of computing science research, including prospective students, university administrators, and policy-makers. In short, rankings matter. This posit…
▽ More
Rankings are a fact of life. Whether or not one likes them, they exist and are influential. Within academia, and in computer science in particular, rankings not only capture our attention but also widely influence people who have a limited understanding of computing science research, including prospective students, university administrators, and policy-makers. In short, rankings matter. This position paper advocates for the adoption of "GOTO rankings": rankings that use Good data, are Open, Transparent, and Objective, and the rejection of rankings that do not meet these criteria.
△ Less
Submitted 24 April, 2019; v1 submitted 29 June, 2018;
originally announced July 2018.
-
On Obtaining Stable Rankings
Authors:
Abolfazl Asudeh,
H. V. Jagadish,
Gerome Miklau,
Julia Stoyanovich
Abstract:
Decision making is challenging when there is more than one criterion to consider. In such cases, it is common to assign a goodness score to each item as a weighted sum of its attribute values and rank them accordingly. Clearly, the ranking obtained depends on the weights used for this summation. Ideally, one would want the ranked order not to change if the weights are changed slightly. We call thi…
▽ More
Decision making is challenging when there is more than one criterion to consider. In such cases, it is common to assign a goodness score to each item as a weighted sum of its attribute values and rank them accordingly. Clearly, the ranking obtained depends on the weights used for this summation. Ideally, one would want the ranked order not to change if the weights are changed slightly. We call this property {\em stability} of the ranking. A consumer of a ranked list may trust the ranking more if it has high stability. A producer of a ranked list prefers to choose weights that result in a stable ranking, both to earn the trust of potential consumers and because a stable ranking is intrinsically likely to be more meaningful. In this paper, we develop a framework that can be used to assess the stability of a provided ranking and to obtain a stable ranking within an "acceptable" range of weight values (called "the region of interest"). We address the case where the user cares about the rank order of the entire set of items, and also the case where the user cares only about the top-$k$ items. Using a geometric interpretation, we propose algorithms that produce stable rankings. In addition to theoretical analyses, we conduct extensive experiments on real datasets that validate our proposal.
△ Less
Submitted 18 December, 2018; v1 submitted 29 April, 2018;
originally announced April 2018.
-
PANDA: Facilitating Usable AI Development
Authors:
Jinyang Gao,
Wei Wang,
Meihui Zhang,
Gang Chen,
H. V. Jagadish,
Guoliang Li,
Teck Khim Ng,
Beng Chin Ooi,
Sheng Wang,
Jingren Zhou
Abstract:
Recent advances in artificial intelligence (AI) and machine learning have created a general perception that AI could be used to solve complex problems, and in some situations over-hyped as a tool that can be so easily used. Unfortunately, the barrier to realization of mass adoption of AI on various business domains is too high because most domain experts have no background in AI. Developing AI app…
▽ More
Recent advances in artificial intelligence (AI) and machine learning have created a general perception that AI could be used to solve complex problems, and in some situations over-hyped as a tool that can be so easily used. Unfortunately, the barrier to realization of mass adoption of AI on various business domains is too high because most domain experts have no background in AI. Developing AI applications involves multiple phases, namely data preparation, application modeling, and product deployment. The effort of AI research has been spent mostly on new AI models (in the model training stage) to improve the performance of benchmark tasks such as image recognition. Many other factors such as usability, efficiency and security of AI have not been well addressed, and therefore form a barrier to democratizing AI. Further, for many real world applications such as healthcare and autonomous driving, learning via huge amounts of possibility exploration is not feasible since humans are involved. In many complex applications such as healthcare, subject matter experts (e.g. Clinicians) are the ones who appreciate the importance of features that affect health, and their knowledge together with existing knowledge bases are critical to the end results. In this paper, we take a new perspective on developing AI solutions, and present a solution for making AI usable. We hope that this resolution will enable all subject matter experts (eg. Clinicians) to exploit AI like data scientists.
△ Less
Submitted 26 April, 2018;
originally announced April 2018.
-
CLX: Towards verifiable PBE data transformation
Authors:
Zhongjun Jin,
Michael Cafarella,
H. V. Jagadish,
Sean Kandel,
Michael Minar,
Joseph M. Hellerstein
Abstract:
Effective data analytics on data collected from the real world usually begins with a notoriously expensive pre-processing step of data transformation and wrangling. Programming By Example (PBE) systems have been proposed to automatically infer transformations using simple examples that users provide as hints. However, an important usability issue - verification - limits the effective use of such P…
▽ More
Effective data analytics on data collected from the real world usually begins with a notoriously expensive pre-processing step of data transformation and wrangling. Programming By Example (PBE) systems have been proposed to automatically infer transformations using simple examples that users provide as hints. However, an important usability issue - verification - limits the effective use of such PBE data transformation systems, since the verification process is often effort-consuming and unreliable. We propose a data transformation paradigm design CLX (pronounced "clicks") with a focus on facilitating verification for end users in a PBE-like data transformation. CLX performs pattern clustering in both input and output data, which allows the user to verify at the pattern level, rather than the data instance level, without having to write any regular expressions, thereby significantly reducing user verification effort. Thereafter, CLX automatically generates transformation programs as regular-expression replace operations that are easy for average users to verify. We experimentally compared the CLX prototype with both FlashFill, a state-of-the-art PBE data transformation tool, and Trifacta, an influential system supporting interactive data transformation. The results show improvements over the state of the art tools in saving user verification effort, without loss of efficiency or expressive power. In a user effort study on data sets of various sizes, when the data size grew by a factor of 30, the user verification time required by the CLX prototype grew by 1.3x whereas that required by FlashFill grew by 11.4x. In another test assessing the users' understanding of the transformation logic - a key ingredient in effective verification - CLX users achieved a success rate about twice that of FlashFill users.
△ Less
Submitted 12 August, 2019; v1 submitted 1 March, 2018;
originally announced March 2018.
-
RRR: Rank-Regret Representative
Authors:
Abolfazl Asudeh,
Azade Nazi,
Nan Zhang,
Gautam Das,
H. V. Jagadish
Abstract:
Selecting the best items in a dataset is a common task in data exploration. However, the concept of "best" lies in the eyes of the beholder: different users may consider different attributes more important, and hence arrive at different rankings. Nevertheless, one can remove "dominated" items and create a "representative" subset of the data set, comprising the "best items" in it. A Pareto-optimal…
▽ More
Selecting the best items in a dataset is a common task in data exploration. However, the concept of "best" lies in the eyes of the beholder: different users may consider different attributes more important, and hence arrive at different rankings. Nevertheless, one can remove "dominated" items and create a "representative" subset of the data set, comprising the "best items" in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be almost as big as the full data. Representative can be found if we relax the requirement to include the best item for every possible user, and instead just limit the users' "regret". Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full data set, for any chosen ranking function.
However, the score is often not a meaningful number and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the data set. In contrast, users do understand the notion of rank ordering. Therefore, alternatively, we consider the position of the items in the ranked list for defining the regret and propose the {\em rank-regret representative} as the minimal subset of the data containing at least one of the top-$k$ of any possible ranking function. This problem is NP-complete. We use the geometric interpretation of items to bound their ranks on ranges of functions and to utilize combinatorial geometry notions for developing effective and efficient approximation algorithms for the problem. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.
△ Less
Submitted 3 March, 2018; v1 submitted 28 February, 2018;
originally announced February 2018.
-
Designing Fair Ranking Schemes
Authors:
Abolfazl Asudeh,
H. V. Jagadish,
Julia Stoyanovich,
Gautam Das
Abstract:
Items from a database are often ranked based on a combination of multiple criteria. A user may have the flexibility to accept combinations that weigh these criteria differently, within limits. On the other hand, this choice of weights can greatly affect the fairness of the produced ranking. In this paper, we develop a system that helps users choose criterion weights that lead to greater fairness.…
▽ More
Items from a database are often ranked based on a combination of multiple criteria. A user may have the flexibility to accept combinations that weigh these criteria differently, within limits. On the other hand, this choice of weights can greatly affect the fairness of the produced ranking. In this paper, we develop a system that helps users choose criterion weights that lead to greater fairness.
We consider ranking functions that compute the score of each item as a weighted sum of (numeric) attribute values, and then sort items on their score. Each ranking function can be expressed as a vector of weights, or as a point in a multi-dimensional space. For a broad range of fairness criteria, we show how to efficiently identify regions in this space that satisfy these criteria. Using this identification method, our system is able to tell users whether their proposed ranking function satisfies the desired fairness criteria and, if it does not, to suggest the smallest modification that does. We develop user-controllable approximation that and indexing techniques that are applied during preprocessing, and support sub-second response times during the online phase. Our extensive experiments on real datasets demonstrate that our methods are able to find solutions that satisfy fairness criteria effectively and efficiently.
△ Less
Submitted 4 January, 2018; v1 submitted 27 December, 2017;
originally announced December 2017.
-
Non-Markovian evolution: a quantum walk perspective
Authors:
Pradeep Kumar,
Subhashish Banerjee,
R. Srikanth,
Vinayak Jagadish,
Francesco Petruccione
Abstract:
Quantum non-Markovianity of a quantum noisy channel manifests typically as information backflow, characterized by the departure of the intermediate map from complete positivity, though we indicate certain noisy channels that don't exhibit this behavior. In complex systems, non-Markovianity becomes more involved on account of subsystem dynamics. Here we study various facets of non-Markovian evoluti…
▽ More
Quantum non-Markovianity of a quantum noisy channel manifests typically as information backflow, characterized by the departure of the intermediate map from complete positivity, though we indicate certain noisy channels that don't exhibit this behavior. In complex systems, non-Markovianity becomes more involved on account of subsystem dynamics. Here we study various facets of non-Markovian evolution, in the context of coined quantum walks, with particular stress on disambiguating the internal vs. environmental contributions to non-Markovian backflow. For the above problem of disambiguation, we present a general power-spectral technique based on a distinguishability measure such as trace-distance or correlation measure such as mutual information. We also study various facets of quantum correlations in the transition from quantum to classical random walks, under the considered non-Markovian noise models. The potential for the application of this analysis to the quantum statistical dynamics of complex systems is indicated.
△ Less
Submitted 8 January, 2019; v1 submitted 9 November, 2017;
originally announced November 2017.
-
Non-Markovian Dynamics of Discrete-Time Quantum Walks
Authors:
Subhashish Banerjee,
N. Pradeep Kumar,
R. Srikanth,
Vinayak Jagadish,
Francesco Petruccione
Abstract:
In the case of the discrete time coined quantum walk the reduced dynamics of the coin shows non-Markovian recurrence features due to information back-flow from the position degree of freedom. Here we study how this non-Markovian behavior is modified in the presence of open system dynamics. In the process, we obtain useful insights into the nature of non-Markovian physics. In particular, we show th…
▽ More
In the case of the discrete time coined quantum walk the reduced dynamics of the coin shows non-Markovian recurrence features due to information back-flow from the position degree of freedom. Here we study how this non-Markovian behavior is modified in the presence of open system dynamics. In the process, we obtain useful insights into the nature of non-Markovian physics. In particular, we show that in the case of (non-Markovian) random telegraph noise (RTN), a further discernbile recurrence feature is present in the dynamics. Moreover, this feature is correlated with the localization of the walker. On the other hand, no additional recurruence feature appears for other non-Markovian types of noise (Ornstein-Uhlenbeck and Power Law noise). We propose a power spectral method for comparing the relative strengths of the non-Markovian component due to the external noise and that due to the internal position degree of freedom.
△ Less
Submitted 23 March, 2017;
originally announced March 2017.
-
Bsmooth: Learning from user feedback to disambiguate query terms in interactive data retrieval
Authors:
Bernardo Gonçalves,
H. V. Jagadish
Abstract:
There is great interest in supporting imprecise queries (e.g., keyword search or natural language queries) over databases today. To support such queries, the database system is typically required to disambiguate parts of the user-specified query against the database, using whatever resources are intrinsically available to it (the database schema, data values distributions, natural language models…
▽ More
There is great interest in supporting imprecise queries (e.g., keyword search or natural language queries) over databases today. To support such queries, the database system is typically required to disambiguate parts of the user-specified query against the database, using whatever resources are intrinsically available to it (the database schema, data values distributions, natural language models etc). Often, systems will also have a user-interaction log available, which can serve as an extrinsic resource to supplement their model based on their own intrinsic resources. This leads to a problem of how best to combine the system's prior ranking with insight derived from the user-interaction log. Statistical inference techniques such as maximum likelihood or Bayesian updates from a subjective prior turn out not to apply in a straightforward way due to possible noise from user search behavior and to encoding biases endemic to the system's models. In this paper, we address such learning problem in interactive data retrieval, with specific focus on type classification for user-specified query terms. We develop a novel Bayesian smoothing algorithm, Bsmooth, which is simple, fast, flexible and accurate. We analytically establish some desirable properties and show, through experiments against an independent benchmark, that the addition of such a learning layer performs much better than standard methods.
△ Less
Submitted 26 April, 2017; v1 submitted 15 October, 2016;
originally announced October 2016.