Search | arXiv e-print repository

Corporate Greenwashing Detection in Text -- a Survey

Authors: Tom Calamai, Oana Balalau, Théo Le Guenedal, Fabian M. Suchanek

Abstract: Greenwashing is an effort to mislead the public about the environmental impact of an entity, such as a state or company. We provide a comprehensive survey of the scientific literature addressing natural language processing methods to identify potentially misleading climate-related corporate communications, indicative of greenwashing. We break the detection of greenwashing into intermediate tasks,… ▽ More Greenwashing is an effort to mislead the public about the environmental impact of an entity, such as a state or company. We provide a comprehensive survey of the scientific literature addressing natural language processing methods to identify potentially misleading climate-related corporate communications, indicative of greenwashing. We break the detection of greenwashing into intermediate tasks, and review the state-of-the-art approaches for each of them. We discuss datasets, methods, and results, as well as limitations and open challenges. We also provide an overview of how far the field has come as a whole, and point out future research directions. △ Less

Submitted 11 February, 2025; originally announced February 2025.

Comments: 35 pages, 1 figure, 21 pages (appendix), working paper

arXiv:2409.11798 [pdf, other]

The Factuality of Large Language Models in the Legal Domain

Authors: Rajaa El Hamdani, Thomas Bonald, Fragkiskos Malliaros, Nils Holzenberger, Fabian Suchanek

Abstract: This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under differen… ▽ More This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching. Our results show that the performance improves significantly under the alias and fuzzy matching methods. Further, we explore the impact of abstaining and in-context examples, finding that both strategies enhance precision. Finally, we demonstrate that additional pre-training on legal documents, as seen with SaulLM, further improves factual precision from 63% to 81%. △ Less

Submitted 18 September, 2024; originally announced September 2024.

Comments: CIKM 2024, short paper

arXiv:2405.13769 [pdf, other]

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation

Authors: Cyril Chhun, Fabian M. Suchanek, Chloé Clavel

Abstract: Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art perf… ▽ More Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: TACL, pre-MIT Press publication version

arXiv:2402.04957 [pdf, other]

Reconfidencing LLMs from the Grouping Loss Perspective

Authors: Lihu Chen, Alexandre Perez-Lebel, Fabian M. Suchanek, Gaël Varoquaux

Abstract: Large Language Models (LLMs), including ChatGPT and LLaMA, are susceptible to generating hallucinated answers in a confident tone. While efforts to elicit and calibrate confidence scores have proven useful, recent findings show that controlling uncertainty must go beyond calibration: predicted scores may deviate significantly from the actual posterior probabilities due to the impact of grouping lo… ▽ More Large Language Models (LLMs), including ChatGPT and LLaMA, are susceptible to generating hallucinated answers in a confident tone. While efforts to elicit and calibrate confidence scores have proven useful, recent findings show that controlling uncertainty must go beyond calibration: predicted scores may deviate significantly from the actual posterior probabilities due to the impact of grouping loss. In this work, we construct a new evaluation dataset derived from a knowledge base to assess confidence scores given to answers of Mistral and LLaMA. Experiments show that they tend to be overconfident. Further, we show that they are more overconfident on some answers than others, \emph{eg} depending on the nationality of the person in the query. In uncertainty-quantification theory, this is grouping loss. To address this, we propose a solution to reconfidence LLMs, canceling not only calibration but also grouping loss. The LLMs, after the reconfidencing process, indicate improved confidence alignment with the accuracy of their responses. △ Less

Submitted 23 October, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: EMNLP 2024 Findings

arXiv:2401.10407 [pdf, other]

Learning High-Quality and General-Purpose Phrase Representations

Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

Abstract: Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, the… ▽ More Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pre-trained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences. Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible. We propose an improved framework to learn phrase representations in a context-free fashion. The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation. Furthermore, we design three granularities of data augmentation to increase the diversity of training samples. Our experiments across a wide range of tasks show that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size. [PEARL-small]: https://huggingface.co/Lihuchen/pearl_small; [PEARL-base]: https://huggingface.co/Lihuchen/pearl_base; [Code and Dataset]: https://github.com/tigerchen52/PEARL △ Less

Submitted 22 February, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: Findings of EACL 2024

arXiv:2311.09761 [pdf, other]

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

Authors: Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chloé Clavel, Fabian Suchanek

Abstract: We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, an… ▽ More We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies. △ Less

Submitted 9 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2310.12864 [pdf, other]

The Locality and Symmetry of Positional Encodings

Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

Abstract: Positional Encodings (PEs) are used to inject word-order information into transformer-based language models. While they can significantly enhance the quality of sentence representations, their specific contribution to language models is not fully understood, especially given recent findings that various positional encodings are insensitive to word order. In this work, we conduct a systematic study… ▽ More Positional Encodings (PEs) are used to inject word-order information into transformer-based language models. While they can significantly enhance the quality of sentence representations, their specific contribution to language models is not fully understood, especially given recent findings that various positional encodings are insensitive to word order. In this work, we conduct a systematic study of positional encodings in \textbf{Bidirectional Masked Language Models} (BERT-style) , which complements existing work in three aspects: (1) We uncover the core function of PEs by identifying two common properties, Locality and Symmetry; (2) We show that the two properties are closely correlated with the performances of downstream tasks; (3) We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly. We believe that these results are the basis for developing better PEs for transformer-based language models. The code is available at \faGithub~ \url{https://github.com/tigerchen52/locality\_symmetry} △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: Long Paper in Findings of EMNLP23

arXiv:2308.11884 [pdf, ps, other]

YAGO 4.5: A Large and Clean Knowledge Base with a Rich Taxonomy

Authors: Fabian Suchanek, Mehwish Alam, Thomas Bonald, Lihu Chen, Pierre-Henri Paris, Jules Soria

Abstract: Knowledge Bases (KBs) find applications in many knowledge-intensive tasks and, most notably, in information retrieval. Wikidata is one of the largest public general-purpose KBs. Yet, its collaborative nature has led to a convoluted schema and taxonomy. The YAGO 4 KB cleaned up the taxonomy by incorporating the ontology of Schema.org, resulting in a cleaner structure amenable to automated reasoning… ▽ More Knowledge Bases (KBs) find applications in many knowledge-intensive tasks and, most notably, in information retrieval. Wikidata is one of the largest public general-purpose KBs. Yet, its collaborative nature has led to a convoluted schema and taxonomy. The YAGO 4 KB cleaned up the taxonomy by incorporating the ontology of Schema.org, resulting in a cleaner structure amenable to automated reasoning. However, it also cut away large parts of the Wikidata taxonomy, which is essential for information retrieval. In this paper, we extend YAGO 4 with a large part of the Wikidata taxonomy - while respecting logical constraints and the distinction between classes and instances. This yields YAGO 4.5, a new, logically consistent version of YAGO that adds a rich layer of informative classes. An intrinsic and an extrinsic evaluation show the value of the new resource. △ Less

Submitted 10 April, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: Published at SIGIR 2024, cite that paper in scientific articles

arXiv:2305.11311 [pdf, other]

BELLA: Black box model Explanations by Local Linear Approximations

Authors: Nedeljko Radulovic, Albert Bifet, Fabian Suchanek

Abstract: Understanding the decision-making process of black-box models has become not just a legal requirement, but also an additional way to assess their performance. However, the state of the art post-hoc explanation approaches for regression models rely on synthetic data generation, which introduces uncertainty and can hurt the reliability of the explanations. Furthermore, they tend to produce explanati… ▽ More Understanding the decision-making process of black-box models has become not just a legal requirement, but also an additional way to assess their performance. However, the state of the art post-hoc explanation approaches for regression models rely on synthetic data generation, which introduces uncertainty and can hurt the reliability of the explanations. Furthermore, they tend to produce explanations that apply to only very few data points. In this paper, we present BELLA, a deterministic model-agnostic post-hoc approach for explaining the individual predictions of regression black-box models. BELLA provides explanations in the form of a linear model trained in the feature space. BELLA maximizes the size of the neighborhood to which the linear model applies so that the explanations are accurate, simple, general, and robust. BELLA can produce both factual and counterfactual explanations. △ Less

Submitted 20 March, 2025; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: 19 pages,3 figures, submitted to TMLR journal

arXiv:2305.05403 [pdf, other]

Completeness, Recall, and Negation in Open-World Knowledge Bases: A Survey

Authors: Simon Razniewski, Hiba Arnaout, Shrestha Ghosh, Fabian Suchanek

Abstract: General-purpose knowledge bases (KBs) are a cornerstone of knowledge-centric AI. Many of them are constructed pragmatically from Web sources, and are thus far from complete. This poses challenges for the consumption as well as the curation of their content. While several surveys target the problem of completing incomplete KBs, the first problem is arguably to know whether and where the KB is incom… ▽ More General-purpose knowledge bases (KBs) are a cornerstone of knowledge-centric AI. Many of them are constructed pragmatically from Web sources, and are thus far from complete. This poses challenges for the consumption as well as the curation of their content. While several surveys target the problem of completing incomplete KBs, the first problem is arguably to know whether and where the KB is incomplete in the first place, and to which degree. In this survey we discuss how knowledge about completeness, recall, and negation in KBs can be expressed, extracted, and inferred. We cover (i) the logical foundations of knowledge representation and querying under partial closed-world semantics; (ii) the estimation of this information via statistical patterns; (iii) the extraction of information about recall from KBs and text; (iv) the identification of interesting negative statements; and (v) relaxed notions of relative recall. This survey is targeted at two types of audiences: (1) practitioners who are interested in tracking KB quality, focusing extraction efforts, and building quality-aware downstream applications; and (2) data management, knowledge base and semantic web researchers who wish to understand the state of the art of knowledge bases beyond the open-world assumption. Consequently, our survey presents both fundamental methodologies and their working, and gives practice-oriented recommendations on how to choose between different approaches for a problem at hand. △ Less

Submitted 6 December, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: 42 pages, 8 figures, 5 tables

Journal ref: Under review, 2022

arXiv:2302.01860 [pdf, other]

GLADIS: A General and Large Acronym Disambiguation Benchmark

Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

Abstract: Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark na… ▽ More Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences; (3) three datasets that cover the general, scientific, and biomedical domains. We then pre-train a language model, \emph{AcroBERT}, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark. △ Less

Submitted 13 March, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: Long paper at EACL 23

arXiv:2208.11646 [pdf, other]

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Authors: Cyril Chhun, Pierre Colombo, Chloé Clavel, Fabian M. Suchanek

Abstract: Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sci… ▽ More Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems. HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria. Our analysis highlights the weaknesses of current metrics for ASG and allows us to formulate practical recommendations for ASG evaluation. △ Less

Submitted 15 September, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

Comments: 43 pages, 38 figures. Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022)

arXiv:2203.07860 [pdf, other]

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

Abstract: State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary (OOV) words. To address this issue, we follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words. We present a simple contrastive learning framework, LOVE, which ext… ▽ More State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary (OOV) words. To address this issue, we follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words. We present a simple contrastive learning framework, LOVE, which extends the word representation of an existing pre-trained language model (such as BERT), and makes it robust to OOV with few additional parameters. Extensive evaluations demonstrate that our lightweight model achieves similar or even better performances than prior competitors, both on original datasets and on corrupted variants. Moreover, it can be used in a plug-and-play fashion with FastText and BERT, where it significantly improves their robustness. △ Less

Submitted 21 March, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: Long paper accepted by ACL main conference. 17 pages

arXiv:2012.08844 [pdf, other]

A Lightweight Neural Model for Biomedical Entity Linking

Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

Abstract: Biomedical entity linking aims to map biomedical mentions, such as diseases and drugs, to standard entities in a given knowledge base. The specific challenge in this context is that the same biomedical entity can have a wide range of names, including synonyms, morphological variations, and names with different word orderings. Recently, BERT-based methods have advanced the state-of-the-art by allow… ▽ More Biomedical entity linking aims to map biomedical mentions, such as diseases and drugs, to standard entities in a given knowledge base. The specific challenge in this context is that the same biomedical entity can have a wide range of names, including synonyms, morphological variations, and names with different word orderings. Recently, BERT-based methods have advanced the state-of-the-art by allowing for rich representations of word sequences. However, they often have hundreds of millions of parameters and require heavy computing resources, which limits their applications in resource-limited scenarios. Here, we propose a lightweight neural method for biomedical entity linking, which needs just a fraction of the parameters of a BERT model and much less computing resources. Our method uses a simple alignment layer with attention mechanisms to capture the variations between mention and entity names. Yet, we show that our model is competitive with previous work on standard evaluation benchmarks. △ Less

Submitted 21 May, 2021; v1 submitted 16 December, 2020; originally announced December 2020.

arXiv:2010.03527 [pdf, other]

Query Rewriting On Path Views Without Integrity Constraints

Authors: Julien Romero, Nicoleta Preda, Fabian Suchanek

Abstract: A view with a binding pattern is a parameterised query on a database. Such views are used, e.g., to model Web services. To answer a query on such views, one has to orchestrate the views together in execution plans. The goal is usually to find equivalent rewritings, which deliver precisely the same results as the query on all databases. However, such rewritings are usually possible only in the pres… ▽ More A view with a binding pattern is a parameterised query on a database. Such views are used, e.g., to model Web services. To answer a query on such views, one has to orchestrate the views together in execution plans. The goal is usually to find equivalent rewritings, which deliver precisely the same results as the query on all databases. However, such rewritings are usually possible only in the presence of integrity constraints - and not all databases have such constraints. In this paper, we describe a class of plans that give practical guarantees about their result even if there are no integrity constraints. We provide a characterisation of such plans and a complete and correct algorithm to enumerate them. Finally, we show that our method can find plans on real-world Web Services. △ Less

Submitted 7 October, 2020; originally announced October 2020.

Comments: This is the full version of the Datamod'2020 article, which integrates all reviewer feedback, with the same text as the publisher version except minor changes

arXiv:2009.11564 [pdf, other]

Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases

Authors: Gerhard Weikum, Luna Dong, Simon Razniewski, Fabian Suchanek

Abstract: Equipping machines with comprehensive knowledge of the world's entities and their relationships has been a long-standing goal of AI. Over the last decade, large-scale knowledge bases, also known as knowledge graphs, have been automatically constructed from web contents and text sources, and have become a key asset for search engines. This machine knowledge can be harnessed to semantically interpre… ▽ More Equipping machines with comprehensive knowledge of the world's entities and their relationships has been a long-standing goal of AI. Over the last decade, large-scale knowledge bases, also known as knowledge graphs, have been automatically constructed from web contents and text sources, and have become a key asset for search engines. This machine knowledge can be harnessed to semantically interpret textual phrases in news, social media and web tables, and contributes to question answering, natural language processing and data analytics. This article surveys fundamental concepts and practical methods for creating and curating large knowledge bases. It covers models and methods for discovering and canonicalizing entities and their semantic types and organizing them into clean taxonomies. On top of this, the article discusses the automatic extraction of entity-centric properties. To support the long-term life-cycle and the quality assurance of machine knowledge, the article presents methods for constructing open schemas and for knowledge curation. Case studies on academic projects and industrial knowledge graphs complement the survey of concepts and methods. △ Less

Submitted 22 March, 2021; v1 submitted 24 September, 2020; originally announced September 2020.

Comments: Submitted to Foundations and Trends in Databases

Journal ref: Foundations and Trends in Databases, 2021

arXiv:2003.07316 [pdf, other]

Equivalent Rewritings on Path Views with Binding Patterns

Authors: Julien Romero, Nicoleta Preda, Antoine Amarilli, Fabian Suchanek

Abstract: A view with a binding pattern is a parameterized query on a database. Such views are used, e.g., to model Web services. To answer a query on such views, the views have to be orchestrated together in execution plans. We show how queries can be rewritten into equivalent execution plans, which are guaranteed to deliver the same results as the query on all databases. We provide a correct and complete… ▽ More A view with a binding pattern is a parameterized query on a database. Such views are used, e.g., to model Web services. To answer a query on such views, the views have to be orchestrated together in execution plans. We show how queries can be rewritten into equivalent execution plans, which are guaranteed to deliver the same results as the query on all databases. We provide a correct and complete algorithm to find these plans for path views and atomic queries. Finally, we show that our method can be used to answer queries on real-world Web services. △ Less

Submitted 19 March, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

Comments: 33 pages including 16 pages of main text. This is the full version of the ESWC'2020 article, which integrates all reviewer feedback, with the same text as the publisher version except minor changes. Several corrections relative to the first version

arXiv:1806.01139 [pdf, other]

Text to brain: predicting the spatial distribution of neuroimaging observations from text reports

Authors: Jérôme Dockès, Demian Wassermann, Russell Poldrack, Fabian Suchanek, Bertrand Thirion, Gaël Varoquaux

Abstract: Despite the digital nature of magnetic resonance imaging, the resulting observations are most frequently reported and stored in text documents. There is a trove of information untapped in medical health records, case reports, and medical publications. In this paper, we propose to mine brain medical publications to learn the spatial distribution associated with anatomical terms. The problem is form… ▽ More Despite the digital nature of magnetic resonance imaging, the resulting observations are most frequently reported and stored in text documents. There is a trove of information untapped in medical health records, case reports, and medical publications. In this paper, we propose to mine brain medical publications to learn the spatial distribution associated with anatomical terms. The problem is formulated in terms of minimization of a risk on distributions which leads to a least-deviation cost function. An efficient algorithm in the dual then learns the mapping from documents to brain structures. Empirical results using coordinates extracted from the brain-imaging literature show that i) models must adapt to semantic variation in the terms used to describe a given anatomical structure, ii) voxel-wise parameterization leads to higher likelihood of locations reported in unseen documents, iii) least-deviation cost outperforms least-square. As a proof of concept for our method, we use our model of spatial distributions to predict the distribution of specific neurological conditions from text-only reports. △ Less

Submitted 28 June, 2018; v1 submitted 4 June, 2018; originally announced June 2018.

Journal ref: MICCAI 2018 - 21st International Conference on Medical Image Computing and Computer Assisted Intervention, Sep 2018, Granada, Spain. pp.1-18, 2018

arXiv:1612.05786 [pdf, other]

doi 10.1145/3018661.3018739

Predicting Completeness in Knowledge Bases

Authors: Luis Galárraga, Simon Razniewski, Antoine Amarilli, Fabian M. Suchanek

Abstract: Knowledge bases such as Wikidata, DBpedia, or YAGO contain millions of entities and facts. In some knowledge bases, the correctness of these facts has been evaluated. However, much less is known about their completeness, i.e., the proportion of real facts that the knowledge bases cover. In this work, we investigate different signals to identify the areas where a knowledge base is complete. We show… ▽ More Knowledge bases such as Wikidata, DBpedia, or YAGO contain millions of entities and facts. In some knowledge bases, the correctness of these facts has been evaluated. However, much less is known about their completeness, i.e., the proportion of real facts that the knowledge bases cover. In this work, we investigate different signals to identify the areas where a knowledge base is complete. We show that we can combine these signals in a rule mining approach, which allows us to predict where facts may be missing. We also show that completeness predictions can help other applications such as fact prediction. △ Less

Submitted 17 December, 2016; originally announced December 2016.

Comments: 21 pages, 19 references, 1 figure, 5 tables. Complete version of the article accepted at WSDM'17

arXiv:1505.00841 [pdf, other]

doi 10.1145/2767109.2767116

Harvesting Entities from the Web Using Unique Identifiers -- IBEX

Authors: Aliaksandr Talaika, Joanna Biega, Antoine Amarilli, Fabian M. Suchanek

Abstract: In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extracti… ▽ More In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web. △ Less

Submitted 4 May, 2015; originally announced May 2015.

Comments: 30 pages, 5 figures, 9 tables. Complete technical report for A. Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting Entities from the Web Using Unique Identifiers. WebDB workshop, 2015

arXiv:1111.7164 [pdf, other]

PARIS: Probabilistic Alignment of Relations, Instances, and Schema

Authors: Fabian M. Suchanek, Serge Abiteboul, Pierre Senellart

Abstract: One of the main challenges that the Semantic Web faces is the integration of a growing number of independently designed ontologies. In this work, we present PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not only instances, but also relations and classes. Alignments at the instance level cross-fertilize with alignments at the schema level. Thereby, our system provides a… ▽ More One of the main challenges that the Semantic Web faces is the integration of a growing number of independently designed ontologies. In this work, we present PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not only instances, but also relations and classes. Alignments at the instance level cross-fertilize with alignments at the schema level. Thereby, our system provides a truly holistic solution to the problem of ontology alignment. The heart of the approach is probabilistic, i.e., we measure degrees of matchings based on probability estimates. This allows PARIS to run without any parameter tuning. We demonstrate the efficiency of the algorithm and its precision through extensive experiments. In particular, we obtain a precision of around 90% in experiments with some of the world's largest ontologies. △ Less

Submitted 30 November, 2011; originally announced November 2011.

Comments: VLDB2012. arXiv admin note: substantial text overlap with arXiv:1105.5516

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 3, pp. 157-168 (2011)

arXiv:1105.5516 [pdf, ps, other]

Ontology Alignment at the Instance and Schema Level

Authors: Fabian Suchanek, Serge Abiteboul, Pierre Senellart

Abstract: We present PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not only instances, but also relations and classes. Alignments at the instance-level cross-fertilize with alignments at the schema-level. Thereby, our system provides a truly holistic solution to the problem of ontology alignment. The heart of the approach is probabilistic. This allows PARIS to run without any pa… ▽ More We present PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not only instances, but also relations and classes. Alignments at the instance-level cross-fertilize with alignments at the schema-level. Thereby, our system provides a truly holistic solution to the problem of ontology alignment. The heart of the approach is probabilistic. This allows PARIS to run without any parameter tuning. We demonstrate the efficiency of the algorithm and its precision through extensive experiments. In particular, we obtain a precision of around 90% in experiments with two of the world's largest ontologies. △ Less

Submitted 18 August, 2011; v1 submitted 27 May, 2011; originally announced May 2011.

Comments: Technical Report at INRIA RT-0408

Report number: RT-0408

Journal ref: N° RT-0408 (2011)

arXiv:1105.1930 [pdf]

Emerging multidisciplinary research across database management systems

Authors: Anisoara Nica, Fabian Suchanek, Aparna Varde

Abstract: The database community is exploring more and more multidisciplinary avenues: Data semantics overlaps with ontology management; reasoning tasks venture into the domain of artificial intelligence; and data stream management and information retrieval shake hands, e.g., when processing Web click-streams. These new research avenues become evident, for example, in the topics that doctoral students choos… ▽ More The database community is exploring more and more multidisciplinary avenues: Data semantics overlaps with ontology management; reasoning tasks venture into the domain of artificial intelligence; and data stream management and information retrieval shake hands, e.g., when processing Web click-streams. These new research avenues become evident, for example, in the topics that doctoral students choose for their dissertations. This paper surveys the emerging multidisciplinary research by doctoral students in database systems and related areas. It is based on the PIKM 2010, which is the 3rd Ph.D. workshop at the International Conference on Information and Knowledge Management (CIKM). The topics addressed include ontology development, data streams, natural language processing, medical databases, green energy, cloud computing, and exploratory search. In addition to core ideas from the workshop, we list some open research questions in these multidisciplinary areas. △ Less

Submitted 10 May, 2011; originally announced May 2011.

Journal ref: SIGMOD REcords (2011)

arXiv:1105.1929 [pdf]

The Hidden Web, XML and Semantic Web: A Scientific Data Management Perspective

Authors: Fabian Suchanek, Aparna Varde, Richi Nayak, Pierre Senellart

Abstract: The World Wide Web no longer consists just of HTML pages. Our work sheds light on a number of trends on the Internet that go beyond simple Web pages. The hidden Web provides a wealth of data in semi-structured form, accessible through Web forms and Web services. These services, as well as numerous other applications on the Web, commonly use XML, the eXtensible Markup Language. XML has become the l… ▽ More The World Wide Web no longer consists just of HTML pages. Our work sheds light on a number of trends on the Internet that go beyond simple Web pages. The hidden Web provides a wealth of data in semi-structured form, accessible through Web forms and Web services. These services, as well as numerous other applications on the Web, commonly use XML, the eXtensible Markup Language. XML has become the lingua franca of the Internet that allows customized markups to be defined for specific domains. On top of XML, the Semantic Web grows as a common structured data source. In this work, we first explain each of these developments in detail. Using real-world examples from scientific domains of great interest today, we then demonstrate how these new developments can assist the managing, harvesting, and organization of data on the Web. On the way, we also illustrate the current research avenues in these domains. We believe that this effort would help bridge multiple database tracks, thereby attracting researchers with a view to extend database technology. △ Less

Submitted 10 May, 2011; originally announced May 2011.

Comments: EDBT - Tutorial (2011)

Showing 1–24 of 24 results for author: Suchanek, F