-
Dragoman: Efficiently Evaluating Declarative Mapping Languages over Frameworks for Knowledge Graph Creation
Authors:
Samaneh Jozashoori,
Enrique Iglesias,
Maria-Esther Vidal
Abstract:
In recent years, there have been valuable efforts and contributions to make the process of RDF knowledge graph creation traceable and transparent; extending and applying declarative mapping languages is an example. One challenging step is the traceability of procedures that aim to overcome interoperability issues, a.k.a. data-level integration. In most pipelines, data integration is performed by a…
▽ More
In recent years, there have been valuable efforts and contributions to make the process of RDF knowledge graph creation traceable and transparent; extending and applying declarative mapping languages is an example. One challenging step is the traceability of procedures that aim to overcome interoperability issues, a.k.a. data-level integration. In most pipelines, data integration is performed by ad-hoc programs, preventing traceability and reusability. However, formal frameworks provided by function-based declarative mapping languages such as FunUL and RML+FnO empower expressiveness. Data-level integration can be defined as functions and integrated as part of the mappings performing schema-level integration. However, combining functions with the mappings introduces a new source of complexity that can considerably impact the required number of resources and execution time. We tackle the problem of efficiently executing mappings with functions and formalize the transformation of them into function-free mappings. These transformations are the basis of an optimization process that aims to perform an eager evaluation of function-based mapping rules. These techniques are implemented in a framework named Dragoman. We demonstrate the correctness of the transformations while ensuring that the function-free data integration processes are equivalent to the original one. The effectiveness of Dragoman is empirically evaluated in 230 testbeds composed of various types of functions integrated with mapping rules of different complexity. The outcomes suggest that evaluating function-free mapping rules reduces execution time in complex knowledge graph creation pipelines composed of large data sources and multiple types of mapping rules. The savings can be up to 75%, suggesting that eagerly executing functions in mapping rules enable making these pipelines applicable and scalable in real-world settings.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Knowledge4COVID-19: A Semantic-based Approach for Constructing a COVID-19 related Knowledge Graph from Various Sources and Analysing Treatments' Toxicities
Authors:
Ahmad Sakor,
Samaneh Jozashoori,
Emetis Niazmand,
Ariam Rivas,
Kostantinos Bougiatiotis,
Fotis Aisopos,
Enrique Iglesias,
Philipp D. Rohde,
Trupti Padiya,
Anastasia Krithara,
Georgios Paliouras,
Maria-Esther Vidal
Abstract:
In this paper, we present Knowledge4COVID-19, a framework that aims to showcase the power of integrating disparate sources of knowledge to discover adverse drug effects caused by drug-drug interactions among COVID-19 treatments and pre-existing condition drugs. Initially, we focus on constructing the Knowledge4COVID-19 knowledge graph (KG) from the declarative definition of mapping rules using the…
▽ More
In this paper, we present Knowledge4COVID-19, a framework that aims to showcase the power of integrating disparate sources of knowledge to discover adverse drug effects caused by drug-drug interactions among COVID-19 treatments and pre-existing condition drugs. Initially, we focus on constructing the Knowledge4COVID-19 knowledge graph (KG) from the declarative definition of mapping rules using the RDF Mapping Language. Since valuable information about drug treatments, drug-drug interactions, and side effects is present in textual descriptions in scientific databases (e.g., DrugBank) or in scientific literature (e.g., the CORD-19, the Covid-19 Open Research Dataset), the Knowledge4COVID-19 framework implements Natural Language Processing. The Knowledge4COVID-19 framework extracts relevant entities and predicates that enable the fine-grained description of COVID-19 treatments and the potential adverse events that may occur when these treatments are combined with treatments of common comorbidities, e.g., hypertension, diabetes, or asthma. Moreover, on top of the KG, several techniques for the discovery and prediction of interactions and potential adverse effects of drugs have been developed with the aim of suggesting more accurate treatments for treating the virus. We provide services to traverse the KG and visualize the effects that a group of drugs may have on a treatment outcome. Knowledge4COVID-19 was part of the Pan-European hackathon#EUvsVirus in April 2020 and is publicly available as a resource through a GitHub repository (https://github.com/SDM-TIB/Knowledge4COVID-19) and a DOI (https://zenodo.org/record/4701817#.YH336-8zbol).
△ Less
Submitted 7 October, 2022; v1 submitted 15 June, 2022;
originally announced June 2022.
-
Scaling Up Knowledge Graph Creation to Large and Heterogeneous Data Sources
Authors:
Enrique Iglesias,
Samaneh Jozashoori,
Maria-Esther Vidal
Abstract:
RDF knowledge graphs (KG) are powerful data structures to represent factual statements created from heterogeneous data sources. KG creation is laborious and demands data management techniques to be executed efficiently. This paper tackles the problem of the automatic generation of KG creation processes declaratively specified; it proposes techniques for planning and transforming heterogeneous data…
▽ More
RDF knowledge graphs (KG) are powerful data structures to represent factual statements created from heterogeneous data sources. KG creation is laborious and demands data management techniques to be executed efficiently. This paper tackles the problem of the automatic generation of KG creation processes declaratively specified; it proposes techniques for planning and transforming heterogeneous data into RDF triples following mapping assertions specified in the RDF Mapping Language (RML). Given a set of mapping assertions, the planner provides an optimized execution plan by partitioning and scheduling the execution of the assertions. First, the planner assesses an optimized number of partitions considering the number of data sources, type of mapping assertions, and the associations between different assertions. After providing a list of partitions and assertions that belong to each partition, the planner determines their execution order. A greedy algorithm is implemented to generate the partitions' bushy tree execution plan. Bushy tree plans are translated into operating system commands that guide the execution of the partitions of the mapping assertions in the order indicated by the bushy tree. The proposed optimization approach is evaluated over state-of-the-art RML-compliant engines, and existing benchmarks of data sources and RML triples maps. Our experimental results suggest that the performance of the studied engines can be considerably improved, particularly in a complex setting with numerous triples maps and large data sources. As a result, engines that time out in complex cases are enabled to produce at least a portion of the KG applying the planner.
△ Less
Submitted 26 October, 2022; v1 submitted 24 January, 2022;
originally announced January 2022.
-
EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines
Authors:
Samaneh Jozashoori,
Ahmad Sakor,
Enrique Iglesias,
Maria-Esther Vidal
Abstract:
Despite encoding enormous amount of rich and valuable data, existing data sources are mostly created independently, being a significant challenge to their integration. Mapping languages, e.g., RML and R2RML, facilitate declarative specification of the process of applying meta-data and integrating data into a knowledge graph. Mapping rules can also include knowledge extraction functions in addition…
▽ More
Despite encoding enormous amount of rich and valuable data, existing data sources are mostly created independently, being a significant challenge to their integration. Mapping languages, e.g., RML and R2RML, facilitate declarative specification of the process of applying meta-data and integrating data into a knowledge graph. Mapping rules can also include knowledge extraction functions in addition to expressing correspondences among data sources and a unified schema. Combining mapping rules and functions represents a powerful formalism to specify pipelines for integrating data into a knowledge graph transparently. Surprisingly, these formalisms are not fully adapted, and many knowledge graphs are created by executing ad-hoc programs to pre-process and integrate data. In this paper, we present EABlock, an approach integrating Entity Alignment (EA) as part of RML mapping rules. EABlock includes a block of functions performing entity recognition from textual attributes and link the recognized entities to the corresponding resources in Wikidata, DBpedia, and domain specific thesaurus, e.g., UMLS. EABlock provides agnostic and efficient techniques to evaluate the functions and transfer the mappings to facilitate its application in any RML-compliant engine. We have empirically evaluated EABlock performance, and results indicate that EABlock speeds up knowledge graph creation pipelines that require entity recognition and linking in state-of-the-art RML-compliant engines. EABlock is also publicly available as a tool through a GitHub repository(https://github.com/SDM-TIB/EABlock) and a DOI(https://doi.org/10.5281/zenodo.5779773).
△ Less
Submitted 15 December, 2021; v1 submitted 14 December, 2021;
originally announced December 2021.
-
FunMap: Efficient Execution of Functional Mappings for Knowledge Graph Creation
Authors:
Samaneh Jozashoori,
David Chaves-Fraga,
Enrique Iglesias,
Maria-Esther Vidal,
Oscar Corcho
Abstract:
Data has exponentially grown in the last years, and knowledge graphs constitute powerful formalisms to integrate a myriad of existing data sources. Transformation functions -- specified with function-based mapping languages like FunUL and RML+FnO -- can be applied to overcome interoperability issues across heterogeneous data sources. However, the absence of engines to efficiently execute these map…
▽ More
Data has exponentially grown in the last years, and knowledge graphs constitute powerful formalisms to integrate a myriad of existing data sources. Transformation functions -- specified with function-based mapping languages like FunUL and RML+FnO -- can be applied to overcome interoperability issues across heterogeneous data sources. However, the absence of engines to efficiently execute these mapping languages hinders their global adoption. We propose FunMap, an interpreter of function-based mapping languages; it relies on a set of lossless rewriting rules to push down and materialize the execution of functions in initial steps of knowledge graph creation. Although applicable to any function-based mapping language that supports joins between mapping rules, FunMap feasibility is shown on RML+FnO. FunMap reduces data redundancy, e.g., duplicates and unused attributes, and converts RML+FnO mappings into a set of equivalent rules executable on RML-compliant engines. We evaluate FunMap performance over real-world testbeds from the biomedical domain. The results indicate that FunMap reduces the execution time of RML-compliant engines by up to a factor of 18, furnishing, thus, a scalable solution for knowledge graph creation.
△ Less
Submitted 5 October, 2020; v1 submitted 31 August, 2020;
originally announced August 2020.
-
SDM-RDFizer: An RML Interpreter for the Efficient Creation of RDF Knowledge Graphs
Authors:
Enrique Iglesias,
Samaneh Jozashoori,
David Chaves-Fraga,
Diego Collarana,
Maria-Esther Vidal
Abstract:
In recent years, the amount of data has increased exponentially, and knowledge graphs have gained attention as data structures to integrate data and knowledge harvested from myriad data sources. However, data complexity issues like large volume, high-duplicate rate, and heterogeneity usually characterize these data sources, being required data management tools able to address the impact negatively…
▽ More
In recent years, the amount of data has increased exponentially, and knowledge graphs have gained attention as data structures to integrate data and knowledge harvested from myriad data sources. However, data complexity issues like large volume, high-duplicate rate, and heterogeneity usually characterize these data sources, being required data management tools able to address the impact negatively of these issues on the knowledge graph creation process. In this paper, we propose the SDM-RDFizer, an interpreter of the RDF Mapping Language (RML), to transform raw data in various formats into an RDF knowledge graph. SDM-RDFizer implements novel algorithms to execute the logical operators between mappings in RML, allowing thus to scale up to complex scenarios where data is not only broad but has a high-duplication rate. We empirically evaluate the SDM-RDFizer performance against diverse testbeds with diverse configurations of data volume, duplicates, and heterogeneity. The observed results indicate that SDM-RDFizer is two orders of magnitude faster than state of the art, thus, meaning that SDM-RDFizer an interoperable and scalable solution for knowledge graph creation. SDM-RDFizer is publicly available as a resource through a Github repository and a DOI.
△ Less
Submitted 17 August, 2020;
originally announced August 2020.
-
MapSDI: A Scaled-up Semantic Data Integration Framework for Knowledge Graph Creation
Authors:
Samaneh Jozashoori,
Maria-Esther Vidal
Abstract:
Semantic web technologies have significantly contributed with effective solutions for the problems of data integration and knowledge graph creation. However, with the rapid growth of big data in diverse domains, different interoperability issues still demand to be addressed, being scalability one of the main challenges. In this paper, we address the problem of knowledge graph creation at scale and…
▽ More
Semantic web technologies have significantly contributed with effective solutions for the problems of data integration and knowledge graph creation. However, with the rapid growth of big data in diverse domains, different interoperability issues still demand to be addressed, being scalability one of the main challenges. In this paper, we address the problem of knowledge graph creation at scale and provide MapSDI, a mapping rule-based framework for optimizing semantic data integration into knowledge graphs. MapSDI allows for the semantic enrichment of large-sized, heterogeneous, and potentially low-quality data efficiently. The input of MapSDI is a set of data sources and mapping rules being generated by a mapping language such as RML. First, MapSDI pre-processes the sources based on semantic information extracted from mapping rules, by performing basic database operators; it projects out required attributes, eliminates duplicates, and selects relevant entries. All these operators are defined based on the knowledge encoded by the mapping rules which will be then used by the semantification engine (or RDFizer) to produce a knowledge graph. We have empirically studied the impact of MapSDI on existing RDFizers, and observed that knowledge graph creation time can be reduced on average in one order of magnitude. It is also shown, theoretically, that the sources and rules transformations provided by MapSDI are data-lossless.
△ Less
Submitted 3 September, 2019;
originally announced September 2019.
-
Linked Open Data Validity -- A Technical Report from ISWS 2018
Authors:
Tayeb Abderrahmani Ghor,
Esha Agrawal,
Mehwish Alam,
Omar Alqawasmeh,
Claudia D'amato,
Amina Annane,
Amr Azzam,
Andrew Berezovskyi,
Russa Biswas,
Mathias Bonduel,
Quentin Brabant,
Cristina-iulia Bucur,
Elena Camossi,
Valentina Anita Carriero,
Shruthi Chari,
David Chaves Fraga,
Fiorela Ciroku,
Michael Cochez,
Hubert Curien,
Vincenzo Cutrona,
Rahma Dandan,
Danilo Dess,
Valerio Di Carlo,
Ahmed El Amine Djebri,
Marieke Van Erp
, et al. (46 additional authors not shown)
Abstract:
Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backbone of applications in diverse fields such as Natu…
▽ More
Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backbone of applications in diverse fields such as Natural Language Processing, Information Retrieval, Computer Vision, Speech Recognition, and many more. Nevertheless, regardless of the specific tasks that LOD-based tools aim to address, the reuse of such knowledge may be challenging for diverse reasons, e.g. semantic heterogeneity, provenance, and data quality. As aptly stated by Heath et al. Linked Data might be outdated, imprecise, or simply wrong": there arouses a necessity to investigate the problem of linked data validity. This work reports a collaborative effort performed by nine teams of students, guided by an equal number of senior researchers, attending the International Semantic Web Research School (ISWS 2018) towards addressing such investigation from different perspectives coupled with different approaches to tackle the issue.
△ Less
Submitted 26 March, 2019;
originally announced March 2019.
-
Data Integration for Supporting Biomedical Knowledge Graph Creation at Large-Scale
Authors:
Samaneh Jozashoori,
Tatiana Novikova,
Maria-Esther Vidal
Abstract:
In recent years, following FAIR and open data principles, the number of available big data including biomedical data has been increased exponentially. In order to extract knowledge, these data should be curated, integrated, and semantically described. Accordingly, several semantic integration techniques have been developed; albeit effective, they may suffer from scalability in terms of different p…
▽ More
In recent years, following FAIR and open data principles, the number of available big data including biomedical data has been increased exponentially. In order to extract knowledge, these data should be curated, integrated, and semantically described. Accordingly, several semantic integration techniques have been developed; albeit effective, they may suffer from scalability in terms of different properties of big data. Even scaled-up approaches may be highly costly because tasks of semantification, curation and integration are performed independently. In order to overcome these issues, we devise ConMap, a semantic integration approach which exploits knowledge encoded in ontology in order to describe mapping rules to perform these tasks at the same time. Experimental results performed on different data sets suggest that ConMap can significantly reduce the time required for knowledge graph creation by up to 70\% of the time that is consumed following a traditional approach.
△ Less
Submitted 5 November, 2018;
originally announced November 2018.