-
CHAD-KG: A Knowledge Graph for Representing Cultural Heritage Objects and Digitisation Paradata
Authors:
Sebastian Barzaghi,
Arianna Moretti,
Ivan Heibi,
Silvio Peroni
Abstract:
This paper presents CHAD-KG, a knowledge graph designed to describe bibliographic metadata and digitisation paradata of cultural heritage objects in exhibitions, museums, and collections. It also documents the related data model and materialisation engine. Originally based on two tabular datasets, the data was converted into RDF according to CHAD-AP, an OWL application profile built on standards l…
▽ More
This paper presents CHAD-KG, a knowledge graph designed to describe bibliographic metadata and digitisation paradata of cultural heritage objects in exhibitions, museums, and collections. It also documents the related data model and materialisation engine. Originally based on two tabular datasets, the data was converted into RDF according to CHAD-AP, an OWL application profile built on standards like CIDOC-CRM, LRMoo, CRMdig, and Getty AAT. A reproducible pipeline, developed with a Morph-KGC extension, was used to generate the graph. CHAD-KG now serves as the main metadata source for the Digital Twin of the temporary exhibition titled \emph{The Other Renaissance - Ulisse Aldrovandi and The Wonders Of The World}, and other collections related to the digitisation work under development in a nationwide funded project, i.e. Project CHANGES (https://fondazionechanges.org). To ensure accessibility and reuse, it offers a SPARQL endpoint, a user interface, open documentation, and is published on Zenodo under a CC0 license. The project improves the semantic interoperability of cultural heritage data, with future work aiming to extend the data model and materialisation pipeline to better capture the complexities of acquisition and digitisation, further enrich the dataset and broaden its relevance to similar initiatives.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Validating and monitoring bibliographic and citation data in OpenCitations collections
Authors:
Ivan Heibi,
Silvio Peroni,
Elia Rizzetto
Abstract:
Purpose. The increasing emphasis on data quantity in research infrastructures has highlighted the need for equally robust mechanisms ensuring data quality, particularly in bibliographic and citation datasets. This paper addresses the challenge of maintaining high-quality open research information within OpenCitations, a community-guided Open Science Infrastructure, by introducing tools for validat…
▽ More
Purpose. The increasing emphasis on data quantity in research infrastructures has highlighted the need for equally robust mechanisms ensuring data quality, particularly in bibliographic and citation datasets. This paper addresses the challenge of maintaining high-quality open research information within OpenCitations, a community-guided Open Science Infrastructure, by introducing tools for validating and monitoring bibliographic metadata and citation data.
Methods. We developed a custom validation tool tailored to the OpenCitations Data Model (OCDM), designed to detect and explain ingestion errors from heterogeneous sources, whether due to upstream data inconsistencies or internal software bugs. Additionally, a quality monitoring tool was created to track known data issues post-publication. These tools were applied in two scenarios: (1) validating metadata and citations from Matilda, a potential future source, and (2) monitoring data quality in the existing OpenCitations Meta dataset.
Results. The validation tool successfully identified a variety of structural and semantic issues in the Matilda dataset, demonstrating its precision. The monitoring tool enabled the detection of recurring problems in the OpenCitations Meta collection, as well as their quantification. Together, these tools proved effective in enhancing the reliability of OpenCitations' published data.
Conclusion. The presented validation and monitoring tools represent a step toward ensuring high-quality bibliographic data in open research infrastructures, though they are limited to the data model adopted by OpenCitations. Future developments are aimed at expanding to additional data sources, with particular regard to crowdsourced data.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Mapping Research Data at the University of Bologna
Authors:
C. Basalti,
G. Caldoni,
S. Coppini,
B. Gualandi,
M. Marino,
F. Masini,
S. Peroni
Abstract:
Research data management (RDM) strategies and practices play a pivotal role in adhering to the paradigms of reproducibility and transparency by enabling research sharing in accordance with the principles of Open Science. Discipline-specificity is an essential factor when understanding RDM declinations, to tailor a comprehensive support service and to enhance interdisciplinarity.
In this paper we…
▽ More
Research data management (RDM) strategies and practices play a pivotal role in adhering to the paradigms of reproducibility and transparency by enabling research sharing in accordance with the principles of Open Science. Discipline-specificity is an essential factor when understanding RDM declinations, to tailor a comprehensive support service and to enhance interdisciplinarity.
In this paper we present the results of a mapping carried out to gather information on research data generated and managed within the University of Bologna (UniBO). The aim is to identify differences and commonalities between disciplines and potential challenges for institutional support.
We analyzed the data management plans (DMPs) of European competitive projects drafted by researchers affiliated with UniBO. We applied descriptive statistics to the collected variables to answer three main questions: How diverse is the range of data managed within the University of Bologna? Which trends of problems and patterns in terms of data management can influence/improve data stewardship service? Is there an interdisciplinary approach to data production within the University?
The research work evidenced many points of contact between different disciplines in terms of data produced, formats used and modest predilection for data reuse. Hot topics such as data confidentiality, needed either on privacy or intellectual property rights (IPR) premises, and long-term preservation pose challenges to all researchers.
These results show an increasing attention to RDM while highlighting the relevance of training and support to face the relatively new challenges posed by this approach.
△ Less
Submitted 26 February, 2025;
originally announced March 2025.
-
Recent Developments in Deep Learning-based Author Name Disambiguation
Authors:
Francesca Cappelli,
Giovanni Colavizza,
Silvio Peroni
Abstract:
Author Name Disambiguation (AND) is a critical task for digital libraries aiming to link existing authors with their respective publications. Due to the lack of persistent identifiers used by researchers and the presence of intrinsic linguistic challenges, such as homonymy, the development of Deep Learning algorithms to address this issue has become widespread. Many AND deep learning methods have…
▽ More
Author Name Disambiguation (AND) is a critical task for digital libraries aiming to link existing authors with their respective publications. Due to the lack of persistent identifiers used by researchers and the presence of intrinsic linguistic challenges, such as homonymy, the development of Deep Learning algorithms to address this issue has become widespread. Many AND deep learning methods have been developed, and surveys exist comparing the approaches in terms of techniques, complexity, performance. However, none explicitly addresses AND methods in the context of deep learning in the latest years (i.e. timeframe 2016-2024). In this paper, we provide a systematic review of state-of-the-art AND techniques based on deep learning, highlighting recent improvements, challenges, and open issues in the field. We find that DL methods have significantly impacted AND by enabling the integration of structured and unstructured data, and hybrid approaches effectively balance supervised and unsupervised learning.
△ Less
Submitted 23 December, 2024;
originally announced March 2025.
-
HERITRACE: A User-Friendly Semantic Data Editor with Change Tracking and Provenance Management for Cultural Heritage Institutions
Authors:
Arcangelo Massari,
Silvio Peroni
Abstract:
HERITRACE is a data editor designed for galleries, libraries, archives and museums, aimed at simplifying data curation while enabling non-technical domain experts to manage data intuitively without losing its semantic integrity. While the semantic nature of RDF can pose a barrier to data curation due to its complexity, HERITRACE conceals this intricacy while preserving the advantages of semantic r…
▽ More
HERITRACE is a data editor designed for galleries, libraries, archives and museums, aimed at simplifying data curation while enabling non-technical domain experts to manage data intuitively without losing its semantic integrity. While the semantic nature of RDF can pose a barrier to data curation due to its complexity, HERITRACE conceals this intricacy while preserving the advantages of semantic representation. The system natively supports provenance management and change tracking, ensuring transparency and accountability throughout the curation process. Although HERITRACE functions effectively out of the box, it offers a straightforward customization interface for technical staff, enabling adaptation to the specific data model required by a given collection. Current applications include the ParaText project, and its adoption is already planned for OpenCitations. Future developments will focus on integrating the RDF Mapping Language (RML) to enhance compatibility with non-RDF data formats, further expanding its applicability in digital heritage management.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Analysing the coverage of the University of Bologna's publication metadata in an existing source of open research information
Authors:
Erica Andreose,
Salvatore Di Marzo,
Ivan Heibi,
Silvio Peroni,
Leonardo Zilli
Abstract:
This study focuses on analysing the coverage of publications' metadata available in the Current Research Information System (CRIS) infrastructure of the University of Bologna (UNIBO), implemented by the IRIS platform, within an authoritative source of open research information, i.e. OpenCitations. The analysis considers data regarding the publication entities alongside the citation links. We preci…
▽ More
This study focuses on analysing the coverage of publications' metadata available in the Current Research Information System (CRIS) infrastructure of the University of Bologna (UNIBO), implemented by the IRIS platform, within an authoritative source of open research information, i.e. OpenCitations. The analysis considers data regarding the publication entities alongside the citation links. We precisely quantify the proportion of UNIBO IRIS publications included in OpenCitations, examine their types, and evaluate the number of citations in OpenCitations that involve IRIS publications. Our methodology filters and transforms data dumps of IRIS and OpenCitations, creating novel datasets used for the analysis. Our findings reveal that only 37.7% of IRIS is covered in OpenCitations, with journal articles exhibiting the highest coverage. We identified 4,290,096 citation links pointing to UNIBO IRIS publications. From a purely quantitative perspective, comparing our results with broader proprietary services like Scopus and Web of Science reveals a small gap in the average number of citations per bibliographic resource. However, further analysis with updated data is required to support this speculation.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Leveraging virtual technologies to enhance museums and art collections: insights from project CHANGES
Authors:
Gianluca Genovese,
Ivan Heibi,
Silvio Peroni,
Sofia Pescarin
Abstract:
We investigated the use of virtual technologies to digitise and enhance cultural heritage (CH), aligning with Open Science and FAIR principles. Through case studies in museums, we developed reproducible workflows, 3D models, and tools fostering accessibility, inclusivity, and sustainability of CH. Applications include interdisciplinary research, educational innovation, and CH preservation.
We investigated the use of virtual technologies to digitise and enhance cultural heritage (CH), aligning with Open Science and FAIR principles. Through case studies in museums, we developed reproducible workflows, 3D models, and tools fostering accessibility, inclusivity, and sustainability of CH. Applications include interdisciplinary research, educational innovation, and CH preservation.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
The OpenCitations Index
Authors:
Ivan Heibi,
Arianna Moretti,
Silvio Peroni,
Marta Soricetti
Abstract:
This article presents the OpenCitations Index, a collection of open citation data maintained by OpenCitations, an independent, not-for-profit infrastructure organisation for open scholarship dedicated to publishing open bibliographic and citation data using Semantic Web and Linked Open Data technologies. The collection involves citation data harvested from multiple sources. To address the possibil…
▽ More
This article presents the OpenCitations Index, a collection of open citation data maintained by OpenCitations, an independent, not-for-profit infrastructure organisation for open scholarship dedicated to publishing open bibliographic and citation data using Semantic Web and Linked Open Data technologies. The collection involves citation data harvested from multiple sources. To address the possibility of different sources providing citation data for bibliographic entities represented with different identifiers, therefore potentially representing same citation, a deduplication mechanism has been implemented. This ensures that citations integrated into OpenCitations Index are accurately identified uniquely, even when different identifiers are used. This mechanism follows a specific workflow, which encompasses a preprocessing of the original source data, a management of the provided bibliographic metadata, and the generation of new citation data to be integrated into the OpenCitations Index. The process relies on another data collection: OpenCitations Meta, and on the use of a new globally persistent identifier, namely OMID (OpenCitations Meta Identifier). As of July 2024, OpenCitations Index stores over 2 billion unique citation links, harvest from Crossref, the National Institute of Heath Open Citation Collection (NIH-OCC), DataCite, OpenAIRE, and the Japan Link Center (JaLC). OpenCitations Index can be systematically accessed and queried through several services, including SPARQL endpoint, REST APIs, and web interfaces. Additionally, dataset dumps are available for free download and reuse (under CC0 waiver) in various formats (CSV, N-Triples, and Scholix), including provenance and change tracking information.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
CiteFusion: An Ensemble Framework for Citation Intent Classification Harnessing Dual-Model Binary Couples and SHAP Analyses
Authors:
Lorenzo Paolini,
Sahar Vahdati,
Angelo Di Iorio,
Robert Wardenga,
Ivan Heibi,
Silvio Peroni
Abstract:
Understanding the motivations underlying scholarly citations is essential to evaluate research impact and pro-mote transparent scholarly communication. This study introduces CiteFusion, an ensemble framework designed to address the multi-class Citation Intent Classification task on two benchmark datasets: SciCite and ACL-ARC. The framework employs a one-vs-all decomposition of the multi-class task…
▽ More
Understanding the motivations underlying scholarly citations is essential to evaluate research impact and pro-mote transparent scholarly communication. This study introduces CiteFusion, an ensemble framework designed to address the multi-class Citation Intent Classification task on two benchmark datasets: SciCite and ACL-ARC. The framework employs a one-vs-all decomposition of the multi-class task into class-specific binary sub-tasks, leveraging complementary pairs of SciBERT and XLNet models, independently tuned, for each citation intent. The outputs of these base models are aggregated through a feedforward neural network meta-classifier to reconstruct the original classification task. To enhance interpretability, SHAP (SHapley Additive exPlanations) is employed to analyze token-level contributions, and interactions among base models, providing transparency into the classification dynamics of CiteFusion, and insights about the kind of misclassifications of the ensem-ble. In addition, this work investigates the semantic role of structural context by incorporating section titles, as framing devices, into input sentences, assessing their positive impact on classification accuracy. CiteFusion ul-timately demonstrates robust performance in imbalanced and data-scarce scenarios: experimental results show that CiteFusion achieves state-of-the-art performance, with Macro-F1 scores of 89.60% on SciCite, and 76.24% on ACL-ARC. Furthermore, to ensure interoperability and reusability, citation intents from both datasets sche-mas are mapped to Citation Typing Ontology (CiTO) object properties, highlighting some overlaps. Finally, we describe and release a web-based application that classifies citation intents leveraging the CiteFusion models developed on SciCite.
△ Less
Submitted 11 June, 2025; v1 submitted 18 July, 2024;
originally announced July 2024.
-
A Proposal for a FAIR Management of 3D Data in Cultural Heritage: The Aldrovandi Digital Twin Case
Authors:
Sebastian Barzaghi,
Alice Bordignon,
Bianca Gualandi,
Ivan Heibi,
Arcangelo Massari,
Arianna Moretti,
Silvio Peroni,
Giulia Renda
Abstract:
In this article we analyse 3D models of cultural heritage with the aim of answering three main questions: what processes can be put in place to create a FAIR-by-design digital twin of a temporary exhibition? What are the main challenges in applying FAIR principles to 3D data in cultural heritage studies and how are they different from other types of data (e.g. images) from a data management perspe…
▽ More
In this article we analyse 3D models of cultural heritage with the aim of answering three main questions: what processes can be put in place to create a FAIR-by-design digital twin of a temporary exhibition? What are the main challenges in applying FAIR principles to 3D data in cultural heritage studies and how are they different from other types of data (e.g. images) from a data management perspective? We begin with a comprehensive literature review touching on: FAIR principles applied to cultural heritage data; representation models; both Object Provenance Information (OPI) and Metadata Record Provenance Information (MRPI), respectively meant as, on the one hand, the detailed history and origin of an object, and - on the other hand - the detailed history and origin of the metadata itself, which describes the primary object (whether physical or digital); 3D models as cultural heritage research data and their creation, selection, publication, archival and preservation. We then describe the process of creating the Aldrovandi Digital Twin, by collecting, storing and modelling data about cultural heritage objects and processes. We detail the many steps from the acquisition of the Digital Cultural Heritage Objects (DCHO), through to the upload of the optimised DCHO onto a web-based framework (ATON), with a focus on open technologies and standards for interoperability and preservation. Using the FAIR Principles for Heritage Library, Archive and Museum Collections [1] as a framework, we look in detail at how the Digital Twin implements FAIR principles at the object and metadata level. We then describe the main challenges we encountered and we summarise what seem to be the peculiarities of 3D cultural heritage data and the possible directions for further research in this field.
△ Less
Submitted 22 January, 2025; v1 submitted 2 July, 2024;
originally announced July 2024.
-
A Workflow for GLAM Metadata Crosswalk
Authors:
Arianna Moretti,
Ivan Heibi,
Silvio Peroni
Abstract:
The acquisition of physical artifacts not only involves transferring existing information into the digital ecosystem but also generates information as a process itself, underscoring the importance of meticulous management of FAIR data and metadata. In addition, the diversity of objects within the cultural heritage domain is reflected in a multitude of descriptive models. The digitization process e…
▽ More
The acquisition of physical artifacts not only involves transferring existing information into the digital ecosystem but also generates information as a process itself, underscoring the importance of meticulous management of FAIR data and metadata. In addition, the diversity of objects within the cultural heritage domain is reflected in a multitude of descriptive models. The digitization process expands the opportunities for exchange and joint utilization, granted that the descriptive schemas are made interoperable in advance. To achieve this goal, we propose a replicable workflow for metadata schema crosswalks that facilitates the preservation and accessibility of cultural heritage in the digital ecosystem. This work presents a methodology for metadata generation and management in the case study of the digital twin of the temporary exhibition "The Other Renaissance - Ulisse Aldrovandi and the Wonders of the World". The workflow delineates a systematic, step-by-step transformation of tabular data into RDF format, to enhance Linked Open Data. The methodology adopts the RDF Mapping Language (RML) technology for converting data to RDF with a human contribution involvement. This last aspect entails an interaction between digital humanists and domain experts through surveys leading to the abstraction and reformulation of domain-specific knowledge, to be exploited in the process of formalizing and converting information.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Developing Application Profiles for Enhancing Data and Workflows in Cultural Heritage Digitisation Processes
Authors:
Sebastian Barzaghi,
Ivan Heibi,
Arianna Moretti,
Silvio Peroni
Abstract:
As a result of the proliferation of 3D digitisation in the context of cultural heritage projects, digital assets and digitisation processes - being considered as proper research objects - must prioritise adherence to FAIR principles. Existing standards and ontologies, such as CIDOC CRM, play a crucial role in this regard, but they are often over-engineered for the need of a particular application…
▽ More
As a result of the proliferation of 3D digitisation in the context of cultural heritage projects, digital assets and digitisation processes - being considered as proper research objects - must prioritise adherence to FAIR principles. Existing standards and ontologies, such as CIDOC CRM, play a crucial role in this regard, but they are often over-engineered for the need of a particular application context, thus making their understanding and adoption difficult. Application profiles of a given standard - defined as sets of ontological entities drawn from one or more semantic artefacts for a particular context or application - are usually proposed as tools for promoting interoperability and reuse while being tied entirely to the particular application context they refer to. In this paper, we present an adaptation and application of an ontology development methodology, i.e. SAMOD, to guide the creation of robust, semantically sound application profiles of large standard models. Using an existing pilot study we have developed in a project dedicated to leveraging virtual technologies to preserve and valorise cultural heritage, we introduce an application profile named CHAD-AP, that we have developed following our customised version of SAMOD. We reflect on the use of SAMOD and similar ontology development methodologies for this purpose, highlighting its strengths and current limitations, future developments, and possible adoption in other similar projects.
△ Less
Submitted 2 August, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Thinking Outside the Black Box: Insights from a Digital Exhibition in the Humanities
Authors:
Sebastian Barzaghi,
Alice Bordignon,
Bianca Gualandi,
Silvio Peroni
Abstract:
One of the main goals of Open Science is to make research more reproducible. There is no consensus, however, on what exactly "reproducibility" is, as opposed for example to "replicability", and how it applies to different research fields. After a short review of the literature on reproducibility/replicability with a focus on the humanities, we describe how the creation of the digital twin of the t…
▽ More
One of the main goals of Open Science is to make research more reproducible. There is no consensus, however, on what exactly "reproducibility" is, as opposed for example to "replicability", and how it applies to different research fields. After a short review of the literature on reproducibility/replicability with a focus on the humanities, we describe how the creation of the digital twin of the temporary exhibition "The Other Renaissance" has been documented throughout, with different methods, but with constant attention to research transparency, openness and accountability. A careful documentation of the study design, data collection and analysis techniques helps reflect and make all possible influencing factors explicit, and is a fundamental tool for reliability and rigour and for opening the "black box" of research.
△ Less
Submitted 10 April, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
HERITRACE: Tracing Evolution and Bridging Data for Streamlined Curatorial Work in the GLAM Domain
Authors:
Arcangelo Massari,
Silvio Peroni
Abstract:
HERITRACE is a semantic data management system tailored for the GLAM sector. It is engineered to streamline data curation for non-technical users while also offering an efficient administrative interface for technical staff. The paper compares HERITRACE with other established platforms such as OmekaS, Semantic MediaWiki, Research Space, and CLEF, emphasizing its advantages in user friendliness, pr…
▽ More
HERITRACE is a semantic data management system tailored for the GLAM sector. It is engineered to streamline data curation for non-technical users while also offering an efficient administrative interface for technical staff. The paper compares HERITRACE with other established platforms such as OmekaS, Semantic MediaWiki, Research Space, and CLEF, emphasizing its advantages in user friendliness, provenance management, change tracking, customization capabilities, and data integration. The system leverages SHACL for data modeling and employs the OpenCitations Data Model (OCDM) for provenance and change tracking, ensuring a harmonious blend of advanced technical features and user accessibility. Future developments include the integration of a robust authentication system and the expansion of data compatibility via the RDF Mapping Language (RML), enhancing HERITRACE's utility in digital heritage management.
△ Less
Submitted 24 April, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Mapping bibliographic metadata collections: the case of OpenCitations Meta and OpenAlex
Authors:
Elia Rizzetto,
Silvio Peroni
Abstract:
This study describes the methodology and analyses the results of the process of mapping entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this mapping is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collection…
▽ More
This study describes the methodology and analyses the results of the process of mapping entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this mapping is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collections. Furthermore, analysing the output of the mapping provides a unique perspective on the consistency and accuracy of bibliographic metadata, offering a valuable tool for identifying potential inconsistencies in the processed data.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Saving temporary exhibitions in virtual environments: the Digital Renaissance of Ulisse Aldrovandi -- acquisition and digitisation of cultural heritage objects
Authors:
Roberto Balzani,
Sebastian Barzaghi,
Gabriele Bitelli,
Federica Bonifazi,
Alice Bordignon,
Luca Cipriani,
Simona Colitti,
Federica Collina,
Marilena Daquino,
Francesca Fabbri,
Bruno Fanini,
Filippo Fantini,
Daniele Ferdani,
Giulia Fiorini,
Elena Formia,
Anna Forte,
Federica Giacomini,
Valentina Alena Girelli,
Bianca Gualandi,
Ivan Heibi,
Alessandro Iannucci,
Rachele Manganelli Del FÃ ,
Arcangelo Massari,
Arianna Moretti,
Silvio Peroni
, et al. (8 additional authors not shown)
Abstract:
As per the objectives of Project CHANGES, particularly its thematic sub-project on the use of virtual technologies for museums and art collections, our goal was to obtain a digital twin of the temporary exhibition on Ulisse Aldrovandi called "The Other Renaissance", and make it accessible to users online. After a preliminary study of the exhibition, focussing on acquisition constraints and related…
▽ More
As per the objectives of Project CHANGES, particularly its thematic sub-project on the use of virtual technologies for museums and art collections, our goal was to obtain a digital twin of the temporary exhibition on Ulisse Aldrovandi called "The Other Renaissance", and make it accessible to users online. After a preliminary study of the exhibition, focussing on acquisition constraints and related solutions, we proceeded with the digital twin creation by acquiring, processing, modelling, optimising, exporting, and metadating the exhibition. We made hybrid use of two acquisition techniques to create new digital cultural heritage objects and environments, and we used open technologies, formats, and protocols to make available the final digital product. Here, we describe the process of collecting and curating bibliographical exhibition (meta)data and the beginning of the digital twin creation to foster its findability, accessibility, interoperability, and reusability. The creation of the digital twin is currently ongoing.
△ Less
Submitted 27 December, 2023; v1 submitted 30 August, 2023;
originally announced August 2023.
-
Retractions in Arts and Humanities: an Analysis of the Retraction Notices
Authors:
Ivan Heibi,
Silvio Peroni
Abstract:
The aim of this work is to understand the retraction phenomenon in the arts and humanities domain through an analysis of the retraction notices: formal documents stating and describing the retraction of a particular publication. The retractions and the corresponding notices are identified using the data provided by Retraction Watch. Our methodology for the analysis combines a metadata analysis and…
▽ More
The aim of this work is to understand the retraction phenomenon in the arts and humanities domain through an analysis of the retraction notices: formal documents stating and describing the retraction of a particular publication. The retractions and the corresponding notices are identified using the data provided by Retraction Watch. Our methodology for the analysis combines a metadata analysis and a content analysis (mainly performed using a topic modeling process) of the retraction notices. Considering 343 cases of retraction, we found that many retraction notices are neither identifiable nor findable. In addition, these were not always separated from the original papers, introducing ambiguity in understanding how these notices were perceived by the community (i.e., cited). Also, we noticed that there is no systematic way to write a retraction notice. Indeed, some retraction notices presented a complete discussion of the reasons for retraction, while others tended to be more direct and succinct. We have also reported many notices having similar text while addressing different retractions. We think a further study with a larger collection should be done using the same methodology to confirm and investigate our findings further.
△ Less
Submitted 25 August, 2023;
originally announced August 2023.
-
A Prototype for a Controlled and Valid RDF Data Production Using SHACL
Authors:
Elia Rizzetto,
Arcangelo Massari,
Ivan Heibi,
Silvio Peroni
Abstract:
The paper introduces a tool prototype that combines SHACL's capabilities with ad-hoc validation functions to create a controlled and user-friendly form interface for producing valid RDF data. The proposed tool is developed within the context of the OpenCitations Data Model (OCDM) use case. The paper discusses the current status of the tool, outlines the future steps required for achieving full fun…
▽ More
The paper introduces a tool prototype that combines SHACL's capabilities with ad-hoc validation functions to create a controlled and user-friendly form interface for producing valid RDF data. The proposed tool is developed within the context of the OpenCitations Data Model (OCDM) use case. The paper discusses the current status of the tool, outlines the future steps required for achieving full functionality, and explores the potential applications and benefits of the tool.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
OpenCitations Meta
Authors:
Arcangelo Massari,
Fabio Mariani,
Ivan Heibi,
Silvio Peroni,
David Shotton
Abstract:
OpenCitations Meta is a new database that contains bibliographic metadata of scholarly publications involved in citations indexed by the OpenCitations infrastructure. It adheres to Open Science principles and provides data under a CC0 license for maximum reuse. The data can be accessed through a SPARQL endpoint, REST APIs, and dumps. OpenCitations Meta serves three important purposes. Firstly, it…
▽ More
OpenCitations Meta is a new database that contains bibliographic metadata of scholarly publications involved in citations indexed by the OpenCitations infrastructure. It adheres to Open Science principles and provides data under a CC0 license for maximum reuse. The data can be accessed through a SPARQL endpoint, REST APIs, and dumps. OpenCitations Meta serves three important purposes. Firstly, it enables disambiguation of citations between publications described using different identifiers from various sources. For example, it can link publications identified by DOIs in Crossref and PMIDs in PubMed. Secondly, it assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs), to bibliographic resources without existing external persistent identifiers like DOIs. Lastly, by hosting the bibliographic metadata internally, OpenCitations Meta improves the speed of metadata retrieval for citing and cited documents. The database is populated through automated data curation, including deduplication, error correction, and metadata enrichment. The data is stored in RDF format following the OpenCitations Data Model, and changes and provenance information are tracked. OpenCitations Meta and its production. OpenCitations Meta currently incorporates data from Crossref, DataCite, and the NIH Open Citation Collection. In terms of semantic publishing datasets, it is currently the first in data volume.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Representing provenance and track changes of cultural heritage metadata in RDF: a survey of existing approaches
Authors:
Arcangelo Massari,
Silvio Peroni,
Francesca Tomasi,
Ivan Heibi
Abstract:
In the realm of Digital Humanities, the management of cultural heritage metadata is pivotal for ensuring data trustworthiness. Provenance information - contextual metadata detailing the origin and history of data - plays a crucial role in this process. However, tracking provenance and changes in metadata using the Resource Description Framework (RDF) presents significant challenges due to the limi…
▽ More
In the realm of Digital Humanities, the management of cultural heritage metadata is pivotal for ensuring data trustworthiness. Provenance information - contextual metadata detailing the origin and history of data - plays a crucial role in this process. However, tracking provenance and changes in metadata using the Resource Description Framework (RDF) presents significant challenges due to the limitations of foundational Semantic Web technologies. This article offers a comprehensive review of existing models and approaches for representing provenance and tracking changes in RDF, with a specific focus on cultural heritage metadata. It examines W3C standard proposals such as RDF Reification and n-ary relations, along with various alternative systems. Through an in-depth analysis, the study identifies Named Graphs, RDF*, the Provenance Ontology (PROV-O), Dublin Core (DC), Conjectural Graphs, and the OpenCitations Data Model (OCDM) as the most effective solutions. These models are evaluated based on their compliance with RDF standards, scalability, and applicability across different domains. The findings underscore the importance of selecting the appropriate model to ensure robust and reliable management of provenance in RDF datasets, thereby contributing to the ongoing discourse on provenance representation in the Digital Humanities.
△ Less
Submitted 22 September, 2024; v1 submitted 15 May, 2023;
originally announced May 2023.
-
A maturity model for catalogues of semantic artefacts
Authors:
Oscar Corcho,
Fajar J. Ekaputra,
Ivan Heibi,
Clement Jonquet,
Andras Micsik,
Silvio Peroni,
Emanuele Storti
Abstract:
This work presents a maturity model for assessing catalogues of semantic artefacts, one of the keystones that permit semantic interoperability of systems. We defined the dimensions and related features to include in the maturity model by analysing the current literature and existing catalogues of semantic artefacts provided by experts. In addition, we assessed 26 different catalogues to demonstrat…
▽ More
This work presents a maturity model for assessing catalogues of semantic artefacts, one of the keystones that permit semantic interoperability of systems. We defined the dimensions and related features to include in the maturity model by analysing the current literature and existing catalogues of semantic artefacts provided by experts. In addition, we assessed 26 different catalogues to demonstrate the effectiveness of the maturity model, which includes 12 different dimensions (Metadata, Openness, Quality, Availability, Statistics, PID, Governance, Community, Sustainability, Technology, Transparency, and Assessment) and 43 related features (or sub-criteria) associated with these dimensions. Such a maturity model is one of the first attempts to provide recommendations for governance and processes for preserving and maintaining semantic artefacts and helps assess/address interoperability challenges.
△ Less
Submitted 24 March, 2024; v1 submitted 11 May, 2023;
originally announced May 2023.
-
Performing live time-traversal queries via SPARQL on RDF datasets
Authors:
Arcangelo Massari,
Silvio Peroni
Abstract:
This article introduces a methodology to perform live time-traversal SPARQL queries on RDF datasets and software based on this methodology that offers a solution to manage the provenance and change-tracking of entities described using RDF. These are crucial factors in ensuring verifiability and trust. Nevertheless, some of the most prominent knowledge bases - including DBpedia, Wikidata, Yago, and…
▽ More
This article introduces a methodology to perform live time-traversal SPARQL queries on RDF datasets and software based on this methodology that offers a solution to manage the provenance and change-tracking of entities described using RDF. These are crucial factors in ensuring verifiability and trust. Nevertheless, some of the most prominent knowledge bases - including DBpedia, Wikidata, Yago, and the Dynamic Linked Data Observatory - do not support time-agnostic queries, i.e., queries across different snapshots together with provenance information. The OpenCitations Data Model (OCDM) describes one possible way to track provenance and entities' changes in RDF datasets, and it allows restoring an entity to a specific status in time (i.e., a snapshot) by applying SPARQL update queries. The methodology and library presented in this article are based on the rationale introduced in the OCDM. We also developed benchmarks proving that such a procedure is efficient for specific queries and less efficient for others. To the best of our knowledge, our library is the only one to support all the time-related retrieval functionalities live, i.e., enabling real-time searches and updates. Moreover, since OCDM complies with standard RDF, queries are expressed via standard SPARQL.
△ Less
Submitted 12 October, 2022; v1 submitted 5 October, 2022;
originally announced October 2022.
-
Approaching Digital Humanities at the University: a Cultural Challenge
Authors:
Silvio Peroni,
Francesca Tomasi
Abstract:
The University of Bologna has a long tradition in Digital Humanities, both at the level of research and teaching. In this article, we want to introduce some experiences in developing new educational models based on the idea of transversal learning, collaborative approaches and projects-oriented outputs, together with the definition of research fields within this vast domain, accompanied by practic…
▽ More
The University of Bologna has a long tradition in Digital Humanities, both at the level of research and teaching. In this article, we want to introduce some experiences in developing new educational models based on the idea of transversal learning, collaborative approaches and projects-oriented outputs, together with the definition of research fields within this vast domain, accompanied by practical examples. The creation of an international master's degree (DHDK), a PhD (CHeDE) and a research centre (/DH.arc) are the results of refining our notion of Digital Humanities in a new bidirectional way: to reflect on computational methodologies and models in the cultural sphere and to suggest a cultural approach to Informatics.
△ Less
Submitted 27 November, 2022; v1 submitted 13 September, 2022;
originally announced September 2022.
-
OpenCitations, an open e-infrastructure to foster maximum reuse of citation data
Authors:
Chiara Di Giambattista,
Ivan Heibi,
Silvio Peroni,
David Shotton
Abstract:
OpenCitations is an independent not-for-profit infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. OpenCitations collaborates with projects that are part of the Open Science ecosystem and complies with the UNESCO founding principles of Open Science, the I4OC recommendations, and…
▽ More
OpenCitations is an independent not-for-profit infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. OpenCitations collaborates with projects that are part of the Open Science ecosystem and complies with the UNESCO founding principles of Open Science, the I4OC recommendations, and the FAIR data principles that data should be Findable, Accessible, Interoperable and Reusable. Since its data satisfies all the Reuse guidelines provided by FAIR in terms of richness, provenance, usage licenses and domain-relevant community standards, OpenCitations provides an example of a successful open e-infrastructure in which the reusability of data is integral to its mission.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Enabling Portability and Reusability of Open Science Infrastructures
Authors:
Giuseppe Grieco,
Ivan Heibi,
Arcangelo Massari,
Arianna Moretti,
Silvio Peroni
Abstract:
This paper presents a methodology for designing a containerized and distributed open science infrastructure to simplify its reusability, replicability, and portability in different environments. The methodology is depicted in a step-by-step schema based on four main phases: (1) Analysis, (2) Design, (3) Definition, and (4) Managing and provisioning. We accompany the description of each step with e…
▽ More
This paper presents a methodology for designing a containerized and distributed open science infrastructure to simplify its reusability, replicability, and portability in different environments. The methodology is depicted in a step-by-step schema based on four main phases: (1) Analysis, (2) Design, (3) Definition, and (4) Managing and provisioning. We accompany the description of each step with existing technologies and concrete examples of application.
△ Less
Submitted 28 July, 2022; v1 submitted 8 June, 2022;
originally announced June 2022.
-
Structured references from PDF articles: assessing the tools for bibliographic reference extraction and parsing
Authors:
Alessia Cioffi,
Silvio Peroni
Abstract:
Many solutions have been provided to extract bibliographic references from PDF papers. Machine learning, rule-based and regular expressions approaches were among the most used methods adopted in tools for addressing this task. This work aims to identify and evaluate all and only the tools which, given a full-text paper in PDF format, can recognise, extract and parse bibliographic references. We id…
▽ More
Many solutions have been provided to extract bibliographic references from PDF papers. Machine learning, rule-based and regular expressions approaches were among the most used methods adopted in tools for addressing this task. This work aims to identify and evaluate all and only the tools which, given a full-text paper in PDF format, can recognise, extract and parse bibliographic references. We identified seven tools: Anystyle, Cermine, ExCite, Grobid, Pdfssa4met, Scholarcy and Science Parse. We compared and evaluated them against a corpus of 56 PDF articles published in 27 subject areas. Indeed, Anystyle obtained the best overall score, followed by Cermine. However, in some subject areas, other tools had better results for specific tasks.
△ Less
Submitted 6 September, 2022; v1 submitted 29 May, 2022;
originally announced May 2022.
-
The way we cite: common metadata used across disciplines for defining bibliographic references
Authors:
Erika Alves dos Santos,
Silvio Peroni,
Marcos Luiz Mucheroni
Abstract:
Current citation practices observed in articles are very noisy, confusing, and not standardised, making identifying the cited works problematic for hu-mans and any reference extraction software. In this work, we want to investigate such citation practices for referencing different types of entities and, in particular, to understand the most used metadata in bibliographic refer-ences. We identified…
▽ More
Current citation practices observed in articles are very noisy, confusing, and not standardised, making identifying the cited works problematic for hu-mans and any reference extraction software. In this work, we want to investigate such citation practices for referencing different types of entities and, in particular, to understand the most used metadata in bibliographic refer-ences. We identified 36 types of cited entities (the most cited ones were articles, books, and proceeding papers) within the 34,140 bibliographic references extracted from a vast set of journal articles on 27 different subject ar-eas. The analysis of such bibliographic references, grouped by the particular type of cited entities, enabled us to highlight the most used metadata for de-fining bibliographic references across the subject areas. However, we also noticed that, in some cases, bibliographic references did not provide the essential elements to identify the work they refer to easily.
△ Less
Submitted 21 July, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
What do we mean by "data"? A proposed classification of data types in the arts and humanities
Authors:
Bianca Gualandi,
Luca Pareschi,
Silvio Peroni
Abstract:
Purpose: This article describes the interviews we conducted in late 2021 with 19 researchers at the Department of Classical Philology and Italian Studies at the University of Bologna. The main purpose was to shed light on the definition of the word "data" in the humanities domain, as far as FAIR data management practices are concerned, and on what researchers think of the term. Methodology: We inv…
▽ More
Purpose: This article describes the interviews we conducted in late 2021 with 19 researchers at the Department of Classical Philology and Italian Studies at the University of Bologna. The main purpose was to shed light on the definition of the word "data" in the humanities domain, as far as FAIR data management practices are concerned, and on what researchers think of the term. Methodology: We invited one researcher for each of the official disciplinary areas represented within the department and all 19 accepted to participate in the study. Participants were then divided into 5 main research areas: philology and literary criticism, language and linguistics, history of art, computer science, archival studies. The interviews were transcribed and analysed using a grounded theory approach. Findings: A list of 13 research data types has been compiled thanks to the information collected from participants. The term "data" does not emerge as especially problematic, although a good deal of confusion remains. Looking at current research management practices, methodologies and teamwork appear more central than previously reported. Originality: Our findings confirm that "data" within the FAIR framework should include all types of input and outputs humanities research work with, including publications. Also, the participants to this study appear ready for a discussion around making their research data FAIR: they do not find the terminology particularly problematic, while they rely on precise and recognised methodologies, as well as on sharing and collaboration with colleagues.
△ Less
Submitted 8 November, 2022; v1 submitted 13 May, 2022;
originally announced May 2022.
-
An analysis of citing and referencing habits across all scholarly disciplines: approaches and trends in bibliographic referencing and citing practices
Authors:
Erika Alves dos Santos,
Silvio Peroni,
Marcos Luiz Mucheroni
Abstract:
Purpose. In this study, we want to identify current possible causes for citing and referencing errors in scholarly literature to compare if something changed from the snapshot provided Sweetland in his 1989 paper. Design/methodology/approach. We analysed reference elements, i.e. bibliographic references, mentions, quotations, and respective in-text reference pointers, from 729 articles published i…
▽ More
Purpose. In this study, we want to identify current possible causes for citing and referencing errors in scholarly literature to compare if something changed from the snapshot provided Sweetland in his 1989 paper. Design/methodology/approach. We analysed reference elements, i.e. bibliographic references, mentions, quotations, and respective in-text reference pointers, from 729 articles published in 147 journals across the 27 subject areas. Findings. The outcomes of our analysis pointed out that bibliographic errors have been perpetuated for decades and that their possible causes have increased, despite the encouraged use of technological facilities, i.e., the reference managers. Originality. As far as we know, our study is the best recent available analysis of errors in referencing and citing practices in the literature since Sweetland (1989).
△ Less
Submitted 10 June, 2023; v1 submitted 17 February, 2022;
originally announced February 2022.
-
A Knowledge Graph Embeddings based Approach for Author Name Disambiguation using Literals
Authors:
Cristian Santini,
Genet Asefa Gesese,
Silvio Peroni,
Aldo Gangemi,
Harald Sack,
Mehwish Alam
Abstract:
Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available as Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, et…
▽ More
Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available as Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, etc. This study more specifically targets the problem of Author Name Disambiguation (AND) on Scholarly KGs and presents a novel framework, Literally Author Name Disambiguation (LAND), which utilizes Knowledge Graph Embeddings (KGEs) using multimodal literal information generated from these KGs. This framework is based on three components: 1) Multimodal KGEs, 2) A blocking procedure, and finally, 3) Hierarchical Agglomerative Clustering. Extensive experiments have been conducted on two newly created KGs: (i) KG containing information from Scientometrics Journal from 1978 onwards (OC-782K), and (ii) a KG extracted from a well-known benchmark for AND provided by AMiner (AMiner-534K). The results show that our proposed architecture outperforms our baselines of 8-14% in terms of the F1 score and shows competitive performances on a challenging benchmark such as AMiner. The code and the datasets are publicly available through Github: https://github.com/sntcristian/and-kge and Zenodo:https://doi.org/10.5281/zenodo.6309855 respectively.
△ Less
Submitted 1 June, 2022; v1 submitted 24 January, 2022;
originally announced January 2022.
-
Identifying and correcting invalid citations due to DOI errors in Crossref data
Authors:
Alessia Cioffi,
Sara Coppini,
Arcangelo Massari,
Arianna Moretti,
Silvio Peroni,
Cristian Santini,
Nooshin Shahidzadeh Asadi
Abstract:
This work aims to identify classes of DOI mistakes by analysing the open bibliographic metadata available in Crossref, highlighting which publishers were responsible for such mistakes and how many of these incorrect DOIs could be corrected through automatic processes. By using a list of invalid cited DOIs gathered by OpenCitations while processing the OpenCitations Index of Crossref open DOI-to-DO…
▽ More
This work aims to identify classes of DOI mistakes by analysing the open bibliographic metadata available in Crossref, highlighting which publishers were responsible for such mistakes and how many of these incorrect DOIs could be corrected through automatic processes. By using a list of invalid cited DOIs gathered by OpenCitations while processing the OpenCitations Index of Crossref open DOI-to-DOI citations (COCI) in the past two years, we retrieved the citations in the January 2021 Crossref dump to such invalid DOIs. We processed these citations by keeping track of their validity and the publishers responsible for uploading the related citation data in Crossref. Finally, we identified patterns of factual errors in the invalid DOIs and the regular expressions needed to catch and correct them. The outcomes of this research show that only a few publishers were responsible for and/or affected by the majority of invalid citations. We extended the taxonomy of DOI name errors proposed in past studies and defined more elaborated regular expressions that can clean a higher number of mistakes in invalid DOIs than prior approaches. The data gathered in our study can enable investigating possible reasons for DOI mistakes from a qualitative point of view, helping publishers identify the problems underlying their production of invalid citation data. Also, the DOI cleaning mechanism we present could be integrated into the existing process (e.g. in COCI) to add citations by automatically correcting a wrong DOI. This study was run strictly following Open Science principles, and, as such, our research outcomes are fully reproducible.
△ Less
Submitted 7 March, 2022; v1 submitted 22 November, 2021;
originally announced November 2021.
-
A quantitative and qualitative open citation analysis of retracted articles in the humanities
Authors:
Ivan Heibi,
Silvio Peroni
Abstract:
In this article, we show and discuss the results of a quantitative and qualitative analysis of open citations to retracted publications in the humanities domain. Our study was conducted by selecting retracted papers in the humanities domain and marking their main characteristics (e.g., retraction reason). Then, we gathered the citing entities and annotated their basic metadata (e.g., title, venue,…
▽ More
In this article, we show and discuss the results of a quantitative and qualitative analysis of open citations to retracted publications in the humanities domain. Our study was conducted by selecting retracted papers in the humanities domain and marking their main characteristics (e.g., retraction reason). Then, we gathered the citing entities and annotated their basic metadata (e.g., title, venue, subject, etc.) and the characteristics of their in-text citations (e.g., intent, sentiment, etc.). Using these data, we performed a quantitative and qualitative study of retractions in the humanities, presenting descriptive statistics and a topic modeling analysis of the citing entities' abstracts and the in-text citation contexts. As part of our main findings, we noticed that there was no drop in the overall number of citations after the year of retraction, with few entities which have either mentioned the retraction or expressed a negative sentiment toward the cited publication. In addition, on several occasions, we noticed a higher concern/awareness when it was about citing a retracted publication, by the citing entities belonging to the health sciences domain, if compared to the humanities and the social science domains. Philosophy, arts, and history are the humanities areas that showed the higher concern toward the retraction.
△ Less
Submitted 10 October, 2022; v1 submitted 9 November, 2021;
originally announced November 2021.
-
Open bibliographic data and the Italian National Scientific Qualification: measuring coverage of academic fields
Authors:
Federica Bologna,
Angelo Di Iorio,
Silvio Peroni,
Francesco Poggi
Abstract:
The importance of open bibliographic repositories is widely accepted by the scientific community. For evaluation processes, however, there is still some skepticism: even if large repositories of open access articles and free publication indexes exist and are continuously growing, assessment procedures still rely on proprietary databases, mainly due to the richness of the data available in these pr…
▽ More
The importance of open bibliographic repositories is widely accepted by the scientific community. For evaluation processes, however, there is still some skepticism: even if large repositories of open access articles and free publication indexes exist and are continuously growing, assessment procedures still rely on proprietary databases, mainly due to the richness of the data available in these proprietary databases and the services provided by the companies they are offered by. This paper investigates the status of open bibliographic data of three of the most used open resources, namely Microsoft Academic Graph, Crossref and OpenAIRE, evaluating their potentialities as substitutes of proprietary databases for academic evaluation processes. We focused on the Italian National Scientific Qualification (NSQ), the Italian process for University Professor qualification, which uses data from commercial indexes, and investigated similarities and differences between research areas, disciplines and application roles. The main conclusion is that open datasets are ready to be used for some disciplines, among which mathematics, natural sciences, economics and statistics, even if there is still room for improvement; but there is still a large gap to fill in others - like history, philosophy, pedagogy and psychology - and a stronger effort is required from researchers and institutions.
△ Less
Submitted 13 May, 2022; v1 submitted 5 October, 2021;
originally announced October 2021.
-
The case for the Humanities Citation Index (HuCI): a citation index by the humanities, for the humanities
Authors:
Giovanni Colavizza,
Silvio Peroni,
Matteo Romanello
Abstract:
Citation indexes are by now part of the research infrastructure in use by most scientists: a necessary tool in order to cope with the increasing amounts of scientific literature being published. Commercial citation indexes are designed for the sciences and have uneven coverage and unsatisfactory characteristics for humanities scholars, while no comprehensive citation index is published by a public…
▽ More
Citation indexes are by now part of the research infrastructure in use by most scientists: a necessary tool in order to cope with the increasing amounts of scientific literature being published. Commercial citation indexes are designed for the sciences and have uneven coverage and unsatisfactory characteristics for humanities scholars, while no comprehensive citation index is published by a public organization. We argue that an open citation index for the humanities is desirable, for four reasons: it would greatly improve and accelerate the retrieval of sources, it would offer a way to interlink collections across repositories (such as archives and libraries), it would foster the adoption of metadata standards and best practices by all stakeholders (including publishers) and it would contribute research data to fields such as bibliometrics and science studies. We also suggest that the citation index should be informed by a set of requirements relevant to the humanities. We discuss four such requirements: source coverage must be comprehensive, including books and citations to primary sources; there needs to be chronological depth, as scholarship in the humanities remains relevant over time; the index should be collection-driven, leveraging the accumulated thematic collections of specialized research libraries; and it should be rich in context in order to allow for the qualification of each citation, for example by providing citation excerpts. We detail the fit-for-purpose research infrastructure which can make the Humanities Citation Index a reality. Ultimately, we argue that a citation index for the humanities can be created by humanists, via a collaborative, distributed and open effort.
△ Less
Submitted 14 May, 2022; v1 submitted 1 October, 2021;
originally announced October 2021.
-
A map of Digital Humanities research across bibliographic data sources
Authors:
Gianmarco Spinaci,
Giovanni Colavizza,
Silvio Peroni
Abstract:
Purpose. This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by further highlighting the relations among DH and other disciplines. Methodology. We created a list of DH journals based on manual curation and bibliometric data. We used that list to identify DH publi…
▽ More
Purpose. This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by further highlighting the relations among DH and other disciplines. Methodology. We created a list of DH journals based on manual curation and bibliometric data. We used that list to identify DH publications in the bibliographic data sources under consideration. We used the ERIH-PLUS list of journals to identify Social Sciences and Humanities (SSH) publications. We analysed the citation links they included to understand the relationship between DH publications and SSH and non-SSH fields. Findings. Crossref emerges as the database containing the highest number of DH publications. Citations from and to DH publications show strong connections between DH and research in Computer Science, Linguistics, Psychology, and Pedagogical & Educational Research. Computer Science is responsible for a large part of incoming and outgoing citations to and from DH research, which suggests a reciprocal interest between the two disciplines. Value. This is the first bibliometric study of DH research involving several bibliographic data sources, including open and proprietary databases. Research limitations. The list of DH journals we created might be only partially representative of broader DH research. In addition, some DH publications could have been cut off from the study since we did not consider books and other publications published in proceedings of DH conferences and workshops. Finally, we used a specific time coverage (2000-2018) that could have prevented the inclusion of additional DH publications.
△ Less
Submitted 1 March, 2022; v1 submitted 27 August, 2021;
originally announced August 2021.
-
BiblioDAP: The 1st Workshop on Bibliographic Data Analysis and Processing
Authors:
Zeyd Boukhers,
Philipp Mayr,
Silvio Peroni
Abstract:
Automatic processing of bibliographic data becomes very important in digital libraries, data science and machine learning due to its importance in keeping pace with the significant increase of published papers every year from one side and to the inherent challenges from the other side. This processing has several aspects including but not limited to I) Automatic extraction of references from PDF d…
▽ More
Automatic processing of bibliographic data becomes very important in digital libraries, data science and machine learning due to its importance in keeping pace with the significant increase of published papers every year from one side and to the inherent challenges from the other side. This processing has several aspects including but not limited to I) Automatic extraction of references from PDF documents, II) Building an accurate citation graph, III) Author name disambiguation, etc. Bibliographic data is heterogeneous by nature and occurs in both structured (e.g. citation graph) and unstructured (e.g. publications) formats. Therefore, it requires data science and machine learning techniques to be processed and analysed. Here we introduce BiblioDAP'21: The 1st Workshop on Bibliographic Data Analysis and Processing.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
Academics evaluating academics: a methodology to inform the review process on top of open citations
Authors:
Federica Bologna,
Angelo Di Iorio,
Silvio Peroni,
Francesco Poggi
Abstract:
In the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. In this work, we aim at introducing a methodology to explore whether citation-based metrics, calculated only considering open bibliographic and citation data, can yield insights on how human peer-review of research assessment exercises is conducted. To understand i…
▽ More
In the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. In this work, we aim at introducing a methodology to explore whether citation-based metrics, calculated only considering open bibliographic and citation data, can yield insights on how human peer-review of research assessment exercises is conducted. To understand if and what metrics provide relevant information, we propose to use a series of machine learning models to replicate the decisions of the committees of the research assessment exercises.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
A protocol to gather, characterize and analyze incoming citations of retracted articles
Authors:
Ivan Heibi,
Silvio Peroni
Abstract:
In this article, we present a methodology which takes as input a collection of retracted articles, gathers the entities citing them, characterizes such entities according to multiple dimensions (disciplines, year of publication, sentiment, etc.), and applies a quantitative and qualitative analysis on the collected values. The methodology is composed of four phases: (1) identifying, retrieving, and…
▽ More
In this article, we present a methodology which takes as input a collection of retracted articles, gathers the entities citing them, characterizes such entities according to multiple dimensions (disciplines, year of publication, sentiment, etc.), and applies a quantitative and qualitative analysis on the collected values. The methodology is composed of four phases: (1) identifying, retrieving, and extracting basic metadata of the entities which have cited a retracted article, (2) extracting and labeling additional features based on the textual content of the citing entities, (3) building a descriptive statistical summary based on the collected data, and finally (4) running a topic modeling analysis. The goal of the methodology is to generate data and visualizations that help understanding possible behaviors related to retraction cases. We present the methodology in a structured step-by-step form following its four phases, discuss its limits and possible workarounds, and list the planned future improvements.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
Can we assess research using open scientific knowledge graphs? A case study within the Italian National Scientific Qualification
Authors:
Federica Bologna,
Angelo Di Iorio,
Silvio Peroni,
Francesco Poggi
Abstract:
The need for open scientific knowledge graphs is ever increasing. While there are large repositories of open access articles and free publication indexes, there are still few free knowledge graphs exposing citation networks, and often their coverage is partial. Consequently, most evaluation processes based on citation counts rely on commercial citation databases. Things are changing thanks to the…
▽ More
The need for open scientific knowledge graphs is ever increasing. While there are large repositories of open access articles and free publication indexes, there are still few free knowledge graphs exposing citation networks, and often their coverage is partial. Consequently, most evaluation processes based on citation counts rely on commercial citation databases. Things are changing thanks to the Initiative for Open Citations (I4OC, https://i4oc.org) and the Initiative for Open Abstracts (I4OA, https://i4oa.org), whose goal is to campaign for scholarly publishers to open the reference lists and the other metadata of their articles. This paper investigates the growth of the open bibliographic metadata and open citations in two scientific knowledge graphs, OpenCitations' COCI and Crossref, with an experiment on the Italian National Scientific Qualification (NSQ), the National process for University Professor qualification which uses data from commercial indexes. We simulated the procedure by only using such open data and explored similarities and differences with the official results. The outcomes of the experiment show that the amount of open bibliographic metadata and open citation data currently available in the two scientific knowledge graphs adopted is not yet enough for obtaining results similar to those provided using commercial databases.
△ Less
Submitted 18 May, 2021;
originally announced May 2021.
-
Do open citations give insights on the qualitative peer-review evaluation in research assessments? An analysis of the Italian National Scientific Qualification
Authors:
Federica Bologna,
Angelo Di Iorio,
Silvio Peroni,
Francesco Poggi
Abstract:
In the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. Indeed, the Italian National Scientific Qualification (NSQ), i.e. the national assessment exercise which aims at deciding whether a scholar can apply to professorial academic positions as Associate Professor and Full Professor, adopts a quantitative and qualitative…
▽ More
In the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. Indeed, the Italian National Scientific Qualification (NSQ), i.e. the national assessment exercise which aims at deciding whether a scholar can apply to professorial academic positions as Associate Professor and Full Professor, adopts a quantitative and qualitative evaluation process: it makes use of bibliometrics followed by a peer-review process of candidates' CVs. The NSQ divides academic disciplines into two categories, i.e. citation-based disciplines (CDs) and non-citation-based disciplines (NDs), a division that affects the metrics used for assessing the candidates of that discipline in the first part of the process, which is based on bibliometrics. In this work, we aim at exploring whether citation-based metrics, calculated only considering open bibliographic and citation data, can support the human peer-review of NDs and yield insights on how it is conducted. To understand if and what citation-based (and, possibly, other) metrics provide relevant information, we created a series of machine learning models to replicate the decisions of the NSQ committees. As one of the main outcomes of our study, we noticed that the strength of the citational relationship between the candidate and the commission in charge of assessing his/her CV seems to play a role in the peer-review phase of the NSQ of NDs.
△ Less
Submitted 23 October, 2022; v1 submitted 14 March, 2021;
originally announced March 2021.
-
A qualitative and quantitative analysis of open citations to retracted articles: the Wakefield et al.'s case
Authors:
Ivan Heibi,
Silvio Peroni
Abstract:
In this article, we show the results of a quantitative and qualitative analysis of open citations on a popular and highly cited retracted paper: "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" by Wakefield et al., published in 1998. The main purpose of our study is to understand the behavior of the publications citing retracted articles…
▽ More
In this article, we show the results of a quantitative and qualitative analysis of open citations on a popular and highly cited retracted paper: "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" by Wakefield et al., published in 1998. The main purpose of our study is to understand the behavior of the publications citing retracted articles and the characteristics of the citations the retracted articles accumulated over time. Our analysis is based on a methodology which illustrates how we gathered the data, extracted the topics of the citing articles, and visualized the results. The data and services used are all open and free to foster the reproducibility of the analysis. The outcomes concerned the analysis of the entities citing Wakefield et al.'s article and their related in-text citations. We observed a constant increasing number of citations in the last 20 years, accompanied with a constant increment in the percentage of those acknowledging its retraction. Citing articles have started either discussing or dealing with the retraction of Wakefield et al.'s article even before its full retraction, happened in 2010. Articles in the social sciences domain citing the Wakefield et al.'s one were among those that have mostly discussed its retraction. In addition, when observing the in-text citations, we noticed that a large part of the citations received by Wakefield et al.'s article has focused on general discussions without recalling strictly medical details, especially after the full retraction. Medical studies did not hesitate in acknowledging the retraction and often provided strong negative statements on it.
△ Less
Submitted 24 May, 2021; v1 submitted 21 December, 2020;
originally announced December 2020.
-
MITAO: a tool for enabling scholars in the Humanities to use Topic Modelling in their studies
Authors:
Ivan Heibi,
Silvio Peroni,
Luca Pareschi,
Paolo Ferri
Abstract:
Automatic text analysis methods, such as Topic Modelling, are gaining much attention in Humanities. However, scholars need to have extensive coding skills to use such methods appropriately. The need of having this technical expertise prevents the broad adoption of these methods in Humanities research. In this paper, to help scholars in the Humanities to use Topic Modelling having no or limited cod…
▽ More
Automatic text analysis methods, such as Topic Modelling, are gaining much attention in Humanities. However, scholars need to have extensive coding skills to use such methods appropriately. The need of having this technical expertise prevents the broad adoption of these methods in Humanities research. In this paper, to help scholars in the Humanities to use Topic Modelling having no or limited coding skills, we introduce MITAO, a web-based tool that allow the definition of a visual workflow which embeds various automatic text analysis operations and allows one to store and share both the workflow and the results of its execution to other researchers, which enables the reproducibility of the analysis. We present an example of an application of use of Topic Modelling with MITAO using a collection of English abstracts of the articles published in "Umanistica Digitale". The results returned by MITAO are shown with dynamic web-based visualizations, which allowed us to have preliminary insights about the evolution of the topics treated over the time in the articles published in "Umanistica Digitale". All the results along with the defined workflows are published and accessible for further studies.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
The Landscape of Ontology Reuse Approaches
Authors:
Valentina Anita Carriero,
Marilena Daquino,
Aldo Gangemi,
Andrea Giovanni Nuzzolese,
Silvio Peroni,
Valentina Presutti,
Francesca Tomasi
Abstract:
Ontology reuse aims to foster interoperability and facilitate knowledge reuse. Several approaches are typically evaluated by ontology engineers when bootstrapping a new project. However, current practices are often motivated by subjective, case-by-case decisions, which hamper the definition of a recommended behaviour. In this chapter we argue that to date there are no effective solutions for suppo…
▽ More
Ontology reuse aims to foster interoperability and facilitate knowledge reuse. Several approaches are typically evaluated by ontology engineers when bootstrapping a new project. However, current practices are often motivated by subjective, case-by-case decisions, which hamper the definition of a recommended behaviour. In this chapter we argue that to date there are no effective solutions for supporting developers' decision-making process when deciding on an ontology reuse strategy. The objective is twofold: (i) to survey current approaches to ontology reuse, presenting motivations, strategies, benefits and limits, and (ii) to analyse two representative approaches and discuss their merits.
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
Citing and referencing habits in Medicine and Social Sciences journals in 2019
Authors:
Erika Alves dos Santos,
Silvio Peroni,
Marcos Luiz Mucheroni
Abstract:
This article explores citing and referencing systems in Social Sciences and Medicine articles from different theoretical and practical perspectives, considering bibliographic references as a facet of descriptive representation. The analysis of citing and referencing elements (i.e. bibliographic references, mentions, quotations, and respective in-text reference pointers) identified citing and refer…
▽ More
This article explores citing and referencing systems in Social Sciences and Medicine articles from different theoretical and practical perspectives, considering bibliographic references as a facet of descriptive representation. The analysis of citing and referencing elements (i.e. bibliographic references, mentions, quotations, and respective in-text reference pointers) identified citing and referencing habits within disciplines under consideration and errors occurring over the long term as stated by previous studies now expanded. Future expected trends of information retrieval from bibliographic metadata was gathered by approaching these referencing elements from the FRBR Entities concepts. Reference styles do not fully accomplish with their role of guiding authors and publishers on providing concise and well-structured bibliographic metadata within bibliographic references. Trends on representative description revision suggest a predicted distancing on the ways information is approached by bibliographic references and bibliographic catalogs adopting FRBR concepts, including the description levels adopted by each of them under the perspective of the FRBR Entities concept. This study was based on a subset of Medicine and Social Sciences articles published in 2019 and, therefore, it may not be taken as a final and broad coverage. Future studies expanding these approaches to other disciplines and chronological periods are encouraged. By approaching citing and referencing issues as descriptive representation's facets, findings on this study may encourage further studies that will support Information Science and Computer Science on providing tools to become bibliographic metadata description simpler, better structured and more efficient facing the revision of descriptive representation actually in progress.
△ Less
Submitted 20 January, 2021; v1 submitted 11 September, 2020;
originally announced September 2020.
-
Creating RESTful APIs over SPARQL endpoints using RAMOSE
Authors:
Marilena Daquino,
Ivan Heibi,
Silvio Peroni,
David Shotton
Abstract:
Semantic Web technologies are widely used for storing RDF data and making them available on the Web through SPARQL endpoints, queryable using the SPARQL query language. While the use of SPARQL endpoints is strongly supported by Semantic Web experts, it hinders broader use of RDF data by common Web users, engineers and developers unfamiliar with Semantic Web technologies, who normally rely on Web R…
▽ More
Semantic Web technologies are widely used for storing RDF data and making them available on the Web through SPARQL endpoints, queryable using the SPARQL query language. While the use of SPARQL endpoints is strongly supported by Semantic Web experts, it hinders broader use of RDF data by common Web users, engineers and developers unfamiliar with Semantic Web technologies, who normally rely on Web RESTful APIs for querying Web-available data and creating applications over them. To solve this problem, we have developed RAMOSE, a generic tool developed in Python to create REST APIs over SPARQL endpoints. Through the creation of source-specific textual configuration files, RAMOSE enables the querying of SPARQL endpoints via simple Web RESTful API calls that return either JSON or CSV-formatted data, thus hiding all the intrinsic complexities of SPARQL and RDF from common Web users. We provide evidence that the use of RAMOSE to provide REST API access to RDF data within OpenCitations triplestores is beneficial in terms of the number of queries made by external users to such RDF data using the RAMOSE API compared with the direct access via the SPARQL endpoint. Our findings show the importance for suppliers of RDF data of having an alternative API access service, which enables its use by those with no (or little) experience in Semantic Web technologies and the SPARQL query language. RAMOSE can be used both to query any SPARQL endpoint and to query any other Web API, and thus it represents an easy generic technical solution for service providers who wish to create an API service to access Linked Data stored as RDF in a conventional triplestore.
△ Less
Submitted 30 May, 2021; v1 submitted 31 July, 2020;
originally announced July 2020.
-
The OpenCitations Data Model
Authors:
Marilena Daquino,
Silvio Peroni,
David Shotton,
Giovanni Colavizza,
Behnam Ghavimi,
Anne Lauscher,
Philipp Mayr,
Matteo Romanello,
Philipp Zumstein
Abstract:
A variety of schemas and ontologies are currently used for the machine-readable description of bibliographic entities and citations. This diversity, and the reuse of the same ontology terms with different nuances, generates inconsistencies in data. Adoption of a single data model would facilitate data integration tasks regardless of the data supplier or context application. In this paper we presen…
▽ More
A variety of schemas and ontologies are currently used for the machine-readable description of bibliographic entities and citations. This diversity, and the reuse of the same ontology terms with different nuances, generates inconsistencies in data. Adoption of a single data model would facilitate data integration tasks regardless of the data supplier or context application. In this paper we present the OpenCitations Data Model (OCDM), a generic data model for describing bibliographic entities and citations, developed using Semantic Web technologies. We also evaluate the effective reusability of OCDM according to ontology evaluation practices, mention existing users of OCDM, and discuss the use and impact of OCDM in the wider open science community.
△ Less
Submitted 24 August, 2020; v1 submitted 25 May, 2020;
originally announced May 2020.
-
OpenCitations, an infrastructure organization for open scholarship
Authors:
Silvio Peroni,
David Shotton
Abstract:
OpenCitations is an infrastructure organization for open scholarship dedicated to the publication of open citation data as Linked Open Data using Semantic Web technologies, thereby providing a disruptive alternative to traditional proprietary citation indexes. Open citation data are valuable for bibliometric analysis, increasing the reproducibility of large-scale analyses by enabling publication o…
▽ More
OpenCitations is an infrastructure organization for open scholarship dedicated to the publication of open citation data as Linked Open Data using Semantic Web technologies, thereby providing a disruptive alternative to traditional proprietary citation indexes. Open citation data are valuable for bibliometric analysis, increasing the reproducibility of large-scale analyses by enabling publication of the source data. Following brief introductions to the development and benefits of open scholarship and to Semantic Web technologies, this paper describes OpenCitations and its datasets, tools, services and activities. These include the OpenCitations Data Model; the SPAR (Semantic Publishing and Referencing) Ontologies; OpenCitations' open software of generic applicability for searching, browsing and providing REST APIs over RDF triplestores; Open Citation Identifiers (OCIs) and the OpenCitations OCI Resolution Service; the OpenCitations Corpus (OCC), a database of open downloadable bibliographic and citation data made available in RDF under a Creative Commons public domain dedication; and the OpenCitations Indexes of open citation data, of which the first and largest is COCI, the OpenCitations Index of Crossref Open DOI-to-DOI Citations, which currently contains over 445 million bibliographic citations and is receiving considerable usage by the scholarly community.
△ Less
Submitted 9 December, 2019; v1 submitted 27 June, 2019;
originally announced June 2019.
-
Nine Million Book Items and Eleven Million Citations: A Study of Book-Based Scholarly Communication Using OpenCitations
Authors:
Yongjun Zhu,
Erjia Yan,
Silvio Peroni,
Chao Che
Abstract:
Books have been widely used to share information and contribute to human knowledge. However, the quantitative use of books as a method of scholarly communication is relatively unexamined compared to journal articles and conference papers. This study uses the COCI dataset (a comprehensive open citation dataset provided by OpenCitations) to explore books' roles in scholarly communication. The COCI d…
▽ More
Books have been widely used to share information and contribute to human knowledge. However, the quantitative use of books as a method of scholarly communication is relatively unexamined compared to journal articles and conference papers. This study uses the COCI dataset (a comprehensive open citation dataset provided by OpenCitations) to explore books' roles in scholarly communication. The COCI data we analyzed includes 445,826,118 citations from 46,534,705 bibliographic entities. By analyzing such a large amount of data, we provide a thorough, multifaceted understanding of books. Among the investigated factors are 1) temporal changes to book citations; 2) book citation distributions; 3) years to citation peak; 4) citation half-life; and 5) characteristics of the most-cited books. Results show that books have received less than 4% of total citations, and have been cited mainly by journal articles. Moreover, 97.96% of books have been cited fewer than ten times. Books take longer than other bibliographic materials to reach peak citation levels, yet are cited for the same duration as journal articles. Most-cited books tend to cover general (yet essential) topics, theories, and technological concepts in mathematics and statistics.
△ Less
Submitted 6 December, 2019; v1 submitted 14 June, 2019;
originally announced June 2019.
-
COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations
Authors:
Ivan Heibi,
Silvio Peroni,
David Shotton
Abstract:
In this paper, we present COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (http://opencitations.net/index/coci). COCI is the first open citation index created by OpenCitations, in which we have applied the concept of citations as first-class data entities, and it contains more than 445 million DOI-to-DOI citation links derived from the data available in Crossref. These citation…
▽ More
In this paper, we present COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (http://opencitations.net/index/coci). COCI is the first open citation index created by OpenCitations, in which we have applied the concept of citations as first-class data entities, and it contains more than 445 million DOI-to-DOI citation links derived from the data available in Crossref. These citations are described in RDF by means of the newly extended version of the OpenCitations Data Model (OCDM). We introduce the workflow we have developed for creating these data, and also show the additional services that facilitate the access to and querying of these data via different access points: a SPARQL endpoint, a REST API, bulk downloads, Web interfaces, and direct access to the citations via HTTP content negotiation. Finally, we present statistics regarding the use of COCI citation data, and we introduce several projects that have already started to use COCI data for different purposes.
△ Less
Submitted 26 July, 2019; v1 submitted 12 April, 2019;
originally announced April 2019.
-
The practice of self-citations: a longitudinal study
Authors:
Silvio Peroni,
Paolo Ciancarini,
Aldo Gangemi,
Andrea Giovanni Nuzzolese,
Francesco Poggi,
Valentina Presutti
Abstract:
In this article, we discuss the outcomes of an experiment where we analysed whether and to what extent the introduction, in 2012, of the new research assessment exercise in Italy (a.k.a. Italian Scientific Habilitation) affected self-citation behaviours in the Italian research community. The Italian Scientific Habilitation attests to the scientific maturity of researchers and in Italy, as in many…
▽ More
In this article, we discuss the outcomes of an experiment where we analysed whether and to what extent the introduction, in 2012, of the new research assessment exercise in Italy (a.k.a. Italian Scientific Habilitation) affected self-citation behaviours in the Italian research community. The Italian Scientific Habilitation attests to the scientific maturity of researchers and in Italy, as in many other countries, is a requirement for accessing to a professorship. To this end, we obtained from ScienceDirect 35,673 articles published from 1957 and 2016 by the participants to the 2012 Italian Scientific Habilitation, that resulted in the extraction of 1,379,050 citations retrieved through Semantic Publishing technologies. Our analysis showed an overall increment in author self-citations (i.e. where the citing article and the cited article share at least one author) in several of the 24 academic disciplines considered. However, we depicted a stronger causal relation between such increment and the rules introduced by the 2012 Italian Scientific Habilitation in 10 out of 24 disciplines analysed.
△ Less
Submitted 19 February, 2020; v1 submitted 14 March, 2019;
originally announced March 2019.