-
Hierarchical Multi-Positive Contrastive Learning for Patent Image Retrieval
Authors:
Kshitij Kavimandan,
Angelos Nalmpantis,
Emma Beauxis-Aussalet,
Robert-Jan Sips
Abstract:
Patent images are technical drawings that convey information about a patent's innovation. Patent image retrieval systems aim to search in vast collections and retrieve the most relevant images. Despite recent advances in information retrieval, patent images still pose significant challenges due to their technical intricacies and complex semantic information, requiring efficient fine-tuning for dom…
▽ More
Patent images are technical drawings that convey information about a patent's innovation. Patent image retrieval systems aim to search in vast collections and retrieve the most relevant images. Despite recent advances in information retrieval, patent images still pose significant challenges due to their technical intricacies and complex semantic information, requiring efficient fine-tuning for domain adaptation. Current methods neglect patents' hierarchical relationships, such as those defined by the Locarno International Classification (LIC) system, which groups broad categories (e.g., "furnishing") into subclasses (e.g., "seats" and "beds") and further into specific patent designs. In this work, we introduce a hierarchical multi-positive contrastive loss that leverages the LIC's taxonomy to induce such relations in the retrieval process. Our approach assigns multiple positive pairs to each patent image within a batch, with varying similarity scores based on the hierarchical taxonomy. Our experimental analysis with various vision and multimodal models on the DeepPatent2 dataset shows that the proposed method enhances the retrieval results. Notably, our method is effective with low-parameter models, which require fewer computational resources and can be deployed on environments with limited hardware.
△ Less
Submitted 17 June, 2025; v1 submitted 16 June, 2025;
originally announced June 2025.
-
Access to care: analysis of the geographical distribution of healthcare using Linked Open Data
Authors:
Selene Baez Santamaria,
Emmanouil Manousogiannis,
Guusje Boomgaard,
Linh P. Tran,
Zoltan Szlavik,
Robert-Jan Sips
Abstract:
Background: Access to medical care is strongly dependent on resource allocation, such as the geographical distribution of medical facilities. Nevertheless, this data is usually restricted to country official documentation, not available to the public. While some medical facilities' data is accessible as semantic resources on the Web, it is not consistent in its modeling and has yet to be integrate…
▽ More
Background: Access to medical care is strongly dependent on resource allocation, such as the geographical distribution of medical facilities. Nevertheless, this data is usually restricted to country official documentation, not available to the public. While some medical facilities' data is accessible as semantic resources on the Web, it is not consistent in its modeling and has yet to be integrated into a complete, open, and specialized repository. This work focuses on generating a comprehensive semantic dataset of medical facilities worldwide containing extensive information about such facilities' geo-location.
Results: For this purpose, we collect, align, and link various open-source databases where medical facilities' information may be present. This work allows us to evaluate each data source along various dimensions, such as completeness, correctness, and interlinking with other sources, all critical aspects of current knowledge representation technologies.
Conclusions: Our contributions directly benefit stakeholders in the biomedical and health domain (patients, healthcare professionals, companies, regulatory authorities, and researchers), who will now have a better overview of the access to and distribution of medical facilities.
△ Less
Submitted 26 September, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Empirical Methodology for Crowdsourcing Ground Truth
Authors:
Anca Dumitrache,
Oana Inel,
Benjamin Timmermans,
Carlos Ortiz,
Robert-Jan Sips,
Lora Aroyo,
Chris Welty
Abstract:
The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in…
▽ More
The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical Relation Extraction, Twitter Event Identification, News Event Extraction and Sound Interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.
△ Less
Submitted 24 September, 2018;
originally announced September 2018.