-
Enhancing Portuguese Variety Identification with Cross-Domain Approaches
Authors:
Hugo Sousa,
Rúben Almeida,
Purificação Silvano,
Inês Cantante,
Ricardo Campos,
Alípio Jorge
Abstract:
Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation…
▽ More
Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Tradutor: Building a Variety Specific Translation Model
Authors:
Hugo Sousa,
Satya Almasian,
Ricardo Campos,
Alípio Jorge
Abstract:
Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese ar…
▽ More
Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Event Extraction for Portuguese: A QA-driven Approach using ACE-2005
Authors:
Luís Filipe Cunha,
Ricardo Campos,
Alípio Jorge
Abstract:
Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event's arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify a…
▽ More
Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event's arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify and classify events in Portuguese documents. We decompose this task into two sub-tasks. Firstly, we use a token classification model to detect event triggers. To extract event arguments, we train a Question Answering model that queries the triggers about their corresponding event argument roles. Given the lack of event annotated corpora in Portuguese, we translated the original version of the ACE-2005 dataset (a reference in the field) into Portuguese, producing a new corpus for Portuguese event extraction. To accomplish this, we developed an automatic translation pipeline. Our framework obtains F1 marks of 64.4 for trigger classification and 46.7 for argument classification setting, thus a new state-of-the-art reference for these tasks in Portuguese.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
ACE-2005-PT: Corpus for Event Extraction in Portuguese
Authors:
Luís Filipe Cunha,
Purificação Silvano,
Ricardo Campos,
Alípio Jorge
Abstract:
Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics…
▽ More
Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55\% and 87.55\% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
TEI2GO: A Multilingual Approach for Fast Temporal Expression Identification
Authors:
Hugo Sousa,
Ricardo Campos,
Alípio Jorge
Abstract:
Temporal expression identification is crucial for understanding texts written in natural language. Although highly effective systems such as HeidelTime exist, their limited runtime performance hampers adoption in large-scale applications and production environments. In this paper, we introduce the TEI2GO models, matching HeidelTime's effectiveness but with significantly improved runtime, supportin…
▽ More
Temporal expression identification is crucial for understanding texts written in natural language. Although highly effective systems such as HeidelTime exist, their limited runtime performance hampers adoption in large-scale applications and production environments. In this paper, we introduce the TEI2GO models, matching HeidelTime's effectiveness but with significantly improved runtime, supporting six languages, and achieving state-of-the-art results in four of them. To train the TEI2GO models, we used a combination of manually annotated reference corpus and developed ``Professor HeidelTime'', a comprehensive weakly labeled corpus of news texts annotated with HeidelTime. This corpus comprises a total of $138,069$ documents (over six languages) with $1,050,921$ temporal expressions, the largest open-source annotated dataset for temporal expression identification to date. By describing how the models were produced, we aim to encourage the research community to further explore, refine, and extend the set of models to additional languages and domains. Code, annotations, and models are openly available for community exploration and use. The models are conveniently on HuggingFace for seamless integration and application.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Indexing Portuguese NLP Resources with PT-Pump-Up
Authors:
Rúben Almeida,
Ricardo Campos,
Alípio Jorge,
Sérgio Nunes
Abstract:
The recent advances in natural language processing (NLP) are linked to training processes that require vast amounts of corpora. Access to this data is commonly not a trivial process due to resource dispersion and the need to maintain these infrastructures online and up-to-date. New developments in NLP are often compromised due to the scarcity of data or lack of a shared repository that works as an…
▽ More
The recent advances in natural language processing (NLP) are linked to training processes that require vast amounts of corpora. Access to this data is commonly not a trivial process due to resource dispersion and the need to maintain these infrastructures online and up-to-date. New developments in NLP are often compromised due to the scarcity of data or lack of a shared repository that works as an entry point to the community. This is especially true in low and mid-resource languages, such as Portuguese, which lack data and proper resource management infrastructures. In this work, we propose PT-Pump-Up, a set of tools that aim to reduce resource dispersion and improve the accessibility to Portuguese NLP resources. Our proposal is divided into four software components: a) a web platform to list the available resources; b) a client-side Python package to simplify the loading of Portuguese NLP resources; c) an administrative Python package to manage the platform and d) a public GitHub repository to foster future collaboration and contributions. All four components are accessible using: https://linktr.ee/pt_pump_up
△ Less
Submitted 27 January, 2024;
originally announced January 2024.
-
Physio: An LLM-Based Physiotherapy Advisor
Authors:
Rúben Almeida,
Hugo Sousa,
Luís F. Cunha,
Nuno Guimarães,
Ricardo Campos,
Alípio Jorge
Abstract:
The capabilities of the most recent language models have increased the interest in integrating them into real-world applications. However, the fact that these models generate plausible, yet incorrect text poses a constraint when considering their use in several domains. Healthcare is a prime example of a domain where text-generative trustworthiness is a hard requirement to safeguard patient well-b…
▽ More
The capabilities of the most recent language models have increased the interest in integrating them into real-world applications. However, the fact that these models generate plausible, yet incorrect text poses a constraint when considering their use in several domains. Healthcare is a prime example of a domain where text-generative trustworthiness is a hard requirement to safeguard patient well-being. In this paper, we present Physio, a chat-based application for physical rehabilitation. Physio is capable of making an initial diagnosis while citing reliable health sources to support the information provided. Furthermore, drawing upon external knowledge databases, Physio can recommend rehabilitation exercises and over-the-counter medication for symptom relief. By combining these features, Physio can leverage the power of generative models for language processing while also conditioning its response on dependable and verifiable sources. A live demo of Physio is available at https://physio.inesctec.pt.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
Polygon Detection from a Set of Lines
Authors:
Alfredo Ferreira Jr.,
Manuel J. Fonseca,
Joaquim A. Jorge
Abstract:
Detecting polygons defined by a set of line segments in a plane is an important step in analyzing vector drawings. This paper presents an approach combining several algorithms to detect basic polygons from arbitrary line segments. The resulting algorithm runs in polynomial time and space, with complexities of $O\bigl((N + M)^4\bigr)$ and $O\bigl((N + M)^2\bigr)$, where $N$ is the number of line se…
▽ More
Detecting polygons defined by a set of line segments in a plane is an important step in analyzing vector drawings. This paper presents an approach combining several algorithms to detect basic polygons from arbitrary line segments. The resulting algorithm runs in polynomial time and space, with complexities of $O\bigl((N + M)^4\bigr)$ and $O\bigl((N + M)^2\bigr)$, where $N$ is the number of line segments and $M$ is the number of intersections between line segments. Our choice of algorithms was made to strike a good compromise between efficiency and ease of implementation. The result is a simple and efficient solution to detect polygons from lines.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
-
GPT Struct Me: Probing GPT Models on Narrative Entity Extraction
Authors:
Hugo Sousa,
Nuno Guimarães,
Alípio Jorge,
Ricardo Campos
Abstract:
The importance of systems that can extract structured information from textual data becomes increasingly pronounced given the ever-increasing volume of text produced on a daily basis. Having a system that can effectively extract such information in an interoperable manner would be an asset for several domains, be it finance, health, or legal. Recent developments in natural language processing led…
▽ More
The importance of systems that can extract structured information from textual data becomes increasingly pronounced given the ever-increasing volume of text produced on a daily basis. Having a system that can effectively extract such information in an interoperable manner would be an asset for several domains, be it finance, health, or legal. Recent developments in natural language processing led to the production of powerful language models that can, to some degree, mimic human intelligence. Such effectiveness raises a pertinent question: Can these models be leveraged for the extraction of structured information? In this work, we address this question by evaluating the capabilities of two state-of-the-art language models -- GPT-3 and GPT-3.5, commonly known as ChatGPT -- in the extraction of narrative entities, namely events, participants, and temporal expressions. This study is conducted on the Text2Story Lusa dataset, a collection of 119 Portuguese news articles whose annotation framework includes a set of entity structures along with several tags and attribute values. We first select the best prompt template through an ablation study over prompt components that provide varying degrees of information on a subset of documents of the dataset. Subsequently, we use the best templates to evaluate the effectiveness of the models on the remaining documents. The results obtained indicate that GPT models are competitive with out-of-the-box baseline systems, presenting an all-in-one alternative for practitioners with limited resources. By studying the strengths and limitations of these models in the context of information extraction, we offer insights that can guide future improvements and avenues to explore in this field.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese
Authors:
Hugo Sousa,
Arian Pasquali,
Alípio Jorge,
Catarina Sousa Santos,
Mário Amorim Lopes
Abstract:
Textual health records of cancer patients are usually protracted and highly unstructured, making it very time-consuming for health professionals to get a complete overview of the patient's therapeutic course. As such limitations can lead to suboptimal and/or inefficient treatment procedures, healthcare providers would greatly benefit from a system that effectively summarizes the information of tho…
▽ More
Textual health records of cancer patients are usually protracted and highly unstructured, making it very time-consuming for health professionals to get a complete overview of the patient's therapeutic course. As such limitations can lead to suboptimal and/or inefficient treatment procedures, healthcare providers would greatly benefit from a system that effectively summarizes the information of those records. With the advent of deep neural models, this objective has been partially attained for English clinical texts, however, the research community still lacks an effective solution for languages with limited resources. In this paper, we present the approach we developed to extract procedures, drugs, and diseases from oncology health records written in European Portuguese. This project was conducted in collaboration with the Portuguese Institute for Oncology which, besides holding over $10$ years of duly protected medical records, also provided oncologist expertise throughout the development of the project. Since there is no annotated corpus for biomedical entity extraction in Portuguese, we also present the strategy we followed in annotating the corpus for the development of the models. The final models, which combined a neural architecture with entity linking, achieved $F_1$ scores of $88.6$, $95.0$, and $55.8$ per cent in the mention extraction of procedures, drugs, and diseases, respectively.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
tieval: An Evaluation Framework for Temporal Information Extraction Systems
Authors:
Hugo Sousa,
Alípio Jorge,
Ricardo Campos
Abstract:
Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between…
▽ More
Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between competitors across different corpora. On the other hand, the fact that each corpus is commonly disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. This constraint forces researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle that hinders the comparability of the TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and $F_1$, a few others prefer temporal awareness -- a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weights this decision is the necessity to implement the temporal closure algorithm in order to compute temporal awareness, which is not straightforward to implement neither is currently easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of temporal extraction systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and facilitates system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features.
△ Less
Submitted 24 November, 2023; v1 submitted 11 January, 2023;
originally announced January 2023.
-
Probing Commonsense Knowledge in Pre-trained Language Models with Sense-level Precision and Expanded Vocabulary
Authors:
Daniel Loureiro,
Alípio Mário Jorge
Abstract:
Progress on commonsense reasoning is usually measured from performance improvements on Question Answering tasks designed to require commonsense knowledge. However, fine-tuning large Language Models (LMs) on these specific tasks does not directly evaluate commonsense learned during pre-training. The most direct assessments of commonsense knowledge in pre-trained LMs are arguably cloze-style tasks t…
▽ More
Progress on commonsense reasoning is usually measured from performance improvements on Question Answering tasks designed to require commonsense knowledge. However, fine-tuning large Language Models (LMs) on these specific tasks does not directly evaluate commonsense learned during pre-training. The most direct assessments of commonsense knowledge in pre-trained LMs are arguably cloze-style tasks targeting commonsense assertions (e.g., A pen is used for [MASK].). However, this approach is restricted by the LM's vocabulary available for masked predictions, and its precision is subject to the context provided by the assertion. In this work, we present a method for enriching LMs with a grounded sense inventory (i.e., WordNet) available at the vocabulary level, without further training. This modification augments the prediction space of cloze-style prompts to the size of a large ontology while enabling finer-grained (sense-level) queries and predictions. In order to evaluate LMs with higher precision, we propose SenseLAMA, a cloze-style task featuring verbalized relations from disambiguated triples sourced from WordNet, WikiData, and ConceptNet. Applying our method to BERT, producing a WordNet-enriched version named SynBERT, we find that LMs can learn non-trivial commonsense knowledge from self-supervision, covering numerous relations, and more effectively than comparable similarity-based approaches.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Precipitation event-based networks: an analysis of the relations between network metrics and meteorological properties
Authors:
Aurelienne A. S. Jorge,
Douglas Uba,
Alex A. Fernandes,
Izabelly C. Costa,
Leonardo B. L. Santos
Abstract:
The study of complex systems in nature is essential to understand the interactions between different elements and how they influence one another. Complex network theory is a powerful tool that helps us to analyze these interactions and gain insights into the behavior of such systems. Surprisingly, this theory has been underutilized in the field of weather science, which focuses on the immediate st…
▽ More
The study of complex systems in nature is essential to understand the interactions between different elements and how they influence one another. Complex network theory is a powerful tool that helps us to analyze these interactions and gain insights into the behavior of such systems. Surprisingly, this theory has been underutilized in the field of weather science, which focuses on the immediate state of the atmosphere. Our research aims to fill this gap by exploring the use of complex network theory in weather science. Specifically, we employ weather radar data to construct event-based geographical networks. By analyzing the relations between meteorological properties and network metrics in these event-based networks, we can gain a better understanding of the behavior of precipitation events. Our findings reveal significant correlations between various meteorological properties and network metrics, shedding light on the underlying mechanisms that govern precipitation events. Through our work, we hope to demonstrate the potential of complex network theory in weather science and inspire further research in this field.
△ Less
Submitted 8 May, 2023; v1 submitted 22 June, 2022;
originally announced June 2022.
-
Global-threshold and backbone high-resolution weather radar networks are significantly complementary in a watershed
Authors:
Aurelienne A. S. Jorge,
Iuri da Silva Diniz,
Vander L. S. Freitas,
Izabelly C. Costa,
Leonardo B. L. Santos
Abstract:
There are several criteria for building up networks from time series related to different points in geographical space. The most used criterion is the Global-Threshold (GT). Using a weather radar dataset, this paper shows that the Backbone (BB) - a local-threshold criterion - generates networks whose geographical configuration is complementary to the GT networks. We compare the results for two wel…
▽ More
There are several criteria for building up networks from time series related to different points in geographical space. The most used criterion is the Global-Threshold (GT). Using a weather radar dataset, this paper shows that the Backbone (BB) - a local-threshold criterion - generates networks whose geographical configuration is complementary to the GT networks. We compare the results for two well-known similarities measures: the Pearson Correlation (PC) coefficient and the Mutual Information (MI). The extracted backbone network (miBB), whose number of links is the same as the global MI (miGT), has the lowest average shortest path and presents a small-world effect. Regarding the global PC (pcGT) and its corresponding BB network (pcBB), there is a significant linear relationship: $R2=0.77$ with a slope of $1.15$ (p-value $<E-7$) for the pcGT network, and $R2=0.68$ with a slope of $0.76$ (p-value $<E-7$) for the pcBB network. In relation to the MI ones, only the miGT present a high $R2$ ($0.79$, with slope = $1.95$), whereas the miBB has an $R2$ of only $0.20$ ($\text{slope} =0.24$). On the one hand, the GT networks present a sizeable connected component in the central area, close to the main rivers. On the other hand, the BB networks present a few meaningful connected components surrounding the watershed and dominating cells close to the outlet, with significant statistical differences in the altimetry distribution.
△ Less
Submitted 13 January, 2022;
originally announced January 2022.
-
Proceedings of the 4th Workshop on Online Recommender Systems and User Modeling -- ORSUM 2021
Authors:
João Vinagre,
Alípio Mário Jorge,
Marie Al-Ghossein,
Albert Bifet
Abstract:
Modern online services continuously generate data at very fast rates. This continuous flow of data encompasses content - e.g., posts, news, products, comments -, but also user feedback - e.g., ratings, views, reads, clicks -, together with context data - user device, spatial or temporal data, user task or activity, weather. This can be overwhelming for systems and algorithms designed to train in b…
▽ More
Modern online services continuously generate data at very fast rates. This continuous flow of data encompasses content - e.g., posts, news, products, comments -, but also user feedback - e.g., ratings, views, reads, clicks -, together with context data - user device, spatial or temporal data, user task or activity, weather. This can be overwhelming for systems and algorithms designed to train in batches, given the continuous and potentially fast change of content, context and user preferences or intents. Therefore, it is important to investigate online methods able to transparently adapt to the inherent dynamics of online services. Incremental models that learn from data streams are gaining attention in the recommender systems community, given their natural ability to deal with the continuous flows of data generated in dynamic, complex environments. User modeling and personalization can particularly benefit from algorithms capable of maintaining models incrementally and online.
The objective of this workshop is to foster contributions and bring together a growing community of researchers and practitioners interested in online, adaptive approaches to user modeling, recommendation and personalization, and their implications regarding multiple dimensions, such as evaluation, reproducibility, privacy and explainability.
△ Less
Submitted 17 January, 2022; v1 submitted 12 January, 2022;
originally announced January 2022.
-
The CirCor DigiScope Dataset: From Murmur Detection to Murmur Classification
Authors:
Jorge Oliveira,
Francesco Renna,
Paulo Dias Costa,
Marcelo Nogueira,
Cristina Oliveira,
Carlos Ferreira,
Alipio Jorge,
Sandra Mattos,
Thamine Hatem,
Thiago Tavares,
Andoni Elola,
Ali Bahrami Rad,
Reza Sameni,
Gari D Clifford,
Miguel T. Coimbra
Abstract:
Cardiac auscultation is one of the most cost-effective techniques used to detect and identify many heart conditions. Computer-assisted decision systems based on auscultation can support physicians in their decisions. Unfortunately, the application of such systems in clinical trials is still minimal since most of them only aim to detect the presence of extra or abnormal waves in the phonocardiogram…
▽ More
Cardiac auscultation is one of the most cost-effective techniques used to detect and identify many heart conditions. Computer-assisted decision systems based on auscultation can support physicians in their decisions. Unfortunately, the application of such systems in clinical trials is still minimal since most of them only aim to detect the presence of extra or abnormal waves in the phonocardiogram signal, i.e., only a binary ground truth variable (normal vs abnormal) is provided. This is mainly due to the lack of large publicly available datasets, where a more detailed description of such abnormal waves (e.g., cardiac murmurs) exists.
To pave the way to more effective research on healthcare recommendation systems based on auscultation, our team has prepared the currently largest pediatric heart sound dataset. A total of 5282 recordings have been collected from the four main auscultation locations of 1568 patients, in the process, 215780 heart sounds have been manually annotated. Furthermore, and for the first time, each cardiac murmur has been manually annotated by an expert annotator according to its timing, shape, pitch, grading, and quality. In addition, the auscultation locations where the murmur is present were identified as well as the auscultation location where the murmur is detected more intensively. Such detailed description for a relatively large number of heart sounds may pave the way for new machine learning algorithms with a real-world application for the detection and analysis of murmur waves for diagnostic purposes.
△ Less
Submitted 24 December, 2021; v1 submitted 2 August, 2021;
originally announced August 2021.
-
LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and Beyond
Authors:
Daniel Loureiro,
Alípio Mário Jorge,
Jose Camacho-Collados
Abstract:
Distributional semantics based on neural approaches is a cornerstone of Natural Language Processing, with surprising connections to human meaning representation as well. Recent Transformer-based Language Models have proven capable of producing contextual word representations that reliably convey sense-specific information, simply as a product of self-supervision. Prior work has shown that these co…
▽ More
Distributional semantics based on neural approaches is a cornerstone of Natural Language Processing, with surprising connections to human meaning representation as well. Recent Transformer-based Language Models have proven capable of producing contextual word representations that reliably convey sense-specific information, simply as a product of self-supervision. Prior work has shown that these contextual representations can be used to accurately represent large sense inventories as sense embeddings, to the extent that a distance-based solution to Word Sense Disambiguation (WSD) tasks outperforms models trained specifically for the task. Still, there remains much to understand on how to use these Neural Language Models (NLMs) to produce sense embeddings that can better harness each NLM's meaning representation abilities. In this work we introduce a more principled approach to leverage information from all layers of NLMs, informed by a probing analysis on 14 NLM variants. We also emphasize the versatility of these sense embeddings in contrast to task-specific models, applying them on several sense-related tasks, besides WSD, while demonstrating improved performance using our proposed approach over prior work focused on sense embeddings. Finally, we discuss unexpected findings regarding layer and model performance variations, and potential applications for downstream tasks.
△ Less
Submitted 1 April, 2022; v1 submitted 26 May, 2021;
originally announced May 2021.
-
Towards augmented reality for corporate training
Authors:
Bruno R. Martins,
Joaquim A. Jorge,
Ezequiel R. Zorzal
Abstract:
Corporate training relates to employees acquiring essential skills to operate equipment or effectively performing required tasks both competently and safely. Unlike formal education, training can be incorporated into the task workflow and performed during working hours. Increasingly, organizations adopt different technologies to develop both individual skills and improve their organization. Studie…
▽ More
Corporate training relates to employees acquiring essential skills to operate equipment or effectively performing required tasks both competently and safely. Unlike formal education, training can be incorporated into the task workflow and performed during working hours. Increasingly, organizations adopt different technologies to develop both individual skills and improve their organization. Studies indicate that Augmented Reality (AR) is quickly becoming an effective technology for training programs. This systematic literature review (SLR) aims to screen works published on AR for corporate training. We describe AR training applications, discuss current challenges, literature gaps, opportunities, and tendencies of corporate AR solutions. We structured a protocol to define keywords, the semantics of research, and databases used as sources of this SLR. From a primary analysis, we considered 1952 articles in the review for qualitative synthesis. We selected 60 among the selected articles for this study. The survey shows a large number of 41.7% of applications focused on automotive and medical training. Additionally, 20% of selected publications use a camera-display with a tablet device, while 40% refer to head-mounted-displays, and many surveyed approaches (45%) adopt marker-based tracking. Results indicate that publications on AR for corporate training increased significantly in recent years. AR has been used in many areas, exhibiting high quality and provides viable approaches to On-The-Job training. Finally, we discuss future research issues related to increasing relevance regarding AR for corporate training.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
A Review on Deep Learning in UAV Remote Sensing
Authors:
Lucas Prado Osco,
José Marcato Junior,
Ana Paula Marques Ramos,
Lúcio André de Castro Jorge,
Sarah Narges Fatholahi,
Jonathan de Andrade Silva,
Edson Takashi Matsubara,
Hemerson Pistori,
Wesley Nunes Gonçalves,
Jonathan Li
Abstract:
Deep Neural Networks (DNNs) learn representation from data with an impressive capability, and brought important breakthroughs for processing images, time-series, natural language, audio, video, and many others. In the remote sensing field, surveys and literature revisions specifically involving DNNs algorithms' applications have been conducted in an attempt to summarize the amount of information p…
▽ More
Deep Neural Networks (DNNs) learn representation from data with an impressive capability, and brought important breakthroughs for processing images, time-series, natural language, audio, video, and many others. In the remote sensing field, surveys and literature revisions specifically involving DNNs algorithms' applications have been conducted in an attempt to summarize the amount of information produced in its subfields. Recently, Unmanned Aerial Vehicles (UAV) based applications have dominated aerial sensing research. However, a literature revision that combines both "deep learning" and "UAV remote sensing" thematics has not yet been conducted. The motivation for our work was to present a comprehensive review of the fundamentals of Deep Learning (DL) applied in UAV-based imagery. We focused mainly on describing classification and regression techniques used in recent applications with UAV-acquired data. For that, a total of 232 papers published in international scientific journal databases was examined. We gathered the published material and evaluated their characteristics regarding application, sensor, and technique used. We relate how DL presents promising results and has the potential for processing tasks associated with UAV-based image data. Lastly, we project future perspectives, commentating on prominent DL paths to be explored in the UAV remote sensing field. Our revision consists of a friendly-approach to introduce, commentate, and summarize the state-of-the-art in UAV-based image applications with DNNs algorithms in diverse subfields of remote sensing, grouping it in the environmental, urban, and agricultural contexts.
△ Less
Submitted 20 August, 2023; v1 submitted 22 January, 2021;
originally announced January 2021.
-
Improving Portuguese Semantic Role Labeling with Transformers and Transfer Learning
Authors:
Sofia Oliveira,
Daniel Loureiro,
Alípio Jorge
Abstract:
The Natural Language Processing task of determining "Who did what to whom" is called Semantic Role Labeling. For English, recent methods based on Transformer models have allowed for major improvements in this task over the previous state of the art. However, for low resource languages, like Portuguese, currently available semantic role labeling models are hindered by scarce training data. In this…
▽ More
The Natural Language Processing task of determining "Who did what to whom" is called Semantic Role Labeling. For English, recent methods based on Transformer models have allowed for major improvements in this task over the previous state of the art. However, for low resource languages, like Portuguese, currently available semantic role labeling models are hindered by scarce training data. In this paper, we explore a model architecture with only a pre-trained Transformer-based model, a linear layer, softmax and Viterbi decoding. We substantially improve the state-of-the-art performance in Portuguese by over 15 F1. Additionally, we improve semantic role labeling results in Portuguese corpora by exploiting cross-lingual transfer learning using multilingual pre-trained models, and transfer learning from dependency parsing in Portuguese, evaluating the various proposed approaches empirically.
△ Less
Submitted 30 October, 2021; v1 submitted 4 January, 2021;
originally announced January 2021.
-
A CNN Approach to Simultaneously Count Plants and Detect Plantation-Rows from UAV Imagery
Authors:
Lucas Prado Osco,
Mauro dos Santos de Arruda,
Diogo Nunes Gonçalves,
Alexandre Dias,
Juliana Batistoti,
Mauricio de Souza,
Felipe David Georges Gomes,
Ana Paula Marques Ramos,
Lúcio André de Castro Jorge,
Veraldo Liesenberg,
Jonathan Li,
Lingfei Ma,
José Marcato Junior,
Wesley Nunes Gonçalves
Abstract:
In this paper, we propose a novel deep learning method based on a Convolutional Neural Network (CNN) that simultaneously detects and geolocates plantation-rows while counting its plants considering highly-dense plantation configurations. The experimental setup was evaluated in a cornfield with different growth stages and in a Citrus orchard. Both datasets characterize different plant density scena…
▽ More
In this paper, we propose a novel deep learning method based on a Convolutional Neural Network (CNN) that simultaneously detects and geolocates plantation-rows while counting its plants considering highly-dense plantation configurations. The experimental setup was evaluated in a cornfield with different growth stages and in a Citrus orchard. Both datasets characterize different plant density scenarios, locations, types of crops, sensors, and dates. A two-branch architecture was implemented in our CNN method, where the information obtained within the plantation-row is updated into the plant detection branch and retro-feed to the row branch; which are then refined by a Multi-Stage Refinement method. In the corn plantation datasets (with both growth phases, young and mature), our approach returned a mean absolute error (MAE) of 6.224 plants per image patch, a mean relative error (MRE) of 0.1038, precision and recall values of 0.856, and 0.905, respectively, and an F-measure equal to 0.876. These results were superior to the results from other deep networks (HRNet, Faster R-CNN, and RetinaNet) evaluated with the same task and dataset. For the plantation-row detection, our approach returned precision, recall, and F-measure scores of 0.913, 0.941, and 0.925, respectively. To test the robustness of our model with a different type of agriculture, we performed the same task in the citrus orchard dataset. It returned an MAE equal to 1.409 citrus-trees per patch, MRE of 0.0615, precision of 0.922, recall of 0.911, and F-measure of 0.965. For citrus plantation-row detection, our approach resulted in precision, recall, and F-measure scores equal to 0.965, 0.970, and 0.964, respectively. The proposed method achieved state-of-the-art performance for counting and geolocating plants and plant-rows in UAV images from different types of crops.
△ Less
Submitted 14 February, 2021; v1 submitted 31 December, 2020;
originally announced December 2020.
-
ECIR 2020 Workshops: Assessing the Impact of Going Online
Authors:
Sérgio Nunes,
Suzanne Little,
Sumit Bhatia,
Ludovico Boratto,
Guillaume Cabanac,
Ricardo Campos,
Francisco M. Couto,
Stefano Faralli,
Ingo Frommholz,
Adam Jatowt,
Alípio Jorge,
Mirko Marras,
Philipp Mayr,
Giovanni Stilo
Abstract:
ECIR 2020 https://ecir2020.org/ was one of the many conferences affected by the COVID-19 pandemic. The Conference Chairs decided to keep the initially planned dates (April 14-17, 2020) and move to a fully online event. In this report, we describe the experience of organizing the ECIR 2020 Workshops in this scenario from two perspectives: the workshop organizers and the workshop participants. We pr…
▽ More
ECIR 2020 https://ecir2020.org/ was one of the many conferences affected by the COVID-19 pandemic. The Conference Chairs decided to keep the initially planned dates (April 14-17, 2020) and move to a fully online event. In this report, we describe the experience of organizing the ECIR 2020 Workshops in this scenario from two perspectives: the workshop organizers and the workshop participants. We provide a report on the organizational aspect of these events and the consequences for participants. Covering the scientific dimension of each workshop is outside the scope of this article.
△ Less
Submitted 14 May, 2020;
originally announced May 2020.
-
A superpixel-driven deep learning approach for the analysis of dermatological wounds
Authors:
Gustavo Blanco,
Agma J. M. Traina,
Caetano Traina Jr.,
Paulo M. Azevedo-Marques,
Ana E. S. Jorge,
Daniel de Oliveira,
Marcos V. N. Bedo
Abstract:
Background. The image-based identification of distinct tissues within dermatological wounds enhances patients' care since it requires no intrusive evaluations. This manuscript presents an approach, we named QTDU, that combines deep learning models with superpixel-driven segmentation methods for assessing the quality of tissues from dermatological ulcers.
Method. QTDU consists of a three-stage pi…
▽ More
Background. The image-based identification of distinct tissues within dermatological wounds enhances patients' care since it requires no intrusive evaluations. This manuscript presents an approach, we named QTDU, that combines deep learning models with superpixel-driven segmentation methods for assessing the quality of tissues from dermatological ulcers.
Method. QTDU consists of a three-stage pipeline for the obtaining of ulcer segmentation, tissues' labeling, and wounded area quantification. We set up our approach by using a real and annotated set of dermatological ulcers for training several deep learning models to the identification of ulcered superpixels.
Results. Empirical evaluations on 179,572 superpixels divided into four classes showed QTDU accurately spot wounded tissues (AUC = 0.986, sensitivity = 0.97, and specificity = 0.974) and outperformed machine-learning approaches in up to 8.2% regarding F1-Score through fine-tuning of a ResNet-based model. Last, but not least, experimental evaluations also showed QTDU correctly quantified wounded tissue areas within a 0.089 Mean Absolute Error ratio.
Conclusions. Results indicate QTDU effectiveness for both tissue segmentation and wounded area quantification tasks. When compared to existing machine-learning approaches, the combination of superpixels and deep learning models outperformed the competitors within strong significant levels.
△ Less
Submitted 20 September, 2019; v1 submitted 13 September, 2019;
originally announced September 2019.
-
Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation
Authors:
Daniel Loureiro,
Alipio Jorge
Abstract:
Contextual embeddings represent a new generation of semantic representations learned from Neural Language Modelling (NLM) that addresses the issue of meaning conflation hampering traditional word embeddings. In this work, we show that contextual embeddings can be used to achieve unprecedented gains in Word Sense Disambiguation (WSD) tasks. Our approach focuses on creating sense-level embeddings wi…
▽ More
Contextual embeddings represent a new generation of semantic representations learned from Neural Language Modelling (NLM) that addresses the issue of meaning conflation hampering traditional word embeddings. In this work, we show that contextual embeddings can be used to achieve unprecedented gains in Word Sense Disambiguation (WSD) tasks. Our approach focuses on creating sense-level embeddings with full-coverage of WordNet, and without recourse to explicit knowledge of sense distributions or task-specific modelling. As a result, a simple Nearest Neighbors (k-NN) method using our representations is able to consistently surpass the performance of previous systems using powerful neural sequencing models. We also analyse the robustness of our approach when ignoring part-of-speech and lemma features, requiring disambiguation against the full sense inventory, and revealing shortcomings to be improved. Finally, we explore applications of our sense embeddings for concept-level analyses of contextual embeddings and their respective NLMs.
△ Less
Submitted 24 June, 2019;
originally announced June 2019.
-
LIAAD at SemDeep-5 Challenge: Word-in-Context (WiC)
Authors:
Daniel Loureiro,
Alipio Jorge
Abstract:
This paper describes the LIAAD system that was ranked second place in the Word-in-Context challenge (WiC) featured in SemDeep-5. Our solution is based on a novel system for Word Sense Disambiguation (WSD) using contextual embeddings and full-inventory sense embeddings. We adapt this WSD system, in a straightforward manner, for the present task of detecting whether the same sense occurs in a pair o…
▽ More
This paper describes the LIAAD system that was ranked second place in the Word-in-Context challenge (WiC) featured in SemDeep-5. Our solution is based on a novel system for Word Sense Disambiguation (WSD) using contextual embeddings and full-inventory sense embeddings. We adapt this WSD system, in a straightforward manner, for the present task of detecting whether the same sense occurs in a pair of sentences. Additionally, we show that our solution is able to achieve competitive performance even without using the provided training or development sets, mitigating potential concerns related to task overfitting
△ Less
Submitted 24 June, 2019;
originally announced June 2019.
-
Preference rules for label ranking: Mining patterns in multi-target relations
Authors:
Cláudio Rebelo de Sá,
Paulo Azevedo,
Carlos Soares,
Alípio Mário Jorge,
Arno Knobbe
Abstract:
In this paper we investigate two variants of association rules for preference data, Label Ranking Association Rules and Pairwise Association Rules. Label Ranking Association Rules (LRAR) are the equivalent of Class Association Rules (CAR) for the Label Ranking task. In CAR, the consequent is a single class, to which the example is expected to belong to. In LRAR, the consequent is a ranking of the…
▽ More
In this paper we investigate two variants of association rules for preference data, Label Ranking Association Rules and Pairwise Association Rules. Label Ranking Association Rules (LRAR) are the equivalent of Class Association Rules (CAR) for the Label Ranking task. In CAR, the consequent is a single class, to which the example is expected to belong to. In LRAR, the consequent is a ranking of the labels. The generation of LRAR requires special support and confidence measures to assess the similarity of rankings. In this work, we carry out a sensitivity analysis of these similarity-based measures. We want to understand which datasets benefit more from such measures and which parameters have more influence in the accuracy of the model. Furthermore, we propose an alternative type of rules, the Pairwise Association Rules (PAR), which are defined as association rules with a set of pairwise preferences in the consequent. While PAR can be used both as descriptive and predictive models, they are essentially descriptive models. Experimental results show the potential of both approaches.
△ Less
Submitted 20 March, 2019;
originally announced March 2019.
-
Affordance Extraction and Inference based on Semantic Role Labeling
Authors:
Daniel Loureiro,
Alípio Mário Jorge
Abstract:
Common-sense reasoning is becoming increasingly important for the advancement of Natural Language Processing. While word embeddings have been very successful, they cannot explain which aspects of 'coffee' and 'tea' make them similar, or how they could be related to 'shop'. In this paper, we propose an explicit word representation that builds upon the Distributional Hypothesis to represent meaning…
▽ More
Common-sense reasoning is becoming increasingly important for the advancement of Natural Language Processing. While word embeddings have been very successful, they cannot explain which aspects of 'coffee' and 'tea' make them similar, or how they could be related to 'shop'. In this paper, we propose an explicit word representation that builds upon the Distributional Hypothesis to represent meaning from semantic roles, and allow inference of relations from their meshing, as supported by the affordance-based Indexical Hypothesis. We find that our model improves the state-of-the-art on unsupervised word similarity tasks while allowing for direct inference of new relations from the same vector space.
△ Less
Submitted 3 September, 2018;
originally announced September 2018.
-
(geo)graphs - Complex Networks as a shapefile of nodes and a shapefile of edges for different applications
Authors:
Leonardo B L Santos,
Aurelienne A S Jorge,
Marcio Rossato,
Jessica D Santos,
Onofre A Candido,
Wilson Seron,
Charles N de Santana
Abstract:
Spatial dependency and spatial embedding are basic physical properties of many phenomena modeled by networks. The most indicated computational environment to deal with spatial information is to use Georeferenced Information System (GIS) and Geographical Database Management Systems (GDBMS). Several models have been proposed in this direction, however there is a gap in the literature in generic fram…
▽ More
Spatial dependency and spatial embedding are basic physical properties of many phenomena modeled by networks. The most indicated computational environment to deal with spatial information is to use Georeferenced Information System (GIS) and Geographical Database Management Systems (GDBMS). Several models have been proposed in this direction, however there is a gap in the literature in generic frameworks for working with Complex Networks in GIS/GDBMS environments. Here we introduce the concept of (geo)graphs: graphs in which the nodes have a known geographical location and the edges have spatial dependence. We present case studies and two open source softwares (GIS4GRAPH and GeoCNet) that indicate how to retrieve networks from GIS data and how to represent networks over GIS data by using (geo)graphs.
△ Less
Submitted 15 November, 2017;
originally announced November 2017.
-
An Overview of Data Mining Applications in Oil and Gas Exploration: Structural Geology and Reservoir Property-Issues
Authors:
Hamed Nikhalat Jahromi,
Alpio M. Jorge
Abstract:
Low oil prices have motivated energy executives to look into cost reduction in their supply chains more seriously. To this end, a new technology that is experimentally considered in hydrocarbon exploration is data mining. There are two major categories of geoscientific problems in which data mining is applied: structural geology and reservoir property-issues. This research overviews these categori…
▽ More
Low oil prices have motivated energy executives to look into cost reduction in their supply chains more seriously. To this end, a new technology that is experimentally considered in hydrocarbon exploration is data mining. There are two major categories of geoscientific problems in which data mining is applied: structural geology and reservoir property-issues. This research overviews these categories by considering a variety of interesting works in each of them. The result is an understanding of the specific geoscientific problems studied in the literature, along with the relative data mining methods. This way, this work tries to lay the ground for a mutual understanding on oil and gas exploration between the data miners and the geoscientists.
△ Less
Submitted 12 May, 2017;
originally announced May 2017.
-
Mind the Gap: A Well Log Data Analysis
Authors:
Rui L. Lopes,
Alípio Jorge
Abstract:
The main task in oil and gas exploration is to gain an understanding of the distribution and nature of rocks and fluids in the subsurface. Well logs are records of petro-physical data acquired along a borehole, providing direct information about what is in the subsurface. The data collected by logging wells can have significant economic consequences, due to the costs inherent to drilling wells, an…
▽ More
The main task in oil and gas exploration is to gain an understanding of the distribution and nature of rocks and fluids in the subsurface. Well logs are records of petro-physical data acquired along a borehole, providing direct information about what is in the subsurface. The data collected by logging wells can have significant economic consequences, due to the costs inherent to drilling wells, and the potential return of oil deposits. In this paper, we describe preliminary work aimed at building a general framework for well log prediction.
First, we perform a descriptive and exploratory analysis of the gaps in the neutron porosity logs of more than a thousand wells in the North Sea. Then, we generate artificial gaps in the neutron logs that reflect the statistics collected before. Finally, we compare Artificial Neural Networks, Random Forests, and three algorithms of Linear Regression in the prediction of missing gaps on a well-by-well basis.
△ Less
Submitted 10 May, 2017;
originally announced May 2017.
-
Proceedings of the Workshop on Data Mining for Oil and Gas
Authors:
Alipio Jorge,
German Larrazabal,
Pablo Guillen,
Rui L. Lopes
Abstract:
The process of exploring and exploiting Oil and Gas (O&G) generates a lot of data that can bring more efficiency to the industry. The opportunities for using data mining techniques in the "digital oil-field" remain largely unexplored or uncharted. With the high rate of data expansion, companies are scrambling to develop ways to develop near-real-time predictive analytics, data mining and machine l…
▽ More
The process of exploring and exploiting Oil and Gas (O&G) generates a lot of data that can bring more efficiency to the industry. The opportunities for using data mining techniques in the "digital oil-field" remain largely unexplored or uncharted. With the high rate of data expansion, companies are scrambling to develop ways to develop near-real-time predictive analytics, data mining and machine learning capabilities, and are expanding their data storage infrastructure and resources. With these new goals, come the challenges of managing data growth, integrating intelligence tools, and analyzing the data to glean useful insights. Oil and Gas companies need data solutions to economically extract value from very large volumes of a wide variety of data generated from exploration, well drilling and production devices and sensors.
Data mining for oil and gas industry throughout the lifecycle of the reservoir includes the following roles: locating hydrocarbons, managing geological data, drilling and formation evaluation, well construction, well completion, and optimizing production through the life of the oil field. For each of these phases during the lifecycle of oil field, data mining play a significant role. Based on which phase were talking about, knowledge creation through scientific models, data analytics and machine learning, a effective, productive, and on demand data insight is critical for decision making within the organization.
The significant challenges posed by this complex and economically vital field justify a meeting of data scientists that are willing to share their experience and knowledge. Thus, the Worskhop on Data Mining for Oil and Gas (DM4OG) aims to provide a quality forum for researchers that work on the significant challenges arising from the synergy between data science, machine learning, and the modeling and optimization problems in the O&G industry.
△ Less
Submitted 26 May, 2017; v1 submitted 9 May, 2017;
originally announced May 2017.
-
PAMPO: using pattern matching and pos-tagging for effective Named Entities recognition in Portuguese
Authors:
Conceição Rocha,
Alípio Jorge,
Roberta Sionara,
Paula Brito,
Carlos Pimenta,
Solange Rezende
Abstract:
This paper deals with the entity extraction task (named entity recognition) of a text mining process that aims at unveiling non-trivial semantic structures, such as relationships and interaction between entities or communities. In this paper we present a simple and efficient named entity extraction algorithm. The method, named PAMPO (PAttern Matching and POs tagging based algorithm for NER), relie…
▽ More
This paper deals with the entity extraction task (named entity recognition) of a text mining process that aims at unveiling non-trivial semantic structures, such as relationships and interaction between entities or communities. In this paper we present a simple and efficient named entity extraction algorithm. The method, named PAMPO (PAttern Matching and POs tagging based algorithm for NER), relies on flexible pattern matching, part-of-speech tagging and lexical-based rules. It was developed to process texts written in Portuguese, however it is potentially applicable to other languages as well.
We compare our approach with current alternatives that support Named Entity Recognition (NER) for content written in Portuguese. These are Alchemy, Zemanta and Rembrandt. Evaluation of the efficacy of the entity extraction method on several texts written in Portuguese indicates a considerable improvement on $recall$ and $F_1$ measures.
△ Less
Submitted 30 December, 2016;
originally announced December 2016.
-
Improving incremental recommenders with online bagging
Authors:
João Vinagre,
Alípio Mário Jorge,
João Gama
Abstract:
Online recommender systems often deal with continuous, potentially fast and unbounded flows of data. Ensemble methods for recommender systems have been used in the past in batch algorithms, however they have never been studied with incremental algorithms that learn from data streams. We evaluate online bagging with an incremental matrix factorization algorithm for top-N recommendation with positiv…
▽ More
Online recommender systems often deal with continuous, potentially fast and unbounded flows of data. Ensemble methods for recommender systems have been used in the past in batch algorithms, however they have never been studied with incremental algorithms that learn from data streams. We evaluate online bagging with an incremental matrix factorization algorithm for top-N recommendation with positive-only -- binary -- ratings. Our results show that online bagging is able to improve accuracy up to 35% over the baseline, with small computational overhead.
△ Less
Submitted 26 March, 2018; v1 submitted 2 November, 2016;
originally announced November 2016.
-
Accelerating Recommender Systems using GPUs
Authors:
André Valente Rodrigues,
Alípio Jorge,
Inês Dutra
Abstract:
We describe GPU implementations of the matrix recommender algorithms CCD++ and ALS. We compare the processing time and predictive ability of the GPU implementations with existing multi-core versions of the same algorithms. Results on the GPU are better than the results of the multi-core versions (maximum speedup of 14.8).
We describe GPU implementations of the matrix recommender algorithms CCD++ and ALS. We compare the processing time and predictive ability of the GPU implementations with existing multi-core versions of the same algorithms. Results on the GPU are better than the results of the multi-core versions (maximum speedup of 14.8).
△ Less
Submitted 7 November, 2015;
originally announced November 2015.
-
Evaluation of recommender systems in streaming environments
Authors:
João Vinagre,
Alípio Mário Jorge,
João Gama
Abstract:
Evaluation of recommender systems is typically done with finite datasets. This means that conventional evaluation methodologies are only applicable in offline experiments, where data and models are stationary. However, in real world systems, user feedback is continuously generated, at unpredictable rates. Given this setting, one important issue is how to evaluate algorithms in such a streaming dat…
▽ More
Evaluation of recommender systems is typically done with finite datasets. This means that conventional evaluation methodologies are only applicable in offline experiments, where data and models are stationary. However, in real world systems, user feedback is continuously generated, at unpredictable rates. Given this setting, one important issue is how to evaluate algorithms in such a streaming data environment. In this paper we propose a prequential evaluation protocol for recommender systems, suitable for streaming data environments, but also applicable in stationary settings. Using this protocol we are able to monitor the evolution of algorithms' accuracy over time. Furthermore, we are able to perform reliable comparative assessments of algorithms by computing significance tests over a sliding window. We argue that besides being suitable for streaming data, prequential evaluation allows the detection of phenomena that would otherwise remain unnoticed in the evaluation of both offline and online recommender systems.
△ Less
Submitted 30 April, 2015;
originally announced April 2015.
-
Using Contextual Information as Virtual Items on Top-N Recommender Systems
Authors:
Marcos A. Domingues,
Alipio Mario Jorge,
Carlos Soares
Abstract:
Traditionally, recommender systems for the Web deal with applications that have two dimensions, users and items. Based on access logs that relate these dimensions, a recommendation model can be built and used to identify a set of N items that will be of interest to a certain user. In this paper we propose a method to complement the information in the access logs with contextual information without…
▽ More
Traditionally, recommender systems for the Web deal with applications that have two dimensions, users and items. Based on access logs that relate these dimensions, a recommendation model can be built and used to identify a set of N items that will be of interest to a certain user. In this paper we propose a method to complement the information in the access logs with contextual information without changing the recommendation algorithm. The method consists in representing context as virtual items. We empirically test this method with two top-N recommender systems, an item-based collaborative filtering technique and association rules, on three data sets. The results show that our method is able to take advantage of the context (new dimensions) when it is informative.
△ Less
Submitted 15 November, 2011; v1 submitted 12 November, 2011;
originally announced November 2011.