-
Studying and Improving Graph Neural Network-based Motif Estimation
Authors:
Pedro C. Vieira,
Miguel E. P. Silva,
Pedro Manuel Pinto Ribeiro
Abstract:
Graph Neural Networks (GNNs) are a predominant method for graph representation learning. However, beyond subgraph frequency estimation, their application to network motif significance-profile (SP) prediction remains under-explored, with no established benchmarks in the literature. We propose to address this problem, framing SP estimation as a task independent of subgraph frequency estimation. Our…
▽ More
Graph Neural Networks (GNNs) are a predominant method for graph representation learning. However, beyond subgraph frequency estimation, their application to network motif significance-profile (SP) prediction remains under-explored, with no established benchmarks in the literature. We propose to address this problem, framing SP estimation as a task independent of subgraph frequency estimation. Our approach shifts from frequency counting to direct SP estimation and modulates the problem as multitarget regression. The reformulation is optimised for interpretability, stability and scalability on large graphs. We validate our method using a large synthetic dataset and further test it on real-world graphs. Our experiments reveal that 1-WL limited models struggle to make precise estimations of SPs. However, they can generalise to approximate the graph generation processes of networks by comparing their predicted SP with the ones originating from synthetic generators. This first study on GNN-based motif estimation also hints at how using direct SP estimation can help go past the theoretical limitations that motif estimation faces when performed through subgraph counting.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Imbalanced malware classification: an approach based on dynamic classifier selection
Authors:
J. V. S. Souza,
C. B. Vieira,
G. D. C. Cavalcanti,
R. M. O. Cruz
Abstract:
In recent years, the rise of cyber threats has emphasized the need for robust malware detection systems, especially on mobile devices. Malware, which targets vulnerabilities in devices and user data, represents a substantial security risk. A significant challenge in malware detection is the imbalance in datasets, where most applications are benign, with only a small fraction posing a threat. This…
▽ More
In recent years, the rise of cyber threats has emphasized the need for robust malware detection systems, especially on mobile devices. Malware, which targets vulnerabilities in devices and user data, represents a substantial security risk. A significant challenge in malware detection is the imbalance in datasets, where most applications are benign, with only a small fraction posing a threat. This study addresses the often-overlooked issue of class imbalance in malware detection by evaluating various machine learning strategies for detecting malware in Android applications. We assess monolithic classifiers and ensemble methods, focusing on dynamic selection algorithms, which have shown superior performance compared to traditional approaches. In contrast to balancing strategies performed on the whole dataset, we propose a balancing procedure that works individually for each classifier in the pool. Our empirical analysis demonstrates that the KNOP algorithm obtained the best results using a pool of Random Forest. Additionally, an instance hardness assessment revealed that balancing reduces the difficulty of the minority class and enhances the detection of the minority class (malware). The code used for the experiments is available at https://github.com/jvss2/Machine-Learning-Empirical-Evaluation.
△ Less
Submitted 5 April, 2025; v1 submitted 30 March, 2025;
originally announced April 2025.
-
An analysis of data variation and bias in image-based dermatological datasets for machine learning classification
Authors:
Francisco Filho,
Emanoel Santos,
Rodrigo Mota,
Kelvin Cunha,
Fabio Papais,
Amanda Arruda,
Mateus Baltazar,
Camila Vieira,
José Gabriel Tavares,
Rafael Barros,
Othon Souza,
Thales Bezerra,
Natalia Lopes,
Érico Moutinho,
Jéssica Guido,
Shirley Cruz,
Paulo Borba,
Tsang Ing Ren
Abstract:
AI algorithms have become valuable in aiding professionals in healthcare. The increasing confidence obtained by these models is helpful in critical decision demands. In clinical dermatology, classification models can detect malignant lesions on patients' skin using only RGB images as input. However, most learning-based methods employ data acquired from dermoscopic datasets on training, which are l…
▽ More
AI algorithms have become valuable in aiding professionals in healthcare. The increasing confidence obtained by these models is helpful in critical decision demands. In clinical dermatology, classification models can detect malignant lesions on patients' skin using only RGB images as input. However, most learning-based methods employ data acquired from dermoscopic datasets on training, which are large and validated by a gold standard. Clinical models aim to deal with classification on users' smartphone cameras that do not contain the corresponding resolution provided by dermoscopy. Also, clinical applications bring new challenges. It can contain captures from uncontrolled environments, skin tone variations, viewpoint changes, noises in data and labels, and unbalanced classes. A possible alternative would be to use transfer learning to deal with the clinical images. However, as the number of samples is low, it can cause degradations on the model's performance; the source distribution used in training differs from the test set. This work aims to evaluate the gap between dermoscopic and clinical samples and understand how the dataset variations impact training. It assesses the main differences between distributions that disturb the model's prediction. Finally, from experiments on different architectures, we argue how to combine the data from divergent distributions, decreasing the impact on the model's final accuracy.
△ Less
Submitted 11 February, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
Semi-supervised classification of dental conditions in panoramic radiographs using large language model and instance segmentation: A real-world dataset evaluation
Authors:
Bernardo Silva,
Jefferson Fontinele,
Carolina Letícia Zilli Vieira,
João Manuel R. S. Tavares,
Patricia Ramos Cury,
Luciano Oliveira
Abstract:
Dental panoramic radiographs offer vast diagnostic opportunities, but training supervised deep learning networks for automatic analysis of those radiology images is hampered by a shortage of labeled data. Here, a different perspective on this problem is introduced. A semi-supervised learning framework is proposed to classify thirteen dental conditions on panoramic radiographs, with a particular em…
▽ More
Dental panoramic radiographs offer vast diagnostic opportunities, but training supervised deep learning networks for automatic analysis of those radiology images is hampered by a shortage of labeled data. Here, a different perspective on this problem is introduced. A semi-supervised learning framework is proposed to classify thirteen dental conditions on panoramic radiographs, with a particular emphasis on teeth. Large language models were explored to annotate the most common dental conditions based on dental reports. Additionally, a masked autoencoder was employed to pre-train the classification neural network, and a Vision Transformer was used to leverage the unlabeled data. The analyses were validated using two of the most extensive datasets in the literature, comprising 8,795 panoramic radiographs and 8,029 paired reports and images. Encouragingly, the results consistently met or surpassed the baseline metrics for the Matthews correlation coefficient. A comparison of the proposed solution with human practitioners, supported by statistical analysis, highlighted its effectiveness and performance limitations; based on the degree of agreement among specialists, the solution demonstrated an accuracy level comparable to that of a junior specialist.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
S+t-SNE -- Bringing Dimensionality Reduction to Data Streams
Authors:
Pedro C. Vieira,
João P. Montrezol,
João T. Vieira,
João Gama
Abstract:
We present S+t-SNE, an adaptation of the t-SNE algorithm designed to handle infinite data streams. The core idea behind S+t-SNE is to update the t-SNE embedding incrementally as new data arrives, ensuring scalability and adaptability to handle streaming scenarios. By selecting the most important points at each step, the algorithm ensures scalability while keeping informative visualisations. By emp…
▽ More
We present S+t-SNE, an adaptation of the t-SNE algorithm designed to handle infinite data streams. The core idea behind S+t-SNE is to update the t-SNE embedding incrementally as new data arrives, ensuring scalability and adaptability to handle streaming scenarios. By selecting the most important points at each step, the algorithm ensures scalability while keeping informative visualisations. By employing a blind method for drift management, the algorithm adjusts the embedding space, which facilitates the visualisation of evolving data dynamics. Our experimental evaluations demonstrate the effectiveness and efficiency of S+t-SNE, whilst highlighting its ability to capture patterns in a streaming scenario. We hope our approach offers researchers and practitioners a real-time tool for understanding and interpreting high-dimensional data.
△ Less
Submitted 21 January, 2025; v1 submitted 26 March, 2024;
originally announced March 2024.
-
DeepVox and SAVE-CT: a contrast- and dose-independent 3D deep learning approach for thoracic aorta segmentation and aneurysm prediction using computed tomography scans
Authors:
Matheus del-Valle,
Lariza Laura de Oliveira,
Henrique Cursino Vieira,
Henrique Min Ho Lee,
Lucas Lembrança Pinheiro,
Maria Fernanda Portugal,
Newton Shydeo Brandão Miyoshi,
Nelson Wolosker
Abstract:
Thoracic aortic aneurysm (TAA) is a fatal disease which potentially leads to dissection or rupture through progressive enlargement of the aorta. It is usually asymptomatic and screening recommendation are limited. The gold-standard evaluation is performed by computed tomography angiography (CTA) and radiologists time-consuming assessment. Scans for other indications could help on this screening, h…
▽ More
Thoracic aortic aneurysm (TAA) is a fatal disease which potentially leads to dissection or rupture through progressive enlargement of the aorta. It is usually asymptomatic and screening recommendation are limited. The gold-standard evaluation is performed by computed tomography angiography (CTA) and radiologists time-consuming assessment. Scans for other indications could help on this screening, however if acquired without contrast enhancement or with low dose protocol, it can make the clinical evaluation difficult, besides increasing the scans quantity for the radiologists. In this study, it was selected 587 unique CT scans including control and TAA patients, acquired with low and standard dose protocols, with or without contrast enhancement. A novel segmentation model, DeepVox, exhibited dice score coefficients of 0.932 and 0.897 for development and test sets, respectively, with faster training speed in comparison to models reported in the literature. The novel TAA classification model, SAVE-CT, presented accuracies of 0.930 and 0.922 for development and test sets, respectively, using only the binary segmentation mask from DeepVox as input, without hand-engineered features. These two models together are a potential approach for TAA screening, as they can handle variable number of slices as input, handling thoracic and thoracoabdominal sequences, in a fully automated contrast- and dose-independent evaluation. This may assist to decrease TAA mortality and prioritize the evaluation queue of patients for radiologists.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Desaparecidxs: characterizing the population of missing children using Twitter
Authors:
Carolina Coimbra Vieira,
Diego Alburez-Gutierrez,
Marília R. Nepomuceno,
Tom Theile
Abstract:
Missing children, i.e., children reported to a relevant authority as having "disappeared," constitute an important but often overlooked population. From a research perspective, missing children constitute a hard-to-reach population about which little is known. This is a particular problem in regions of the Global South that lack robust or centralized data collection systems. In this study, we anal…
▽ More
Missing children, i.e., children reported to a relevant authority as having "disappeared," constitute an important but often overlooked population. From a research perspective, missing children constitute a hard-to-reach population about which little is known. This is a particular problem in regions of the Global South that lack robust or centralized data collection systems. In this study, we analyze the composition of the population of missing children in Guatemala, a country with high levels of violence. We contrast the official aggregated-level data from the Guatemalan National Police during the 2018-2020 period with real-time individual-level data on missing children from the official Twitter account of the Alerta Alba-Keneth, a governmental warning system tasked with disseminating information about missing children. Using the Twitter data, we characterize the population of missing children in Guatemala by single-year age, sex, and place of disappearance. Our results show that women are more likely to be reported as missing, particularly those aged 13-17. We discuss the findings in light of the known links between missing people, violence, and human trafficking. Finally, the study highlights the potential of web data to contribute to society by improving our understanding of this and similar hard-to-reach populations.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Convolutional Neural Network to Restore Low-Dose Digital Breast Tomosynthesis Projections in a Variance Stabilization Domain
Authors:
Rodrigo de Barros Vimieiro,
Chuang Niu,
Hongming Shan,
Lucas Rodrigues Borges,
Ge Wang,
Marcelo Andrade da Costa Vieira
Abstract:
Digital breast tomosynthesis (DBT) exams should utilize the lowest possible radiation dose while maintaining sufficiently good image quality for accurate medical diagnosis. In this work, we propose a convolution neural network (CNN) to restore low-dose (LD) DBT projections to achieve an image quality equivalent to a standard full-dose (FD) acquisition. The proposed network architecture benefits fr…
▽ More
Digital breast tomosynthesis (DBT) exams should utilize the lowest possible radiation dose while maintaining sufficiently good image quality for accurate medical diagnosis. In this work, we propose a convolution neural network (CNN) to restore low-dose (LD) DBT projections to achieve an image quality equivalent to a standard full-dose (FD) acquisition. The proposed network architecture benefits from priors in terms of layers that were inspired by traditional model-based (MB) restoration methods, considering a model-based deep learning approach, where the network is trained to operate in the variance stabilization transformation (VST) domain. To accurately control the network operation point, in terms of noise and blur of the restored image, we propose a loss function that minimizes the bias and matches residual noise between the input and the output. The training dataset was composed of clinical data acquired at the standard FD and low-dose pairs obtained by the injection of quantum noise. The network was tested using real DBT projections acquired with a physical anthropomorphic breast phantom. The proposed network achieved superior results in terms of the mean normalized squared error (MNSE), training time and noise spatial correlation compared with networks trained with traditional data-driven methods. The proposed approach can be extended for other medical imaging application that requires LD acquisitions.
△ Less
Submitted 22 March, 2022;
originally announced March 2022.
-
Impact of loss functions on the performance of a deep neural network designed to restore low-dose digital mammography
Authors:
Hongming Shan,
Rodrigo de Barros Vimieiro,
Lucas Rodrigues Borges,
Marcelo Andrade da Costa Vieira,
Ge Wang
Abstract:
Digital mammography is still the most common imaging tool for breast cancer screening. Although the benefits of using digital mammography for cancer screening outweigh the risks associated with the x-ray exposure, the radiation dose must be kept as low as possible while maintaining the diagnostic utility of the generated images, thus minimizing patient risks. Many studies investigated the feasibil…
▽ More
Digital mammography is still the most common imaging tool for breast cancer screening. Although the benefits of using digital mammography for cancer screening outweigh the risks associated with the x-ray exposure, the radiation dose must be kept as low as possible while maintaining the diagnostic utility of the generated images, thus minimizing patient risks. Many studies investigated the feasibility of dose reduction by restoring low-dose images using deep neural networks. In these cases, choosing the appropriate training database and loss function is crucial and impacts the quality of the results. In this work, a modification of the ResNet architecture, with hierarchical skip connections, is proposed to restore low-dose digital mammography. We compared the restored images to the standard full-dose images. Moreover, we evaluated the performance of several loss functions for this task. For training purposes, we extracted 256,000 image patches from a dataset of 400 images of retrospective clinical mammography exams, where different dose levels were simulated to generate low and standard-dose pairs. To validate the network in a real scenario, a physical anthropomorphic breast phantom was used to acquire real low-dose and standard full-dose images in a commercially avaliable mammography system, which were then processed through our trained model. An analytical restoration model for low-dose digital mammography, previously presented, was used as a benchmark in this work. Objective assessment was performed through the signal-to-noise ratio (SNR) and mean normalized squared error (MNSE), decomposed into residual noise and bias. Results showed that the perceptual loss function (PL4) is able to achieve virtually the same noise levels of a full-dose acquisition, while resulting in smaller signal bias compared to other loss functions.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Learning from pandemics: using extraordinary events can improve disease now-casting models
Authors:
Sara Mesquita,
Cláudio Haupt Vieira,
Lília Perfeito,
Joana Gonçalves-Sá
Abstract:
Online searches have been used to study different health-related behaviours, including monitoring disease outbreaks. An obvious caveat is that several reasons can motivate individuals to seek online information and models that are blind to people's motivations are of limited use and can even mislead. This is particularly true during extraordinary public health crisis, such as the ongoing pandemic,…
▽ More
Online searches have been used to study different health-related behaviours, including monitoring disease outbreaks. An obvious caveat is that several reasons can motivate individuals to seek online information and models that are blind to people's motivations are of limited use and can even mislead. This is particularly true during extraordinary public health crisis, such as the ongoing pandemic, when fear, curiosity and many other reasons can lead individuals to search for health-related information, masking the disease-driven searches. However, health crisis can also offer an opportunity to disentangle between different drivers and learn about human behavior. Here, we focus on the two pandemics of the 21st century (2009-H1N1 flu and Covid-19) and propose a methodology to discriminate between search patterns linked to general information seeking (media driven) and search patterns possibly more associated with actual infection (disease driven). We show that by learning from such pandemic periods, with high anxiety and media hype, it is possible to select online searches and improve model performance both in pandemic and seasonal settings. Moreover, and despite the common claim that more data is always better, our results indicate that lower volume of the right data can be better than including large volumes of apparently similar data, especially in the long run. Our work provides a general framework that can be applied beyond specific events and diseases, and argues that algorithms can be improved simply by using less (better) data. This has important consequences, for example, to solve the accuracy-explainability trade-off in machine-learning.
△ Less
Submitted 17 January, 2021;
originally announced January 2021.
-
Can WhatsApp Counter Misinformation by Limiting Message Forwarding?
Authors:
Philipe de Freitas Melo,
Carolina Coimbra Vieira,
Kiran Garimella,
Pedro O. S. Vaz de Melo,
Fabrício Benevenuto
Abstract:
WhatsApp is the most popular messaging app in the world. The closed nature of the app, in addition to the ease of transferring multimedia and sharing information to large-scale groups make WhatsApp unique among other platforms, where an anonymous encrypted messages can become viral, reaching multiple users in a short period of time. The personal feeling and immediacy of messages directly delivered…
▽ More
WhatsApp is the most popular messaging app in the world. The closed nature of the app, in addition to the ease of transferring multimedia and sharing information to large-scale groups make WhatsApp unique among other platforms, where an anonymous encrypted messages can become viral, reaching multiple users in a short period of time. The personal feeling and immediacy of messages directly delivered to the user's phone on WhatsApp was extensively abused to spread unfounded rumors and create misinformation campaigns during recent elections in Brazil and India. WhatsApp has been deploying measures to mitigate this problem, such as reducing the limit for forwarding a message to at most five users at once. Despite the welcomed effort to counter the problem, there is no evidence so far on the real effectiveness of such restrictions. In this work, we propose a methodology to evaluate the effectiveness of such measures on the spreading of misinformation circulating on WhatsApp. We use an epidemiological model and real data gathered from WhatsApp in Brazil, India and Indonesia to assess the impact of limiting virality features in this kind of network. Our results suggest that the current efforts deployed by WhatsApp can offer significant delays on the information spread, but they are ineffective in blocking the propagation of misinformation campaigns through public groups when the content has a high viral nature.
△ Less
Submitted 23 September, 2019; v1 submitted 18 September, 2019;
originally announced September 2019.
-
An Adversarial Risk Analysis Framework for Cybersecurity
Authors:
David Rios Insua,
Aitor Couce Vieira,
Jose Antonio Rubio,
Wolter Pieters,
Katsiaryna Labunets,
Daniel Garcia Rasines
Abstract:
Cyber threats affect all kinds of organisations. Risk analysis is an essential methodology for cybersecurity as it allows organisations to deal with the cyber threats potentially affecting them, prioritise the defence of their assets and decide what security controls should be implemented. Many risk analysis methods are present in cybersecurity models, compliance frameworks and international stand…
▽ More
Cyber threats affect all kinds of organisations. Risk analysis is an essential methodology for cybersecurity as it allows organisations to deal with the cyber threats potentially affecting them, prioritise the defence of their assets and decide what security controls should be implemented. Many risk analysis methods are present in cybersecurity models, compliance frameworks and international standards. However, most of them employ risk matrices, which suffer shortcomings that may lead to suboptimal resource allocations. We propose a comprehensive framework for cybersecurity risk analysis, covering the presence of both adversarial and non-intentional threats and the use of insurance as part of the security portfolio. A case study illustrating the proposed framework is presented, serving as template for more complex cases.
△ Less
Submitted 18 March, 2019;
originally announced March 2019.
-
Data Augmentation for Detection of Architectural Distortion in Digital Mammography using Deep Learning Approach
Authors:
Arthur C. Costa,
Helder C. R. Oliveira,
Juliana H. Catani,
Nestor de Barros,
Carlos F. E. Melo,
Marcelo A. C. Vieira
Abstract:
Early detection of breast cancer can increase treatment efficiency. Architectural Distortion (AD) is a very subtle contraction of the breast tissue and may represent the earliest sign of cancer. Since it is very likely to be unnoticed by radiologists, several approaches have been proposed over the years but none using deep learning techniques. To train a Convolutional Neural Network (CNN), which i…
▽ More
Early detection of breast cancer can increase treatment efficiency. Architectural Distortion (AD) is a very subtle contraction of the breast tissue and may represent the earliest sign of cancer. Since it is very likely to be unnoticed by radiologists, several approaches have been proposed over the years but none using deep learning techniques. To train a Convolutional Neural Network (CNN), which is a deep neural architecture, is necessary a huge amount of data. To overcome this problem, this paper proposes a data augmentation approach applied to clinical image dataset to properly train a CNN. Results using receiver operating characteristic analysis showed that with a very limited dataset we could train a CNN to detect AD in digital mammography with area under the curve (AUC = 0.74).
△ Less
Submitted 5 July, 2018;
originally announced July 2018.
-
A Graphical Adversarial Risk Analysis Model for Oil and Gas Drilling Cybersecurity
Authors:
Aitor Couce Vieira,
Siv Hilde Houmb,
David Rios Insua
Abstract:
Oil and gas drilling is based, increasingly, on operational technology, whose cybersecurity is complicated by several challenges. We propose a graphical model for cybersecurity risk assessment based on Adversarial Risk Analysis to face those challenges. We also provide an example of the model in the context of an offshore drilling rig. The proposed model provides a more formal and comprehensive an…
▽ More
Oil and gas drilling is based, increasingly, on operational technology, whose cybersecurity is complicated by several challenges. We propose a graphical model for cybersecurity risk assessment based on Adversarial Risk Analysis to face those challenges. We also provide an example of the model in the context of an offshore drilling rig. The proposed model provides a more formal and comprehensive analysis of risks, still using the standard business language based on decisions, risks, and value.
△ Less
Submitted 7 April, 2014;
originally announced April 2014.
-
Free Instrument for Movement Measure
Authors:
Norberto Peña,
Bruno Cecílio Credidio,
Lorena Peixoto Nogueira Rodriguez Martinez Salles Corrêa,
Lucas Gabriel Souza França,
Marcelo do Vale Cunha,
Marcos Cavalcanti de Sousa,
João Paulo Bomfim Cruz Vieira,
José Garcia Vivas Miranda
Abstract:
This paper presents the validation of a computational tool that serves to obtain continuous measurements of moving objects. The software uses techniques of computer vision, pattern recognition and optical flow, to enable tracking of objects in videos, generating data trajectory, velocity, acceleration and angular movement. The program was applied to track a ball around a simple pendulum. The metho…
▽ More
This paper presents the validation of a computational tool that serves to obtain continuous measurements of moving objects. The software uses techniques of computer vision, pattern recognition and optical flow, to enable tracking of objects in videos, generating data trajectory, velocity, acceleration and angular movement. The program was applied to track a ball around a simple pendulum. The methodology used to validate it, taking as a basis to compare the values measured by the program, as well as the theoretical values expected according to the model of a simple pendulum. The experiment is appropriate to the method because it was built within the limits of the linear harmonic oscillator and energy losses due to friction had been minimized, making it the most ideal possible. The results indicate that the tool is sensitive and accurate. Deviations of less than a millimeter to the extent of the trajectory, ensures the applicability of the software on physics, whether in research or in teaching topics.
△ Less
Submitted 29 June, 2013;
originally announced July 2013.