-
Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning
Authors:
Daniel A. P. Oliveira,
David Martins de Matos
Abstract:
Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections a…
▽ More
Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in synthetic contexts. Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B). Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%). Pronoun grounding accuracy improved across all pronoun types except "its", and cross-frame character and object persistence increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%). Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%).
△ Less
Submitted 10 July, 2025; v1 submitted 9 July, 2025;
originally announced July 2025.
-
3W Dataset 2.0.0: a realistic and public dataset with rare undesirable real events in oil wells
Authors:
Ricardo Emanuel Vaz Vargas,
Afrânio José de Melo Junior,
Celso José Munaro,
Cláudio Benevenuto de Campos Lima,
Eduardo Toledo de Lima Junior,
Felipe Muntzberg Barrocas,
Flávio Miguel Varejão,
Guilherme Fidelis Peixer,
Igor de Melo Nery Oliveira,
Jader Riso Barbosa Jr.,
Jaime Andrés Lozano Cadena,
Jean Carlos Dias de Araújo,
João Neuenschwander Escosteguy Carneiro,
Lucas Gouveia Omena Lopes,
Lucas Pereira de Gouveia,
Mateus de Araujo Fernandes,
Matheus Lima Scramignon,
Patrick Marques Ciarelli,
Rodrigo Castello Branco,
Rogério Leite Alves Pinto
Abstract:
In the oil industry, undesirable events in oil wells can cause economic losses, environmental accidents, and human casualties. Solutions based on Artificial Intelligence and Machine Learning for Early Detection of such events have proven valuable for diverse applications across industries. In 2019, recognizing the importance and the lack of public datasets related to undesirable events in oil well…
▽ More
In the oil industry, undesirable events in oil wells can cause economic losses, environmental accidents, and human casualties. Solutions based on Artificial Intelligence and Machine Learning for Early Detection of such events have proven valuable for diverse applications across industries. In 2019, recognizing the importance and the lack of public datasets related to undesirable events in oil wells, Petrobras developed and publicly released the first version of the 3W Dataset, which is essentially a set of Multivariate Time Series labeled by experts. Since then, the 3W Dataset has been developed collaboratively and has become a foundational reference for numerous works in the field. This data article describes the current publicly available version of the 3W Dataset, which contains structural modifications and additional labeled data. The detailed description provided encourages and supports the 3W community and new 3W users to improve previous published results and to develop new robust methodologies, digital products and services capable of detecting undesirable events in oil wells with enough anticipation to enable corrective or mitigating actions.
△ Less
Submitted 25 June, 2025;
originally announced July 2025.
-
LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models
Authors:
Danilo de Oliveira,
Julius Richter,
Tal Peer,
Timo Gerkmann
Abstract:
We present LipDiffuser, a conditional diffusion model for lip-to-speech generation synthesizing natural and intelligible speech directly from silent video recordings. Our approach leverages the magnitude-preserving ablated diffusion model (MP-ADM) architecture as a denoiser model. To effectively condition the model, we incorporate visual features using magnitude-preserving feature-wise linear modu…
▽ More
We present LipDiffuser, a conditional diffusion model for lip-to-speech generation synthesizing natural and intelligible speech directly from silent video recordings. Our approach leverages the magnitude-preserving ablated diffusion model (MP-ADM) architecture as a denoiser model. To effectively condition the model, we incorporate visual features using magnitude-preserving feature-wise linear modulation (MP-FiLM) alongside speaker embeddings. A neural vocoder then reconstructs the speech waveform from the generated mel-spectrograms. Evaluations on LRS3 and TCD-TIMIT demonstrate that LipDiffuser outperforms existing lip-to-speech baselines in perceptual speech quality and speaker similarity, while remaining competitive in downstream automatic speech recognition (ASR). These findings are also supported by a formal listening experiment. Extensive ablation studies and cross-dataset evaluation confirm the effectiveness and generalization capabilities of our approach.
△ Less
Submitted 26 May, 2025; v1 submitted 16 May, 2025;
originally announced May 2025.
-
StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation
Authors:
Daniel A. P. Oliveira,
David Martins de Matos
Abstract:
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both struc…
▽ More
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Normalize Everything: A Preconditioned Magnitude-Preserving Architecture for Diffusion-Based Speech Enhancement
Authors:
Julius Richter,
Danilo de Oliveira,
Timo Gerkmann
Abstract:
This paper presents a new framework for diffusion-based speech enhancement. Our method employs a Schroedinger bridge to transform the noisy speech distribution into the clean speech distribution. To stabilize and improve training, we employ time-dependent scalings of the inputs and outputs of the network, known as preconditioning. We consider two skip connection configurations, which either includ…
▽ More
This paper presents a new framework for diffusion-based speech enhancement. Our method employs a Schroedinger bridge to transform the noisy speech distribution into the clean speech distribution. To stabilize and improve training, we employ time-dependent scalings of the inputs and outputs of the network, known as preconditioning. We consider two skip connection configurations, which either include or omit the current process state in the denoiser's output, enabling the network to predict either environmental noise or clean speech. Each approach leads to improved performance on different speech enhancement metrics. To maintain stable magnitude levels and balance during training, we use a magnitude-preserving network architecture that normalizes all activations and network weights to unit length. Additionally, we propose learning the contribution of the noisy input within each network block for effective input conditioning. After training, we apply a method to approximate different exponential moving average (EMA) profiles and investigate their effects on the speech enhancement performance. In contrast to image generation tasks, where longer EMA lengths often enhance mode coverage, we observe that shorter EMA lengths consistently lead to better performance on standard speech enhancement metrics. Code, audio examples, and checkpoints are available online.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Performance of Large Language Models in Supporting Medical Diagnosis and Treatment
Authors:
Diogo Sousa,
Guilherme Barbosa,
Catarina Rocha,
Dulce Oliveira
Abstract:
The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including bo…
▽ More
The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Proprioceptive multistable mechanical metamaterial via soft capacitive sensors
Authors:
Hugo de Souza Oliveira,
Niloofar Saeedzadeh Khaanghah,
Martijn Oetelmans,
Niko Münzenrieder,
Edoardo Milana
Abstract:
The technological transition from soft machines to soft robots necessarily passes through the integration of soft electronics and sensors. This allows for the establishment of feedback control systems while preserving the softness of the robot embodiment. Multistable mechanical metamaterials are excellent building blocks of soft machines, as their nonlinear response can be tuned by design to accom…
▽ More
The technological transition from soft machines to soft robots necessarily passes through the integration of soft electronics and sensors. This allows for the establishment of feedback control systems while preserving the softness of the robot embodiment. Multistable mechanical metamaterials are excellent building blocks of soft machines, as their nonlinear response can be tuned by design to accomplish several functions. In this work, we present the integration of soft capacitive sensors in a multistable mechanical metamaterial, to enable proprioceptive sensing of state changes. The metamaterial is a periodic arrangement of 4 bistable unit cells. Each unit cell has an integrated capacitive sensor. Both the metastructure and the sensors are made of soft materials (TPU) and are 3D printed. Our preliminary results show that the capacitance variation of the sensors can be linked to state transitions of the metamaterial, by capturing the nonlinear deformation.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
Meta-Ori: monolithic meta-origami for nonlinear inflatable soft actuators
Authors:
Hugo de Souza Oliveira,
Xin Li,
Johannes Frey,
Edoardo Milana
Abstract:
The nonlinear mechanical response of soft materials and slender structures is purposefully harnessed to program functions by design in soft robotic actuators, such as sequencing, amplified response, fast energy release, etc. However, typical designs of nonlinear actuators - e.g. balloons, inverted membranes, springs - have limited design parameters space and complex fabrication processes, hinderin…
▽ More
The nonlinear mechanical response of soft materials and slender structures is purposefully harnessed to program functions by design in soft robotic actuators, such as sequencing, amplified response, fast energy release, etc. However, typical designs of nonlinear actuators - e.g. balloons, inverted membranes, springs - have limited design parameters space and complex fabrication processes, hindering the achievement of more elaborated functions. Mechanical metamaterials, on the other hand, have very large design parameter spaces, which allow fine-tuning of nonlinear behaviours. In this work, we present a novel approach to fabricate nonlinear inflatables based on metamaterials and origami (Meta-Ori) as monolithic parts that can be fully 3D printed via Fused Deposition Modeling (FDM) using thermoplastic polyurethane (TPU) commercial filaments. Our design consists of a metamaterial shell with cylindrical topology and nonlinear mechanical response combined with a Kresling origami inflatable acting as a pneumatic transmitter. We develop and release a design tool in the visual programming language Grasshopper to interactively design our Meta-Ori. We characterize the mechanical response of the metashell and the origami, and the nonlinear pressure-volume curve of the Meta-Ori inflatable and, lastly, we demonstrate the actuation sequencing of a bi-segment monolithic Meta-Ori soft actuator.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
CountPath: Automating Fragment Counting in Digital Pathology
Authors:
Ana Beatriz Vieira,
Maria Valente,
Diana Montezuma,
Tomé Albuquerque,
Liliana Ribeiro,
Domingos Oliveira,
João Monteiro,
Sofia Gonçalves,
Isabel M. Pinto,
Jaime S. Cardoso,
Arlindo L. Oliveira
Abstract:
Quality control of medical images is a critical component of digital pathology, ensuring that diagnostic images meet required standards. A pre-analytical task within this process is the verification of the number of specimen fragments, a process that ensures that the number of fragments on a slide matches the number documented in the macroscopic report. This step is important to ensure that the sl…
▽ More
Quality control of medical images is a critical component of digital pathology, ensuring that diagnostic images meet required standards. A pre-analytical task within this process is the verification of the number of specimen fragments, a process that ensures that the number of fragments on a slide matches the number documented in the macroscopic report. This step is important to ensure that the slides contain the appropriate diagnostic material from the grossing process, thereby guaranteeing the accuracy of subsequent microscopic examination and diagnosis. Traditionally, this assessment is performed manually, requiring significant time and effort while being subject to significant variability due to its subjective nature. To address these challenges, this study explores an automated approach to fragment counting using the YOLOv9 and Vision Transformer models. Our results demonstrate that the automated system achieves a level of performance comparable to expert assessments, offering a reliable and efficient alternative to manual counting. Additionally, we present findings on interobserver variability, showing that the automated approach achieves an accuracy of 86%, which falls within the range of variation observed among experts (82-88%), further supporting its potential for integration into routine pathology workflows.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
i-WiViG: Interpretable Window Vision GNN
Authors:
Ivica Obadic,
Dmitry Kangin,
Dario Oliveira,
Plamen P Angelov,
Xiao Xiang Zhu
Abstract:
Deep learning models based on graph neural networks have emerged as a popular approach for solving computer vision problems. They encode the image into a graph structure and can be beneficial for efficiently capturing the long-range dependencies typically present in remote sensing imagery. However, an important drawback of these methods is their black-box nature which may hamper their wider usage…
▽ More
Deep learning models based on graph neural networks have emerged as a popular approach for solving computer vision problems. They encode the image into a graph structure and can be beneficial for efficiently capturing the long-range dependencies typically present in remote sensing imagery. However, an important drawback of these methods is their black-box nature which may hamper their wider usage in critical applications. In this work, we tackle the self-interpretability of the graph-based vision models by proposing our Interpretable Window Vision GNN (i-WiViG) approach, which provides explanations by automatically identifying the relevant subgraphs for the model prediction. This is achieved with window-based image graph processing that constrains the node receptive field to a local image region and by using a self-interpretable graph bottleneck that ranks the importance of the long-range relations between the image regions. We evaluate our approach to remote sensing classification and regression tasks, showing it achieves competitive performance while providing inherent and faithful explanations through the identified relations. Further, the quantitative evaluation reveals that our model reduces the infidelity of post-hoc explanations compared to other Vision GNN models, without sacrificing explanation sparsity.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
REGRACE: A Robust and Efficient Graph-based Re-localization Algorithm using Consistency Evaluation
Authors:
Débora N. P. Oliveira,
Joshua Knights,
Sebastián Barbas Laina,
Simon Boche,
Wolfram Burgard,
Stefan Leutenegger
Abstract:
Loop closures are essential for correcting odometry drift and creating consistent maps, especially in the context of large-scale navigation. Current methods using dense point clouds for accurate place recognition do not scale well due to computationally expensive scan-to-scan comparisons. Alternative object-centric approaches are more efficient but often struggle with sensitivity to viewpoint vari…
▽ More
Loop closures are essential for correcting odometry drift and creating consistent maps, especially in the context of large-scale navigation. Current methods using dense point clouds for accurate place recognition do not scale well due to computationally expensive scan-to-scan comparisons. Alternative object-centric approaches are more efficient but often struggle with sensitivity to viewpoint variation. In this work, we introduce REGRACE, a novel approach that addresses these challenges of scalability and perspective difference in re-localization by using LiDAR-based submaps. We introduce rotation-invariant features for each labeled object and enhance them with neighborhood context through a graph neural network. To identify potential revisits, we employ a scalable bag-of-words approach, pooling one learned global feature per submap. Additionally, we define a revisit with geometrical consistency cues rather than embedding distance, allowing us to recognize far-away loop closures. Our evaluations demonstrate that REGRACE achieves similar results compared to state-of-the-art place recognition and registration baselines while being twice as fast.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
GroundCap: A Visually Grounded Image Captioning Dataset
Authors:
Daniel A. P. Oliveira,
Lourenço Teodoro,
David Martins de Matos
Abstract:
Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object refer…
▽ More
Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking. We present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and the segmentation of background elements through K-means clustering. We propose gMETEOR, a metric combining caption quality with grounding accuracy, and establish baseline performance by fine-tuning Pixtral-12B and Qwen2.5-VL 7B on GroundCap. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references.
△ Less
Submitted 25 June, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Mamute: high-performance computing for geophysical methods
Authors:
João B. Fernandes,
Antônio D. S. Oliveira,
Mateus C. A. T. Silva,
Felipe H. Santos-da-Silva,
Vitor H. M. Rodrigues,
Kleiton A. Schneider,
Calebe P. Bianchini,
João M. de Araujo,
Tiago Barros,
Ítalo A. S. Assis,
Samuel Xavier-de-Souza
Abstract:
Due to their high computational cost, geophysical applications are typically designed to run in large computing systems. Because of that, such applications must implement several high-performance techniques to use the computational resources better. In this paper, we present Mamute, a software that delivers wave equation-based geophysical methods. Mamute implements two geophysical methods: seismic…
▽ More
Due to their high computational cost, geophysical applications are typically designed to run in large computing systems. Because of that, such applications must implement several high-performance techniques to use the computational resources better. In this paper, we present Mamute, a software that delivers wave equation-based geophysical methods. Mamute implements two geophysical methods: seismic modeling and full waveform inversion (FWI). It also supports high-performance strategies such as fault tolerance, automatic parallel looping scheduling, and distributed systems workload balancing. We demonstrate Mamute's operation using both seismic modeling and FWI. Mamute is a C++ software readily available under the MIT license.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Navigating Gender Disparities in Communication Research Leadership: Academic Recognition, Career Development, and Compensation
Authors:
Diego F. M. Oliveira,
Qian Huang
Abstract:
This study examines gender disparities in communication research through citation metrics, authorship patterns, team composition, and faculty salaries. Using data from 62,359 papers across 121 communication journals, we find that while female authors are increasingly represented, citation gaps persist, with sole-authored papers by women receiving fewer citations than those by men, especially in sm…
▽ More
This study examines gender disparities in communication research through citation metrics, authorship patterns, team composition, and faculty salaries. Using data from 62,359 papers across 121 communication journals, we find that while female authors are increasingly represented, citation gaps persist, with sole-authored papers by women receiving fewer citations than those by men, especially in smaller teams. Team composition analysis reveals a tendency toward gender homophily, with single-gender teams being more common. In top U.S. communication journals, female authors face underrepresentation and citation disparities favoring male authors. Salary analysis from leading U.S. public universities shows that female faculty earn lower salaries at the Assistant Professor level, though disparities lessen at higher ranks. These findings highlight the need for greater efforts to promote gender equity through inclusive collaboration, equitable citation practices, and fair compensation.
△ Less
Submitted 15 January, 2025; v1 submitted 14 January, 2025;
originally announced January 2025.
-
Finding the Underlying Viscoelastic Constitutive Equation via Universal Differential Equations and Differentiable Physics
Authors:
Elias C. Rodrigues,
Roney L. Thompson,
Dário A. B. Oliveira,
Roberto F. Ausas
Abstract:
This research employs Universal Differential Equations (UDEs) alongside differentiable physics to model viscoelastic fluids, merging conventional differential equations, neural networks and numerical methods to reconstruct missing terms in constitutive models. This study focuses on analyzing four viscoelastic models: Upper Convected Maxwell (UCM), Johnson-Segalman, Giesekus, and Exponential Phan-T…
▽ More
This research employs Universal Differential Equations (UDEs) alongside differentiable physics to model viscoelastic fluids, merging conventional differential equations, neural networks and numerical methods to reconstruct missing terms in constitutive models. This study focuses on analyzing four viscoelastic models: Upper Convected Maxwell (UCM), Johnson-Segalman, Giesekus, and Exponential Phan-Thien-Tanner (ePTT), through the use of synthetic datasets. The methodology was tested across different experimental conditions, including oscillatory and startup flows. While the UDE framework effectively predicts shear and normal stresses for most models, it demonstrates some limitations when applied to the ePTT model. The findings underscore the potential of UDEs in fluid mechanics while identifying critical areas for methodological improvement. Also, a model distillation approach was employed to extract simplified models from complex ones, emphasizing the versatility and robustness of UDEs in rheological modeling.
△ Less
Submitted 23 May, 2025; v1 submitted 31 December, 2024;
originally announced January 2025.
-
Seq2Seq Model-Based Chatbot with LSTM and Attention Mechanism for Enhanced User Interaction
Authors:
Lamya Benaddi,
Charaf Ouaddi,
Adnane Souha,
Abdeslam Jakimi,
Mohamed Rahouti,
Mohammed Aledhari,
Diogo Oliveira,
Brahim Ouchao
Abstract:
A chatbot is an intelligent software application that automates conversations and engages users in natural language through messaging platforms. Leveraging artificial intelligence (AI), chatbots serve various functions, including customer service, information gathering, and casual conversation. Existing virtual assistant chatbots, such as ChatGPT and Gemini, demonstrate the potential of AI in Natu…
▽ More
A chatbot is an intelligent software application that automates conversations and engages users in natural language through messaging platforms. Leveraging artificial intelligence (AI), chatbots serve various functions, including customer service, information gathering, and casual conversation. Existing virtual assistant chatbots, such as ChatGPT and Gemini, demonstrate the potential of AI in Natural Language Processing (NLP). However, many current solutions rely on predefined APIs, which can result in vendor lock-in and high costs. To address these challenges, this work proposes a chatbot developed using a Sequence-to-Sequence (Seq2Seq) model with an encoder-decoder architecture that incorporates attention mechanisms and Long Short-Term Memory (LSTM) cells. By avoiding predefined APIs, this approach ensures flexibility and cost-effectiveness. The chatbot is trained, validated, and tested on a dataset specifically curated for the tourism sector in Draa-Tafilalet, Morocco. Key evaluation findings indicate that the proposed Seq2Seq model-based chatbot achieved high accuracies: approximately 99.58% in training, 98.03% in validation, and 94.12% in testing. These results demonstrate the chatbot's effectiveness in providing relevant and coherent responses within the tourism domain, highlighting the potential of specialized AI applications to enhance user experience and satisfaction in niche markets.
△ Less
Submitted 27 December, 2024;
originally announced January 2025.
-
FlowNav: Combining Flow Matching and Depth Priors for Efficient Navigation
Authors:
Samiran Gode,
Abhijeet Nayak,
Débora N. P. Oliveira,
Michael Krawez,
Cordelia Schmid,
Wolfram Burgard
Abstract:
Effective robot navigation in unseen environments is a challenging task that requires precise control actions at high frequencies. Recent advances have framed it as an image-goal-conditioned control problem, where the robot generates navigation actions using frontal RGB images. Current state-of-the-art methods in this area use diffusion policies to generate these control actions. Despite their pro…
▽ More
Effective robot navigation in unseen environments is a challenging task that requires precise control actions at high frequencies. Recent advances have framed it as an image-goal-conditioned control problem, where the robot generates navigation actions using frontal RGB images. Current state-of-the-art methods in this area use diffusion policies to generate these control actions. Despite their promising results, these models are computationally expensive and suffer from weak perception. To address these limitations, we present FlowNav, a novel approach that uses a combination of Conditional Flow Matching (CFM) and depth priors from off-the-shelf foundation models to learn action policies for robot navigation. FlowNav is significantly more accurate at navigation and exploration than state-of-the-art methods. We validate our contributions using real robot experiments in multiple unseen environments, demonstrating improved navigation reliability and accuracy. We make the code and trained models publicly available.
△ Less
Submitted 3 March, 2025; v1 submitted 14 November, 2024;
originally announced November 2024.
-
Understanding Code Understandability Improvements in Code Reviews
Authors:
Delano Oliveira,
Reydne Santos,
Benedito de Oliveira,
Martin Monperrus,
Fernando Castor,
Fernanda Madeiral
Abstract:
Motivation: Code understandability is crucial in software development, as developers spend 58% to 70% of their time reading source code. Improving it can improve productivity and reduce maintenance costs. Problem: Experimental studies often identify factors influencing code understandability in controlled settings but overlook real-world influences like project culture, guidelines, and developers'…
▽ More
Motivation: Code understandability is crucial in software development, as developers spend 58% to 70% of their time reading source code. Improving it can improve productivity and reduce maintenance costs. Problem: Experimental studies often identify factors influencing code understandability in controlled settings but overlook real-world influences like project culture, guidelines, and developers' backgrounds. Ignoring these factors may yield results with limited external validity. Objective: This study investigates how developers enhance code understandability through code review comments, assuming that code reviewers are specialists in code quality. Method and Results: We analyzed 2,401 code review comments from Java open-source projects on GitHub, finding that over 42% focus on improving code understandability. We further examined 385 comments specifically related to this aspect and identified eight categories of concerns, such as inadequate documentation and poor identifiers. Notably, 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted. We identified various types of patches that enhance understandability, from simple changes like removing unused code to context-dependent improvements such as optimizing method calls. Additionally, we evaluated four well-known linters for their ability to flag these issues, finding they cover less than 30%, although many could be easily added as new rules. Implications: Our findings encourage the development of tools to enhance code understandability, as accepted changes can serve as reliable training data for specialized machine-learning models. Our dataset supports this training and can inform the development of evidence-based code style guides. Data Availability: Our data is publicly available at https://codeupcrc.github.io.
△ Less
Submitted 12 November, 2024; v1 submitted 29 October, 2024;
originally announced October 2024.
-
Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean Speech
Authors:
Danilo de Oliveira,
Julius Richter,
Jean-Marie Lemercier,
Simon Welker,
Timo Gerkmann
Abstract:
Diffusion models have found great success in generating high quality, natural samples of speech, but their potential for density estimation for speech has so far remained largely unexplored. In this work, we leverage an unconditional diffusion model trained only on clean speech for the assessment of speech quality. We show that the quality of a speech utterance can be assessed by estimating the li…
▽ More
Diffusion models have found great success in generating high quality, natural samples of speech, but their potential for density estimation for speech has so far remained largely unexplored. In this work, we leverage an unconditional diffusion model trained only on clean speech for the assessment of speech quality. We show that the quality of a speech utterance can be assessed by estimating the likelihood of a corresponding sample in the terminating Gaussian distribution, obtained via a deterministic noising process. The resulting method is purely unsupervised, trained only on clean speech, and therefore does not rely on annotations. Our diffusion-based approach leverages clean speech priors to assess quality based on how the input relates to the learned distribution of clean data. Our proposed log-likelihoods show promising results, correlating well with intrusive speech quality metrics and showing the best correlation with human scores in a listening experiment.
△ Less
Submitted 13 June, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows
Authors:
Rafael Ferreira da Silva,
Deborah Bard,
Kyle Chard,
Shaun de Witt,
Ian T. Foster,
Tom Gibbs,
Carole Goble,
William Godoy,
Johan Gustafsson,
Utz-Uwe Haus,
Stephen Hudson,
Shantenu Jha,
Laila Los,
Drew Paine,
Frédéric Suter,
Logan Ward,
Sean Wilkinson,
Marcos Amaris,
Yadu Babuji,
Jonathan Bader,
Riccardo Balin,
Daniel Balouek,
Sarah Beecroft,
Khalid Belhajjame,
Rajat Bhattarai
, et al. (86 additional authors not shown)
Abstract:
The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific w…
▽ More
The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific workflows, enabling higher-fidelity models and complex, time-sensitive processes, while introducing challenges in managing heterogeneous environments and multi-facility data dependencies. The rise of large language models is driving computational demands to zettaflop scales, necessitating modular, adaptable systems and cloud-service models to optimize resource utilization and ensure reproducibility. Multi-facility workflows present challenges in data movement, curation, and overcoming institutional silos, while diverse hardware architectures require integrating workflow considerations into early system design and developing standardized resource management tools. The summit emphasized improving user experience in workflow systems and ensuring FAIR workflows to enhance collaboration and accelerate scientific discovery. Key recommendations include developing standardized metrics for time-sensitive workflows, creating frameworks for cloud-HPC integration, implementing distributed-by-design workflow modeling, establishing multi-facility authentication protocols, and accelerating AI integration in HPC workflow management. The summit also called for comprehensive workflow benchmarks, workflow-specific UX principles, and a FAIR workflow maturity model, highlighting the need for continued collaboration in addressing the complex challenges posed by the convergence of AI, HPC, and multi-facility research environments.
△ Less
Submitted 18 October, 2024;
originally announced October 2024.
-
Urban Computing for Climate and Environmental Justice: Early Perspectives From Two Research Initiatives
Authors:
Carolina Veiga,
Ashish Sharma,
Daniel de Oliveira,
Marcos Lage,
Fabio Miranda
Abstract:
The impacts of climate change are intensifying existing vulnerabilities and disparities within urban communities around the globe, as extreme weather events, including floods and heatwaves, are becoming more frequent and severe, disproportionately affecting low-income and underrepresented groups. Tackling these increasing challenges requires novel approaches that integrate expertise across multipl…
▽ More
The impacts of climate change are intensifying existing vulnerabilities and disparities within urban communities around the globe, as extreme weather events, including floods and heatwaves, are becoming more frequent and severe, disproportionately affecting low-income and underrepresented groups. Tackling these increasing challenges requires novel approaches that integrate expertise across multiple domains, including computer science, engineering, climate science, and public health. Urban computing can play a pivotal role in these efforts by integrating data from multiple sources to support decision-making and provide actionable insights into weather patterns, infrastructure weaknesses, and population vulnerabilities. However, the capacity to leverage technological advancements varies significantly between the Global South and Global North. In this paper, we present two multiyear, multidisciplinary projects situated in Chicago, USA and Niterói, Brazil, highlighting the opportunities and limitations of urban computing in these diverse contexts. Reflecting on our experiences, we then discuss the essential requirements, as well as existing gaps, for visual analytics tools that facilitate the understanding and mitigation of climate-related risks in urban environments.
△ Less
Submitted 5 October, 2024;
originally announced October 2024.
-
Investigating Training Objectives for Generative Speech Enhancement
Authors:
Julius Richter,
Danilo de Oliveira,
Timo Gerkmann
Abstract:
Generative speech enhancement has recently shown promising advancements in improving speech quality in noisy environments. Multiple diffusion-based frameworks exist, each employing distinct training objectives and learning techniques. This paper aims to explain the differences between these frameworks by focusing our investigation on score-based generative models and the Schrödinger bridge. We con…
▽ More
Generative speech enhancement has recently shown promising advancements in improving speech quality in noisy environments. Multiple diffusion-based frameworks exist, each employing distinct training objectives and learning techniques. This paper aims to explain the differences between these frameworks by focusing our investigation on score-based generative models and the Schrödinger bridge. We conduct a series of comprehensive experiments to compare their performance and highlight differing training behaviors. Furthermore, we propose a novel perceptual loss function tailored for the Schrödinger bridge framework, demonstrating enhanced performance and improved perceptual quality of the enhanced speech signals. All experimental code and pre-trained models are publicly available to facilitate further research and development in this domain.
△ Less
Submitted 18 January, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Computer Vision Model Compression Techniques for Embedded Systems: A Survey
Authors:
Alexandre Lopes,
Fernando Pereira dos Santos,
Diulhio de Oliveira,
Mauricio Schiezaro,
Helio Pedrini
Abstract:
Deep neural networks have consistently represented the state of the art in most computer vision problems. In these scenarios, larger and more complex models have demonstrated superior performance to smaller architectures, especially when trained with plenty of representative data. With the recent adoption of Vision Transformer (ViT) based architectures and advanced Convolutional Neural Networks (C…
▽ More
Deep neural networks have consistently represented the state of the art in most computer vision problems. In these scenarios, larger and more complex models have demonstrated superior performance to smaller architectures, especially when trained with plenty of representative data. With the recent adoption of Vision Transformer (ViT) based architectures and advanced Convolutional Neural Networks (CNNs), the total number of parameters of leading backbone architectures increased from 62M parameters in 2012 with AlexNet to 7B parameters in 2024 with AIM-7B. Consequently, deploying such deep architectures faces challenges in environments with processing and runtime constraints, particularly in embedded systems. This paper covers the main model compression techniques applied for computer vision tasks, enabling modern models to be used in embedded systems. We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique and expected variations when analyzing it on various embedded devices. We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges for each subarea and present trends for Model Compression. Case studies for compression models are available at \href{https://github.com/venturusbr/cv-model-compression}{https://github.com/venturusbr/cv-model-compression}.
△ Less
Submitted 15 August, 2024;
originally announced August 2024.
-
MyoGestic: EMG Interfacing Framework for Decoding Multiple Spared Degrees of Freedom of the Hand in Individuals with Neural Lesions
Authors:
Raul C. Sîmpetru,
Dominik I. Braun,
Arndt U. Simon,
Michael März,
Vlad Cnejevici,
Daniela Souza de Oliveira,
Nico Weber,
Jonas Walter,
Jörg Franke,
Daniel Höglinger,
Cosima Prahm,
Matthias Ponfick,
Alessandro Del Vecchio
Abstract:
Restoring limb motor function in individuals with spinal cord injury (SCI), stroke, or amputation remains a critical challenge, one which affects millions worldwide. Recent studies show through surface electromyography (EMG) that spared motor neurons can still be voluntarily controlled, even without visible limb movement . These signals can be decoded and used for motor intent estimation; however,…
▽ More
Restoring limb motor function in individuals with spinal cord injury (SCI), stroke, or amputation remains a critical challenge, one which affects millions worldwide. Recent studies show through surface electromyography (EMG) that spared motor neurons can still be voluntarily controlled, even without visible limb movement . These signals can be decoded and used for motor intent estimation; however, current wearable solutions lack the necessary hardware and software for intuitive interfacing of the spared degrees of freedom after neural injuries. To address these limitations, we developed a wireless, high-density EMG bracelet, coupled with a novel software framework, MyoGestic. Our system allows rapid and tailored adaptability of machine learning models to the needs of the users, facilitating real-time decoding of multiple spared distinctive degrees of freedom. In our study, we successfully decoded the motor intent from two participants with SCI, two with spinal stroke , and three amputees in real-time, achieving several controllable degrees of freedom within minutes after wearing the EMG bracelet. We provide a proof-of-concept that these decoded signals can be used to control a digitally rendered hand, a wearable orthosis, a prosthesis, or a 2D cursor. Our framework promotes a participant-centered approach, allowing immediate feedback integration, thus enhancing the iterative development of myocontrol algorithms. The proposed open-source software framework, MyoGestic, allows researchers and patients to focus on the augmentation and training of the spared degrees of freedom after neural lesions, thus potentially bridging the gap between research and clinical application and advancing the development of intuitive EMG interfaces for diverse neural lesions.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
Curio: A Dataflow-Based Framework for Collaborative Urban Visual Analytics
Authors:
Gustavo Moreira,
Maryam Hosseini,
Carolina Veiga,
Lucas Alexandre,
Nicola Colaninno,
Daniel de Oliveira,
Nivan Ferreira,
Marcos Lage,
Fabio Miranda
Abstract:
Over the past decade, several urban visual analytics systems and tools have been proposed to tackle a host of challenges faced by cities, in areas as diverse as transportation, weather, and real estate. Many of these tools have been designed through collaborations with urban experts, aiming to distill intricate urban analysis workflows into interactive visualizations and interfaces. However, the d…
▽ More
Over the past decade, several urban visual analytics systems and tools have been proposed to tackle a host of challenges faced by cities, in areas as diverse as transportation, weather, and real estate. Many of these tools have been designed through collaborations with urban experts, aiming to distill intricate urban analysis workflows into interactive visualizations and interfaces. However, the design, implementation, and practical use of these tools still rely on siloed approaches, resulting in bespoke applications that are difficult to reproduce and extend. At the design level, these tools undervalue rich data workflows from urban experts, typically treating them only as data providers and evaluators. At the implementation level, they lack interoperability with other technical frameworks. At the practical use level, they tend to be narrowly focused on specific fields, inadvertently creating barriers to cross-domain collaboration. To address these gaps, we present Curio, a framework for collaborative urban visual analytics. Curio uses a dataflow model with multiple abstraction levels (code, grammar, GUI elements) to facilitate collaboration across the design and implementation of visual analytics components. The framework allows experts to intertwine data preprocessing, management, and visualization stages while tracking the provenance of code and visualizations. In collaboration with urban experts, we evaluate Curio through a diverse set of usage scenarios targeting urban accessibility, urban microclimate, and sunlight access. These scenarios use different types of data and domain methodologies to illustrate Curio's flexibility in tackling pressing societal challenges. Curio is available at https://urbantk.org/curio.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
A Survey on Cell Nuclei Instance Segmentation and Classification: Leveraging Context and Attention
Authors:
João D. Nunes,
Diana Montezuma,
Domingos Oliveira,
Tania Pereira,
Jaime S. Cardoso
Abstract:
Manually annotating nuclei from the gigapixel Hematoxylin and Eosin (H&E)-stained Whole Slide Images (WSIs) is a laborious and costly task, meaning automated algorithms for cell nuclei instance segmentation and classification could alleviate the workload of pathologists and clinical researchers and at the same time facilitate the automatic extraction of clinically interpretable features. But due t…
▽ More
Manually annotating nuclei from the gigapixel Hematoxylin and Eosin (H&E)-stained Whole Slide Images (WSIs) is a laborious and costly task, meaning automated algorithms for cell nuclei instance segmentation and classification could alleviate the workload of pathologists and clinical researchers and at the same time facilitate the automatic extraction of clinically interpretable features. But due to high intra- and inter-class variability of nuclei morphological and chromatic features, as well as H&E-stains susceptibility to artefacts, state-of-the-art algorithms cannot correctly detect and classify instances with the necessary performance. In this work, we hypothesise context and attention inductive biases in artificial neural networks (ANNs) could increase the generalization of algorithms for cell nuclei instance segmentation and classification. We conduct a thorough survey on context and attention methods for cell nuclei instance segmentation and classification from H&E-stained microscopy imaging, while providing a comprehensive discussion of the challenges being tackled with context and attention. Besides, we illustrate some limitations of current approaches and present ideas for future research. As a case study, we extend both a general instance segmentation and classification method (Mask-RCNN) and a tailored cell nuclei instance segmentation and classification model (HoVer-Net) with context- and attention-based mechanisms, and do a comparative analysis on a multi-centre colon nuclei identification and counting dataset. Although pathologists rely on context at multiple levels while paying attention to specific Regions of Interest (RoIs) when analysing and annotating WSIs, our findings suggest translating that domain knowledge into algorithm design is no trivial task, but to fully exploit these mechanisms, the scientific understanding of these methods should be addressed.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Cryptocurrency Price Forecasting Using XGBoost Regressor and Technical Indicators
Authors:
Abdelatif Hafid,
Maad Ebrahim,
Ali Alfatemi,
Mohamed Rahouti,
Diogo Oliveira
Abstract:
The rapid growth of the stock market has attracted many investors due to its potential for significant profits. However, predicting stock prices accurately is difficult because financial markets are complex and constantly changing. This is especially true for the cryptocurrency market, which is known for its extreme volatility, making it challenging for traders and investors to make wise and profi…
▽ More
The rapid growth of the stock market has attracted many investors due to its potential for significant profits. However, predicting stock prices accurately is difficult because financial markets are complex and constantly changing. This is especially true for the cryptocurrency market, which is known for its extreme volatility, making it challenging for traders and investors to make wise and profitable decisions. This study introduces a machine learning approach to predict cryptocurrency prices. Specifically, we make use of important technical indicators such as Exponential Moving Average (EMA) and Moving Average Convergence Divergence (MACD) to train and feed the XGBoost regressor model. We demonstrate our approach through an analysis focusing on the closing prices of Bitcoin cryptocurrency. We evaluate the model's performance through various simulations, showing promising results that suggest its usefulness in aiding/guiding cryptocurrency traders and investors in dynamic market conditions.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Full Iso-recursive Types
Authors:
Litao Zhou,
Qianyong Wan,
Bruno C. d. S. Oliveira
Abstract:
There are two well-known formulations of recursive types: iso-recursive and equi-recursive types. Abadi and Fiore [1996] have shown that iso- and equi-recursive types have the same expressive power. However, their encoding of equi-recursive types in terms of iso-recursive types requires explicit coercions. These coercions come with significant additional computational overhead, and complicate reas…
▽ More
There are two well-known formulations of recursive types: iso-recursive and equi-recursive types. Abadi and Fiore [1996] have shown that iso- and equi-recursive types have the same expressive power. However, their encoding of equi-recursive types in terms of iso-recursive types requires explicit coercions. These coercions come with significant additional computational overhead, and complicate reasoning about the equivalence of the two formulations of recursive types.
This paper proposes a generalization of iso-recursive types called full iso-recursive types. Full iso-recursive types allow encoding all programs with equi-recursive types without computational overhead. Instead of explicit term coercions, all type transformations are captured by computationally irrelevant casts, which can be erased at runtime without affecting the semantics of the program. Consequently, reasoning about the equivalence between the two approaches can be greatly simplified. We present a calculus called $λ^μ_{Fi}$, which extends the simply typed lambda calculus (STLC) with full iso-recursive types. The $λ^μ_{Fi}$ calculus is proved to be type sound, and shown to have the same expressive power as a calculus with equi-recursive types. We also extend our results to subtyping, and show that equi-recursive subtyping can be expressed in terms of iso-recursive subtyping with cast operators.
△ Less
Submitted 7 July, 2024; v1 submitted 30 June, 2024;
originally announced July 2024.
-
The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement
Authors:
Danilo de Oliveira,
Simon Welker,
Julius Richter,
Timo Gerkmann
Abstract:
To obtain improved speech enhancement models, researchers often focus on increasing performance according to specific instrumental metrics. However, when the same metric is used in a loss function to optimize models, it may be detrimental to aspects that the given metric does not see. The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for…
▽ More
To obtain improved speech enhancement models, researchers often focus on increasing performance according to specific instrumental metrics. However, when the same metric is used in a loss function to optimize models, it may be detrimental to aspects that the given metric does not see. The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for evaluation. For this, we introduce enhancement models that exploit the widely used PESQ measure. Our "PESQetarian" model achieves 3.82 PESQ on VB-DMD while scoring very poorly in a listening experiment. While the obtained PESQ value of 3.82 would imply "state-of-the-art" PESQ-performance on the VB-DMD benchmark, our examples show that when optimizing w.r.t. a metric, an isolated evaluation on the same metric may be misleading. Instead, other metrics should be included in the evaluation and the resulting performance predictions should be confirmed by listening.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges
Authors:
Daniel A. P. Oliveira,
Eugénio Ribeiro,
David Martins de Matos
Abstract:
Creating engaging narratives from visual data is crucial for automated digital media consumption, assistive technologies, and interactive entertainment. This survey covers methodologies used in the generation of these narratives, focusing on their principles, strengths, and limitations.
The survey also covers tasks related to automatic story generation, such as image and video captioning, and vi…
▽ More
Creating engaging narratives from visual data is crucial for automated digital media consumption, assistive technologies, and interactive entertainment. This survey covers methodologies used in the generation of these narratives, focusing on their principles, strengths, and limitations.
The survey also covers tasks related to automatic story generation, such as image and video captioning, and visual question answering, as well as story generation without visual inputs. These tasks share common challenges with visual story generation and have served as inspiration for the techniques used in the field. We analyze the main datasets and evaluation metrics, providing a critical perspective on their limitations.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Contrastive Pretraining for Visual Concept Explanations of Socioeconomic Outcomes
Authors:
Ivica Obadic,
Alex Levering,
Lars Pennig,
Dario Oliveira,
Diego Marcos,
Xiaoxiang Zhu
Abstract:
Predicting socioeconomic indicators from satellite imagery with deep learning has become an increasingly popular research direction. Post-hoc concept-based explanations can be an important step towards broader adoption of these models in policy-making as they enable the interpretation of socioeconomic outcomes based on visual concepts that are intuitive to humans. In this paper, we study the inter…
▽ More
Predicting socioeconomic indicators from satellite imagery with deep learning has become an increasingly popular research direction. Post-hoc concept-based explanations can be an important step towards broader adoption of these models in policy-making as they enable the interpretation of socioeconomic outcomes based on visual concepts that are intuitive to humans. In this paper, we study the interplay between representation learning using an additional task-specific contrastive loss and post-hoc concept explainability for socioeconomic studies. Our results on two different geographical locations and tasks indicate that the task-specific pretraining imposes a continuous ordering of the latent space embeddings according to the socioeconomic outcomes. This improves the model's interpretability as it enables the latent space of the model to associate concepts encoding typical urban and natural area patterns with continuous intervals of socioeconomic outcomes. Further, we illustrate how analyzing the model's conceptual sensitivity for the intervals of socioeconomic outcomes can shed light on new insights for urban studies.
△ Less
Submitted 13 June, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
ProvDeploy: Provenance-oriented Containerization of High Performance Computing Scientific Workflows
Authors:
Liliane Kunstmann,
Débora Pina,
Daniel de Oliveira,
Marta Mattoso
Abstract:
Many existing scientific workflows require High Performance Computing environments to produce results in a timely manner. These workflows have several software library components and use different environments, making the deployment and execution of the software stack not trivial. This complexity increases if the user needs to add provenance data capture services to the workflow. This manuscript i…
▽ More
Many existing scientific workflows require High Performance Computing environments to produce results in a timely manner. These workflows have several software library components and use different environments, making the deployment and execution of the software stack not trivial. This complexity increases if the user needs to add provenance data capture services to the workflow. This manuscript introduces ProvDeploy to assist the user in configuring containers for scientific workflows with integrated provenance data capture. ProvDeploy was evaluated with a Scientific Machine Learning workflow, exploring containerization strategies focused on provenance in two distinct HPC environments
△ Less
Submitted 25 March, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
Ontologia para monitorar a deficiência mental em seus déficts no processamento da informação por declínio cognitivo e evitar agressões psicológicas e físicas em ambientes educacionais com ajuda da I.A*
Authors:
Bruna Araújo de Castro Oliveira
Abstract:
The intention of this article is to propose the use of artificial intelligence to detect through analysis by UFO ontology the emergence of verbal and physical aggression related to psychosocial deficiencies and their provoking agents, in an attempt to prevent catastrophic consequences within school environments.
The intention of this article is to propose the use of artificial intelligence to detect through analysis by UFO ontology the emergence of verbal and physical aggression related to psychosocial deficiencies and their provoking agents, in an attempt to prevent catastrophic consequences within school environments.
△ Less
Submitted 31 January, 2024;
originally announced March 2024.
-
Opening the Black-Box: A Systematic Review on Explainable AI in Remote Sensing
Authors:
Adrian Höhl,
Ivica Obadic,
Miguel Ángel Fernández Torres,
Hiba Najjar,
Dario Oliveira,
Zeynep Akata,
Andreas Dengel,
Xiao Xiang Zhu
Abstract:
In recent years, black-box machine learning approaches have become a dominant modeling paradigm for knowledge extraction in remote sensing. Despite the potential benefits of uncovering the inner workings of these models with explainable AI, a comprehensive overview summarizing the explainable AI methods used and their objectives, findings, and challenges in remote sensing applications is still mis…
▽ More
In recent years, black-box machine learning approaches have become a dominant modeling paradigm for knowledge extraction in remote sensing. Despite the potential benefits of uncovering the inner workings of these models with explainable AI, a comprehensive overview summarizing the explainable AI methods used and their objectives, findings, and challenges in remote sensing applications is still missing. In this paper, we address this gap by performing a systematic review to identify the key trends in the field and shed light on novel explainable AI approaches and emerging directions that tackle specific remote sensing challenges. We also reveal the common patterns of explanation interpretation, discuss the extracted scientific insights, and reflect on the approaches used for the evaluation of explainable AI methods. As such, our review provides a complete summary of the state-of-the-art of explainable AI in remote sensing. Further, we give a detailed outlook on the challenges and promising research directions, representing a basis for novel methodological development and a useful starting point for new researchers in the field.
△ Less
Submitted 6 November, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
Are Fact-Checking Tools Helpful? An Exploration of the Usability of Google Fact Check
Authors:
Qiangeng Yang,
Tess Christensen,
Shlok Gilda,
Juliana Fernandes,
Daniela Oliveira,
Ronald Wilson,
Damon Woodard
Abstract:
Fact-checking-specific search tools such as Google Fact Check are a promising way to combat misinformation on social media, especially during events bringing significant social influence, such as the COVID-19 pandemic and the U.S. presidential elections. However, the usability of such an approach has not been thoroughly studied. We evaluated the performance of Google Fact Check by analyzing the re…
▽ More
Fact-checking-specific search tools such as Google Fact Check are a promising way to combat misinformation on social media, especially during events bringing significant social influence, such as the COVID-19 pandemic and the U.S. presidential elections. However, the usability of such an approach has not been thoroughly studied. We evaluated the performance of Google Fact Check by analyzing the retrieved fact-checking results regarding 1,000 COVID-19-related false claims and found it able to retrieve the fact-checking results for 15.8% of the input claims, and the rendered results are relatively reliable. We also found that the false claims receiving different fact-checking verdicts (i.e., "False," "Partly False," "True," and "Unratable") tend to reflect diverse emotional tones, and fact-checking sources tend to check the claims in different lengths and using dictionary words to various extents. Claim variations addressing the same issue yet described differently are likely to retrieve distinct fact-checking results. We suggest that the quantities of the retrieved fact-checking results could be optimized and that slightly adjusting input wording may be the best practice for users to retrieve more useful information. This study aims to contribute to the understanding of state-of-the-art fact-checking tools and information integrity.
△ Less
Submitted 24 May, 2025; v1 submitted 20 February, 2024;
originally announced February 2024.
-
Evolution of urban areas and land surface temperature
Authors:
Sudipan Saha,
Tushar Verma,
Dario Augusto Borges Oliveira
Abstract:
With the global population on the rise, our cities have been expanding to accommodate the growing number of people. The expansion of cities generally leads to the engulfment of peripheral areas. However, such expansion of urban areas is likely to cause increment in areas with increased land surface temperature (LST). By considering each summer as a data point, we form LST multi-year time-series an…
▽ More
With the global population on the rise, our cities have been expanding to accommodate the growing number of people. The expansion of cities generally leads to the engulfment of peripheral areas. However, such expansion of urban areas is likely to cause increment in areas with increased land surface temperature (LST). By considering each summer as a data point, we form LST multi-year time-series and cluster it to obtain spatio-temporal pattern. We observe several interesting phenomena from these patterns, e.g., some clusters show reasonable similarity to the built-up area, whereas the locations with high temporal variation are seen more in the peripheral areas. Furthermore, the LST center of mass shifts over the years for cities with development activities tilted towards a direction. We conduct the above-mentioned studies for three different cities in three different continents.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
Foundation Models for Generalist Geospatial Artificial Intelligence
Authors:
Johannes Jakubik,
Sujit Roy,
C. E. Phillips,
Paolo Fraccaro,
Denys Godwin,
Bianca Zadrozny,
Daniela Szwarcman,
Carlos Gomes,
Gabby Nyirjesy,
Blair Edwards,
Daiki Kimura,
Naomi Simumba,
Linsong Chu,
S. Karthik Mukkavilli,
Devyani Lambhate,
Kamal Das,
Ranjini Bangalore,
Dario Oliveira,
Michal Muszynski,
Kumar Ankur,
Muthukumaran Ramasubramanian,
Iksha Gurung,
Sam Khallaghi,
Hanxi,
Li
, et al. (8 additional authors not shown)
Abstract:
Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framewo…
▽ More
Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive geospatial data. We have utilized this framework to create Prithvi, a transformer-based geospatial foundational model pre-trained on more than 1TB of multispectral satellite imagery from the Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the efficacy of our framework in successfully fine-tuning Prithvi to a range of Earth observation tasks that have not been tackled by previous work on foundation models involving multi-temporal cloud gap imputation, flood mapping, wildfire scar segmentation, and multi-temporal crop segmentation. Our experiments show that the pre-trained model accelerates the fine-tuning process compared to leveraging randomly initialized weights. In addition, pre-trained Prithvi compares well against the state-of-the-art, e.g., outperforming a conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) in the structural similarity index. Finally, due to the limited availability of labeled data in the field of Earth observation, we gradually reduce the quantity of available labeled data for refining the model to evaluate data efficiency and demonstrate that data can be decreased significantly without affecting the model's accuracy. The pre-trained 100 million parameter model and corresponding fine-tuning workflows have been released publicly as open source contributions to the global Earth sciences community through Hugging Face.
△ Less
Submitted 8 November, 2023; v1 submitted 28 October, 2023;
originally announced October 2023.
-
MCU-Wide Timing Side Channels and Their Detection
Authors:
Johannes Müller,
Anna Lena Duque Antón,
Lucas Deutschmann,
Dino Mehmedagić,
Cristiano Rodrigues,
Daniel Oliveira,
Keerthikumara Devarajegowda,
Mohammad Rahmani Fadiheh,
Sandro Pinto,
Dominik Stoffel,
Wolfgang Kunz
Abstract:
Microarchitectural timing side channels have been thoroughly investigated as a security threat in hardware designs featuring shared buffers (e.g., caches) or parallelism between attacker and victim task execution. However, contradicting common intuitions, recent activities demonstrate that this threat is real even in microcontroller SoCs without such features. In this paper, we describe SoC-wide t…
▽ More
Microarchitectural timing side channels have been thoroughly investigated as a security threat in hardware designs featuring shared buffers (e.g., caches) or parallelism between attacker and victim task execution. However, contradicting common intuitions, recent activities demonstrate that this threat is real even in microcontroller SoCs without such features. In this paper, we describe SoC-wide timing side channels previously neglected by security analysis and present a new formal method to close this gap. In a case study on the RISC-V Pulpissimo SoC, our method detected a vulnerability to a previously unknown attack variant that allows an attacker to obtain information about a victim's memory access behavior. After implementing a conservative fix, we were able to verify that the SoC is now secure w.r.t. the considered class of timing side channels.
△ Less
Submitted 18 July, 2024; v1 submitted 22 September, 2023;
originally announced September 2023.
-
Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation
Authors:
Danilo de Oliveira,
Timo Gerkmann
Abstract:
Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming. In this work, we show that the original method of knowledge distillation (and its more recently proposed extension, decoupled knowledge distillation) can be applied to the task of distilling HuBERT. In contrast to methods that focus on distil…
▽ More
Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming. In this work, we show that the original method of knowledge distillation (and its more recently proposed extension, decoupled knowledge distillation) can be applied to the task of distilling HuBERT. In contrast to methods that focus on distilling internal features, this allows for more freedom in the network architecture of the compressed model. We thus propose to distill HuBERT's Transformer layers into an LSTM-based distilled model that reduces the number of parameters even below DistilHuBERT and at the same time shows improved performance in automatic speech recognition.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
ProWis: A Visual Approach for Building, Managing, and Analyzing Weather Simulation Ensembles at Runtime
Authors:
Carolina Veiga Ferreira de Souza,
Suzanna Maria Bonnet,
Daniel de Oliveira,
Marcio Cataldi,
Fabio Miranda,
Marcos Lage
Abstract:
Weather forecasting is essential for decision-making and is usually performed using numerical modeling. Numerical weather models, in turn, are complex tools that require specialized training and laborious setup and are challenging even for weather experts. Moreover, weather simulations are data-intensive computations and may take hours to days to complete. When the simulation is finished, the expe…
▽ More
Weather forecasting is essential for decision-making and is usually performed using numerical modeling. Numerical weather models, in turn, are complex tools that require specialized training and laborious setup and are challenging even for weather experts. Moreover, weather simulations are data-intensive computations and may take hours to days to complete. When the simulation is finished, the experts face challenges analyzing its outputs, a large mass of spatiotemporal and multivariate data. From the simulation setup to the analysis of results, working with weather simulations involves several manual and error-prone steps. The complexity of the problem increases exponentially when the experts must deal with ensembles of simulations, a frequent task in their daily duties. To tackle these challenges, we propose ProWis: an interactive and provenance-oriented system to help weather experts build, manage, and analyze simulation ensembles at runtime. Our system follows a human-in-the-loop approach to enable the exploration of multiple atmospheric variables and weather scenarios. ProWis was built in close collaboration with weather experts, and we demonstrate its effectiveness by presenting two case studies of rainfall events in Brazil.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
IRQ Coloring and the Subtle Art of Mitigating Interrupt-generated Interference
Authors:
Diogo Costa,
Luca Cuomo,
Daniel Oliveira,
Ida Maria Savino,
Bruno Morelli,
José Martins,
Alessandro Biasci,
Sandro Pinto
Abstract:
Integrating workloads with differing criticality levels presents a formidable challenge in achieving the stringent spatial and temporal isolation requirements imposed by safety-critical standards such as ISO26262. The shift towards high-performance multicore platforms has been posing increasing issues to the so-called mixed-criticality systems (MCS) due to the reciprocal interference created by co…
▽ More
Integrating workloads with differing criticality levels presents a formidable challenge in achieving the stringent spatial and temporal isolation requirements imposed by safety-critical standards such as ISO26262. The shift towards high-performance multicore platforms has been posing increasing issues to the so-called mixed-criticality systems (MCS) due to the reciprocal interference created by consolidated subsystems vying for access to shared (microarchitectural) resources (e.g., caches, bus interconnect, memory controller). The research community has acknowledged all these challenges. Thus, several techniques, such as cache partitioning and memory throttling, have been proposed to mitigate such interference; however, these techniques have some drawbacks and limitations that impact performance, memory footprint, and availability. In this work, we look from a different perspective. Departing from the observation that safety-critical workloads are typically event- and thus interrupt-driven, we mask "colored" interrupts based on the \ac{QoS} assessment, providing fine-grain control to mitigate interference on critical workloads without entirely suspending non-critical workloads. We propose the so-called IRQ coloring technique. We implement and evaluate the IRQ Coloring on a reference high-performance multicore platform, i.e., Xilinx ZCU102. Results demonstrate negligible performance overhead, i.e., <1% for a 100 microseconds period, and reasonable throughput guarantees for medium-critical workloads. We argue that the IRQ coloring technique presents predictability and intermediate guarantees advantages compared to state-of-art mechanisms
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Kernels, Data & Physics
Authors:
Francesco Cagnetta,
Deborah Oliveira,
Mahalakshmi Sabanayagam,
Nikolaos Tsilivis,
Julia Kempe
Abstract:
Lecture notes from the course given by Professor Julia Kempe at the summer school "Statistical physics of Machine Learning" in Les Houches. The notes discuss the so-called NTK approach to problems in machine learning, which consists of gaining an understanding of generally unsolvable problems by finding a tractable kernel formulation. The notes are mainly focused on practical applications such as…
▽ More
Lecture notes from the course given by Professor Julia Kempe at the summer school "Statistical physics of Machine Learning" in Les Houches. The notes discuss the so-called NTK approach to problems in machine learning, which consists of gaining an understanding of generally unsolvable problems by finding a tractable kernel formulation. The notes are mainly focused on practical applications such as data distillation and adversarial robustness, examples of inductive bias are also discussed.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
On the Behavior of Intrusive and Non-intrusive Speech Enhancement Metrics in Predictive and Generative Settings
Authors:
Danilo de Oliveira,
Julius Richter,
Jean-Marie Lemercier,
Tal Peer,
Timo Gerkmann
Abstract:
Since its inception, the field of deep speech enhancement has been dominated by predictive (discriminative) approaches, such as spectral mapping or masking. Recently, however, novel generative approaches have been applied to speech enhancement, attaining good denoising performance with high subjective quality scores. At the same time, advances in deep learning also allowed for the creation of neur…
▽ More
Since its inception, the field of deep speech enhancement has been dominated by predictive (discriminative) approaches, such as spectral mapping or masking. Recently, however, novel generative approaches have been applied to speech enhancement, attaining good denoising performance with high subjective quality scores. At the same time, advances in deep learning also allowed for the creation of neural network-based metrics, which have desirable traits such as being able to work without a reference (non-intrusively). Since generatively enhanced speech tends to exhibit radically different residual distortions, its evaluation using instrumental speech metrics may behave differently compared to predictively enhanced speech. In this paper, we evaluate the performance of the same speech enhancement backbone trained under predictive and generative paradigms on a variety of metrics and show that intrusive and non-intrusive measures correlate differently for each paradigm. This analysis motivates the search for metrics that can together paint a complete and unbiased picture of speech enhancement performance, irrespective of the model's training process.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Datasets for Portuguese Legal Semantic Textual Similarity: Comparing weak supervision and an annotation process approaches
Authors:
Daniel da Silva Junior,
Paulo Roberto dos S. Corval,
Aline Paes,
Daniel de Oliveira
Abstract:
The Brazilian judiciary has a large workload, resulting in a long time to finish legal proceedings. Brazilian National Council of Justice has established in Resolution 469/2022 formal guidance for document and process digitalization opening up the possibility of using automatic techniques to help with everyday tasks in the legal field, particularly in a large number of texts yielded on the routine…
▽ More
The Brazilian judiciary has a large workload, resulting in a long time to finish legal proceedings. Brazilian National Council of Justice has established in Resolution 469/2022 formal guidance for document and process digitalization opening up the possibility of using automatic techniques to help with everyday tasks in the legal field, particularly in a large number of texts yielded on the routine of law procedures. Notably, Artificial Intelligence (AI) techniques allow for processing and extracting useful information from textual data, potentially speeding up the process. However, datasets from the legal domain required by several AI techniques are scarce and difficult to obtain as they need labels from experts. To address this challenge, this article contributes with four datasets from the legal domain, two with documents and metadata but unlabeled, and another two labeled with a heuristic aiming at its use in textual semantic similarity tasks. Also, to evaluate the effectiveness of the proposed heuristic label process, this article presents a small ground truth dataset generated from domain expert annotations. The analysis of ground truth labels highlights that semantic analysis of domain text can be challenging even for domain experts. Also, the comparison between ground truth and heuristic labels shows that heuristic labels are useful.
△ Less
Submitted 29 May, 2023;
originally announced June 2023.
-
Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models
Authors:
Danilo de Oliveira,
Navin Raj Prabhu,
Timo Gerkmann
Abstract:
In large part due to their implicit semantic modeling, self-supervised learning (SSL) methods have significantly increased the performance of valence recognition in speech emotion recognition (SER) systems. Yet, their large size may often hinder practical implementations. In this work, we take HuBERT as an example of an SSL model and analyze the relevance of each of its layers for SER. We show tha…
▽ More
In large part due to their implicit semantic modeling, self-supervised learning (SSL) methods have significantly increased the performance of valence recognition in speech emotion recognition (SER) systems. Yet, their large size may often hinder practical implementations. In this work, we take HuBERT as an example of an SSL model and analyze the relevance of each of its layers for SER. We show that shallow layers are more important for arousal recognition while deeper layers are more important for valence. This observation motivates the importance of additional textual information for accurate valence recognition, as the distilled framework lacks the depth of its large-scale SSL teacher. Thus, we propose an audio-textual distilled SSL framework that, while having only ~20% of the trainable parameters of a large SSL model, achieves on par performance across the three emotion dimensions (arousal, valence, dominance) on the MSP-Podcast v1.10 dataset.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
An interpretable machine learning system for colorectal cancer diagnosis from pathology slides
Authors:
Pedro C. Neto,
Diana Montezuma,
Sara P. Oliveira,
Domingos Oliveira,
João Fraga,
Ana Monteiro,
João Monteiro,
Liliana Ribeiro,
Sofia Gonçalves,
Stefan Reinhard,
Inti Zlobec,
Isabel M. Pinto,
Jaime S. Cardoso
Abstract:
Considering the profound transformation affecting pathology practice, we aimed to develop a scalable artificial intelligence (AI) system to diagnose colorectal cancer from whole-slide images (WSI). For this, we propose a deep learning (DL) system that learns from weak labels, a sampling strategy that reduces the number of training samples by a factor of six without compromising performance, an app…
▽ More
Considering the profound transformation affecting pathology practice, we aimed to develop a scalable artificial intelligence (AI) system to diagnose colorectal cancer from whole-slide images (WSI). For this, we propose a deep learning (DL) system that learns from weak labels, a sampling strategy that reduces the number of training samples by a factor of six without compromising performance, an approach to leverage a small subset of fully annotated samples, and a prototype with explainable predictions, active learning features and parallelisation. Noting some problems in the literature, this study is conducted with one of the largest WSI colorectal samples dataset with approximately 10,500 WSIs. Of these samples, 900 are testing samples. Furthermore, the robustness of the proposed method is assessed with two additional external datasets (TCGA and PAIP) and a dataset of samples collected directly from the proposed prototype. Our proposed method predicts, for the patch-based tiles, a class based on the severity of the dysplasia and uses that information to classify the whole slide. It is trained with an interpretable mixed-supervision scheme to leverage the domain knowledge introduced by pathologists through spatial annotations. The mixed-supervision scheme allowed for an intelligent sampling strategy effectively evaluated in several different scenarios without compromising the performance. On the internal dataset, the method shows an accuracy of 93.44% and a sensitivity between positive (low-grade and high-grade dysplasia) and non-neoplastic samples of 0.996. On the external test samples varied with TCGA being the most challenging dataset with an overall accuracy of 84.91% and a sensitivity of 0.996.
△ Less
Submitted 30 April, 2024; v1 submitted 6 January, 2023;
originally announced January 2023.
-
Chronic pain patient narratives allow for the estimation of current pain intensity
Authors:
Diogo A. P. Nunes,
Joana Ferreira-Gomes,
Daniela Oliveira,
Carlos Vaz,
Sofia Pimenta,
Fani Neto,
David Martins de Matos
Abstract:
Chronic pain is a multi-dimensional experience, and pain intensity plays an important part, impacting the patients emotional balance, psychology, and behaviour. Standard self-reporting tools, such as the Visual Analogue Scale for pain, fail to capture this burden. Moreover, this type of tools is susceptible to a degree of subjectivity, dependent on the patients clear understanding of how to use it…
▽ More
Chronic pain is a multi-dimensional experience, and pain intensity plays an important part, impacting the patients emotional balance, psychology, and behaviour. Standard self-reporting tools, such as the Visual Analogue Scale for pain, fail to capture this burden. Moreover, this type of tools is susceptible to a degree of subjectivity, dependent on the patients clear understanding of how to use it, social biases, and their ability to translate a complex experience to a scale. To overcome these and other self-reporting challenges, pain intensity estimation has been previously studied based on facial expressions, electroencephalograms, brain imaging, and autonomic features. However, to the best of our knowledge, it has never been attempted to base this estimation on the patient narratives of the personal experience of chronic pain, which is what we propose in this work. Indeed, in the clinical assessment and management of chronic pain, verbal communication is essential to convey information to physicians that would otherwise not be easily accessible through standard reporting tools, since language, sociocultural, and psychosocial variables are intertwined. We show that language features from patient narratives indeed convey information relevant for pain intensity estimation, and that our computational models can take advantage of that. Specifically, our results show that patients with mild pain focus more on the use of verbs, whilst moderate and severe pain patients focus on adverbs, and nouns and adjectives, respectively, and that these differences allow for the distinction between these three pain classes.
△ Less
Submitted 17 November, 2022; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Exploring Self-Attention for Crop-type Classification Explainability
Authors:
Ivica Obadic,
Ribana Roscher,
Dario Augusto Borges Oliveira,
Xiao Xiang Zhu
Abstract:
Transformer models have become a promising approach for crop-type classification. Although their attention weights can be used to understand the relevant time points for crop disambiguation, the validity of these insights depends on how closely the attention weights approximate the actual workings of these black-box models, which is not always clear. In this paper, we introduce a novel explainabil…
▽ More
Transformer models have become a promising approach for crop-type classification. Although their attention weights can be used to understand the relevant time points for crop disambiguation, the validity of these insights depends on how closely the attention weights approximate the actual workings of these black-box models, which is not always clear. In this paper, we introduce a novel explainability framework that systematically evaluates the explanatory power of the attention weights of a standard transformer encoder for crop-type classification. Our results show that attention patterns strongly relate to key dates, which are often associated with critical phenological events for crop-type classification. Further, the sensitivity analysis reveals the limited capability of the attention weights to characterize crop phenology as the identified phenological events depend on the other crops considered during training. This limitation highlights the relevance of future work towards the development of deep learning approaches capable of automatically learning the temporal vegetation dynamics for accurate crop disambiguation
△ Less
Submitted 20 April, 2025; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees
Authors:
Moacir Antonelli Ponti,
Lucas de Angelis Oliveira,
Mathias Esteban,
Valentina Garcia,
Juan Martín Román,
Luis Argerich
Abstract:
Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose…
▽ More
Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. Our methods achieved the best results overall when compared with confident learning, direct heuristics and a robust boosting algorithm. We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.
△ Less
Submitted 22 February, 2024; v1 submitted 20 October, 2022;
originally announced October 2022.
-
Transfer-learning for video classification: Video Swin Transformer on multiple domains
Authors:
Daniel A. P. Oliveira,
David Martins de Matos
Abstract:
The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and effic…
▽ More
The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85\% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21\% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.
△ Less
Submitted 28 March, 2025; v1 submitted 18 October, 2022;
originally announced October 2022.