-
Advances on Affordable Hardware Platforms for Human Demonstration Acquisition in Agricultural Applications
Authors:
Alberto San-Miguel-Tello,
Gennaro Scarati,
Alejandro Hernández,
Mario Cavero-Vidal,
Aakash Maroti,
Néstor García
Abstract:
This paper presents advances on the Universal Manipulation Interface (UMI), a low-cost hand-held gripper for robot Learning from Demonstration (LfD), for complex in-the-wild scenarios found in agricultural settings. The focus is on improving the acquisition of suitable samples with minimal additional setup. Firstly, idle times and user's cognitive load are reduced through the extraction of individ…
▽ More
This paper presents advances on the Universal Manipulation Interface (UMI), a low-cost hand-held gripper for robot Learning from Demonstration (LfD), for complex in-the-wild scenarios found in agricultural settings. The focus is on improving the acquisition of suitable samples with minimal additional setup. Firstly, idle times and user's cognitive load are reduced through the extraction of individual samples from a continuous demonstration considering task events. Secondly, reliability on the generation of task sample's trajectories is increased through the combination on-board inertial measurements and external visual marker localization usage using Extended Kalman Filtering (EKF). Results are presented for a fruit harvesting task, outperforming the default pipeline.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification
Authors:
Lu Wei,
Liangzhi Li,
Tong Xiang,
Xiao Liu,
Noa Garcia
Abstract:
The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetyp…
▽ More
The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis
Authors:
Enmin Zhong,
Carlos R. del-Blanco,
Daniel Berjón,
Fernando Jaureguizar,
Narciso García
Abstract:
Recently, there has been a surge of interest in applying deep learning techniques to animal behavior recognition, particularly leveraging pre-trained visual language models, such as CLIP, due to their remarkable generalization capacity across various downstream tasks. However, adapting these models to the specific domain of animal behavior recognition presents two significant challenges: integrati…
▽ More
Recently, there has been a surge of interest in applying deep learning techniques to animal behavior recognition, particularly leveraging pre-trained visual language models, such as CLIP, due to their remarkable generalization capacity across various downstream tasks. However, adapting these models to the specific domain of animal behavior recognition presents two significant challenges: integrating motion information and devising an effective temporal modeling scheme. In this paper, we propose AnimalMotionCLIP to address these challenges by interleaving video frames and optical flow information in the CLIP framework. Additionally, several temporal modeling schemes using an aggregation of classifiers are proposed and compared: dense, semi dense, and sparse. As a result, fine temporal actions can be correctly recognized, which is of vital importance in animal behavior analysis. Experiments on the Animal Kingdom dataset demonstrate that AnimalMotionCLIP achieves superior performance compared to state-of-the-art approaches.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
ImageSet2Text: Describing Sets of Images through Text
Authors:
Piera Riccio,
Francesco Galati,
Kajetan Schweighofer,
Noa Garcia,
Nuria Oliver
Abstract:
We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. Inspired by concept bottleneck models (CBMs) and based on visual-question answering (VQA) chains, ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an ext…
▽ More
We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. Inspired by concept bottleneck models (CBMs) and based on visual-question answering (VQA) chains, ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an external knowledge graph and CLIP-based validation. This iterative process enhances interpretability and enables accurate and detailed set-level summarization. Through extensive experiments, we evaluate ImageSet2Text's descriptions on accuracy, completeness, readability and overall quality, benchmarking it against existing vision-language models and introducing new datasets for large-scale group image captioning.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware
Authors:
Gonzalo Santamaría Gómez,
Guillem García Subies,
Pablo Gutiérrez Ruiz,
Mario González Valero,
Natàlia Fuertes,
Helena Montoro Zamorano,
Carmen Muñoz Sanz,
Leire Rosado Plaza,
Nuria Aldama García,
David Betancur Sánchez,
Kateryna Sushkova,
Marta Guerrero Nieto,
Álvaro Barbero Jiménez
Abstract:
Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational r…
▽ More
Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Interpreting and Steering Protein Language Models through Sparse Autoencoders
Authors:
Edith Natalia Villegas Garcia,
Alessio Ansuini
Abstract:
The rapid advancements in transformer-based language models have revolutionized natural language processing, yet understanding the internal mechanisms of these models remains a significant challenge. This paper explores the application of sparse autoencoders (SAE) to interpret the internal representations of protein language models, specifically focusing on the ESM-2 8M parameter model. By perform…
▽ More
The rapid advancements in transformer-based language models have revolutionized natural language processing, yet understanding the internal mechanisms of these models remains a significant challenge. This paper explores the application of sparse autoencoders (SAE) to interpret the internal representations of protein language models, specifically focusing on the ESM-2 8M parameter model. By performing a statistical analysis on each latent component's relevance to distinct protein annotations, we identify potential interpretations linked to various protein characteristics, including transmembrane regions, binding sites, and specialized motifs.
We then leverage these insights to guide sequence generation, shortlisting the relevant latent components that can steer the model towards desired targets such as zinc finger domains. This work contributes to the emerging field of mechanistic interpretability in biological sequence models, offering new perspectives on model steering for sequence design.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
6KSFx Synth Dataset
Authors:
Nelly Garcia,
Joshua Reiss
Abstract:
Procedural audio, often referred to as "digital Foley", generates sound from scratch using computational processes. It represents an innovative approach to sound-effects creation. However, the development and adoption of procedural audio has been constrained by a lack of publicly available datasets and models, which hinders evaluation and optimization. To address this important gap, this paper pre…
▽ More
Procedural audio, often referred to as "digital Foley", generates sound from scratch using computational processes. It represents an innovative approach to sound-effects creation. However, the development and adoption of procedural audio has been constrained by a lack of publicly available datasets and models, which hinders evaluation and optimization. To address this important gap, this paper presents a dataset of 6000 synthetic audio samples specifically designed to advance research and development in sound synthesis within 30 sound categories. By offering a description of the diverse synthesis methods used in each sound category and supporting the creation of robust evaluation frameworks, this dataset not only highlights the potential of procedural audio, but also provides a resource for researchers, audio developers, and sound designers. This contribution can accelerate the progress of procedural audio, opening up new possibilities in digital sound design.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
MEL: Legal Spanish Language Model
Authors:
David Betancur Sánchez,
Nuria Aldama García,
Álvaro Barbero Jiménez,
Marta Guerrero Nieto,
Patricia Marsà Morales,
Nicolás Serrano Salas,
Carlos García Hernán,
Pablo Haya Coll,
Elena Montiel Ponsoda,
Pablo Calleja Ibáñez
Abstract:
Legal texts, characterized by complex and specialized terminology, present a significant challenge for Language Models. Adding an underrepresented language, such as Spanish, to the mix makes it even more challenging. While pre-trained models like XLM-RoBERTa have shown capabilities in handling multilingual corpora, their performance on domain specific documents remains underexplored. This paper pr…
▽ More
Legal texts, characterized by complex and specialized terminology, present a significant challenge for Language Models. Adding an underrepresented language, such as Spanish, to the mix makes it even more challenging. While pre-trained models like XLM-RoBERTa have shown capabilities in handling multilingual corpora, their performance on domain specific documents remains underexplored. This paper presents the development and evaluation of MEL, a legal language model based on XLM-RoBERTa-large, fine-tuned on legal documents such as BOE (Boletín Oficial del Estado, the Spanish oficial report of laws) and congress texts. We detail the data collection, processing, training, and evaluation processes. Evaluation benchmarks show a significant improvement over baseline models in understanding the legal Spanish language. We also present case studies demonstrating the model's application to new legal texts, highlighting its potential to perform top results over different NLP tasks.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
3CEL: A corpus of legal Spanish contract clauses
Authors:
Nuria Aldama García,
Patricia Marsà Morales,
David Betancur Sánchez,
Álvaro Barbero Jiménez,
Marta Guerrero Nieto,
Pablo Haya Coll,
Patricia Martín Chozas,
Elena Montiel Ponsoda
Abstract:
Legal corpora for Natural Language Processing (NLP) are valuable and scarce resources in languages like Spanish due to two main reasons: data accessibility and legal expert knowledge availability. INESData 2024 is a European Union funded project lead by the Universidad Politécnica de Madrid (UPM) and developed by Instituto de Ingeniería del Conocimiento (IIC) to create a series of state-of-the-art…
▽ More
Legal corpora for Natural Language Processing (NLP) are valuable and scarce resources in languages like Spanish due to two main reasons: data accessibility and legal expert knowledge availability. INESData 2024 is a European Union funded project lead by the Universidad Politécnica de Madrid (UPM) and developed by Instituto de Ingeniería del Conocimiento (IIC) to create a series of state-of-the-art NLP resources applied to the legal/administrative domain in Spanish. The goal of this paper is to present the Corpus of Legal Spanish Contract Clauses (3CEL), which is a contract information extraction corpus developed within the framework of INESData 2024. 3CEL contains 373 manually annotated tenders using 19 defined categories (4 782 total tags) that identify key information for contract understanding and reviewing.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
No Annotations for Object Detection in Art through Stable Diffusion
Authors:
Patrick Ramos,
Nicolas Gonthier,
Selina Khan,
Yuta Nakashima,
Noa Garcia
Abstract:
Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (no annotations for detection in art), a pipeline that leverages diffusion models' art-related kno…
▽ More
Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (no annotations for detection in art), a pipeline that leverages diffusion models' art-related knowledge for object detection in paintings without the need for full bounding box supervision. Our method, which supports both weakly-supervised and zero-shot scenarios and does not require any fine-tuning of its pretrained components, consists of a class proposer based on large vision-language models and a class-conditioned detector based on Stable Diffusion. NADA is evaluated on two artwork datasets, ArtDL 2.0 and IconArt, outperforming prior work in weakly-supervised detection, while being the first work for zero-shot object detection in art. Code is available at https://github.com/patrick-john-ramos/nada
△ Less
Submitted 17 December, 2024; v1 submitted 9 December, 2024;
originally announced December 2024.
-
Best Practices for Responsible Machine Learning in Credit Scoring
Authors:
Giovani Valdrighi,
Athyrson M. Ribeiro,
Jansen S. B. Pereira,
Vitoria Guardieiro,
Arthur Hendricks,
Décio Miranda Filho,
Juan David Nieto Garcia,
Felipe F. Bocca,
Thalita B. Veronese,
Lucas Wanner,
Marcos Medeiros Raimundo
Abstract:
The widespread use of machine learning in credit scoring has brought significant advancements in risk assessment and decision-making. However, it has also raised concerns about potential biases, discrimination, and lack of transparency in these automated systems. This tutorial paper performed a non-systematic literature review to guide best practices for developing responsible machine learning mod…
▽ More
The widespread use of machine learning in credit scoring has brought significant advancements in risk assessment and decision-making. However, it has also raised concerns about potential biases, discrimination, and lack of transparency in these automated systems. This tutorial paper performed a non-systematic literature review to guide best practices for developing responsible machine learning models in credit scoring, focusing on fairness, reject inference, and explainability. We discuss definitions, metrics, and techniques for mitigating biases and ensuring equitable outcomes across different groups. Additionally, we address the issue of limited data representativeness by exploring reject inference methods that incorporate information from rejected loan applications. Finally, we emphasize the importance of transparency and explainability in credit models, discussing techniques that provide insights into the decision-making process and enable individuals to understand and potentially improve their creditworthiness. By adopting these best practices, financial institutions can harness the power of machine learning while upholding ethical and responsible lending practices.
△ Less
Submitted 30 September, 2024;
originally announced September 2024.
-
Gender Bias Evaluation in Text-to-image Generation: A Survey
Authors:
Yankun Wu,
Yuta Nakashima,
Noa Garcia
Abstract:
The rapid development of text-to-image generation has brought rising ethical considerations, especially regarding gender bias. Given a text prompt as input, text-to-image models generate images according to the prompt. Pioneering models such as Stable Diffusion and DALL-E 2 have demonstrated remarkable capabilities in producing high-fidelity images from natural language prompts. However, these mod…
▽ More
The rapid development of text-to-image generation has brought rising ethical considerations, especially regarding gender bias. Given a text prompt as input, text-to-image models generate images according to the prompt. Pioneering models such as Stable Diffusion and DALL-E 2 have demonstrated remarkable capabilities in producing high-fidelity images from natural language prompts. However, these models often exhibit gender bias, as studied by the tendency of generating man from prompts such as "a photo of a software developer". Given the widespread application and increasing accessibility of these models, bias evaluation is crucial for regulating the development of text-to-image generation. Unlike well-established metrics for evaluating image quality or fidelity, the evaluation of bias presents challenges and lacks standard approaches. Although biases related to other factors, such as skin tone, have been explored, gender bias remains the most extensively studied. In this paper, we review recent work on gender bias evaluation in text-to-image generation, involving bias evaluation setup, bias evaluation metrics, and findings and trends. We primarily focus on the evaluation of recent popular models such as Stable Diffusion, a diffusion model operating in the latent space and using CLIP text embedding, and DALL-E 2, a diffusion model leveraging Seq2Seq architectures like BART. By analyzing recent work and discussing trends, we aim to provide insights for future work.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
Authors:
Xiao Liu,
Liangzhi Li,
Tong Xiang,
Fuying Ye,
Lu Wei,
Wangyue Li,
Noa Garcia
Abstract:
With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies…
▽ More
With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Would Deep Generative Models Amplify Bias in Future Models?
Authors:
Tianwei Chen,
Yusuke Hirota,
Mayu Otani,
Noa Garcia,
Yuta Nakashima
Abstract:
We investigate the impact of deep generative models on potential social biases in upcoming computer vision models. As the internet witnesses an increasing influx of AI-generated images, concerns arise regarding inherent biases that may accompany them, potentially leading to the dissemination of harmful content. This paper explores whether a detrimental feedback loop, resulting in bias amplificatio…
▽ More
We investigate the impact of deep generative models on potential social biases in upcoming computer vision models. As the internet witnesses an increasing influx of AI-generated images, concerns arise regarding inherent biases that may accompany them, potentially leading to the dissemination of harmful content. This paper explores whether a detrimental feedback loop, resulting in bias amplification, would occur if generated images were used as the training data for future models. We conduct simulations by progressively substituting original images in COCO and CC3M datasets with images generated through Stable Diffusion. The modified datasets are used to train OpenCLIP and image captioning models, which we evaluate in terms of quality and bias. Contrary to expectations, our findings indicate that introducing generated images during training does not uniformly amplify bias. Instead, instances of bias mitigation across specific tasks are observed. We further explore the factors that may influence these phenomena, such as artifacts in image generation (e.g., blurry faces) or pre-existing biases in the original datasets.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Can multiple-choice questions really be useful in detecting the abilities of LLMs?
Authors:
Wangyue Li,
Liangzhi Li,
Tong Xiang,
Xiao Liu,
Wei Deng,
Noa Garcia
Abstract:
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM's capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoug…
▽ More
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM's capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs' output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs.
△ Less
Submitted 23 May, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
Combined Task and Motion Planning Via Sketch Decompositions (Extended Version with Supplementary Material)
Authors:
Magí Dalmau-Moreno,
Néstor García,
Vicenç Gómez,
Héctor Geffner
Abstract:
The challenge in combined task and motion planning (TAMP) is the effective integration of a search over a combinatorial space, usually carried out by a task planner, and a search over a continuous configuration space, carried out by a motion planner. Using motion planners for testing the feasibility of task plans and filling out the details is not effective because it makes the geometrical constra…
▽ More
The challenge in combined task and motion planning (TAMP) is the effective integration of a search over a combinatorial space, usually carried out by a task planner, and a search over a continuous configuration space, carried out by a motion planner. Using motion planners for testing the feasibility of task plans and filling out the details is not effective because it makes the geometrical constraints play a passive role. This work introduces a new interleaved approach for integrating the two dimensions of TAMP that makes use of sketches, a recent simple but powerful language for expressing the decomposition of problems into subproblems. A sketch has width 1 if it decomposes the problem into subproblems that can be solved greedily in linear time. In the paper, a general sketch is introduced for several classes of TAMP problems which has width 1 under suitable assumptions. While sketch decompositions have been developed for classical planning, they offer two important benefits in the context of TAMP. First, when a task plan is found to be unfeasible due to the geometric constraints, the combinatorial search resumes in a specific sub-problem. Second, the sampling of object configurations is not done once, globally, at the start of the search, but locally, at the start of each subproblem. Optimizations of this basic setting are also considered and experimental results over existing and new pick-and-place benchmarks are reported.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Sim-to-Real gap in RL: Use Case with TIAGo and Isaac Sim/Gym
Authors:
Jaume Albardaner,
Alberto San Miguel,
Néstor García,
Magí Dalmau-Moreno
Abstract:
This paper explores policy-learning approaches in the context of sim-to-real transfer for robotic manipulation using a TIAGo mobile manipulator, focusing on two state-of-art simulators, Isaac Gym and Isaac Sim, both developed by Nvidia. Control architectures are discussed, with a particular emphasis on achieving collision-less movement in both simulation and the real environment. Presented results…
▽ More
This paper explores policy-learning approaches in the context of sim-to-real transfer for robotic manipulation using a TIAGo mobile manipulator, focusing on two state-of-art simulators, Isaac Gym and Isaac Sim, both developed by Nvidia. Control architectures are discussed, with a particular emphasis on achieving collision-less movement in both simulation and the real environment. Presented results demonstrate successful sim-to-real transfer, showcasing similar movements executed by an RL-trained model in both simulated and real setups.
△ Less
Submitted 27 March, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
Advancing dermatological diagnosis: Development of a hyperspectral dermatoscope for enhanced skin imaging
Authors:
Martin J. Hetz,
Carina Nogueira Garcia,
Sarah Haggenmüller,
Titus J. Brinker
Abstract:
Clinical dermatology necessitates precision and innovation for efficient diagnosis and treatment of various skin conditions. This paper introduces the development of a cutting-edge hyperspectral dermatoscope (the Hyperscope) tailored for human skin analysis. We detail the requirements to such a device and the design considerations, from optical configurations to sensor selection, necessary to capt…
▽ More
Clinical dermatology necessitates precision and innovation for efficient diagnosis and treatment of various skin conditions. This paper introduces the development of a cutting-edge hyperspectral dermatoscope (the Hyperscope) tailored for human skin analysis. We detail the requirements to such a device and the design considerations, from optical configurations to sensor selection, necessary to capture a wide spectral range with high fidelity. Preliminary results from 15 individuals and 160 recorded skin images demonstrate the potential of the Hyperscope in identifying and characterizing various skin conditions, offering a promising avenue for non-invasive skin evaluation and a platform for future research in dermatology-related hyperspectral imaging.
△ Less
Submitted 25 June, 2024; v1 submitted 1 March, 2024;
originally announced March 2024.
-
Multi-organ Self-supervised Contrastive Learning for Breast Lesion Segmentation
Authors:
Hugo Figueiras,
Helena Aidos,
Nuno Cruz Garcia
Abstract:
Self-supervised learning has proven to be an effective way to learn representations in domains where annotated labels are scarce, such as medical imaging. A widely adopted framework for this purpose is contrastive learning and it has been applied to different scenarios. This paper seeks to advance our understanding of the contrastive learning framework by exploring a novel perspective: employing m…
▽ More
Self-supervised learning has proven to be an effective way to learn representations in domains where annotated labels are scarce, such as medical imaging. A widely adopted framework for this purpose is contrastive learning and it has been applied to different scenarios. This paper seeks to advance our understanding of the contrastive learning framework by exploring a novel perspective: employing multi-organ datasets for pre-training models tailored to specific organ-related target tasks. More specifically, our target task is breast tumour segmentation in ultrasound images. The pre-training datasets include ultrasound images from other organs, such as the lungs and heart, and large datasets of natural images. Our results show that conventional contrastive learning pre-training improves performance compared to supervised baseline approaches. Furthermore, our pre-trained models achieve comparable performance when fine-tuned with only half of the available labelled data. Our findings also show the advantages of pre-training on diverse organ data for improving performance in the downstream task.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Stable Diffusion Exposed: Gender Bias from Prompt to Image
Authors:
Yankun Wu,
Yuta Nakashima,
Noa Garcia
Abstract:
Several studies have raised awareness about social biases in image generative models, demonstrating their predisposition towards stereotypes and imbalances. This paper contributes to this growing body of research by introducing an evaluation protocol that analyzes the impact of gender indicators at every step of the generation process on Stable Diffusion images. Leveraging insights from prior work…
▽ More
Several studies have raised awareness about social biases in image generative models, demonstrating their predisposition towards stereotypes and imbalances. This paper contributes to this growing body of research by introducing an evaluation protocol that analyzes the impact of gender indicators at every step of the generation process on Stable Diffusion images. Leveraging insights from prior work, we explore how gender indicators not only affect gender presentation but also the representation of objects and layouts within the generated images. Our findings include the existence of differences in the depiction of objects, such as instruments tailored for specific genders, and shifts in overall layouts. We also reveal that neutral prompts tend to produce images more aligned with masculine prompts than their feminine counterparts. We further explore where bias originates through representational disparities and how it manifests in the images via prompt-image dependencies, and provide recommendations for developers and users to mitigate potential bias in image generation.
△ Less
Submitted 11 August, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
TOP-Former: A Multi-Agent Transformer Approach for the Team Orienteering Problem
Authors:
Daniel Fuertes,
Carlos R. del-Blanco,
Fernando Jaureguizar,
Narciso García
Abstract:
Route planning for a fleet of vehicles is an important task in applications such as package delivery, surveillance, or transportation, often integrated within larger Intelligent Transportation Systems (ITS). This problem is commonly formulated as a Vehicle Routing Problem (VRP) known as the Team Orienteering Problem (TOP). Existing solvers for this problem primarily rely on either linear programmi…
▽ More
Route planning for a fleet of vehicles is an important task in applications such as package delivery, surveillance, or transportation, often integrated within larger Intelligent Transportation Systems (ITS). This problem is commonly formulated as a Vehicle Routing Problem (VRP) known as the Team Orienteering Problem (TOP). Existing solvers for this problem primarily rely on either linear programming, which provides accurate solutions but requires computation times that grow with the size of the problem, or heuristic methods, which typically find suboptimal solutions in a shorter time. In this paper, we introduce TOP-Former, a multi-agent route planning neural network designed to efficiently and accurately solve the Team Orienteering Problem. The proposed algorithm is based on a centralized Transformer neural network capable of learning to encode the scenario (modeled as a graph) and analyze the complete context of all agents to deliver fast, precise, and collaborative solutions. Unlike other neural network-based approaches that adopt a more local perspective, TOP-Former is trained to understand the global situation of the vehicle fleet and generate solutions that maximize long-term expected returns. Extensive experiments demonstrate that the presented system outperforms most state-of-the-art methods in terms of both accuracy and computation speed.
△ Less
Submitted 20 May, 2025; v1 submitted 30 November, 2023;
originally announced November 2023.
-
Situating the social issues of image generation models in the model life cycle: a sociotechnical approach
Authors:
Amelia Katirai,
Noa Garcia,
Kazuki Ide,
Yuta Nakashima,
Atsuo Kishimoto
Abstract:
The race to develop image generation models is intensifying, with a rapid increase in the number of text-to-image models available. This is coupled with growing public awareness of these technologies. Though other generative AI models--notably, large language models--have received recent critical attention for the social and other non-technical issues they raise, there has been relatively little c…
▽ More
The race to develop image generation models is intensifying, with a rapid increase in the number of text-to-image models available. This is coupled with growing public awareness of these technologies. Though other generative AI models--notably, large language models--have received recent critical attention for the social and other non-technical issues they raise, there has been relatively little comparable examination of image generation models. This paper reports on a novel, comprehensive categorization of the social issues associated with image generation models. At the intersection of machine learning and the social sciences, we report the results of a survey of the literature, identifying seven issue clusters arising from image generation models: data issues, intellectual property, bias, privacy, and the impacts on the informational, cultural, and natural environments. We situate these social issues in the model life cycle, to aid in considering where potential issues arise, and mitigation may be needed. We then compare these issue clusters with what has been reported for large language models. Ultimately, we argue that the risks posed by image generation models are comparable in severity to the risks posed by large language models, and that the social impact of image generation models must be urgently considered.
△ Less
Submitted 22 July, 2024; v1 submitted 30 November, 2023;
originally announced November 2023.
-
CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care
Authors:
Tong Xiang,
Liangzhi Li,
Wangyue Li,
Mingbai Bai,
Lu Wei,
Bowen Wang,
Noa Garcia
Abstract:
The recent advances in natural language processing (NLP), have led to a new trend of applying large language models (LLMs) to real-world scenarios. While the latest LLMs are astonishingly fluent when interacting with humans, they suffer from the misinformation problem by unintentionally generating factually false statements. This can lead to harmful consequences, especially when produced within se…
▽ More
The recent advances in natural language processing (NLP), have led to a new trend of applying large language models (LLMs) to real-world scenarios. While the latest LLMs are astonishingly fluent when interacting with humans, they suffer from the misinformation problem by unintentionally generating factually false statements. This can lead to harmful consequences, especially when produced within sensitive contexts, such as healthcare. Yet few previous works have focused on evaluating misinformation in the long-form (LF) generation of LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have been shown to perform well in different languages, misinformation evaluation has been mostly conducted in English. To this end, we present a benchmark, CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic, specifically the maternity and infant care domain; and 2) a language other than English, namely Chinese. Most importantly, we provide an innovative paradigm for building LF generation evaluation benchmarks that can be transferred to other knowledge-intensive domains and low-resourced languages. Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models. It contains 1,612 expert-checked questions, accompanied with human-selected references. Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care. In an effort to minimize the reliance on human resources for performance evaluation, we offer off-the-shelf judgment models for automatically assessing the LF output of LLMs given benchmark questions. Moreover, we compare potential solutions for LF generation evaluation and provide insights for building better automated metrics.
△ Less
Submitted 26 October, 2023; v1 submitted 3 July, 2023;
originally announced July 2023.
-
A Cluster-Based Opposition Differential Evolution Algorithm Boosted by a Local Search for ECG Signal Classification
Authors:
Mehran Pourvahab,
Seyed Jalaleddin Mousavirad,
Virginie Felizardo,
Nuno Pombo,
Henriques Zacarias,
Hamzeh Mohammadigheymasi,
Sebastião Pais,
Seyed Nooreddin Jafari,
Nuno M. Garcia
Abstract:
Electrocardiogram (ECG) signals, which capture the heart's electrical activity, are used to diagnose and monitor cardiac problems. The accurate classification of ECG signals, particularly for distinguishing among various types of arrhythmias and myocardial infarctions, is crucial for the early detection and treatment of heart-related diseases. This paper proposes a novel approach based on an impro…
▽ More
Electrocardiogram (ECG) signals, which capture the heart's electrical activity, are used to diagnose and monitor cardiac problems. The accurate classification of ECG signals, particularly for distinguishing among various types of arrhythmias and myocardial infarctions, is crucial for the early detection and treatment of heart-related diseases. This paper proposes a novel approach based on an improved differential evolution (DE) algorithm for ECG signal classification for enhancing the performance. In the initial stages of our approach, the preprocessing step is followed by the extraction of several significant features from the ECG signals. These extracted features are then provided as inputs to an enhanced multi-layer perceptron (MLP). While MLPs are still widely used for ECG signal classification, using gradient-based training methods, the most widely used algorithm for the training process, has significant disadvantages, such as the possibility of being stuck in local optimums. This paper employs an enhanced differential evolution (DE) algorithm for the training process as one of the most effective population-based algorithms. To this end, we improved DE based on a clustering-based strategy, opposition-based learning, and a local search. Clustering-based strategies can act as crossover operators, while the goal of the opposition operator is to improve the exploration of the DE algorithm. The weights and biases found by the improved DE algorithm are then fed into six gradient-based local search algorithms. In other words, the weights found by the DE are employed as an initialization point. Therefore, we introduced six different algorithms for the training process (in terms of different local search algorithms). In an extensive set of experiments, we showed that our proposed training algorithm could provide better results than the conventional training algorithms.
△ Less
Submitted 6 October, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis
Authors:
Yankun Wu,
Yuta Nakashima,
Noa Garcia
Abstract:
The duality of content and style is inherent to the nature of art. For humans, these two elements are clearly different: content refers to the objects and concepts in the piece of art, and style to the way it is expressed. This duality poses an important challenge for computer vision. The visual appearance of objects and concepts is modulated by the style that may reflect the author's emotions, so…
▽ More
The duality of content and style is inherent to the nature of art. For humans, these two elements are clearly different: content refers to the objects and concepts in the piece of art, and style to the way it is expressed. This duality poses an important challenge for computer vision. The visual appearance of objects and concepts is modulated by the style that may reflect the author's emotions, social trends, artistic movement, etc., and their deep comprehension undoubtfully requires to handle both. A promising step towards a general paradigm for art analysis is to disentangle content and style, whereas relying on human annotations to cull a single aspect of artworks has limitations in learning semantic concepts and the visual appearance of paintings. We thus present GOYA, a method that distills the artistic knowledge captured in a recent generative model to disentangle content and style. Experiments show that synthetically generated images sufficiently serve as a proxy of the real distribution of artworks, allowing GOYA to separately represent the two elements of art while keeping more information than existing methods.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
Model-Agnostic Gender Debiased Image Captioning
Authors:
Yusuke Hirota,
Yuta Nakashima,
Noa Garcia
Abstract:
Image captioning models are known to perpetuate and amplify harmful societal bias in the training set. In this work, we aim to mitigate such gender bias in image captioning models. While prior work has addressed this problem by forcing models to focus on people to reduce gender misclassification, it conversely generates gender-stereotypical words at the expense of predicting the correct gender. Fr…
▽ More
Image captioning models are known to perpetuate and amplify harmful societal bias in the training set. In this work, we aim to mitigate such gender bias in image captioning models. While prior work has addressed this problem by forcing models to focus on people to reduce gender misclassification, it conversely generates gender-stereotypical words at the expense of predicting the correct gender. From this observation, we hypothesize that there are two types of gender bias affecting image captioning models: 1) bias that exploits context to predict gender, and 2) bias in the probability of generating certain (often stereotypical) words because of gender. To mitigate both types of gender biases, we propose a framework, called LIBRA, that learns from synthetically biased samples to decrease both types of biases, correcting gender misclassification and changing gender-stereotypical words to more neutral ones. Code is available at https://github.com/rebnej/LIBRA.
△ Less
Submitted 21 December, 2023; v1 submitted 7 April, 2023;
originally announced April 2023.
-
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Authors:
Noa Garcia,
Yusuke Hirota,
Yankun Wu,
Yuta Nakashima
Abstract:
The increasing tendency to collect large and uncurated datasets to train vision-and-language models has raised concerns about fair representations. It is known that even small but manually annotated datasets, such as MSCOCO, are affected by societal bias. This problem, far from being solved, may be getting worse with data crawled from the Internet without much control. In addition, the lack of too…
▽ More
The increasing tendency to collect large and uncurated datasets to train vision-and-language models has raised concerns about fair representations. It is known that even small but manually annotated datasets, such as MSCOCO, are affected by societal bias. This problem, far from being solved, may be getting worse with data crawled from the Internet without much control. In addition, the lack of tools to analyze societal bias in big collections of images makes addressing the problem extremely challenging. Our first contribution is to annotate part of the Google Conceptual Captions dataset, widely used for training vision-and-language models, with four demographic and two contextual attributes. Our second contribution is to conduct a comprehensive analysis of the annotations, focusing on how different demographic groups are represented. Our last contribution lies in evaluating three prevailing vision-and-language tasks: image captioning, text-image CLIP embeddings, and text-to-image generation, showing that societal bias is a persistent problem in all of them.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma
Authors:
Tirtha Chanda,
Katja Hauser,
Sarah Hobelsberger,
Tabea-Clara Bucher,
Carina Nogueira Garcia,
Christoph Wies,
Harald Kittler,
Philipp Tschandl,
Cristian Navarrete-Dechent,
Sebastian Podlipnik,
Emmanouil Chousakos,
Iva Crnaric,
Jovana Majstorovic,
Linda Alhajwan,
Tanya Foreman,
Sandra Peternel,
Sergei Sarap,
İrem Özdemir,
Raymond L. Barnhill,
Mar Llamas Velasco,
Gabriela Poch,
Sören Korsing,
Wiebke Sondermann,
Frank Friedrich Gellrich,
Markus V. Heppt
, et al. (10 additional authors not shown)
Abstract:
Although artificial intelligence (AI) systems have been shown to improve the accuracy of initial melanoma diagnosis, the lack of transparency in how these systems identify melanoma poses severe obstacles to user acceptance. Explainable artificial intelligence (XAI) methods can help to increase transparency, but most XAI methods are unable to produce precisely located domain-specific explanations,…
▽ More
Although artificial intelligence (AI) systems have been shown to improve the accuracy of initial melanoma diagnosis, the lack of transparency in how these systems identify melanoma poses severe obstacles to user acceptance. Explainable artificial intelligence (XAI) methods can help to increase transparency, but most XAI methods are unable to produce precisely located domain-specific explanations, making the explanations difficult to interpret. Moreover, the impact of XAI methods on dermatologists has not yet been evaluated. Extending on two existing classifiers, we developed an XAI system that produces text and region based explanations that are easily interpretable by dermatologists alongside its differential diagnoses of melanomas and nevi. To evaluate this system, we conducted a three-part reader study to assess its impact on clinicians' diagnostic accuracy, confidence, and trust in the XAI-support. We showed that our XAI's explanations were highly aligned with clinicians' explanations and that both the clinicians' trust in the support system and their confidence in their diagnoses were significantly increased when using our XAI compared to using a conventional AI system. The clinicians' diagnostic accuracy was numerically, albeit not significantly, increased. This work demonstrates that clinicians are willing to adopt such an XAI system, motivating their future use in the clinic.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
MROS: A framework for robot self-adaptation
Authors:
Gustavo Rezende Silva,
Darko Bozhinoski,
Mario Garzon Oviedo,
Mariano Ramírez Montero,
Nadia Hammoudeh Garcia,
Harshavardhan Deshpande,
Andrzej Wasowski,
Carlos Hernandez Corbato
Abstract:
Self-adaptation can be used in robotics to increase system robustness and reliability. This work describes the Metacontrol method for self-adaptation in robotics. Particularly, it details how the MROS (Metacontrol for ROS Systems) framework implements and packages Metacontrol, and it demonstrate how MROS can be applied in a navigation scenario where a mobile robot navigates in a factory floor. Vid…
▽ More
Self-adaptation can be used in robotics to increase system robustness and reliability. This work describes the Metacontrol method for self-adaptation in robotics. Particularly, it details how the MROS (Metacontrol for ROS Systems) framework implements and packages Metacontrol, and it demonstrate how MROS can be applied in a navigation scenario where a mobile robot navigates in a factory floor. Video: https://www.youtube.com/watch?v=ISe9aMskJuE
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
A Comparative Analysis of Bias Amplification in Graph Neural Network Approaches for Recommender Systems
Authors:
Nikzad Chizari,
Niloufar Shoeibi,
María N. Moreno-García
Abstract:
Recommender Systems (RSs) are used to provide users with personalized item recommendations and help them overcome the problem of information overload. Currently, recommendation methods based on deep learning are gaining ground over traditional methods such as matrix factorization due to their ability to represent the complex relationships between users and items and to incorporate additional infor…
▽ More
Recommender Systems (RSs) are used to provide users with personalized item recommendations and help them overcome the problem of information overload. Currently, recommendation methods based on deep learning are gaining ground over traditional methods such as matrix factorization due to their ability to represent the complex relationships between users and items and to incorporate additional information. The fact that these data have a graph structure and the greater capability of Graph Neural Networks (GNNs) to learn from these structures has led to their successful incorporation into recommender systems. However, the bias amplification issue needs to be investigated while using these algorithms. Bias results in unfair decisions, which can negatively affect the company reputation and financial status due to societal disappointment and environmental harm. In this paper, we aim to comprehensively study this problem through a literature review and an analysis of the behavior against biases of different GNN-based algorithms compared to state-of-the-art methods. We also intend to explore appropriate solutions to tackle this issue with the least possible impact on the model performance.
△ Less
Submitted 18 January, 2023;
originally announced January 2023.
-
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks
Authors:
Tianwei Chen,
Noa Garcia,
Mayu Otani,
Chenhui Chu,
Yuta Nakashima,
Hajime Nagahara
Abstract:
Is more data always better to train vision-and-language models? We study knowledge transferability in multi-modal tasks. The current tendency in machine learning is to assume that by joining multiple datasets from different tasks their overall performance will improve. However, we show that not all the knowledge transfers well or has a positive impact on related tasks, even when they share a commo…
▽ More
Is more data always better to train vision-and-language models? We study knowledge transferability in multi-modal tasks. The current tendency in machine learning is to assume that by joining multiple datasets from different tasks their overall performance will improve. However, we show that not all the knowledge transfers well or has a positive impact on related tasks, even when they share a common goal. We conduct an exhaustive analysis based on hundreds of cross-experiments on 12 vision-and-language tasks categorized in 4 groups. Whereas tasks in the same group are prone to improve each other, results show that this is not always the case. Other factors such as dataset size or pre-training stage have also a great impact on how well the knowledge is transferred.
△ Less
Submitted 23 August, 2022;
originally announced August 2022.
-
RigoBERTa: A State-of-the-Art Language Model For Spanish
Authors:
Alejandro Vaca Serrano,
Guillem Garcia Subies,
Helena Montoro Zamorano,
Nuria Aldama Garcia,
Doaa Samy,
David Betancur Sanchez,
Antonio Moreno Sandoval,
Marta Guerrero Nieto,
Alvaro Barbero Jimenez
Abstract:
This paper presents RigoBERTa, a State-of-the-Art Language Model for Spanish. RigoBERTa is trained over a well-curated corpus formed up from different subcorpora with key features. It follows the DeBERTa architecture, which has several advantages over other architectures of similar size as BERT or RoBERTa. RigoBERTa performance is assessed over 13 NLU tasks in comparison with other available Spani…
▽ More
This paper presents RigoBERTa, a State-of-the-Art Language Model for Spanish. RigoBERTa is trained over a well-curated corpus formed up from different subcorpora with key features. It follows the DeBERTa architecture, which has several advantages over other architectures of similar size as BERT or RoBERTa. RigoBERTa performance is assessed over 13 NLU tasks in comparison with other available Spanish language models, namely, MarIA, BERTIN and BETO. RigoBERTa outperformed the three models in 10 out of the 13 tasks, achieving new "State-of-the-Art" results.
△ Less
Submitted 3 June, 2022; v1 submitted 27 April, 2022;
originally announced May 2022.
-
Gender and Racial Bias in Visual Question Answering Datasets
Authors:
Yusuke Hirota,
Yuta Nakashima,
Noa Garcia
Abstract:
Vision-and-language tasks have increasingly drawn more attention as a means to evaluate human-like reasoning in machine learning models. A popular task in the field is visual question answering (VQA), which aims to answer questions about images. However, VQA models have been shown to exploit language bias by learning the statistical correlations between questions and answers without looking into t…
▽ More
Vision-and-language tasks have increasingly drawn more attention as a means to evaluate human-like reasoning in machine learning models. A popular task in the field is visual question answering (VQA), which aims to answer questions about images. However, VQA models have been shown to exploit language bias by learning the statistical correlations between questions and answers without looking into the image content: e.g., questions about the color of a banana are answered with yellow, even if the banana in the image is green. If societal bias (e.g., sexism, racism, ableism, etc.) is present in the training data, this problem may be causing VQA models to learn harmful stereotypes. For this reason, we investigate gender and racial bias in five VQA datasets. In our analysis, we find that the distribution of answers is highly different between questions about women and men, as well as the existence of detrimental gender-stereotypical samples. Likewise, we identify that specific race-related attributes are underrepresented, whereas potentially discriminatory samples appear in the analyzed datasets. Our findings suggest that there are dangers associated to using VQA datasets without considering and dealing with the potentially harmful stereotypes. We conclude the paper by proposing solutions to alleviate the problem before, during, and after the dataset collection process.
△ Less
Submitted 3 June, 2022; v1 submitted 17 May, 2022;
originally announced May 2022.
-
Emerging Immersive Communication Systems: Overview, Taxonomy, and Good Practises for QoE Assessment
Authors:
Pablo Pérez,
Ester Gonzalez-Sosa,
Jesús Gutiérrez,
Narciso García
Abstract:
Several technological and scientific advances have been achieved recently in the fields of immersive systems, which are offering new possibilities to applications and services in different communication domains, such as entertainment, virtual conferencing, working meetings, social relations, healthcare, and industry. Users of these immersive technologies can explore and experience the stimuli in a…
▽ More
Several technological and scientific advances have been achieved recently in the fields of immersive systems, which are offering new possibilities to applications and services in different communication domains, such as entertainment, virtual conferencing, working meetings, social relations, healthcare, and industry. Users of these immersive technologies can explore and experience the stimuli in a more interactive and personalized way than previous technologies. Thus, considering the new technological challenges related to these systems and the new perceptual dimensions and interaction behaviors involved, a deep understanding of the users' Quality of Experience is required to satisfy their demands and expectations. In this sense, it is essential to foster the research on evaluating the QoE of immersive communication systems, since this will provide useful outcomes to optimize them and to identify the factors that can deteriorate the user experience. With this aim, subjective tests are usually performed following standard methodologies, which are designed for specific technologies and services. Although numerous user studies have been already published, there are no recommendations or standards that define common testing methodologies to be applied to evaluate immersive communication systems, such as those developed for images and video. Therefore, a revision of the QoE evaluation methods designed for previous technologies is required to develop robust and reliable methodologies for immersive communication systems. Thus, the objective of this paper is to provide an overview of existing immersive communication systems and related user studies, which can help on the definition of basic guidelines and testing methodologies to be used when performing user tests of immersive communication systems, such as 360-degree video-based telepresence, avatar-based social VR, cooperative AR, etc.
△ Less
Submitted 1 September, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Quantifying Societal Bias Amplification in Image Captioning
Authors:
Yusuke Hirota,
Yuta Nakashima,
Noa Garcia
Abstract:
We study societal bias amplification in image captioning. Image captioning models have been shown to perpetuate gender and racial biases, however, metrics to measure, quantify, and evaluate the societal bias in captions are not yet standardized. We provide a comprehensive study on the strengths and limitations of each metric, and propose LIC, a metric to study captioning bias amplification. We arg…
▽ More
We study societal bias amplification in image captioning. Image captioning models have been shown to perpetuate gender and racial biases, however, metrics to measure, quantify, and evaluate the societal bias in captions are not yet standardized. We provide a comprehensive study on the strengths and limitations of each metric, and propose LIC, a metric to study captioning bias amplification. We argue that, for image captioning, it is not enough to focus on the correct prediction of the protected attribute, and the whole context should be taken into account. We conduct extensive evaluation on traditional and state-of-the-art image captioning models, and surprisingly find that, by only focusing on the protected attribute prediction, bias mitigation models are unexpectedly amplifying bias.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
The Met Dataset: Instance-level Recognition for Artworks
Authors:
Nikolaos-Antonios Ypsilantis,
Noa Garcia,
Guangxing Han,
Sarah Ibrahimi,
Nanne Van Noord,
Giorgos Tolias
Abstract:
This work introduces a dataset for large-scale instance-level recognition in the domain of artworks. The proposed benchmark exhibits a number of different challenges such as large inter-class similarity, long tail distribution, and many classes. We rely on the open access collection of The Met museum to form a large training set of about 224k classes, where each class corresponds to a museum exhib…
▽ More
This work introduces a dataset for large-scale instance-level recognition in the domain of artworks. The proposed benchmark exhibits a number of different challenges such as large inter-class similarity, long tail distribution, and many classes. We rely on the open access collection of The Met museum to form a large training set of about 224k classes, where each class corresponds to a museum exhibit with photos taken under studio conditions. Testing is primarily performed on photos taken by museum guests depicting exhibits, which introduces a distribution shift between training and testing. Testing is additionally performed on a set of images not related to Met exhibits making the task resemble an out-of-distribution detection problem. The proposed benchmark follows the paradigm of other recent datasets for instance-level recognition on different domains to encourage research on domain independent approaches. A number of suitable approaches are evaluated to offer a testbed for future comparisons. Self-supervised and supervised contrastive learning are effectively combined to train the backbone which is used for non-parametric classification that is shown as a promising direction. Dataset webpage: http://cmp.felk.cvut.cz/met/
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties
Authors:
Mohamed S. Kraiem,
Fernando Sánchez-Hernández,
María N. Moreno-García
Abstract:
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the ov…
▽ More
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.
△ Less
Submitted 15 December, 2021;
originally announced January 2022.
-
Transferring Domain-Agnostic Knowledge in Video Question Answering
Authors:
Tianran Wu,
Noa Garcia,
Mayu Otani,
Chenhui Chu,
Yuta Nakashima,
Haruo Takemura
Abstract:
Video question answering (VideoQA) is designed to answer a given question based on a relevant video clip. The current available large-scale datasets have made it possible to formulate VideoQA as the joint understanding of visual and language information. However, this training procedure is costly and still less competent with human performance. In this paper, we investigate a transfer learning met…
▽ More
Video question answering (VideoQA) is designed to answer a given question based on a relevant video clip. The current available large-scale datasets have made it possible to formulate VideoQA as the joint understanding of visual and language information. However, this training procedure is costly and still less competent with human performance. In this paper, we investigate a transfer learning method by the introduction of domain-agnostic knowledge and domain-specific knowledge. First, we develop a novel transfer learning framework, which finetunes the pre-trained model by applying domain-agnostic knowledge as the medium. Second, we construct a new VideoQA dataset with 21,412 human-generated question-answer samples for comparable transfer of knowledge. Our experiments show that: (i) domain-agnostic knowledge is transferable and (ii) our proposed transfer learning framework can boost VideoQA performance effectively.
△ Less
Submitted 25 October, 2021;
originally announced October 2021.
-
Detecting Renewal States in Chains of Variable Length via Intrinsic Bayes Factors
Authors:
Victor Freguglia,
Nancy Garcia
Abstract:
Markov chains with variable length are useful parsimonious stochastic models able to generate most stationary sequence of discrete symbols. The idea is to identify the suffixes of the past, called contexts, that are relevant to predict the future symbol. Sometimes a single state is a context, and looking at the past and finding this specific state makes the further past irrelevant. States with suc…
▽ More
Markov chains with variable length are useful parsimonious stochastic models able to generate most stationary sequence of discrete symbols. The idea is to identify the suffixes of the past, called contexts, that are relevant to predict the future symbol. Sometimes a single state is a context, and looking at the past and finding this specific state makes the further past irrelevant. States with such property are called renewal states and they can be used to split the chain into independent and identically distributed blocks. In order to identify renewal states for chains with variable length, we propose the use of Intrinsic Bayes Factor to evaluate the hypothesis that some particular state is a renewal state. In this case, the difficulty lies in integrating the marginal posterior distribution for the random context trees for general prior distribution on the space of context trees, with Dirichlet prior for the transition probabilities, and Monte Carlo methods are applied. To show the strength of our method, we analyzed artificial datasets generated from different binary models models and one example coming from the field of Linguistics.
△ Less
Submitted 6 January, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
Dynamic inference of user context through social tag embedding for music recommendation
Authors:
Diego Sánchez-Moreno,
Álvaro Lozano Murciego,
Vivian F. López Batista,
María Dolores Muñoz Vicente,
María N. Moreno-García
Abstract:
Music listening preferences at a given time depend on a wide range of contextual factors, such as user emotional state, location and activity at listening time, the day of the week, the time of the day, etc. It is therefore of great importance to take them into account when recommending music. However, it is very difficult to develop context-aware recommender systems that consider these factors, b…
▽ More
Music listening preferences at a given time depend on a wide range of contextual factors, such as user emotional state, location and activity at listening time, the day of the week, the time of the day, etc. It is therefore of great importance to take them into account when recommending music. However, it is very difficult to develop context-aware recommender systems that consider these factors, both because of the difficulty of detecting some of them, such as emotional state, and because of the drawbacks derived from the inclusion of many factors, such as sparsity problems in contextual pre-filtering. This work involves the proposal of a method for the detection of the user contextual state when listening to music based on the social tags of music items. The intrinsic characteristics of social tagging that allow for the description of items in multiple dimensions can be exploited to capture many contextual dimensions in the user listening sessions. The embeddings of the tags of the first items played in each session are used to represent the context of that session. Recommendations are then generated based on both user preferences and the similarity of the items computed from tag embeddings. Social tags have been used extensively in many recommender systems, however, to our knowledge, they have been hardly used to dynamically infer contextual states.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation
Authors:
Zechen Bai,
Yuta Nakashima,
Noa Garcia
Abstract:
Have you ever looked at a painting and wondered what is the story behind it? This work presents a framework to bring art closer to people by generating comprehensive descriptions of fine-art paintings. Generating informative descriptions for artworks, however, is extremely challenging, as it requires to 1) describe multiple aspects of the image such as its style, content, or composition, and 2) pr…
▽ More
Have you ever looked at a painting and wondered what is the story behind it? This work presents a framework to bring art closer to people by generating comprehensive descriptions of fine-art paintings. Generating informative descriptions for artworks, however, is extremely challenging, as it requires to 1) describe multiple aspects of the image such as its style, content, or composition, and 2) provide background and contextual knowledge about the artist, their influences, or the historical period. To address these challenges, we introduce a multi-topic and knowledgeable art description framework, which modules the generated sentences according to three artistic topics and, additionally, enhances each description with external knowledge. The framework is validated through an exhaustive analysis, both quantitative and qualitative, as well as a comparative human evaluation, demonstrating outstanding results in terms of both topic diversity and information veracity.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
Soccer line mark segmentation and classification with stochastic watershed transform
Authors:
Daniel Berjón,
Carlos Cuevas,
Narciso García
Abstract:
Augmented reality applications are beginning to change the way sports are broadcast, providing richer experiences and valuable insights to fans. The first step of augmented reality systems is camera calibration, possibly based on detecting the line markings of the playing field. Most existing proposals for line detection rely on edge detection and Hough transform, but radial distortion and extrane…
▽ More
Augmented reality applications are beginning to change the way sports are broadcast, providing richer experiences and valuable insights to fans. The first step of augmented reality systems is camera calibration, possibly based on detecting the line markings of the playing field. Most existing proposals for line detection rely on edge detection and Hough transform, but radial distortion and extraneous edges cause inaccurate or spurious detections of line markings. We propose a novel strategy to automatically and accurately segment and classify line markings. First, line points are segmented thanks to a stochastic watershed transform that is robust to radial distortions, since it makes no assumptions about line straightness, and is unaffected by the presence of players or the ball. The line points are then linked to primitive structures (straight lines and ellipses) thanks to a very efficient procedure that makes no assumptions about the number of primitives that appear in each image. The strategy has been tested on a new and public database composed by 60 annotated images from matches in five stadiums. The results obtained have proven that the proposed strategy is more robust and accurate than existing approaches, achieving successful line mark detection even in challenging conditions.
△ Less
Submitted 3 August, 2022; v1 submitted 13 August, 2021;
originally announced August 2021.
-
A Picture May Be Worth a Hundred Words for Visual Question Answering
Authors:
Yusuke Hirota,
Noa Garcia,
Mayu Otani,
Chenhui Chu,
Yuta Nakashima,
Ittetsu Taniguchi,
Takao Onoye
Abstract:
How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the…
▽ More
How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Meanwhile, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model, simplifying the process and the computational cost. We also experiment with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias. Extensive evaluations have shown that textual representations require only about a hundred words to compete with deep visual features on both VQA 2.0 and VQA-CP v2.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
GCNBoost: Artwork Classification by Label Propagation through a Knowledge Graph
Authors:
Cheikh Brahim El Vaigh,
Noa Garcia,
Benjamin Renoust,
Chenhui Chu,
Yuta Nakashima,
Hajime Nagahara
Abstract:
The rise of digitization of cultural documents offers large-scale contents, opening the road for development of AI systems in order to preserve, search, and deliver cultural heritage. To organize such cultural content also means to classify them, a task that is very familiar to modern computer science. Contextual information is often the key to structure such real world data, and we propose to use…
▽ More
The rise of digitization of cultural documents offers large-scale contents, opening the road for development of AI systems in order to preserve, search, and deliver cultural heritage. To organize such cultural content also means to classify them, a task that is very familiar to modern computer science. Contextual information is often the key to structure such real world data, and we propose to use it in form of a knowledge graph. Such a knowledge graph, combined with content analysis, enhances the notion of proximity between artworks so it improves the performances in classification tasks. In this paper, we propose a novel use of a knowledge graph, that is constructed on annotated data and pseudo-labeled data. With label propagation, we boost artwork classification by training a model using a graph convolutional network, relying on the relationships between entities of the knowledge graph. Following a transductive learning framework, our experiments show that relying on a knowledge graph modeling the relations between labeled data and unlabeled data allows to achieve state-of-the-art results on multiple classification tasks on a dataset of paintings, and on a dataset of Buddha statues. Additionally, we show state-of-the-art results for the difficult case of dealing with unbalanced data, with the limitation of disregarding classes with extremely low degrees in the knowledge graph.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Subjective Assessment Experiments That Recruit Few Observers With Repetitions (FOWR)
Authors:
Pablo Perez,
Lucjan Janowski,
Narciso Garcia,
Margaret Pinson
Abstract:
Recent studies have shown that it is possible to characterize subject bias and variance in subjective assessment tests. Apparent differences among subjects can, for the most part, be explained by random factors. Building on that theory, we propose a subjective test design where three to four team members each rate the stimuli multiple times. The results are comparable to a high performing objectiv…
▽ More
Recent studies have shown that it is possible to characterize subject bias and variance in subjective assessment tests. Apparent differences among subjects can, for the most part, be explained by random factors. Building on that theory, we propose a subjective test design where three to four team members each rate the stimuli multiple times. The results are comparable to a high performing objective metric. This provides a quick and simple way to analyze new technologies and perform pre-tests for subjective assessment.
△ Less
Submitted 20 July, 2022; v1 submitted 6 April, 2021;
originally announced April 2021.
-
The Internet Protocol -- Past, some current limitations and a glimpse of a possible future
Authors:
Nuno M. Garcia
Abstract:
The network layer is central to the networking scientific area. It is around the network layer that all the data communications develop, and one of its main tasks is to allow the identification of each single interface/machine between the potentially many interfaces in a network. This seminar addresses some of the issues that are usually presented to young Computer Science Engineering students in…
▽ More
The network layer is central to the networking scientific area. It is around the network layer that all the data communications develop, and one of its main tasks is to allow the identification of each single interface/machine between the potentially many interfaces in a network. This seminar addresses some of the issues that are usually presented to young Computer Science Engineering students in the course of several classes, but also presents some topics that are not address in networking courses. It is mostly focused on using Internet Protocol addresses in Local Area Networks, also considering issues that belong to the Wide Area Networks, such as data aggregation. This document summarizes the content of a seminar, therefore it comprehends both teaching and researching subject. The seminar starts with a history of the evolution of the communication protocols from the early days of networks up until IPv6. It describes a new approach to define the addresses of network interfaces using Variable Length Subnet Masks, as usually this is a not an easy task for Computer Science Engineering undergraduate students. This summary also describes some of the limitations of the data communication in todays' networks, proposing some solutions, where possible, including a novel mean of connectionless data transmission by using IPv6 addresses, by extension of previously published research. The way the seminar is organized provides a history to the past of the Internet Protocol, a view of some of its well-known current limitations, and a glimpse into a possible future regarding an improved connectionless layer 3 data transfer protocol.
△ Less
Submitted 3 April, 2021;
originally announced April 2021.
-
Methodology to Assess Quality, Presence, Empathy, Attitude, and Attention in 360-degree Videos for Immersive Communications
Authors:
Marta Orduna,
Pablo Pérez,
Jesús Gutiérrez,
Narciso García
Abstract:
This paper analyzes the joint assessment of quality, spatial and social presence, empathy, attitude, and attention in three conditions: (A)visualizing and rating the quality of contents in a Head-Mounted Display (HMD), (B)visualizing the contents in an HMD,and (C)visualizing the contents in an HMD where participants can see their hands and take notes. The experiment simulates an immersive communic…
▽ More
This paper analyzes the joint assessment of quality, spatial and social presence, empathy, attitude, and attention in three conditions: (A)visualizing and rating the quality of contents in a Head-Mounted Display (HMD), (B)visualizing the contents in an HMD,and (C)visualizing the contents in an HMD where participants can see their hands and take notes. The experiment simulates an immersive communication where participants attend conversations of different genres and from different acquisition perspectives in the context of international experiences. Video quality is evaluated with Single-Stimulus Discrete Quality Evaluation (SSDQE) methodology. Spatial and social presence are evaluated with questionnaires adapted from the literature. Initial empathy is assessed with Interpersonal Reactivity Index(IRI) and a questionnaire is designed to evaluate attitude. Attention is evaluated with 3 questions that had pass/fail answers. 54 participants were evenly distributed among A, B, and C conditions taking into account their international experience backgrounds, obtaining a diverse sample of participants. The results from the subjective test validate the proposed methodology in VR communications, showing that video quality experiments can be adapted to conditions imposed by experiments focused on the evaluation of socioemotional features in terms of contents of long-duration, actor and observer acquisition perspectives, and genre. In addition, the positive results related to the sense of presence imply that technology can be relevant in the analyzed use case. The acquisition perspective greatly influences social presence and all the contents have a positive impact on all participants on their attitude towards international experiences. The annotated dataset, Student Experiences Around the World dataset (SEAW-dataset), obtained from the experiment is made publicly available.
△ Less
Submitted 9 February, 2022; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Understanding the Role of Scene Graphs in Visual Question Answering
Authors:
Vinay Damodaran,
Sharanya Chakravarthy,
Akshay Kumar,
Anjana Umapathy,
Teruko Mitamura,
Yuta Nakashima,
Noa Garcia,
Chenhui Chu
Abstract:
Visual Question Answering (VQA) is of tremendous interest to the research community with important applications such as aiding visually impaired users and image-based search. In this work, we explore the use of scene graphs for solving the VQA task. We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning ca…
▽ More
Visual Question Answering (VQA) is of tremendous interest to the research community with important applications such as aiding visually impaired users and image-based search. In this work, we explore the use of scene graphs for solving the VQA task. We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability, and provides scene graphs for a large number of images. We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs, and build late fusion architectures to learn from multiple image representations. We present a multi-faceted study into the use of scene graphs for VQA, making this work the first of its kind.
△ Less
Submitted 16 January, 2021; v1 submitted 14 January, 2021;
originally announced January 2021.
-
MROS: Runtime Adaptation For Robot Control Architectures
Authors:
Darko Bozhinoski,
Carlos Hernandez Corbato,
Mario Garzon Oviedo,
Gijs van der Hoorn,
Nadia Hammoudeh Garcia,
Harshavardhan Deshpande,
Jon Tjerngren,
Andrzej Wasowski
Abstract:
Known attempts to build autonomous robots rely on complex control architectures, often implemented with the Robot Operating System platform (ROS). Runtime adaptation is needed in these systems, to cope with component failures and with contingencies arising from dynamic environments-otherwise, these affect the reliability and quality of the mission execution. Existing proposals on how to build self…
▽ More
Known attempts to build autonomous robots rely on complex control architectures, often implemented with the Robot Operating System platform (ROS). Runtime adaptation is needed in these systems, to cope with component failures and with contingencies arising from dynamic environments-otherwise, these affect the reliability and quality of the mission execution. Existing proposals on how to build self-adaptive systems in robotics usually require a major re-design of the control architecture and rely on complex tools unfamiliar to the robotics community. Moreover, they are hard to reuse across applications.
This paper presents MROS: a model-based framework for run-time adaptation of robot control architectures based on ROS. MROS uses a combination of domain-specific languages to model architectural variants and captures mission quality concerns, and an ontology-based implementation of the MAPE-K and meta-control visions for run-time adaptation. The experiment results obtained applying MROS in two realistic ROS-based robotic demonstrators show the benefits of our approach in terms of the quality of the mission execution, and MROS' extensibility and re-usability across robotic applications.
△ Less
Submitted 23 November, 2021; v1 submitted 18 October, 2020;
originally announced October 2020.
-
Demographic Influences on Contemporary Art with Unsupervised Style Embeddings
Authors:
Nikolai Huckle,
Noa Garcia,
Yuta Nakashima
Abstract:
Computational art analysis has, through its reliance on classification tasks, prioritised historical datasets in which the artworks are already well sorted with the necessary annotations. Art produced today, on the other hand, is numerous and easily accessible, through the internet and social networks that are used by professional and amateur artists alike to display their work. Although this art,…
▽ More
Computational art analysis has, through its reliance on classification tasks, prioritised historical datasets in which the artworks are already well sorted with the necessary annotations. Art produced today, on the other hand, is numerous and easily accessible, through the internet and social networks that are used by professional and amateur artists alike to display their work. Although this art, yet unsorted in terms of style and genre, is less suited for supervised analysis, the data sources come with novel information that may help frame the visual content in equally novel ways. As a first step in this direction, we present contempArt, a multi-modal dataset of exclusively contemporary artworks. contempArt is a collection of paintings and drawings, a detailed graph network based on social connections on Instagram and additional socio-demographic information; all attached to 442 artists at the beginning of their career. We evaluate three methods suited for generating unsupervised style embeddings of images and correlate them with the remaining data. We find no connections between visual style on the one hand and social proximity, gender, and nationality on the other.
△ Less
Submitted 1 December, 2020; v1 submitted 30 September, 2020;
originally announced September 2020.