-
Bisecle: Binding and Separation in Continual Learning for Video Language Understanding
Authors:
Yue Tan,
Xiaoqian Hu,
Hao Xue,
Celso De Melo,
Flora D. Salim
Abstract:
Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models…
▽ More
Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i.e., catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Binding and pattern separation mechanisms in the hippocampus, in this work, we propose Bisecle for video-language continual learning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Shuffle PatchMix Augmentation with Confidence-Margin Weighted Pseudo-Labels for Enhanced Source-Free Domain Adaptation
Authors:
Prasanna Reddy Pulakurthi,
Majid Rabbani,
Jamison Heard,
Sohail Dianat,
Celso M. de Melo,
Raghuveer Rao
Abstract:
This work investigates Source-Free Domain Adaptation (SFDA), where a model adapts to a target domain without access to source data. A new augmentation technique, Shuffle PatchMix (SPM), and a novel reweighting strategy are introduced to enhance performance. SPM shuffles and blends image patches to generate diverse and challenging augmentations, while the reweighting strategy prioritizes reliable p…
▽ More
This work investigates Source-Free Domain Adaptation (SFDA), where a model adapts to a target domain without access to source data. A new augmentation technique, Shuffle PatchMix (SPM), and a novel reweighting strategy are introduced to enhance performance. SPM shuffles and blends image patches to generate diverse and challenging augmentations, while the reweighting strategy prioritizes reliable pseudo-labels to mitigate label noise. These techniques are particularly effective on smaller datasets like PACS, where overfitting and pseudo-label noise pose greater risks. State-of-the-art results are achieved on three major benchmarks: PACS, VisDA-C, and DomainNet-126. Notably, on PACS, improvements of 7.3% (79.4% to 86.7%) and 7.2% are observed in single-target and multi-target settings, respectively, while gains of 2.8% and 0.7% are attained on DomainNet-126 and VisDA-C. This combination of advanced augmentation and robust pseudo-label reweighting establishes a new benchmark for SFDA. The code is available at: https://github.com/PrasannaPulakurthi/SPM
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
SHARDeg: A Benchmark for Skeletal Human Action Recognition in Degraded Scenarios
Authors:
Simon Malzard,
Nitish Mital,
Richard Walters,
Victoria Nockles,
Raghuveer Rao,
Celso M. De Melo
Abstract:
Computer vision (CV) models for detection, prediction or classification tasks operate on video data-streams that are often degraded in the real world, due to deployment in real-time or on resource-constrained hardware. It is therefore critical that these models are robust to degraded data, but state of the art (SoTA) models are often insufficiently assessed with these real-world constraints in min…
▽ More
Computer vision (CV) models for detection, prediction or classification tasks operate on video data-streams that are often degraded in the real world, due to deployment in real-time or on resource-constrained hardware. It is therefore critical that these models are robust to degraded data, but state of the art (SoTA) models are often insufficiently assessed with these real-world constraints in mind. This is exemplified by Skeletal Human Action Recognition (SHAR), which is critical in many CV pipelines operating in real-time and at the edge, but robustness to degraded data has previously only been shallowly and inconsistently assessed. Here we address this issue for SHAR by providing an important first data degradation benchmark on the most detailed and largest 3D open dataset, NTU-RGB+D-120, and assess the robustness of five leading SHAR models to three forms of degradation that represent real-world issues. We demonstrate the need for this benchmark by showing that the form of degradation, which has not previously been considered, has a large impact on model accuracy; at the same effective frame rate, model accuracy can vary by >40% depending on degradation type. We also identify that temporal regularity of frames in degraded SHAR data is likely a major driver of differences in model performance, and harness this to improve performance of existing models by up to >40%, through employing a simple mitigation approach based on interpolation. Finally, we highlight how our benchmark has helped identify an important degradation-resistant SHAR model based in Rough Path Theory; the LogSigRNN SHAR model outperforms the SoTA DeGCN model in five out of six cases at low frame rates by an average accuracy of 6%, despite trailing the SoTA model by 11-12% on un-degraded data at high frame rates (30 FPS).
△ Less
Submitted 27 May, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Authors:
Wufei Ma,
Luoxin Ye,
Celso M de Melo,
Jieneng Chen,
Alan Yuille
Abstract:
Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study th…
▽ More
Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing SpatialLLM, a large multi-modal model with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object's 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporate 3D orientation relationships on real images. Furthermore, we systematically integrate these two types of training data with the architectural and training designs of LMMs, providing a roadmap for optimal design aimed at achieving superior 3D reasoning capabilities. Our SpatialLLM advances machines toward highly capable 3D-informed reasoning, surpassing GPT-4o performance by 8.7%. Our systematic empirical design and the resulting findings offer valuable insights for future research in this direction. Our project page is available at: https://3d-spatial-reasoning.github.io/spatial-llm/
△ Less
Submitted 10 June, 2025; v1 submitted 1 May, 2025;
originally announced May 2025.
-
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
Authors:
Wufei Ma,
Yu-Cheng Chou,
Qihao Liu,
Xingrui Wang,
Celso de Melo,
Jianwen Xie,
Alan Yuille
Abstract:
Despite recent advances on multi-modal models, 3D spatial reasoning remains a challenging task for state-of-the-art open-source and proprietary models. Recent studies explore data-driven approaches and achieve enhanced spatial reasoning performance by fine-tuning models on 3D-related visual question-answering data. However, these methods typically perform spatial reasoning in an implicit manner an…
▽ More
Despite recent advances on multi-modal models, 3D spatial reasoning remains a challenging task for state-of-the-art open-source and proprietary models. Recent studies explore data-driven approaches and achieve enhanced spatial reasoning performance by fine-tuning models on 3D-related visual question-answering data. However, these methods typically perform spatial reasoning in an implicit manner and often fail on questions that are trivial to humans, even with long chain-of-thought reasoning. In this work, we introduce SpatialReasoner, a novel large vision-language model (LVLM) that addresses 3D spatial reasoning with explicit 3D representations shared between multiple stages--3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and improves the generalization ability to novel question types. Furthermore, by analyzing the explicit 3D representations in multi-step reasoning traces of SpatialReasoner, we study the factual errors and identify key shortcomings of current LVLMs. Results show that our SpatialReasoner achieves improved performance on a variety of spatial reasoning benchmarks, outperforming Gemini 2.0 by 9.2% on 3DSRBench, and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.
△ Less
Submitted 10 June, 2025; v1 submitted 28 April, 2025;
originally announced April 2025.
-
Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data
Authors:
Prasanna Reddy Pulakurthi,
Majid Rabbani,
Celso M. de Melo,
Sohail A. Dianat,
Raghuveer M. Rao
Abstract:
This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground o…
▽ More
This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground objects and spatially shuffling background patches. This effectively increases the diversity of the training data, improving model robustness and generalization. Evaluations on the PACS dataset for SFDA demonstrate that our augmentation strategy consistently outperforms existing methods, achieving significant accuracy improvements in both single-target and multi-target adaptation settings. By augmenting training data through structured transformations, our method enables model generalization across domains, providing a scalable solution for reducing reliance on manually annotated datasets. Furthermore, experiments on Market-1501 and DukeMTMC-reID datasets validate the effectiveness of our approach for person ReID, surpassing traditional augmentation techniques. The code is available at https://github.com/PrasannaPulakurthi/Foreground-Background-Augmentation
△ Less
Submitted 3 June, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
F-ViTA: Foundation Model Guided Visible to Thermal Translation
Authors:
Jay N. Paranjape,
Celso de Melo,
Vishal M. Patel
Abstract:
Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks…
▽ More
Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: https://github.com/JayParanjape/F-ViTA/tree/master.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Authors:
Arun Reddy,
Alexander Martin,
Eugene Yang,
Andrew Yates,
Kate Sanders,
Kenton Murray,
Reno Kriz,
Celso M. de Melo,
Benjamin Van Durme,
Rama Chellappa
Abstract:
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temp…
▽ More
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
PromptGAR: Flexible Promptive Group Activity Recognition
Authors:
Zhangyu Jin,
Andrew Feng,
Ankur Chemburkar,
Celso M. De Melo
Abstract:
We present PromptGAR, a novel framework that addresses the limitations of current Group Activity Recognition (GAR) approaches by leveraging multi-modal prompts to achieve both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, the lack of long-term actor consistency, and under-explo…
▽ More
We present PromptGAR, a novel framework that addresses the limitations of current Group Activity Recognition (GAR) approaches by leveraging multi-modal prompts to achieve both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, the lack of long-term actor consistency, and under-exploration of multi-group scenarios. To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining. Specifically, we unify bounding boxes, skeletal keypoints, and areas as point prompts and employ a recognition decoder for cross-updating class and prompt tokens. To ensure long-term consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance IDs. Finally, PromptGAR explores the use of area prompts to enable the selective recognition of the particular group activity within videos that contain multiple concurrent groups. Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and diverse prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Availability Modeling for Blockchain Provisioning in Private Clouds
Authors:
J Dantas,
P Silva,
L Fiondella,
C Melo,
P Maciel
Abstract:
Blockchain technology has emerged, and many previous studies have assessed its performance issues. However, less attention has been paid to the dependability attributes, which have been a critical topic in service provisioning, considering public or private infrastructures. This paper introduces analytical models to assess the availability of private blockchain infrastructure for Hyperledger Fabri…
▽ More
Blockchain technology has emerged, and many previous studies have assessed its performance issues. However, less attention has been paid to the dependability attributes, which have been a critical topic in service provisioning, considering public or private infrastructures. This paper introduces analytical models to assess the availability of private blockchain infrastructure for Hyperledger Fabric-based applications. Furthermore, a case study will be presented to demonstrate the feasibility of the proposed model, which may assist stakeholders in deciding whether to migrate from old to new technology. Some of the obtained results indicate that, unlike most conventional systems, general availability may decrease as new nodes are added to the environment. This phenomenon occurs due to the adopted endorsement policy, which determines the proportion of required nodes to sign the authenticity of a transaction.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context
Authors:
Bryan L. M. de Oliveira,
Luana G. B. Martins,
Bruno Brandão,
Luckeciano C. Melo
Abstract:
Large language models excel at following explicit instructions, but they often struggle with ambiguous or incomplete user requests, defaulting to verbose, generic responses instead of seeking clarification. We introduce InfoQuest, a multi-turn chat benchmark designed to evaluate how dialogue agents handle hidden context in open-ended user requests. This benchmark presents intentionally ambiguous s…
▽ More
Large language models excel at following explicit instructions, but they often struggle with ambiguous or incomplete user requests, defaulting to verbose, generic responses instead of seeking clarification. We introduce InfoQuest, a multi-turn chat benchmark designed to evaluate how dialogue agents handle hidden context in open-ended user requests. This benchmark presents intentionally ambiguous scenarios that require models to engage in information-seeking dialogue by asking clarifying questions before providing appropriate responses. Our evaluation of both open and closed models reveals that, while proprietary models generally perform better, all current assistants struggle to gather critical information effectively. They often require multiple turns to infer user intent and frequently default to generic responses without proper clarification. We provide a systematic methodology for generating diverse scenarios and evaluating models' information-seeking capabilities, which can be leveraged to automatically generate data for self-improvement. We also offer insights into the current limitations of language models in handling ambiguous requests through multi-turn interactions.
△ Less
Submitted 25 April, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Uncertainty-Aware Step-wise Verification with Generative Reward Models
Authors:
Zihuiwen Ye,
Luckeciano Carvalho Melo,
Younesse Kaddar,
Phil Blunsom,
Sam Staton,
Yarin Gal
Abstract:
Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susce…
▽ More
Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Distributed Application Provisioning over Ethereum based private and permissioned Blockchain: Availability modeling, capacity, and costs planning
Authors:
Carlos Melo,
Jamilson Dantas,
Paulo Pereira,
Paulo Maciel
Abstract:
Blockchain and Cloud Computing are two of the main topics related to the distributed computing paradigm, and in the last decade, they have seen exponential growth in their adoption. Cloud computing has long been established as the main mechanism to test, develop, and deliver new applications and services in a distributed manner across the World Wide Web. Large data centers host many services and s…
▽ More
Blockchain and Cloud Computing are two of the main topics related to the distributed computing paradigm, and in the last decade, they have seen exponential growth in their adoption. Cloud computing has long been established as the main mechanism to test, develop, and deliver new applications and services in a distributed manner across the World Wide Web. Large data centers host many services and store petabytes of user data. Infrastructure and services owners rule the access to data and may even be able to change contents and attest to its veracity. Blockchain is a step towards a future where the user's data are considered safer, besides being public. Advances in blockchain-based technologies, now, support service provisioning over permissioned and private infrastructures. Therefore, organizations or groups of individuals may share information, service even if they do not trust each other, besides supporting infrastructure management tasks. This paper presents and evaluates models for assessing the availability and capacity-oriented availability of cloud computing infrastructures. It aims at running Blockchain's distributed applications based on the Ethereum blockchain platform and the required expenses to perform service delivery in public and private infrastructures. Most of the obtained results also apply to other blockchains based platforms.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
A Comprehensive Hyperledger Fabric Performance Evaluation based on Resources Capacity Planning
Authors:
Carlos Melo,
Glauber Gonçalves,
Francisco A. Silva,
André Soares
Abstract:
Hyperledger Fabric is a platform for permissioned blockchain networks that enables secure and auditable distributed data storage for enterprise applications. There is a growing interest in applications based on this platform, but its use requires the configuration of different blockchain parameters. Various configurations impact the system's non-functional qualities, especially performance and cos…
▽ More
Hyperledger Fabric is a platform for permissioned blockchain networks that enables secure and auditable distributed data storage for enterprise applications. There is a growing interest in applications based on this platform, but its use requires the configuration of different blockchain parameters. Various configurations impact the system's non-functional qualities, especially performance and cost. In this article, we propose a Stochastic Petri Net to model the performance of the Hyperledger Fabric platform with different blockchain parameters, computer capacity, and transaction rates. We also present a set of case studies to demonstrate the feasibility of the proposed model. This model serves as a practical guide to help administrators of permissioned blockchain networks find the best performance for their applications. The proposed model allowed us to identify the block size that leads to a high mean response time (ranging from 1 to 25 seconds) caused by a change in the arrival rate.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Transactional Dynamics in Hyperledger Fabric: A Stochastic Modeling and Performance Evaluation of Permissioned Blockchains
Authors:
Carlos Melo,
Glauber Gonçalves,
Francisco Airton Silva,
Iure Fé,
Ericksulino Moura,
André Soares,
Eunmi Choi,
Dugki Min,
Jae-Woo Lee,
Tuan Anh Nguyen
Abstract:
Blockchain, often integrated with distributed systems and security enhancements, has significant potential in various industries. However, environmental concerns and the efficiency of consortia-controlled permissioned networks remain critical issues. We use a Stochastic Petri Net model to analyze transaction flows in Hyperledger Fabric networks, achieving a 95% confidence interval for response tim…
▽ More
Blockchain, often integrated with distributed systems and security enhancements, has significant potential in various industries. However, environmental concerns and the efficiency of consortia-controlled permissioned networks remain critical issues. We use a Stochastic Petri Net model to analyze transaction flows in Hyperledger Fabric networks, achieving a 95% confidence interval for response times. This model enables administrators to assess the impact of system changes on resource utilization. Sensitivity analysis reveals major factors influencing response times and throughput. Our case studies demonstrate that block size can alter throughput and response times by up to 200%, underscoring the need for performance optimization with resource efficiency.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Optimal Resource Utilization in Hyperledger Fabric: A Comprehensive SPN-Based Performance Evaluation Paradigm
Authors:
Carlos Melo,
Glauber Gonçalves,
Francisco A. Silva,
Leonel Feitosa,
Iure Fé,
André Soares,
Eunmi Choi,
Tuan Anh Nguyen,
Dugki Min
Abstract:
Hyperledger Fabric stands as a leading framework for permissioned blockchain systems, ensuring data security and auditability for enterprise applications. As applications on this platform grow, understanding its complex configuration concerning various blockchain parameters becomes vital. These configurations significantly affect the system's performance and cost. In this research, we introduce a…
▽ More
Hyperledger Fabric stands as a leading framework for permissioned blockchain systems, ensuring data security and auditability for enterprise applications. As applications on this platform grow, understanding its complex configuration concerning various blockchain parameters becomes vital. These configurations significantly affect the system's performance and cost. In this research, we introduce a Stochastic Petri Net (SPN) model to analyze Hyperledger Fabric's performance, considering variations in blockchain parameters, computational resources, and transaction rates. We provide case studies to validate the utility of our model, aiding blockchain administrators in determining optimal configurations for their applications. A key observation from our model highlights the block size's role in system response time. We noted an increased mean response time, between 1 to 25 seconds, due to variations in transaction arrival rates.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Performance Modeling and Evaluation of Hyperledger Fabric: An Analysis Based on Transaction Flow and Endorsement Policies
Authors:
Carlos Melo,
Glauber Gonçalves,
Francisco A. Silva,
André Soares
Abstract:
Blockchain is a paradigm derived from distributed systems, protocols, and security concepts. However, can blockchain applications provide services in industrial environments, especially concerning performance issues? In blockchains, long response times can impair both user and service experience, and intensive resource use may increase the costs of service provision. The proposed paper tries to an…
▽ More
Blockchain is a paradigm derived from distributed systems, protocols, and security concepts. However, can blockchain applications provide services in industrial environments, especially concerning performance issues? In blockchains, long response times can impair both user and service experience, and intensive resource use may increase the costs of service provision. The proposed paper tries to answer this question by evaluating the performance of one of the most popular permissioned blockchain platforms, the Hyperledger Fabric (HLF). We provide a framework for performance evaluation based on modeling and experimentation. The results indicate that block size and arrival rate can compromise throughput (by -70%), latency (by +1,500%), and environment utilization (by +28%) and that multiple gateways can reduce latency (by -75%), and throughput (by -60%)
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
Authors:
Xingrui Wang,
Wufei Ma,
Tiezheng Zhang,
Celso M de Melo,
Jieneng Chen,
Alan Yuille
Abstract:
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this…
▽ More
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings. The code and data are released in https://github.com/XingruiWang/Spatial457.
△ Less
Submitted 8 June, 2025; v1 submitted 12 February, 2025;
originally announced February 2025.
-
Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models
Authors:
Yasiru Ranasinghe,
Vibashan VS,
James Uplinger,
Celso De Melo,
Vishal M. Patel
Abstract:
Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. In extreme use cases, such as military applications, these factors are often challenged due to the presence of unknown terrains, environmental conditions, and novel object categories. Current object detectors, including open-world detectors, lack the abili…
▽ More
Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. In extreme use cases, such as military applications, these factors are often challenged due to the presence of unknown terrains, environmental conditions, and novel object categories. Current object detectors, including open-world detectors, lack the ability to confidently recognize novel objects or operate in unknown environments, as they have not been exposed to these new conditions. However, Large Vision-Language Models (LVLMs) exhibit emergent properties that enable them to recognize objects in varying conditions in a zero-shot manner. Despite this, LVLMs struggle to localize objects effectively within a scene. To address these limitations, we propose a novel pipeline that combines the detection capabilities of open-world detectors with the recognition confidence of LVLMs, creating a robust system for zero-shot ATR of novel classes and unknown domains. In this study, we compare the performance of various LVLMs for recognizing military vehicles, which are often underrepresented in training datasets. Additionally, we examine the impact of factors such as distance range, modality, and prompting methods on the recognition performance, providing insights into the development of more reliable ATR systems for novel conditions and classes.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
Texture- and Shape-based Adversarial Attacks for Vehicle Detection in Synthetic Overhead Imagery
Authors:
Mikael Yeghiazaryan,
Sai Abhishek Siddhartha Namburu,
Emily Kim,
Stanislav Panev,
Celso de Melo,
Brent Lance,
Fernando De la Torre,
Jessica K. Hodgins
Abstract:
Detecting vehicles in aerial images can be very challenging due to complex backgrounds, small resolution, shadows, and occlusions. Despite the effectiveness of SOTA detectors such as YOLO, they remain vulnerable to adversarial attacks (AAs), compromising their reliability. Traditional AA strategies often overlook the practical constraints of physical implementation, focusing solely on attack perfo…
▽ More
Detecting vehicles in aerial images can be very challenging due to complex backgrounds, small resolution, shadows, and occlusions. Despite the effectiveness of SOTA detectors such as YOLO, they remain vulnerable to adversarial attacks (AAs), compromising their reliability. Traditional AA strategies often overlook the practical constraints of physical implementation, focusing solely on attack performance. Our work addresses this issue by proposing practical implementation constraints for AA in texture and/or shape. These constraints include pixelation, masking, limiting the color palette of the textures, and constraining the shape modifications. We evaluated the proposed constraints through extensive experiments using three widely used object detector architectures, and compared them to previous works. The results demonstrate the effectiveness of our solutions and reveal a trade-off between practicality and performance. Additionally, we introduce a labeled dataset of overhead images featuring vehicles of various categories. We will make the code/dataset public upon paper acceptance.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Authors:
Wufei Ma,
Haoyu Chen,
Guofeng Zhang,
Yu-Cheng Chou,
Celso M de Melo,
Alan Yuille
Abstract:
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable pr…
▽ More
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning capabilities by balancing the data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images with uncommon camera viewpoints. Our 3DSRBench provide valuable findings and insights about the future development of LMMs with strong 3D reasoning capabilities. Our project page and dataset is available https://3dsrbench.github.io.
△ Less
Submitted 8 May, 2025; v1 submitted 10 December, 2024;
originally announced December 2024.
-
Improving Object Detection by Modifying Synthetic Data with Explainable AI
Authors:
Nitish Mital,
Simon Malzard,
Richard Walters,
Celso M. De Melo,
Raghuveer Rao,
Victoria Nockles
Abstract:
Limited real-world data severely impacts model performance in many computer vision domains, particularly for samples that are underrepresented in training. Synthetically generated images are a promising solution, but 1) it remains unclear how to design synthetic training data to optimally improve model performance (e.g, whether and where to introduce more realism or more abstraction) and 2) the do…
▽ More
Limited real-world data severely impacts model performance in many computer vision domains, particularly for samples that are underrepresented in training. Synthetically generated images are a promising solution, but 1) it remains unclear how to design synthetic training data to optimally improve model performance (e.g, whether and where to introduce more realism or more abstraction) and 2) the domain expertise, time and effort required from human operators for this design and optimisation process represents a major practical challenge. Here we propose a novel conceptual approach to improve the efficiency of designing synthetic images, by using robust Explainable AI (XAI) techniques to guide a human-in-the-loop process of modifying 3D mesh models used to generate these images. Importantly, this framework allows both modifications that increase and decrease realism in synthetic data, which can both improve model performance. We illustrate this concept using a real-world example where data are sparse; detection of vehicles in infrared imagery. We fine-tune an initial YOLOv8 model on the ATR DSIAC infrared dataset and synthetic images generated from 3D mesh models in the Unity gaming engine, and then use XAI saliency maps to guide modification of our Unity models. We show that synthetic data can improve detection of vehicles in orientations unseen in training by 4.6% (to mAP50 = 94.6%). We further improve performance by an additional 1.5% (to 96.1%) through our new XAI-guided approach, which reduces misclassifications through both increasing and decreasing the realism of different parts of the synthetic data. Our proof-of-concept results pave the way for fine, XAI-controlled curation of synthetic datasets tailored to improve object detection performance, whilst simultaneously reducing the burden on human operators in designing and optimising these datasets.
△ Less
Submitted 3 April, 2025; v1 submitted 2 December, 2024;
originally announced December 2024.
-
Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning
Authors:
Bryan L. M. de Oliveira,
Luana G. B. Martins,
Bruno Brandão,
Murilo L. da Luz,
Telma W. de L. Soares,
Luckeciano C. Melo
Abstract:
Effective visual representation learning is crucial for reinforcement learning (RL) agents to extract task-relevant information from raw sensory inputs and generalize across diverse environments. However, existing RL benchmarks lack the ability to systematically evaluate representation learning capabilities in isolation from other learning challenges. To address this gap, we introduce the Sliding…
▽ More
Effective visual representation learning is crucial for reinforcement learning (RL) agents to extract task-relevant information from raw sensory inputs and generalize across diverse environments. However, existing RL benchmarks lack the ability to systematically evaluate representation learning capabilities in isolation from other learning challenges. To address this gap, we introduce the Sliding Puzzles Gym (SPGym), a novel benchmark that transforms the classic 8-tile puzzle into a visual RL task with images drawn from arbitrarily large datasets. SPGym's key innovation lies in its ability to precisely control representation learning complexity through adjustable grid sizes and image pools, while maintaining fixed environment dynamics, observation, and action spaces. This design enables researchers to isolate and scale the visual representation challenge independently of other learning components. Through extensive experiments with model-free and model-based RL algorithms, we uncover fundamental limitations in current methods' ability to handle visual diversity. As we increase the pool of possible images, all algorithms exhibit in- and out-of-distribution performance degradation, with sophisticated representation learning techniques often underperforming simpler approaches like data augmentation. These findings highlight critical gaps in visual representation learning for RL and establish SPGym as a valuable tool for driving progress in robust, generalizable decision-making systems.
△ Less
Submitted 1 July, 2025; v1 submitted 17 October, 2024;
originally announced October 2024.
-
Temporal-Difference Variational Continual Learning
Authors:
Luckeciano C. Melo,
Alessandro Abate,
Yarin Gal
Abstract:
Machine Learning models in real-world applications must continuously learn new tasks to adapt to shifts in the data-generating distribution. Yet, for Continual Learning (CL), models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines t…
▽ More
Machine Learning models in real-world applications must continuously learn new tasks to adapt to shifts in the data-generating distribution. Yet, for Continual Learning (CL), models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. In the Bayesian CL literature, variational methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution while constraining it to stay close to its previous estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. Experiments on challenging CL benchmarks show that our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.
△ Less
Submitted 14 May, 2025; v1 submitted 10 October, 2024;
originally announced October 2024.
-
ConceptAgent: LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution
Authors:
Corban Rivera,
Grayson Byrd,
William Paul,
Tyler Feldman,
Meghan Booker,
Emma Holmes,
David Handelman,
Bethany Kemp,
Andrew Badger,
Aurora Schmidt,
Krishna Murthy Jatavallabhula,
Celso M de Melo,
Lalithkumar Seenivasan,
Mathias Unberath,
Rama Chellappa
Abstract:
Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching t…
▽ More
Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching the action space. However, prior work fails to address the possibility of hallucinations from LLMs, which results in failures to execute the planned actions largely due to logical fallacies at high- or low-levels. To contend with automation failure due to such hallucinations, we introduce ConceptAgent, a natural language-driven robotic platform designed for task execution in unstructured environments. With a focus on scalability and reliability of LLM-based planning in complex state and action spaces, we present innovations designed to limit these shortcomings, including 1) Predicate Grounding to prevent and recover from infeasible actions, and 2) an embodied version of LLM-guided Monte Carlo Tree Search with self reflection. In simulation experiments, ConceptAgent achieved a 19% task completion rate across three room layouts and 30 easy level embodied tasks outperforming other state-of-the-art LLM-driven reasoning baselines that scored 10.26% and 8.11% on the same benchmark. Additionally, ablation studies on moderate to hard embodied tasks revealed a 20% increase in task completion from the baseline agent to the fully enhanced ConceptAgent, highlighting the individual and combined contributions of Predicate Grounding and LLM-guided Tree Search to enable more robust automation in complex state and action spaces.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
An Evaluation of Large Pre-Trained Models for Gesture Recognition using Synthetic Videos
Authors:
Arun Reddy,
Ketul Shah,
Corban Rivera,
William Paul,
Celso M. De Melo,
Rama Chellappa
Abstract:
In this work, we explore the possibility of using synthetically generated data for video-based gesture recognition with large pre-trained models. We consider whether these models have sufficiently robust and expressive representation spaces to enable "training-free" classification. Specifically, we utilize various state-of-the-art video encoders to extract features for use in k-nearest neighbors c…
▽ More
In this work, we explore the possibility of using synthetically generated data for video-based gesture recognition with large pre-trained models. We consider whether these models have sufficiently robust and expressive representation spaces to enable "training-free" classification. Specifically, we utilize various state-of-the-art video encoders to extract features for use in k-nearest neighbors classification, where the training data points are derived from synthetic videos only. We compare these results with another training-free approach -- zero-shot classification using text descriptions of each gesture. In our experiments with the RoCoG-v2 dataset, we find that using synthetic training videos yields significantly lower classification accuracy on real test videos compared to using a relatively small number of real training videos. We also observe that video backbones that were fine-tuned on classification tasks serve as superior feature extractors, and that the choice of fine-tuning data has a substantial impact on k-nearest neighbors performance. Lastly, we find that zero-shot text-based classification performs poorly on the gesture recognition task, as gestures are not easily described through natural language.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
A Mamba-based Siamese Network for Remote Sensing Change Detection
Authors:
Jay N. Paranjape,
Celso de Melo,
Vishal M. Patel
Abstract:
Change detection in remote sensing images is an essential tool for analyzing a region at different times. It finds varied applications in monitoring environmental changes, man-made changes as well as corresponding decision-making and prediction of future trends. Deep learning methods like Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable success in detecting significan…
▽ More
Change detection in remote sensing images is an essential tool for analyzing a region at different times. It finds varied applications in monitoring environmental changes, man-made changes as well as corresponding decision-making and prediction of future trends. Deep learning methods like Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable success in detecting significant changes, given two images at different times. In this paper, we propose a Mamba-based Change Detector (M-CD) that segments out the regions of interest even better. Mamba-based architectures demonstrate linear-time training capabilities and an improved receptive field over transformers. Our experiments on four widely used change detection datasets demonstrate significant improvements over existing state-of-the-art (SOTA) methods. Our code and pre-trained models are available at https://github.com/JayParanjape/M-CD
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
ViLCo-Bench: VIdeo Language COntinual learning Benchmark
Authors:
Tianqi Tang,
Shohreh Deldari,
Hao Xue,
Celso De Melo,
Flora D. Salim
Abstract:
Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model's ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark…
▽ More
Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model's ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark, ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks. The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets. Additionally, we introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects. This framework addresses challenges including memory complexity from long video clips, natural language complexity from open queries, and text-video misalignment. We posit that ViLCo-Bench, with greater complexity compared to existing continual learning benchmarks, would serve as a critical tool for exploring the video-language domain, extending beyond conventional class-incremental tasks, and addressing complex and limited annotation issues. The curated data, evaluations, and our novel method are available at https://github.com/cruiseresearchgroup/ViLCo.
△ Less
Submitted 15 December, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Deep Bayesian Active Learning for Preference Modeling in Large Language Models
Authors:
Luckeciano C. Melo,
Panagiotis Tigas,
Alessandro Abate,
Yarin Gal
Abstract:
Leveraging human preferences for steering the behavior of Large Language Models (LLMs) has demonstrated notable success in recent years. Nonetheless, data selection and labeling are still a bottleneck for these systems, particularly at large scale. Hence, selecting the most informative points for acquiring human feedback may considerably reduce the cost of preference labeling and unleash the furth…
▽ More
Leveraging human preferences for steering the behavior of Large Language Models (LLMs) has demonstrated notable success in recent years. Nonetheless, data selection and labeling are still a bottleneck for these systems, particularly at large scale. Hence, selecting the most informative points for acquiring human feedback may considerably reduce the cost of preference labeling and unleash the further development of LLMs. Bayesian Active Learning provides a principled framework for addressing this challenge and has demonstrated remarkable success in diverse settings. However, previous attempts to employ it for Preference Modeling did not meet such expectations. In this work, we identify that naive epistemic uncertainty estimation leads to the acquisition of redundant samples. We address this by proposing the Bayesian Active Learner for Preference Modeling (BAL-PM), a novel stochastic acquisition policy that not only targets points of high epistemic uncertainty according to the preference model but also seeks to maximize the entropy of the acquired prompt distribution in the feature space spanned by the employed LLM. Notably, our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous stochastic Bayesian acquisition policies.
△ Less
Submitted 28 October, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
WeatherProof: Leveraging Language Guidance for Semantic Segmentation in Adverse Weather
Authors:
Blake Gella,
Howard Zhang,
Rishi Upadhyay,
Tiffany Chang,
Nathan Wei,
Matthew Waliman,
Yunhao Ba,
Celso de Melo,
Alex Wong,
Achuta Kadambi
Abstract:
We propose a method to infer semantic segmentation maps from images captured under adverse weather conditions. We begin by examining existing models on images degraded by weather conditions such as rain, fog, or snow, and found that they exhibit a large performance drop as compared to those captured under clear weather. To control for changes in scene structures, we propose WeatherProof, the first…
▽ More
We propose a method to infer semantic segmentation maps from images captured under adverse weather conditions. We begin by examining existing models on images degraded by weather conditions such as rain, fog, or snow, and found that they exhibit a large performance drop as compared to those captured under clear weather. To control for changes in scene structures, we propose WeatherProof, the first semantic segmentation dataset with accurate clear and adverse weather image pairs that share an underlying scene. Through this dataset, we analyze the error modes in existing models and found that they were sensitive to the highly complex combination of different weather effects induced on the image during capture. To improve robustness, we propose a way to use language as guidance by identifying contributions of adverse weather conditions and injecting that as "side information". Models trained using our language guidance exhibit performance gains by up to 10.2% in mIoU on WeatherProof, up to 8.44% in mIoU on the widely used ACDC dataset compared to standard training techniques, and up to 6.21% in mIoU on the ACDC dataset as compared to previous SOTA methods.
△ Less
Submitted 7 May, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
Entropic Open-set Active Learning
Authors:
Bardia Safaei,
Vibashan VS,
Celso M. de Melo,
Vishal M. Patel
Abstract:
Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting.…
▽ More
Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting. However, these methods focus more on selecting known samples and do not efficiently utilize unknown samples obtained during AL rounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which leverages both known and unknown distributions effectively to select informative samples during AL rounds. Specifically, our approach employs two different entropy scores. One measures the uncertainty of a sample with respect to the known-class distributions. The other measures the uncertainty of the sample with respect to the unknown-class distributions. By utilizing these two entropy scores we effectively separate the known and unknown samples from the unlabeled data resulting in better sampling. Through extensive experiments, we show that the proposed method outperforms existing state-of-the-art methods on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Code is available at \url{https://github.com/bardisafa/EOAL}.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
Authors:
Arun Reddy,
William Paul,
Corban Rivera,
Ketul Shah,
Celso M. de Melo,
Rama Chellappa
Abstract:
In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then…
▽ More
In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.
△ Less
Submitted 4 March, 2025; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Guarding Barlow Twins Against Overfitting with Mixed Samples
Authors:
Wele Gedara Chaminda Bandara,
Celso M. De Melo,
Vishal M. Patel
Abstract:
Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing fo…
▽ More
Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing for the above objective forces the network to learn useful representations, while avoiding noisy or constant features, resulting in improved downstream task performance with limited adaptation. Despite Barlow Twins' proven effectiveness in pre-training, the underlying SSL objective can inadvertently cause feature overfitting due to the lack of strong interaction between the samples unlike the contrastive learning approaches. From our experiments, we observe that optimizing for the Barlow Twins objective doesn't necessarily guarantee sustained improvements in representation quality beyond a certain pre-training phase, and can potentially degrade downstream performance on some datasets. To address this challenge, we introduce Mixed Barlow Twins, which aims to improve sample interaction during Barlow Twins training via linearly interpolated samples. This results in an additional regularization term to the original Barlow Twins objective, assuming linear interpolation in the input space translates to linearly interpolated features in the feature space. Pre-training with this regularization effectively mitigates feature overfitting and further enhances the downstream performance on CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and ImageNet datasets. The code and checkpoints are available at: https://github.com/wgcban/mix-bt.git
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Interpreting Indirect Answers to Yes-No Questions in Multiple Languages
Authors:
Zijie Wang,
Md Mosharaf Hossain,
Shivam Mathur,
Terry Cruz Melo,
Kadir Bulut Ozler,
Keun Hee Park,
Jacob Quintero,
MohammadHossein Rezaei,
Shreya Nupur Shakya,
Md Nayem Uddin,
Eduardo Blanco
Abstract:
Yes-no questions expect a yes or no for an answer, but people often skip polar keywords. Instead, they answer with long explanations that must be interpreted. In this paper, we focus on this challenging problem and release new benchmarks in eight languages. We present a distant supervision approach to collect training data. We also demonstrate that direct answers (i.e., with polar keywords) are us…
▽ More
Yes-no questions expect a yes or no for an answer, but people often skip polar keywords. Instead, they answer with long explanations that must be interpreted. In this paper, we focus on this challenging problem and release new benchmarks in eight languages. We present a distant supervision approach to collect training data. We also demonstrate that direct answers (i.e., with polar keywords) are useful to train models to interpret indirect answers (i.e., without polar keywords). Experimental results demonstrate that monolingual fine-tuning is beneficial if training data can be obtained via distant supervision for the language of interest (5 languages). Additionally, we show that cross-lingual fine-tuning is always beneficial (8 languages).
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
Authors:
Qiao Gu,
Alihusein Kuwajerwala,
Sacha Morin,
Krishna Murthy Jatavallabhula,
Bipasha Sen,
Aditya Agarwal,
Corban Rivera,
William Paul,
Kirsty Ellis,
Rama Chellappa,
Chuang Gan,
Celso Miguel de Melo,
Joshua B. Tenenbaum,
Antonio Torralba,
Florian Shkurti,
Liam Paull
Abstract:
For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, whi…
▽ More
For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
RobôCIn Small Size League Extended Team Description Paper for RoboCup 2023
Authors:
Aline Lima de Oliveira,
Cauê Addae da Silva Gomes,
Cecília Virginia Santos da Silva,
Charles Matheus de Sousa Alves,
Danilo Andrade Martins de Souza,
Driele Pires Ferreira Araújo Xavier,
Edgleyson Pereira da Silva,
Felipe Bezerra Martins,
Lucas Henrique Cavalcanti Santos,
Lucas Dias Maciel,
Matheus Paixão Gumercindo dos Santos,
Matheus Lafayette Vasconcelos,
Matheus Vinícius Teotonio do Nascimento Andrade,
João Guilherme Oliveira Carvalho de Melo,
João Pedro Souza Pereira de Moura,
José Ronald da Silva,
José Victor Silva Cruz,
Pedro Henrique Santana de Morais,
Pedro Paulo Salman de Oliveira,
Riei Joaquim Matos Rodrigues,
Roberto Costa Fernandes,
Ryan Vinicius Santos Morais,
Tamara Mayara Ramos Teobaldo,
Washington Igor dos Santos Silva,
Edna Natividade Silva Barros
Abstract:
RobôCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Ou…
▽ More
RobôCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Our team has successfully published 2 articles related to SSL at two high-impact conferences: the 25th RoboCup International Symposium and the 19th IEEE Latin American Robotics Symposium (LARS 2022). Over the last year, we have been continuously migrating from our past codebase to Unification. We will describe the new architecture implemented and some points of software and AI refactoring. In addition, we discuss the process of integrating machined components into the mechanical system, our development for participating in the vision blackout challenge last year and what we are preparing for this year.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Authors:
Xiaoyu Zhu,
Po-Yao Huang,
Junwei Liang,
Celso M. de Melo,
Alexander Hauptmann
Abstract:
We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-fra…
▽ More
We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.
△ Less
Submitted 26 July, 2024; v1 submitted 31 March, 2023;
originally announced March 2023.
-
Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances
Authors:
Arun V. Reddy,
Ketul Shah,
William Paul,
Rohita Mocharla,
Judy Hoffman,
Kapil D. Katyal,
Dinesh Manocha,
Celso M. de Melo,
Rama Chellappa
Abstract:
Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data h…
▽ More
Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs and potential ethical concerns associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as \textit{domain shift}, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood about how best to develop these techniques. In this paper, we introduce a new dataset called Robot Control Gestures (RoCoG-v2). The dataset is composed of both real and synthetic videos from seven gesture classes, and is intended to support the study of synthetic-to-real domain shift for video-based action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for human-robot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts.
△ Less
Submitted 1 August, 2024; v1 submitted 17 March, 2023;
originally announced March 2023.
-
AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning
Authors:
Xijun Wang,
Ruiqi Xian,
Tianrui Guan,
Celso M. de Melo,
Stephen M. Nogar,
Aniket Bera,
Dinesh Manocha
Abstract:
We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also presen…
▽ More
We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also present an efficient temporal reasoning algorithm to capture the action information along the spatial and temporal domains within a controllable computational cost. Our approach has been implemented and evaluated both on the desktop with high-end GPUs and on the low power Robotics RB5 Platform for robots and drones. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset, 8.3-10.4% improvement on the UAV-Human dataset and 3.2% improvement on the Drone Action dataset.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
ConceptFusion: Open-set Multimodal 3D Mapping
Authors:
Krishna Murthy Jatavallabhula,
Alihusein Kuwajerwala,
Qiao Gu,
Mohd Omama,
Tao Chen,
Alaa Maalouf,
Shuang Li,
Ganesh Iyer,
Soroush Saryazdi,
Nikhil Keetha,
Ayush Tewari,
Joshua B. Tenenbaum,
Celso Miguel de Melo,
Madhava Krishna,
Liam Paull,
Florian Shkurti,
Antonio Torralba
Abstract:
Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent wor…
▽ More
Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts.
We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping.
For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs
△ Less
Submitted 23 October, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Open-Set Automatic Target Recognition
Authors:
Bardia Safaei,
Vibashan VS,
Celso M. de Melo,
Shuowen Hu,
Vishal M. Patel
Abstract:
Automatic Target Recognition (ATR) is a category of computer vision algorithms which attempts to recognize targets on data obtained from different sensors. ATR algorithms are extensively used in real-world scenarios such as military and surveillance applications. Existing ATR algorithms are developed for traditional closed-set methods where training and testing have the same class distribution. Th…
▽ More
Automatic Target Recognition (ATR) is a category of computer vision algorithms which attempts to recognize targets on data obtained from different sensors. ATR algorithms are extensively used in real-world scenarios such as military and surveillance applications. Existing ATR algorithms are developed for traditional closed-set methods where training and testing have the same class distribution. Thus, these algorithms have not been robust to unknown classes not seen during the training phase, limiting their utility in real-world applications. To this end, we propose an Open-set Automatic Target Recognition framework where we enable open-set recognition capability for ATR algorithms. In addition, we introduce a plugin Category-aware Binary Classifier (CBC) module to effectively tackle unknown classes seen during inference. The proposed CBC module can be easily integrated with any existing ATR algorithms and can be trained in an end-to-end manner. Experimental results show that the proposed approach outperforms many open-set methods on the DSIAC and CIFAR-10 datasets. To the best of our knowledge, this is the first work to address the open-set classification problem for ATR algorithms. Source code is available at: https://github.com/bardisafa/Open-set-ATR.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
-
The Impact of Partner Expressions on Felt Emotion in the Iterated Prisoner's Dilemma: An Event-level Analysis
Authors:
Maria Angelika-Nikita,
Celso M. de Melo,
Kazunori Terada,
Gale Lucas,
Jonathan Gratch
Abstract:
Social games like the prisoner's dilemma are often used to develop models of the role of emotion in social decision-making. Here we examine an understudied aspect of emotion in such games: how an individual's feelings are shaped by their partner's expressions. Prior research has tended to focus on other aspects of emotion. Research on felt-emotion has focused on how an individual's feelings shape…
▽ More
Social games like the prisoner's dilemma are often used to develop models of the role of emotion in social decision-making. Here we examine an understudied aspect of emotion in such games: how an individual's feelings are shaped by their partner's expressions. Prior research has tended to focus on other aspects of emotion. Research on felt-emotion has focused on how an individual's feelings shape how they treat their partner, or whether these feelings are authentically expressed. Research on expressed-emotion has focused on how an individual's decisions are shaped by their partner's expressions, without regard for whether these expressions actually evoke feelings. Here, we use computer-generated characters to examine how an individual's moment-to-moment feelings are shaped by (1) how they are treated by their partner and (2) what their partner expresses during this treatment. Surprisingly, we find that partner expressions are far more important than actions in determining self-reported feelings. In other words, our partner can behave in a selfish and exploitive way, but if they show a collaborative pattern of expressions, we will feel greater pleasure collaborating with them. These results also emphasize the importance of context in determining how someone will feel in response to an expression (i.e., knowing a partner is happy is insufficient; we must know what they are happy-at). We discuss the implications of this work for cognitive-system design, emotion theory, and methodological practice in affective computing.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
On Structuring Functional Programs with Monoidal Profunctors
Authors:
Alexandre Garcia de Oliveira,
Mauro Jaskelioff,
Ana Cristina Vieira de Melo
Abstract:
We study monoidal profunctors as a tool to reason and structure pure functional programs both from a categorical perspective and as a Haskell implementation. From the categorical point of view we approach them as monoids in a certain monoidal category of profunctors. We study properties of this monoidal category and construct and implement the free monoidal profunctor. We study the relationship of…
▽ More
We study monoidal profunctors as a tool to reason and structure pure functional programs both from a categorical perspective and as a Haskell implementation. From the categorical point of view we approach them as monoids in a certain monoidal category of profunctors. We study properties of this monoidal category and construct and implement the free monoidal profunctor. We study the relationship of the monoidal construction to optics, and introduce a promising generalization of the implementation which we illustrate by introducing effectful monoidal profunctors.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
Not Just Streaks: Towards Ground Truth for Single Image Deraining
Authors:
Yunhao Ba,
Howard Zhang,
Ethan Yang,
Akira Suzuki,
Arnold Pfahnl,
Chethan Chinder Chandrappa,
Celso de Melo,
Suya You,
Stefano Soatto,
Alex Wong,
Achuta Kadambi
Abstract:
We propose a large-scale dataset of real-world rainy and clean image pairs and a method to remove degradations, induced by rain streaks and rain accumulation, from the image. As there exists no real-world dataset for deraining, current state-of-the-art methods rely on synthetic data and thus are limited by the sim2real domain gap; moreover, rigorous evaluation remains a challenge due to the absenc…
▽ More
We propose a large-scale dataset of real-world rainy and clean image pairs and a method to remove degradations, induced by rain streaks and rain accumulation, from the image. As there exists no real-world dataset for deraining, current state-of-the-art methods rely on synthetic data and thus are limited by the sim2real domain gap; moreover, rigorous evaluation remains a challenge due to the absence of a real paired dataset. We fill this gap by collecting a real paired deraining dataset through meticulous control of non-rain variations. Our dataset enables paired training and quantitative evaluation for diverse real-world rain phenomena (e.g. rain streaks and rain accumulation). To learn a representation robust to rain phenomena, we propose a deep neural network that reconstructs the underlying scene by minimizing a rain-robust loss between rainy and clean images. Extensive experiments demonstrate that our model outperforms the state-of-the-art deraining methods on real rainy images under various conditions. Project website: https://visual.ee.ucla.edu/gt_rain.htm/.
△ Less
Submitted 29 July, 2024; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Transformers are Meta-Reinforcement Learners
Authors:
Luckeciano C. Melo
Abstract:
The transformer architecture and variants presented remarkable success across many machine learning tasks in recent years. This success is intrinsically related to the capability of handling long sequences and the presence of context-dependent weights from the attention mechanism. We argue that these capabilities suit the central role of a Meta-Reinforcement Learning algorithm. Indeed, a meta-RL a…
▽ More
The transformer architecture and variants presented remarkable success across many machine learning tasks in recent years. This success is intrinsically related to the capability of handling long sequences and the presence of context-dependent weights from the attention mechanism. We argue that these capabilities suit the central role of a Meta-Reinforcement Learning algorithm. Indeed, a meta-RL agent needs to infer the task from a sequence of trajectories. Furthermore, it requires a fast adaptation strategy to adapt its policy for a new task -- which can be achieved using the self-attention mechanism. In this work, we present TrMRL (Transformers for Meta-Reinforcement Learning), a meta-RL agent that mimics the memory reinstatement mechanism using the transformer architecture. It associates the recent past of working memories to build an episodic memory recursively through the transformer layers. We show that the self-attention computes a consensus representation that minimizes the Bayes Risk at each layer and provides meaningful features to compute the best actions. We conducted experiments in high-dimensional continuous control environments for locomotion and dexterous manipulation. Results show that TrMRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments.
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
Authors:
R. Gnana Praveen,
Wheidima Carneiro de Melo,
Nasib Ullah,
Haseeb Aslam,
Osama Zeeshan,
Théo Denorme,
Marco Pedersoli,
Alessandro Koerich,
Simon Bacon,
Patrick Cardinal,
Eric Granger
Abstract:
Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively lever…
▽ More
Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion.
△ Less
Submitted 6 July, 2024; v1 submitted 28 March, 2022;
originally announced March 2022.
-
Facial Expression Analysis Using Decomposed Multiscale Spatiotemporal Networks
Authors:
Wheidima Carneiro de Melo,
Eric Granger,
Miguel Bordallo Lopez
Abstract:
Video-based analysis of facial expressions has been increasingly applied to infer health states of individuals, such as depression and pain. Among the existing approaches, deep learning models composed of structures for multiscale spatiotemporal processing have shown strong potential for encoding facial dynamics. However, such models have high computational complexity, making for a difficult deplo…
▽ More
Video-based analysis of facial expressions has been increasingly applied to infer health states of individuals, such as depression and pain. Among the existing approaches, deep learning models composed of structures for multiscale spatiotemporal processing have shown strong potential for encoding facial dynamics. However, such models have high computational complexity, making for a difficult deployment of these solutions. To address this issue, we introduce a new technique to decompose the extraction of multiscale spatiotemporal features. Particularly, a building block structure called Decomposed Multiscale Spatiotemporal Network (DMSN) is presented along with three variants: DMSN-A, DMSN-B, and DMSN-C blocks. The DMSN-A block generates multiscale representations by analyzing spatiotemporal features at multiple temporal ranges, while the DMSN-B block analyzes spatiotemporal features at multiple ranges, and the DMSN-C block analyzes spatiotemporal features at multiple spatial sizes. Using these variants, we design our DMSN architecture which has the ability to explore a variety of multiscale spatiotemporal features, favoring the adaptation to different facial behaviors. Our extensive experiments on challenging datasets show that the DMSN-C block is effective for depression detection, whereas the DMSN-A block is efficient for pain estimation. Results also indicate that our DMSN architecture provides a cost-effective solution for expressions that range from fewer facial variations over time, as in depression detection, to greater variations, as in pain estimation.
△ Less
Submitted 21 March, 2022;
originally announced March 2022.
-
Towards Multi-Criteria Prioritization of Best Practices in Research Artifact Sharing
Authors:
Carlos Diego Nascimento Damasceno,
Isotilia Costa Melo,
Daniel Struber
Abstract:
Research artifact sharing is known to strengthen the transparency of scientific studies. However, in the lack of common discipline-specific guidelines for artifacts evaluation, subjective and conflicting expectations may happen and threaten artifact quality. In this paper, we discuss our preliminary ideas for a framework based on quality management principles (5W2H) that can aid in the establishme…
▽ More
Research artifact sharing is known to strengthen the transparency of scientific studies. However, in the lack of common discipline-specific guidelines for artifacts evaluation, subjective and conflicting expectations may happen and threaten artifact quality. In this paper, we discuss our preliminary ideas for a framework based on quality management principles (5W2H) that can aid in the establishment of common guidelines for artifact evaluation and sharing. Also, using the Analytic Hierarchy Process, we discuss how research communities could join efforts to aid the guidelines' adequacy to research priorities. These combined methodologies constitute a novelty for software engineering research which can foster research software sustainability.
△ Less
Submitted 6 September, 2021;
originally announced September 2021.
-
Sustainability: Delivering Agility's Promise
Authors:
Jutta Eckstein,
Claudia de O. Melo
Abstract:
Sustainability is a promise by agile development, as it is part of both the Agile Alliance's and the Scrum Alliance's vision. Thus far, not much has been delivered on this promise. This paper explores the Agile Manifesto and points out how agility could contribute to sustainability in its three dimensions - social, economic, and environmental. Additionally, this paper provides some sample cases of…
▽ More
Sustainability is a promise by agile development, as it is part of both the Agile Alliance's and the Scrum Alliance's vision. Thus far, not much has been delivered on this promise. This paper explores the Agile Manifesto and points out how agility could contribute to sustainability in its three dimensions - social, economic, and environmental. Additionally, this paper provides some sample cases of companies focusing on both sustainability (partially or holistically) and agile development.
△ Less
Submitted 9 March, 2021;
originally announced March 2021.
-
MARS-Gym: A Gym framework to model, train, and evaluate Recommender Systems for Marketplaces
Authors:
Marlesson R. O. Santana,
Luckeciano C. Melo,
Fernando H. F. Camargo,
Bruno Brandão,
Anderson Soares,
Renan M. Oliveira,
Sandor Caetano
Abstract:
Recommender Systems are especially challenging for marketplaces since they must maximize user satisfaction while maintaining the healthiness and fairness of such ecosystems. In this context, we observed a lack of resources to design, train, and evaluate agents that learn by interacting within these environments. For this matter, we propose MARS-Gym, an open-source framework to empower researchers…
▽ More
Recommender Systems are especially challenging for marketplaces since they must maximize user satisfaction while maintaining the healthiness and fairness of such ecosystems. In this context, we observed a lack of resources to design, train, and evaluate agents that learn by interacting within these environments. For this matter, we propose MARS-Gym, an open-source framework to empower researchers and engineers to quickly build and evaluate Reinforcement Learning agents for recommendations in marketplaces. MARS-Gym addresses the whole development pipeline: data processing, model design and optimization, and multi-sided evaluation. We also provide the implementation of a diverse set of baseline agents, with a metrics-driven analysis of them in the Trivago marketplace dataset, to illustrate how to conduct a holistic assessment using the available metrics of recommendation, off-policy estimation, and fairness. With MARS-Gym, we expect to bridge the gap between academic research and production systems, as well as to facilitate the design of new algorithms and applications.
△ Less
Submitted 30 September, 2020;
originally announced October 2020.