Skip to main content

Showing 1–21 of 21 results for author: de Melo, C M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.24216  [pdf, ps, other

    cs.CV

    Shuffle PatchMix Augmentation with Confidence-Margin Weighted Pseudo-Labels for Enhanced Source-Free Domain Adaptation

    Authors: Prasanna Reddy Pulakurthi, Majid Rabbani, Jamison Heard, Sohail Dianat, Celso M. de Melo, Raghuveer Rao

    Abstract: This work investigates Source-Free Domain Adaptation (SFDA), where a model adapts to a target domain without access to source data. A new augmentation technique, Shuffle PatchMix (SPM), and a novel reweighting strategy are introduced to enhance performance. SPM shuffles and blends image patches to generate diverse and challenging augmentations, while the reweighting strategy prioritizes reliable p… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: 6 pages, 3 figures, 5 tables, Accepted to IEEE ICIP 2025

  2. arXiv:2505.18048  [pdf, ps, other

    cs.CV

    SHARDeg: A Benchmark for Skeletal Human Action Recognition in Degraded Scenarios

    Authors: Simon Malzard, Nitish Mital, Richard Walters, Victoria Nockles, Raghuveer Rao, Celso M. De Melo

    Abstract: Computer vision (CV) models for detection, prediction or classification tasks operate on video data-streams that are often degraded in the real world, due to deployment in real-time or on resource-constrained hardware. It is therefore critical that these models are robust to degraded data, but state of the art (SoTA) models are often insufficiently assessed with these real-world constraints in min… ▽ More

    Submitted 27 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: 19 pages, 2 images, updated acknowledgements versus previous versions to be compliant with funders

  3. arXiv:2505.00788  [pdf, ps, other

    cs.CV

    SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

    Authors: Wufei Ma, Luoxin Ye, Celso M de Melo, Jieneng Chen, Alan Yuille

    Abstract: Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study th… ▽ More

    Submitted 10 June, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

    Comments: CVPR 2025 highlight

  4. Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data

    Authors: Prasanna Reddy Pulakurthi, Majid Rabbani, Celso M. de Melo, Sohail A. Dianat, Raghuveer M. Rao

    Abstract: This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground o… ▽ More

    Submitted 3 June, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: 9 pages, 2 figures, 4 tables, Accepted to SPIE DSC 2025 Conference: Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications III

    Journal ref: Proc. SPIE 13459, Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications III, 134590I (2025)

  5. arXiv:2503.19009  [pdf, other

    cs.CV cs.IR

    Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

    Authors: Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Kenton Murray, Reno Kriz, Celso M. de Melo, Benjamin Van Durme, Rama Chellappa

    Abstract: In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temp… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: Accepted at CVPR 2025. 13 pages, 4 figures. Approved for public release: distribution unlimited

  6. arXiv:2503.08933  [pdf, other

    cs.CV

    PromptGAR: Flexible Promptive Group Activity Recognition

    Authors: Zhangyu Jin, Andrew Feng, Ankur Chemburkar, Celso M. De Melo

    Abstract: We present PromptGAR, a novel framework that addresses the limitations of current Group Activity Recognition (GAR) approaches by leveraging multi-modal prompts to achieve both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, the lack of long-term actor consistency, and under-explo… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  7. arXiv:2502.08636  [pdf, ps, other

    cs.CV

    Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

    Authors: Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, Alan Yuille

    Abstract: Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this… ▽ More

    Submitted 8 June, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

    Comments: Published in CVPR 2025 as Highlight. Data and code are released at https://github.com/XingruiWang/Spatial457

  8. arXiv:2412.07825  [pdf, other

    cs.CV

    3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

    Authors: Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Celso M de Melo, Alan Yuille

    Abstract: 3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable pr… ▽ More

    Submitted 8 May, 2025; v1 submitted 10 December, 2024; originally announced December 2024.

    Comments: Project page: https://3dsrbench.github.io

  9. arXiv:2412.01477  [pdf, other

    cs.CV

    Improving Object Detection by Modifying Synthetic Data with Explainable AI

    Authors: Nitish Mital, Simon Malzard, Richard Walters, Celso M. De Melo, Raghuveer Rao, Victoria Nockles

    Abstract: Limited real-world data severely impacts model performance in many computer vision domains, particularly for samples that are underrepresented in training. Synthetically generated images are a promising solution, but 1) it remains unclear how to design synthetic training data to optimally improve model performance (e.g, whether and where to introduce more realism or more abstraction) and 2) the do… ▽ More

    Submitted 3 April, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

  10. arXiv:2410.06108  [pdf, other

    cs.AI

    ConceptAgent: LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution

    Authors: Corban Rivera, Grayson Byrd, William Paul, Tyler Feldman, Meghan Booker, Emma Holmes, David Handelman, Bethany Kemp, Andrew Badger, Aurora Schmidt, Krishna Murthy Jatavallabhula, Celso M de Melo, Lalithkumar Seenivasan, Mathias Unberath, Rama Chellappa

    Abstract: Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching t… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  11. An Evaluation of Large Pre-Trained Models for Gesture Recognition using Synthetic Videos

    Authors: Arun Reddy, Ketul Shah, Corban Rivera, William Paul, Celso M. De Melo, Rama Chellappa

    Abstract: In this work, we explore the possibility of using synthetically generated data for video-based gesture recognition with large pre-trained models. We consider whether these models have sufficiently robust and expressive representation spaces to enable "training-free" classification. Specifically, we utilize various state-of-the-art video encoders to extract features for use in k-nearest neighbors c… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

    Comments: Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications II (SPIE Defense + Commercial Sensing, 2024)

    Journal ref: Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications II. Vol. 13035. SPIE, 2024

  12. arXiv:2312.14126  [pdf, other

    cs.CV

    Entropic Open-set Active Learning

    Authors: Bardia Safaei, Vibashan VS, Celso M. de Melo, Vishal M. Patel

    Abstract: Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting.… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Accepted in AAAI 2024

  13. arXiv:2312.02914  [pdf, other

    cs.CV cs.LG

    Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training

    Authors: Arun Reddy, William Paul, Corban Rivera, Ketul Shah, Celso M. de Melo, Rama Chellappa

    Abstract: In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then… ▽ More

    Submitted 4 March, 2025; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: Accepted at CVPR 2024. 13 pages, 4 figures. Approved for public release: distribution unlimited

  14. arXiv:2312.02151  [pdf, other

    cs.CV cs.AI cs.LG

    Guarding Barlow Twins Against Overfitting with Mixed Samples

    Authors: Wele Gedara Chaminda Bandara, Celso M. De Melo, Vishal M. Patel

    Abstract: Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing fo… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: Code and checkpoints are available at: https://github.com/wgcban/mix-bt.git

  15. arXiv:2309.16650  [pdf, other

    cs.RO cs.CV

    ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

    Authors: Qiao Gu, Alihusein Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, Liam Paull

    Abstract: For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, whi… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc

  16. arXiv:2303.18177  [pdf, other

    cs.CV

    STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

    Authors: Xiaoyu Zhu, Po-Yao Huang, Junwei Liang, Celso M. de Melo, Alexander Hauptmann

    Abstract: We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-fra… ▽ More

    Submitted 26 July, 2024; v1 submitted 31 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  17. arXiv:2303.10280  [pdf, other

    cs.CV

    Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

    Authors: Arun V. Reddy, Ketul Shah, William Paul, Rohita Mocharla, Judy Hoffman, Kapil D. Katyal, Dinesh Manocha, Celso M. de Melo, Rama Chellappa

    Abstract: Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data h… ▽ More

    Submitted 1 August, 2024; v1 submitted 17 March, 2023; originally announced March 2023.

    Comments: ICRA 2023. The first two authors contributed equally. Dataset available at: https://github.com/reddyav1/RoCoG-v2

  18. AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

    Authors: Xijun Wang, Ruiqi Xian, Tianrui Guan, Celso M. de Melo, Stephen M. Nogar, Aniket Bera, Dinesh Manocha

    Abstract: We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also presen… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted for publication at ICRA 2023

  19. arXiv:2302.07241  [pdf, other

    cs.CV cs.AI cs.RO

    ConceptFusion: Open-set Multimodal 3D Mapping

    Authors: Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Alaa Maalouf, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, Antonio Torralba

    Abstract: Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent wor… ▽ More

    Submitted 23 October, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

    Comments: RSS 2023. Project page: https://concept-fusion.github.io Explainer video: https://www.youtube.com/watch?v=rkXgws8fiDs Code: https://github.com/concept-fusion/concept-fusion

  20. arXiv:2211.05883  [pdf, other

    cs.CV

    Open-Set Automatic Target Recognition

    Authors: Bardia Safaei, Vibashan VS, Celso M. de Melo, Shuowen Hu, Vishal M. Patel

    Abstract: Automatic Target Recognition (ATR) is a category of computer vision algorithms which attempts to recognize targets on data obtained from different sensors. ATR algorithms are extensively used in real-world scenarios such as military and surveillance applications. Existing ATR algorithms are developed for traditional closed-set methods where training and testing have the same class distribution. Th… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: 5 pages, 3 figures. Submitted to ICASSP 2023

  21. arXiv:2207.00925  [pdf

    cs.GT

    The Impact of Partner Expressions on Felt Emotion in the Iterated Prisoner's Dilemma: An Event-level Analysis

    Authors: Maria Angelika-Nikita, Celso M. de Melo, Kazunori Terada, Gale Lucas, Jonathan Gratch

    Abstract: Social games like the prisoner's dilemma are often used to develop models of the role of emotion in social decision-making. Here we examine an understudied aspect of emotion in such games: how an individual's feelings are shaped by their partner's expressions. Prior research has tended to focus on other aspects of emotion. Research on felt-emotion has focused on how an individual's feelings shape… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: 18 pages, 7 figures, Ninth Annual Conference on Advances in Cognitive Systems