-
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Authors:
Bingda Tang,
Boyang Zheng,
Xichen Pan,
Sayak Paul,
Saining Xie
Abstract:
This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed com…
▽ More
This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Authors:
Le Zhuo,
Liangbing Zhao,
Sayak Paul,
Yue Liao,
Renrui Zhang,
Yi Xin,
Peng Gao,
Mohamed Elhoseiny,
Hongsheng Li
Abstract:
Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and…
▽ More
Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Authors:
Xin Li,
Yeying Jin,
Xin Jin,
Zongwei Wu,
Bingchen Li,
Yufei Wang,
Wenhan Yang,
Yu Li,
Zhibo Chen,
Bihan Wen,
Robby T. Tan,
Radu Timofte,
Qiyu Rong,
Hongyuan Jing,
Mengmeng Zhang,
Jinglong Li,
Xiangyu Lu,
Yi Ren,
Yuting Liu,
Meng Zhang,
Xiang Chen,
Qiyuan Guan,
Jiangxin Dong,
Jinshan Pan,
Conglin Gou
, et al. (112 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ…
▽ More
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
△ Less
Submitted 19 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
Resolving Nondeterminism by Chance
Authors:
Soumyajit Paul,
David Purser,
Sven Schewe,
Qiyi Tang,
Patrick Totzke,
Di-De Yen
Abstract:
History-deterministic automata are those in which nondeterministic choices can be correctly resolved stepwise: there is a strategy to select a continuation of a run given the next input letter so that if the overall input word admits some accepting run, then the constructed run is also accepting.
Motivated by checking qualitative properties in probabilistic verification, we consider the setting…
▽ More
History-deterministic automata are those in which nondeterministic choices can be correctly resolved stepwise: there is a strategy to select a continuation of a run given the next input letter so that if the overall input word admits some accepting run, then the constructed run is also accepting.
Motivated by checking qualitative properties in probabilistic verification, we consider the setting where the resolver strategy can randomize and only needs to succeed with lower-bounded probability. We study the expressiveness of such stochastically-resolvable automata as well as consider the decision questions of whether a given automaton has this property. In particular, we show that it is undecidable to check if a given NFA is $λ$-stochastically resolvable. This problem is decidable for finitely-ambiguous automata. We also present complexity upper and lower bounds for several well-studied classes of automata for which this problem remains decidable.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling
Authors:
Tasmiah Haque,
Md. Asif Bin Syed,
Byungheon Jeong,
Xue Bai,
Sumit Mohan,
Somdyuti Paul,
Imtiaz Ahmed,
Srinjoy Das
Abstract:
We propose a deep learning framework designed to significantly optimize bandwidth for motion-transfer-enabled video applications, including video conferencing, virtual reality interactions, health monitoring systems, and vision-based real-time anomaly detection. To capture complex motion effectively, we utilize the First Order Motion Model (FOMM), which encodes dynamic objects by detecting keypoin…
▽ More
We propose a deep learning framework designed to significantly optimize bandwidth for motion-transfer-enabled video applications, including video conferencing, virtual reality interactions, health monitoring systems, and vision-based real-time anomaly detection. To capture complex motion effectively, we utilize the First Order Motion Model (FOMM), which encodes dynamic objects by detecting keypoints and their associated local affine transformations. These keypoints are identified using a self-supervised keypoint detector and arranged into a time series corresponding to the successive frames. Forecasting is performed on these keypoints by integrating two advanced generative time series models into the motion transfer pipeline, namely the Variational Recurrent Neural Network (VRNN) and the Gated Recurrent Unit with Normalizing Flow (GRU-NF). The predicted keypoints are subsequently synthesized into realistic video frames using an optical flow estimator paired with a generator network, thereby facilitating accurate video forecasting and enabling efficient, low-frame-rate video transmission. We validate our results across three datasets for video animation and reconstruction using the following metrics: Mean Absolute Error, Joint Embedding Predictive Architecture Embedding Distance, Structural Similarity Index, and Average Pair-wise Displacement. Our results confirm that by utilizing the superior reconstruction property of the Variational Autoencoder, the VRNN integrated FOMM excels in applications involving multi-step ahead forecasts such as video conferencing. On the other hand, by leveraging the Normalizing Flow architecture for exact likelihood estimation, and enabling efficient latent space sampling, the GRU-NF based FOMM exhibits superior capabilities for producing diverse future samples while maintaining high visual quality for tasks like real-time video-based anomaly detection.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
LLM-Assisted Proactive Threat Intelligence for Automated Reasoning
Authors:
Shuva Paul,
Farhad Alemi,
Richard Macwan
Abstract:
Successful defense against dynamically evolving cyber threats requires advanced and sophisticated techniques. This research presents a novel approach to enhance real-time cybersecurity threat detection and response by integrating large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems with continuous threat intelligence feeds. Leveraging recent advancements in LLMs, specifica…
▽ More
Successful defense against dynamically evolving cyber threats requires advanced and sophisticated techniques. This research presents a novel approach to enhance real-time cybersecurity threat detection and response by integrating large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems with continuous threat intelligence feeds. Leveraging recent advancements in LLMs, specifically GPT-4o, and the innovative application of RAG techniques, our approach addresses the limitations of traditional static threat analysis by incorporating dynamic, real-time data sources. We leveraged RAG to get the latest information in real-time for threat intelligence, which is not possible in the existing GPT-4o model. We employ the Patrowl framework to automate the retrieval of diverse cybersecurity threat intelligence feeds, including Common Vulnerabilities and Exposures (CVE), Common Weakness Enumeration (CWE), Exploit Prediction Scoring System (EPSS), and Known Exploited Vulnerabilities (KEV) databases, and integrate these with the all-mpnet-base-v2 model for high-dimensional vector embeddings, stored and queried in Milvus. We demonstrate our system's efficacy through a series of case studies, revealing significant improvements in addressing recently disclosed vulnerabilities, KEVs, and high-EPSS-score CVEs compared to the baseline GPT-4o. This work not only advances the role of LLMs in cybersecurity but also establishes a robust foundation for the development of automated intelligent cyberthreat information management systems, addressing crucial gaps in current cybersecurity practices.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Monitoring Spatially Distributed Cyber-Physical Systems with Alternating Finite Automata
Authors:
Anand Balakrishnan,
Sheryl Paul,
Simone Silvetti,
Laura Nenzi,
Jyotirmoy V. Deshmukh
Abstract:
Modern cyber-physical systems (CPS) can consist of various networked components and agents interacting and communicating with each other. In the context of spatially distributed CPS, these connections can be dynamically dependent on the spatial configuration of the various components and agents. In these settings, robust monitoring of the distributed components is vital to ensuring complex behavio…
▽ More
Modern cyber-physical systems (CPS) can consist of various networked components and agents interacting and communicating with each other. In the context of spatially distributed CPS, these connections can be dynamically dependent on the spatial configuration of the various components and agents. In these settings, robust monitoring of the distributed components is vital to ensuring complex behaviors are achieved, and safety properties are maintained. To this end, we look at defining the automaton semantics for the Spatio-Temporal Reach and Escape Logic (STREL), a formal logic designed to express and monitor spatio-temporal requirements over mobile, spatially distributed CPS. Specifically, STREL reasons about spatio-temporal behavior over dynamic weighted graphs. While STREL is endowed with well defined qualitative and quantitative semantics, in this paper, we propose a novel construction of (weighted) alternating finite automata from STREL specifications that efficiently encodes these semantics. Moreover, we demonstrate how this automaton semantics can be used to perform both, offline and online monitoring for STREL specifications using a simulated drone swarm environment.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Exploring the Efficacy of Partial Denoising Using Bit Plane Slicing for Enhanced Fracture Identification: A Comparative Study of Deep Learning-Based Approaches and Handcrafted Feature Extraction Techniques
Authors:
Snigdha Paul,
Sambit Mallick,
Anindya Sen
Abstract:
Computer vision has transformed medical diagnosis, treatment, and research through advanced image processing and machine learning techniques. Fracture classification, a critical area in healthcare, has greatly benefited from these advancements, yet accurate detection is challenged by complex patterns and image noise. Bit plane slicing enhances medical images by reducing noise interference and extr…
▽ More
Computer vision has transformed medical diagnosis, treatment, and research through advanced image processing and machine learning techniques. Fracture classification, a critical area in healthcare, has greatly benefited from these advancements, yet accurate detection is challenged by complex patterns and image noise. Bit plane slicing enhances medical images by reducing noise interference and extracting informative features. This research explores partial denoising techniques to provide practical solutions for improved fracture analysis, ultimately enhancing patient care. The study explores deep learning model DenseNet and handcrafted feature extraction. Decision Tree and Random Forest, were employed to train and evaluate distinct image representations. These include the original image, the concatenation of the four bit planes from the LSB as well as MSB, the fully denoised image, and an image consisting of 6 bit planes from MSB and 2 denoised bit planes from LSB. The purpose of forming these diverse image representations is to analyze SNR as well as classification accuracy and identify the bit planes that contain the most informative features. Moreover, the study delves into the significance of partial denoising techniques in preserving crucial features, leading to improvements in classification results. Notably, this study shows that employing the Random Forest classifier, the partially denoised image representation exhibited a testing accuracy of 95.61% surpassing the performance of other image representations. The outcomes of this research provide valuable insights into the development of efficient preprocessing, feature extraction and classification approaches for fracture identification. By enhancing diagnostic accuracy, these advancements hold the potential to positively impact patient care and overall medical outcomes.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
Authors:
Junsong Chen,
Shuchen Xue,
Yuyang Zhao,
Jincheng Yu,
Sayak Paul,
Junyu Chen,
Han Cai,
Enze Xie,
Song Han
Abstract:
This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-t…
▽ More
This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step - outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10x faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024 x 1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.
△ Less
Submitted 23 March, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Simplifying imperfect recall games
Authors:
Hugo Gimbert,
Soumyajit Paul,
B. Srivathsan
Abstract:
In games with imperfect recall, players may forget the sequence of decisions they made in the past. When players also forget whether they have already encountered their current decision point, they are said to be absent-minded. Solving one-player imperfect recall games is known to be NP-hard, even when the players are not absent-minded. This motivates the search for polynomial-time solvable subcla…
▽ More
In games with imperfect recall, players may forget the sequence of decisions they made in the past. When players also forget whether they have already encountered their current decision point, they are said to be absent-minded. Solving one-player imperfect recall games is known to be NP-hard, even when the players are not absent-minded. This motivates the search for polynomial-time solvable subclasses. A special type of imperfect recall, called A-loss recall, is amenable to efficient polynomial-time algorithms. In this work, we present novel techniques to simplify non-absent-minded imperfect recall games into equivalent A-loss recall games. The first idea involves shuffling the order of actions, and leads to a new polynomial-time solvable class of imperfect recall games that extends A-loss recall. The second idea generalises the first one, by constructing a new set of action sequences which can be "linearly combined" to give the original game. The equivalent game has a simplified information structure, but it could be exponentially bigger in size (in accordance with the NP-hardness). We present an algorithm to generate an equivalent A-loss recall game with the smallest size.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Continual Learning Should Move Beyond Incremental Classification
Authors:
Rupert Mitchell,
Antonio Alliegro,
Raffaello Camoriano,
Dustin Carrión-Ojeda,
Antonio Carta,
Georgia Chalvatzaki,
Nikhil Churamani,
Carlo D'Eramo,
Samin Hamidi,
Robin Hesse,
Fabian Hinder,
Roshni Ramanna Kamath,
Vincenzo Lomonaco,
Subarnaduti Paul,
Francesca Pistilli,
Tinne Tuytelaars,
Gido M van de Ven,
Kristian Kersting,
Simone Schaub-Meyer,
Martin Mundt
Abstract:
Continual learning (CL) is the sub-field of machine learning concerned with accumulating knowledge in dynamic environments. So far, CL research has mainly focused on incremental classification tasks, where models learn to classify new categories while retaining knowledge of previously learned ones. Here, we argue that maintaining such a focus limits both theoretical development and practical appli…
▽ More
Continual learning (CL) is the sub-field of machine learning concerned with accumulating knowledge in dynamic environments. So far, CL research has mainly focused on incremental classification tasks, where models learn to classify new categories while retaining knowledge of previously learned ones. Here, we argue that maintaining such a focus limits both theoretical development and practical applicability of CL methods. Through a detailed analysis of concrete examples - including multi-target classification, robotics with constrained output spaces, learning in continuous task domains, and higher-level concept memorization - we demonstrate how current CL approaches often fail when applied beyond standard classification. We identify three fundamental challenges: (C1) the nature of continuity in learning problems, (C2) the choice of appropriate spaces and metrics for measuring similarity, and (C3) the role of learning objectives beyond classification. For each challenge, we provide specific recommendations to help move the field forward, including formalizing temporal dynamics through distribution processes, developing principled approaches for continuous task spaces, and incorporating density estimation and generative objectives. In so doing, this position paper aims to broaden the scope of CL research while strengthening its theoretical foundations, making it more applicable to real-world problems.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
The Cake that is Intelligence and Who Gets to Bake it: An AI Analogy and its Implications for Participation
Authors:
Martin Mundt,
Anaelia Ovalle,
Felix Friedrich,
A Pranav,
Subarnaduti Paul,
Manuel Brack,
Kristian Kersting,
William Agnew
Abstract:
In a widely popular analogy by Turing Award Laureate Yann LeCun, machine intelligence has been compared to cake - where unsupervised learning forms the base, supervised learning adds the icing, and reinforcement learning is the cherry on top. We expand this 'cake that is intelligence' analogy from a simple structural metaphor to the full life-cycle of AI systems, extending it to sourcing of ingred…
▽ More
In a widely popular analogy by Turing Award Laureate Yann LeCun, machine intelligence has been compared to cake - where unsupervised learning forms the base, supervised learning adds the icing, and reinforcement learning is the cherry on top. We expand this 'cake that is intelligence' analogy from a simple structural metaphor to the full life-cycle of AI systems, extending it to sourcing of ingredients (data), conception of recipes (instructions), the baking process (training), and the tasting and selling of the cake (evaluation and distribution). Leveraging our re-conceptualization, we describe each step's entailed social ramifications and how they are bounded by statistical assumptions within machine learning. Whereas these technical foundations and social impacts are deeply intertwined, they are often studied in isolation, creating barriers that restrict meaningful participation. Our re-conceptualization paves the way to bridge this gap by mapping where technical foundations interact with social outcomes, highlighting opportunities for cross-disciplinary dialogue. Finally, we conclude with actionable recommendations at each stage of the metaphorical AI cake's life-cycle, empowering prospective AI practitioners, users, and researchers, with increased awareness and ability to engage in broader AI discourse.
△ Less
Submitted 6 February, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
Masked Generative Nested Transformers with Decode Time Scaling
Authors:
Sahil Goyal,
Debapriya Tula,
Gagan Jain,
Pradeep Shenoy,
Prateek Jain,
Sujoy Paul
Abstract:
Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iteratio…
▽ More
Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256$\times$256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost $3\times$ less compute than baseline, our model obtains competitive performance.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
EmoFormer: A Text-Independent Speech Emotion Recognition using a Hybrid Transformer-CNN model
Authors:
Rashedul Hasan,
Meher Nigar,
Nursadul Mamun,
Sayan Paul
Abstract:
Speech Emotion Recognition is a crucial area of research in human-computer interaction. While significant work has been done in this field, many state-of-the-art networks struggle to accurately recognize emotions in speech when the data is both speech and speaker-independent. To address this limitation, this study proposes, EmoFormer, a hybrid model combining CNNs (CNNs) with Transformer encoders…
▽ More
Speech Emotion Recognition is a crucial area of research in human-computer interaction. While significant work has been done in this field, many state-of-the-art networks struggle to accurately recognize emotions in speech when the data is both speech and speaker-independent. To address this limitation, this study proposes, EmoFormer, a hybrid model combining CNNs (CNNs) with Transformer encoders to capture emotion patterns in speech data for such independent datasets. The EmoFormer network was trained and tested using the Expressive Anechoic Recordings of Speech (EARS) dataset, recently released by META. We experimented with two feature extraction techniques: MFCCs and x-vectors. The model was evaluated on different emotion sets comprising 5, 7, 10, and 23 distinct categories. The results demonstrate that the model achieved its best performance with five emotions, attaining an accuracy of 90%, a precision of 0.92, a recall, and an F1-score of 0.91. However, performance decreased as the number of emotions increased, with an accuracy of 83% for seven emotions compared to 70% for the baseline network. This study highlights the effectiveness of combining CNNs and Transformer-based architectures for emotion recognition from speech, particularly when using MFCC features.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
IoT-Based Real-Time Medical-Related Human Activity Recognition Using Skeletons and Multi-Stage Deep Learning for Healthcare
Authors:
Subrata Kumer Paul,
Abu Saleh Musa Miah,
Rakhi Rani Paul,
Md. Ekramul Hamid,
Jungpil Shin,
Md Abdur Rahim
Abstract:
The Internet of Things (IoT) and mobile technology have significantly transformed healthcare by enabling real-time monitoring and diagnosis of patients. Recognizing medical-related human activities (MRHA) is pivotal for healthcare systems, particularly for identifying actions that are critical to patient well-being. However, challenges such as high computational demands, low accuracy, and limited…
▽ More
The Internet of Things (IoT) and mobile technology have significantly transformed healthcare by enabling real-time monitoring and diagnosis of patients. Recognizing medical-related human activities (MRHA) is pivotal for healthcare systems, particularly for identifying actions that are critical to patient well-being. However, challenges such as high computational demands, low accuracy, and limited adaptability persist in Human Motion Recognition (HMR). While some studies have integrated HMR with IoT for real-time healthcare applications, limited research has focused on recognizing MRHA as essential for effective patient monitoring. This study proposes a novel HMR method for MRHA detection, leveraging multi-stage deep learning techniques integrated with IoT. The approach employs EfficientNet to extract optimized spatial features from skeleton frame sequences using seven Mobile Inverted Bottleneck Convolutions (MBConv) blocks, followed by ConvLSTM to capture spatio-temporal patterns. A classification module with global average pooling, a fully connected layer, and a dropout layer generates the final predictions. The model is evaluated on the NTU RGB+D 120 and HMDB51 datasets, focusing on MRHA, such as sneezing, falling, walking, sitting, etc. It achieves 94.85% accuracy for cross-subject evaluations and 96.45% for cross-view evaluations on NTU RGB+D 120, along with 89.00% accuracy on HMDB51. Additionally, the system integrates IoT capabilities using a Raspberry Pi and GSM module, delivering real-time alerts via Twilios SMS service to caregivers and patients. This scalable and efficient solution bridges the gap between HMR and IoT, advancing patient monitoring, improving healthcare outcomes, and reducing costs.
△ Less
Submitted 12 January, 2025;
originally announced January 2025.
-
Heterogeneous transfer learning for high dimensional regression with feature mismatch
Authors:
Jae Ho Chang,
Massimiliano Russo,
Subhadeep Paul
Abstract:
We consider the problem of transferring knowledge from a source, or proxy, domain to a new target domain for learning a high-dimensional regression model with possibly different features. Recently, the statistical properties of homogeneous transfer learning have been investigated. However, most homogeneous transfer and multi-task learning methods assume that the target and proxy domains have the s…
▽ More
We consider the problem of transferring knowledge from a source, or proxy, domain to a new target domain for learning a high-dimensional regression model with possibly different features. Recently, the statistical properties of homogeneous transfer learning have been investigated. However, most homogeneous transfer and multi-task learning methods assume that the target and proxy domains have the same feature space, limiting their practical applicability. In applications, target and proxy feature spaces are frequently inherently different, for example, due to the inability to measure some variables in the target data-poor environments. Conversely, existing heterogeneous transfer learning methods do not provide statistical error guarantees, limiting their utility for scientific discovery. We propose a two-stage method that involves learning the relationship between the missing and observed features through a projection step in the proxy data and then solving a joint penalized regression optimization problem in the target data. We develop an upper bound on the method's parameter estimation risk and prediction risk, assuming that the proxy and the target domain parameters are sparsely different. Our results elucidate how estimation and prediction error depend on the complexity of the model, sample size, the extent of overlap, and correlation between matched and mismatched features.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability
Authors:
Estelle Aflalo,
Gabriela Ben Melech Stan,
Tiep Le,
Man Luo,
Shachar Rosenman,
Sayak Paul,
Shao-Yen Tseng,
Vasudev Lal
Abstract:
Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective vis…
▽ More
Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. Furthermore, current vision-language benchmarks are not specifically measuring the degree to which the answer require the visual input. This limitation makes it challenging to confirm that the image is truly necessary, particularly in tasks like visual question answering. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and also evaluate their effectiveness in achieving it. We demonstrate the value of our datasets through three approaches. First, we introduce a novel training task based on our augmented training dataset, resulting in better performance than the baseline. Second, we present benchmarks to assess the model's ability to use image as substantive evidence, rather than relying solely on linguistic priors. Finally, we identify attention heads with the strongest vision-language alignment, enabling explainability on visual-driven hallucinations. The code is available at https://github.com/IntelLabs/fivl.
△ Less
Submitted 19 March, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
-
Performance Evaluation of ROS2-DDS middleware implementations facilitating Cooperative Driving in Autonomous Vehicle
Authors:
Sumit Paul,
Danh Lephuoc,
Manfred Hauswirth
Abstract:
In the autonomous vehicle and self-driving paradigm, cooperative perception or exchanging sensor information among vehicles over wireless communication has added a new dimension. Generally, an autonomous vehicle is a special type of robot that requires real-time, highly reliable sensor inputs due to functional safety. Autonomous vehicles are equipped with a considerable number of sensors to provid…
▽ More
In the autonomous vehicle and self-driving paradigm, cooperative perception or exchanging sensor information among vehicles over wireless communication has added a new dimension. Generally, an autonomous vehicle is a special type of robot that requires real-time, highly reliable sensor inputs due to functional safety. Autonomous vehicles are equipped with a considerable number of sensors to provide different required sensor data to make the driving decision and share with other surrounding vehicles. The inclusion of Data Distribution Service(DDS) as a communication middleware in ROS2 has proved its potential capability to be a reliable real-time distributed system. DDS comes with a scoping mechanism known as domain. Whenever a ROS2 process is initiated, it creates a DDS participant. It is important to note that there is a limit to the number of participants allowed in a single domain.
The efficient handling of numerous in-vehicle sensors and their messages demands the use of multiple ROS2 nodes in a single vehicle. Additionally, in the cooperative perception paradigm, a significant number of ROS2 nodes can be required when a vehicle functions as a single ROS2 node. These ROS2 nodes cannot be part of a single domain due to DDS participant limitation; thus, different domain communication is unavoidable. Moreover, there are different vendor-specific implementations of DDS, and each vendor has their configurations, which is an inevitable communication catalyst between the ROS2 nodes. The communication between vehicles or robots or ROS2 nodes depends directly on the vendor-specific configuration, data type, data size, and the DDS implementation used as middleware; in our study, we evaluate and investigate the limitations, capabilities, and prospects of the different domain communication for various vendor-specific DDS implementations for diverse sensor data type.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
A Noise is Worth Diffusion Guidance
Authors:
Donghoon Ahn,
Jiwon Kang,
Sanghyun Lee,
Jaewon Min,
Minjae Kim,
Wooseok Jang,
Hyoungwon Cho,
Sayak Paul,
SeonHwa Kim,
Eunju Cha,
Kyong Hwan Jin,
Seungryong Kim
Abstract:
Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By…
▽ More
Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: https://cvlab-kaist.github.io/NoiseRefine/.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
FastRM: An efficient and automatic explainability framework for multimodal generative models
Authors:
Gabriela Ben-Melech Stan,
Estelle Aflalo,
Man Luo,
Shachar Rosenman,
Tiep Le,
Sayak Paul,
Shao-Yen Tseng,
Vasudev Lal
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable reasoning capabilities over textual and visual inputs. However, these models remain prone to generating misinformation. Identifying and mitigating ungrounded responses is crucial for developing trustworthy AI. Traditional explainability methods such as gradient-based relevancy maps, offer insight into the decision process of models,…
▽ More
Large Vision Language Models (LVLMs) have demonstrated remarkable reasoning capabilities over textual and visual inputs. However, these models remain prone to generating misinformation. Identifying and mitigating ungrounded responses is crucial for developing trustworthy AI. Traditional explainability methods such as gradient-based relevancy maps, offer insight into the decision process of models, but are often computationally expensive and unsuitable for real-time output validation. In this work, we introduce FastRM, an efficient method for predicting explainable Relevancy Maps of LVLMs. Furthermore, FastRM provides both quantitative and qualitative assessment of model confidence. Experimental results demonstrate that FastRM achieves a 99.8% reduction in computation time and a 44.4% reduction in memory footprint compared to traditional relevancy map generation. FastRM allows explainable AI to be more practical and scalable, thereby promoting its deployment in real-world applications and enabling users to more effectively evaluate the reliability of model outputs.
△ Less
Submitted 6 May, 2025; v1 submitted 2 December, 2024;
originally announced December 2024.
-
A Talent-infused Policy-gradient Approach to Efficient Co-Design of Morphology and Task Allocation Behavior of Multi-Robot Systems
Authors:
Prajit KrisshnaKumar,
Steve Paul,
Souma Chowdhury
Abstract:
Interesting and efficient collective behavior observed in multi-robot or swarm systems emerges from the individual behavior of the robots. The functional space of individual robot behaviors is in turn shaped or constrained by the robot's morphology or physical design. Thus the full potential of multi-robot systems can be realized by concurrently optimizing the morphology and behavior of individual…
▽ More
Interesting and efficient collective behavior observed in multi-robot or swarm systems emerges from the individual behavior of the robots. The functional space of individual robot behaviors is in turn shaped or constrained by the robot's morphology or physical design. Thus the full potential of multi-robot systems can be realized by concurrently optimizing the morphology and behavior of individual robots, informed by the environment's feedback about their collective performance, as opposed to treating morphology and behavior choices disparately or in sequence (the classical approach). This paper presents an efficient concurrent design or co-design method to explore this potential and understand how morphology choices impact collective behavior, particularly in an MRTA problem focused on a flood response scenario, where the individual behavior is designed via graph reinforcement learning. Computational efficiency in this case is attributed to a new way of near exact decomposition of the co-design problem into a series of simpler optimization and learning problems. This is achieved through i) the identification and use of the Pareto front of Talent metrics that represent morphology-dependent robot capabilities, and ii) learning the selection of Talent best trade-offs and individual robot policy that jointly maximizes the MRTA performance. Applied to a multi-unmanned aerial vehicle flood response use case, the co-design outcomes are shown to readily outperform sequential design baselines. Significant differences in morphology and learned behavior are also observed when comparing co-designed single robot vs. co-designed multi-robot systems for similar operations.
△ Less
Submitted 27 November, 2024;
originally announced November 2024.
-
Rate-Informed Discovery via Bayesian Adaptive Multifidelity Sampling
Authors:
Aman Sinha,
Payam Nikdel,
Supratik Paul,
Shimon Whiteson
Abstract:
Ensuring the safety of autonomous vehicles (AVs) requires both accurate estimation of their performance and efficient discovery of potential failure cases. This paper introduces Bayesian adaptive multifidelity sampling (BAMS), which leverages the power of adaptive Bayesian sampling to achieve efficient discovery while simultaneously estimating the rate of adverse events. BAMS prioritizes explorati…
▽ More
Ensuring the safety of autonomous vehicles (AVs) requires both accurate estimation of their performance and efficient discovery of potential failure cases. This paper introduces Bayesian adaptive multifidelity sampling (BAMS), which leverages the power of adaptive Bayesian sampling to achieve efficient discovery while simultaneously estimating the rate of adverse events. BAMS prioritizes exploration of regions with potentially low performance, leading to the identification of novel and critical scenarios that traditional methods might miss. Using real-world AV data we demonstrate that BAMS discovers 10 times as many issues as Monte Carlo (MC) and importance sampling (IS) baselines, while at the same time generating rate estimates with variances 15 and 6 times narrower than MC and IS baselines respectively.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Temporally Non-Uniform Cellular Automata (t-NUCA): Reversibility and Cyclic behavior
Authors:
Subrata Paul,
Sukanta Das
Abstract:
In this work, we propose a variant of non-uniform cellular automata, named as Temporally Non-Uniform Cellular Automata (t-NUCAs), which temporally use two rules, $f$ and $g$ in a sequence $\mathcal{R}$. To observe reversibility in t-NUCAs, we study their injectivity and surjectivity properties. Unlike classical CAs, some irreversible t-NUCAs show the behavior similar to reversible t-NUCAs. To stud…
▽ More
In this work, we propose a variant of non-uniform cellular automata, named as Temporally Non-Uniform Cellular Automata (t-NUCAs), which temporally use two rules, $f$ and $g$ in a sequence $\mathcal{R}$. To observe reversibility in t-NUCAs, we study their injectivity and surjectivity properties. Unlike classical CAs, some irreversible t-NUCAs show the behavior similar to reversible t-NUCAs. To study this behavior, we define restricted surjectivity of t-NUCA and introduce restricted reversibility which shows reversibility of t-NUCA for a set of initial configurations. By further investigating the remaining irreversible t-NUCAs, some t-NUCAs are found which have many-to-one mapping in their configuration space, but do not have non-reachable (Garden-of-Eden) configurations. We name these t-NUCAs as weakly reversible t-NUCAs. Under finite lattice size, a t-NUCA, like any classical CA, shows cyclic behavior. We explore this cyclic behavior and discuss its relation with rule sequence. Finally, we note down the possible longest cycle length of a t-NUCA, based on the lattice size and rule sequence.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors
Authors:
Soumava Paul,
Prakhar Kaushik,
Alan Yuille
Abstract:
In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree o…
▽ More
In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree of foreground and background detail) with known camera poses using view-conditioned generative priors, these methods cannot be directly adapted for the pose-free setting when ground-truth poses are not available during evaluation. To address this, we propose an image-to-image generative model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We introduce context and geometry conditioning using Feature-wise Linear Modulation (FiLM) modulation layers as a lightweight alternative to cross-attention and also propose a novel confidence measure for 3D Gaussian splat representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent 3D representation. Evaluations on the MipNeRF360 and DL3DV-10K benchmark datasets demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes.
△ Less
Submitted 5 April, 2025; v1 submitted 24 November, 2024;
originally announced November 2024.
-
Aim My Robot: Precision Local Navigation to Any Object
Authors:
Xiangyun Meng,
Xuning Yang,
Sanghun Jung,
Fabio Ramos,
Srid Sadhan Jujjavarapu,
Sanjoy Paul,
Dieter Fox
Abstract:
Existing navigation systems mostly consider "success" when the robot reaches within 1m radius to a goal. This precision is insufficient for emerging applications where the robot needs to be positioned precisely relative to an object for downstream tasks, such as docking, inspection, and manipulation. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a…
▽ More
Existing navigation systems mostly consider "success" when the robot reaches within 1m radius to a goal. This precision is insufficient for emerging applications where the robot needs to be positioned precisely relative to an object for downstream tasks, such as docking, inspection, and manipulation. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a robot to reach any object in its vicinity at the desired relative pose, with centimeter-level precision. AMR achieves high precision and robustness by leveraging multi-modal perception, precise action prediction, and is trained on large-scale photorealistic data generated in simulation. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning.
△ Less
Submitted 27 December, 2024; v1 submitted 22 November, 2024;
originally announced November 2024.
-
(Independent) Roman Domination Parameterized by Distance to Cluster
Authors:
Pradeesha Ashok,
Gautam K. Das,
Arti Pandey,
Kaustav Paul,
Subhabrata Paul
Abstract:
Given a graph $G=(V,E)$, a function $f:V\to \{0,1,2\}$ is said to be a \emph{Roman Dominating function} (RDF) if for every $v\in V$ with $f(v)=0$, there exists a vertex $u\in N(v)$ such that $f(u)=2$. A Roman Dominating function $f$ is said to be an \emph{Independent Roman Dominating function} (IRDF), if $V_1\cup V_2$ forms an independent set, where $V_i=\{v\in V~\vert~f(v)=i\}$, for…
▽ More
Given a graph $G=(V,E)$, a function $f:V\to \{0,1,2\}$ is said to be a \emph{Roman Dominating function} (RDF) if for every $v\in V$ with $f(v)=0$, there exists a vertex $u\in N(v)$ such that $f(u)=2$. A Roman Dominating function $f$ is said to be an \emph{Independent Roman Dominating function} (IRDF), if $V_1\cup V_2$ forms an independent set, where $V_i=\{v\in V~\vert~f(v)=i\}$, for $i\in \{0,1,2\}$. The total weight of $f$ is equal to $\sum_{v\in V} f(v)$, and is denoted as $w(f)$. The \emph{Roman Domination Number} (resp. \emph{Independent Roman Domination Number}) of $G$, denoted by $γ_R(G)$ (resp. $i_R(G)$), is defined as min$\{w(f)~\vert~f$ is an RDF (resp. IRDF) of $G\}$. For a given graph $G$, the problem of computing $γ_R(G)$ (resp. $i_R(G)$) is defined as the \emph{Roman Domination problem} (resp. \emph{Independent Roman Domination problem}).
In this paper, we examine structural parameterizations of the (Independent) Roman Domination problem. We propose fixed-parameter tractable (FPT) algorithms for the (Independent) Roman Domination problem in graphs that are $k$ vertices away from a cluster graph. These graphs have a set of $k$ vertices whose removal results in a cluster graph. We refer to $k$ as the distance to the cluster graph. Specifically, we prove the following results when parameterized by the deletion distance $k$ to cluster graphs: we can find the Roman Domination Number (and Independent Roman Domination Number) in time $4^kn^{O(1)}$. In terms of lower bounds, we show that the Roman Domination number can not be computed in time $2^{εk}n^{O(1)}$, for any $0<ε<1$ unless a well-known conjecture, SETH fails. In addition, we also show that the Roman Domination problem parameterized by distance to cluster, does not admit a polynomial kernel unless NP $\subseteq$ coNP$/$poly.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
Multi-agent Path Finding for Timed Tasks using Evolutionary Games
Authors:
Sheryl Paul,
Anand Balakrishnan,
Xin Qin,
Jyotirmoy V. Deshmukh
Abstract:
Autonomous multi-agent systems such as hospital robots and package delivery drones often operate in highly uncertain environments and are expected to achieve complex temporal task objectives while ensuring safety. While learning-based methods such as reinforcement learning are popular methods to train single and multi-agent autonomous systems under user-specified and state-based reward functions,…
▽ More
Autonomous multi-agent systems such as hospital robots and package delivery drones often operate in highly uncertain environments and are expected to achieve complex temporal task objectives while ensuring safety. While learning-based methods such as reinforcement learning are popular methods to train single and multi-agent autonomous systems under user-specified and state-based reward functions, applying these methods to satisfy trajectory-level task objectives is a challenging problem. Our first contribution is the use of weighted automata to specify trajectory-level objectives, such that, maximal paths induced in the weighted automaton correspond to desired trajectory-level behaviors. We show how weighted automata-based specifications go beyond timeliness properties focused on deadlines to performance properties such as expeditiousness. Our second contribution is the use of evolutionary game theory (EGT) principles to train homogeneous multi-agent teams targeting homogeneous task objectives. We show how shared experiences of agents and EGT-based policy updates allow us to outperform state-of-the-art reinforcement learning (RL) methods in minimizing path length by nearly 30\% in large spaces. We also show that our algorithm is computationally faster than deep RL methods by at least an order of magnitude. Additionally our results indicate that it scales better with an increase in the number of agents as compared to other methods.
△ Less
Submitted 15 November, 2024;
originally announced November 2024.
-
MPVO: Motion-Prior based Visual Odometry for PointGoal Navigation
Authors:
Sayan Paul,
Ruddra dev Roychoudhury,
Brojeshwar Bhowmick
Abstract:
Visual odometry (VO) is essential for enabling accurate point-goal navigation of embodied agents in indoor environments where GPS and compass sensors are unreliable and inaccurate. However, traditional VO methods face challenges in wide-baseline scenarios, where fast robot motions and low frames per second (FPS) during inference hinder their performance, leading to drift and catastrophic failures…
▽ More
Visual odometry (VO) is essential for enabling accurate point-goal navigation of embodied agents in indoor environments where GPS and compass sensors are unreliable and inaccurate. However, traditional VO methods face challenges in wide-baseline scenarios, where fast robot motions and low frames per second (FPS) during inference hinder their performance, leading to drift and catastrophic failures in point-goal navigation. Recent deep-learned VO methods show robust performance but suffer from sample inefficiency during training; hence, they require huge datasets and compute resources. So, we propose a robust and sample-efficient VO pipeline based on motion priors available while an agent is navigating an environment. It consists of a training-free action-prior based geometric VO module that estimates a coarse relative pose which is further consumed as a motion prior by a deep-learned VO model, which finally produces a fine relative pose to be used by the navigation policy. This strategy helps our pipeline achieve up to 2x sample efficiency during training and demonstrates superior accuracy and robustness in point-goal navigation tasks compared to state-of-the-art VO method(s). Realistic indoor environments of the Gibson dataset is used in the AI-Habitat simulator to evaluate the proposed approach using navigation metrics (like success/SPL) and pose metrics (like RPE/ATE). We hope this method further opens a direction of work where motion priors from various sources can be utilized to improve VO estimates and achieve better results in embodied navigation tasks.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
Empowering Library Users: Creative Strategies for Engagement and Innovation
Authors:
Snehasish Paul,
Shivali Chauhan,
Atul Kumar Pal
Abstract:
This study investigated the integration of cutting-edge technologies and methodologies for creating dynamic, user-centered library environments. In creative strategies for engagement and innovation, library users must be empowered to undertake the new role of modernizing library services and enhancing user experiences. It also enhances the information management and user engagement. This can be at…
▽ More
This study investigated the integration of cutting-edge technologies and methodologies for creating dynamic, user-centered library environments. In creative strategies for engagement and innovation, library users must be empowered to undertake the new role of modernizing library services and enhancing user experiences. It also enhances the information management and user engagement. This can be attained from personalized approaches, such as recommendation systems to interactive platforms that will have effective experiences tailored to users of different natures. It investigates the consumer engagement practices of enthusiasm, sharing, and learning about their roles in cognitive, affective, and behavioural engagements. Combined, these new approaches will help promote learning, interaction, and growth, add value, and have a more positive impact on users. The challenge for libraries in this rapidly changing, technologically advancing, and digitally networked world, with a base of expectant users, is to remain relevant and engaging. This study discusses innovative strategies for empowering library users and enhancing their engagement through creative and technological approaches. This investigation was conducted to integrate cutting-edge technologies and methodologies into creating dynamic library settings that are user-centered and foster learning, interaction, and personal growth.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
A Comparative Study of Multiple Deep Learning Algorithms for Efficient Localization of Bone Joints in the Upper Limbs of Human Body
Authors:
Soumalya Bose,
Soham Basu,
Indranil Bera,
Sambit Mallick,
Snigdha Paul,
Saumodip Das,
Swarnendu Sil,
Swarnava Ghosh,
Anindya Sen
Abstract:
This paper addresses the medical imaging problem of joint detection in the upper limbs, viz. elbow, shoulder, wrist and finger joints. Localization of joints from X-Ray and Computerized Tomography (CT) scans is an essential step for the assessment of various bone-related medical conditions like Osteoarthritis, Rheumatoid Arthritis, and can even be used for automated bone fracture detection. Automa…
▽ More
This paper addresses the medical imaging problem of joint detection in the upper limbs, viz. elbow, shoulder, wrist and finger joints. Localization of joints from X-Ray and Computerized Tomography (CT) scans is an essential step for the assessment of various bone-related medical conditions like Osteoarthritis, Rheumatoid Arthritis, and can even be used for automated bone fracture detection. Automated joint localization also detects the corresponding bones and can serve as input to deep learning-based models used for the computerized diagnosis of the aforementioned medical disorders. This in-creases the accuracy of prediction and aids the radiologists with analyzing the scans, which is quite a complex and exhausting task. This paper provides a detailed comparative study between diverse Deep Learning (DL) models - YOLOv3, YOLOv7, EfficientDet and CenterNet in multiple bone joint detections in the upper limbs of the human body. The research analyses the performance of different DL models, mathematically, graphically and visually. These models are trained and tested on a portion of the openly available MURA (musculoskeletal radiographs) dataset. The study found that the best Mean Average Precision (mAP at 0.5:0.95) values of YOLOv3, YOLOv7, EfficientDet and CenterNet are 35.3, 48.3, 46.5 and 45.9 respectively. Besides, it has been found YOLOv7 performed the best for accurately predicting the bounding boxes while YOLOv3 performed the worst in the Visual Analysis test. Code available at https://github.com/Sohambasu07/BoneJointsLocalization
△ Less
Submitted 27 October, 2024;
originally announced October 2024.
-
Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts
Authors:
Sheryl Paul,
Jyotirmoy V. Deshmukh
Abstract:
Reinforcement learning (RL) has been successfully applied to solve the problem of finding obstacle-free paths for autonomous agents operating in stochastic and uncertain environments. However, when the underlying stochastic dynamics of the environment experiences drastic distribution shifts, the optimal policy obtained in the trained environment may be sub-optimal or may entirely fail in helping f…
▽ More
Reinforcement learning (RL) has been successfully applied to solve the problem of finding obstacle-free paths for autonomous agents operating in stochastic and uncertain environments. However, when the underlying stochastic dynamics of the environment experiences drastic distribution shifts, the optimal policy obtained in the trained environment may be sub-optimal or may entirely fail in helping find goal-reaching paths for the agent. Approaches like domain randomization and robust RL can provide robust policies, but typically assume minor (bounded) distribution shifts. For substantial distribution shifts, retraining (either with a warm-start policy or from scratch) is an alternative approach. In this paper, we develop a novel approach called {\em Evolutionary Robust Policy Optimization} (ERPO), an adaptive re-training algorithm inspired by evolutionary game theory (EGT). ERPO learns an optimal policy for the shifted environment iteratively using a temperature parameter that controls the trade off between exploration and adherence to the old optimal policy. The policy update itself is an instantiation of the replicator dynamics used in EGT. We show that under fairly common sparsity assumptions on rewards in such environments, ERPO converges to the optimal policy in the shifted environment. We empirically demonstrate that for path finding tasks in a number of environments, ERPO outperforms several popular RL and deep RL algorithms (PPO, A3C, DQN) in many scenarios and popular environments. This includes scenarios where the RL algorithms are allowed to train from scratch in the new environment, when they are retrained on the new environment, or when they are used in conjunction with domain randomization. ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Learning Robust Representations for Communications over Interference-limited Channels
Authors:
Shubham Paul,
Sudharsan Senthil,
Preethi Seshadri,
Nambi Seshadri,
R David Koilpillai
Abstract:
In the context of cellular networks, users located at the periphery of cells are particularly vulnerable to substantial interference from neighbouring cells, which can be represented as a two-user interference channel. This study introduces two highly effective methodologies, namely TwinNet and SiameseNet, using autoencoders, tailored for the design of encoders and decoders for block transmission…
▽ More
In the context of cellular networks, users located at the periphery of cells are particularly vulnerable to substantial interference from neighbouring cells, which can be represented as a two-user interference channel. This study introduces two highly effective methodologies, namely TwinNet and SiameseNet, using autoencoders, tailored for the design of encoders and decoders for block transmission and detection in interference-limited environments. The findings unambiguously illustrate that the developed models are capable of leveraging the interference structure to outperform traditional methods reliant on complete orthogonality. While it is recognized that systems employing coordinated transmissions and independent detection can offer greater capacity, the specific gains of data-driven models have not been thoroughly quantified or elucidated. This paper conducts an analysis to demonstrate the quantifiable advantages of such models in particular scenarios. Additionally, a comprehensive examination of the characteristics of codewords generated by these models is provided to offer a more intuitive comprehension of how these models achieve superior performance.
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Formal Explanations for Neuro-Symbolic AI
Authors:
Sushmita Paul,
Jinqiang Yu,
Jip J. Dekker,
Alexey Ignatiev,
Peter J. Stuckey
Abstract:
Despite the practical success of Artificial Intelligence (AI), current neural AI algorithms face two significant issues. First, the decisions made by neural architectures are often prone to bias and brittleness. Second, when a chain of reasoning is required, neural systems often perform poorly. Neuro-symbolic artificial intelligence is a promising approach that tackles these (and other) weaknesses…
▽ More
Despite the practical success of Artificial Intelligence (AI), current neural AI algorithms face two significant issues. First, the decisions made by neural architectures are often prone to bias and brittleness. Second, when a chain of reasoning is required, neural systems often perform poorly. Neuro-symbolic artificial intelligence is a promising approach that tackles these (and other) weaknesses by combining the power of neural perception and symbolic reasoning. Meanwhile, the success of AI has made it critical to understand its behaviour, leading to the development of explainable artificial intelligence (XAI). While neuro-symbolic AI systems have important advantages over purely neural AI, we still need to explain their actions, which are obscured by the interactions of the neural and symbolic components. To address the issue, this paper proposes a formal approach to explaining the decisions of neuro-symbolic systems. The approach hinges on the use of formal abductive explanations and on solving the neuro-symbolic explainability problem hierarchically. Namely, it first computes a formal explanation for the symbolic component of the system, which serves to identify a subset of the individual parts of neural information that needs to be explained. This is followed by explaining only those individual neural inputs, independently of each other, which facilitates succinctness of hierarchical formal explanations and helps to increase the overall performance of the approach. Experimental results for a few complex reasoning tasks demonstrate practical efficiency of the proposed approach, in comparison to purely neural systems, from the perspective of explanation size, explanation time, training time, model sizes, and the quality of explanations reported.
△ Less
Submitted 18 October, 2024;
originally announced October 2024.
-
Core Tokensets for Data-efficient Sequential Training of Transformers
Authors:
Subarnaduti Paul,
Manuel Brack,
Patrick Schramowski,
Kristian Kersting,
Martin Mundt
Abstract:
Deep networks are frequently tuned to novel tasks and continue learning from ongoing data streams. Such sequential training requires consolidation of new and past information, a challenge predominantly addressed by retaining the most important data points - formally known as coresets. Traditionally, these coresets consist of entire samples, such as images or sentences. However, recent transformer…
▽ More
Deep networks are frequently tuned to novel tasks and continue learning from ongoing data streams. Such sequential training requires consolidation of new and past information, a challenge predominantly addressed by retaining the most important data points - formally known as coresets. Traditionally, these coresets consist of entire samples, such as images or sentences. However, recent transformer architectures operate on tokens, leading to the famous assertion that an image is worth 16x16 words. Intuitively, not all of these tokens are equally informative or memorable. Going beyond coresets, we thus propose to construct a deeper-level data summary on the level of tokens. Our respectively named core tokensets both select the most informative data points and leverage feature attribution to store only their most relevant features. We demonstrate that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory. In fact, we empirically find that a core tokenset of 1\% of the data performs comparably to at least a twice as large and up to 10 times larger coreset.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
A Sinkhorn Regularized Adversarial Network for Image Guided DEM Super-resolution using Frequency Selective Hybrid Graph Transformer
Authors:
Subhajit Paul,
Ashutosh Gupta
Abstract:
Digital Elevation Model (DEM) is an essential aspect in the remote sensing (RS) domain to analyze various applications related to surface elevations. Here, we address the generation of high-resolution (HR) DEMs using HR multi-spectral (MX) satellite imagery as a guide by introducing a novel hybrid transformer model consisting of Densely connected Multi-Residual Block (DMRB) and multi-headed Freque…
▽ More
Digital Elevation Model (DEM) is an essential aspect in the remote sensing (RS) domain to analyze various applications related to surface elevations. Here, we address the generation of high-resolution (HR) DEMs using HR multi-spectral (MX) satellite imagery as a guide by introducing a novel hybrid transformer model consisting of Densely connected Multi-Residual Block (DMRB) and multi-headed Frequency Selective Graph Attention (M-FSGA). To promptly regulate this process, we utilize the notion of discriminator spatial maps as the conditional attention to the MX guide. Further, we present a novel adversarial objective related to optimizing Sinkhorn distance with classical GAN. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. Based on our experiments on 4 different DEM datasets, we demonstrate both qualitative and quantitative comparisons with available baseline methods and show that the performance of our proposed model is superior to others with sharper details and minimal errors.
△ Less
Submitted 21 September, 2024;
originally announced September 2024.
-
Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data
Authors:
Sneha Paul,
Zachary Patterson,
Nizar Bouguila
Abstract:
Semi-supervised learning (SSL) has shown its effectiveness in learning effective 3D representation from a small amount of labelled data while utilizing large unlabelled data. Traditional semi-supervised approaches rely on the fundamental concept of predicting pseudo-labels for unlabelled data and incorporating them into the learning process. However, we identify that the existing methods do not fu…
▽ More
Semi-supervised learning (SSL) has shown its effectiveness in learning effective 3D representation from a small amount of labelled data while utilizing large unlabelled data. Traditional semi-supervised approaches rely on the fundamental concept of predicting pseudo-labels for unlabelled data and incorporating them into the learning process. However, we identify that the existing methods do not fully utilize all the unlabelled samples and consequently limit their potential performance. To address this issue, we propose AllMatch, a novel SSL-based 3D classification framework that effectively utilizes all the unlabelled samples. AllMatch comprises three modules: (1) an adaptive hard augmentation module that applies relatively hard augmentations to the high-confident unlabelled samples with lower loss values, thereby enhancing the contribution of such samples, (2) an inverse learning module that further improves the utilization of unlabelled data by learning what not to learn, and (3) a contrastive learning module that ensures learning from all the samples in both supervised and unsupervised settings. Comprehensive experiments on two popular 3D datasets demonstrate a performance improvement of up to 11.2% with 1% labelled data, surpassing the SOTA by a significant margin. Furthermore, AllMatch exhibits its efficiency in effectively leveraging all the unlabelled data, demonstrated by the fact that only 10% of labelled data reaches nearly the same performance as fully-supervised learning with all labelled data. The code of our work is available at: https://github.com/snehaputul/AllMatch.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
F2former: When Fractional Fourier Meets Deep Wiener Deconvolution and Selective Frequency Transformer for Image Deblurring
Authors:
Subhajit Paul,
Sahil Kumawat,
Ashutosh Gupta,
Deepak Mishra
Abstract:
Recent progress in image deblurring techniques focuses mainly on operating in both frequency and spatial domains using the Fourier transform (FT) properties. However, their performance is limited due to the dependency of FT on stationary signals and its lack of capability to extract spatial-frequency properties. In this paper, we propose a novel approach based on the Fractional Fourier Transform (…
▽ More
Recent progress in image deblurring techniques focuses mainly on operating in both frequency and spatial domains using the Fourier transform (FT) properties. However, their performance is limited due to the dependency of FT on stationary signals and its lack of capability to extract spatial-frequency properties. In this paper, we propose a novel approach based on the Fractional Fourier Transform (FRFT), a unified spatial-frequency representation leveraging both spatial and frequency components simultaneously, making it ideal for processing non-stationary signals like images. Specifically, we introduce a Fractional Fourier Transformer (F2former), where we combine the classical fractional Fourier based Wiener deconvolution (F2WD) as well as a multi-branch encoder-decoder transformer based on a new fractional frequency aware transformer block (F2TB). We design F2TB consisting of a fractional frequency aware self-attention (F2SA) to estimate element-wise product attention based on important frequency components and a novel feed-forward network based on frequency division multiplexing (FM-FFN) to refine high and low frequency features separately for efficient latent clear image restoration. Experimental results for the cases of both motion deblurring as well as defocus deblurring show that the performance of our proposed method is superior to other state-of-the-art (SOTA) approaches.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Learning Robust Representations for Communications over Noisy Channels
Authors:
Sudharsan Senthil,
Shubham Paul,
Nambi Seshadri,
R. David Koilpillai
Abstract:
We explore the use of FCNNs (Fully Connected Neural Networks) for designing end-to-end communication systems without taking any inspiration from existing classical communications models or error control coding. This work relies solely on the tools of information theory and machine learning. We investigate the impact of using various cost functions based on mutual information and pairwise distances…
▽ More
We explore the use of FCNNs (Fully Connected Neural Networks) for designing end-to-end communication systems without taking any inspiration from existing classical communications models or error control coding. This work relies solely on the tools of information theory and machine learning. We investigate the impact of using various cost functions based on mutual information and pairwise distances between codewords to generate robust representations for transmission under strict power constraints. Additionally, we introduce a novel encoder structure inspired by the Barlow Twins framework. Our results show that iterative training with randomly chosen noise power levels while minimizing block error rate provides the best error performance.
△ Less
Submitted 7 September, 2024; v1 submitted 2 September, 2024;
originally announced September 2024.
-
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
Authors:
Chansung Park,
Juyong Jiang,
Fan Wang,
Sayak Paul,
Jing Tang
Abstract:
The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity. In this work, we introduce an LLMOps pipeline, "LlamaDuo", for the seamless migration of knowledge and abilities from service-oriented LLMs to smaller, locally manageable m…
▽ More
The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity. In this work, we introduce an LLMOps pipeline, "LlamaDuo", for the seamless migration of knowledge and abilities from service-oriented LLMs to smaller, locally manageable models. This pipeline is crucial for ensuring service continuity in the presence of operational failures, strict privacy policies, or offline requirements. Our LlamaDuo involves fine-tuning a small language model against the service LLM using a synthetic dataset generated by the latter. If the performance of the fine-tuned model falls short of expectations, it is enhanced by further fine-tuning with additional similar data created by the service LLM. This iterative process guarantees that the smaller model can eventually match or even surpass the service LLM's capabilities in specific downstream tasks, offering a practical and scalable solution for managing AI deployments in constrained environments. Extensive experiments with leading edge LLMs are conducted to demonstrate the effectiveness, adaptability, and affordability of LlamaDuo across various downstream tasks. Our pipeline implementation is available at https://github.com/deep-diver/llamaduo.
△ Less
Submitted 28 August, 2024; v1 submitted 24 August, 2024;
originally announced August 2024.
-
InfLocNet: Enhanced Lung Infection Localization and Disease Detection from Chest X-Ray Images Using Lightweight Deep Learning
Authors:
Md. Asiful Islam Miah,
Shourin Paul,
Sunanda Das,
M. M. A. Hashem
Abstract:
In recent years, the integration of deep learning techniques into medical imaging has revolutionized the diagnosis and treatment of lung diseases, particularly in the context of COVID-19 and pneumonia. This paper presents a novel, lightweight deep learning based segmentation-classification network designed to enhance the detection and localization of lung infections using chest X-ray images. By le…
▽ More
In recent years, the integration of deep learning techniques into medical imaging has revolutionized the diagnosis and treatment of lung diseases, particularly in the context of COVID-19 and pneumonia. This paper presents a novel, lightweight deep learning based segmentation-classification network designed to enhance the detection and localization of lung infections using chest X-ray images. By leveraging the power of transfer learning with pre-trained VGG-16 weights, our model achieves robust performance even with limited training data. The architecture incorporates refined skip connections within the UNet++ framework, reducing semantic gaps and improving precision in segmentation tasks. Additionally, a classification module is integrated at the end of the encoder block, enabling simultaneous classification and segmentation. This dual functionality enhances the model's versatility, providing comprehensive diagnostic insights while optimizing computational efficiency. Experimental results demonstrate that our proposed lightweight network outperforms existing methods in terms of accuracy and computational requirements, making it a viable solution for real-time and resource constrained medical imaging applications. Furthermore, the streamlined design facilitates easier hyperparameter tuning and deployment on edge devices. This work underscores the potential of advanced deep learning architectures in improving clinical outcomes through precise and efficient medical image analysis. Our model achieved remarkable results with an Intersection over Union (IoU) of 93.59% and a Dice Similarity Coefficient (DSC) of 97.61% in lung area segmentation, and an IoU of 97.67% and a DSC of 87.61% for infection region localization. Additionally, it demonstrated high accuracy of 93.86% and sensitivity of 89.55% in detecting chest diseases, highlighting its efficacy and reliability.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
Process-constrained batch Bayesian approaches for yield optimization in multi-reactor systems
Authors:
Markus Grimm,
Sébastien Paul,
Pierre Chainais
Abstract:
The optimization of yields in multi-reactor systems, which are advanced tools in heterogeneous catalysis research, presents a significant challenge due to hierarchical technical constraints. To this respect, this work introduces a novel approach called process-constrained batch Bayesian optimization via Thompson sampling (pc-BO-TS) and its generalized hierarchical extension (hpc-BO-TS). This metho…
▽ More
The optimization of yields in multi-reactor systems, which are advanced tools in heterogeneous catalysis research, presents a significant challenge due to hierarchical technical constraints. To this respect, this work introduces a novel approach called process-constrained batch Bayesian optimization via Thompson sampling (pc-BO-TS) and its generalized hierarchical extension (hpc-BO-TS). This method, tailored for the efficiency demands in multi-reactor systems, integrates experimental constraints and balances between exploration and exploitation in a sequential batch optimization strategy. It offers an improvement over other Bayesian optimization methods. The performance of pc-BO-TS and hpc-BO-TS is validated in synthetic cases as well as in a realistic scenario based on data obtained from high-throughput experiments done on a multi-reactor system available in the REALCAT platform. The proposed methods often outperform other sequential Bayesian optimizations and existing process-constrained batch Bayesian optimization methods. This work proposes a novel approach to optimize the yield of a reaction in a multi-reactor system, marking a significant step forward in digital catalysis and generally in optimization methods for chemical engineering.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Authors:
Gagan Jain,
Nidhi Hegde,
Aditya Kusupati,
Arsha Nagrani,
Shyamal Buch,
Prateek Jain,
Anurag Arnab,
Sujoy Paul
Abstract:
The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate…
▽ More
The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNE$'$s adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.
△ Less
Submitted 30 July, 2024; v1 submitted 29 July, 2024;
originally announced July 2024.
-
Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence
Authors:
Zahidul Islam,
Sujoy Paul,
Mrigank Rochan
Abstract:
With the exponential growth of video content, the need for automated video highlight detection to extract key moments or highlights from lengthy videos has become increasingly pressing. This technology has the potential to enhance user experiences by allowing quick access to relevant content across diverse domains. Existing methods typically rely either on expensive manually labeled frame-level an…
▽ More
With the exponential growth of video content, the need for automated video highlight detection to extract key moments or highlights from lengthy videos has become increasingly pressing. This technology has the potential to enhance user experiences by allowing quick access to relevant content across diverse domains. Existing methods typically rely either on expensive manually labeled frame-level annotations, or on a large external dataset of videos for weak supervision through category information. To overcome this, we focus on unsupervised video highlight detection, eliminating the need for manual annotations. We propose a novel unsupervised approach which capitalizes on the premise that significant moments tend to recur across multiple videos of the similar category in both audio and visual modalities. Surprisingly, audio remains under-explored, especially in unsupervised algorithms, despite its potential to detect key moments. Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video by measuring the similarities of audio features among audio clips of all the videos within each pseudo-category. Similarly, we also compute visual pseudo-highlight scores for each video using visual features. Then, we combine audio and visual pseudo-highlights to create the audio-visual pseudo ground-truth highlight of each video for training an audio-visual highlight detection network. Extensive experiments and ablation studies on three benchmarks showcase the superior performance of our method over prior work.
△ Less
Submitted 14 May, 2025; v1 submitted 18 July, 2024;
originally announced July 2024.
-
LookupViT: Compressing visual information to a limited number of tokens
Authors:
Rajat Koner,
Gagan Jain,
Prateek Jain,
Volker Tresp,
Sujoy Paul
Abstract:
Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually spa…
▽ More
Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually sparse and redundant. In this work, we introduce LookupViT, that aims to exploit this information sparsity to reduce ViT inference cost. LookupViT provides a novel general purpose vision transformer block that operates by compressing information from higher resolution tokens to a fixed number of tokens. These few compressed tokens undergo meticulous processing, while the higher-resolution tokens are passed through computationally cheaper layers. Information sharing between these two token sets is enabled through a bidirectional cross-attention mechanism. The approach offers multiple advantages - (a) easy to implement on standard ML accelerators (GPUs/TPUs) via standard high-level operators, (b) applicable to standard ViT and its variants, thus generalizes to various tasks, (c) can handle different tokenization and attention approaches. LookupViT also offers flexibility for the compressed tokens, enabling performance-computation trade-offs in a single trained model. We show LookupViT's effectiveness on multiple domains - (a) for image-classification (ImageNet-1K and ImageNet-21K), (b) video classification (Kinetics400 and Something-Something V2), (c) image captioning (COCO-Captions) with a frozen encoder. LookupViT provides $2\times$ reduction in FLOPs while upholding or improving accuracy across these domains. In addition, LookupViT also demonstrates out-of-the-box robustness and generalization on image classification (ImageNet-C,R,A,O), improving by up to $4\%$ over ViT.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
A Graph-based Adversarial Imitation Learning Framework for Reliable & Realtime Fleet Scheduling in Urban Air Mobility
Authors:
Prithvi Poddar,
Steve Paul,
Souma Chowdhury
Abstract:
The advent of Urban Air Mobility (UAM) presents the scope for a transformative shift in the domain of urban transportation. However, its widespread adoption and economic viability depends in part on the ability to optimally schedule the fleet of aircraft across vertiports in a UAM network, under uncertainties attributed to airspace congestion, changing weather conditions, and varying demands. This…
▽ More
The advent of Urban Air Mobility (UAM) presents the scope for a transformative shift in the domain of urban transportation. However, its widespread adoption and economic viability depends in part on the ability to optimally schedule the fleet of aircraft across vertiports in a UAM network, under uncertainties attributed to airspace congestion, changing weather conditions, and varying demands. This paper presents a comprehensive optimization formulation of the fleet scheduling problem, while also identifying the need for alternate solution approaches, since directly solving the resulting integer nonlinear programming problem is computationally prohibitive for daily fleet scheduling. Previous work has shown the effectiveness of using (graph) reinforcement learning (RL) approaches to train real-time executable policy models for fleet scheduling. However, such policies can often be brittle on out-of-distribution scenarios or edge cases. Moreover, training performance also deteriorates as the complexity (e.g., number of constraints) of the problem increases. To address these issues, this paper presents an imitation learning approach where the RL-based policy exploits expert demonstrations yielded by solving the exact optimization using a Genetic Algorithm. The policy model comprises Graph Neural Network (GNN) based encoders that embed the space of vertiports and aircraft, Transformer networks to encode demand, passenger fare, and transport cost profiles, and a Multi-head attention (MHA) based decoder. Expert demonstrations are used through the Generative Adversarial Imitation Learning (GAIL) algorithm. Interfaced with a UAM simulation environment involving 8 vertiports and 40 aircrafts, in terms of the daily profits earned reward, the new imitative approach achieves better mean performance and remarkable improvement in the case of unseen worst-case scenarios, compared to pure RL results.
△ Less
Submitted 5 September, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Muzzle-Based Cattle Identification System Using Artificial Intelligence (AI)
Authors:
Hasan Zohirul Islam,
Safayet Khan,
Sanjib Kumar Paul,
Sheikh Imtiaz Rahi,
Fahim Hossain Sifat,
Md. Mahadi Hasan Sany,
Md. Shahjahan Ali Sarker,
Tareq Anam,
Ismail Hossain Polas
Abstract:
Absence of tamper-proof cattle identification technology was a significant problem preventing insurance companies from providing livestock insurance. This lack of technology had devastating financial consequences for marginal farmers as they did not have the opportunity to claim compensation for any unexpected events such as the accidental death of cattle in Bangladesh. Using machine learning and…
▽ More
Absence of tamper-proof cattle identification technology was a significant problem preventing insurance companies from providing livestock insurance. This lack of technology had devastating financial consequences for marginal farmers as they did not have the opportunity to claim compensation for any unexpected events such as the accidental death of cattle in Bangladesh. Using machine learning and deep learning algorithms, we have solved the bottleneck of cattle identification by developing and introducing a muzzle-based cattle identification system. The uniqueness of cattle muzzles has been scientifically established, which resembles human fingerprints. This is the fundamental premise that prompted us to develop a cattle identification system that extracts the uniqueness of cattle muzzles. For this purpose, we collected 32,374 images from 826 cattle. Contrast-limited adaptive histogram equalization (CLAHE) with sharpening filters was applied in the preprocessing steps to remove noise from images. We used the YOLO algorithm for cattle muzzle detection in the image and the FaceNet architecture to learn unified embeddings from muzzle images using squared $L_2$ distances. Our system performs with an accuracy of $96.489\%$, $F_1$ score of $97.334\%$, and a true positive rate (tpr) of $87.993\%$ at a remarkably low false positive rate (fpr) of $0.098\%$. This reliable and efficient system for identifying cattle can significantly advance livestock insurance and precision farming.
△ Less
Submitted 9 October, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning
Authors:
Abhinav Joshi,
Shounak Paul,
Akshat Sharma,
Pawan Goyal,
Saptarshi Ghosh,
Ashutosh Modi
Abstract:
Legal systems worldwide are inundated with exponential growth in cases and documents. There is an imminent need to develop NLP and ML techniques for automatically processing and understanding legal documents to streamline the legal system. However, evaluating and comparing various NLP models designed specifically for the legal domain is challenging. This paper addresses this challenge by proposing…
▽ More
Legal systems worldwide are inundated with exponential growth in cases and documents. There is an imminent need to develop NLP and ML techniques for automatically processing and understanding legal documents to streamline the legal system. However, evaluating and comparing various NLP models designed specifically for the legal domain is challenging. This paper addresses this challenge by proposing IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning. IL-TUR contains monolingual (English, Hindi) and multi-lingual (9 Indian languages) domain-specific tasks that address different aspects of the legal system from the point of view of understanding and reasoning over Indian legal documents. We present baseline models (including LLM-based) for each task, outlining the gap between models and the ground truth. To foster further research in the legal domain, we create a leaderboard (available at: https://exploration-lab.github.io/IL-TUR/) where the research community can upload and compare legal text understanding systems.
△ Less
Submitted 26 November, 2024; v1 submitted 7 July, 2024;
originally announced July 2024.
-
Remembering Everything Makes You Vulnerable: A Limelight on Machine Unlearning for Personalized Healthcare Sector
Authors:
Ahan Chatterjee,
Sai Anirudh Aryasomayajula,
Rajat Chaudhari,
Subhajit Paul,
Vishwa Mohan Singh
Abstract:
As the prevalence of data-driven technologies in healthcare continues to rise, concerns regarding data privacy and security become increasingly paramount. This thesis aims to address the vulnerability of personalized healthcare models, particularly in the context of ECG monitoring, to adversarial attacks that compromise patient privacy. We propose an approach termed "Machine Unlearning" to mitigat…
▽ More
As the prevalence of data-driven technologies in healthcare continues to rise, concerns regarding data privacy and security become increasingly paramount. This thesis aims to address the vulnerability of personalized healthcare models, particularly in the context of ECG monitoring, to adversarial attacks that compromise patient privacy. We propose an approach termed "Machine Unlearning" to mitigate the impact of exposed data points on machine learning models, thereby enhancing model robustness against adversarial attacks while preserving individual privacy. Specifically, we investigate the efficacy of Machine Unlearning in the context of personalized ECG monitoring, utilizing a dataset of clinical ECG recordings. Our methodology involves training a deep neural classifier on ECG data and fine-tuning the model for individual patients. We demonstrate the susceptibility of fine-tuned models to adversarial attacks, such as the Fast Gradient Sign Method (FGSM), which can exploit additional data points in personalized models. To address this vulnerability, we propose a Machine Unlearning algorithm that selectively removes sensitive data points from fine-tuned models, effectively enhancing model resilience against adversarial manipulation. Experimental results demonstrate the effectiveness of our approach in mitigating the impact of adversarial attacks while maintaining the pre-trained model accuracy.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Towards Physically Talented Aerial Robots with Tactically Smart Swarm Behavior thereof: An Efficient Co-design Approach
Authors:
Prajit KrisshnaKumar,
Steve Paul,
Hemanth Manjunatha,
Mary Corra,
Ehsan Esfahani,
Souma Chowdhury
Abstract:
The collective performance or capacity of collaborative autonomous systems such as a swarm of robots is jointly influenced by the morphology and the behavior of individual systems in that collective. In that context, this paper explores how morphology impacts the learned tactical behavior of unmanned aerial/ground robots performing reconnaissance and search & rescue. This is achieved by presenting…
▽ More
The collective performance or capacity of collaborative autonomous systems such as a swarm of robots is jointly influenced by the morphology and the behavior of individual systems in that collective. In that context, this paper explores how morphology impacts the learned tactical behavior of unmanned aerial/ground robots performing reconnaissance and search & rescue. This is achieved by presenting a computationally efficient framework to solve this otherwise challenging problem of jointly optimizing the morphology and tactical behavior of swarm robots. Key novel developments to this end include the use of physical talent metrics and modification of graph reinforcement learning architectures to allow joint learning of the swarm tactical policy and the talent metrics (search speed, flight range, and cruising speed) that constrain mobility and object/victim search capabilities of the aerial robots executing these tactics. Implementation of this co-design approach is supported by advancements to an open-source Pybullet-based swarm simulator that allows the use of variable aerial asset capabilities. The results of the co-design are observed to outperform those of tactics learning with a fixed Pareto design, when compared in terms of mission performance metrics. Significant differences in morphology and learned behavior are also observed by comparing the baseline design and the co-design outcomes.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
From Pixels to Prose: A Large Dataset of Dense Image Captions
Authors:
Vasu Singla,
Kaiyu Yue,
Sukriti Paul,
Reza Shirkavand,
Mayuka Jayawardhana,
Alireza Ganjdanesh,
Heng Huang,
Abhinav Bhatele,
Gowthami Somepalli,
Tom Goldstein
Abstract:
Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure d…
▽ More
Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose
△ Less
Submitted 14 June, 2024;
originally announced June 2024.