Search | arXiv e-print repository

arXiv:2507.02092 [pdf, ps, other]

Energy-Based Transformers are Scalable Learners and Thinkers

Authors: Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal

Abstract: Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pret… ▽ More Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01788 [pdf, ps, other]

Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

Authors: Montasir Shams, Chashi Mahiul Islam, Shaeke Salman, Phat Tran, Xiuwen Liu

Abstract: Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produc… ▽ More Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produced by such models are semantically meaningful. In this paper, using a projected gradient-based algorithm, we show that their representations are not semantically meaningful and they are inherently vulnerable to small changes. Images with imperceptible differences can have very different representations; on the other hand, images that should belong to different semantic classes can have nearly identical representations. Such vulnerability can lead to unreliable classification results; for example, unnoticeable changes cause the classification accuracy to be reduced by over 60\%. %. To the best of our knowledge, this is the first work to systematically demonstrate this fundamental lack of semantic meaningfulness in ViT representations for medical image classification, revealing a critical challenge for their deployment in safety-critical systems. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 9 pages

arXiv:2506.23714 [pdf, ps, other]

Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization

Authors: Md Moinul Islam, Sofoklis Kakouros, Janne Heikkilä, Mourad Oussalah

Abstract: The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues an… ▽ More The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: Accepted to HHAI WS 2025: Workshops at the Fourth International Conference on Hybrid Human-Artificial Intelligence (HHAI)

arXiv:2506.22742 [pdf, ps, other]

RAILS: Retrieval-Augmented Intelligence for Learning Software Development

Authors: Wali Mohammad Abdullah, Md. Morshedul Islam, Devraj Parmar, Happy Hasmukhbhai Patel, Sindhuja Prabhakaran, Baidya Saha

Abstract: Large Language Models (LLMs) like GPT-3.5-Turbo are increasingly used to assist software development, yet they often produce incomplete code or incorrect imports, especially when lacking access to external or project-specific documentation. We introduce RAILS (Retrieval-Augmented Intelligence for Learning Software Development), a framework that augments LLM prompts with semantically retrieved cont… ▽ More Large Language Models (LLMs) like GPT-3.5-Turbo are increasingly used to assist software development, yet they often produce incomplete code or incorrect imports, especially when lacking access to external or project-specific documentation. We introduce RAILS (Retrieval-Augmented Intelligence for Learning Software Development), a framework that augments LLM prompts with semantically retrieved context from curated Java resources using FAISS and OpenAI embeddings. RAILS incorporates an iterative validation loop guided by compiler feedback to refine suggestions. We evaluated RAILS on 78 real-world Java import error cases spanning standard libraries, GUI APIs, external tools, and custom utilities. Despite using the same LLM, RAILS outperforms baseline prompting by preserving intent, avoiding hallucinations, and surfacing correct imports even when libraries are unavailable locally. Future work will integrate symbolic filtering via PostgreSQL and extend support to other languages and IDEs. △ Less

Submitted 27 June, 2025; originally announced June 2025.

arXiv:2506.19870 [pdf]

doi 10.63332/joph.v5i6.2198

Secure Energy Transactions Using Blockchain Leveraging AI for Fraud Detection and Energy Market Stability

Authors: Md Asif Ul Hoq Khan, MD Zahedul Islam, Istiaq Ahmed, Md Masud Karim Rabbi, Farhana Rahman Anonna, MD Abdul Fahim Zeeshan, Mehedi Hasan Ridoy, Bivash Ranjan Chowdhury, Md Nazmul Shakir Rabbi, GM Alamin Sadnan

Abstract: Peer-to-peer trading and the move to decentralized grids have reshaped the energy markets in the United States. Notwithstanding, such developments lead to new challenges, mainly regarding the safety and authenticity of energy trade. This study aimed to develop and build a secure, intelligent, and efficient energy transaction system for the decentralized US energy market. This research interlinks t… ▽ More Peer-to-peer trading and the move to decentralized grids have reshaped the energy markets in the United States. Notwithstanding, such developments lead to new challenges, mainly regarding the safety and authenticity of energy trade. This study aimed to develop and build a secure, intelligent, and efficient energy transaction system for the decentralized US energy market. This research interlinks the technological prowess of blockchain and artificial intelligence (AI) in a novel way to solve long-standing challenges in the distributed energy market, specifically those of security, fraudulent behavior detection, and market reliability. The dataset for this research is comprised of more than 1.2 million anonymized energy transaction records from a simulated peer-to-peer (P2P) energy exchange network emulating real-life blockchain-based American microgrids, including those tested by LO3 Energy and Grid+ Labs. Each record contains detailed fields of transaction identifier, timestamp, energy volume (kWh), transaction type (buy/sell), unit price, prosumer/consumer identifier (hashed for privacy), smart meter readings, geolocation regions, and settlement confirmation status. The dataset also includes system-calculated behavior metrics of transaction rate, variability of energy production, and historical pricing patterns. The system architecture proposed involves the integration of two layers, namely a blockchain layer and artificial intelligence (AI) layer, each playing a unique but complementary function in energy transaction securing and market intelligence improvement. The machine learning models used in this research were specifically chosen for their established high performance in classification tasks, specifically in the identification of energy transaction fraud in decentralized markets. △ Less

Submitted 21 June, 2025; originally announced June 2025.

arXiv:2506.18927 [pdf, ps, other]

From Tiny Machine Learning to Tiny Deep Learning: A Survey

Authors: Shriyank Somvanshi, Md Monzurul Islam, Gaurab Chhetri, Rohit Chakraborty, Mahmuda Sultana Mimi, Sawgat Ahmed Shuvo, Kazi Sifatul Islam, Syed Aaqib Javed, Sharif Ahmed Rafat, Anandi Dutta, Subasish Das

Abstract: The rapid growth of edge devices has driven the demand for deploying artificial intelligence (AI) at the edge, giving rise to Tiny Machine Learning (TinyML) and its evolving counterpart, Tiny Deep Learning (TinyDL). While TinyML initially focused on enabling simple inference tasks on microcontrollers, the emergence of TinyDL marks a paradigm shift toward deploying deep learning models on severely… ▽ More The rapid growth of edge devices has driven the demand for deploying artificial intelligence (AI) at the edge, giving rise to Tiny Machine Learning (TinyML) and its evolving counterpart, Tiny Deep Learning (TinyDL). While TinyML initially focused on enabling simple inference tasks on microcontrollers, the emergence of TinyDL marks a paradigm shift toward deploying deep learning models on severely resource-constrained hardware. This survey presents a comprehensive overview of the transition from TinyML to TinyDL, encompassing architectural innovations, hardware platforms, model optimization techniques, and software toolchains. We analyze state-of-the-art methods in quantization, pruning, and neural architecture search (NAS), and examine hardware trends from MCUs to dedicated neural accelerators. Furthermore, we categorize software deployment frameworks, compilers, and AutoML tools enabling practical on-device learning. Applications across domains such as computer vision, audio recognition, healthcare, and industrial monitoring are reviewed to illustrate the real-world impact of TinyDL. Finally, we identify emerging directions including neuromorphic computing, federated TinyDL, edge-native foundation models, and domain-specific co-design approaches. This survey aims to serve as a foundational resource for researchers and practitioners, offering a holistic view of the ecosystem and laying the groundwork for future advancements in edge AI. △ Less

Submitted 25 June, 2025; v1 submitted 21 June, 2025; originally announced June 2025.

arXiv:2506.16009 [pdf]

Bridging Brain with Foundation Models through Self-Supervised Learning

Authors: Hamdi Altaheri, Fakhri Karray, Md. Milon Islam, S M Taslim Uddin Raju, Amir-Hossein Karimi

Abstract: Foundation models (FMs), powered by self-supervised learning (SSL), have redefined the capabilities of artificial intelligence, demonstrating exceptional performance in domains like natural language processing and computer vision. These advances present a transformative opportunity for brain signal analysis. Unlike traditional supervised learning, which is limited by the scarcity of labeled neural… ▽ More Foundation models (FMs), powered by self-supervised learning (SSL), have redefined the capabilities of artificial intelligence, demonstrating exceptional performance in domains like natural language processing and computer vision. These advances present a transformative opportunity for brain signal analysis. Unlike traditional supervised learning, which is limited by the scarcity of labeled neural data, SSL offers a promising solution by enabling models to learn meaningful representations from unlabeled data. This is particularly valuable in addressing the unique challenges of brain signals, including high noise levels, inter-subject variability, and low signal-to-noise ratios. This survey systematically reviews the emerging field of bridging brain signals with foundation models through the innovative application of SSL. It explores key SSL techniques, the development of brain-specific foundation models, their adaptation to downstream tasks, and the integration of brain signals with other modalities in multimodal SSL frameworks. The review also covers commonly used evaluation metrics and benchmark datasets that support comparative analysis. Finally, it highlights key challenges and outlines future research directions. This work aims to provide researchers with a structured understanding of this rapidly evolving field and a roadmap for developing generalizable brain foundation models powered by self-supervision. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.14629 [pdf, ps, other]

VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning

Authors: Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Muhammad Ziaur Rahman, Shahanur Rahman Bappy, Raiyan Rahman, Swakkhar Shatabda

Abstract: Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for objec… ▽ More Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, our fine-tuned BLIP model achieves a final loss of 0.0028, with a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.87. This dataset and model framework emphasize the theme "Prevention is Better than Cure", showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.11451 [pdf, ps, other]

Understanding the Issue Types in Open Source Blockchain-based Software Projects with the Transformer-based BERTopic

Authors: Md Nahidul Islam Opu, Md Shahidul Islam, Sara Rouhani, Shaiful Chowdhury

Abstract: Blockchain-based software systems are increasingly deployed across diverse domains, yet a systematic understanding of their development challenges remains limited. This paper presents a large-scale empirical study of 497,742 issues mined from 1,209 open-source blockchain projects hosted on GitHub. Employing BERTopic, a transformer-based topic modeling technique, we identify 49 distinct issue topic… ▽ More Blockchain-based software systems are increasingly deployed across diverse domains, yet a systematic understanding of their development challenges remains limited. This paper presents a large-scale empirical study of 497,742 issues mined from 1,209 open-source blockchain projects hosted on GitHub. Employing BERTopic, a transformer-based topic modeling technique, we identify 49 distinct issue topics and organize them hierarchically into 11 major subcategories. Our analysis reveals that both general software development issues and blockchain-specific concerns are nearly equally represented, with Wallet Management and UI Enhancement emerging as the most prominent topics. We further examine the temporal evolution of issue categories and resolution times, finding that Wallet issues not only dominate in frequency but also exhibit the longest resolution time. Conversely, Mechanisms issues are resolved significantly faster. Issue frequency surged after 2016 with the rise of Ethereum and decentralized applications, but declined after 2022. These findings enhance our understanding of blockchain software maintenance, informing the development of specialized tools and practices to improve robustness and maintainability. △ Less

Submitted 13 June, 2025; originally announced June 2025.

arXiv:2506.08051 [pdf, ps, other]

ST-GraphNet: A Spatio-Temporal Graph Neural Network for Understanding and Predicting Automated Vehicle Crash Severity

Authors: Mahmuda Sultana Mimi, Md Monzurul Islam, Anannya Ghosh Tusti, Shriyank Somvanshi, Subasish Das

Abstract: Understanding the spatial and temporal dynamics of automated vehicle (AV) crash severity is critical for advancing urban mobility safety and infrastructure planning. In this work, we introduce ST-GraphNet, a spatio-temporal graph neural network framework designed to model and predict AV crash severity by using both fine-grained and region-aggregated spatial graphs. Using a balanced dataset of 2,35… ▽ More Understanding the spatial and temporal dynamics of automated vehicle (AV) crash severity is critical for advancing urban mobility safety and infrastructure planning. In this work, we introduce ST-GraphNet, a spatio-temporal graph neural network framework designed to model and predict AV crash severity by using both fine-grained and region-aggregated spatial graphs. Using a balanced dataset of 2,352 real-world AV-related crash reports from Texas (2024), including geospatial coordinates, crash timestamps, SAE automation levels, and narrative descriptions, we construct two complementary graph representations: (1) a fine-grained graph with individual crash events as nodes, where edges are defined via spatio-temporal proximity; and (2) a coarse-grained graph where crashes are aggregated into Hexagonal Hierarchical Spatial Indexing (H3)-based spatial cells, connected through hexagonal adjacency. Each node in the graph is enriched with multimodal data, including semantic, spatial, and temporal attributes, including textual embeddings from crash narratives using a pretrained Sentence-BERT model. We evaluate various graph neural network (GNN) architectures, such as Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and Dynamic Spatio-Temporal GCN (DSTGCN), to classify crash severity and predict high-risk regions. Our proposed ST-GraphNet, which utilizes a DSTGCN backbone on the coarse-grained H3 graph, achieves a test accuracy of 97.74\%, substantially outperforming the best fine-grained model (64.7\% test accuracy). These findings highlight the effectiveness of spatial aggregation, dynamic message passing, and multi-modal feature integration in capturing the complex spatio-temporal patterns underlying AV crash severity. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2506.06989 [pdf, ps, other]

Correcting for Position Bias in Learning to Rank: A Control Function Approach

Authors: Md Aminul Islam, Kathryn Vasilaky, Elena Zheleva

Abstract: Implicit feedback data, such as user clicks, is commonly used in learning-to-rank (LTR) systems because it is easy to collect and it often reflects user preferences. However, this data is prone to various biases, and training an LTR system directly on biased data can result in suboptimal ranking performance. One of the most prominent and well-studied biases in implicit feedback data is position bi… ▽ More Implicit feedback data, such as user clicks, is commonly used in learning-to-rank (LTR) systems because it is easy to collect and it often reflects user preferences. However, this data is prone to various biases, and training an LTR system directly on biased data can result in suboptimal ranking performance. One of the most prominent and well-studied biases in implicit feedback data is position bias, which occurs because users are more likely to interact with higher-ranked documents regardless of their true relevance. In this paper, we propose a novel control function-based method that accounts for position bias in a two-stage process. The first stage uses exogenous variation from the residuals of the ranking process to correct for position bias in the second stage click equation. Unlike previous position bias correction methods, our method does not require knowledge of the click or propensity model and allows for nonlinearity in the underlying ranking model. Moreover, our method is general and allows for debiasing any state-of-the-art ranking algorithm by plugging it into the second stage. We also introduce a technique to debias validation clicks for hyperparameter tuning to select the optimal model in the absence of unbiased validation data. Experimental results demonstrate that our method outperforms state-of-the-art approaches in correcting for position bias. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2506.04696 [pdf, ps, other]

Enhanced Drought Analysis in Bangladesh: A Machine Learning Approach for Severity Classification Using Satellite Data

Authors: Tonmoy Paul, Mrittika Devi Mati, Md. Mahmudul Islam

Abstract: Drought poses a pervasive environmental challenge in Bangladesh, impacting agriculture, socio-economic stability, and food security due to its unique geographic and anthropogenic vulnerabilities. Traditional drought indices, such as the Standardized Precipitation Index (SPI) and Palmer Drought Severity Index (PDSI), often overlook crucial factors like soil moisture and temperature, limiting their… ▽ More Drought poses a pervasive environmental challenge in Bangladesh, impacting agriculture, socio-economic stability, and food security due to its unique geographic and anthropogenic vulnerabilities. Traditional drought indices, such as the Standardized Precipitation Index (SPI) and Palmer Drought Severity Index (PDSI), often overlook crucial factors like soil moisture and temperature, limiting their resolution. Moreover, current machine learning models applied to drought prediction have been underexplored in the context of Bangladesh, lacking a comprehensive integration of satellite data across multiple districts. To address these gaps, we propose a satellite data-driven machine learning framework to classify drought across 38 districts of Bangladesh. Using unsupervised algorithms like K-means and Bayesian Gaussian Mixture for clustering, followed by classification models such as KNN, Random Forest, Decision Tree, and Naive Bayes, the framework integrates weather data (humidity, soil moisture, temperature) from 2012-2024. This approach successfully classifies drought severity into different levels. However, it shows significant variabilities in drought vulnerabilities across regions which highlights the aptitude of machine learning models in terms of identifying and predicting drought conditions. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.04238 [pdf, other]

A Comprehensive Survey on Bio-Inspired Algorithms: Taxonomy, Applications, and Future Directions

Authors: Shriyank Somvanshi, Md Monzurul Islam, Syed Aaqib Javed, Gaurab Chhetri, Kazi Sifatul Islam, Tausif Islam Chowdhury, Sazzad Bin Bashar Polock, Anandi Dutta, Subasish Das

Abstract: Bio-inspired algorithms (BIAs) utilize natural processes such as evolution, swarm behavior, foraging, and plant growth to solve complex, nonlinear, high-dimensional optimization problems. This survey categorizes BIAs into eight groups: evolutionary, swarm intelligence, physics-inspired, ecosystem and plant-based, predator-prey, neural-inspired, human-inspired, and hybrid approaches, and reviews th… ▽ More Bio-inspired algorithms (BIAs) utilize natural processes such as evolution, swarm behavior, foraging, and plant growth to solve complex, nonlinear, high-dimensional optimization problems. This survey categorizes BIAs into eight groups: evolutionary, swarm intelligence, physics-inspired, ecosystem and plant-based, predator-prey, neural-inspired, human-inspired, and hybrid approaches, and reviews their core principles, strengths, and limitations. We illustrate the usage of these algorithms in machine learning, engineering design, bioinformatics, and intelligent systems, and highlight recent advances in hybridization, parameter tuning, and adaptive strategies. Finally, we identify open challenges such as scalability, convergence, reliability, and interpretability to suggest directions for future research. This work aims to serve as a foundational resource for both researchers and practitioners interested in understanding the current landscape and future directions of bio-inspired computing. △ Less

Submitted 25 May, 2025; originally announced June 2025.

arXiv:2506.03191 [pdf, ps, other]

Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward

Authors: Muhammad Islam, Tao Huang, Euijoon Ahn, Usman Naseem

Abstract: This paper presents an in-depth survey on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation, offering insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis. Focusing exclusively on text and motion modalities, this research inv… ▽ More This paper presents an in-depth survey on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation, offering insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis. Focusing exclusively on text and motion modalities, this research investigates how textual descriptions can guide the generation of complex, human-like motion sequences. The paper explores various generative approaches, including autoregressive models, diffusion models, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and transformer-based models, by analyzing their strengths and limitations in terms of motion quality, computational efficiency, and adaptability. It highlights recent advances in text-conditioned motion generation, where textual inputs are used to control and refine motion outputs with greater precision. The integration of LLMs further enhances these models by enabling semantic alignment between instructions and motion, improving coherence and contextual relevance. This systematic survey underscores the transformative potential of text-to-motion GenAI and LLM architectures in applications such as healthcare, humanoids, gaming, animation, and assistive technologies, while addressing ongoing challenges in generating efficient and realistic human motion. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2506.03184 [pdf]

Impact of Tuning Parameters in Deep Convolutional Neural Network Using a Crack Image Dataset

Authors: Mahe Zabin, Ho-Jin Choi, Md. Monirul Islam, Jia Uddin

Abstract: The performance of a classifier depends on the tuning of its parame ters. In this paper, we have experimented the impact of various tuning parameters on the performance of a deep convolutional neural network (DCNN). In the ex perimental evaluation, we have considered a DCNN classifier that consists of 2 convolutional layers (CL), 2 pooling layers (PL), 1 dropout, and a dense layer. To observe the… ▽ More The performance of a classifier depends on the tuning of its parame ters. In this paper, we have experimented the impact of various tuning parameters on the performance of a deep convolutional neural network (DCNN). In the ex perimental evaluation, we have considered a DCNN classifier that consists of 2 convolutional layers (CL), 2 pooling layers (PL), 1 dropout, and a dense layer. To observe the impact of pooling, activation function, and optimizer tuning pa rameters, we utilized a crack image dataset having two classes: negative and pos itive. The experimental results demonstrate that with the maxpooling, the DCNN demonstrates its better performance for adam optimizer and tanh activation func tion. △ Less

Submitted 30 May, 2025; originally announced June 2025.

Comments: 8 pages, 2 figures, published at Proceedings of the 15th KIPS International Conference on Ubiquitous Information Technologies and Applications (CUTE 2021), Jeju, Repubilc of Korea

arXiv:2506.03160 [pdf]

Applying MambaAttention, TabPFN, and TabTransformers to Classify SAE Automation Levels in Crashes

Authors: Shriyank Somvanshi, Anannya Ghosh Tusti, Mahmuda Sultana Mimi, Md Monzurul Islam, Sazzad Bin Bashar Polock, Anandi Dutta, Subasish Das

Abstract: The increasing presence of automated vehicles (AVs) presents new challenges for crash classification and safety analysis. Accurately identifying the SAE automation level involved in each crash is essential to understanding crash dynamics and system accountability. However, existing approaches often overlook automation-specific factors and lack model sophistication to capture distinctions between d… ▽ More The increasing presence of automated vehicles (AVs) presents new challenges for crash classification and safety analysis. Accurately identifying the SAE automation level involved in each crash is essential to understanding crash dynamics and system accountability. However, existing approaches often overlook automation-specific factors and lack model sophistication to capture distinctions between different SAE levels. To address this gap, this study evaluates the performance of three advanced tabular deep learning models MambaAttention, TabPFN, and TabTransformer for classifying SAE automation levels using structured crash data from Texas (2024), covering 4,649 cases categorized as Assisted Driving (SAE Level 1), Partial Automation (SAE Level 2), and Advanced Automation (SAE Levels 3-5 combined). Following class balancing using SMOTEENN, the models were trained and evaluated on a unified dataset of 7,300 records. MambaAttention demonstrated the highest overall performance (F1-scores: 88% for SAE 1, 97% for SAE 2, and 99% for SAE 3-5), while TabPFN excelled in zero-shot inference with high robustness for rare crash categories. In contrast, TabTransformer underperformed, particularly in detecting Partial Automation crashes (F1-score: 55%), suggesting challenges in modeling shared human-system control dynamics. These results highlight the capability of deep learning models tailored for tabular data to enhance the accuracy and efficiency of automation-level classification. Integrating such models into crash analysis frameworks can support policy development, AV safety evaluation, and regulatory decisions, especially in distinguishing high-risk conditions for mid- and high-level automation technologies. △ Less

Submitted 22 May, 2025; originally announced June 2025.

arXiv:2506.02312 [pdf]

doi 10.19907/j.0490-6756.240074

Dual encoding feature filtering generalized attention UNET for retinal vessel segmentation

Authors: Md Tauhidul Islam, Wu Da-Wen, Tang Qing-Qing, Zhao Kai-Yang, Yin Teng, Li Yan-Fei, Shang Wen-Yi, Liu Jing-Yu, Zhang Hai-Xian

Abstract: Retinal blood vessel segmentation is crucial for diagnosing ocular and cardiovascular diseases. Although the introduction of U-Net in 2015 by Olaf Ronneberger significantly advanced this field, yet issues like limited training data, imbalance data distribution, and inadequate feature extraction persist, hindering both the segmentation performance and optimal model generalization. Addressing these… ▽ More Retinal blood vessel segmentation is crucial for diagnosing ocular and cardiovascular diseases. Although the introduction of U-Net in 2015 by Olaf Ronneberger significantly advanced this field, yet issues like limited training data, imbalance data distribution, and inadequate feature extraction persist, hindering both the segmentation performance and optimal model generalization. Addressing these critical issues, the DEFFA-Unet is proposed featuring an additional encoder to process domain-invariant pre-processed inputs, thereby improving both richer feature encoding and enhanced model generalization. A feature filtering fusion module is developed to ensure the precise feature filtering and robust hybrid feature fusion. In response to the task-specific need for higher precision where false positives are very costly, traditional skip connections are replaced with the attention-guided feature reconstructing fusion module. Additionally, innovative data augmentation and balancing methods are proposed to counter data scarcity and distribution imbalance, further boosting the robustness and generalization of the model. With a comprehensive suite of evaluation metrics, extensive validations on four benchmark datasets (DRIVE, CHASEDB1, STARE, and HRF) and an SLO dataset (IOSTAR), demonstrate the proposed method's superiority over both baseline and state-of-the-art models. Particularly the proposed method significantly outperforms the compared methods in cross-validation model generalization. △ Less

Submitted 2 June, 2025; originally announced June 2025.

ACM Class: I.4; I.5

Journal ref: J Sichuan Univ: Nat Sci Ed, 2025, 62: 79-95.

arXiv:2506.01587 [pdf, other]

Unified Large Language Models for Misinformation Detection in Low-Resource Linguistic Settings

Authors: Muhammad Islam, Javed Ali Khan, Mohammed Abaker, Ali Daud, Azeem Irshad

Abstract: The rapid expansion of social media platforms has significantly increased the dissemination of forged content and misinformation, making the detection of fake news a critical area of research. Although fact-checking efforts predominantly focus on English-language news, there is a noticeable gap in resources and strategies to detect news in regional languages, such as Urdu. Advanced Fake News Detec… ▽ More The rapid expansion of social media platforms has significantly increased the dissemination of forged content and misinformation, making the detection of fake news a critical area of research. Although fact-checking efforts predominantly focus on English-language news, there is a noticeable gap in resources and strategies to detect news in regional languages, such as Urdu. Advanced Fake News Detection (FND) techniques rely heavily on large, accurately labeled datasets. However, FND in under-resourced languages like Urdu faces substantial challenges due to the scarcity of extensive corpora and the lack of validated lexical resources. Current Urdu fake news datasets are often domain-specific and inaccessible to the public. They also lack human verification, relying mainly on unverified English-to-Urdu translations, which compromises their reliability in practical applications. This study highlights the necessity of developing reliable, expert-verified, and domain-independent Urdu-enhanced FND datasets to improve fake news detection in Urdu and other resource-constrained languages. This paper presents the first benchmark large FND dataset for Urdu news, which is publicly available for validation and deep analysis. We also evaluate this dataset using multiple state-of-the-art pre-trained large language models (LLMs), such as XLNet, mBERT, XLM-RoBERTa, RoBERTa, DistilBERT, and DeBERTa. Additionally, we propose a unified LLM model that outperforms the others with different embedding and feature extraction techniques. The performance of these models is compared based on accuracy, F1 score, precision, recall, and human judgment for vetting the sample results of news. △ Less

Submitted 2 June, 2025; originally announced June 2025.

arXiv:2506.00735 [pdf, ps, other]

Involution-Infused DenseNet with Two-Step Compression for Resource-Efficient Plant Disease Classification

Authors: T. Ahmed, S. Jannat, Md. F. Islam, J. Noor

Abstract: Agriculture is vital for global food security, but crops are vulnerable to diseases that impact yield and quality. While Convolutional Neural Networks (CNNs) accurately classify plant diseases using leaf images, their high computational demands hinder their deployment in resource-constrained settings such as smartphones, edge devices, and real-time monitoring systems. This study proposes a two-ste… ▽ More Agriculture is vital for global food security, but crops are vulnerable to diseases that impact yield and quality. While Convolutional Neural Networks (CNNs) accurately classify plant diseases using leaf images, their high computational demands hinder their deployment in resource-constrained settings such as smartphones, edge devices, and real-time monitoring systems. This study proposes a two-step model compression approach integrating Weight Pruning and Knowledge Distillation, along with the hybridization of DenseNet with Involutional Layers. Pruning reduces model size and computational load, while distillation improves the smaller student models performance by transferring knowledge from a larger teacher network. The hybridization enhances the models ability to capture spatial features efficiently. These compressed models are suitable for real-time applications, promoting precision agriculture through rapid disease identification and crop management. The results demonstrate ResNet50s superior performance post-compression, achieving 99.55% and 98.99% accuracy on the PlantVillage and PaddyLeaf datasets, respectively. The DenseNet-based model, optimized for efficiency, recorded 99.21% and 93.96% accuracy with a minimal parameter count. Furthermore, the hybrid model achieved 98.87% and 97.10% accuracy, supporting the practical deployment of energy-efficient devices for timely disease intervention and sustainable farming practices. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2505.24002 [pdf, ps, other]

DGIQA: Depth-guided Feature Attention and Refinement for Generalizable Image Quality Assessment

Authors: Vaishnav Ramesh, Junliang Liu, Haining Wang, Md Jahidul Islam

Abstract: A long-held challenge in no-reference image quality assessment (NR-IQA) learning from human subjective perception is the lack of objective generalization to unseen natural distortions. To address this, we integrate a novel Depth-Guided cross-attention and refinement (Depth-CAR) mechanism, which distills scene depth and spatial features into a structure-aware representation for improved NR-IQA. Thi… ▽ More A long-held challenge in no-reference image quality assessment (NR-IQA) learning from human subjective perception is the lack of objective generalization to unseen natural distortions. To address this, we integrate a novel Depth-Guided cross-attention and refinement (Depth-CAR) mechanism, which distills scene depth and spatial features into a structure-aware representation for improved NR-IQA. This brings in the knowledge of object saliency and relative contrast of the scene for more discriminative feature learning. Additionally, we introduce the idea of TCB (Transformer-CNN Bridge) to fuse high-level global contextual dependencies from a transformer backbone with local spatial features captured by a set of hierarchical CNN (convolutional neural network) layers. We implement TCB and Depth-CAR as multimodal attention-based projection functions to select the most informative features, which also improve training time and inference efficiency. Experimental results demonstrate that our proposed DGIQA model achieves state-of-the-art (SOTA) performance on both synthetic and authentic benchmark datasets. More importantly, DGIQA outperforms SOTA models on cross-dataset evaluations as well as in assessing natural image distortions such as low-light effects, hazy conditions, and lens flares. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: 18 pages

arXiv:2505.22305 [pdf, ps, other]

doi 10.1145/3715336.3735754

IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth

Authors: Md Touhidul Islam, Imran Kabir, Md Alimoor Reza, Syed Masum Billah

Abstract: We present IKIWISI ("I Know It When I See It"), an interactive visual pattern generator for assessing vision-language models in video object recognition when ground truth is unavailable. IKIWISI transforms model outputs into a binary heatmap where green cells indicate object presence and red cells indicate object absence. This visualization leverages humans' innate pattern recognition abilities to… ▽ More We present IKIWISI ("I Know It When I See It"), an interactive visual pattern generator for assessing vision-language models in video object recognition when ground truth is unavailable. IKIWISI transforms model outputs into a binary heatmap where green cells indicate object presence and red cells indicate object absence. This visualization leverages humans' innate pattern recognition abilities to evaluate model reliability. IKIWISI introduces "spy objects": adversarial instances users know are absent, to discern models hallucinating on nonexistent items. The tool functions as a cognitive audit mechanism, surfacing mismatches between human and machine perception by visualizing where models diverge from human understanding. Our study with 15 participants found that users considered IKIWISI easy to use, made assessments that correlated with objective metrics when available, and reached informed conclusions by examining only a small fraction of heatmap cells. This approach not only complements traditional evaluation methods through visual assessment of model behavior with custom object sets, but also reveals opportunities for improving alignment between human perception and machine understanding in vision-language systems. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted at DIS'25 (Funchal, Portugal)

arXiv:2505.21715 [pdf, ps, other]

Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT-2

Authors: Md. Zahid Hossain, Mustofa Ahmed, Most. Sharmin Sultana Samu, Md. Rakibul Islam

Abstract: The automated generation of radiology reports from chest X-ray images holds significant promise in enhancing diagnostic workflows while preserving patient privacy. Traditional centralized approaches often require sensitive data transfer, posing privacy concerns. To address this, the study proposes a Multimodal Federated Learning framework for chest X-ray report generation using the IU-Xray dataset… ▽ More The automated generation of radiology reports from chest X-ray images holds significant promise in enhancing diagnostic workflows while preserving patient privacy. Traditional centralized approaches often require sensitive data transfer, posing privacy concerns. To address this, the study proposes a Multimodal Federated Learning framework for chest X-ray report generation using the IU-Xray dataset. The system utilizes a Vision Transformer (ViT) as the encoder and GPT-2 as the report generator, enabling decentralized training without sharing raw data. Three Federated Learning (FL) aggregation strategies: FedAvg, Krum Aggregation and a novel Loss-aware Federated Averaging (L-FedAvg) were evaluated. Among these, Krum Aggregation demonstrated superior performance across lexical and semantic evaluation metrics such as ROUGE, BLEU, BERTScore and RaTEScore. The results show that FL can match or surpass centralized models in generating clinically relevant and semantically rich radiology reports. This lightweight and privacy-preserving framework paves the way for collaborative medical AI development without compromising data confidentiality. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: Preprint, manuscript under-review

arXiv:2505.20324 [pdf, ps, other]

Evaluating the Energy-Efficiency of the Code Generated by LLMs

Authors: Md Arman Islam, Devi Varaprasad Jonnala, Ritika Rekhi, Pratik Pokharel, Siddharth Cilamkoti, Asif Imran, Tevfik Kosar, Bekir Turkkan

Abstract: As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by… ▽ More As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by 20 popular LLMs for 878 programming problems of varying difficulty levels and diverse algorithmic categories selected from the LeetCode platform by comparing them against canonical human-written solutions. Although LLMs can produce functionally correct results in most cases, our findings show that the performance and energy efficiency of LLM-produced solutions are often far below those of human-written solutions. Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code, whereas Grok-2 and Gemini-1.5-Pro are among the least energy-efficient models. On average, human-generated canonical solutions are approximately 1.17 times more energy efficient than DeepSeek-v3, 1.21 times more energy efficient than GPT-4o, and over 2 times more energy efficient than Grok-2 and Gemini-1.5-Pro. For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions. △ Less

Submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.18709 [pdf]

doi 10.1109/ICCCNT61001.2024.10723886

Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla

Authors: Sourav Kumar Das, Md. Julkar Naeen, MD. Jahidul Islam, Md. Anisul Haque Sajeeb, Narayan Ranjan Chakraborty, Mayen Uddin Mojumdar

Abstract: Bangla or Bengali is the national language of Bangladesh, people from different regions don't talk in proper Bangla. Every division of Bangladesh has its own local language like Sylheti, Chittagong etc. In recent years some papers were published on Bangla language like sentiment analysis, fake news detection and classifications, but a few of them were on Bangla languages. This research is for the… ▽ More Bangla or Bengali is the national language of Bangladesh, people from different regions don't talk in proper Bangla. Every division of Bangladesh has its own local language like Sylheti, Chittagong etc. In recent years some papers were published on Bangla language like sentiment analysis, fake news detection and classifications, but a few of them were on Bangla languages. This research is for the local language and this particular paper is on Sylheti language. It presented a comprehensive system using Natural Language Processing or NLP techniques for translating Pure or Modern Bangla to locally spoken Sylheti Bangla language. Total 1200 data used for training 3 models LSTM, Bi-LSTM and Seq2Seq and LSTM scored the best in performance with 89.3% accuracy. The findings of this research may contribute to the growth of Bangla NLP researchers for future more advanced innovations. △ Less

Submitted 24 May, 2025; originally announced May 2025.

Comments: 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)

Journal ref: 2024 15th Int. Conf. on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, pp. 1-7, 2024

arXiv:2505.17236 [pdf]

LogStamping: A blockchain-based log auditing approach for large-scale systems

Authors: Md Shariful Islam, M. Sohel Rahman

Abstract: Log management is crucial for ensuring the security, integrity, and compliance of modern information systems. Traditional log management solutions face challenges in achieving tamper-proofing, scalability, and real-time processing in distributed environments. This paper presents a blockchain-based log management framework that addresses these limitations by leveraging blockchain's decentralized, i… ▽ More Log management is crucial for ensuring the security, integrity, and compliance of modern information systems. Traditional log management solutions face challenges in achieving tamper-proofing, scalability, and real-time processing in distributed environments. This paper presents a blockchain-based log management framework that addresses these limitations by leveraging blockchain's decentralized, immutable, and transparent features. The framework integrates a hybrid on-chain and off-chain storage model, combining blockchain's integrity guarantees with the scalability of distributed storage solutions like IPFS. Smart contracts automate log validation and access control, while cryptographic techniques ensure privacy and confidentiality. With a focus on real-time log processing, the framework is designed to handle the high-volume log generation typical in large-scale systems, such as data centers and network infrastructure. Performance evaluations demonstrate the framework's scalability, low latency, and ability to manage millions of log entries while maintaining strong security guarantees. Additionally, the paper discusses challenges like blockchain storage overhead and energy consumption, offering insights for enhancing future systems. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: 7 Figures, 2 tables

arXiv:2505.16290 [pdf]

Multimodal Generative AI for Story Point Estimation in Software Development

Authors: Mohammad Rubyet Islam, Peter Sandborn

Abstract: This research explores the application of Multimodal Generative AI to enhance story point estimation in Agile software development. By integrating text, image, and categorical data using advanced models like BERT, CNN, and XGBoost, our approach surpasses the limitations of traditional single-modal estimation methods. The results demonstrate strong accuracy for simpler story points, while also high… ▽ More This research explores the application of Multimodal Generative AI to enhance story point estimation in Agile software development. By integrating text, image, and categorical data using advanced models like BERT, CNN, and XGBoost, our approach surpasses the limitations of traditional single-modal estimation methods. The results demonstrate strong accuracy for simpler story points, while also highlighting challenges in more complex categories due to data imbalance. This study further explores the impact of categorical data, particularly severity, on the estimation process, emphasizing its influence on model performance. Our findings emphasize the transformative potential of multimodal data integration in refining AI-driven project management, paving the way for more precise, adaptable, and domain-specific AI capabilities. Additionally, this work outlines future directions for addressing data variability and enhancing the robustness of AI in Agile methodologies. △ Less

Submitted 22 May, 2025; originally announced May 2025.

MSC Class: 68T07; 68T45 ACM Class: I.2.6; I.2.10; D.2.9; H.2.8

Journal ref: A revised version of this work is published in the proceedings of the IEEE Conference on Artificial Intelligence 2025

arXiv:2505.14727 [pdf]

The Evolution of Alpha in Finance Harnessing Human Insight and LLM Agents

Authors: Mohammad Rubyet Islam

Abstract: The pursuit of alpha returns that exceed market benchmarks has undergone a profound transformation, evolving from intuition-driven investing to autonomous, AI powered systems. This paper introduces a comprehensive five stage taxonomy that traces this progression across manual strategies, statistical models, classical machine learning, deep learning, and agentic architectures powered by large langu… ▽ More The pursuit of alpha returns that exceed market benchmarks has undergone a profound transformation, evolving from intuition-driven investing to autonomous, AI powered systems. This paper introduces a comprehensive five stage taxonomy that traces this progression across manual strategies, statistical models, classical machine learning, deep learning, and agentic architectures powered by large language models (LLMs). Unlike prior surveys focused narrowly on modeling techniques, this review adopts a system level lens, integrating advances in representation learning, multimodal data fusion, and tool augmented LLM agents. The strategic shift from static predictors to contextaware financial agents capable of real time reasoning, scenario simulation, and cross modal decision making is emphasized. Key challenges in interpretability, data fragility, governance, and regulatory compliance areas critical to production deployment are examined. The proposed taxonomy offers a unified framework for evaluating maturity, aligning infrastructure, and guiding the responsible development of next generation alpha systems. △ Less

Submitted 19 May, 2025; originally announced May 2025.

MSC Class: 91G70 Statistical methods; risk measures 91B84 Economic models (financial models; industrial models; growth models) ACM Class: I.2.6; I.5.1; I.2.7

arXiv:2505.12333 [pdf]

Predicting Gas Well Performance with Decline Curve Analysis: A Case Study on Semutang Gas Field

Authors: Md. Shakil Rahaman, Ahmed Sakib, Ataharuse Samad, Md. Ashraful Islam

Abstract: Decline-curve analysis (DCA) is a widely utilized method for production forecasting and estimating remaining reserves in gas reservoir. Based on the assumptions that past production trend can be mathematically characterized and used to predict future performance. It relies on historical production data and assumes that production methods remain unchanged throughout the analysis. This method is par… ▽ More Decline-curve analysis (DCA) is a widely utilized method for production forecasting and estimating remaining reserves in gas reservoir. Based on the assumptions that past production trend can be mathematically characterized and used to predict future performance. It relies on historical production data and assumes that production methods remain unchanged throughout the analysis. This method is particularly valuable due to its accuracy in forecasting and its broad acceptance within the industry. Wells in the same geographical area and producing from similar geological formations often exhibit similar decline curve parameters. This study applies DCA to forecast the future production performance and estimate the ultimate recovery for the Semutang gas field's well 5 in Bangladesh. Using historical production data, decline curves were generated based on exponential, hyperbolic, and harmonic model equations. The cumulative production estimations were 11,139.34 MMSCF for the exponential model, 11,620.26 MMSCF for the hyperbolic model, and 14,021.92 MMSCF for the harmonic model. In terms of the well's productive life, the estimates were 335.13 days, 1,152 days, and 22,611 days, respectively. Among these models, the hyperbolic decline provided the most realistic forecast, closely aligning with observed production trend. The study highlights the importance of selecting an appropriate decline model for accurate production forecasting and reserve estimation, which is essential for effective reservoir management and resource optimization. △ Less

Submitted 18 May, 2025; originally announced May 2025.

Comments: 8th International Conference on Mechanical, Industrial and Energy Engineering 2024

arXiv:2505.11845 [pdf, ps, other]

ElderFallGuard: Real-Time IoT and Computer Vision-Based Fall Detection System for Elderly Safety

Authors: Tasrifur Riahi, Md. Azizul Hakim Bappy, Md. Mehedi Islam

Abstract: For the elderly population, falls pose a serious and increasing risk of serious injury and loss of independence. In order to overcome this difficulty, we present ElderFallGuard: A Computer Vision Based IoT Solution for Elderly Fall Detection and Notification, a cutting-edge, non-invasive system intended for quick caregiver alerts and real-time fall detection. Our approach leverages the power of co… ▽ More For the elderly population, falls pose a serious and increasing risk of serious injury and loss of independence. In order to overcome this difficulty, we present ElderFallGuard: A Computer Vision Based IoT Solution for Elderly Fall Detection and Notification, a cutting-edge, non-invasive system intended for quick caregiver alerts and real-time fall detection. Our approach leverages the power of computer vision, utilizing MediaPipe for accurate human pose estimation from standard video streams. We developed a custom dataset comprising 7200 samples across 12 distinct human poses to train and evaluate various machine learning classifiers, with Random Forest ultimately selected for its superior performance. ElderFallGuard employs a specific detection logic, identifying a fall when a designated prone pose ("Pose6") is held for over 3 seconds coupled with a significant drop in motion detected for more than 2 seconds. Upon confirmation, the system instantly dispatches an alert, including a snapshot of the event, to a designated Telegram group via a custom bot, incorporating cooldown logic to prevent notification overload. Rigorous testing on our dataset demonstrated exceptional results, achieving 100% accuracy, precision, recall, and F1-score. ElderFallGuard offers a promising, vision-based IoT solution to enhance elderly safety and provide peace of mind for caregivers through intelligent, timely alerts. △ Less

Submitted 17 May, 2025; originally announced May 2025.

Comments: 9 page, 1 table, 5 figure

arXiv:2505.08704 [pdf, ps, other]

LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs

Authors: K M Sajjadul Islam, Ayesha Siddika Nipu, Jiawei Wu, Praveen Madiraju

Abstract: Electronic Health Records (EHRs) are digital records of patient information, often containing unstructured clinical text. Named Entity Recognition (NER) is essential in EHRs for extracting key medical entities like problems, tests, and treatments to support downstream clinical applications. This paper explores prompt-based medical entity recognition using large language models (LLMs), specifically… ▽ More Electronic Health Records (EHRs) are digital records of patient information, often containing unstructured clinical text. Named Entity Recognition (NER) is essential in EHRs for extracting key medical entities like problems, tests, and treatments to support downstream clinical applications. This paper explores prompt-based medical entity recognition using large language models (LLMs), specifically GPT-4o and DeepSeek-R1, guided by various prompt engineering techniques, including zero-shot, few-shot, and an ensemble approach. Among all strategies, GPT-4o with prompt ensemble achieved the highest classification performance with an F1-score of 0.95 and recall of 0.98, outperforming DeepSeek-R1 on the task. The ensemble method improved reliability by aggregating outputs through embedding-based similarity and majority voting. △ Less

Submitted 25 May, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

Comments: IEEE 26th International Conference on Information Reuse and Integration for Data Science (IRI 2025), San Jose, CA, USA

arXiv:2505.08468 [pdf, ps, other]

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Authors: Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

Abstract: Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comp… ▽ More Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge's accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist. △ Less

Submitted 7 July, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

Comments: Accepted at ACL 2025 Industry Track

arXiv:2505.05538 [pdf, ps, other]

Cardioformer: Advancing AI in ECG Analysis with Multi-Granularity Patching and ResNet

Authors: Md Kamrujjaman Mobin, Md Saiful Islam, Sadik Al Barid, Md Masum

Abstract: Electrocardiogram (ECG) classification is crucial for automated cardiac disease diagnosis, yet existing methods often struggle to capture local morphological details and long-range temporal dependencies simultaneously. To address these challenges, we propose Cardioformer, a novel multi-granularity hybrid model that integrates cross-channel patching, hierarchical residual learning, and a two-stage… ▽ More Electrocardiogram (ECG) classification is crucial for automated cardiac disease diagnosis, yet existing methods often struggle to capture local morphological details and long-range temporal dependencies simultaneously. To address these challenges, we propose Cardioformer, a novel multi-granularity hybrid model that integrates cross-channel patching, hierarchical residual learning, and a two-stage self-attention mechanism. Cardioformer first encodes multi-scale token embeddings to capture fine-grained local features and global contextual information and then selectively fuses these representations through intra- and inter-granularity self-attention. Extensive evaluations on three benchmark ECG datasets under subject-independent settings demonstrate that model consistently outperforms four state-of-the-art baselines. Our Cardioformer model achieves the AUROC of 96.34$\pm$0.11, 89.99$\pm$0.12, and 95.59$\pm$1.66 in MIMIC-IV, PTB-XL and PTB dataset respectively outperforming PatchTST, Reformer, Transformer, and Medformer models. It also demonstrates strong cross-dataset generalization, achieving 49.18% AUROC on PTB and 68.41% on PTB-XL when trained on MIMIC-IV. These findings underscore the potential of Cardioformer to advance automated ECG analysis, paving the way for more accurate and robust cardiovascular disease diagnosis. We release the source code at https://github.com/KMobin555/Cardioformer. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.04948

Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations

Authors: Md Aminul Islam, Ahmed Sayeed Faruk

Abstract: Recommender systems are essential for delivering personalized content across digital platforms by modeling user preferences and behaviors. Recently, large language models (LLMs) have been adopted for prompt-based recommendation due to their ability to generate personalized outputs without task-specific training. However, LLM-based methods face limitations such as limited context window size, ineff… ▽ More Recommender systems are essential for delivering personalized content across digital platforms by modeling user preferences and behaviors. Recently, large language models (LLMs) have been adopted for prompt-based recommendation due to their ability to generate personalized outputs without task-specific training. However, LLM-based methods face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking due to token constraints. LLMs can also be sensitive to position bias, as they may overemphasize earlier items in the prompt regardless of their true relevance. To address and investigate these issues, we propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts. We evaluate the effects of user history reordering and instructional prompts for mitigating position bias. Experiments on MovieLens-100K show that randomizing user history improves ranking quality, but LLM-based reranking does not outperform the base model. Explicit instructions to reduce position bias are also ineffective. Our evaluations reveal limitations in LLMs' ability to model ranking context and mitigate bias. Our code is publicly available at https://github.com/aminul7506/LLMForReRanking. △ Less

Submitted 26 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

Comments: We have decided to withdraw the manuscript as it requires substantial revisions that go beyond what is appropriate for a versioned update on arXiv. We plan to resubmit once the necessary improvements are made

arXiv:2505.04308 [pdf]

Guardians of the Web: The Evolution and Future of Website Information Security

Authors: Md Saiful Islam, Li Xiangdong

Abstract: Website information security has become a critical concern in the digital age. This article explores the evolution of website information security, examining its historical development, current practices, and future directions. The early beginnings from the 1960s to the 1980s laid the groundwork for modern cybersecurity, with the development of ARPANET, TCP/IP, public-key cryptography, and the fir… ▽ More Website information security has become a critical concern in the digital age. This article explores the evolution of website information security, examining its historical development, current practices, and future directions. The early beginnings from the 1960s to the 1980s laid the groundwork for modern cybersecurity, with the development of ARPANET, TCP/IP, public-key cryptography, and the first antivirus programs. The 1990s marked a transformative era, driven by the commercialization of the Internet and the emergence of web-based services. As the Internet grew, so did the range and sophistication of cyber threats, leading to advancements in security technologies such as the Secure Sockets Layer (SSL) protocol, password protection, and firewalls. Current practices in website information security involve a multi-layered approach, including encryption, secure coding practices, regular security audits, and user education. The future of website information security is expected to be shaped by emerging technologies such as artificial intelligence, blockchain, and quantum computing, as well as the increasing importance of international cooperation and standardization efforts. As cyber threats continue to evolve, ongoing research and innovation in website information security will be essential to protect sensitive information and maintain trust in the digital world. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 22 pages

ACM Class: F.2.2, I.2.7

arXiv:2505.03799 [pdf, other]

Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling

Authors: Hyun Lee, Chris Yi, Maminur Islam, B. D. S. Aritra

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in various natural language processing tasks; however, their application to graph-related problems remains limited, primarily due to scalability constraints and the absence of dedicated mechanisms for processing graph structures. Existing approaches predominantly integrate LLMs with Graph Neural Networks (GNNs), using GNNs as featu… ▽ More Large Language Models (LLMs) have demonstrated strong capabilities in various natural language processing tasks; however, their application to graph-related problems remains limited, primarily due to scalability constraints and the absence of dedicated mechanisms for processing graph structures. Existing approaches predominantly integrate LLMs with Graph Neural Networks (GNNs), using GNNs as feature encoders or auxiliary components. However, directly encoding graph structures within LLMs has been underexplored, particularly in the context of large-scale graphs where token limitations hinder effective representation. To address these challenges, we propose SDM-InstructGLM, a novel instruction-tuned Graph Language Model (InstructGLM) framework that enhances scalability and efficiency without relying on GNNs. Our method introduces a similarity-degree-based biased random walk mechanism, which selectively samples and encodes graph information based on node-feature similarity and degree centrality, ensuring an adaptive and structured representation within the LLM. This approach significantly improves token efficiency, mitigates information loss due to random sampling, and enhances performance on graph-based tasks such as node classification and link prediction. Furthermore, our results demonstrate the feasibility of LLM-only graph processing, enabling scalable and interpretable Graph Language Models (GLMs) optimized through instruction-based fine-tuning. This work paves the way for GNN-free approaches to graph learning, leveraging LLMs as standalone graph reasoning models. Our source code is available on GitHub. △ Less

Submitted 2 May, 2025; originally announced May 2025.

Comments: To be published in International Joint Conference on Neural Networks (IJCNN), 2025

arXiv:2505.03770 [pdf, other]

Proceedings of 1st Workshop on Advancing Artificial Intelligence through Theory of Mind

Authors: Mouad Abrini, Omri Abend, Dina Acklin, Henny Admoni, Gregor Aichinger, Nitay Alon, Zahra Ashktorab, Ashish Atreja, Moises Auron, Alexander Aufreiter, Raghav Awasthi, Soumya Banerjee, Joe M. Barnby, Rhea Basappa, Severin Bergsmann, Djallel Bouneffouf, Patrick Callaghan, Marc Cavazza, Thierry Chaminade, Sonia Chernova, Mohamed Chetouan, Moumita Choudhury, Axel Cleeremans, Jacek B. Cywinski, Fabio Cuzzolin , et al. (83 additional authors not shown)

Abstract: This volume includes a selection of papers presented at the Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2025 in Philadelphia US on 3rd March 2025. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community. This volume includes a selection of papers presented at the Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2025 in Philadelphia US on 3rd March 2025. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community. △ Less

Submitted 28 April, 2025; originally announced May 2025.

Comments: workshop proceedings

arXiv:2505.01766 [pdf, other]

Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement

Authors: Long Bai, Boyi Ma, Ruohan Wang, Guankun Wang, Beilei Cui, Zhongliang Jiang, Mobarakol Islam, Zhe Min, Jiewen Lai, Nassir Navab, Hongliang Ren

Abstract: Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. In this case, we explore a robust… ▽ More Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. In this case, we explore a robust graph-based multimodal approach to integrating vision and kinematic data to enhance accuracy and reliability. Vision data captures dynamic surgical scenes, while kinematic data provides precise movement information, overcoming limitations of visual recognition under adverse conditions. We propose a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) for robust surgical workflow recognition in challenging scenarios with domain shifts or corrupted data. Specifically, we introduce a Multimodal Disentanglement Graph Network that captures fine-grained visual information while explicitly modeling the complex relationships between vision and kinematic embeddings through graph-based message modeling. To align feature spaces across modalities, we propose a Vision-Kinematic Adversarial framework that leverages adversarial training to reduce modality gaps and improve feature consistency. Furthermore, we design a Contextual Calibrated Decoder, incorporating temporal and contextual priors to enhance robustness against domain shifts and corrupted data. Extensive comparative and ablation experiments demonstrate the effectiveness of our model and proposed modules. Moreover, our robustness experiments show that our method effectively handles data corruption during storage and transmission, exhibiting excellent stability and robustness. Our approach aims to advance automated surgical workflow recognition, addressing the complexities and dynamism inherent in surgical procedures. △ Less

Submitted 3 May, 2025; originally announced May 2025.

Comments: Accepted by Information Fusion

arXiv:2505.01635 [pdf]

Dendritic Computing with Multi-Gate Ferroelectric Field-Effect Transistors

Authors: A N M Nafiul Islam, Xuezhong Niu, Jiahui Duan, Shubham Kumar, Kai Ni, Abhronil Sengupta

Abstract: Although inspired by neuronal systems in the brain, artificial neural networks generally employ point-neurons, which offer far less computational complexity than their biological counterparts. Neurons have dendritic arbors that connect to different sets of synapses and offer local non-linear accumulation - playing a pivotal role in processing and learning. Inspired by this, we propose a novel neur… ▽ More Although inspired by neuronal systems in the brain, artificial neural networks generally employ point-neurons, which offer far less computational complexity than their biological counterparts. Neurons have dendritic arbors that connect to different sets of synapses and offer local non-linear accumulation - playing a pivotal role in processing and learning. Inspired by this, we propose a novel neuron design based on a multi-gate ferroelectric field-effect transistor that mimics dendrites. It leverages ferroelectric nonlinearity for local computations within dendritic branches, while utilizing the transistor action to generate the final neuronal output. The branched architecture paves the way for utilizing smaller crossbar arrays in hardware integration, leading to greater efficiency. Using an experimentally calibrated device-circuit-algorithm co-simulation framework, we demonstrate that networks incorporating our dendritic neurons achieve superior performance in comparison to much larger networks without dendrites ($\sim$17$\times$ fewer trainable weight parameters). These findings suggest that dendritic hardware can significantly improve computational efficiency, and learning capacity of neuromorphic systems optimized for edge applications. △ Less

Submitted 2 May, 2025; originally announced May 2025.

arXiv:2505.01429 [pdf, other]

Explainable AI-Driven Detection of Human Monkeypox Using Deep Learning and Vision Transformers: A Comprehensive Analysis

Authors: Md. Zahid Hossain, Md. Rakibul Islam, Most. Sharmin Sultana Samu

Abstract: Since mpox can spread from person to person, it is a zoonotic viral illness that poses a significant public health concern. It is difficult to make an early clinical diagnosis because of how closely its symptoms match those of measles and chickenpox. Medical imaging combined with deep learning (DL) techniques has shown promise in improving disease detection by analyzing affected skin areas. Our st… ▽ More Since mpox can spread from person to person, it is a zoonotic viral illness that poses a significant public health concern. It is difficult to make an early clinical diagnosis because of how closely its symptoms match those of measles and chickenpox. Medical imaging combined with deep learning (DL) techniques has shown promise in improving disease detection by analyzing affected skin areas. Our study explore the feasibility to train deep learning and vision transformer-based models from scratch with publicly available skin lesion image dataset. Our experimental results show dataset limitation as a major drawback to build better classifier models trained from scratch. We used transfer learning with the help of pre-trained models to get a better classifier. The MobileNet-v2 outperformed other state of the art pre-trained models with 93.15% accuracy and 93.09% weighted average F1 score. ViT B16 and ResNet-50 also achieved satisfactory performance compared to already available studies with accuracy 92.12% and 86.21% respectively. To further validate the performance of the models, we applied explainable AI techniques. △ Less

Submitted 3 April, 2025; originally announced May 2025.

arXiv:2504.21025 [pdf]

doi 10.1109/ICCIT64611.2024.11021969

Durghotona GPT: A Web Scraping and Large Language Model Based Framework to Generate Road Accident Dataset Automatically in Bangladesh

Authors: MD Thamed Bin Zaman Chowdhury, Moazzem Hossain, Md. Ridwanul Islam

Abstract: Road accidents pose significant concerns globally. They lead to large financial losses, injuries, disabilities, and societal challenges. Accurate and timely accident data is essential for predicting and mitigating these events. This paper presents a novel framework named 'Durghotona GPT' that integrates web scraping and Large Language Models (LLMs) to automate the generation of comprehensive accid… ▽ More Road accidents pose significant concerns globally. They lead to large financial losses, injuries, disabilities, and societal challenges. Accurate and timely accident data is essential for predicting and mitigating these events. This paper presents a novel framework named 'Durghotona GPT' that integrates web scraping and Large Language Models (LLMs) to automate the generation of comprehensive accident datasets from prominent national dailies in Bangladesh. The authors collected accident reports from three major newspapers: Prothom Alo, Dhaka Tribune, and The Daily Star. The collected news was then processed using the newest available LLMs: GPT-4, GPT-3.5, and Llama-3. The framework efficiently extracts relevant information, categorizes reports, and compiles detailed datasets. Thus, this framework overcomes limitations of manual data collection methods such as delays, errors, and communication gaps. The authors' evaluation demonstrates that Llama-3, an open-source model, performs comparably to GPT-4. It achieved 89% accuracy in the authors' evaluation. Therefore, it can be considered a cost-effective alternative for similar tasks. The results suggest that the framework developed by the authors can drastically enhance the quality and availability of accident data. As a result, it can support critical applications in traffic safety analysis, urban planning, and public health. The authors also developed an interface for 'Durghotona GPT' for ease of use as part of this paper. Future work will focus on expanding data collection methods and refining LLMs to further increase dataset accuracy and applicability. △ Less

Submitted 23 April, 2025; originally announced April 2025.

Comments: It has been accepted in IEEE 27th International Conference on Computer and Information Technology (ICCIT). Now, we are waiting for it to get published in IEEE Xplore

arXiv:2504.20123 [pdf, other]

Multiscale modelling of thermally stressed superelastic polyimide

Authors: Jerome Samuel S, Puneet Kumar Patra, Md Rushdie Ibne Islam

Abstract: Many thermo-mechanical processes, such as thermal expansion and stress relaxation, originate at the atomistic scale. We develop a sequential multiscale approach to study thermally stressed superelastic polyimide to explore these effects. The continuum-scale smoothed particle hydrodynamics (SPH) model is coupled with atomistic molecular dynamics (MD) through constitutive modelling, where thermo-mec… ▽ More Many thermo-mechanical processes, such as thermal expansion and stress relaxation, originate at the atomistic scale. We develop a sequential multiscale approach to study thermally stressed superelastic polyimide to explore these effects. The continuum-scale smoothed particle hydrodynamics (SPH) model is coupled with atomistic molecular dynamics (MD) through constitutive modelling, where thermo-mechanical properties and equations of state are derived from MD simulations. The results are verified through benchmark problems of heat transfer. Finally, we analyse the insulating capabilities of superelastic polyimide by simulating the thermal response of an aluminium plate. The result shows a considerable reduction in the thermal stress, strain and temperature field development in the aluminium plate when superelastic polyimide is used as an insulator. The present work demonstrates the effectiveness of the multi-scale method in capturing thermo-mechanical interactions in superelastic polyimide. △ Less

Submitted 28 April, 2025; originally announced April 2025.

Comments: 25 pages, 17 figures

arXiv:2504.17725 [pdf, other]

STGen: A Novel Lightweight IoT Testbed for Generating Sensor Traffic for the Experimentation of IoT Protocol and its Application in Hybrid Network

Authors: Hasan MA Islam, S. Nath, M. Rahman, N. Shahriar, M. K. M. Khan, R. Islam

Abstract: A Wireless Sensor Network (WSN) is a network that does not rely on a fixed infrastructure and consists of numerous sensors, such as temperature, humidity, GPS, and cameras, equipped with onboard processors that manage and monitor the environment in a specific area. As a result, building a real sensor network testbed for verifying, validating, or experimenting with a newly designed protocol present… ▽ More A Wireless Sensor Network (WSN) is a network that does not rely on a fixed infrastructure and consists of numerous sensors, such as temperature, humidity, GPS, and cameras, equipped with onboard processors that manage and monitor the environment in a specific area. As a result, building a real sensor network testbed for verifying, validating, or experimenting with a newly designed protocol presents considerable challenges in adapting a laboratory scenario due to the significant financial and logistical barriers, such as the need for specialized hardware and large-scale deployments. Additionally, WSN suffers from severe constraints such as restricted power supply, short communication range, limited bandwidth availability, and restricted memory storage. Addressing these challenges, this work presents a flexible testbed solution named STGen that enables researchers to experiment with IoT protocols in a hybrid environment that emulates WSN implementations with the physical Internet through a dedicated physical server named STGen core, which receives sensor traffic and processes it for further actions. The STGen testbed is lightweight in memory usage and easy to deploy. Most importantly, STGen supports large-scale distributed systems, facilitates experimentation with IoT protocols, and enables integration with back-end services for big data analytics and statistical insights. The key feature of STGen is the integration of real-world IoT protocols and their applications with WSN. Its modular and lightweight design makes STGen efficient and enables it to outperform other popular testbeds, such as Gotham and GothX, reducing memory usage by 89\%. While GothX takes approximately 26 minutes to establish a large topology with four VM nodes and 498 Docker nodes, STGen requires only 1.645 seconds to initialize the platform with 500 sensor nodes. △ Less

Submitted 24 April, 2025; originally announced April 2025.

Comments: 23 Pages, 12 Figures, Submitted to ACM Transactions on Sensor Networks

arXiv:2504.16741 [pdf, other]

Search Timelines: Visualizing Search History to Enable Cross-Session Exploratory Search

Authors: Orland Hoeber, Md Nazmul Islam, Miriam Boon, Dale Storie, Veronica Ramshaw

Abstract: Purpose: The timespan over which exploratory searching can occur, as well as the scope and volume of the search activities undertaken, can make it difficult for searchers to remember key details about their search activities. These difficulties are present both in the midst of searching as well as when resuming a search that spans multiple sessions. In this paper, we present a search interface des… ▽ More Purpose: The timespan over which exploratory searching can occur, as well as the scope and volume of the search activities undertaken, can make it difficult for searchers to remember key details about their search activities. These difficulties are present both in the midst of searching as well as when resuming a search that spans multiple sessions. In this paper, we present a search interface designed to support cross-session exploratory search in a public digital library context. Methods: Search Timelines provides a visualization of current and past search activities via a dynamic timeline of the search activity (queries and saved resources). This timeline is presented at two levels of detail. An overview timeline is provided alongside the search results in a typical search engine results page design. A detailed timeline is provided in the workspace, where searchers can review the history of their search activities and their saved resources. A controlled laboratory study was conducted to compare this approach to a baseline interface modelled after a typical public digital library search/workspace interface. Results: Participants who used Search Timelines reported higher levels of user engagement, usability, and perceived knowledge gain, during an initial search session and when resuming the search after a 7-8 day interval. This came at the expense of the searchers taking more time to complete the search task, which we view as positive evidence of engagement in cross-session exploratory search processes. Conclusion: Search Timelines serves as an example of how lightweight visualization approaches can be used to enhance typical search interface designs to support exploratory search. The results highlight the value of providing persistent representations of past search activities within the search interface. △ Less

Submitted 23 April, 2025; originally announced April 2025.

arXiv:2504.14466 [pdf, other]

A Bio-inspired Asymmetric Double-Gate Ferroelectric FET for Emulating Astrocyte and Dendrite Dynamics in Neuromorphic Systems

Authors: Zhouhang Jiang, A N M Nafiul Islam, Zhuangyu Han, Zijian Zhao, Franz Müller, Jiahui Duan, Halid Mulaosmanovic, Stefan Dünkel, Sven Beyer, Sourav Dutta, Vijaykrishnan Narayanan, Thomas Kämpfe, Suma George Cardwell, Frances Chance, Abhronil Sengupta, Kai Ni

Abstract: Neuromorphic systems seek to replicate the functionalities of biological neural networks to attain significant improvements in performance and efficiency of AI computing platforms. However, these systems have generally remained limited to emulation of simple neurons and synapses; and ignored higher order functionalities enabled by other components of the brain like astrocytes and dendrites. In thi… ▽ More Neuromorphic systems seek to replicate the functionalities of biological neural networks to attain significant improvements in performance and efficiency of AI computing platforms. However, these systems have generally remained limited to emulation of simple neurons and synapses; and ignored higher order functionalities enabled by other components of the brain like astrocytes and dendrites. In this work, drawing inspiration from biology, we introduce a compact Double-Gate Ferroelectric Field Effect Transistor (DG-FeFET) cell that can emulate the dynamics of both astrocytes and dendrites within neuromorphic architectures. We demonstrate that with a ferroelectric top gate for synaptic weight programming as in conventional synapses and a non-ferroelectric back gate, the DG-FeFET realizes a synapse with a dynamic gain modulation mechanism. This can be leveraged as an analog for a compact astrocyte-tripartite synapse, as well as enabling dendrite-like gain modulation operations. By employing a fully-depleted silicon-on-insulator (FDSOI) FeFET as our double-gate device, we validate the linear control of the synaptic weight via the back gate terminal (i.e., the gate underneath the buried oxide (BOX) layer) through comprehensive theoretical and experimental studies. We showcase the promise such a tripartite synaptic device holds for numerous important neuromorphic applications, including autonomous self-repair of faulty neuromorphic hardware mediated by astrocytic functionality. Coordinate transformations based on dragonfly prey-interception circuitry models are also demonstrated based on dendritic function emulation by the device. This work paves the way forward for developing truly "brain-like" neuromorphic hardware that go beyond the current dogma focusing only on neurons and synapses. △ Less

Submitted 19 April, 2025; originally announced April 2025.

Comments: 37 pages, 6 figure, 2 tables

arXiv:2504.14068 [pdf, other]

Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement

Authors: K M Sajjadul Islam, Ravi Teja Karri, Srujan Vegesna, Jiawei Wu, Praveen Madiraju

Abstract: Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents significant challenges due to limited data and domain-specific nuances. Traditional supervised learning approaches require extensive labeled datasets, making unsupervised methods more viable for uncovering meaningful insights from patient feedback. This study explores u… ▽ More Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents significant challenges due to limited data and domain-specific nuances. Traditional supervised learning approaches require extensive labeled datasets, making unsupervised methods more viable for uncovering meaningful insights from patient feedback. This study explores unsupervised methods to extract meaningful topics from 439 survey responses collected from a healthcare system in Wisconsin, USA. A keyword-based filtering approach was applied to isolate complaint-related feedback using a domain-specific lexicon. To delve deeper and analyze dominant topics in feedback, we explored traditional topic modeling methods, including Latent Dirichlet Allocation (LDA) and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM), alongside BERTopic, an advanced neural embedding-based clustering approach. To improve coherence and interpretability where data are scarce and consist of short-texts, we propose kBERT, an integration of BERT embeddings with k-means clustering. Model performance was assessed using coherence scores (Cv ) for topic interpretability and average Inverted Rank-Biased Overlap (IRBOavg) for topic diversity. Results indicate that kBERT achieves the highest coherence (Cv = 0.53) and distinct topic separation (IRBOavg = 1.00), outperforming all other models in short-text healthcare feedback analysis. Our findings emphasize the importance of embedding-based techniques for topic identification and highlight the need for context-aware models in healthcare analytics. △ Less

Submitted 18 April, 2025; originally announced April 2025.

Comments: Full version of the paper accepted at the 2025 IEEE COMPSAC, Toronto, Canada

arXiv:2504.13990 [pdf, other]

PC-DeepNet: A GNSS Positioning Error Minimization Framework Using Permutation-Invariant Deep Neural Network

Authors: M. Humayun Kabir, Md. Ali Hasan, Md. Shafiqul Islam, Kyeongjun Ko, Wonjae Shin

Abstract: Global navigation satellite systems (GNSS) face significant challenges in urban and sub-urban areas due to non-line-of-sight (NLOS) propagation, multipath effects, and low received power levels, resulting in highly non-linear and non-Gaussian measurement error distributions. In light of this, conventional model-based positioning approaches, which rely on Gaussian error approximations, struggle to… ▽ More Global navigation satellite systems (GNSS) face significant challenges in urban and sub-urban areas due to non-line-of-sight (NLOS) propagation, multipath effects, and low received power levels, resulting in highly non-linear and non-Gaussian measurement error distributions. In light of this, conventional model-based positioning approaches, which rely on Gaussian error approximations, struggle to achieve precise localization under these conditions. To overcome these challenges, we put forth a novel learning-based framework, PC-DeepNet, that employs a permutation-invariant (PI) deep neural network (DNN) to estimate position corrections (PC). This approach is designed to ensure robustness against changes in the number and/or order of visible satellite measurements, a common issue in GNSS systems, while leveraging NLOS and multipath indicators as features to enhance positioning accuracy in challenging urban and sub-urban environments. To validate the performance of the proposed framework, we compare the positioning error with state-of-the-art model-based and learning-based positioning methods using two publicly available datasets. The results confirm that proposed PC-DeepNet achieves superior accuracy than existing model-based and learning-based methods while exhibiting lower computational complexity compared to previous learning-based approaches. △ Less

Submitted 18 April, 2025; originally announced April 2025.

Comments: 31 pages, 14 figures, 6 tables

arXiv:2504.13272 [pdf, other]

Using LLMs for Library Migration

Authors: Md Mohayeminul Islam, Ajay Kumar Jha, May Mahmoud, Ildar Akhmetov, Sarah Nadi

Abstract: Library migration is the process of replacing a used software library with another library that provides similar functionality. Manual library migration is time-consuming and error prone, as it requires developers to understand the APIs of both libraries, map them, and perform the necessary code transformations. Due to its difficulty, most of the existing automated techniques and tooling stop at t… ▽ More Library migration is the process of replacing a used software library with another library that provides similar functionality. Manual library migration is time-consuming and error prone, as it requires developers to understand the APIs of both libraries, map them, and perform the necessary code transformations. Due to its difficulty, most of the existing automated techniques and tooling stop at the API mapping stage or support a limited set of code transformations. On the other hand, Large Language Models (LLMs) are good at generating and transforming code and finding similar code, which are necessary upstream tasks for library migration. Such capabilities suggest that LLMs may be suitable for library migration. Therefore, in this paper, we investigate the effectiveness of LLMs for migration between Python libraries. We evaluate three LLMs, LLama 3.1, GPT-4o mini, and GPT-4o on PyMigBench, where we migrate 321 real-world library migrations that include 2,989 migration-related code changes. We measure the correctness of the migration results in two ways. We first compare the LLM's migrated code with the developers' migrated code in the benchmark and then run the unit tests available in the client repositories. We find that LLama 3.1, GPT-4o mini, and GPT-4o correctly migrate 89%, 89%, and 94% of the migration-related code changes. respectively. We also find that 36%, 52% and 64% of the LLama 3.1, GPT-4o mini, and GPT-4o migrations pass the same tests that passed in the developer's migration. Overall, our results suggest that LLMs can be effective in migrating code between libraries, but we also identify cases that pose difficulties for the LLM. △ Less

Submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.12809 [pdf]

doi 10.1145/3701716.3715519

Saliency-Aware Diffusion Reconstruction for Effective Invisible Watermark Removal

Authors: Inzamamul Alam, Md Tanvir Islam, Simon S. Woo

Abstract: As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diff… ▽ More As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diffusion-based reconstruction. SADRE disrupts embedded watermarks by injecting targeted noise into latent representations guided by saliency masks although preserving essential image features. A reverse diffusion process ensures high-fidelity image restoration, leveraging adaptive noise levels determined by watermark strength. Our framework is theoretically grounded with stability guarantees and achieves robust watermark removal across diverse scenarios. Empirical evaluations on state-of-the-art (SOTA) watermarking techniques demonstrate SADRE's superiority in balancing watermark disruption and image quality. SADRE sets a new benchmark for watermark elimination, offering a flexible and reliable solution for real-world web content. Code is available on~\href{https://github.com/inzamamulDU/SADRE}{\textbf{https://github.com/inzamamulDU/SADRE}}. △ Less

Submitted 17 April, 2025; originally announced April 2025.

Comments: Accepted at The Web Conference 2025

ACM Class: I.4.5; I.5.4

arXiv:2504.12711 [pdf, other]

NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/. △ Less

Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

arXiv:2504.09896 [pdf, other]

TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models

Authors: Aish Albladi, Md Kaosar Uddin, Minarul Islam, Cheryl Seals

Abstract: Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The… ▽ More Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94\% and 95\%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: 41 pages, 12 figures, includes algorithm and comparative tables

Showing 1–50 of 911 results for author: Islam, M