-
Position-Normal Manifold for Efficient Glint Rendering on High-Resolution Normal Maps
Authors:
Liwen Wu,
Fujun Luan,
Miloš Hašan,
Ravi Ramamoorthi
Abstract:
Detailed microstructures on specular objects often exhibit intriguing glinty patterns under high-frequency lighting, which is challenging to render using a conventional normal-mapped BRDF. In this paper, we present a manifold-based formulation of the glint normal distribution functions (NDF) that precisely captures the surface normal distributions over queried footprints. The manifold-based formul…
▽ More
Detailed microstructures on specular objects often exhibit intriguing glinty patterns under high-frequency lighting, which is challenging to render using a conventional normal-mapped BRDF. In this paper, we present a manifold-based formulation of the glint normal distribution functions (NDF) that precisely captures the surface normal distributions over queried footprints. The manifold-based formulation transfers the integration for the glint NDF construction to a problem of mesh intersections. Compared to previous works that rely on complex numerical approximations, our integral solution is exact and much simpler to compute, which also allows an easy adaptation of a mesh clustering hierarchy to accelerate the NDF evaluation of large footprints. Our performance and quality analysis shows that our NDF formulation achieves similar glinty appearance compared to the baselines but is an order of magnitude faster. Within this framework, we further present a novel derivation of analytical shadow-masking for normal-mapped diffuse surfaces -- a component that is often ignored in previous works.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
IntrinsicEdit: Precise generative image manipulation in intrinsic space
Authors:
Linjie Lyu,
Valentin Deschaintre,
Yannick Hold-Geoffroy,
Miloš Hašan,
Jae Shin Yoon,
Thomas Leimkühler,
Christian Theobalt,
Iliyan Georgiev
Abstract:
Generative diffusion models have advanced image editing with high-quality results and intuitive interfaces such as prompts and semantic drawing. However, these interfaces lack precise control, and the associated methods typically specialize on a single editing task. We introduce a versatile, generative workflow that operates in an intrinsic-image latent space, enabling semantic, local manipulation…
▽ More
Generative diffusion models have advanced image editing with high-quality results and intuitive interfaces such as prompts and semantic drawing. However, these interfaces lack precise control, and the associated methods typically specialize on a single editing task. We introduce a versatile, generative workflow that operates in an intrinsic-image latent space, enabling semantic, local manipulation with pixel precision for a range of editing operations. Building atop the RGB-X diffusion framework, we address key challenges of identity preservation and intrinsic-channel entanglement. By incorporating exact diffusion inversion and disentangled channel manipulation, we enable precise, efficient editing with automatic resolution of global illumination effects -- all without additional data collection or model fine-tuning. We demonstrate state-of-the-art performance across a variety of tasks on complex images, including color and texture adjustments, object insertion and removal, global relighting, and their combinations.
△ Less
Submitted 15 May, 2025; v1 submitted 13 May, 2025;
originally announced May 2025.
-
Predicting Diabetes Using Machine Learning: A Comparative Study of Classifiers
Authors:
Mahade Hasan,
Farhana Yasmin
Abstract:
Diabetes remains a significant health challenge globally, contributing to severe complications like kidney disease, vision loss, and heart issues. The application of machine learning (ML) in healthcare enables efficient and accurate disease prediction, offering avenues for early intervention and patient support. Our study introduces an innovative diabetes prediction framework, leveraging both trad…
▽ More
Diabetes remains a significant health challenge globally, contributing to severe complications like kidney disease, vision loss, and heart issues. The application of machine learning (ML) in healthcare enables efficient and accurate disease prediction, offering avenues for early intervention and patient support. Our study introduces an innovative diabetes prediction framework, leveraging both traditional ML techniques such as Logistic Regression, SVM, Naïve Bayes, and Random Forest and advanced ensemble methods like AdaBoost, Gradient Boosting, Extra Trees, and XGBoost. Central to our approach is the development of a novel model, DNet, a hybrid architecture combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers for effective feature extraction and sequential learning. The DNet model comprises an initial convolutional block for capturing essential features, followed by a residual block with skip connections to facilitate efficient information flow. Batch Normalization and Dropout are employed for robust regularization, and an LSTM layer captures temporal dependencies within the data. Using a Kaggle-sourced real-world diabetes dataset, our model evaluation spans cross-validation accuracy, precision, recall, F1 score, and ROC-AUC. Among the models, DNet demonstrates the highest efficacy with an accuracy of 99.79% and an AUC-ROC of 99.98%, establishing its potential for superior diabetes prediction. This robust hybrid architecture showcases the value of combining CNN and LSTM layers, emphasizing its applicability in medical diagnostics and disease prediction tasks.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Unmasking Deep Fakes: Leveraging Deep Learning for Video Authenticity Detection
Authors:
Mahmudul Hasan
Abstract:
Deepfake videos, produced through advanced artificial intelligence methods now a days, pose a new challenge to the truthfulness of the digital media. As Deepfake becomes more convincing day by day, detecting them requires advanced methods capable of identifying subtle inconsistencies. The primary motivation of this paper is to recognize deepfake videos using deep learning techniques, specifically…
▽ More
Deepfake videos, produced through advanced artificial intelligence methods now a days, pose a new challenge to the truthfulness of the digital media. As Deepfake becomes more convincing day by day, detecting them requires advanced methods capable of identifying subtle inconsistencies. The primary motivation of this paper is to recognize deepfake videos using deep learning techniques, specifically by using convolutional neural networks. Deep learning excels in pattern recognition, hence, makes it an ideal approach for detecting the intricate manipulations in deepfakes. In this paper, we consider using MTCNN as a face detector and EfficientNet-B5 as encoder model to predict if a video is deepfake or not. We utilize training and evaluation dataset from Kaggle DFDC. The results shows that our deepfake detection model acquired 42.78% log loss, 93.80% AUC and 86.82% F1 score on kaggle's DFDC dataset.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations
Authors:
Md Rakibul Hasan,
Pouria Behnoudfar,
Dan MacKinlay,
Thomas Poulet
Abstract:
Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Nois…
▽ More
Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional methods, even with limited training data (e.g., only 13% of training data required for SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning, offering improved accuracy and efficiency for image processing, enhanced process understanding, and broader applications to scientific research. The source codes and data will be made publicly available at https://github.com/hasan-rakibul/PC-SRGAN upon acceptance of this paper.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Sponge Attacks on Sensing AI: Energy-Latency Vulnerabilities and Defense via Model Pruning
Authors:
Syed Mhamudul Hasan,
Hussein Zangoti,
Iraklis Anagnostopoulos,
Abdur R. Shahid
Abstract:
Recent studies have shown that sponge attacks can significantly increase the energy consumption and inference latency of deep neural networks (DNNs). However, prior work has focused primarily on computer vision and natural language processing tasks, overlooking the growing use of lightweight AI models in sensing-based applications on resource-constrained devices, such as those in Internet of Thing…
▽ More
Recent studies have shown that sponge attacks can significantly increase the energy consumption and inference latency of deep neural networks (DNNs). However, prior work has focused primarily on computer vision and natural language processing tasks, overlooking the growing use of lightweight AI models in sensing-based applications on resource-constrained devices, such as those in Internet of Things (IoT) environments. These attacks pose serious threats of energy depletion and latency degradation in systems where limited battery capacity and real-time responsiveness are critical for reliable operation. This paper makes two key contributions. First, we present the first systematic exploration of energy-latency sponge attacks targeting sensing-based AI models. Using wearable sensing-based AI as a case study, we demonstrate that sponge attacks can substantially degrade performance by increasing energy consumption, leading to faster battery drain, and by prolonging inference latency. Second, to mitigate such attacks, we investigate model pruning, a widely adopted compression technique for resource-constrained AI, as a potential defense. Our experiments show that pruning-induced sparsity significantly improves model resilience against sponge poisoning. We also quantify the trade-offs between model efficiency and attack resilience, offering insights into the security implications of model compression in sensing-based AI systems deployed in IoT environments.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
AI Standardized Patient Improves Human Conversations in Advanced Cancer Care
Authors:
Kurtis Haut,
Masum Hasan,
Thomas Carroll,
Ronald Epstein,
Taylan Sen,
Ehsan Hoque
Abstract:
Serious illness communication (SIC) in end-of-life care faces challenges such as emotional stress, cultural barriers, and balancing hope with honesty. Despite its importance, one of the few available ways for clinicians to practice SIC is with standardized patients, which is expensive, time-consuming, and inflexible. In this paper, we present SOPHIE, an AI-powered standardized patient simulation a…
▽ More
Serious illness communication (SIC) in end-of-life care faces challenges such as emotional stress, cultural barriers, and balancing hope with honesty. Despite its importance, one of the few available ways for clinicians to practice SIC is with standardized patients, which is expensive, time-consuming, and inflexible. In this paper, we present SOPHIE, an AI-powered standardized patient simulation and automated feedback system. SOPHIE combines large language models (LLMs), a lifelike virtual avatar, and automated, personalized feedback based on clinical literature to provide remote, on-demand SIC training. In a randomized control study with healthcare students and professionals, SOPHIE users demonstrated significant improvement across three critical SIC domains: Empathize, Be Explicit, and Empower. These results suggest that AI-driven tools can enhance complex interpersonal communication skills, offering scalable, accessible solutions to address a critical gap in clinician education.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Steering Semantic Data Processing With DocWrangler
Authors:
Shreya Shankar,
Bhavya Chopra,
Mawil Hasan,
Stephen Lee,
Björn Hartmann,
Joseph M. Hellerstein,
Aditya G. Parameswaran,
Eugene Wu
Abstract:
Unstructured text has long been difficult to automatically analyze at scale. Large language models (LLMs) now offer a way forward by enabling {\em semantic data processing}, where familiar data processing operators (e.g., map, reduce, filter) are powered by LLMs instead of code. However, building effective semantic data processing pipelines presents a departure from traditional data pipelines: use…
▽ More
Unstructured text has long been difficult to automatically analyze at scale. Large language models (LLMs) now offer a way forward by enabling {\em semantic data processing}, where familiar data processing operators (e.g., map, reduce, filter) are powered by LLMs instead of code. However, building effective semantic data processing pipelines presents a departure from traditional data pipelines: users need to understand their data to write effective pipelines, yet they need to construct pipelines to extract the data necessary for that understanding -- all while navigating LLM idiosyncrasies and inconsistencies. We present \docwrangler, a mixed-initiative integrated development environment (IDE) for semantic data processing with three novel features to address the gaps between the user, their data, and their pipeline: {\em (i) In-Situ User Notes} that allows users to inspect, annotate, and track observations across documents and LLM outputs, {\em (ii) LLM-Assisted Prompt Refinement} that transforms user notes into improved operations, and {\em (iii) LLM-Assisted Operation Decomposition} that identifies when operations or documents are too complex for the LLM to correctly process and suggests decompositions. Our evaluation combines a think-aloud study with 10 participants and a public-facing deployment (available at \href{https://docetl.org/playground}{docetl.org/playground}) with 1,500+ recorded sessions, revealing how users develop systematic strategies for their semantic data processing tasks; e.g., transforming open-ended operations into classifiers for easier validation and intentionally using vague prompts to learn more about their data or LLM capabilities.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
PC-DeepNet: A GNSS Positioning Error Minimization Framework Using Permutation-Invariant Deep Neural Network
Authors:
M. Humayun Kabir,
Md. Ali Hasan,
Md. Shafiqul Islam,
Kyeongjun Ko,
Wonjae Shin
Abstract:
Global navigation satellite systems (GNSS) face significant challenges in urban and sub-urban areas due to non-line-of-sight (NLOS) propagation, multipath effects, and low received power levels, resulting in highly non-linear and non-Gaussian measurement error distributions. In light of this, conventional model-based positioning approaches, which rely on Gaussian error approximations, struggle to…
▽ More
Global navigation satellite systems (GNSS) face significant challenges in urban and sub-urban areas due to non-line-of-sight (NLOS) propagation, multipath effects, and low received power levels, resulting in highly non-linear and non-Gaussian measurement error distributions. In light of this, conventional model-based positioning approaches, which rely on Gaussian error approximations, struggle to achieve precise localization under these conditions. To overcome these challenges, we put forth a novel learning-based framework, PC-DeepNet, that employs a permutation-invariant (PI) deep neural network (DNN) to estimate position corrections (PC). This approach is designed to ensure robustness against changes in the number and/or order of visible satellite measurements, a common issue in GNSS systems, while leveraging NLOS and multipath indicators as features to enhance positioning accuracy in challenging urban and sub-urban environments. To validate the performance of the proposed framework, we compare the positioning error with state-of-the-art model-based and learning-based positioning methods using two publicly available datasets. The results confirm that proposed PC-DeepNet achieves superior accuracy than existing model-based and learning-based methods while exhibiting lower computational complexity compared to previous learning-based approaches.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Tabular foundation model to detect empathy from visual cues
Authors:
Md Rakibul Hasan,
Shafin Rahman,
Md Zakir Hossain,
Aneesh Krishna,
Tom Gedeon
Abstract:
Detecting empathy from video interactions is an emerging area of research. Video datasets, however, are often released as extracted features (i.e., tabular data) rather than raw footage due to privacy and ethical concerns. Prior research on such tabular datasets established tree-based classical machine learning approaches as the best-performing models. Motivated by the recent success of textual fo…
▽ More
Detecting empathy from video interactions is an emerging area of research. Video datasets, however, are often released as extracted features (i.e., tabular data) rather than raw footage due to privacy and ethical concerns. Prior research on such tabular datasets established tree-based classical machine learning approaches as the best-performing models. Motivated by the recent success of textual foundation models (i.e., large language models), we explore the use of tabular foundation models in empathy detection from tabular visual features. We experiment with two recent tabular foundation models $-$ TabPFN v2 and TabICL $-$ through in-context learning and fine-tuning setups. Our experiments on a public human-robot interaction benchmark demonstrate a significant boost in cross-subject empathy detection accuracy over several strong baselines (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). In addition to performance improvement, we contribute novel insights and an evaluation setup to ensure generalisation on unseen subjects in this public benchmark. As the practice of releasing video features as tabular datasets is likely to persist due to privacy constraints, our findings will be widely applicable to future empathy detection video datasets as well.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Stochastic Ray Tracing of 3D Transparent Gaussians
Authors:
Xin Sun,
Iliyan Georgiev,
Yun Fei,
Miloš Hašan
Abstract:
3D Gaussian splatting has recently been widely adopted as a 3D representation for novel-view synthesis, relighting, and text-to-3D generation tasks, offering realistic and detailed results through a collection of explicit 3D Gaussians carrying opacities and view-dependent colors. However, efficient rendering of many transparent primitives remains a significant challenge. Existing approaches either…
▽ More
3D Gaussian splatting has recently been widely adopted as a 3D representation for novel-view synthesis, relighting, and text-to-3D generation tasks, offering realistic and detailed results through a collection of explicit 3D Gaussians carrying opacities and view-dependent colors. However, efficient rendering of many transparent primitives remains a significant challenge. Existing approaches either rasterize the 3D Gaussians with approximate sorting per view or rely on high-end RTX GPUs to exhaustively process all ray-Gaussian intersections (bounding Gaussians by meshes). This paper proposes a stochastic ray tracing method to render 3D clouds of transparent primitives. Instead of processing all ray-Gaussian intersections in sequential order, each ray traverses the acceleration structure only once, randomly accepting and shading a single intersection (or N intersections, using a simple extension). This approach minimizes shading time and avoids sorting the Gaussians along the ray while minimizing the register usage and maximizing parallelism even on low-end GPUs. The cost of rays through the Gaussian asset is comparable to that of standard mesh-intersection rays. While our method introduces noise, the shading is unbiased, and the variance is slight, as stochastic acceptance is importance-sampled based on accumulated opacity. The alignment with the Monte Carlo philosophy simplifies implementation and easily integrates our method into a conventional path-tracing framework.
△ Less
Submitted 10 April, 2025; v1 submitted 9 April, 2025;
originally announced April 2025.
-
NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge
Authors:
Firoj Alam,
Md Arid Hasan,
Sahinur Rahman Laskar,
Mucahid Kutlu,
Shammur Absar Chowdhury
Abstract:
The rapid advancement of large language models (LLMs) has raised concerns about cultural bias, fairness, and their applicability in diverse linguistic and underrepresented regional contexts. To enhance and benchmark the capabilities of LLMs, there is a need to develop large-scale resources focused on multilingual, local, and cultural contexts. In this study, we propose a framework, NativQA, that c…
▽ More
The rapid advancement of large language models (LLMs) has raised concerns about cultural bias, fairness, and their applicability in diverse linguistic and underrepresented regional contexts. To enhance and benchmark the capabilities of LLMs, there is a need to develop large-scale resources focused on multilingual, local, and cultural contexts. In this study, we propose a framework, NativQA, that can seamlessly construct large-scale, culturally and regionally aligned QA datasets in native languages. The framework utilizes user-defined seed queries and leverages search engines to collect location-specific, everyday information. It has been evaluated across 39 locations in 24 countries and in 7 languages, ranging from extremely low-resource to high-resource languages, which resulted over 300K Question Answer (QA) pairs. The developed resources can be used for LLM benchmarking and further fine-tuning. The framework has been made publicly available for the community (https://gitlab.com/nativqa/nativqa-framework).
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Online Facility Assignments on Polygons
Authors:
Sumaiya Malik,
Reyan Ahmed,
Md. Manzurul Hasan
Abstract:
We study the online facility assignment problem on regular polygons, where all sides are of equal length. The influence of specific geometric settings has remained mostly unexplored, even though classical online facility assignment problems have mainly dealt with linear and general metric spaces. We fill this gap by considering the following four basic geometric settings: equilateral triangles, re…
▽ More
We study the online facility assignment problem on regular polygons, where all sides are of equal length. The influence of specific geometric settings has remained mostly unexplored, even though classical online facility assignment problems have mainly dealt with linear and general metric spaces. We fill this gap by considering the following four basic geometric settings: equilateral triangles, rectangles, regular $n$-polygons, and circles. The facilities are situated at fixed positions on the boundary, and customers appear sequentially on the boundary. A customer needs to be assigned immediately without any information about future customer arrivals. We study a natural greedy algorithm. First, we study an equilateral triangle with three facilities at its corners; customers can appear anywhere on the boundary. We then analyze regular $n$-sided polygons, obtaining a competitive ratio of $2n-1$, showing that the algorithm performance degrades linearly with the number of corner points for polygons. For the circular configuration, the competitive ratio is $2n-1$ when the distance between two adjacent facilities is the same. And the competitive ratios are $n^2-n+1$ and $2^n - 1$ for varying distances linearly and exponentially respectively. Each facility has a fixed capacity proportional to the geometric configuration, and customers appear only along the boundary edges. Our results also show that simpler geometric configurations have more efficient performance bounds and that spacing facilities uniformly apart prevent worst-case scenarios. The findings have many practical implications because large networks of facilities are best partitioned into smaller and geometrically simple pieces to guarantee good overall performance.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
WorldPrompter: Traversable Text-to-Scene Generation
Authors:
Zhaoyang Zhang,
Yannick Hold-Geoffroy,
Miloš Hašan,
Chen Ziwen,
Fujun Luan,
Julie Dorsey,
Yiwei Hu
Abstract:
Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360° details of a scene. WorldPrompter incorporate…
▽ More
Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360° details of a scene. WorldPrompter incorporates a conditional 360° panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360° video generators and 3D scene generation models.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Enhancing stroke disease classification through machine learning models by feature selection techniques
Authors:
Mahade Hasan,
Farhana Yasmin,
Xue Yu
Abstract:
Heart disease remains a leading cause of mortality and morbidity worldwide, necessitating the development of accurate and reliable predictive models to facilitate early detection and intervention. While state of the art work has focused on various machine learning approaches for predicting heart disease, but they could not able to achieve remarkable accuracy. In response to this need, we applied n…
▽ More
Heart disease remains a leading cause of mortality and morbidity worldwide, necessitating the development of accurate and reliable predictive models to facilitate early detection and intervention. While state of the art work has focused on various machine learning approaches for predicting heart disease, but they could not able to achieve remarkable accuracy. In response to this need, we applied nine machine learning algorithms XGBoost, logistic regression, decision tree, random forest, k-nearest neighbors (KNN), support vector machine (SVM), gaussian naïve bayes (NB gaussian), adaptive boosting, and linear regression to predict heart disease based on a range of physiological indicators. Our approach involved feature selection techniques to identify the most relevant predictors, aimed at refining the models to enhance both performance and interpretability. The models were trained, incorporating processes such as grid search hyperparameter tuning, and cross-validation to minimize overfitting. Additionally, we have developed a novel voting system with feature selection techniques to advance heart disease classification. Furthermore, we have evaluated the models using key performance metrics including accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (ROC AUC). Among the models, XGBoost demonstrated exceptional performance, achieving 99% accuracy, precision, F1-Score, 98% recall, and 100% ROC AUC. This study offers a promising approach to early heart disease diagnosis and preventive healthcare.
△ Less
Submitted 11 May, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
Faster Releases, Fewer Risks: A Study on Maven Artifact Vulnerabilities and Lifecycle Management
Authors:
Md Shafiullah Shafin,
Md Fazle Rabbi,
S. M. Mahedy Hasan,
Minhaz F. Zibran
Abstract:
In modern software ecosystems, dependency management plays a critical role in ensuring secure and maintainable applications. However, understanding the relationship between release practices and their impact on vulnerabilities and update cycles remains a challenge. In this study, we analyze the release histories of 10,000 Maven artifacts, covering over 203,000 releases and 1.7 million dependencies…
▽ More
In modern software ecosystems, dependency management plays a critical role in ensuring secure and maintainable applications. However, understanding the relationship between release practices and their impact on vulnerabilities and update cycles remains a challenge. In this study, we analyze the release histories of 10,000 Maven artifacts, covering over 203,000 releases and 1.7 million dependencies. We evaluate how release speed affects software security and lifecycle. Our results show an inverse relationship between release speed and dependency outdatedness. Artifacts with more frequent releases maintain significantly shorter outdated times. We also find that faster release cycles are linked to fewer CVEs in dependency chains, indicating a strong negative correlation. These findings emphasize the importance of accelerated release strategies in reducing security risks and ensuring timely updates. Our research provides valuable insights for software developers, maintainers, and ecosystem managers.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation
Authors:
Md Mahfuz Al Hasan,
Mahdi Zaman,
Abdul Jawad,
Alberto Santamaria-Pang,
Ho Hin Lee,
Ivan Tarapov,
Kyle See,
Md Shah Imran,
Antika Roy,
Yaser Pourmohammadi Fallah,
Navid Asadizanjani,
Reza Forghani
Abstract:
Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limitations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for con…
▽ More
Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limitations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual representation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architecture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency details while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where computational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.
△ Less
Submitted 31 March, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
Insights into Dependency Maintenance Trends in the Maven Ecosystem
Authors:
Barisha Chowdhury,
Md Fazle Rabbi,
S. M. Mahedy Hasan,
Minhaz F. Zibran
Abstract:
As modern software development increasingly relies on reusable libraries and components, managing dependencies has become critical for ensuring software stability and security. However, challenges such as outdated dependencies, missed releases, and the complexity of interdependent libraries can significantly impact project maintenance. In this paper, we present a quantitative analysis of the Neo4j…
▽ More
As modern software development increasingly relies on reusable libraries and components, managing dependencies has become critical for ensuring software stability and security. However, challenges such as outdated dependencies, missed releases, and the complexity of interdependent libraries can significantly impact project maintenance. In this paper, we present a quantitative analysis of the Neo4j dataset using the Goblin framework to uncover patterns of freshness in projects with different numbers of dependencies. Our analysis reveals that releases with fewer dependencies have a higher number of missed releases. Additionally, our study shows that the dependencies in the latest releases have positive freshness scores, indicating better software management efficacy. These results can encourage better management practices and contribute to the overall health of software ecosystems.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Assessing the influence of cybersecurity threats and risks on the adoption and growth of digital banking: a systematic literature review
Authors:
Md. Waliullah,
Md Zahin Hossain George,
Md Tarek Hasan,
Md Khorshed Alam,
Mosa Sumaiya Khatun Munira,
Noor Alam Siddiqui
Abstract:
The rapid digitalization of banking services has significantly transformed financial transactions, offering enhanced convenience and efficiency for consumers. However, the increasing reliance on digital banking has also exposed financial institutions and users to a wide range of cybersecurity threats, including phishing, malware, ransomware, data breaches, and unauthorized access. This study syste…
▽ More
The rapid digitalization of banking services has significantly transformed financial transactions, offering enhanced convenience and efficiency for consumers. However, the increasing reliance on digital banking has also exposed financial institutions and users to a wide range of cybersecurity threats, including phishing, malware, ransomware, data breaches, and unauthorized access. This study systematically examines the influence of cybersecurity threats on digital banking security, adoption, and regulatory compliance by conducting a comprehensive review of 78 peer-reviewed articles published between 2015 and 2024. Using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology, this research critically evaluates the most prevalent cyber threats targeting digital banking platforms, the effectiveness of modern security measures, and the role of regulatory frameworks in mitigating financial cybersecurity risks. The findings reveal that phishing and malware attacks remain the most commonly exploited cyber threats, leading to significant financial losses and consumer distrust. Multi-factor authentication (MFA) and biometric security have been widely adopted to combat unauthorized access, while AI-driven fraud detection and blockchain technology offer promising solutions for securing financial transactions. However, the integration of third-party FinTech solutions introduces additional security risks, necessitating stringent regulatory oversight and cybersecurity protocols. The study also highlights that compliance with global cybersecurity regulations, such as GDPR, PSD2, and GLBA, enhances digital banking security by enforcing strict authentication measures, encryption protocols, and real-time fraud monitoring.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
Leveraging Language Models for Analyzing Longitudinal Experiential Data in Education
Authors:
Ahatsham Hayat,
Bilal Khan,
Mohammad Rashedul Hasan
Abstract:
We propose a novel approach to leveraging pre-trained language models (LMs) for early forecasting of academic trajectories in STEM students using high-dimensional longitudinal experiential data. This data, which captures students' study-related activities, behaviors, and psychological states, offers valuable insights for forecasting-based interventions. Key challenges in handling such data include…
▽ More
We propose a novel approach to leveraging pre-trained language models (LMs) for early forecasting of academic trajectories in STEM students using high-dimensional longitudinal experiential data. This data, which captures students' study-related activities, behaviors, and psychological states, offers valuable insights for forecasting-based interventions. Key challenges in handling such data include high rates of missing values, limited dataset size due to costly data collection, and complex temporal variability across modalities. Our approach addresses these issues through a comprehensive data enrichment process, integrating strategies for managing missing values, augmenting data, and embedding task-specific instructions and contextual cues to enhance the models' capacity for learning temporal patterns. Through extensive experiments on a curated student learning dataset, we evaluate both encoder-decoder and decoder-only LMs. While our findings show that LMs effectively integrate data across modalities and exhibit resilience to missing data, they primarily rely on high-level statistical patterns rather than demonstrating a deeper understanding of temporal dynamics. Furthermore, their ability to interpret explicit temporal information remains limited. This work advances educational data science by highlighting both the potential and limitations of LMs in modeling student trajectories for early intervention based on longitudinal experiential data.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
A Space-Efficient Algorithm for Longest Common Almost Increasing Subsequence of Two Sequences
Authors:
Md Tanzeem Rahat,
Md. Manzurul Hasan,
Debajyoti Mondal
Abstract:
Let $A$ and $B$ be two number sequences of length $n$ and $m$, respectively, where $m\le n$. Given a positive number $δ$, a common almost increasing sequence $s_1\ldots s_k$ is a common subsequence for both $A$ and $B$ such that for all $2\le i\le k$, $s_i+δ> \max_{1\le j < i} s_j$. The LCaIS problem seeks to find the longest common almost increasing subsequence (LCaIS) of $A$ and $B$. An LCaIS ca…
▽ More
Let $A$ and $B$ be two number sequences of length $n$ and $m$, respectively, where $m\le n$. Given a positive number $δ$, a common almost increasing sequence $s_1\ldots s_k$ is a common subsequence for both $A$ and $B$ such that for all $2\le i\le k$, $s_i+δ> \max_{1\le j < i} s_j$. The LCaIS problem seeks to find the longest common almost increasing subsequence (LCaIS) of $A$ and $B$. An LCaIS can be computed in $O(nm\ell)$ time and $O(nm)$ space [Ta, Shieh, Lu (TCS 2021)], where $\ell$ is the length of the LCaIS of $A$ and $B$. In this paper we first give an $O(nm\ell)$-time and $O(n+m\ell)$-space algorithm to find LCaIS, which improves the space complexity. We then design an $O((n+m)\log n +\mathcal{M}\log \mathcal{M} + \mathcal{C}\ell)$-time and $O(\mathcal{M}(\ell+\log \mathcal{M}))$-space algorithm, which is faster when the number of matching pairs $\mathcal{M}$ and the number of compatible matching pairs $\mathcal{C}$ are in $o(nm/\log m)$.
△ Less
Submitted 7 May, 2025; v1 submitted 19 March, 2025;
originally announced March 2025.
-
Designing and Deploying AI Models for Sustainable Logistics Optimization: A Case Study on Eco-Efficient Supply Chains in the USA
Authors:
Reza E Rabbi Shawon,
MD Rokibul Hasan,
Md Anisur Rahman,
Mohamed Ghandri,
Iman Ahmed Lamari,
Mohammed Kawsar,
Rubi Akter
Abstract:
The rapid evolution of Artificial Intelligence (AI) and Machine Learning (ML) has significantly transformed logistics and supply chain management, particularly in the pursuit of sustainability and eco-efficiency. This study explores AI-based methodologies for optimizing logistics operations in the USA, focusing on reducing environmental impact, improving fuel efficiency, and minimizing costs. Key…
▽ More
The rapid evolution of Artificial Intelligence (AI) and Machine Learning (ML) has significantly transformed logistics and supply chain management, particularly in the pursuit of sustainability and eco-efficiency. This study explores AI-based methodologies for optimizing logistics operations in the USA, focusing on reducing environmental impact, improving fuel efficiency, and minimizing costs. Key AI applications include predictive analytics for demand forecasting, route optimization through machine learning, and AI-powered fuel efficiency strategies. Various models, such as Linear Regression, XGBoost, Support Vector Machine, and Neural Networks, are applied to real-world logistics datasets to reduce carbon emissions based on logistics operations, optimize travel routes to minimize distance and travel time, and predict future deliveries to plan optimal routes. Other models such as K-Means and DBSCAN are also used to optimize travel routes to minimize distance and travel time for logistics operations. This study utilizes datasets from logistics companies' databases. The study also assesses model performance using metrics such as mean absolute error (MAE), mean squared error (MSE), and R2 score. This study also explores how these models can be deployed to various platforms for real-time logistics and supply chain use. The models are also examined through a thorough case study, highlighting best practices and regulatory frameworks that promote sustainability. The findings demonstrate AI's potential to enhance logistics efficiency, reduce carbon footprints, and contribute to a more resilient and adaptive supply chain ecosystem.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
Authors:
James Burgess,
Jeffrey J Nirschl,
Laura Bravo-Sánchez,
Alejandro Lozano,
Sanket Rajan Gupte,
Jesus G. Galaz-Montoya,
Yuhui Zhang,
Yuchang Su,
Disha Bhowmik,
Zachary Coman,
Sarina M. Hasan,
Alexandra Johannesson,
William D. Leineweber,
Malvika G Nair,
Ridhi Yarlagadda,
Connor Zuraski,
Wah Chiu,
Sarah Cohen,
Jan N. Hansen,
Manuel D Leonetti,
Chad Liu,
Emma Lundberg,
Serena Yeung-Levy
Abstract:
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimo…
▽ More
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
FDCT: Frequency-Aware Decomposition and Cross-Modal Token-Alignment for Multi-Sensor Target Classification
Authors:
Shoaib Meraj Sami,
Md Mahedi Hasan,
Nasser M. Nasrabadi,
Raghuveer Rao
Abstract:
In automatic target recognition (ATR) systems, sensors may fail to capture discriminative, fine-grained detail features due to environmental conditions, noise created by CMOS chips, occlusion, parallaxes, and sensor misalignment. Therefore, multi-sensor image fusion is an effective choice to overcome these constraints. However, multi-modal image sensors are heterogeneous and have domain and granul…
▽ More
In automatic target recognition (ATR) systems, sensors may fail to capture discriminative, fine-grained detail features due to environmental conditions, noise created by CMOS chips, occlusion, parallaxes, and sensor misalignment. Therefore, multi-sensor image fusion is an effective choice to overcome these constraints. However, multi-modal image sensors are heterogeneous and have domain and granularity gaps. In addition, the multi-sensor images can be misaligned due to intricate background clutters, fluctuating illumination conditions, and uncontrolled sensor settings. In this paper, to overcome these issues, we decompose, align, and fuse multiple image sensor data for target classification. We extract the domain-specific and domain-invariant features from each sensor data. We propose to develop a shared unified discrete token (UDT) space between sensors to reduce the domain and granularity gaps. Additionally, we develop an alignment module to overcome the misalignment between multi-sensors and emphasize the discriminative representation of the UDT space. In the alignment module, we introduce sparsity constraints to provide a better cross-modal representation of the UDT space and robustness against various sensor settings. We achieve superior classification performance compared to single-modality classifiers and several state-of-the-art multi-modal fusion algorithms on four multi-sensor ATR datasets.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Solving Bayesian inverse problems with diffusion priors and off-policy RL
Authors:
Luca Scimeca,
Siddarth Venkatraman,
Moksh Jain,
Minsu Kim,
Marcin Sendera,
Mohsin Hasan,
Luke Rowe,
Sarthak Mittal,
Pablo Lemos,
Emmanuel Bengio,
Alexandre Adam,
Jarrid Rector-Brooks,
Yashar Hezaveh,
Laurence Perreault-Levasseur,
Yoshua Bengio,
Glen Berseth,
Nikolay Malkin
Abstract:
This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (RL) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems…
▽ More
This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (RL) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization
Authors:
Mst. Fahmida Sultana Naznin,
Adnan Ibney Faruq,
Mostafa Rifat Tazwar,
Md Jobayer,
Md. Mehedi Hasan Shawon,
Md Rakibul Hasan
Abstract:
A radiology report comprises several sections, including the Findings and Impression of the diagnosis. Automatically generating the Impression from the Findings is crucial for reducing radiologists' workload and improving diagnostic accuracy. Pretrained models that excel in common abstractive summarization problems encounter challenges when applied to specialized medical domains largely due to the…
▽ More
A radiology report comprises several sections, including the Findings and Impression of the diagnosis. Automatically generating the Impression from the Findings is crucial for reducing radiologists' workload and improving diagnostic accuracy. Pretrained models that excel in common abstractive summarization problems encounter challenges when applied to specialized medical domains largely due to the complex terminology and the necessity for accurate clinical context. Such tasks in medical domains demand extracting core information, avoiding context shifts, and maintaining proper flow. Misuse of medical terms can lead to drastic clinical errors. To address these issues, we introduce a sequential transfer learning that ensures key content extraction and coherent summarization. Sequential transfer learning often faces challenges like initial parameter decay and knowledge loss, which we resolve with the Fisher matrix regularization. Using MIMIC-CXR and Open-I datasets, our model, CSTRL-Context-driven Sequential TRansfer Learning-achieved state-of-the-art performance, showing 56.2% improvement in BLEU-1, 40.5% in BLEU-2, 84.3% in BLEU-3, 28.9% in ROUGE-1, 41.0% in ROUGE-2 and 26.5% in ROGUE-3 score over benchmark studies. We also analyze factual consistency scores while preserving the medical context. Our code is publicly available at TBA.
△ Less
Submitted 21 February, 2025;
originally announced March 2025.
-
Free Your Hands: Lightweight Relightable Turntable Capture Pipeline
Authors:
Jiahui Fan,
Fujun Luan,
Jian Yang,
Miloš Hašan,
Beibei Wang
Abstract:
Novel view synthesis (NVS) from multiple captured photos of an object is a widely studied problem. Achieving high quality typically requires dense sampling of input views, which can lead to frustrating and tedious manual labor. Manually positioning cameras to maintain an optimal desired distribution can be difficult for humans, and if a good distribution is found, it is not easy to replicate. Addi…
▽ More
Novel view synthesis (NVS) from multiple captured photos of an object is a widely studied problem. Achieving high quality typically requires dense sampling of input views, which can lead to frustrating and tedious manual labor. Manually positioning cameras to maintain an optimal desired distribution can be difficult for humans, and if a good distribution is found, it is not easy to replicate. Additionally, the captured data can suffer from motion blur and defocus due to human error. In this paper, we present a lightweight object capture pipeline to reduce the manual workload and standardize the acquisition setup. We use a consumer turntable to carry the object and a tripod to hold the camera. As the turntable rotates, we automatically capture dense samples from various views and lighting conditions; we can repeat this for several camera positions. This way, we can easily capture hundreds of valid images in several minutes without hands-on effort. However, in the object reference frame, the light conditions vary; this is harmful to a standard NVS method like 3D Gaussian splatting (3DGS) which assumes fixed lighting. We design a neural radiance representation conditioned on light rotations, which addresses this issue and allows relightability as an additional benefit. We demonstrate our pipeline using 3DGS as the underlying framework, achieving competitive quality compared to previous methods with exhaustive acquisition and showcasing its potential for relighting and harmonization tasks.
△ Less
Submitted 14 April, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization
Authors:
Zhanhong Jiang,
Md Zahid Hasan,
Aditya Balu,
Joshua R. Waite,
Genyi Huang,
Soumik Sarkar
Abstract:
Stochastic optimization methods have actively been playing a critical role in modern machine learning algorithms to deliver decent performance. While numerous works have proposed and developed diverse approaches, first-order and second-order methods are in entirely different situations. The former is significantly pivotal and dominating in emerging deep learning but only leads convergence to a sta…
▽ More
Stochastic optimization methods have actively been playing a critical role in modern machine learning algorithms to deliver decent performance. While numerous works have proposed and developed diverse approaches, first-order and second-order methods are in entirely different situations. The former is significantly pivotal and dominating in emerging deep learning but only leads convergence to a stationary point. However, second-order methods are less popular due to their computational intensity in large-dimensional problems. This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework, termed FUSE, from which a practical version (PV) is derived accordingly. FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders. Additionally, we develop different criteria that determine when to switch. FUSE-PV has provably shown a smaller computational complexity than SGD and Adam. To validate our proposed scheme, we present an ablation study on several simple test functions and show a comparison with baselines for benchmark datasets.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Zero-Shot Multi-Label Classification of Bangla Documents: Large Decoders Vs. Classic Encoders
Authors:
Souvika Sarkar,
Md. Najib Hasan,
Santu Karmaker
Abstract:
Bangla, a language spoken by over 300 million native speakers and ranked as the sixth most spoken language worldwide, presents unique challenges in natural language processing (NLP) due to its complex morphological characteristics and limited resources. While recent Large Decoder Based models (LLMs), such as GPT, LLaMA, and DeepSeek, have demonstrated excellent performance across many NLP tasks, t…
▽ More
Bangla, a language spoken by over 300 million native speakers and ranked as the sixth most spoken language worldwide, presents unique challenges in natural language processing (NLP) due to its complex morphological characteristics and limited resources. While recent Large Decoder Based models (LLMs), such as GPT, LLaMA, and DeepSeek, have demonstrated excellent performance across many NLP tasks, their effectiveness in Bangla remains largely unexplored. In this paper, we establish the first benchmark comparing decoder-based LLMs with classic encoder-based models for Zero-Shot Multi-Label Classification (Zero-Shot-MLC) task in Bangla. Our evaluation of 32 state-of-the-art models reveals that, existing so-called powerful encoders and decoders still struggle to achieve high accuracy on the Bangla Zero-Shot-MLC task, suggesting a need for more research and resources for Bangla NLP.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
BdSLW401: Transformer-Based Word-Level Bangla Sign Language Recognition Using Relative Quantization Encoding (RQE)
Authors:
Husne Ara Rubaiyeat,
Njayou Youssouf,
Md Kamrul Hasan,
Hasan Mahmud
Abstract:
Sign language recognition (SLR) for low-resource languages like Bangla suffers from signer variability, viewpoint variations, and limited annotated datasets. In this paper, we present BdSLW401, a large-scale, multi-view, word-level Bangla Sign Language (BdSL) dataset with 401 signs and 102,176 video samples from 18 signers in front and lateral views. To improve transformer-based SLR, we introduce…
▽ More
Sign language recognition (SLR) for low-resource languages like Bangla suffers from signer variability, viewpoint variations, and limited annotated datasets. In this paper, we present BdSLW401, a large-scale, multi-view, word-level Bangla Sign Language (BdSL) dataset with 401 signs and 102,176 video samples from 18 signers in front and lateral views. To improve transformer-based SLR, we introduce Relative Quantization Encoding (RQE), a structured embedding approach anchoring landmarks to physiological reference points and quantize motion trajectories. RQE improves attention allocation by decreasing spatial variability, resulting in 44.3% WER reduction in WLASL100, 21.0% in SignBD-200, and significant gains in BdSLW60 and SignBD-90. However, fixed quantization becomes insufficient on large-scale datasets (e.g., WLASL2000), indicating the need for adaptive encoding strategies. Further, RQE-SF, an extended variant that stabilizes shoulder landmarks, achieves improvements in pose consistency at the cost of small trade-offs in lateral view recognition. The attention graphs prove that RQE improves model interpretability by focusing on the major articulatory features (fingers, wrists) and the more distinctive frames instead of global pose changes. Introducing BdSLW401 and demonstrating the effectiveness of RQE-enhanced structured embeddings, this work advances transformer-based SLR for low-resource languages and sets a benchmark for future research in this area.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
CQ CNN: A Hybrid Classical Quantum Convolutional Neural Network for Alzheimer's Disease Detection Using Diffusion Generated and U Net Segmented 3D MRI
Authors:
Mominul Islam,
Mohammad Junayed Hasan,
M. R. C. Mahdy
Abstract:
The detection of Alzheimer disease (AD) from clinical MRI data is an active area of research in medical imaging. Recent advances in quantum computing, particularly the integration of parameterized quantum circuits (PQCs) with classical machine learning architectures, offer new opportunities to develop models that may outperform traditional methods. However, quantum machine learning (QML) remains i…
▽ More
The detection of Alzheimer disease (AD) from clinical MRI data is an active area of research in medical imaging. Recent advances in quantum computing, particularly the integration of parameterized quantum circuits (PQCs) with classical machine learning architectures, offer new opportunities to develop models that may outperform traditional methods. However, quantum machine learning (QML) remains in its early stages and requires further experimental analysis to better understand its behavior and limitations. In this paper, we propose an end to end hybrid classical quantum convolutional neural network (CQ CNN) for AD detection using clinically formatted 3D MRI data. Our approach involves developing a framework to make 3D MRI data usable for machine learning, designing and training a brain tissue segmentation model (Skull Net), and training a diffusion model to generate synthetic images for the minority class. Our converged models exhibit potential quantum advantages, achieving higher accuracy in fewer epochs than classical models. The proposed beta8 3 qubit model achieves an accuracy of 97.50%, surpassing state of the art (SOTA) models while requiring significantly fewer computational resources. In particular, the architecture employs only 13K parameters (0.48 MB), reducing the parameter count by more than 99.99% compared to current SOTA models. Furthermore, the diffusion-generated data used to train our quantum models, in conjunction with real samples, preserve clinical structural standards, representing a notable first in the field of QML. We conclude that CQCNN architecture like models, with further improvements in gradient optimization techniques, could become a viable option and even a potential alternative to classical models for AD detection, especially in data limited and resource constrained clinical settings.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Automatic Vehicle Detection using DETR: A Transformer-Based Approach for Navigating Treacherous Roads
Authors:
Istiaq Ahmed Fahad,
Abdullah Ibne Hanif Arean,
Nazmus Sakib Ahmed,
Mahmudul Hasan
Abstract:
Automatic Vehicle Detection (AVD) in diverse driving environments presents unique challenges due to varying lighting conditions, road types, and vehicle types. Traditional methods, such as YOLO and Faster R-CNN, often struggle to cope with these complexities. As computer vision evolves, combining Convolutional Neural Networks (CNNs) with Transformer-based approaches offers promising opportunities…
▽ More
Automatic Vehicle Detection (AVD) in diverse driving environments presents unique challenges due to varying lighting conditions, road types, and vehicle types. Traditional methods, such as YOLO and Faster R-CNN, often struggle to cope with these complexities. As computer vision evolves, combining Convolutional Neural Networks (CNNs) with Transformer-based approaches offers promising opportunities for improving detection accuracy and efficiency. This study is the first to experiment with Detection Transformer (DETR) for automatic vehicle detection in complex and varied settings. We employ a Collaborative Hybrid Assignments Training scheme, Co-DETR, to enhance feature learning and attention mechanisms in DETR. By leveraging versatile label assignment strategies and introducing multiple parallel auxiliary heads, we provide more effective supervision during training and extract positive coordinates to boost training efficiency. Through extensive experiments on DETR variants and YOLO models, conducted using the BadODD dataset, we demonstrate the advantages of our approach. Our method achieves superior results, and improved accuracy in diverse conditions, making it practical for real-world deployment. This work significantly advances autonomous navigation technology and opens new research avenues in object detection for autonomous vehicles. By integrating the strengths of CNNs and Transformers, we highlight the potential of DETR for robust and efficient vehicle detection in challenging driving environments.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
Authors:
Mohamed Bayan Kmainasi,
Abul Hasnat,
Md Arid Hasan,
Ali Ezzat Shahroor,
Firoj Alam
Abstract:
The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to label detection and the generation of explanation-based ra…
▽ More
The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to label detection and the generation of explanation-based rationales for predicted labels. To address this challenge, we introduce MemeIntel, an explanation-enhanced dataset for propaganda memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results demonstrate that this approach significantly improves performance over the base model for both \textbf{label detection} and explanation generation, outperforming the current state-of-the-art with an absolute improvement of ~3% on ArMeme and ~7% on Hateful Memes. For reproducibility and future research, we aim to make the MemeIntel dataset and experimental resources publicly available.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
Reasoning About Persuasion: Can LLMs Enable Explainable Propaganda Detection?
Authors:
Maram Hasanain,
Md Arid Hasan,
Mohamed Bayan Kmainasi,
Elisa Sartori,
Ali Ezzat Shahroor,
Giovanni Da San Martino,
Firoj Alam
Abstract:
There has been significant research on propagandistic content detection across different modalities and languages. However, most studies have primarily focused on detection, with little attention given to explanations justifying the predicted label. This is largely due to the lack of resources that provide explanations alongside annotated labels. To address this issue, we propose a multilingual (i…
▽ More
There has been significant research on propagandistic content detection across different modalities and languages. However, most studies have primarily focused on detection, with little attention given to explanations justifying the predicted label. This is largely due to the lack of resources that provide explanations alongside annotated labels. To address this issue, we propose a multilingual (i.e., Arabic and English) explanation-enhanced dataset, the first of its kind. Additionally, we introduce an explanation-enhanced LLM for both label detection and rationale-based explanation generation. Our findings indicate that the model performs comparably while also generating explanations. We will make the dataset and experimental resources publicly available for the research community.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
A CNN Approach to Automated Detection and Classification of Brain Tumors
Authors:
Md. Zahid Hasan,
Abdullah Tamim,
D. M. Asadujjaman,
Md. Mahfujur Rahman,
Md. Abu Ahnaf Mollick,
Nosin Anjum Dristi,
Abdullah-Al-Noman
Abstract:
Brain tumors require an assessment to ensure timely diagnosis and effective patient treatment. Morphological factors such as size, location, texture, and variable appearance complicate tumor inspection. Medical imaging presents challenges, including noise and incomplete images. This research article presents a methodology for processing Magnetic Resonance Imaging (MRI) data, encompassing technique…
▽ More
Brain tumors require an assessment to ensure timely diagnosis and effective patient treatment. Morphological factors such as size, location, texture, and variable appearance complicate tumor inspection. Medical imaging presents challenges, including noise and incomplete images. This research article presents a methodology for processing Magnetic Resonance Imaging (MRI) data, encompassing techniques for image classification and denoising. The effective use of MRI images allows medical professionals to detect brain disorders, including tumors. This research aims to categorize healthy brain tissue and brain tumors by analyzing the provided MRI data. Unlike alternative methods like Computed Tomography (CT), MRI technology offers a more detailed representation of internal anatomical components, making it a suitable option for studying data related to brain tumors. The MRI picture is first subjected to a denoising technique utilizing an Anisotropic diffusion filter. The dataset utilized for the models creation is a publicly accessible and validated Brain Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE was employed for data augmentation and dataset balancing. Convolutional Neural Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for the classification procedure. EfficientNet attained an accuracy of 98%, the highest recorded.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Distributed Approach to Haskell Based Applications Refactoring with LLMs Based Multi-Agent Systems
Authors:
Shahbaz Siddeeq,
Zeeshan Rasheed,
Malik Abdul Sami,
Mahade Hasan,
Muhammad Waseem,
Jussi Rasku,
Mika Saari,
Kai-Kristian Kemell,
Pekka Abrahamsson
Abstract:
We present a large language models (LLMs) based multi-agent system to automate the refactoring of Haskell codebases. The multi-agent system consists of specialized agents performing tasks such as context analysis, refactoring, validation, and testing. Refactoring improvements are using metrics such as cyclomatic complexity, run-time, and memory allocation. Experimental evaluations conducted on Has…
▽ More
We present a large language models (LLMs) based multi-agent system to automate the refactoring of Haskell codebases. The multi-agent system consists of specialized agents performing tasks such as context analysis, refactoring, validation, and testing. Refactoring improvements are using metrics such as cyclomatic complexity, run-time, and memory allocation. Experimental evaluations conducted on Haskell codebases demonstrate improvements in code quality. Cyclomatic complexity was reduced by 13.64% and 47.06% in the respective codebases. Memory allocation improved by 4.17% and 41.73%, while runtime efficiency increased by up to 50%. These metrics highlight the systems ability to optimize Haskells functional paradigms while maintaining correctness and scalability. Results show reductions in complexity and performance enhancements across codebases. The integration of LLMs based multi-agent system enables precise task execution and inter-agent collaboration, addressing the challenges of refactoring in functional programming. This approach aims to address the challenges of refactoring functional programming languages through distributed and modular systems.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models
Authors:
Siddarth Venkatraman,
Mohsin Hasan,
Minsu Kim,
Luca Scimeca,
Marcin Sendera,
Yoshua Bengio,
Glen Berseth,
Nikolay Malkin
Abstract:
Any well-behaved generative model over a variable $\mathbf{x}$ can be expressed as a deterministic transformation of an exogenous ('outsourced') Gaussian noise variable $\mathbf{z}$: $\mathbf{x}=f_θ(\mathbf{z})$. In such a model (e.g., a VAE, GAN, or continuous-time flow-based model), sampling of the target variable $\mathbf{x} \sim p_θ(\mathbf{x})$ is straightforward, but sampling from a posterio…
▽ More
Any well-behaved generative model over a variable $\mathbf{x}$ can be expressed as a deterministic transformation of an exogenous ('outsourced') Gaussian noise variable $\mathbf{z}$: $\mathbf{x}=f_θ(\mathbf{z})$. In such a model (e.g., a VAE, GAN, or continuous-time flow-based model), sampling of the target variable $\mathbf{x} \sim p_θ(\mathbf{x})$ is straightforward, but sampling from a posterior distribution of the form $p(\mathbf{x}\mid\mathbf{y}) \propto p_θ(\mathbf{x})r(\mathbf{x},\mathbf{y})$, where $r$ is a constraint function depending on an auxiliary variable $\mathbf{y}$, is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space ($\mathbf{z}$). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samples $f_θ(\mathbf{z})$ are distributed according to the posterior in the data space ($\mathbf{x}$). For many models and constraints of interest, the posterior in the noise space is smoother than the posterior in the data space, making it more amenable to such amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow-based priors, comparing favorably both with current amortized and non-amortized inference methods. We demonstrate the proposed outsourced diffusion sampling in several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
An Optimized YOLOv5 Based Approach For Real-time Vehicle Detection At Road Intersections Using Fisheye Cameras
Authors:
Md. Jahin Alam,
Muhammad Zubair Hasan,
Md Maisoon Rahman,
Md Awsafur Rahman,
Najibul Haque Sarker,
Shariar Azad,
Tasnim Nishat Islam,
Bishmoy Paul,
Tanvir Anjum,
Barproda Halder,
Shaikh Anowarul Fattah
Abstract:
Real time vehicle detection is a challenging task for urban traffic surveillance. Increase in urbanization leads to increase in accidents and traffic congestion in junction areas resulting in delayed travel time. In order to solve these problems, an intelligent system utilizing automatic detection and tracking system is significant. But this becomes a challenging task at road intersection areas wh…
▽ More
Real time vehicle detection is a challenging task for urban traffic surveillance. Increase in urbanization leads to increase in accidents and traffic congestion in junction areas resulting in delayed travel time. In order to solve these problems, an intelligent system utilizing automatic detection and tracking system is significant. But this becomes a challenging task at road intersection areas which require a wide range of field view. For this reason, fish eye cameras are widely used in real time vehicle detection purpose to provide large area coverage and 360 degree view at junctions. However, it introduces challenges such as light glare from vehicles and street lights, shadow, non-linear distortion, scaling issues of vehicles and proper localization of small vehicles. To overcome each of these challenges, a modified YOLOv5 object detection scheme is proposed. YOLOv5 is a deep learning oriented convolutional neural network (CNN) based object detection method. The proposed scheme for detecting vehicles in fish-eye images consists of a light-weight day-night CNN classifier so that two different solutions can be implemented to address the day-night detection issues. Furthurmore, challenging instances are upsampled in the dataset for proper localization of vehicles and later on the detection model is ensembled and trained in different combination of vehicle datasets for better generalization, detection and accuracy. For testing, a real world fisheye dataset provided by the Video and Image Processing (VIP) Cup organizer ISSD has been used which includes images from video clips of different fisheye cameras at junction of different cities during day and night time. Experimental results show that our proposed model has outperformed the YOLOv5 model on the dataset by 13.7% mAP @ 0.5.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Querying Databases with Function Calling
Authors:
Connor Shorten,
Charles Pierse,
Thomas Benjamin Smith,
Karel D'Oosterlinck,
Tuana Celik,
Erika Cardenas,
Leonie Monigatti,
Mohd Shukri Hasan,
Edward Schmuhl,
Daniel Williams,
Aravind Kesiraju,
Bob van Luijt
Abstract:
The capabilities of Large Language Models (LLMs) are rapidly accelerating largely thanks to their integration with external tools. Querying databases is among the most effective of these integrations, enabling LLMs to access private or continually updating data. While Function Calling is the most common method for interfacing external tools to LLMs, its application to database querying as a tool h…
▽ More
The capabilities of Large Language Models (LLMs) are rapidly accelerating largely thanks to their integration with external tools. Querying databases is among the most effective of these integrations, enabling LLMs to access private or continually updating data. While Function Calling is the most common method for interfacing external tools to LLMs, its application to database querying as a tool has been underexplored. We propose a tool definition for database querying that unifies accessing data with search queries, filters, or a combination both, as well as transforming results with aggregation and groupby operators. To evaluate its effectiveness, we conduct a study with 8 LLMs spanning 5 model families. We present a novel pipeline adapting the Gorilla LLM framework to create synthetic database schemas and queries. We primarily evaluate the models with the Exact Match of predicted and ground truth query APIs. Among the models tested, Claude 3.5 Sonnet achieves the highest performance with an Exact Match score of 74.3%, followed by GPT-4o mini at 73.7%, and GPT-4o at 71.8%. We further breakdown these results per API component utilized and across synthetic use cases. We find that LLMs are highly effective at utilizing operators on boolean properties, but struggle with text property filters. Across use cases we find robust results with the higher performing models such as GPT-4o, but significant performance variance across use cases from lower performing models. We additionally conduct ablation studies exploring the impact of parallel tool calling, adding a rationale as an argument of the tool call, using a separate tool per database collection, and tool calling with structured outputs. Our findings demonstrate the effectiveness of enabling LLMs to query databases with Function Calling. We have open-sourced our experimental code and results at github.com/weaviate/gorilla.
△ Less
Submitted 23 January, 2025;
originally announced February 2025.
-
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception
Authors:
Joshua R. Waite,
Md. Zahid Hasan,
Qisai Liu,
Zhanhong Jiang,
Chinmay Hegde,
Soumik Sarkar
Abstract:
Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insuffici…
▽ More
Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune the VLM over the targeted task (e.g. spatial reasoning). The key contribution of this work is developing a framework where the RL agent serves as an informative data sampling tool and assists the VLM in order to enhance performance and address task-specific vulnerabilities. By targeting the data sampling process to address the weaknesses of the VLM, we can effectively train a more context-aware model. In addition, generating synthetic data allows us to have precise control over each scene and generate granular ground truth captions. Our results show that the proposed data generation approach improves the spatial reasoning performance of VLMs, which demonstrates the benefits of using RL-guided data generation in vision-language tasks.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Motion-enhancement to Echocardiography Segmentation via Inserting a Temporal Attention Module: An Efficient, Adaptable, and Scalable Approach
Authors:
Md. Kamrul Hasan,
Guang Yang,
Choon Hwai Yap
Abstract:
Cardiac anatomy segmentation is essential for clinical assessment of cardiac function and disease diagnosis to inform treatment and intervention. In performing segmentation, deep learning (DL) algorithms improved accuracy significantly compared to traditional image processing approaches. More recently, studies showed that enhancing DL segmentation with motion information can further improve it. A…
▽ More
Cardiac anatomy segmentation is essential for clinical assessment of cardiac function and disease diagnosis to inform treatment and intervention. In performing segmentation, deep learning (DL) algorithms improved accuracy significantly compared to traditional image processing approaches. More recently, studies showed that enhancing DL segmentation with motion information can further improve it. A range of methods for injecting motion information has been proposed, but many of them increase the dimensionality of input images (which is computationally expensive) or have not used an optimal method to insert motion information, such as non-DL registration, non-attention-based networks or single-headed attention. Here, we present a novel, computation-efficient alternative where a novel, scalable temporal attention module (TAM) extracts temporal feature interactions multiple times and where TAM has a multi-headed, KQV projection cross-attention architecture. The module can be seamlessly integrated into a wide range of existing CNN- or Transformer-based networks, providing novel flexibility for inclusion in future implementations. Extensive evaluations on different cardiac datasets, 2D echocardiography (CAMUS), and 3D echocardiography (MITEA) demonstrate the model's effectiveness when integrated into well-established backbone networks like UNet, FCN8s, UNetR, SwinUNetR, and the recent I2UNet. We further find that the optimized TAM-enhanced FCN8s network performs well compared to contemporary alternatives. Our results confirm TAM's robustness, scalability, and generalizability across diverse datasets and backbones.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
AI-Enhanced Decision-Making for Sustainable Supply Chains: Reducing Carbon Footprints in the USA
Authors:
MD Rokibul Hasan
Abstract:
Organizations increasingly need to reassess their supply chain strategies in the rapidly modernizing world towards sustainability. This is particularly true in the United States, where supply chains are very extensive and consume a large number of resources. This research paper discusses how AI can support decision-making for sustainable supply chains with a special focus on carbon footprints. The…
▽ More
Organizations increasingly need to reassess their supply chain strategies in the rapidly modernizing world towards sustainability. This is particularly true in the United States, where supply chains are very extensive and consume a large number of resources. This research paper discusses how AI can support decision-making for sustainable supply chains with a special focus on carbon footprints. These AI technologies, including machine learning, predictive analytics, and optimization algorithms, will enable companies to be more efficient, reduce emissions, and display regulatory and consumer demands for sustainability, among other aspects. The paper reviews challenges and opportunities regarding implementing AI-driven solutions to promote sustainable supply chain practices in the USA.
△ Less
Submitted 8 December, 2024;
originally announced January 2025.
-
Multimodal Marvels of Deep Learning in Medical Diagnosis: A Comprehensive Review of COVID-19 Detection
Authors:
Md Shofiqul Islam,
Khondokar Fida Hasan,
Hasibul Hossain Shajeeb,
Humayan Kabir Rana,
Md Saifur Rahmand,
Md Munirul Hasan,
AKM Azad,
Ibrahim Abdullah,
Mohammad Ali Moni
Abstract:
This study presents a comprehensive review of the potential of multimodal deep learning (DL) in medical diagnosis, using COVID-19 as a case example. Motivated by the success of artificial intelligence applications during the COVID-19 pandemic, this research aims to uncover the capabilities of DL in disease screening, prediction, and classification, and to derive insights that enhance the resilienc…
▽ More
This study presents a comprehensive review of the potential of multimodal deep learning (DL) in medical diagnosis, using COVID-19 as a case example. Motivated by the success of artificial intelligence applications during the COVID-19 pandemic, this research aims to uncover the capabilities of DL in disease screening, prediction, and classification, and to derive insights that enhance the resilience, sustainability, and inclusiveness of science, technology, and innovation systems. Adopting a systematic approach, we investigate the fundamental methodologies, data sources, preprocessing steps, and challenges encountered in various studies and implementations. We explore the architecture of deep learning models, emphasising their data-specific structures and underlying algorithms. Subsequently, we compare different deep learning strategies utilised in COVID-19 analysis, evaluating them based on methodology, data, performance, and prerequisites for future research. By examining diverse data types and diagnostic modalities, this research contributes to scientific understanding and knowledge of the multimodal application of DL and its effectiveness in diagnosis. We have implemented and analysed 11 deep learning models using COVID-19 image, text, and speech (ie, cough) data. Our analysis revealed that the MobileNet model achieved the highest accuracy of 99.97% for COVID-19 image data and 93.73% for speech data (i.e., cough). However, the BiGRU model demonstrated superior performance in COVID-19 text classification with an accuracy of 99.89%. The broader implications of this research suggest potential benefits for other domains and disciplines that could leverage deep learning techniques for image, text, and speech analysis.
△ Less
Submitted 21 January, 2025; v1 submitted 16 January, 2025;
originally announced January 2025.
-
TWIX: Automatically Reconstructing Structured Data from Templatized Documents
Authors:
Yiming Lin,
Mawil Hasan,
Rohan Kosalge,
Alvin Cheung,
Aditya G. Parameswaran
Abstract:
Many documents, that we call templatized documents, are programmatically generated by populating fields in a visual template. Effective data extraction from these documents is crucial to supporting downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and often require significant human effort, whe…
▽ More
Many documents, that we call templatized documents, are programmatically generated by populating fields in a visual template. Effective data extraction from these documents is crucial to supporting downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and often require significant human effort, when extracting tables or values given user-specified fields from documents. The key insight of our tool, TWIX, is to predict the underlying template used to create such documents, modeling the visual and structural commonalities across documents. Data extraction based on this predicted template provides a more principled, accurate, and efficient solution at a low cost. Comprehensive evaluations on 34 diverse real-world datasets show that uncovering the template is crucial for data extraction from templatized documents. TWIX achieves over 90% precision and recall on average, outperforming tools from industry: Textract and Azure Document Intelligence, and vision-based LLMs like GPT-4-Vision, by over 25% in precision and recall. TWIX scales easily to large datasets and is 734X faster and 5836X cheaper than vision-based LLMs for extracting data from a large document collection with 817 pages.
△ Less
Submitted 11 January, 2025;
originally announced January 2025.
-
A Systematic Literature Review on Deep Learning-based Depth Estimation in Computer Vision
Authors:
Ali Rohan,
Md Junayed Hasan,
Andrei Petrovski
Abstract:
Depth estimation (DE) provides spatial information about a scene and enables tasks such as 3D reconstruction, object detection, and scene understanding. Recently, there has been an increasing interest in using deep learning (DL)-based methods for DE. Traditional techniques rely on handcrafted features that often struggle to generalise to diverse scenes and require extensive manual tuning. However,…
▽ More
Depth estimation (DE) provides spatial information about a scene and enables tasks such as 3D reconstruction, object detection, and scene understanding. Recently, there has been an increasing interest in using deep learning (DL)-based methods for DE. Traditional techniques rely on handcrafted features that often struggle to generalise to diverse scenes and require extensive manual tuning. However, DL models for DE can automatically extract relevant features from input data, adapt to various scene conditions, and generalise well to unseen environments. Numerous DL-based methods have been developed, making it necessary to survey and synthesize the state-of-the-art (SOTA). Previous reviews on DE have mainly focused on either monocular or stereo-based techniques, rather than comprehensively reviewing DE. Furthermore, to the best of our knowledge, there is no systematic literature review (SLR) that comprehensively focuses on DE. Therefore, this SLR study is being conducted. Initially, electronic databases were searched for relevant publications, resulting in 1284 publications. Using defined exclusion and quality criteria, 128 publications were shortlisted and further filtered to select 59 high-quality primary studies. These studies were analysed to extract data and answer defined research questions. Based on the results, DL methods were developed for mainly three different types of DE: monocular, stereo, and multi-view. 20 publicly available datasets were used to train, test, and evaluate DL models for DE, with KITTI, NYU Depth V2, and Make 3D being the most used datasets. 29 evaluation metrics were used to assess the performance of DE. 35 base models were reported in the primary studies, and the top five most-used base models were ResNet-50, ResNet-18, ResNet-101, U-Net, and VGG-16. Finally, the lack of ground truth data was among the most significant challenges reported by primary studies.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
Design of a 6-bit Threshold Inverter Quantization (TIQ) Flash Analog to Digital Converter (ADC)
Authors:
Noyon Kumar Sarkar,
Moumita Roy,
Md. Tariq Hasan
Abstract:
An ADC is used to convert analog signals into binary signals. Compared with many other types of ADCs, flash converters are incredibly quick. A typical Flash ADC consists of 2n resistors, 2n-1 op-amp comparators, and an encoder which requires more area. The resistors and comparators can be eliminated by using threshold inverter quantization (TIQ) comparators. As a voltage comparator, TIQ technique…
▽ More
An ADC is used to convert analog signals into binary signals. Compared with many other types of ADCs, flash converters are incredibly quick. A typical Flash ADC consists of 2n resistors, 2n-1 op-amp comparators, and an encoder which requires more area. The resistors and comparators can be eliminated by using threshold inverter quantization (TIQ) comparators. As a voltage comparator, TIQ technique uses two cascaded CMOS inverters. So that there will be no variation in the fabrication process, and temperature. A 6-bit flash ADC based on threshold inverter quantization (TIQ) comparator was designed and software implementation was performed employing a fat tree encoder with 0.25 micrometer CMOS technology. The design consists of 2n-1 TIQ comparator arrays, a gain booster, a 1-out-of-n code generator, and a fat tree encoder. This TIQ flash ADC is simulated with the Tanner EDA Tool. Here the supply voltage is 2.5 V, input frequency of 10 kHz and 10 MHz. The power consumption of the ADC is 6.25 mW, and the propagation delay is 1.07 microseconds for 10 kHz input frequency. For 10 MHz input frequency, power consumption is 12.12 mW and propagation delay is 947.14 ms.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
Labels Generated by Large Language Model Helps Measuring People's Empathy in Vitro
Authors:
Md Rakibul Hasan,
Yue Yao,
Md Zakir Hossain,
Aneesh Krishna,
Imre Rudas,
Shafin Rahman,
Tom Gedeon
Abstract:
Large language models (LLMs) have revolutionised numerous fields, with LLM-as-a-service (LLMSaaS) having a strong generalisation ability that offers accessible solutions directly without the need for costly training. In contrast to the widely studied prompt engineering for task solving directly (in vivo), this paper explores its potential in in-vitro applications. These involve using LLM to genera…
▽ More
Large language models (LLMs) have revolutionised numerous fields, with LLM-as-a-service (LLMSaaS) having a strong generalisation ability that offers accessible solutions directly without the need for costly training. In contrast to the widely studied prompt engineering for task solving directly (in vivo), this paper explores its potential in in-vitro applications. These involve using LLM to generate labels to help the supervised training of mainstream models by (1) noisy label correction and (2) training data augmentation with LLM-generated labels. In this paper, we evaluate this approach in the emerging field of empathy computing -- automating the prediction of psychological questionnaire outcomes from inputs like text sequences. Specifically, crowdsourced datasets in this domain often suffer from noisy labels that misrepresent underlying empathy. By leveraging LLM-generated labels to train pre-trained language models (PLMs) like RoBERTa, we achieve statistically significant accuracy improvements over baselines, achieving a state-of-the-art Pearson correlation coefficient of 0.648 on NewsEmp benchmarks. In addition, we bring insightful discussions, including current challenges in empathy computing, data biases in training data and evaluation metric selection. Code and LLM-generated data are available at https://github.com/hasan-rakibul/LLMPathy (available once the paper is accepted).
△ Less
Submitted 31 December, 2024;
originally announced January 2025.
-
Adaptive Tabu Dropout for Regularization of Deep Neural Network
Authors:
Md. Tarek Hasan,
Arifa Akter,
Mohammad Nazmush Shamael,
Md Al Emran Hossain,
H. M. Mutasim Billah,
Sumayra Islam,
Swakkhar Shatabda
Abstract:
Dropout is an effective strategy for the regularization of deep neural networks. Applying tabu to the units that have been dropped in the recent epoch and retaining them for training ensures diversification in dropout. In this paper, we improve the Tabu Dropout mechanism for training deep neural networks in two ways. Firstly, we propose to use tabu tenure, or the number of epochs a particular unit…
▽ More
Dropout is an effective strategy for the regularization of deep neural networks. Applying tabu to the units that have been dropped in the recent epoch and retaining them for training ensures diversification in dropout. In this paper, we improve the Tabu Dropout mechanism for training deep neural networks in two ways. Firstly, we propose to use tabu tenure, or the number of epochs a particular unit will not be dropped. Different tabu tenures provide diversification to boost the training of deep neural networks based on the search landscape. Secondly, we propose an adaptive tabu algorithm that automatically selects the tabu tenure based on the training performances through epochs. On several standard benchmark datasets, the experimental results show that the adaptive tabu dropout and tabu tenure dropout diversify and perform significantly better compared to the standard dropout and basic tabu dropout mechanisms.
△ Less
Submitted 31 December, 2024;
originally announced January 2025.
-
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
Authors:
Mahir Labib Dihan,
Md Tanvir Hassan,
Md Tanvir Parvez,
Md Hasebul Hasan,
Md Almash Alam,
Muhammad Aamir Cheema,
Mohammed Eunus Ali,
Md Rizwan Parvez
Abstract:
Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to ass…
▽ More
Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models' ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.
△ Less
Submitted 31 December, 2024;
originally announced January 2025.
-
HyperQ-Opt: Q-learning for Hyperparameter Optimization
Authors:
Md. Tarek Hasan
Abstract:
Hyperparameter optimization (HPO) is critical for enhancing the performance of machine learning models, yet it often involves a computationally intensive search across a large parameter space. Traditional approaches such as Grid Search and Random Search suffer from inefficiency and limited scalability, while surrogate models like Sequential Model-based Bayesian Optimization (SMBO) rely heavily on…
▽ More
Hyperparameter optimization (HPO) is critical for enhancing the performance of machine learning models, yet it often involves a computationally intensive search across a large parameter space. Traditional approaches such as Grid Search and Random Search suffer from inefficiency and limited scalability, while surrogate models like Sequential Model-based Bayesian Optimization (SMBO) rely heavily on heuristic predictions that can lead to suboptimal results. This paper presents a novel perspective on HPO by formulating it as a sequential decision-making problem and leveraging Q-learning, a reinforcement learning technique, to optimize hyperparameters. The study explores the works of H.S. Jomaa et al. and Qi et al., which model HPO as a Markov Decision Process (MDP) and utilize Q-learning to iteratively refine hyperparameter settings. The approaches are evaluated for their ability to find optimal or near-optimal configurations within a limited number of trials, demonstrating the potential of reinforcement learning to outperform conventional methods. Additionally, this paper identifies research gaps in existing formulations, including the limitations of discrete search spaces and reliance on heuristic policies, and suggests avenues for future exploration. By shifting the paradigm toward policy-based optimization, this work contributes to advancing HPO methods for scalable and efficient machine learning applications.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.