-
PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language
Authors:
Ijazul Haq,
Yingjie Zhang,
Irfan Ali Khan
Abstract:
This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images ann…
▽ More
This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Advancing Mobile UI Testing by Learning Screen Usage Semantics
Authors:
Safwat Ali Khan
Abstract:
The demand for quality in mobile applications has increased greatly given users' high reliance on them for daily tasks. Developers work tirelessly to ensure that their applications are both functional and user-friendly. In pursuit of this, Automated Input Generation (AIG) tools have emerged as a promising solution for testing mobile applications by simulating user interactions and exploring app fu…
▽ More
The demand for quality in mobile applications has increased greatly given users' high reliance on them for daily tasks. Developers work tirelessly to ensure that their applications are both functional and user-friendly. In pursuit of this, Automated Input Generation (AIG) tools have emerged as a promising solution for testing mobile applications by simulating user interactions and exploring app functionalities. However, these tools face significant challenges in navigating complex Graphical User Interfaces (GUIs), and developers often have trouble understanding their output. More specifically, AIG tools face difficulties in navigating out of certain screens, such as login pages and advertisements, due to a lack of contextual understanding which leads to suboptimal testing coverage. Furthermore, while AIG tools can provide interaction traces consisting of action and screen details, there is limited understanding of its coverage of higher level functionalities, such as logging in, setting alarms, or saving notes. Understanding these covered use cases are essential to ensure comprehensive test coverage of app functionalities. Difficulty in testing mobile UIs can lead to the design of complex interfaces, which can adversely affect users of advanced age who often face usability barriers due to small buttons, cluttered layouts, and unintuitive navigation. There exists many studies that highlight these issues, but automated solutions for improving UI accessibility needs more attention. This research seeks to enhance automated UI testing techniques by learning the screen usage semantics of mobile apps and helping them navigate more efficiently, offer more insights about tested functionalities and also improve the usability of a mobile app's interface by identifying and mitigating UI design issues.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Generating Skyline Explanations for Graph Neural Networks
Authors:
Dazhuo Qiu,
Haolai Che,
Arijit Khan,
Yinghui Wu
Abstract:
This paper proposes a novel approach to generate subgraph explanations for graph neural networks GNNs that simultaneously optimize multiple measures for explainability. Existing GNN explanation methods often compute subgraphs (called ``explanatory subgraphs'') that optimize a pre-defined, single explainability measure, such as fidelity or conciseness. This can lead to biased explanations that cann…
▽ More
This paper proposes a novel approach to generate subgraph explanations for graph neural networks GNNs that simultaneously optimize multiple measures for explainability. Existing GNN explanation methods often compute subgraphs (called ``explanatory subgraphs'') that optimize a pre-defined, single explainability measure, such as fidelity or conciseness. This can lead to biased explanations that cannot provide a comprehensive explanation to clarify the output of GNN models. We introduce skyline explanation, a GNN explanation paradigm that aims to identify k explanatory subgraphs by simultaneously optimizing multiple explainability measures. (1) We formulate skyline explanation generation as a multi-objective optimization problem, and pursue explanations that approximate a skyline set of explanatory subgraphs. We show the hardness for skyline explanation generation. (2) We design efficient algorithms with an onion-peeling approach that strategically removes edges from neighbors of nodes of interests, and incrementally improves explanations as it explores an interpretation domain, with provable quality guarantees. (3) We further develop an algorithm to diversify explanations to provide more comprehensive perspectives. Using real-world graphs, we empirically verify the effectiveness, efficiency, and scalability of our algorithms.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Neural Brain: A Neuroscience-inspired Framework for Embodied Agents
Authors:
Jian Liu,
Xiongtao Shi,
Thai Duy Nguyen,
Haitian Zhang,
Tianxiang Zhang,
Wei Sun,
Yanjie Li,
Athanasios V. Vasilakos,
Giovanni Iacca,
Arshad Ali Khan,
Arvind Kumar,
Jae Won Cho,
Ajmal Mian,
Lihua Xie,
Erik Cambria,
Lin Wang
Abstract:
The rapid evolution of artificial intelligence (AI) has shifted from static, data-driven models to dynamic systems capable of perceiving and interacting with real-world environments. Despite advancements in pattern recognition and symbolic reasoning, current AI systems, such as large language models, remain disembodied, unable to physically engage with the world. This limitation has driven the ris…
▽ More
The rapid evolution of artificial intelligence (AI) has shifted from static, data-driven models to dynamic systems capable of perceiving and interacting with real-world environments. Despite advancements in pattern recognition and symbolic reasoning, current AI systems, such as large language models, remain disembodied, unable to physically engage with the world. This limitation has driven the rise of embodied AI, where autonomous agents, such as humanoid robots, must navigate and manipulate unstructured environments with human-like adaptability. At the core of this challenge lies the concept of Neural Brain, a central intelligence system designed to drive embodied agents with human-like adaptability. A Neural Brain must seamlessly integrate multimodal sensing and perception with cognitive capabilities. Achieving this also requires an adaptive memory system and energy-efficient hardware-software co-design, enabling real-time action in dynamic environments. This paper introduces a unified framework for the Neural Brain of embodied agents, addressing two fundamental challenges: (1) defining the core components of Neural Brain and (2) bridging the gap between static AI models and the dynamic adaptability required for real-world deployment. To this end, we propose a biologically inspired architecture that integrates multimodal active sensing, perception-cognition-action function, neuroplasticity-based memory storage and updating, and neuromorphic hardware/software optimization. Furthermore, we also review the latest research on embodied agents across these four aspects and analyze the gap between current AI systems and human intelligence. By synthesizing insights from neuroscience, we outline a roadmap towards the development of generalizable, autonomous agents capable of human-level intelligence in real-world scenarios.
△ Less
Submitted 14 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
Neural Network Operator-Based Fractal Approximation: Smoothness Preservation and Convergence Analysis
Authors:
Aaqib Ayoub Bhat,
Asif Khan,
M. Mursaleen
Abstract:
This paper presents a new approach of constructing $α$-fractal interpolation functions (FIFs) using neural network operators, integrating concepts from approximation theory. Initially, we construct $α$-fractals utilizing neural network-based operators, providing an approach to generating fractal functions with interpolation properties. Based on the same foundation, we have developed fractal interp…
▽ More
This paper presents a new approach of constructing $α$-fractal interpolation functions (FIFs) using neural network operators, integrating concepts from approximation theory. Initially, we construct $α$-fractals utilizing neural network-based operators, providing an approach to generating fractal functions with interpolation properties. Based on the same foundation, we have developed fractal interpolation functions that utilize only the values of the original function at the nodes or partition points, unlike traditional methods that rely on the entire original function.
Further, we have constructed \(α\)-fractals that preserve the smoothness of functions under certain constraints by employing a four-layered neural network operator, ensuring that if \(f \in C^{r}[a,b]\), then the corresponding fractal \(f^α \in C^{r}[a,b]\). Furthermore, we analyze the convergence of these $α$-fractals to the original function under suitable conditions. The work uses key approximation theory tools, such as the modulus of continuity and interpolation operators, to develop convergence results and uniform approximation error bounds.
△ Less
Submitted 22 March, 2025;
originally announced May 2025.
-
Detecting Concept Drift in Neural Networks Using Chi-squared Goodness of Fit Testing
Authors:
Jacob Glenn Ayers,
Buvaneswari A. Ramanan,
Manzoor A. Khan
Abstract:
As the adoption of deep learning models has grown beyond human capacity for verification, meta-algorithms are needed to ensure reliable model inference. Concept drift detection is a field dedicated to identifying statistical shifts that is underutilized in monitoring neural networks that may encounter inference data with distributional characteristics diverging from their training data. Given the…
▽ More
As the adoption of deep learning models has grown beyond human capacity for verification, meta-algorithms are needed to ensure reliable model inference. Concept drift detection is a field dedicated to identifying statistical shifts that is underutilized in monitoring neural networks that may encounter inference data with distributional characteristics diverging from their training data. Given the wide variety of model architectures, applications, and datasets, it is important that concept drift detection algorithms are adaptable to different inference scenarios. In this paper, we introduce an application of the $χ^2$ Goodness of Fit Hypothesis Test as a drift detection meta-algorithm applied to a multilayer perceptron, a convolutional neural network, and a transformer trained for machine vision as they are exposed to simulated drift during inference. To that end, we demonstrate how unexpected drops in accuracy due to concept drift can be detected without directly examining the inference outputs. Our approach enhances safety by ensuring models are continually evaluated for reliability across varying conditions.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
NMPC-Lander: Nonlinear MPC with Barrier Function for UAV Landing on a Mobile Platform
Authors:
Amber Batool,
Faryal Batool,
Roohan Ahmed Khan,
Muhammad Ahsan Mustafa,
Aleksey Fedoseev,
Dzmitry Tsetserukou
Abstract:
Quadcopters are versatile aerial robots gaining popularity in numerous critical applications. However, their operational effectiveness is constrained by limited battery life and restricted flight range. To address these challenges, autonomous drone landing on stationary or mobile charging and battery-swapping stations has become an essential capability. In this study, we present NMPC-Lander, a nov…
▽ More
Quadcopters are versatile aerial robots gaining popularity in numerous critical applications. However, their operational effectiveness is constrained by limited battery life and restricted flight range. To address these challenges, autonomous drone landing on stationary or mobile charging and battery-swapping stations has become an essential capability. In this study, we present NMPC-Lander, a novel control architecture that integrates Nonlinear Model Predictive Control (NMPC) with Control Barrier Functions (CBF) to achieve precise and safe autonomous landing on both static and dynamic platforms. Our approach employs NMPC for accurate trajectory tracking and landing, while simultaneously incorporating CBF to ensure collision avoidance with static obstacles. Experimental evaluations on the real hardware demonstrate high precision in landing scenarios, with an average final position error of 9.0 cm and 11 cm for stationary and mobile platforms, respectively. Notably, NMPC-Lander outperforms the B-spline combined with the A* planning method by nearly threefold in terms of position tracking, underscoring its superior robustness and practical effectiveness.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
ArrhythmiaVision: Resource-Conscious Deep Learning Models with Visual Explanations for ECG Arrhythmia Classification
Authors:
Zuraiz Baig,
Sidra Nasir,
Rizwan Ahmed Khan,
Muhammad Zeeshan Ul Haque
Abstract:
Cardiac arrhythmias are a leading cause of life-threatening cardiac events, highlighting the urgent need for accurate and timely detection. Electrocardiography (ECG) remains the clinical gold standard for arrhythmia diagnosis; however, manual interpretation is time-consuming, dependent on clinical expertise, and prone to human error. Although deep learning has advanced automated ECG analysis, many…
▽ More
Cardiac arrhythmias are a leading cause of life-threatening cardiac events, highlighting the urgent need for accurate and timely detection. Electrocardiography (ECG) remains the clinical gold standard for arrhythmia diagnosis; however, manual interpretation is time-consuming, dependent on clinical expertise, and prone to human error. Although deep learning has advanced automated ECG analysis, many existing models abstract away the signal's intrinsic temporal and morphological features, lack interpretability, and are computationally intensive-hindering their deployment on resource-constrained platforms. In this work, we propose two novel lightweight 1D convolutional neural networks, ArrhythmiNet V1 and V2, optimized for efficient, real-time arrhythmia classification on edge devices. Inspired by MobileNet's depthwise separable convolutional design, these models maintain memory footprints of just 302.18 KB and 157.76 KB, respectively, while achieving classification accuracies of 0.99 (V1) and 0.98 (V2) on the MIT-BIH Arrhythmia Dataset across five classes: Normal Sinus Rhythm, Left Bundle Branch Block, Right Bundle Branch Block, Atrial Premature Contraction, and Premature Ventricular Contraction. In order to ensure clinical transparency and relevance, we integrate Shapley Additive Explanations and Gradient-weighted Class Activation Mapping, enabling both local and global interpretability. These techniques highlight physiologically meaningful patterns such as the QRS complex and T-wave that contribute to the model's predictions. We also discuss performance-efficiency trade-offs and address current limitations related to dataset diversity and generalizability. Overall, our findings demonstrate the feasibility of combining interpretability, predictive accuracy, and computational efficiency in practical, wearable, and embedded ECG monitoring systems.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation
Authors:
Mohammad Shoaib Ansari,
Mohd Sohail Ali Khan,
Shubham Revankar,
Aditya Varma,
Anil S. Mokhade
Abstract:
This research paper investigates the application of Large Language Models (LLMs) in healthcare, specifically focusing on enhancing medical decision support through Retrieval-Augmented Generation (RAG) integrated with hospital-specific data and fine-tuning using Quantized Low-Rank Adaptation (QLoRA). The system utilizes Llama 3.2-3B-Instruct as its foundation model. By embedding and retrieving cont…
▽ More
This research paper investigates the application of Large Language Models (LLMs) in healthcare, specifically focusing on enhancing medical decision support through Retrieval-Augmented Generation (RAG) integrated with hospital-specific data and fine-tuning using Quantized Low-Rank Adaptation (QLoRA). The system utilizes Llama 3.2-3B-Instruct as its foundation model. By embedding and retrieving context-relevant healthcare information, the system significantly improves response accuracy. QLoRA facilitates notable parameter efficiency and memory optimization, preserving the integrity of medical information through specialized quantization techniques. Our research also shows that our model performs relatively well on various medical benchmarks, indicating that it can be used to make basic medical suggestions. This paper details the system's technical components, including its architecture, quantization methods, and key healthcare applications such as enhanced disease prediction from patient symptoms and medical history, treatment suggestions, and efficient summarization of complex medical reports. We touch on the ethical considerations-patient privacy, data security, and the need for rigorous clinical validation-as well as the practical challenges of integrating such systems into real-world healthcare workflows. Furthermore, the lightweight quantized weights ensure scalability and ease of deployment even in low-resource hospital environments. Finally, the paper concludes with an analysis of the broader impact of LLMs on healthcare and outlines future directions for LLMs in medical settings.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Authors:
Carlo Siebenschuh,
Kyle Hippe,
Ozan Gokdemir,
Alexander Brace,
Arham Khan,
Khalid Hossain,
Yadu Babuji,
Nicholas Chia,
Venkatram Vishwanath,
Rick Stevens,
Arvind Ramanathan,
Ian Foster,
Robert Underwood
Abstract:
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accurac…
▽ More
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/
△ Less
Submitted 23 April, 2025;
originally announced May 2025.
-
Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization
Authors:
Anas Anwarul Haq Khan,
Utkarsh Verma,
Prateek Chanda,
Ganesh Ramakrishnan
Abstract:
We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and…
▽ More
We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
The Role of Generative AI in Strengthening Secure Software Coding Practices: A Systematic Perspective
Authors:
Hathal S. Alwageed,
Rafiq Ahmad Khan
Abstract:
As software security threats continue to evolve, the demand for innovative ways of securing coding has tremendously grown. The integration of Generative AI (GenAI) into software development holds significant potential for improving secure coding practices. This paper aims at systematically studying the impact of GenAI in enhancing secure coding practices from improving software security, setting f…
▽ More
As software security threats continue to evolve, the demand for innovative ways of securing coding has tremendously grown. The integration of Generative AI (GenAI) into software development holds significant potential for improving secure coding practices. This paper aims at systematically studying the impact of GenAI in enhancing secure coding practices from improving software security, setting forth its potential benefits, challenges, and implications. To outline the contribution of AI driven code generation tools, we analyze via a structured review of recent literature, application to the industry, and empirical studies on how these tools help to mitigate security risks, comply with the secure coding standards, and make software development efficient. We hope that our findings will benefit researchers, software engineers and cybersecurity professionals alike in integrating GenAI into a secure development workflow without losing the advantages GenAI provides. Finally, the state of the art advances and future directions of AI assisted in secure software engineering discussed in this study can contribute to the ongoing discourse on AI assisted in secure software engineering.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection
Authors:
Athul M. Mathew,
Arshad Ali Khan,
Thariq Khalid,
Faroq AL-Tam,
Riad Souissi
Abstract:
Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person's head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the…
▽ More
Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person's head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (\textit{attention-grabbing}) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
Authors:
Shahad Albastaki,
Anabia Sohail,
Iyyakutti Iyappan Ganapathi,
Basit Alawode,
Asim Khan,
Sajid Javed,
Naoufel Werghi,
Mohammed Bennamoun,
Arif Mahmood
Abstract:
In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution i…
▽ More
In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
OPUS-VFL: Incentivizing Optimal Privacy-Utility Tradeoffs in Vertical Federated Learning
Authors:
Sindhuja Madabushi,
Ahmad Faraz Khan,
Haider Ali,
Jin-Hee Cho
Abstract:
Vertical Federated Learning (VFL) enables organizations with disjoint feature spaces but shared user bases to collaboratively train models without sharing raw data. However, existing VFL systems face critical limitations: they often lack effective incentive mechanisms, struggle to balance privacy-utility tradeoffs, and fail to accommodate clients with heterogeneous resource capabilities. These cha…
▽ More
Vertical Federated Learning (VFL) enables organizations with disjoint feature spaces but shared user bases to collaboratively train models without sharing raw data. However, existing VFL systems face critical limitations: they often lack effective incentive mechanisms, struggle to balance privacy-utility tradeoffs, and fail to accommodate clients with heterogeneous resource capabilities. These challenges hinder meaningful participation, degrade model performance, and limit practical deployment. To address these issues, we propose OPUS-VFL, an Optimal Privacy-Utility tradeoff Strategy for VFL. OPUS-VFL introduces a novel, privacy-aware incentive mechanism that rewards clients based on a principled combination of model contribution, privacy preservation, and resource investment. It employs a lightweight leave-one-out (LOO) strategy to quantify feature importance per client, and integrates an adaptive differential privacy mechanism that enables clients to dynamically calibrate noise levels to optimize their individual utility. Our framework is designed to be scalable, budget-balanced, and robust to inference and poisoning attacks. Extensive experiments on benchmark datasets (MNIST, CIFAR-10, and CIFAR-100) demonstrate that OPUS-VFL significantly outperforms state-of-the-art VFL baselines in both efficiency and robustness. It reduces label inference attack success rates by up to 20%, increases feature inference reconstruction error (MSE) by over 30%, and achieves up to 25% higher incentives for clients that contribute meaningfully while respecting privacy and cost constraints. These results highlight the practicality and innovation of OPUS-VFL as a secure, fair, and performance-driven solution for real-world VFL.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models
Authors:
Feiyang Li,
Peng Fang,
Zhan Shi,
Arijit Khan,
Fang Wang,
Dan Feng,
Weihao Wang,
Xin Zhang,
Yongjian Cui
Abstract:
While chain-of-thought (CoT) reasoning improves the performance of large language models (LLMs) in complex tasks, it still has two main challenges: the low reliability of relying solely on LLMs to generate reasoning chains and the interference of natural language reasoning chains on the inference logic of LLMs. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key…
▽ More
While chain-of-thought (CoT) reasoning improves the performance of large language models (LLMs) in complex tasks, it still has two main challenges: the low reliability of relying solely on LLMs to generate reasoning chains and the interference of natural language reasoning chains on the inference logic of LLMs. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key designs: (i) Knowledge Graph-driven CoT Generation, featuring knowledge graphs to modulate reasoning chain generation of LLMs, thereby enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which incorporates retrieval-augmented generation (RAG) into knowledge graphs to retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable information; (iii) Pseudo-Program Prompting Execution, which encourages LLMs to execute reasoning tasks in pseudo-programs with greater logical rigor. We conduct a comprehensive evaluation on nine public datasets, covering three reasoning problems. Compared with the-state-of-the-art methods, CoT-RAG exhibits a significant accuracy improvement, ranging from 4.0% to 23.0%. Furthermore, testing on four domain-specific datasets, CoT-RAG shows remarkable accuracy and efficient execution, highlighting its strong practical applicability and scalability.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Dynamic Memory-enhanced Transformer for Hyperspectral Image Classification
Authors:
Muhammad Ahmad,
Manuel Mazzara,
Salvatore Distefano,
Adil Mehmood Khan
Abstract:
Hyperspectral image (HSI) classification remains a challenging task due to the intricate spatial-spectral correlations. Existing transformer models excel in capturing long-range dependencies but often suffer from information redundancy and attention inefficiencies, limiting their ability to model fine-grained relationships crucial for HSI classification. To overcome these limitations, this work pr…
▽ More
Hyperspectral image (HSI) classification remains a challenging task due to the intricate spatial-spectral correlations. Existing transformer models excel in capturing long-range dependencies but often suffer from information redundancy and attention inefficiencies, limiting their ability to model fine-grained relationships crucial for HSI classification. To overcome these limitations, this work proposes MemFormer, a lightweight and memory-enhanced transformer. MemFormer introduces a memory-enhanced multi-head attention mechanism that iteratively refines a dynamic memory module, enhancing feature extraction while reducing redundancy across layers. Additionally, a dynamic memory enrichment strategy progressively captures complex spatial and spectral dependencies, leading to more expressive feature representations. To further improve structural consistency, we incorporate a spatial-spectral positional encoding (SSPE) tailored for HSI data, ensuring continuity without the computational burden of convolution-based approaches. Extensive experiments on benchmark datasets demonstrate that MemFormer achieves superior classification accuracy, outperforming state-of-the-art methods.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
AttentionDrop: A Novel Regularization Method for Transformer Models
Authors:
Mirza Samad Ahmed Baig,
Syeda Anshrah Gillani,
Abdul Akbar Khan,
Shahid Munir Shah
Abstract:
Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech. However, their immense capacity often leads to overfitting, especially when training data is limited or noisy. We propose AttentionDrop, a unified family of stochastic regularization techniques that operate directly on the self-attention dis…
▽ More
Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech. However, their immense capacity often leads to overfitting, especially when training data is limited or noisy. We propose AttentionDrop, a unified family of stochastic regularization techniques that operate directly on the self-attention distributions. We introduces three variants: 1. Hard Attention Masking: randomly zeroes out top-k attention logits per query to encourage diverse context utilization. 2. Blurred Attention Smoothing: applies a dynamic Gaussian convolution over attention logits to diffuse overly peaked distributions. 3. Consistency-Regularized AttentionDrop: enforces output stability under multiple independent AttentionDrop perturbations via a KL-based consistency loss.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Achieving Optimal Tissue Repair Through MARL with Reward Shaping and Curriculum Learning
Authors:
Muhammad Al-Zafar Khan,
Jamal Al-Karaki
Abstract:
In this paper, we present a multi-agent reinforcement learning (MARL) framework for optimizing tissue repair processes using engineered biological agents. Our approach integrates: (1) stochastic reaction-diffusion systems modeling molecular signaling, (2) neural-like electrochemical communication with Hebbian plasticity, and (3) a biologically informed reward function combining chemical gradient t…
▽ More
In this paper, we present a multi-agent reinforcement learning (MARL) framework for optimizing tissue repair processes using engineered biological agents. Our approach integrates: (1) stochastic reaction-diffusion systems modeling molecular signaling, (2) neural-like electrochemical communication with Hebbian plasticity, and (3) a biologically informed reward function combining chemical gradient tracking, neural synchronization, and robust penalties. A curriculum learning scheme guides the agent through progressively complex repair scenarios. In silico experiments demonstrate emergent repair strategies, including dynamic secretion control and spatial coordination.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Ctrl-Z: Controlling AI Agents via Resampling
Authors:
Aryan Bhatt,
Cody Rushing,
Adam Kaufman,
Tyler Tracy,
Vasil Georgiev,
David Matolcsi,
Akbir Khan,
Buck Shlegeris
Abstract:
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially cons…
▽ More
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
A Full Spectrum of 3D Ferroelectric Memory Architectures Shaped by Polarization Sensing
Authors:
Jiahui Duan,
Asif Khan,
Xiao Gong,
Vijaykrishnan Narayanan,
Kai Ni
Abstract:
Ferroelectric memories have attracted significant interest due to their non-volatile storage, energy efficiency, and fast operation, making them prime candidates for future memory technologies. As commercial Dynamic Random Access Memory (DRAM) and NAND flash memory are transiting or have moved toward three-dimensional (3D) integration, 3D ferroelectric memory architectures are also emerging, provi…
▽ More
Ferroelectric memories have attracted significant interest due to their non-volatile storage, energy efficiency, and fast operation, making them prime candidates for future memory technologies. As commercial Dynamic Random Access Memory (DRAM) and NAND flash memory are transiting or have moved toward three-dimensional (3D) integration, 3D ferroelectric memory architectures are also emerging, provided they can achieve a competitive position within the modern memory hierarchy. Given the excellent scalability of ferroelectric HfO2, various dense 3D integrated ferroelectric memory architectures are feasible, each offering unique strengths and facing distinct challenges. In this work, we present a comprehensive classification of 3D ferroelectric memory architectures based on polarization sensing methods, highlighting their critical role in shaping memory cell design and operational efficiency. Through a systematic evaluation of these architectures, we develop a unified framework to assess their advantages and trade-offs. This classification not only enhances the understanding of current 3D ferroelectric memory technologies but also lays the foundation for designing next-generation architectures optimized for advanced computing and high-performance applications.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
All-in-Memory Stochastic Computing using ReRAM
Authors:
João Paulo C. de Lima,
Mehran Shoushtari Moghadam,
Sercan Aygun,
Jeronimo Castrillon,
M. Hassan Najafi,
Asif Ali Khan
Abstract:
As the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC…
▽ More
As the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC operations are inherently fault-tolerant, their accuracy largely depends on the length and quality of the stochastic bit-streams (SBS). These bit-streams are typically generated by CMOS-based stochastic bit-stream generators that consume over 80% of the SC system's power and area. Current SC solutions focus on optimizing the logic gates but often neglect the high cost of moving the bit-streams between memory and processor. This work leverages the physics of emerging ReRAM devices to implement the entire SC flow in place: (1) generating low-cost true random numbers and SBSs, (2) conducting SC operations, and (3) converting SBSs back to binary. Considering the low reliability of ReRAM cells, we demonstrate how SC's robustness to errors copes with ReRAM's variability. Our evaluation shows significant improvements in throughput (1.39x, 2.16x) and energy consumption (1.15x, 2.8x) over state-of-the-art (CMOS- and ReRAM-based) solutions, respectively, with an average image quality drop of 5% across multiple SBS lengths and image processing tasks.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
How Good Are Large Language Models for Course Recommendation in MOOCs?
Authors:
Boxuan Ma,
Md Akib Zabed Khan,
Tianyuan Yang,
Agoritsa Polyzou,
Shin'ichi Konomi
Abstract:
Large Language Models (LLMs) have made significant strides in natural language processing and are increasingly being integrated into recommendation systems. However, their potential in educational recommendation systems has yet to be fully explored. This paper investigates the use of LLMs as a general-purpose recommendation model, leveraging their vast knowledge derived from large-scale corpora fo…
▽ More
Large Language Models (LLMs) have made significant strides in natural language processing and are increasingly being integrated into recommendation systems. However, their potential in educational recommendation systems has yet to be fully explored. This paper investigates the use of LLMs as a general-purpose recommendation model, leveraging their vast knowledge derived from large-scale corpora for course recommendation tasks. We explore a variety of approaches, ranging from prompt-based methods to more advanced fine-tuning techniques, and compare their performance against traditional recommendation models. Extensive experiments were conducted on a real-world MOOC dataset, evaluating using LLMs as course recommendation systems across key dimensions such as accuracy, diversity, and novelty. Our results demonstrate that LLMs can achieve good performance comparable to traditional models, highlighting their potential to enhance educational recommendation systems. These findings pave the way for further exploration and development of LLM-based approaches in the context of educational recommendations.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
TactileNet: Bridging the Accessibility Gap with AI-Generated Tactile Graphics for Individuals with Vision Impairment
Authors:
Adnan Khan,
Alireza Choubineh,
Mai A. Shaaban,
Abbas Akkasi,
Majid Komeili
Abstract:
Tactile graphics are essential for providing access to visual information for the 43 million people globally living with vision loss. Traditional methods for creating these graphics are labor-intensive and cannot meet growing demand. We introduce TactileNet, the first comprehensive dataset and AI-driven framework for generating embossing-ready 2D tactile templates using text-to-image Stable Diffus…
▽ More
Tactile graphics are essential for providing access to visual information for the 43 million people globally living with vision loss. Traditional methods for creating these graphics are labor-intensive and cannot meet growing demand. We introduce TactileNet, the first comprehensive dataset and AI-driven framework for generating embossing-ready 2D tactile templates using text-to-image Stable Diffusion (SD) models. By integrating Low-Rank Adaptation (LoRA) and DreamBooth, our method fine-tunes SD models to produce high-fidelity, guideline-compliant graphics while reducing computational costs. Quantitative evaluations with tactile experts show 92.86% adherence to accessibility standards. Structural fidelity analysis revealed near-human design similarity, with an SSIM of 0.538 between generated graphics and expert-designed tactile images. Notably, our method preserves object silhouettes better than human designs (SSIM = 0.259 vs. 0.215 for binary masks), addressing a key limitation of manual tactile abstraction. The framework scales to 32,000 images (7,050 high-quality) across 66 classes, with prompt editing enabling customizable outputs (e.g., adding or removing details). By automating the 2D template generation step-compatible with standard embossing workflows-TactileNet accelerates production while preserving design flexibility. This work demonstrates how AI can augment (not replace) human expertise to bridge the accessibility gap in education and beyond. Code, data, and models will be publicly released to foster further research.
△ Less
Submitted 15 May, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
How Accurately Do Large Language Models Understand Code?
Authors:
Sabaat Haroon,
Ahmad Faraz Khan,
Ahmad Humayun,
Waris Gill,
Abdul Haddi Amjad,
Ali R. Butt,
Mohammad Taha Khan,
Muhammad Ali Gulzar
Abstract:
Large Language Models (LLMs) are increasingly used in post-development tasks such as code repair and testing. A key factor in these tasks' success is the model's deep understanding of code. However, the extent to which LLMs truly understand code remains largely unevaluated. Quantifying code comprehension is challenging due to its abstract nature and the lack of a standardized metric. Previously, t…
▽ More
Large Language Models (LLMs) are increasingly used in post-development tasks such as code repair and testing. A key factor in these tasks' success is the model's deep understanding of code. However, the extent to which LLMs truly understand code remains largely unevaluated. Quantifying code comprehension is challenging due to its abstract nature and the lack of a standardized metric. Previously, this was assessed through developer surveys, which are not feasible for evaluating LLMs. Existing LLM benchmarks focus primarily on code generation, fundamentally different from code comprehension. Additionally, fixed benchmarks quickly become obsolete as they become part of the training data. This paper presents the first large-scale empirical investigation into LLMs' ability to understand code. Inspired by mutation testing, we use an LLM's fault-finding ability as a proxy for its deep code understanding. This approach is based on the insight that a model capable of identifying subtle functional discrepancies must understand the code well. We inject faults in real-world programs and ask the LLM to localize them, ensuring the specifications suffice for fault localization. Next, we apply semantic-preserving code mutations (SPMs) to the faulty programs and test whether the LLMs still locate the faults, verifying their confidence in code understanding. We evaluate nine popular LLMs on 600,010 debugging tasks from 670 Java and 637 Python programs. We find that LLMs lose the ability to debug the same bug in 78% of faulty programs when SPMs are applied, indicating a shallow understanding of code and reliance on features irrelevant to semantics. We also find that LLMs understand code earlier in the program better than later. This suggests that LLMs' code comprehension remains tied to lexical and syntactic features due to tokenization designed for natural languages, which overlooks code semantics.
△ Less
Submitted 9 April, 2025; v1 submitted 6 April, 2025;
originally announced April 2025.
-
EMF: Event Meta Formers for Event-based Real-time Traffic Object Detection
Authors:
Muhammad Ahmed Ullah Khan,
Abdul Hannan Khan,
Andreas Dengel
Abstract:
Event cameras have higher temporal resolution, and require less storage and bandwidth compared to traditional RGB cameras. However, due to relatively lagging performance of event-based approaches, event cameras have not yet replace traditional cameras in performance-critical applications like autonomous driving. Recent approaches in event-based object detection try to bridge this gap by employing…
▽ More
Event cameras have higher temporal resolution, and require less storage and bandwidth compared to traditional RGB cameras. However, due to relatively lagging performance of event-based approaches, event cameras have not yet replace traditional cameras in performance-critical applications like autonomous driving. Recent approaches in event-based object detection try to bridge this gap by employing computationally expensive transformer-based solutions. However, due to their resource-intensive components, these solutions fail to exploit the sparsity and higher temporal resolution of event cameras efficiently. Moreover, these solutions are adopted from the vision domain, lacking specificity to the event cameras. In this work, we explore efficient and performant alternatives to recurrent vision transformer models and propose a novel event-based object detection backbone. The proposed backbone employs a novel Event Progression Extractor module, tailored specifically for event data, and uses Metaformer concept with convolution-based efficient components. We evaluate the resultant model on well-established traffic object detection benchmarks and conduct cross-dataset evaluation to test its ability to generalize. The proposed model outperforms the state-of-the-art on Prophesee Gen1 dataset by 1.6 mAP while reducing inference time by 14%. Our proposed EMF becomes the fastest DNN-based architecture in the domain by outperforming most efficient event-based object detectors. Moreover, the proposed model shows better ability to generalize to unseen data and scales better with the abundance of data.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
Characterizing Creativity in Visualization Design
Authors:
Naimul Hoque,
Zinat Ara,
Safwat Ali Khan,
Fanny Chevalier,
Niklas Elmqvist
Abstract:
Understanding the role of creativity in visualization design becomes increasingly important as the field matures, particularly with the emergence of various visualization authoring and recommendation systems. In this paper, we examine how creativity manifests in visualization design processes and how academic research has conceptualized it over time. Through a systematic review of 58 visualization…
▽ More
Understanding the role of creativity in visualization design becomes increasingly important as the field matures, particularly with the emergence of various visualization authoring and recommendation systems. In this paper, we examine how creativity manifests in visualization design processes and how academic research has conceptualized it over time. Through a systematic review of 58 visualization papers that use the terms "creativity" or "creative," we analyze the evolution of creative practices in visualization design. Our findings show that prior literature predominantly used atypical designs through free-form drawings, infographics, pictorials, and data comics to define creative representations. However, creativity in visualization design extends beyond visual representations to encompass early needfinding design activities such as sketching, storyboarding, discussion, and card sorting. Data visualization can also support a wide variety of creative tasks (e.g., fiction writing). We discuss the implications of these findings for fostering innovation within established design paradigms and for developing more sophisticated visualization authoring systems. The full list of coded papers are available here: https://vizcreativity.notion.site/coded-papers.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations
Authors:
Adrián Sánchez-Mompó,
Ioannis Mavromatis,
Peizheng Li,
Konstantinos Katsaros,
Aftab Khan
Abstract:
This study presents an empirical investigation into the energy consumption of Discriminative and Generative AI models within real-world MLOps pipelines. For Discriminative models, we examine various architectures and hyperparameters during training and inference and identify energy-efficient practices. For Generative AI, Large Language Models (LLMs) are assessed, focusing primarily on energy consu…
▽ More
This study presents an empirical investigation into the energy consumption of Discriminative and Generative AI models within real-world MLOps pipelines. For Discriminative models, we examine various architectures and hyperparameters during training and inference and identify energy-efficient practices. For Generative AI, Large Language Models (LLMs) are assessed, focusing primarily on energy consumption across different model sizes and varying service requests. Our study employs software-based power measurements, ensuring ease of replication across diverse configurations, models, and datasets. We analyse multiple models and hardware setups to uncover correlations among various metrics, identifying key contributors to energy consumption. The results indicate that for Discriminative models, optimising architectures, hyperparameters, and hardware can significantly reduce energy consumption without sacrificing performance. For LLMs, energy efficiency depends on balancing model size, reasoning complexity, and request-handling capacity, as larger models do not necessarily consume more energy when utilisation remains low. This analysis provides practical guidelines for designing green and sustainable ML operations, emphasising energy consumption and carbon footprint reductions while maintaining performance. This paper can serve as a benchmark for accurately estimating total energy use across different types of AI models.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Optimal Path Planning and Cost Minimization for a Drone Delivery System Via Model Predictive Control
Authors:
Muhammad Al-Zafar Khan,
Jamal Al-Karaki
Abstract:
In this study, we formulate the drone delivery problem as a control problem and solve it using Model Predictive Control. Two experiments are performed: The first is on a less challenging grid world environment with lower dimensionality, and the second is with a higher dimensionality and added complexity. The MPC method was benchmarked against three popular Multi-Agent Reinforcement Learning (MARL)…
▽ More
In this study, we formulate the drone delivery problem as a control problem and solve it using Model Predictive Control. Two experiments are performed: The first is on a less challenging grid world environment with lower dimensionality, and the second is with a higher dimensionality and added complexity. The MPC method was benchmarked against three popular Multi-Agent Reinforcement Learning (MARL): Independent $Q$-Learning (IQL), Joint Action Learners (JAL), and Value-Decomposition Networks (VDN). It was shown that the MPC method solved the problem quicker and required fewer optimal numbers of drones to achieve a minimized cost and navigate the optimal path.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Improved Approximation Algorithms for Three-Dimensional Knapsack
Authors:
Klaus Jansen,
Debajyoti Kar,
Arindam Khan,
K. V. N. Sreenivas,
Malte Tutas
Abstract:
We study the three-dimensional Knapsack (3DK) problem, in which we are given a set of axis-aligned cuboids with associated profits and an axis-aligned cube knapsack. The objective is to find a non-overlapping axis-aligned packing (by translation) of the maximum profit subset of cuboids into the cube. The previous best approximation algorithm is due to Diedrich, Harren, Jansen, Thöle, and Thomas (2…
▽ More
We study the three-dimensional Knapsack (3DK) problem, in which we are given a set of axis-aligned cuboids with associated profits and an axis-aligned cube knapsack. The objective is to find a non-overlapping axis-aligned packing (by translation) of the maximum profit subset of cuboids into the cube. The previous best approximation algorithm is due to Diedrich, Harren, Jansen, Thöle, and Thomas (2008), who gave a $(7+\varepsilon)$-approximation algorithm for 3DK and a $(5+\varepsilon)$-approximation algorithm for the variant when the items can be rotated by 90 degrees around any axis, for any constant $\varepsilon>0$. Chlebík and Chlebíková (2009) showed that the problem does not admit an asymptotic polynomial-time approximation scheme.
We provide an improved polynomial-time $(139/29+\varepsilon) \approx 4.794$-approximation algorithm for 3DK and $(30/7+\varepsilon) \approx 4.286$-approximation when rotations by 90 degrees are allowed. We also provide improved approximation algorithms for several variants such as the cardinality case (when all items have the same profit) and uniform profit-density case (when the profit of an item is equal to its volume). Our key technical contribution is container packing -- a structured packing in 3D such that all items are assigned into a constant number of containers, and each container is packed using a specific strategy based on its type. We first show the existence of highly profitable container packings. Thereafter, we show that one can find near-optimal container packing efficiently using a variant of the Generalized Assignment Problem (GAP).
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Efficient IoT Intrusion Detection with an Improved Attention-Based CNN-BiLSTM Architecture
Authors:
Amna Naeem,
Muazzam A. Khan,
Nada Alasbali,
Jawad Ahmad,
Aizaz Ahmad Khattak,
Muhammad Shahbaz Khan
Abstract:
The ever-increasing security vulnerabilities in the Internet-of-Things (IoT) systems require improved threat detection approaches. This paper presents a compact and efficient approach to detect botnet attacks by employing an integrated approach that consists of traffic pattern analysis, temporal support learning, and focused feature extraction. The proposed attention-based model benefits from a hy…
▽ More
The ever-increasing security vulnerabilities in the Internet-of-Things (IoT) systems require improved threat detection approaches. This paper presents a compact and efficient approach to detect botnet attacks by employing an integrated approach that consists of traffic pattern analysis, temporal support learning, and focused feature extraction. The proposed attention-based model benefits from a hybrid CNN-BiLSTM architecture and achieves 99% classification accuracy in detecting botnet attacks utilizing the N-BaIoT dataset, while maintaining high precision and recall across various scenarios. The proposed model's performance is further validated by key parameters, such as Mathews Correlation Coefficient and Cohen's kappa Correlation Coefficient. The close-to-ideal results for these parameters demonstrate the proposed model's ability to detect botnet attacks accurately and efficiently in practical settings and on unseen data. The proposed model proved to be a powerful defence mechanism for IoT networks to face emerging security challenges.
△ Less
Submitted 1 May, 2025; v1 submitted 25 March, 2025;
originally announced March 2025.
-
Gene42: Long-Range Genomic Foundation Model With Dense Attention
Authors:
Kirill Vishniakov,
Boulbaba Ben Amor,
Engin Tekin,
Nancy A. ElNaker,
Karthik Viswanathan,
Aleksandr Medvedev,
Aahan Singh,
Maryam Nadeem,
Mohammad Amaan Sayeed,
Praveenkumar Kanithi,
Tiago Magalhaes,
Natalia Vassilieva,
Dwarikanath Mahapatra,
Marco Pimentel,
and Shadab Khan
Abstract:
We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context…
▽ More
We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Safety Aware Task Planning via Large Language Models in Robotics
Authors:
Azal Ahmad Khan,
Michael Andrev,
Muhammad Ali Murtaza,
Sergio Aguilera,
Rui Zhang,
Jie Ding,
Seth Hutchinson,
Ali Anwar
Abstract:
The integration of large language models (LLMs) into robotic task planning has unlocked better reasoning capabilities for complex, long-horizon workflows. However, ensuring safety in LLM-driven plans remains a critical challenge, as these models often prioritize task completion over risk mitigation. This paper introduces SAFER (Safety-Aware Framework for Execution in Robotics), a multi-LLM framewo…
▽ More
The integration of large language models (LLMs) into robotic task planning has unlocked better reasoning capabilities for complex, long-horizon workflows. However, ensuring safety in LLM-driven plans remains a critical challenge, as these models often prioritize task completion over risk mitigation. This paper introduces SAFER (Safety-Aware Framework for Execution in Robotics), a multi-LLM framework designed to embed safety awareness into robotic task planning. SAFER employs a Safety Agent that operates alongside the primary task planner, providing safety feedback. Additionally, we introduce LLM-as-a-Judge, a novel metric leveraging LLMs as evaluators to quantify safety violations within generated task plans. Our framework integrates safety feedback at multiple stages of execution, enabling real-time risk assessment, proactive error correction, and transparent safety evaluation. We also integrate a control framework using Control Barrier Functions (CBFs) to ensure safety guarantees within SAFER's task planning. We evaluated SAFER against state-of-the-art LLM planners on complex long-horizon tasks involving heterogeneous robotic agents, demonstrating its effectiveness in reducing safety violations while maintaining task efficiency. We also verify the task planner and safety planner through actual hardware experiments involving multiple robots and a human.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis
Authors:
Asifullah Khan,
Laiba Asmatullah,
Anza Malik,
Shahzaib Khan,
Hamna Asif
Abstract:
Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of "positive" and "negative" samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and n…
▽ More
Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of "positive" and "negative" samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and negative pairs (e.g., views from different images/objects) are pushed farther away. This methodology has shown significant improvements in image understanding and image text analysis without much reliance on labeled data. In this paper, we comprehensively discuss the terminologies, recent developments and applications of contrastive learning with respect to text-image models. Specifically, we provide an overview of the approaches of contrastive learning in text-image models in recent years. Secondly, we categorize the approaches based on different model structures. Thirdly, we further introduce and discuss the latest advances of the techniques used in the process such as pretext tasks for both images and text, architectural structures, and key trends. Lastly, we discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.
△ Less
Submitted 18 April, 2025; v1 submitted 14 March, 2025;
originally announced March 2025.
-
Auditing language models for hidden objectives
Authors:
Samuel Marks,
Johannes Treutlein,
Trenton Bricken,
Jack Lindsey,
Jonathan Marcus,
Siddharth Mishra-Sharma,
Daniel Ziegler,
Emmanuel Ameisen,
Joshua Batson,
Tim Belonax,
Samuel R. Bowman,
Shan Carter,
Brian Chen,
Hoagy Cunningham,
Carson Denison,
Florian Dietz,
Satvik Golechha,
Akbir Khan,
Jan Kirchner,
Jan Leike,
Austin Meek,
Kei Nishimura-Gasparian,
Euan Ong,
Christopher Olah,
Adam Pearce
, et al. (10 additional authors not shown)
Abstract:
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model…
▽ More
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.
△ Less
Submitted 27 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
A Chaotic Image Encryption Scheme Using Novel Geometric Block Permutation and Dynamic Substitution
Authors:
Muhammad Ali,
Jawad Ahmad,
Muhammad Abdullah Hussain Khan,
Safee Ullah,
Mujeeb Ur Rehman,
Syed Aziz Shah,
Muhammad Shahbaz Khan
Abstract:
In this digital era, ensuring the security of digital data during transmission and storage is crucial. Digital data, particularly image data, needs to be protected against unauthorized access. To address this, this paper presents a novel image encryption scheme based on a confusion diffusion architecture. The diffusion module introduces a novel geometric block permutation technique, which effectiv…
▽ More
In this digital era, ensuring the security of digital data during transmission and storage is crucial. Digital data, particularly image data, needs to be protected against unauthorized access. To address this, this paper presents a novel image encryption scheme based on a confusion diffusion architecture. The diffusion module introduces a novel geometric block permutation technique, which effectively scrambles the pixels based on geometric shape extraction of pixels. The image is converted into four blocks, and pixels are extracted from these blocks using L-shape, U-shape, square-shape, and inverted U-shape patterns for each block, respectively. This robust extraction and permutation effectively disrupts the correlation within the image. Furthermore, the confusion module utilises bit-XOR and dynamic substitution techniques. For the bit-XOR operation, 2D Henon map has been utilised to generate a chaotic seed matrix, which is bit-XORed with the scrambled image. The resultant image then undergoes the dynamic substitution process to complete confusion phase. A statistical security analysis demonstrates the superior security of the proposed scheme, with being high uncertainty and unpredictability, achieving an entropy of 7.9974 and a correlation coefficient of 0.0014. These results validate the proposed scheme's effectiveness in securing digital images.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Factorio Learning Environment
Authors:
Jack Hopkins,
Mart Bakler,
Akbir Khan
Abstract:
Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization. FLE provides exponentially scaling challenges -- from basic automation to complex factories processing millions o…
▽ More
Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization. FLE provides exponentially scaling challenges -- from basic automation to complex factories processing millions of resource units per second. We provide two settings: (1) lab-play consisting of eight structured tasks with fixed resources, and (2) open-play with the unbounded task of building the largest factory on an procedurally generated map. We demonstrate across both settings that models still lack strong spatial reasoning. In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis. In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing).
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
A Hybrid Neural Network with Smart Skip Connections for High-Precision, Low-Latency EMG-Based Hand Gesture Recognition
Authors:
Hafsa Wazir,
Jawad Ahmad,
Muazzam A. Khan,
Sana Ullah Jan,
Fadia Ali Khan,
Muhammad Shahbaz Khan
Abstract:
Electromyography (EMG) is extensively used in key biomedical areas, such as prosthetics, and assistive and interactive technologies. This paper presents a new hybrid neural network named ConSGruNet for precise and efficient hand gesture recognition. The proposed model comprises convolutional neural networks with smart skip connections in conjunction with a Gated Recurrent Unit (GRU). The proposed…
▽ More
Electromyography (EMG) is extensively used in key biomedical areas, such as prosthetics, and assistive and interactive technologies. This paper presents a new hybrid neural network named ConSGruNet for precise and efficient hand gesture recognition. The proposed model comprises convolutional neural networks with smart skip connections in conjunction with a Gated Recurrent Unit (GRU). The proposed model is trained on the complete Ninapro DB1 dataset. The proposed model boasts an accuracy of 99.7\% in classifying 53 classes in just 25 milliseconds. In addition to being fast, the proposed model is lightweight with just 3,946 KB in size. Moreover, the proposed model has also been evaluated for the reliability parameters, i.e., Cohen's kappa coefficient, Matthew's correlation coefficient, and confidence intervals. The close to ideal results of these parameters validate the models performance on unseen data.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Improved Approximation Algorithms for Three-Dimensional Bin Packing
Authors:
Debajyoti Kar,
Arindam Khan,
Malin Rau
Abstract:
We study three fundamental three-dimensional (3D) geometric packing problems: 3D (Geometric) Bin Packing (3D-BP), 3D Strip Packing (3D-SP), and Minimum Volume Bounding Box (3D-MVBB), where given a set of 3D (rectangular) cuboids, the goal is to find an axis-aligned nonoverlapping packing of all cuboids. In 3D-BP, we need to pack the given cuboids into the minimum number of unit cube bins. In 3D-SP…
▽ More
We study three fundamental three-dimensional (3D) geometric packing problems: 3D (Geometric) Bin Packing (3D-BP), 3D Strip Packing (3D-SP), and Minimum Volume Bounding Box (3D-MVBB), where given a set of 3D (rectangular) cuboids, the goal is to find an axis-aligned nonoverlapping packing of all cuboids. In 3D-BP, we need to pack the given cuboids into the minimum number of unit cube bins. In 3D-SP, we need to pack them into a 3D cuboid with a unit square base and minimum height. Finally, in 3D-MVBB, the goal is to pack into a cuboid box of minimum volume.
It is NP-hard to even decide whether a set of rectangles can be packed into a unit square bin -- giving an (absolute) approximation hardness of 2 for 3D-BP and 3D-SP. The previous best (absolute) approximation for all three problems is by Li and Cheng (SICOMP, 1990), who gave algorithms with approximation ratios of 13, $46/7$, and $46/7+\varepsilon$, respectively, for 3D-BP, 3D-SP, and 3D-MVBB. We provide improved approximation ratios of 6, 6, and $3+\varepsilon$, respectively, for the three problems, for any constant $\varepsilon > 0$.
For 3D-BP, in the asymptotic regime, Bansal, Correa, Kenyon, and Sviridenko (Math.~Oper.~Res., 2006) showed that there is no asymptotic polynomial-time approximation scheme (APTAS) even when all items have the same height. Caprara (Math.~Oper.~Res., 2008) gave an asymptotic approximation ratio of $T_{\infty}^2 + \varepsilon\approx 2.86$, where $T_{\infty}$ is the well-known Harmonic constant in Bin Packing. We provide an algorithm with an improved asymptotic approximation ratio of $3 T_{\infty}/2 +\varepsilon \approx 2.54$. Further, we show that unlike 3D-BP (and 3D-SP), 3D-MVBB admits an APTAS.
△ Less
Submitted 19 April, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
Authors:
Ariba Khan,
Stephen Casper,
Dylan Hadfield-Menell
Abstract:
Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behin…
▽ More
Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behind current survey-based evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current survey-based approaches to evaluating the cultural alignment of LLMs and highlight a need for systematic robustness checks and red-teaming for evaluation results. Data and code are available at https://huggingface.co/datasets/akhan02/cultural-dimension-cover-letters and https://github.com/ariba-k/llm-cultural-alignment-evaluation, respectively.
△ Less
Submitted 8 April, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
AttentionSwarm: Reinforcement Learning with Attention Control Barier Function for Crazyflie Drones in Dynamic Environments
Authors:
Grik Tadevosyan,
Valerii Serpiva,
Aleksey Fedoseev,
Roohan Ahmed Khan,
Demetros Aschu,
Faryal Batool,
Nickolay Efanov,
Artem Mikhaylov,
Dzmitry Tsetserukou
Abstract:
We introduce AttentionSwarm, a novel benchmark designed to evaluate safe and efficient swarm control across three challenging environments: a landing environment with obstacles, a competitive drone game setting, and a dynamic drone racing scenario. Central to our approach is the Attention Model Based Control Barrier Function (CBF) framework, which integrates attention mechanisms with safety-critic…
▽ More
We introduce AttentionSwarm, a novel benchmark designed to evaluate safe and efficient swarm control across three challenging environments: a landing environment with obstacles, a competitive drone game setting, and a dynamic drone racing scenario. Central to our approach is the Attention Model Based Control Barrier Function (CBF) framework, which integrates attention mechanisms with safety-critical control theory to enable real-time collision avoidance and trajectory optimization. This framework dynamically prioritizes critical obstacles and agents in the swarms vicinity using attention weights, while CBFs formally guarantee safety by enforcing collision-free constraints. The safe attention net algorithm was developed and evaluated using a swarm of Crazyflie 2.1 micro quadrotors, which were tested indoors with the Vicon motion capture system to ensure precise localization and control. Experimental results show that our system achieves landing accuracy of 3.02 cm with a mean time of 23 s and collision-free landings in a dynamic landing environment, 100% and collision-free navigation in a drone game environment, and 95% and collision-free navigation for a dynamic multiagent drone racing environment, underscoring its effectiveness and robustness in real-world scenarios. This work offers a promising foundation for applications in dynamic environments where safety and fastness are paramount.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models
Authors:
Md Azim Khan,
Aryya Gangopadhyay,
Jianwu Wang,
Robert F. Erbacher
Abstract:
Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in…
▽ More
Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV).
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Mining Q&A Platforms for Empirical Evidence on Quantum Software Programming
Authors:
Arif Ali Khan,
Boshuai Ye,
Muhammad Azeem Akbar,
Javed Ali Khan,
Davoud Mougouei,
Xinyuan Ma
Abstract:
The rise of quantum computing has driven the need for quantum software engineering, yet its programming landscape remains largely unexplored in empirical research. As quantum technologies advance toward industrial adoption, understanding programming aspects is crucial to addressing software development challenges. This study analyzes 6,935 quantum software programming discussion posts from Stack E…
▽ More
The rise of quantum computing has driven the need for quantum software engineering, yet its programming landscape remains largely unexplored in empirical research. As quantum technologies advance toward industrial adoption, understanding programming aspects is crucial to addressing software development challenges. This study analyzes 6,935 quantum software programming discussion posts from Stack Exchange platforms (Quantum Computing, Stack Overflow, Software Engineering, and Code Review). Using topic modeling and qualitative analysis, we identified key discussion topics, trends (popular and difficult), tools/frameworks, and practitioner challenges. Twenty topics were identified, including popular ones such as physical theories and mathematical foundations, as well as security and encryption algorithms, while the most difficult were object-oriented programming and parameter control in quantum algorithms. Additionally, we identified nine frameworks that support quantum programming, with Qiskit emerging as the most widely adopted. Our findings also reveal core challenges in quantum software programming, thematically mapped into four areas: theories and mathematical concepts, algorithms and applications, experimental practices and software development, and education and community engagement. This study provides empirical insights that can inform future research, tool development, and educational efforts, supporting the evolution of the quantum software ecosystem.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
AUTOFRAME -- A Software-driven Integration Framework for Automotive Systems
Authors:
Sven Kirchner,
Nils Purschke,
Chengdong Wu,
Muhammed Aqib Khan,
Divye Dixit,
Alois C. Knoll
Abstract:
The evolution of automotive technologies towards more integrated and sophisticated systems requires a shift from traditional distributed architectures to centralized vehicle architectures. This work presents a novel framework that addresses the increasing complexity of Software Defined Vehicles (SDV) through a centralized approach that optimizes software and hardware integration. Our approach intr…
▽ More
The evolution of automotive technologies towards more integrated and sophisticated systems requires a shift from traditional distributed architectures to centralized vehicle architectures. This work presents a novel framework that addresses the increasing complexity of Software Defined Vehicles (SDV) through a centralized approach that optimizes software and hardware integration. Our approach introduces a scalable, modular, and secure automotive deployment framework that leverages a hardware abstraction layer and dynamic software deployment capabilities to meet the growing demands of the industry. The framework supports centralized computing of vehicle functions, making software development more dynamic and easier to update and upgrade. We demonstrate the capabilities of our framework by implementing it in a simulated environment where it effectively handles several automotive operations such as lane detection, motion planning, and vehicle control. Our results highlight the framework's potential to facilitate the development and maintenance of future vehicles, emphasizing its adaptability to different hardware configurations and its readiness for real-world applications. This work lays the foundation for further exploration of robust, scalable, and secure SDV systems, setting a new standard for future automotive architectures.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
ImpedanceGPT: VLM-driven Impedance Control of Swarm of Mini-drones for Intelligent Navigation in Dynamic Environment
Authors:
Faryal Batool,
Malaika Zafar,
Yasheerah Yaqoot,
Roohan Ahmed Khan,
Muhammad Haris Khan,
Aleksey Fedoseev,
Dzmitry Tsetserukou
Abstract:
Swarm robotics plays a crucial role in enabling autonomous operations in dynamic and unpredictable environments. However, a major challenge remains ensuring safe and efficient navigation in environments filled with both dynamic alive (e.g., humans) and dynamic inanimate (e.g., non-living objects) obstacles. In this paper, we propose ImpedanceGPT, a novel system that combines a Vision-Language Mode…
▽ More
Swarm robotics plays a crucial role in enabling autonomous operations in dynamic and unpredictable environments. However, a major challenge remains ensuring safe and efficient navigation in environments filled with both dynamic alive (e.g., humans) and dynamic inanimate (e.g., non-living objects) obstacles. In this paper, we propose ImpedanceGPT, a novel system that combines a Vision-Language Model (VLM) with retrieval-augmented generation (RAG) to enable real-time reasoning for adaptive navigation of mini-drone swarms in complex environments.
The key innovation of ImpedanceGPT lies in the integration of VLM and RAG, which provides the drones with enhanced semantic understanding of their surroundings. This enables the system to dynamically adjust impedance control parameters in response to obstacle types and environmental conditions. Our approach not only ensures safe and precise navigation but also improves coordination between drones in the swarm.
Experimental evaluations demonstrate the effectiveness of the system. The VLM-RAG framework achieved an obstacle detection and retrieval accuracy of 80 % under optimal lighting. In static environments, drones navigated dynamic inanimate obstacles at 1.4 m/s but slowed to 0.7 m/s with increased separation around humans. In dynamic environments, speed adjusted to 1.0 m/s near hard obstacles, while reducing to 0.6 m/s with higher deflection to safely avoid moving humans.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
QDCNN: Quantum Deep Learning for Enhancing Safety and Reliability in Autonomous Transportation Systems
Authors:
Ashtakala Meghanath,
Subham Das,
Bikash K. Behera,
Muhammad Attique Khan,
Saif Al-Kuwari,
Ahmed Farouk
Abstract:
In transportation cyber-physical systems (CPS), ensuring safety and reliability in real-time decision-making is essential for successfully deploying autonomous vehicles and intelligent transportation networks. However, these systems face significant challenges, such as computational complexity and the ability to handle ambiguous inputs like shadows in complex environments. This paper introduces a…
▽ More
In transportation cyber-physical systems (CPS), ensuring safety and reliability in real-time decision-making is essential for successfully deploying autonomous vehicles and intelligent transportation networks. However, these systems face significant challenges, such as computational complexity and the ability to handle ambiguous inputs like shadows in complex environments. This paper introduces a Quantum Deep Convolutional Neural Network (QDCNN) designed to enhance the safety and reliability of CPS in transportation by leveraging quantum algorithms. At the core of QDCNN is the UU† method, which is utilized to improve shadow detection through a propagation algorithm that trains the centroid value with preprocessing and postprocessing operations to classify shadow regions in images accurately. The proposed QDCNN is evaluated on three datasets on normal conditions and one road affected by rain to test its robustness. It outperforms existing methods in terms of computational efficiency, achieving a shadow detection time of just 0.0049352 seconds, faster than classical algorithms like intensity-based thresholding (0.03 seconds), chromaticity-based shadow detection (1.47 seconds), and local binary pattern techniques (2.05 seconds). This remarkable speed, superior accuracy, and noise resilience demonstrate the key factors for safe navigation in autonomous transportation in real-time. This research demonstrates the potential of quantum-enhanced models in addressing critical limitations of classical methods, contributing to more dependable and robust autonomous transportation systems within the CPS framework.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks
Authors:
Umar Ali Khan,
Ekram Khan,
Fiza Khan,
Athar Ali Moinuddin
Abstract:
Large Language Models (LLMs) have proven immensely beneficial in education by capturing vast amounts of literature-based information, allowing them to generate context without relying on external sources. In this paper, we propose a generative AI-powered GATE question-answering framework (GATE stands for Graduate Aptitude Test in Engineering) that leverages LLMs to explain GATE solutions and suppo…
▽ More
Large Language Models (LLMs) have proven immensely beneficial in education by capturing vast amounts of literature-based information, allowing them to generate context without relying on external sources. In this paper, we propose a generative AI-powered GATE question-answering framework (GATE stands for Graduate Aptitude Test in Engineering) that leverages LLMs to explain GATE solutions and support students in their exam preparation. We conducted extensive benchmarking to select the optimal embedding model and LLM, evaluating our framework based on criteria such as latency, faithfulness, and relevance, with additional validation through human evaluation. Our chatbot integrates state-of-the-art embedding models and LLMs to deliver accurate, context-aware responses. Through rigorous experimentation, we identified configurations that balance performance and computational efficiency, ensuring a reliable chatbot to serve students' needs. Additionally, we discuss the challenges faced in data processing and modeling and implemented solutions. Our work explores the application of Retrieval-Augmented Generation (RAG) for GATE Q/A explanation tasks, and our findings demonstrate significant improvements in retrieval accuracy and response quality. This research offers practical insights for developing effective AI-driven educational tools while highlighting areas for future enhancement in usability and scalability.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
FLStore: Efficient Federated Learning Storage for non-training workloads
Authors:
Ahmad Faraz Khan,
Samuel Fountain,
Ahmed M. Abdelmoniem,
Ali R. Butt,
Ali Anwar
Abstract:
Federated Learning (FL) is an approach for privacy-preserving Machine Learning (ML), enabling model training across multiple clients without centralized data collection. With an aggregator server coordinating training, aggregating model updates, and storing metadata across rounds. In addition to training, a substantial part of FL systems are the non-training workloads such as scheduling, personali…
▽ More
Federated Learning (FL) is an approach for privacy-preserving Machine Learning (ML), enabling model training across multiple clients without centralized data collection. With an aggregator server coordinating training, aggregating model updates, and storing metadata across rounds. In addition to training, a substantial part of FL systems are the non-training workloads such as scheduling, personalization, clustering, debugging, and incentivization. Most existing systems rely on the aggregator to handle non-training workloads and use cloud services for data storage. This results in high latency and increased costs as non-training workloads rely on large volumes of metadata, including weight parameters from client updates, hyperparameters, and aggregated updates across rounds, making the situation even worse. We propose FLStore, a serverless framework for efficient FL non-training workloads and storage. FLStore unifies the data and compute planes on a serverless cache, enabling locality-aware execution via tailored caching policies to reduce latency and costs. Per our evaluations, compared to cloud object store based aggregator server FLStore reduces per request average latency by 71% and costs by 92.45%, with peak improvements of 99.7% and 98.8%, respectively. Compared to an in-memory cloud cache based aggregator server, FLStore reduces average latency by 64.6% and costs by 98.83%, with peak improvements of 98.8% and 99.6%, respectively. FLStore integrates seamlessly with existing FL frameworks with minimal modifications, while also being fault-tolerant and highly scalable.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
LADs: Leveraging LLMs for AI-Driven DevOps
Authors:
Ahmad Faraz Khan,
Azal Ahmad Khan,
Anas Mohamed,
Haider Ali,
Suchithra Moolinti,
Sabaat Haroon,
Usman Tahir,
Mattia Fazzini,
Ali R. Butt,
Ali Anwar
Abstract:
Automating cloud configuration and deployment remains a critical challenge due to evolving infrastructures, heterogeneous hardware, and fluctuating workloads. Existing solutions lack adaptability and require extensive manual tuning, leading to inefficiencies and misconfigurations. We introduce LADs, the first LLM-driven framework designed to tackle these challenges by ensuring robustness, adaptabi…
▽ More
Automating cloud configuration and deployment remains a critical challenge due to evolving infrastructures, heterogeneous hardware, and fluctuating workloads. Existing solutions lack adaptability and require extensive manual tuning, leading to inefficiencies and misconfigurations. We introduce LADs, the first LLM-driven framework designed to tackle these challenges by ensuring robustness, adaptability, and efficiency in automated cloud management. Instead of merely applying existing techniques, LADs provides a principled approach to configuration optimization through in-depth analysis of what optimization works under which conditions. By leveraging Retrieval-Augmented Generation, Few-Shot Learning, Chain-of-Thought, and Feedback-Based Prompt Chaining, LADs generates accurate configurations and learns from deployment failures to iteratively refine system settings. Our findings reveal key insights into the trade-offs between performance, cost, and scalability, helping practitioners determine the right strategies for different deployment scenarios. For instance, we demonstrate how prompt chaining-based adaptive feedback loops enhance fault tolerance in multi-tenant environments and how structured log analysis with example shots improves configuration accuracy. Through extensive evaluations, LADs reduces manual effort, optimizes resource utilization, and improves system reliability. By open-sourcing LADs, we aim to drive further innovation in AI-powered DevOps automation.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence
Authors:
Yingying Sun,
Jun A,
Zhiwei Liu,
Rui Sun,
Liujia Qian,
Samuel H. Payne,
Wout Bittremieux,
Markus Ralser,
Chen Li,
Yi Chen,
Zhen Dong,
Yasset Perez-Riverol,
Asif Khan,
Chris Sander,
Ruedi Aebersold,
Juan Antonio Vizcaíno,
Jonathan R Krieger,
Jianhua Yao,
Han Wen,
Linfeng Zhang,
Yunping Zhu,
Yue Xuan,
Benjamin Boyang Sun,
Liang Qiao,
Henning Hermjakob
, et al. (37 additional authors not shown)
Abstract:
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights.…
▽ More
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.