-
A Hybrid Ensemble Learning Framework for Image-Based Solar Panel Classification
Authors:
Vivek Tetarwal,
Sandeep Kumar
Abstract:
The installation of solar energy systems is on the rise, and therefore, appropriate maintenance techniques are required to be used in order to maintain maximum performance levels. One of the major challenges is the automated discrimination between clean and dirty solar panels. This paper presents a novel Dual Ensemble Neural Network (DENN) to classify solar panels using image-based features. The s…
▽ More
The installation of solar energy systems is on the rise, and therefore, appropriate maintenance techniques are required to be used in order to maintain maximum performance levels. One of the major challenges is the automated discrimination between clean and dirty solar panels. This paper presents a novel Dual Ensemble Neural Network (DENN) to classify solar panels using image-based features. The suggested approach utilizes the advantages offered by various ensemble models by integrating them into a dual framework, aimed at improving both classification accuracy and robustness. The DENN model is evaluated in comparison to current ensemble methods, showcasing its superior performance across a range of assessment metrics. The proposed approach performs the best compared to other methods and reaches state-of-the-art accuracy on experimental results for the Deep Solar Eye dataset, effectively serving predictive maintenance purposes in solar energy systems. It reveals the potential of hybrid ensemble learning techniques to further advance the prospects of automated solar panel inspections as a scalable solution to real-world challenges.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Cooperative Target Capture in 3D Engagements over Switched Dynamic Graphs
Authors:
Abhinav Sinha,
Shashi Ranjan Kumar
Abstract:
This paper presents a leaderless cooperative guidance strategy for simultaneous time-constrained interception of a stationary target when the interceptors exchange information over switched dynamic graphs. We specifically focus on scenarios when the interceptors lack radial acceleration capabilities, relying solely on their lateral acceleration components. This consideration aligns with their inhe…
▽ More
This paper presents a leaderless cooperative guidance strategy for simultaneous time-constrained interception of a stationary target when the interceptors exchange information over switched dynamic graphs. We specifically focus on scenarios when the interceptors lack radial acceleration capabilities, relying solely on their lateral acceleration components. This consideration aligns with their inherent kinematic turn constraints. The proposed strategy explicitly addresses the complexities of coupled 3D engagements, thereby mitigating performance degradation that typically arises when the pitch and yaw channels are decoupled into two separate, mutually orthogonal planar engagements. Moreover, our formulation incorporates modeling uncertainties associated with the time-to-go estimation into the derivation of cooperative guidance commands to ensure robustness against inaccuracies in dynamic engagement scenarios. To optimize control efficiency, we analytically derive the lateral acceleration components in the orthogonal pitch and yaw channels by solving an instantaneous optimization problem, subject to an affine constraint. We show that the proposed cooperative guidance commands guarantee consensus in time-to-go values within a predefined time, which can be prescribed as a design parameter, regardless of the interceptors' initial configurations. We provide simulations to attest to the efficacy of the proposed method.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Parallel Transmission Aware Co-Design: Enhancing Manipulator Performance Through Actuation-Space Optimization
Authors:
Rohit Kumar,
Melya Boukheddimi,
Dennis Mronga,
Shivesh Kumar,
Frank Kirchner
Abstract:
In robotics, structural design and behavior optimization have long been considered separate processes, resulting in the development of systems with limited capabilities. Recently, co-design methods have gained popularity, where bi-level formulations are used to simultaneously optimize the robot design and behavior for specific tasks. However, most implementations assume a serial or tree-type model…
▽ More
In robotics, structural design and behavior optimization have long been considered separate processes, resulting in the development of systems with limited capabilities. Recently, co-design methods have gained popularity, where bi-level formulations are used to simultaneously optimize the robot design and behavior for specific tasks. However, most implementations assume a serial or tree-type model of the robot, overlooking the fact that many robot platforms incorporate parallel mechanisms. In this paper, we present a novel co-design approach that explicitly incorporates parallel coupling constraints into the dynamic model of the robot. In this framework, an outer optimization loop focuses on the design parameters, in our case the transmission ratios of a parallel belt-driven manipulator, which map the desired torques from the joint space to the actuation space. An inner loop performs trajectory optimization in the actuation space, thus exploiting the entire dynamic range of the manipulator. We compare the proposed method with a conventional co-design approach based on a simplified tree-type model. By taking advantage of the actuation space representation, our approach leads to a significant increase in dynamic payload capacity compared to the conventional co-design implementation.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Container damage detection using advanced computer vision model Yolov12 vs Yolov11 vs RF-DETR A comparative analysis
Authors:
Subhadip Kumar
Abstract:
Containers are an integral part of the logistics industry and act as a barrier for cargo. A typical service life for a container is more than 20 years. However, overtime containers suffer various types of damage due to the mechanical as well as natural factors. A damaged container is a safety hazard for the employees handling it and a liability for the logistic company. Therefore, a timely inspect…
▽ More
Containers are an integral part of the logistics industry and act as a barrier for cargo. A typical service life for a container is more than 20 years. However, overtime containers suffer various types of damage due to the mechanical as well as natural factors. A damaged container is a safety hazard for the employees handling it and a liability for the logistic company. Therefore, a timely inspection and detection of the damaged container is a key for prolonging service life as well as avoiding safety hazards. In this paper, we will compare the performance of the damage detection by three state-of-the-art advanced computer vision models Yolov12, Yolov11 and RF-DETR. We will use a dataset of 278 annotated images to train, validate and test the model. We will compare the mAP and precision of the model. The objective of this paper is to identify the model that is best suited for container damage detection. The result is mixed. mAP@50 score of Yolov11 and 12 was 81.9% compared to RF-DETR, which was 77.7%. However, while testing the model for not-so-common damaged containers, the RF-DETR model outperformed the others overall, exhibiting superiority to accurately detecting both damaged containers as well as damage occurrences with high confidence.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation
Authors:
Reza Yousefi Maragheh,
Pratheek Vadla,
Priyank Gupta,
Kai Zhao,
Aysenur Inan,
Kehui Yao,
Jianpeng Xu,
Praveen Kanumala,
Jason Cho,
Sushant Kumar
Abstract:
Retrieval-Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG-based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval-Augmented Generation fr…
▽ More
Retrieval-Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG-based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval-Augmented Generation framework for Personalized Recommendation, which integrates a multi-agent collaboration mechanism into the RAG pipeline. To better understand the long-term and session behavior of the user, ARAG leverages four specialized LLM-based agents: a User Understanding Agent that summarizes user preferences from long-term and session contexts, a Natural Language Inference (NLI) Agent that evaluates semantic alignment between candidate items retrieved by RAG and inferred intent, a context summary agent that summarizes the findings of NLI agent, and an Item Ranker Agent that generates a ranked list of recommendations based on contextual fit. We evaluate ARAG accross three datasets. Experimental results demonstrate that ARAG significantly outperforms standard RAG and recency-based baselines, achieving up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an ablation study to analyse the effect by different components of ARAG. Our findings highlight the effectiveness of integrating agentic reasoning into retrieval-augmented recommendation and provide new directions for LLM-based personalization.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection
Authors:
Devesh Pant,
Rishi Raj Grandhe,
Vipin Samaria,
Mukul Paul,
Sudhir Kumar,
Saransh Khanna,
Jatin Agrawal,
Jushaan Singh Kalra,
Akhil VSSG,
Satish V Khalikar,
Vipin Garg,
Himanshu Chauhan,
Pranay Verma,
Neha Khandelwal,
Soma S Dhavala,
Minesh Mathew
Abstract:
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To addre…
▽ More
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
RareSpot: Spotting Small and Rare Wildlife in Aerial Imagery with Multi-Scale Consistency and Context-Aware Augmentation
Authors:
Bowen Zhang,
Jesse T. Boulerice,
Nikhil Kuniyil,
Charvi Mendiratta,
Satish Kumar,
Hila Shamon,
B. S. Manjunath
Abstract:
Automated detection of small and rare wildlife in aerial imagery is crucial for effective conservation, yet remains a significant technical challenge. Prairie dogs exemplify this issue: their ecological importance as keystone species contrasts sharply with their elusive presence--marked by small size, sparse distribution, and subtle visual features--which undermines existing detection approaches.…
▽ More
Automated detection of small and rare wildlife in aerial imagery is crucial for effective conservation, yet remains a significant technical challenge. Prairie dogs exemplify this issue: their ecological importance as keystone species contrasts sharply with their elusive presence--marked by small size, sparse distribution, and subtle visual features--which undermines existing detection approaches. To address these challenges, we propose RareSpot, a robust detection framework integrating multi-scale consistency learning and context-aware augmentation. Our multi-scale consistency approach leverages structured alignment across feature pyramids, enhancing fine-grained object representation and mitigating scale-related feature loss. Complementarily, context-aware augmentation strategically synthesizes challenging training instances by embedding difficult-to-detect samples into realistic environmental contexts, significantly boosting model precision and recall. Evaluated on an expert-annotated prairie dog drone imagery benchmark, our method achieves state-of-the-art performance, improving detection accuracy by over 35% compared to baseline methods. Importantly, it generalizes effectively across additional wildlife datasets, demonstrating broad applicability. The RareSpot benchmark and approach not only support critical ecological monitoring but also establish a new foundation for detecting small, rare species in complex aerial scenes.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
When Fine-Tuning Fails: Lessons from MS MARCO Passage Ranking
Authors:
Manu Pande,
Shahil Kumar,
Anay Yatin Damle
Abstract:
This paper investigates the counterintuitive phenomenon where fine-tuning pre-trained transformer models degrades performance on the MS MARCO passage ranking task. Through comprehensive experiments involving five model variants-including full parameter fine-tuning and parameter efficient LoRA adaptations-we demonstrate that all fine-tuning approaches underperform the base sentence-transformers/all…
▽ More
This paper investigates the counterintuitive phenomenon where fine-tuning pre-trained transformer models degrades performance on the MS MARCO passage ranking task. Through comprehensive experiments involving five model variants-including full parameter fine-tuning and parameter efficient LoRA adaptations-we demonstrate that all fine-tuning approaches underperform the base sentence-transformers/all- MiniLM-L6-v2 model (MRR@10: 0.3026). Our analysis reveals that fine-tuning disrupts the optimal embedding space structure learned during the base model's extensive pre-training on 1 billion sentence pairs, including 9.1 million MS MARCO samples. UMAP visualizations show progressive embedding space flattening, while training dynamics analysis and computational efficiency metrics further support our findings. These results challenge conventional wisdom about transfer learning effectiveness on saturated benchmarks and suggest architectural innovations may be necessary for meaningful improvements.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Comparative Analysis of Lion and AdamW Optimizers for Cross-Encoder Reranking with MiniLM, GTE, and ModernBERT
Authors:
Shahil Kumar,
Manu Pande,
Anay Yatin Damle
Abstract:
Modern information retrieval systems often employ a two-stage pipeline: an efficient initial retrieval stage followed by a computationally intensive reranking stage. Cross-encoders have shown strong effectiveness for reranking due to their deep analysis of query-document pairs. This paper studies the impact of the Lion optimizer, a recent alternative to AdamW, during fine-tuning of cross-encoder r…
▽ More
Modern information retrieval systems often employ a two-stage pipeline: an efficient initial retrieval stage followed by a computationally intensive reranking stage. Cross-encoders have shown strong effectiveness for reranking due to their deep analysis of query-document pairs. This paper studies the impact of the Lion optimizer, a recent alternative to AdamW, during fine-tuning of cross-encoder rerankers. We fine-tune three transformer models-MiniLM, GTE, and ModernBERT-on the MS MARCO passage ranking dataset using both optimizers. GTE and ModernBERT support extended context lengths (up to 8192 tokens). We evaluate effectiveness using TREC 2019 Deep Learning Track and MS MARCO dev set (MRR@10). Experiments, run on the Modal cloud platform, reveal that ModernBERT with Lion achieves the best NDCG@10 (0.7225) and MAP (0.5121) on TREC DL 2019, while MiniLM with Lion ties ModernBERT for MRR@10 (0.5988) on MS MARCO dev. Lion also provides superior GPU efficiency, improving utilization by 2.67% to 10.33% across models. We analyze performance trends using standard IR metrics and discuss the optimizer's impact on training dynamics across architectures.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
CARTS: Collaborative Agents for Recommendation Textual Summarization
Authors:
Jiao Chen,
Kehui Yao,
Reza Yousefi Maragheh,
Kai Zhao,
Jianpeng Xu,
Jason Cho,
Evren Korpeoglu,
Sushant Kumar,
Kannan Achan
Abstract:
Current recommendation systems often require some form of textual data summarization, such as generating concise and coherent titles for product carousels or other grouped item displays. While large language models have shown promise in NLP domains for textual summarization, these approaches do not directly apply to recommendation systems, where explanations must be highly relevant to the core fea…
▽ More
Current recommendation systems often require some form of textual data summarization, such as generating concise and coherent titles for product carousels or other grouped item displays. While large language models have shown promise in NLP domains for textual summarization, these approaches do not directly apply to recommendation systems, where explanations must be highly relevant to the core features of item sets, adhere to strict word limit constraints. In this paper, we propose CARTS (Collaborative Agents for Recommendation Textual Summarization), a multi-agent LLM framework designed for structured summarization in recommendation systems. CARTS decomposes the task into three stages-Generation Augmented Generation (GAG), refinement circle, and arbitration, where successive agent roles are responsible for extracting salient item features, iteratively refining candidate titles based on relevance and length feedback, and selecting the final title through a collaborative arbitration process. Experiments on large-scale e-commerce data and live A/B testing show that CARTS significantly outperforms single-pass and chain-of-thought LLM baselines, delivering higher title relevance and improved user engagement metrics.
△ Less
Submitted 1 July, 2025; v1 submitted 21 June, 2025;
originally announced June 2025.
-
AndroIDS : Android-based Intrusion Detection System using Federated Learning
Authors:
Akarsh K Nair,
Shanik Hubert Satheesh Kumar.,
Deepti Gupta
Abstract:
The exponential growth of android-based mobile IoT systems has significantly increased the susceptibility of devices to cyberattacks, particularly in smart homes, UAVs, and other connected mobile environments. This article presents a federated learning-based intrusion detection framework called AndroIDS that leverages system call traces as a personalized and privacy-preserving data source. Unlike…
▽ More
The exponential growth of android-based mobile IoT systems has significantly increased the susceptibility of devices to cyberattacks, particularly in smart homes, UAVs, and other connected mobile environments. This article presents a federated learning-based intrusion detection framework called AndroIDS that leverages system call traces as a personalized and privacy-preserving data source. Unlike conventional centralized approaches, the proposed method enables collaborative anomaly detection without sharing raw data, thus preserving user privacy across distributed nodes. A generalized system call dataset was generated to reflect realistic android system behavior and serves as the foundation for experimentation. Extensive evaluation demonstrates the effectiveness of the FL model under both IID and non-IID conditions, achieving an accuracy of 96.46 % and 92.87 %, and F1-scores of 89 % and 86 %, respectively. These results highlight the models robustness to data heterogeneity, with only a minor performance drop in the non-IID case. Further, a detailed comparison with centralized deep learning further illustrates trade-offs in detection performance and deployment feasibility. Overall, the results validate the practical applicability of the proposed approach for secure and scalable intrusion detection in real-world mobile IoT scenarios.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
SIDE: Semantic ID Embedding for effective learning from sequences
Authors:
Dinesh Ramasamy,
Shakti Kumar,
Chris Cadonic,
Jiaxin Yang,
Sohini Roychowdhury,
Esam Abdel Rhman,
Srihari Reddy
Abstract:
Sequence-based recommendations models are driving the state-of-the-art for industrial ad-recommendation systems. Such systems typically deal with user histories or sequence lengths ranging in the order of O(10^3) to O(10^4) events. While adding embeddings at this scale is manageable in pre-trained models, incorporating them into real-time prediction models is challenging due to both storage and in…
▽ More
Sequence-based recommendations models are driving the state-of-the-art for industrial ad-recommendation systems. Such systems typically deal with user histories or sequence lengths ranging in the order of O(10^3) to O(10^4) events. While adding embeddings at this scale is manageable in pre-trained models, incorporating them into real-time prediction models is challenging due to both storage and inference costs. To address this scaling challenge, we propose a novel approach that leverages vector quantization (VQ) to inject a compact Semantic ID (SID) as input to the recommendation models instead of a collection of embeddings. Our method builds on recent works of SIDs by introducing three key innovations: (i) a multi-task VQ-VAE framework, called VQ fusion that fuses multiple content embeddings and categorical predictions into a single Semantic ID; (ii) a parameter-free, highly granular SID-to-embedding conversion technique, called SIDE, that is validated with two content embedding collections, thereby eliminating the need for a large parameterized lookup table; and (iii) a novel quantization method called Discrete-PCA (DPCA) which generalizes and enhances residual quantization techniques. The proposed enhancements when applied to a large-scale industrial ads-recommendation system achieves 2.4X improvement in normalized entropy (NE) gain and 3X reduction in data footprint compared to traditional SID methods.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
From Local Interactions to Global Operators: Scalable Gaussian Process Operator for Physical Systems
Authors:
Sawan Kumar,
Tapas Tripura,
Rajdip Nayek,
Souvik Chakraborty
Abstract:
Operator learning offers a powerful paradigm for solving parametric partial differential equations (PDEs), but scaling probabilistic neural operators such as the recently proposed Gaussian Processes Operators (GPOs) to high-dimensional, data-intensive regimes remains a significant challenge. In this work, we introduce a novel, scalable GPO, which capitalizes on sparsity, locality, and structural i…
▽ More
Operator learning offers a powerful paradigm for solving parametric partial differential equations (PDEs), but scaling probabilistic neural operators such as the recently proposed Gaussian Processes Operators (GPOs) to high-dimensional, data-intensive regimes remains a significant challenge. In this work, we introduce a novel, scalable GPO, which capitalizes on sparsity, locality, and structural information through judicious kernel design. Addressing the fundamental limitation of cubic computational complexity, our method leverages nearest-neighbor-based local kernel approximations in the spatial domain, sparse kernel approximation in the parameter space, and structured Kronecker factorizations to enable tractable inference on large-scale datasets and high-dimensional input. While local approximations often introduce accuracy trade-offs due to limited kernel interactions, we overcome this by embedding operator-aware kernel structures and employing expressive, task-informed mean functions derived from neural operator architectures. Through extensive evaluations on a broad class of nonlinear PDEs - including Navier-Stokes, wave advection, Darcy flow, and Burgers' equations - we demonstrate that our framework consistently achieves high accuracy across varying discretization scales. These results underscore the potential of our approach to bridge the gap between scalability and fidelity in GPO, offering a compelling foundation for uncertainty-aware modeling in complex physical systems.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Authors:
Kartik Sharma,
Yiqiao Jin,
Vineeth Rakesh,
Yingtong Dou,
Menghai Pan,
Mahashweta Das,
Srijan Kumar
Abstract:
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robu…
▽ More
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose $\textbf{Sysformer}$, a trans$\textbf{former}$ model that updates an initial $\textbf{sys}$tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on $5$ LLMs from different families and $2$ recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto $80\%$ gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto $90\%$. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto $100\%$ more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation
Authors:
Hussam Al Daas,
Grey Ballard,
Laura Grigori,
Suraj Kumar,
Kathryn Rouse,
Mathieu Vérité
Abstract:
In this article, we focus on the parallel communication cost of multiplying the same vector along two modes of a $3$-dimensional symmetric tensor. This is a key computation in the higher-order power method for determining eigenpairs of a $3$-dimensional symmetric tensor and in gradient-based methods for computing a symmetric CP decomposition. We establish communication lower bounds that determine…
▽ More
In this article, we focus on the parallel communication cost of multiplying the same vector along two modes of a $3$-dimensional symmetric tensor. This is a key computation in the higher-order power method for determining eigenpairs of a $3$-dimensional symmetric tensor and in gradient-based methods for computing a symmetric CP decomposition. We establish communication lower bounds that determine how much data movement is required to perform the specified computation in parallel. The core idea of the proof relies on extending a key geometric inequality for $3$-dimensional symmetric computations. We demonstrate that the communication lower bounds are tight by presenting an optimal algorithm where the data distribution is a natural extension of the triangle block partition scheme for symmetric matrices to 3-dimensional symmetric tensors.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints
Authors:
Sunil Kumar,
Bowen Zhao,
Leo Dirac,
Paulina Varshavskaya
Abstract:
Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The gre…
▽ More
Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Unifying Streaming and Non-streaming Zipformer-based ASR
Authors:
Bidisha Sharma,
Karthik Pandia Durai,
Shankar Venkatesan,
Jeena J Prakash,
Shashi Kumar,
Malolan Chetlur,
Andreas Stolcke
Abstract:
There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked atten…
▽ More
There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked attention masking in the training of zipformer-based ASR models. We demonstrate that using right-context is more effective in zipformer models compared to other conformer models due to its multi-scale nature. We analyze the effect of varying the number of right-context frames on accuracy and latency of the streaming ASR models. We use Librispeech and large in-house conversational datasets to train different versions of streaming and non-streaming models and evaluate them in a production grade server-client setup across diverse testsets of different domains. The proposed strategy reduces word error by relative 7.9\% with a small degradation in user-perceived latency. By adding more right-context frames, we are able to achieve streaming performance close to that of non-streaming models. Our approach also allows flexible control of the latency-accuracy tradeoff according to customers requirements.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Enhancing Clinical Decision Support and EHR Insights through LLMs and the Model Context Protocol: An Open-Source MCP-FHIR Framework
Authors:
Abul Ehtesham,
Aditi Singh,
Saket Kumar
Abstract:
Enhancing clinical decision support (CDS), reducing documentation burdens, and improving patient health literacy remain persistent challenges in digital health. This paper presents an open-source, agent-based framework that integrates Large Language Models (LLMs) with HL7 FHIR data via the Model Context Protocol (MCP) for dynamic extraction and reasoning over electronic health records (EHRs). Buil…
▽ More
Enhancing clinical decision support (CDS), reducing documentation burdens, and improving patient health literacy remain persistent challenges in digital health. This paper presents an open-source, agent-based framework that integrates Large Language Models (LLMs) with HL7 FHIR data via the Model Context Protocol (MCP) for dynamic extraction and reasoning over electronic health records (EHRs). Built on the established MCP-FHIR implementation, the framework enables declarative access to diverse FHIR resources through JSON-based configurations, supporting real-time summarization, interpretation, and personalized communication across multiple user personas, including clinicians, caregivers, and patients. To ensure privacy and reproducibility, the framework is evaluated using synthetic EHR data from the SMART Health IT sandbox (https://r4.smarthealthit.org/), which conforms to the FHIR R4 standard. Unlike traditional approaches that rely on hardcoded retrieval and static workflows, the proposed method delivers scalable, explainable, and interoperable AI-powered EHR applications. The agentic architecture further supports multiple FHIR formats, laying a robust foundation for advancing personalized digital health solutions.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Adaptive Model-Base Control of Quadrupeds via Online System Identification using Kalman Filter
Authors:
Jonas Haack,
Franek Stark,
Shubham Vyas,
Frank Kirchner,
Shivesh Kumar
Abstract:
Many real-world applications require legged robots to be able to carry variable payloads. Model-based controllers such as model predictive control (MPC) have become the de facto standard in research for controlling these systems. However, most model-based control architectures use fixed plant models, which limits their applicability to different tasks. In this paper, we present a Kalman filter (KF…
▽ More
Many real-world applications require legged robots to be able to carry variable payloads. Model-based controllers such as model predictive control (MPC) have become the de facto standard in research for controlling these systems. However, most model-based control architectures use fixed plant models, which limits their applicability to different tasks. In this paper, we present a Kalman filter (KF) formulation for online identification of the mass and center of mass (COM) of a four-legged robot. We evaluate our method on a quadrupedal robot carrying various payloads and find that it is more robust to strong measurement noise than classical recursive least squares (RLS) methods. Moreover, it improves the tracking performance of the model-based controller with varying payloads when the model parameters are adjusted at runtime.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
NaSh: Guardrails for an LLM-Powered Natural Language Shell
Authors:
Bimal Raj Gyawali,
Saikrishna Achalla,
Konstantinos Kallas,
Sam Kumar
Abstract:
We explore how a shell that uses an LLM to accept natural language input might be designed differently from the shells of today. As LLMs may produce unintended or unexplainable outputs, we argue that a natural language shell should provide guardrails that empower users to recover from such errors. We concretize some ideas for doing so by designing a new shell called NaSh, identify remaining open p…
▽ More
We explore how a shell that uses an LLM to accept natural language input might be designed differently from the shells of today. As LLMs may produce unintended or unexplainable outputs, we argue that a natural language shell should provide guardrails that empower users to recover from such errors. We concretize some ideas for doing so by designing a new shell called NaSh, identify remaining open problems in this space, and discuss research directions to address them.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Beyond Sin-Squared Error: Linear-Time Entrywise Uncertainty Quantification for Streaming PCA
Authors:
Syamantak Kumar,
Shourya Pandey,
Purnamrita Sarkar
Abstract:
We propose a novel statistical inference framework for streaming principal component analysis (PCA) using Oja's algorithm, enabling the construction of confidence intervals for individual entries of the estimated eigenvector. Most existing works on streaming PCA focus on providing sharp sin-squared error guarantees. Recently, there has been some interest in uncertainty quantification for the sin-s…
▽ More
We propose a novel statistical inference framework for streaming principal component analysis (PCA) using Oja's algorithm, enabling the construction of confidence intervals for individual entries of the estimated eigenvector. Most existing works on streaming PCA focus on providing sharp sin-squared error guarantees. Recently, there has been some interest in uncertainty quantification for the sin-squared error. However, uncertainty quantification or sharp error guarantees for entries of the estimated eigenvector in the streaming setting remains largely unexplored. We derive a sharp Bernstein-type concentration bound for elements of the estimated vector matching the optimal error rate up to logarithmic factors. We also establish a Central Limit Theorem for a suitably centered and scaled subset of the entries. To efficiently estimate the coordinate-wise variance, we introduce a provably consistent subsampling algorithm that leverages the median-of-means approach, empirically achieving similar accuracy to multiplier bootstrap methods while being significantly more computationally efficient. Numerical experiments demonstrate its effectiveness in providing reliable uncertainty estimates with a fraction of the computational cost of existing methods.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Forward Target Propagation: A Forward-Only Approach to Global Error Credit Assignment via Local Losses
Authors:
Nazmus Saadat As-Saquib,
A N M Nafiz Abeer,
Hung-Ta Chien,
Byung-Jun Yoon,
Suhas Kumar,
Su-in Yi
Abstract:
Training neural networks has traditionally relied on backpropagation (BP), a gradient-based algorithm that, despite its widespread success, suffers from key limitations in both biological and hardware perspectives. These include backward error propagation by symmetric weights, non-local credit assignment, and frozen activity during backward passes. We propose Forward Target Propagation (FTP), a bi…
▽ More
Training neural networks has traditionally relied on backpropagation (BP), a gradient-based algorithm that, despite its widespread success, suffers from key limitations in both biological and hardware perspectives. These include backward error propagation by symmetric weights, non-local credit assignment, and frozen activity during backward passes. We propose Forward Target Propagation (FTP), a biologically plausible and computationally efficient alternative that replaces the backward pass with a second forward pass. FTP estimates layerwise targets using only feedforward computations, eliminating the need for symmetric feedback weights or learnable inverse functions, hence enabling modular and local learning. We evaluate FTP on fully connected networks, CNNs, and RNNs, demonstrating accuracies competitive with BP on MNIST, CIFAR10, and CIFAR100, as well as effective modeling of long-term dependencies in sequential tasks. Moreover, FTP outperforms BP under quantized low-precision and emerging hardware constraints while also demonstrating substantial efficiency gains over other biologically inspired methods such as target propagation variants and forward-only learning algorithms. With its minimal computational overhead, forward-only nature, and hardware compatibility, FTP provides a promising direction for energy-efficient on-device learning and neuromorphic computing.
△ Less
Submitted 20 May, 2025;
originally announced June 2025.
-
Automated Validation of COBOL to Java Transformation
Authors:
Atul Kumar,
Diptikalyan Saha,
Toshikai Yasue,
Kohichi Ono,
Saravanan Krishnan,
Sandeep Hans,
Fumiko Satoh,
Gerald Mitchell,
Sachin Kumar
Abstract:
Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterpriselevel code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code. We propose a framework and a tool t…
▽ More
Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterpriselevel code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code. We propose a framework and a tool to help validate the equivalence of COBOL and translated Java. The results can also help repair the code if there are some issues and provide feedback to the AI model to improve. We have developed a symbolic-execution-based test generation to automatically generate unit tests for the source COBOL programs which also mocks the external resource calls. We generate equivalent JUnit test cases with equivalent mocking as COBOL and run them to check semantic equivalence between original and translated programs.
△ Less
Submitted 14 April, 2025;
originally announced June 2025.
-
Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers
Authors:
Xanh Ho,
Sunisth Kumar,
Yun-Ang Wu,
Florian Boudin,
Atsuhiro Takasu,
Akiko Aizawa
Abstract:
Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table ce…
▽ More
Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma
Authors:
Garima Jain,
Sanghamitra Pati,
Mona Duggal,
Amit Sethi,
Abhijeet Patil,
Gururaj Malekar,
Nilesh Kowe,
Jitender Kumar,
Jatin Kashyap,
Divyajeet Rout,
Deepali,
Hitesh,
Nishi Halduniya,
Sharat Kumar,
Heena Tabassum,
Rupinder Singh Dhaliwal,
Sucheta Devi Khuraijam,
Sushma Khuraijam,
Sharmila Laishram,
Simmi Kharb,
Sunita Singh,
K. Swaminadtan,
Ranjana Solanki,
Deepika Hemranjani,
Shashank Nath Singh
, et al. (12 additional authors not shown)
Abstract:
Oral squamous cell carcinoma OSCC is a major global health burden, particularly in several regions across Asia, Africa, and South America, where it accounts for a significant proportion of cancer cases. Early detection dramatically improves outcomes, with stage I cancers achieving up to 90 percent survival. However, traditional diagnosis based on histopathology has limited accessibility in low-res…
▽ More
Oral squamous cell carcinoma OSCC is a major global health burden, particularly in several regions across Asia, Africa, and South America, where it accounts for a significant proportion of cancer cases. Early detection dramatically improves outcomes, with stage I cancers achieving up to 90 percent survival. However, traditional diagnosis based on histopathology has limited accessibility in low-resource settings because it is invasive, resource-intensive, and reliant on expert pathologists. On the other hand, oral cytology of brush biopsy offers a minimally invasive and lower cost alternative, provided that the remaining challenges, inter observer variability and unavailability of expert pathologists can be addressed using artificial intelligence. Development and validation of robust AI solutions requires access to large, labeled, and multi-source datasets to train high capacity models that generalize across domain shifts. We introduce the first large and multicenter oral cytology dataset, comprising annotated slides stained with Papanicolaou(PAP) and May-Grunwald-Giemsa(MGG) protocols, collected from ten tertiary medical centers in India. The dataset is labeled and annotated by expert pathologists for cellular anomaly classification and detection, is designed to advance AI driven diagnostic methods. By filling the gap in publicly available oral cytology datasets, this resource aims to enhance automated detection, reduce diagnostic errors, and improve early OSCC diagnosis in resource-constrained settings, ultimately contributing to reduced mortality and better patient outcomes worldwide.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Spark Transformer: Reactivating Sparsity in FFN and Attention
Authors:
Chong You,
Kan Wu,
Zhipeng Jia,
Lin Chen,
Srinadh Bhojanapalli,
Jiaxian Guo,
Utku Evci,
Jan Wassenberg,
Praneeth Netrapalli,
Jeremiah J. Willcock,
Suvinay Subramanian,
Felix Chern,
Alek Andreev,
Shreya Pathak,
Felix Yu,
Prateek Jain,
David E. Culler,
Henry M. Levy,
Sanjiv Kumar
Abstract:
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the Re…
▽ More
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges.
This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top-$k$ operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Mapping Human-Agent Co-Learning and Co-Adaptation: A Scoping Review
Authors:
Shruti Kumar,
Xiaoyu Chen,
Xiaomei Wang
Abstract:
Several papers have delved into the challenges of human-AI-robot co-learning and co-adaptation. It has been noted that the terminology used to describe this collaborative relationship in existing studies needs to be more consistent. For example, the prefix "co" is used interchangeably to represent both "collaborative" and "mutual," and the terms "co-learning" and "co-adaptation" are sometimes used…
▽ More
Several papers have delved into the challenges of human-AI-robot co-learning and co-adaptation. It has been noted that the terminology used to describe this collaborative relationship in existing studies needs to be more consistent. For example, the prefix "co" is used interchangeably to represent both "collaborative" and "mutual," and the terms "co-learning" and "co-adaptation" are sometimes used interchangeably. However, they can reflect subtle differences in the focus of the studies. The current scoping review's primary research question (RQ1) aims to gather existing papers discussing this collaboration pattern and examine the terms researchers use to describe this human-agent relationship. Given the relative newness of this area of study, we are also keen on exploring the specific types of intelligent agents and task domains that have been considered in existing research (RQ2). This exploration is significant as it can shed light on the diversity of human-agent interactions, from one-time to continuous learning/adaptation scenarios. It can also help us understand the dynamics of human-agent interactions in different task domains, guiding our expectations towards research situated in dynamic, complex domains. Our third objective (RQ3) is to investigate the cognitive theories and frameworks that have been utilized in existing studies to measure human-agent co-learning and co-adaptation. This investigation is crucial as it can help us understand the theoretical underpinnings of human-agent collaboration and adaptation, and it can also guide us in identifying any new frameworks proposed specifically for this type of relationship.
△ Less
Submitted 29 May, 2025;
originally announced June 2025.
-
Optimizing Recall or Relevance? A Multi-Task Multi-Head Approach for Item-to-Item Retrieval in Recommendation
Authors:
Jiang Zhang,
Sumit Kumar,
Wei Chang,
Yubo Wang,
Feng Zhang,
Weize Mao,
Hanchao Yu,
Aashu Singh,
Min Li,
Qifan Wang
Abstract:
The task of item-to-item (I2I) retrieval is to identify a set of relevant and highly engaging items based on a given trigger item. It is a crucial component in modern recommendation systems, where users' previously engaged items serve as trigger items to retrieve relevant content for future engagement. However, existing I2I retrieval models in industry are primarily built on co-engagement data and…
▽ More
The task of item-to-item (I2I) retrieval is to identify a set of relevant and highly engaging items based on a given trigger item. It is a crucial component in modern recommendation systems, where users' previously engaged items serve as trigger items to retrieve relevant content for future engagement. However, existing I2I retrieval models in industry are primarily built on co-engagement data and optimized using the recall measure, which overly emphasizes co-engagement patterns while failing to capture semantic relevance. This often leads to overfitting short-term co-engagement trends at the expense of long-term benefits such as discovering novel interests and promoting content diversity. To address this challenge, we propose MTMH, a Multi-Task and Multi-Head I2I retrieval model that achieves both high recall and semantic relevance. Our model consists of two key components: 1) a multi-task learning loss for formally optimizing the trade-off between recall and semantic relevance, and 2) a multi-head I2I retrieval architecture for retrieving both highly co-engaged and semantically relevant items. We evaluate MTMH using proprietary data from a commercial platform serving billions of users and demonstrate that it can improve recall by up to 14.4% and semantic relevance by up to 56.6% compared with prior state-of-the-art models. We also conduct live experiments to verify that MTMH can enhance both short-term consumption metrics and long-term user-experience-related metrics. Our work provides a principled approach for jointly optimizing I2I recall and semantic relevance, which has significant implications for improving the overall performance of recommendation systems.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Efficient Memory Tiering in a Virtual Machine
Authors:
Chandra Prakash,
Aravinda Prasad,
Sandeep Kumar,
Sreenivas Subramoney
Abstract:
Memory tiering is the norm to effectively tackle the increasing server memory total cost of ownership (TCO) and the growing data demands of modern data center workloads. However, the host-based state-of-the-art memory tiering solutions can be inefficient for a virtualized environment when (i) the frequently accessed data are scattered across the guest physical address space or (ii) the accesses to…
▽ More
Memory tiering is the norm to effectively tackle the increasing server memory total cost of ownership (TCO) and the growing data demands of modern data center workloads. However, the host-based state-of-the-art memory tiering solutions can be inefficient for a virtualized environment when (i) the frequently accessed data are scattered across the guest physical address space or (ii) the accesses to a huge page inside the guest are skewed due to a small number of subpages being hot. Scattered or skewed accesses make the whole huge page look hot in the host address space. This results in host selecting and placing sparsely accessed huge pages in near memory, wasting costly near memory resources.
We propose a host-agnostic technique employed inside the guest that exploits the two-level address translation in a virtualized environment to consolidate the scattered and skewed accesses to a set of guest physical address ranges. Consolidation transforms sparsely hot huge pages to densely hot huge pages in the host address space context. As a consequence, host-based tiering solutions can place densely hot huge pages in near memory, improving near memory utilization. Our evaluation of our technique on standalone real-world benchmarks with state-of-the-art host-based tiering show 50-70% reduction in near memory consumption at similar performance levels, while evaluation at scale improves performance by 10-13% with similar memory TCO.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Analysis of cost-efficiency of serverless approaches
Authors:
Nakhat Syeda,
Harsh Shah,
Rajvinder Singh,
Suraj Jaju,
Sumedha Kumar,
Gourav Chhabra,
Maria Spichkova
Abstract:
In this paper, we present a survey of research studies related to the cost-effectiveness of serverless approach and corresponding cost savings. We conducted a systematic literature review using Google Scholar search engine, covering the period from 2010 to 2024. We identified 34 related studies, from which we extracted 17 parameters that might influence the relative cost savings of applying the se…
▽ More
In this paper, we present a survey of research studies related to the cost-effectiveness of serverless approach and corresponding cost savings. We conducted a systematic literature review using Google Scholar search engine, covering the period from 2010 to 2024. We identified 34 related studies, from which we extracted 17 parameters that might influence the relative cost savings of applying the serverless approach.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering
Authors:
Andres Carofilis,
Pradeep Rangappa,
Srikanth Madikeri,
Shashi Kumar,
Sergio Burdisso,
Jeena Prakash,
Esau Villatoro-Tello,
Petr Motlicek,
Bidisha Sharma,
Kadri Hacioglu,
Shankar Venkatesan,
Saurabh Vyas,
Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxilia…
▽ More
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Building a Few-Shot Cross-Domain Multilingual NLU Model for Customer Care
Authors:
Saurabh Kumar,
Sourav Bansal,
Neeraj Agrawal,
Priyanka Bhatt
Abstract:
Customer care is an essential pillar of the e-commerce shopping experience with companies spending millions of dollars each year, employing automation and human agents, across geographies (like US, Canada, Mexico, Chile), channels (like Chat, Interactive Voice Response (IVR)), and languages (like English, Spanish). SOTA pre-trained models like multilingual-BERT, fine-tuned on annotated data have s…
▽ More
Customer care is an essential pillar of the e-commerce shopping experience with companies spending millions of dollars each year, employing automation and human agents, across geographies (like US, Canada, Mexico, Chile), channels (like Chat, Interactive Voice Response (IVR)), and languages (like English, Spanish). SOTA pre-trained models like multilingual-BERT, fine-tuned on annotated data have shown good performance in downstream tasks relevant to Customer Care. However, model performance is largely subject to the availability of sufficient annotated domain-specific data. Cross-domain availability of data remains a bottleneck, thus building an intent classifier that generalizes across domains (defined by channel, geography, and language) with only a few annotations, is of great practical value. In this paper, we propose an embedder-cum-classifier model architecture which extends state-of-the-art domain-specific models to other domains with only a few labeled samples. We adopt a supervised fine-tuning approach with isotropic regularizers to train a domain-specific sentence embedder and a multilingual knowledge distillation strategy to generalize this embedder across multiple domains. The trained embedder, further augmented with a simple linear classifier can be deployed for new domains. Experiments on Canada and Mexico e-commerce Customer Care dataset with few-shot intent detection show an increase in accuracy by 20-23% against the existing state-of-the-art pre-trained models.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Hierarchical Text Classification Using Contrastive Learning Informed Path Guided Hierarchy
Authors:
Neeraj Agrawal,
Saurabh Kumar,
Priyanka Bhatt,
Tanishka Agarwal
Abstract:
Hierarchical Text Classification (HTC) has recently gained traction given the ability to handle complex label hierarchy. This has found applications in domains like E- commerce, customer care and medicine industry among other real-world applications. Existing HTC models either encode label hierarchy separately and mix it with text encoding or guide the label hierarchy structure in the text encoder…
▽ More
Hierarchical Text Classification (HTC) has recently gained traction given the ability to handle complex label hierarchy. This has found applications in domains like E- commerce, customer care and medicine industry among other real-world applications. Existing HTC models either encode label hierarchy separately and mix it with text encoding or guide the label hierarchy structure in the text encoder. Both approaches capture different characteristics of label hierarchy and are complementary to each other. In this paper, we propose a Hierarchical Text Classification using Contrastive Learning Informed Path guided hierarchy (HTC-CLIP), which learns hierarchy-aware text representation and text informed path guided hierarchy representation using contrastive learning. During the training of HTC-CLIP, we learn two different sets of class probabilities distributions and during inference, we use the pooled output of both probabilities for each class to get the best of both representations. Our results show that the two previous approaches can be effectively combined into one architecture to achieve improved performance. Tests on two public benchmark datasets showed an improvement of 0.99 - 2.37% in Macro F1 score using HTC-CLIP over the existing state-of-the-art models.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
RhoDARTS: Differentiable Quantum Architecture Search with Density Matrix Simulations
Authors:
Swagat Kumar,
Jan-Nico Zaech,
Colin Michael Wilmott,
Luc Van Gool
Abstract:
Variational Quantum Algorithms (VQAs) are a promising approach for leveraging powerful Noisy Intermediate-Scale Quantum (NISQ) computers. When applied to machine learning tasks, VQAs give rise to NISQ-compatible Quantum Neural Networks (QNNs), which have been shown to outperform classical neural networks with a similar number of trainable parameters. While the quantum circuit structures of VQAs fo…
▽ More
Variational Quantum Algorithms (VQAs) are a promising approach for leveraging powerful Noisy Intermediate-Scale Quantum (NISQ) computers. When applied to machine learning tasks, VQAs give rise to NISQ-compatible Quantum Neural Networks (QNNs), which have been shown to outperform classical neural networks with a similar number of trainable parameters. While the quantum circuit structures of VQAs for physics simulations are determined by the physical properties of the systems, identifying effective QNN architectures for general machine learning tasks is a difficult challenge due to the lack of domain-specific priors. Indeed, existing Quantum Architecture Search (QAS) algorithms, adaptations of classical neural architecture search techniques, often overlook the inherent quantum nature of the circuits they produce. By approaching QAS from the ground-up and from a quantum perspective, we resolve this limitation by proposing $ρ$DARTS, a differentiable QAS algorithm that models the search process as the evolution of a quantum mixed state, emerging from the search space of quantum architectures. We validate our method by finding circuits for state initialization, Hamiltonian optimization, and image classification. Further, we demonstrate better convergence against existing QAS techniques and show improved robustness levels to noise.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering
Authors:
Pradeep Rangappa,
Andres Carofilis,
Jeena Prakash,
Shashi Kumar,
Sergio Burdisso,
Srikanth Madikeri,
Esau Villatoro-Tello,
Bidisha Sharma,
Petr Motlicek,
Kadri Hacioglu,
Shankar Venkatesan,
Saurabh Vyas,
Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple sel…
▽ More
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry
Authors:
Sujeet Kumar,
Pretam Ray,
Abhinay Beerukuri,
Shrey Kamoji,
Manoj Balaji Jagadeeshan,
Pawan Goyal
Abstract:
Sanskrit, an ancient language with a rich linguistic heritage, presents unique challenges for automatic speech recognition (ASR) due to its phonemic complexity and the phonetic transformations that occur at word junctures, similar to the connected speech found in natural conversations. Due to these complexities, there has been limited exploration of ASR in Sanskrit, particularly in the context of…
▽ More
Sanskrit, an ancient language with a rich linguistic heritage, presents unique challenges for automatic speech recognition (ASR) due to its phonemic complexity and the phonetic transformations that occur at word junctures, similar to the connected speech found in natural conversations. Due to these complexities, there has been limited exploration of ASR in Sanskrit, particularly in the context of its poetic verses, which are characterized by intricate prosodic and rhythmic patterns. This gap in research raises the question: How can we develop an effective ASR system for Sanskrit, particularly one that captures the nuanced features of its poetic form? In this study, we introduce Vedavani, the first comprehensive ASR study focused on Sanskrit Vedic poetry. We present a 54-hour Sanskrit ASR dataset, consisting of 30,779 labelled audio samples from the Rig Veda and Atharva Veda. This dataset captures the precise prosodic and rhythmic features that define the language. We also benchmark the dataset on various state-of-the-art multilingual speech models.$^{1}$ Experimentation revealed that IndicWhisper performed the best among the SOTA models.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Interactive Video Generation via Domain Adaptation
Authors:
Ishaan Rawal,
Suryansh Kumar
Abstract:
Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in…
▽ More
Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial noise does not align with IVG conditioning, by introducing a temporal intrinsic diffusion prior that enforces spatio-temporal consistency at each denoising step. Extensive qualitative and quantitative evaluations demonstrate that mask normalization and temporal intrinsic denoising improve both perceptual quality and trajectory control over the existing state-of-the-art IVG techniques.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
SEMMA: A Semantic Aware Knowledge Graph Foundation Model
Authors:
Arvindh Arun,
Sumit Kumar,
Mojtaba Nayyeri,
Bo Xiong,
Ponnurangam Kumaraguru,
Antonio Vergari,
Steffen Staab
Abstract:
Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. S…
▽ More
Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Generative AI in Computer Science Education: Accelerating Python Learning with ChatGPT
Authors:
Ian McCulloh,
Pedro Rodriguez,
Srivaths Kumar,
Manu Gupta,
Viplove Raj Sharma,
Benjamin Johnson,
Anthony N. Johnson
Abstract:
The increasing demand for digital literacy and artificial intelligence (AI) fluency in the workforce has highlighted the need for scalable, efficient programming instruction. This study evaluates the effectiveness of integrating generative AI, specifically OpenAIs ChatGPT, into a self-paced Python programming module embedded within a sixteen-week professional training course on applied generative…
▽ More
The increasing demand for digital literacy and artificial intelligence (AI) fluency in the workforce has highlighted the need for scalable, efficient programming instruction. This study evaluates the effectiveness of integrating generative AI, specifically OpenAIs ChatGPT, into a self-paced Python programming module embedded within a sixteen-week professional training course on applied generative AI. A total of 86 adult learners with varying levels of programming experience completed asynchronous Python instruction in Weeks three and four, using ChatGPT to generate, interpret, and debug code. Python proficiency and general coding knowledge was assessed across 30 different assessments during the first 13 weeks of the course through timed, code-based evaluations. A mixed-design ANOVA revealed that learners without prior programming experience scored significantly lower than their peers on early assessments. However, following the completion of the accelerated Python instruction module, these group differences were no longer statistically significant,, indicating that the intervention effectively closed initial performance gaps and supported proficiency gains across all learner groups. These findings suggest that generative AI can support accelerated learning outcomes and reduce entry barriers for learners with no prior coding background. While ChatGPT effectively facilitated foundational skill acquisition, the study also highlights the importance of balancing AI assistance with opportunities for independent problem-solving. The results support the potential of AI-augmented instruction as a scalable model for reskilling in the digital economy.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
Authors:
Shahriar Noroozizadeh,
Sayantan Kumar,
George H. Chen,
Jeremy C. Weiss
Abstract:
Understanding temporal dynamics in clinical narratives is essential for modeling patient trajectories, yet large-scale temporally annotated resources remain limited. We present PMOA-TTS, the first openly available dataset of 124,699 PubMed Open Access (PMOA) case reports, each converted into structured (event, time) timelines via a scalable LLM-based pipeline. Our approach combines heuristic filte…
▽ More
Understanding temporal dynamics in clinical narratives is essential for modeling patient trajectories, yet large-scale temporally annotated resources remain limited. We present PMOA-TTS, the first openly available dataset of 124,699 PubMed Open Access (PMOA) case reports, each converted into structured (event, time) timelines via a scalable LLM-based pipeline. Our approach combines heuristic filtering with Llama 3.3 to identify single-patient case reports, followed by prompt-driven extraction using Llama 3.3 and DeepSeek R1, resulting in over 5.6 million timestamped clinical events. To assess timeline quality, we evaluate against a clinician-curated reference set using three metrics: (i) event-level matching (80% match at a cosine similarity threshold of 0.1), (ii) temporal concordance (c-index > 0.90), and (iii) Area Under the Log-Time CDF (AULTC) for timestamp alignment. Corpus-level analysis shows wide diagnostic and demographic coverage. In a downstream survival prediction task, embeddings from extracted timelines achieve time-dependent concordance indices up to 0.82 $\pm$ 0.01, demonstrating the predictive value of temporally structured narratives. PMOA-TTS provides a scalable foundation for timeline extraction, temporal reasoning, and longitudinal modeling in biomedical NLP. The dataset is available at: https://huggingface.co/datasets/snoroozi/pmoa-tts .
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Private Geometric Median in Nearly-Linear Time
Authors:
Syamantak Kumar,
Daogao Liu,
Kevin Tian,
Chutong Yang
Abstract:
Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an $(\varepsilon, δ)$-differentially private algorithm obtaining an $α$-multiplicative approximation to the geometric median objective, $\frac 1 n \sum_{i \in [n]} \|\cdot - \mathbf{x}_i\|$, given a dataset…
▽ More
Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an $(\varepsilon, δ)$-differentially private algorithm obtaining an $α$-multiplicative approximation to the geometric median objective, $\frac 1 n \sum_{i \in [n]} \|\cdot - \mathbf{x}_i\|$, given a dataset $\mathcal{D} := \{\mathbf{x}_i\}_{i \in [n]} \subset \mathbb{R}^d$. Their algorithm requires $n \gtrsim \sqrt d \cdot \frac 1 {α\varepsilon}$ samples, which they prove is information-theoretically optimal. This result is surprising because its error scales with the \emph{effective radius} of $\mathcal{D}$ (i.e., of a ball capturing most points), rather than the worst-case radius. We give an improved algorithm that obtains the same approximation quality, also using $n \gtrsim \sqrt d \cdot \frac 1 {αε}$ samples, but in time $\widetilde{O}(nd + \frac d {α^2})$. Our runtime is nearly-linear, plus the cost of the cheapest non-private first-order method due to [CLM+16]. To achieve our results, we use subsampling and geometric aggregation tools inspired by FriendlyCore [TCK+22] to speed up the "warm start" component of the [HSU24] algorithm, combined with a careful custom analysis of DP-SGD's sensitivity for the geometric median objective.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Navigating Pitfalls: Evaluating LLMs in Machine Learning Programming Education
Authors:
Smitha Kumar,
Michael A. Lones,
Manuel Maarek,
Hind Zantout
Abstract:
The rapid advancement of Large Language Models (LLMs) has opened new avenues in education. This study examines the use of LLMs in supporting learning in machine learning education; in particular, it focuses on the ability of LLMs to identify common errors of practice (pitfalls) in machine learning code, and their ability to provide feedback that can guide learning. Using a portfolio of code sample…
▽ More
The rapid advancement of Large Language Models (LLMs) has opened new avenues in education. This study examines the use of LLMs in supporting learning in machine learning education; in particular, it focuses on the ability of LLMs to identify common errors of practice (pitfalls) in machine learning code, and their ability to provide feedback that can guide learning. Using a portfolio of code samples, we consider four different LLMs: one closed model and three open models. Whilst the most basic pitfalls are readily identified by all models, many common pitfalls are not. They particularly struggle to identify pitfalls in the early stages of the ML pipeline, especially those which can lead to information leaks, a major source of failure within applied ML projects. They also exhibit limited success at identifying pitfalls around model selection, which is a concept that students often struggle with when first transitioning from theory to practice. This questions the use of current LLMs to support machine learning education, and also raises important questions about their use by novice practitioners. Nevertheless, when LLMs successfully identify pitfalls in code, they do provide feedback that includes advice on how to proceed, emphasising their potential role in guiding learners. We also compare the capability of closed and open LLM models, and find that the gap is relatively small given the large difference in model sizes. This presents an opportunity to deploy, and potentially customise, smaller more efficient LLM models within education, avoiding risks around cost and data sharing associated with commercial models.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Self-heating electrochemical memory for high-precision analog computing
Authors:
Adam L. Gross,
Sangheon Oh,
François Léonard,
Wyatt Hodges,
T. Patrick Xiao,
Joshua D. Sugar,
Jacklyn Zhu,
Sritharini Radhakrishnan,
Sangyong Lee,
Jolie Wang,
Adam Christensen,
Sam Lilak,
Patrick S. Finnegan,
Patrick Crandall,
Christopher H. Bennett,
William Wahby,
Robin Jacobs-Gedrim,
Matthew J. Marinella,
Suhas Kumar,
Sapan Agarwal,
Yiyang Li,
A. Alec Talin,
Elliot J. Fuller
Abstract:
Analog computers hold promise to significantly reduce the energy consumption of artificial intelligence algorithms, but commercialization has been hampered by a fundamental scientific challenge - how to reliably store and process analog information with high precision. We present an approach based upon metal oxide memory cells that undergo controlled self-heating during programming with a newly de…
▽ More
Analog computers hold promise to significantly reduce the energy consumption of artificial intelligence algorithms, but commercialization has been hampered by a fundamental scientific challenge - how to reliably store and process analog information with high precision. We present an approach based upon metal oxide memory cells that undergo controlled self-heating during programming with a newly developed, electro-thermo-chemical gate. The gate uniformly spreads heat and electrochemical reactions to enable wide, bulk-vacancy modulation which yields nine orders of magnitude in tunable analog resistance - three orders greater than other devices reported, with thousands of states. The gating profoundly reduces noise and drift to enable precision programming to targeted states within a few operations, lowering conductance errors by two orders of magnitude relative to other devices reported. Simulations show improvement in computational energy efficiency by at least 10x over other devices due to far greater scalability at higher precision. The results overturn long-held assumptions about the poor reliability and precision of analog resistance devices and opens the door to manufacturable, bulk metal-oxide devices and new applications that leverage high precision.
△ Less
Submitted 1 July, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
AH-UGC: Adaptive and Heterogeneous-Universal Graph Coarsening
Authors:
Mohit Kataria,
Shreyash Bhilwade,
Sandeep Kumar,
Jayadeva
Abstract:
$\textbf{Graph Coarsening (GC)}$ is a prominent graph reduction technique that compresses large graphs to enable efficient learning and inference. However, existing GC methods generate only one coarsened graph per run and must recompute from scratch for each new coarsening ratio, resulting in unnecessary overhead. Moreover, most prior approaches are tailored to $\textit{homogeneous}…
▽ More
$\textbf{Graph Coarsening (GC)}$ is a prominent graph reduction technique that compresses large graphs to enable efficient learning and inference. However, existing GC methods generate only one coarsened graph per run and must recompute from scratch for each new coarsening ratio, resulting in unnecessary overhead. Moreover, most prior approaches are tailored to $\textit{homogeneous}$ graphs and fail to accommodate the semantic constraints of $\textit{heterogeneous}$ graphs, which comprise multiple node and edge types. To overcome these limitations, we introduce a novel framework that combines Locality Sensitive Hashing (LSH) with Consistent Hashing to enable $\textit{adaptive graph coarsening}$. Leveraging hashing techniques, our method is inherently fast and scalable. For heterogeneous graphs, we propose a $\textit{type isolated coarsening}$ strategy that ensures semantic consistency by restricting merges to nodes of the same type. Our approach is the first unified framework to support both adaptive and heterogeneous coarsening. Extensive evaluations on 23 real-world datasets including homophilic, heterophilic, homogeneous, and heterogeneous graphs demonstrate that our method achieves superior scalability while preserving the structural and semantic integrity of the original graph.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
A Hybrid Quantum Classical Pipeline for X Ray Based Fracture Diagnosis
Authors:
Sahil Tomar,
Rajeshwar Tripathi,
Sandeep Kumar
Abstract:
Bone fractures are a leading cause of morbidity and disability worldwide, imposing significant clinical and economic burdens on healthcare systems. Traditional X ray interpretation is time consuming and error prone, while existing machine learning and deep learning solutions often demand extensive feature engineering, large, annotated datasets, and high computational resources. To address these ch…
▽ More
Bone fractures are a leading cause of morbidity and disability worldwide, imposing significant clinical and economic burdens on healthcare systems. Traditional X ray interpretation is time consuming and error prone, while existing machine learning and deep learning solutions often demand extensive feature engineering, large, annotated datasets, and high computational resources. To address these challenges, a distributed hybrid quantum classical pipeline is proposed that first applies Principal Component Analysis (PCA) for dimensionality reduction and then leverages a 4 qubit quantum amplitude encoding circuit for feature enrichment. By fusing eight PCA derived features with eight quantum enhanced features into a 16 dimensional vector and then classifying with different machine learning models achieving 99% accuracy using a public multi region X ray dataset on par with state of the art transfer learning models while reducing feature extraction time by 82%.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations
Authors:
Somnath Banerjee,
Pratyush Chatterjee,
Shanu Kumar,
Sayan Layek,
Parag Agrawal,
Rima Hazra,
Animesh Mukherjee
Abstract:
Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful be…
▽ More
Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
JIR-Arena: The First Benchmark Dataset for Just-in-time Information Recommendation
Authors:
Ke Yang,
Kevin Ros,
Shankar Kumar Senthil Kumar,
ChengXiang Zhai
Abstract:
Just-in-time Information Recommendation (JIR) is a service designed to deliver the most relevant information precisely when users need it, , addressing their knowledge gaps with minimal effort and boosting decision-making and efficiency in daily life. Advances in device-efficient deployment of foundation models and the growing use of intelligent wearable devices have made always-on JIR assistants…
▽ More
Just-in-time Information Recommendation (JIR) is a service designed to deliver the most relevant information precisely when users need it, , addressing their knowledge gaps with minimal effort and boosting decision-making and efficiency in daily life. Advances in device-efficient deployment of foundation models and the growing use of intelligent wearable devices have made always-on JIR assistants feasible. However, there has been no systematic effort to formally define JIR tasks or establish evaluation frameworks. To bridge this gap, we present the first mathematical definition of JIR tasks and associated evaluation metrics. Additionally, we introduce JIR-Arena, a multimodal benchmark dataset featuring diverse, information-request-intensive scenarios to evaluate JIR systems across critical dimensions: i) accurately inferring user information needs, ii) delivering timely and relevant recommendations, and iii) avoiding irrelevant content that may distract users.
Developing a JIR benchmark dataset poses challenges due to subjectivity in estimating user information needs and uncontrollable system variables affecting reproducibility. To address these, JIR-Arena: i) combines input from multiple humans and large AI models to approximate information need distributions; ii) assesses JIR quality through information retrieval outcomes using static knowledge base snapshots; and iii) employs a multi-turn, multi-entity validation framework to improve objectivity and generality. Furthermore, we implement a baseline JIR system capable of processing real-time information streams aligned with user inputs. Our evaluation of this baseline system on JIR-Arena indicates that while foundation model-based JIR systems simulate user needs with reasonable precision, they face challenges in recall and effective content retrieval. To support future research in this new area, we fully release our code and data.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
ADALog: Adaptive Unsupervised Anomaly detection in Logs with Self-attention Masked Language Model
Authors:
Przemek Pospieszny,
Wojciech Mormul,
Karolina Szyndler,
Sanjeev Kumar
Abstract:
Modern software systems generate extensive heterogeneous log data with dynamic formats, fragmented event sequences, and varying temporal patterns, making anomaly detection both crucial and challenging. To address these complexities, we propose ADALog, an adaptive, unsupervised anomaly detection framework designed for practical applicability across diverse real-world environments. Unlike traditiona…
▽ More
Modern software systems generate extensive heterogeneous log data with dynamic formats, fragmented event sequences, and varying temporal patterns, making anomaly detection both crucial and challenging. To address these complexities, we propose ADALog, an adaptive, unsupervised anomaly detection framework designed for practical applicability across diverse real-world environments. Unlike traditional methods reliant on log parsing, strict sequence dependencies, or labeled data, ADALog operates on individual unstructured logs, extracts intra-log contextual relationships, and performs adaptive thresholding on normal data. The proposed approach utilizes a transformer-based, pretrained bidirectional encoder with a masked language modeling task, fine-tuned on normal logs to capture domain-specific syntactic and semantic patterns essential for accurate anomaly detection. Anomalies are identified via token-level reconstruction probabilities, aggregated into log-level scores, with adaptive percentile-based thresholding calibrated only on normal data. This allows the model to dynamically adapt to evolving system behaviors while avoiding rigid, heuristic-based thresholds common in traditional systems. We evaluate ADALog on benchmark datasets BGL, Thunderbird, and Spirit, showing strong generalization and competitive performance compared to state-of-the-art supervised and unsupervised methods. Additional ablation studies examine the effects of masking, fine-tuning, and token positioning on model behavior and interpretability.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
TSPulse: Dual Space Tiny Pre-Trained Models for Rapid Time-Series Analysis
Authors:
Vijay Ekambaram,
Subodh Kumar,
Arindam Jati,
Sumanta Mukherjee,
Tomoya Sakai,
Pankaj Dayama,
Wesley M. Gifford,
Jayant Kalagnanam
Abstract:
The rise of time-series pre-trained models has advanced temporal representation learning, but current state-of-the-art models are often large-scale, requiring substantial compute. We introduce TSPulse, ultra-compact time-series pre-trained models with only 1M parameters, specialized to perform strongly across classification, anomaly detection, imputation, and retrieval tasks. TSPulse introduces in…
▽ More
The rise of time-series pre-trained models has advanced temporal representation learning, but current state-of-the-art models are often large-scale, requiring substantial compute. We introduce TSPulse, ultra-compact time-series pre-trained models with only 1M parameters, specialized to perform strongly across classification, anomaly detection, imputation, and retrieval tasks. TSPulse introduces innovations at both the architecture and task levels. At the architecture level, it employs a dual-space masked reconstruction, learning from both time and frequency domains to capture complementary signals. This is further enhanced by a dual-embedding disentanglement, generating both detailed embeddings for fine-grained analysis and high-level semantic embeddings for broader task understanding. Notably, TSPulse's semantic embeddings are robust to shifts in time, magnitude, and noise, which is important for robust retrieval. At the task level, TSPulse incorporates TSLens, a fine-tuning component enabling task-specific feature attention. It also introduces a multi-head triangulation technique that correlates deviations from multiple prediction heads, enhancing anomaly detection by fusing complementary model outputs. Additionally, a hybrid mask pretraining is proposed to improves zero-shot imputation by reducing pre-training bias. These architecture and task innovations collectively contribute to TSPulse's significant performance gains: 5-16% on the UEA classification benchmarks, +20% on the TSB-AD anomaly detection leaderboard, +50% in zero-shot imputation, and +25% in time-series retrieval. Remarkably, these results are achieved with just 1M parameters (10-100X smaller than existing SOTA models) and allow GPU-free inference, setting a new standard for efficient time-series pre-trained models. The models can be accessed from https://huggingface.co/ibm-granite/granite-timeseries-tspulse-r1
△ Less
Submitted 25 June, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
GraphFLEx: Structure Learning Framework for Large Expanding Graphs
Authors:
Mohit Kataria,
Nikita Malik,
Sandeep Kumar,
Jayadeva
Abstract:
Graph structure learning is a core problem in graph-based machine learning, essential for uncovering latent relationships and ensuring model interpretability. However, most existing approaches are ill-suited for large-scale and dynamically evolving graphs, as they often require complete re-learning of the structure upon the arrival of new nodes and incur substantial computational and memory costs.…
▽ More
Graph structure learning is a core problem in graph-based machine learning, essential for uncovering latent relationships and ensuring model interpretability. However, most existing approaches are ill-suited for large-scale and dynamically evolving graphs, as they often require complete re-learning of the structure upon the arrival of new nodes and incur substantial computational and memory costs. In this work, we propose GraphFLEx: a unified and scalable framework for Graph Structure Learning in Large and Expanding Graphs. GraphFLEx mitigates the scalability bottlenecks by restricting edge formation to structurally relevant subsets of nodes identified through a combination of clustering and coarsening techniques. This dramatically reduces the search space and enables efficient, incremental graph updates. The framework supports 48 flexible configurations by integrating diverse choices of learning paradigms, coarsening strategies, and clustering methods, making it adaptable to a wide range of graph settings and learning objectives. Extensive experiments across 26 diverse datasets and Graph Neural Network architectures demonstrate that GraphFLEx achieves state-of-the-art performance with significantly improved scalability.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.