-
Is External Information Useful for Stance Detection with LLMs?
Authors:
Quang Minh Nguyen,
Taegyoon Kim
Abstract:
In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning t…
▽ More
In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9\%. We explain this through experiments showing LLMs' tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers. Code is available at https://github.com/ngqm/acl2025-stance-detection.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Privacy-Preserving Quantized Federated Learning with Diverse Precision
Authors:
Dang Qua Nguyen,
Morteza Hashemi,
Erik Perrins,
Sergiy A. Vorobyov,
David J. Love,
Taejoon Kim
Abstract:
Federated learning (FL) has emerged as a promising paradigm for distributed machine learning, enabling collaborative training of a global model across multiple local devices without requiring them to share raw data. Despite its advancements, FL is limited by factors such as: (i) privacy risks arising from the unprotected transmission of local model updates to the fusion center (FC) and (ii) decrea…
▽ More
Federated learning (FL) has emerged as a promising paradigm for distributed machine learning, enabling collaborative training of a global model across multiple local devices without requiring them to share raw data. Despite its advancements, FL is limited by factors such as: (i) privacy risks arising from the unprotected transmission of local model updates to the fusion center (FC) and (ii) decreased learning utility caused by heterogeneity in model quantization resolution across participating devices. Prior work typically addresses only one of these challenges because maintaining learning utility under both privacy risks and quantization heterogeneity is a non-trivial task. In this paper, our aim is therefore to improve the learning utility of a privacy-preserving FL that allows clusters of devices with different quantization resolutions to participate in each FL round. Specifically, we introduce a novel stochastic quantizer (SQ) that is designed to simultaneously achieve differential privacy (DP) and minimum quantization error. Notably, the proposed SQ guarantees bounded distortion, unlike other DP approaches. To address quantization heterogeneity, we introduce a cluster size optimization technique combined with a linear fusion approach to enhance model aggregation accuracy. Numerical simulations validate the benefits of our approach in terms of privacy protection and learning utility compared to the conventional LaplaceSQ-FL algorithm.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning
Authors:
Phan Quoc Hung Mai,
Quang Hung Nguyen,
Phuong Giang Duong,
Hong Hanh Nguyen,
Nguyen Tuan Long
Abstract:
In the field of education, understanding students' opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from univers…
▽ More
In the field of education, understanding students' opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from university forums, which offers more samples, richer class diversity, longer texts, and broader vocabulary. In addition, we explore multitask learning using encoder-only language models (BERT), in which we showed that it achieves performance up to 83.7% and 79.8% accuracy for sentiment and topic classification tasks. We also benchmark our dataset and model with other datasets and models, including Large Language Models, and discuss these benchmarks. The dataset is publicly available at: https://huggingface.co/datasets/hung20gg/NEU-ESC.
△ Less
Submitted 30 June, 2025;
originally announced June 2025.
-
FinStat2SQL: A Text2SQL Pipeline for Financial Statement Analysis
Authors:
Quang Hung Nguyen,
Phuong Anh Trinh,
Phan Quoc Hung Mai,
Tuan Phong Trinh
Abstract:
Despite the advancements of large language models, text2sql still faces many challenges, particularly with complex and domain-specific queries. In finance, database designs and financial reporting layouts vary widely between financial entities and countries, making text2sql even more challenging. We present FinStat2SQL, a lightweight text2sql pipeline enabling natural language queries over financi…
▽ More
Despite the advancements of large language models, text2sql still faces many challenges, particularly with complex and domain-specific queries. In finance, database designs and financial reporting layouts vary widely between financial entities and countries, making text2sql even more challenging. We present FinStat2SQL, a lightweight text2sql pipeline enabling natural language queries over financial statements. Tailored to local standards like VAS, it combines large and small language models in a multi-agent setup for entity extraction, SQL generation, and self-correction. We build a domain-specific database and evaluate models on a synthetic QA dataset. A fine-tuned 7B model achieves 61.33\% accuracy with sub-4-second response times on consumer hardware, outperforming GPT-4o-mini. FinStat2SQL offers a scalable, cost-efficient solution for financial analysis, making AI-powered querying accessible to Vietnamese enterprises.
△ Less
Submitted 29 June, 2025;
originally announced June 2025.
-
SepFormer: Coarse-to-fine Separator Regression Network for Table Structure Recognition
Authors:
Nam Quan Nguyen,
Xuan Phong Pham,
Tuan-Anh Tran
Abstract:
The automated reconstruction of the logical arrangement of tables from image data, termed Table Structure Recognition (TSR), is fundamental for semantic data extraction. Recently, researchers have explored a wide range of techniques to tackle this problem, demonstrating significant progress. Each table is a set of vertical and horizontal separators. Following this realization, we present SepFormer…
▽ More
The automated reconstruction of the logical arrangement of tables from image data, termed Table Structure Recognition (TSR), is fundamental for semantic data extraction. Recently, researchers have explored a wide range of techniques to tackle this problem, demonstrating significant progress. Each table is a set of vertical and horizontal separators. Following this realization, we present SepFormer, which integrates the split-and-merge paradigm into a single step through separator regression with a DETR-style architecture, improving speed and robustness. SepFormer is a coarse-to-fine approach that predicts table separators from single-line to line-strip separators with a stack of two transformer decoders. In the coarse-grained stage, the model learns to gradually refine single-line segments through decoder layers with additional angle loss. At the end of the fine-grained stage, the model predicts line-strip separators by refining sampled points from each single-line segment. Our SepFormer can run on average at 25.6 FPS while achieving comparable performance with state-of-the-art methods on several benchmark datasets, including SciTSR, PubTabNet, WTW, and iFLYTAB.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
Automated Image Recognition Framework
Authors:
Quang-Binh Nguyen,
Trong-Vu Hoang,
Ngoc-Do Tran,
Tam V. Nguyen,
Minh-Triet Tran,
Trung-Nghia Le
Abstract:
While the efficacy of deep learning models heavily relies on data, gathering and annotating data for specific tasks, particularly when addressing novel or sensitive subjects lacking relevant datasets, poses significant time and resource challenges. In response to this, we propose a novel Automated Image Recognition (AIR) framework that harnesses the power of generative AI. AIR empowers end-users t…
▽ More
While the efficacy of deep learning models heavily relies on data, gathering and annotating data for specific tasks, particularly when addressing novel or sensitive subjects lacking relevant datasets, poses significant time and resource challenges. In response to this, we propose a novel Automated Image Recognition (AIR) framework that harnesses the power of generative AI. AIR empowers end-users to synthesize high-quality, pre-annotated datasets, eliminating the necessity for manual labeling. It also automatically trains deep learning models on the generated datasets with robust image recognition performance. Our framework includes two main data synthesis processes, AIR-Gen and AIR-Aug. The AIR-Gen enables end-users to seamlessly generate datasets tailored to their specifications. To improve image quality, we introduce a novel automated prompt engineering module that leverages the capabilities of large language models. We also introduce a distribution adjustment algorithm to eliminate duplicates and outliers, enhancing the robustness and reliability of generated datasets. On the other hand, the AIR-Aug enhances a given dataset, thereby improving the performance of deep classifier models. AIR-Aug is particularly beneficial when users have limited data for specific tasks. Through comprehensive experiments, we demonstrated the efficacy of our generated data in training deep learning models and showcased the system's potential to provide image recognition models for a wide range of objects. We also conducted a user study that achieved an impressive score of 4.4 out of 5.0, underscoring the AI community's positive perception of AIR.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation
Authors:
Trong-Vu Hoang,
Quang-Binh Nguyen,
Thanh-Toan Do,
Tam V. Nguyen,
Minh-Triet Tran,
Trung-Nghia Le
Abstract:
Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowF…
▽ More
Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and employs a disentangled learning approach with a novel attention regularization objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses the learned models from ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a layout consistency strategy as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System
Authors:
Quang Nguyen,
Tri Le,
Huy Nguyen,
Thieu Vo,
Tung D. Ta,
Baoru Huang,
Minh N. Vu,
Anh Nguyen
Abstract:
Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a tr…
▽ More
Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a training or finetuning step to adapt to new domains, limiting their generation in real-world applications. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios. Our framework consists of three specialized agents: Planner, responsible for strategizing complex queries; Coder, which generates and executes source code; and Observer, which evaluates the outcomes and provides feedback. Intensive experiments on two large-scale datasets demonstrate that our GraspMAS significantly outperforms existing baselines. Additionally, robot experiments conducted in both simulation and real-world settings further validate the effectiveness of our approach.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Leveraging Cloud-Fog Automation for Autonomous Collision Detection and Classification in Intelligent Unmanned Surface Vehicles
Authors:
Thien Tran,
Quang Nguyen,
Jonathan Kua,
Minh Tran,
Toan Luu,
Thuong Hoang,
Jiong Jin
Abstract:
Industrial Cyber-Physical Systems (ICPS) technologies are foundational in driving maritime autonomy, particularly for Unmanned Surface Vehicles (USVs). However, onboard computational constraints and communication latency significantly restrict real-time data processing, analysis, and predictive modeling, hence limiting the scalability and responsiveness of maritime ICPS. To overcome these challeng…
▽ More
Industrial Cyber-Physical Systems (ICPS) technologies are foundational in driving maritime autonomy, particularly for Unmanned Surface Vehicles (USVs). However, onboard computational constraints and communication latency significantly restrict real-time data processing, analysis, and predictive modeling, hence limiting the scalability and responsiveness of maritime ICPS. To overcome these challenges, we propose a distributed Cloud-Edge-IoT architecture tailored for maritime ICPS by leveraging design principles from the recently proposed Cloud-Fog Automation paradigm. Our proposed architecture comprises three hierarchical layers: a Cloud Layer for centralized and decentralized data aggregation, advanced analytics, and future model refinement; an Edge Layer that executes localized AI-driven processing and decision-making; and an IoT Layer responsible for low-latency sensor data acquisition. Our experimental results demonstrated improvements in computational efficiency, responsiveness, and scalability. When compared with our conventional approaches, we achieved a classification accuracy of 86\%, with an improved latency performance. By adopting Cloud-Fog Automation, we address the low-latency processing constraints and scalability challenges in maritime ICPS applications. Our work offers a practical, modular, and scalable framework to advance robust autonomy and AI-driven decision-making and autonomy for intelligent USVs in future maritime ICPS.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Theoretically Unmasking Inference Attacks Against LDP-Protected Clients in Federated Vision Models
Authors:
Quan Nguyen,
Minh N. Vu,
Truc Nguyen,
My T. Thai
Abstract:
Federated Learning enables collaborative learning among clients via a coordinating server while avoiding direct data sharing, offering a perceived solution to preserve privacy. However, recent studies on Membership Inference Attacks (MIAs) have challenged this notion, showing high success rates against unprotected training data. While local differential privacy (LDP) is widely regarded as a gold s…
▽ More
Federated Learning enables collaborative learning among clients via a coordinating server while avoiding direct data sharing, offering a perceived solution to preserve privacy. However, recent studies on Membership Inference Attacks (MIAs) have challenged this notion, showing high success rates against unprotected training data. While local differential privacy (LDP) is widely regarded as a gold standard for privacy protection in data analysis, most studies on MIAs either neglect LDP or fail to provide theoretical guarantees for attack success rates against LDP-protected data. To address this gap, we derive theoretical lower bounds for the success rates of low-polynomial time MIAs that exploit vulnerabilities in fully connected or self-attention layers. We establish that even when data are protected by LDP, privacy risks persist, depending on the privacy budget. Practical evaluations on federated vision models confirm considerable privacy risks, revealing that the noise required to mitigate these attacks significantly degrades models' utility.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations
Authors:
Quang Hieu Pham,
Thuy Duong Nguyen,
Tung Pham,
Anh Tuan Luu,
Dat Quoc Nguyen
Abstract:
The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine…
▽ More
The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
VerificAgent: Integrating Expert Knowledge and Fact-Checked Memory for Robust Domain-Specific Task Planning
Authors:
Thong Q. Nguyen,
Shubhang Desai,
Yash Jain,
Tanvir Aumi,
Vishal Chowdhary
Abstract:
Continual memory augmentation allows computer-use agents (CUAs) to learn from past interactions and refine their task-solving strategies over time. However, unchecked memory accumulation can introduce spurious or hallucinated "learnings" that degrade agent performance, particularly in domain-specific workflows such as productivity software. We present a novel framework, VerificAgent, that effectiv…
▽ More
Continual memory augmentation allows computer-use agents (CUAs) to learn from past interactions and refine their task-solving strategies over time. However, unchecked memory accumulation can introduce spurious or hallucinated "learnings" that degrade agent performance, particularly in domain-specific workflows such as productivity software. We present a novel framework, VerificAgent, that effectively manages memory for CUAs through (1) an expert-curated seed of domain knowledge, (2) iterative, trajectory-based memory refinement during training, and (3) a post-hoc fact-checking pass by human experts to sanitize accumulated memory before deployment. On OSWorld productivity tasks, VerificAgent achieves a 111.1% relative improvement in success rate over baseline CUA without any additional fine-tuning.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Zero-Shot Text-to-Speech for Vietnamese
Authors:
Thi Vu,
Linh The Nguyen,
Dat Quoc Nguyen
Abstract:
This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit supe…
▽ More
This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
VM14K: First Vietnamese Medical Benchmark
Authors:
Thong Nguyen,
Duc Nguyen,
Minh Dang,
Thai Dao,
Long Nguyen,
Quan H. Nguyen,
Dat Nguyen,
Kien Tran,
Minh Tran
Abstract:
Medical benchmarks are indispensable for evaluating the capabilities of language models in healthcare for non-English-speaking communities,therefore help ensuring the quality of real-life applications. However, not every community has sufficient resources and standardized methods to effectively build and design such benchmark, and available non-English medical data is normally fragmented and diffi…
▽ More
Medical benchmarks are indispensable for evaluating the capabilities of language models in healthcare for non-English-speaking communities,therefore help ensuring the quality of real-life applications. However, not every community has sufficient resources and standardized methods to effectively build and design such benchmark, and available non-English medical data is normally fragmented and difficult to verify. We developed an approach to tackle this problem and applied it to create the first Vietnamese medical question benchmark, featuring 14,000 multiple-choice questions across 34 medical specialties. Our benchmark was constructed using various verifiable sources, including carefully curated medical exams and clinical records, and eventually annotated by medical experts. The benchmark includes four difficulty levels, ranging from foundational biological knowledge commonly found in textbooks to typical clinical case studies that require advanced reasoning. This design enables assessment of both the breadth and depth of language models' medical understanding in the target language thanks to its extensive coverage and in-depth subject-specific expertise. We release the benchmark in three parts: a sample public set (4k questions), a full public set (10k questions), and a private set (2k questions) used for leaderboard evaluation. Each set contains all medical subfields and difficulty levels. Our approach is scalable to other languages, and we open-source our data construction pipeline to support the development of future multilingual benchmarks in the medical domain.
△ Less
Submitted 13 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation
Authors:
Hieu Tran,
Phuong-Anh Nguyen-Le,
Huy Nghiem,
Quang-Nhan Nguyen,
Wei Ai,
Marine Carpuat
Abstract:
Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure…
▽ More
Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
DiffCoTune: Differentiable Co-Tuning for Cross-domain Robot Control
Authors:
Lokesh Krishna,
Sheng Cheng,
Junheng Li,
Naira Hovakimyan,
Quan Nguyen
Abstract:
The deployment of robot controllers is hindered by modeling discrepancies due to necessary simplifications for computational tractability or inaccuracies in data-generating simulators. Such discrepancies typically require ad-hoc tuning to meet the desired performance, thereby ensuring successful transfer to a target domain. We propose a framework for automated, gradient-based tuning to enhance per…
▽ More
The deployment of robot controllers is hindered by modeling discrepancies due to necessary simplifications for computational tractability or inaccuracies in data-generating simulators. Such discrepancies typically require ad-hoc tuning to meet the desired performance, thereby ensuring successful transfer to a target domain. We propose a framework for automated, gradient-based tuning to enhance performance in the deployment domain by leveraging differentiable simulators. Our method collects rollouts in an iterative manner to co-tune the simulator and controller parameters, enabling systematic transfer within a few trials in the deployment domain. Specifically, we formulate multi-step objectives for tuning and employ alternating optimization to effectively adapt the controller to the deployment domain. The scalability of our framework is demonstrated by co-tuning model-based and learning-based controllers of arbitrary complexity for tasks ranging from low-dimensional cart-pole stabilization to high-dimensional quadruped and biped tracking, showing performance improvements across different deployment domains.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning
Authors:
Tuan Van Vo,
Tan Quang Nguyen,
Khang Minh Nguyen,
Duy Ho Minh Nguyen,
Minh Nhat Vu
Abstract:
Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into robotic actions. Despite their recent advancements, VLAs often overlook the explicit reasoning and only learn the functional input-action mappings, omitting these crucial logical steps for interpretability and g…
▽ More
Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into robotic actions. Despite their recent advancements, VLAs often overlook the explicit reasoning and only learn the functional input-action mappings, omitting these crucial logical steps for interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose \textit{ReFineVLA}, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we use \textit{ReFineVLA} to fine-tune pre-trained VLAs with the reasoning-enriched datasets, while maintaining their inherent generalization abilities and boosting reasoning capabilities. In addition, we conduct an attention map visualization to analyze the alignment among visual attention, linguistic prompts, and to-be-executed actions of \textit{ReFineVLA}, showcasing its ability to focus on relevant tasks and actions. Through the latter step, we explore that \textit{ReFineVLA}-trained models exhibit a meaningful attention shift towards relevant objects, highlighting the enhanced multimodal understanding and improved generalization.
Evaluated across manipulation tasks, \textit{ReFineVLA} outperforms the state-of-the-art baselines. Specifically, it achieves an average increase of $5.0\%$ success rate on SimplerEnv WidowX Robot tasks, improves by an average of $8.6\%$ in variant aggregation settings, and by $1.7\%$ in visual matching settings for SimplerEnv Google Robot tasks. The source code will be publicly available.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
Reduce Computational Cost In Deep Reinforcement Learning Via Randomized Policy Learning
Authors:
Zhuochen Liu,
Rahul Jain,
Quan Nguyen
Abstract:
Recent advancements in reinforcement learning (RL) have leveraged neural networks to achieve state-of-the-art performance across various control tasks. However, these successes often come at the cost of significant computational resources, as training deep neural networks requires substantial time and data. In this paper, we introduce an actor-critic algorithm that utilizes randomized neural netwo…
▽ More
Recent advancements in reinforcement learning (RL) have leveraged neural networks to achieve state-of-the-art performance across various control tasks. However, these successes often come at the cost of significant computational resources, as training deep neural networks requires substantial time and data. In this paper, we introduce an actor-critic algorithm that utilizes randomized neural networks to drastically reduce computational costs while maintaining strong performance. Despite its simple architecture, our method effectively solves a range of control problems, including the locomotion control of a highly dynamic 12-motor quadruped robot, and achieves results comparable to leading algorithms such as Proximal Policy Optimization (PPO). Notably, our approach does not outperform other algorithms in terms of sample efficnency but rather in terms of wall-clock training time. That is, although our algorithm requires more timesteps to converge to an optimal policy, the actual time required for training turns out to be lower.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks
Authors:
Quan Nguyen,
Thanh Nguyen-Tang
Abstract:
We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities wer…
▽ More
We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities were known. Our work addresses these gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss of these transformers converges at linear rate to the Bayes risk. Moreover, we prove that the trained models generalize to unseen samples as well as exhibit learning behaviors that were empirically observed in previous works. Our theoretical findings are further supported by extensive empirical validations.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Towards Rehearsal-Free Continual Relation Extraction: Capturing Within-Task Variance with Adaptive Prompting
Authors:
Bao-Ngoc Dao,
Quang Nguyen,
Luyen Ngo Dinh,
Minh Le,
Nam Le,
Linh Ngo Van
Abstract:
Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE,…
▽ More
Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures variations within each task and across tasks. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at https://github.com/PiDinosauR2804/WAVE-CRE-PLUS-PLUS.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
A Cautionary Tale on Integrating Studies with Disparate Outcome Measures for Causal Inference
Authors:
Harsh Parikh,
Trang Quynh Nguyen,
Elizabeth A. Stuart,
Kara E. Rudolph,
Caleb H. Miles
Abstract:
Data integration approaches are increasingly used to enhance the efficiency and generalizability of studies. However, a key limitation of these methods is the assumption that outcome measures are identical across datasets -- an assumption that often does not hold in practice. Consider the following opioid use disorder (OUD) studies: the XBOT trial and the POAT study, both evaluating the effect of…
▽ More
Data integration approaches are increasingly used to enhance the efficiency and generalizability of studies. However, a key limitation of these methods is the assumption that outcome measures are identical across datasets -- an assumption that often does not hold in practice. Consider the following opioid use disorder (OUD) studies: the XBOT trial and the POAT study, both evaluating the effect of medications for OUD on withdrawal symptom severity (not the primary outcome of either trial). While XBOT measures withdrawal severity using the subjective opiate withdrawal scale, POAT uses the clinical opiate withdrawal scale. We analyze this realistic yet challenging setting where outcome measures differ across studies and where neither study records both types of outcomes. Our paper studies whether and when integrating studies with disparate outcome measures leads to efficiency gains. We introduce three sets of assumptions -- with varying degrees of strength -- linking both outcome measures. Our theoretical and empirical results highlight a cautionary tale: integration can improve asymptotic efficiency only under the strongest assumption linking the outcomes. However, misspecification of this assumption leads to bias. In contrast, a milder assumption may yield finite-sample efficiency gains, yet these benefits diminish as sample size increases. We illustrate these trade-offs via a case study integrating the XBOT and POAT datasets to estimate the comparative effect of two medications for opioid use disorder on withdrawal symptoms. By systematically varying the assumptions linking the SOW and COW scales, we show potential efficiency gains and the risks of bias. Our findings emphasize the need for careful assumption selection when fusing datasets with differing outcome measures, offering guidance for researchers navigating this common challenge in modern data integration.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Vendi Information Gain: An Alternative To Mutual Information For Science And Machine Learning
Authors:
Quan Nguyen,
Adji Bousso Dieng
Abstract:
In his 1948 seminal paper A Mathematical Theory of Communication that birthed information theory, Claude Shannon introduced mutual information (MI), which he called "rate of transmission", as a way to quantify information gain (IG) and defined it as the difference between the marginal and conditional entropy of a random variable. While MI has become a standard tool in science and engineering, it h…
▽ More
In his 1948 seminal paper A Mathematical Theory of Communication that birthed information theory, Claude Shannon introduced mutual information (MI), which he called "rate of transmission", as a way to quantify information gain (IG) and defined it as the difference between the marginal and conditional entropy of a random variable. While MI has become a standard tool in science and engineering, it has several shortcomings. First, MI is often intractable - it requires a density over samples with tractable Shannon entropy - and existing techniques for approximating it often fail, especially in high dimensions. Moreover, in settings where MI is tractable, its symmetry and insensitivity to sample similarity are undesirable. In this paper, we propose the Vendi Information Gain (VIG), a novel alternative to MI that leverages the Vendi scores, a flexible family of similarity-based diversity metrics. We call the logarithm of the VS the Vendi entropy and define VIG as the difference between the marginal and conditional Vendi entropy of a variable. Being based on the VS, VIG accounts for similarity. Furthermore, VIG generalizes MI and recovers it under the assumption that the samples are completely dissimilar. Importantly, VIG only requires samples and not a probability distribution over them. Finally, it is asymmetric, a desideratum for a good measure of IG that MI fails to meet. VIG extends information theory to settings where MI completely fails. For example, we use VIG to describe a novel, unified framework for active data acquisition, a popular paradigm of modern data-driven science. We demonstrate the advantages of VIG over MI in diverse applications, including in cognitive science to model human response times to external stimuli and in epidemiology to learn epidemic processes and identify disease hotspots in different countries via level-set estimation.
△ Less
Submitted 16 May, 2025; v1 submitted 13 May, 2025;
originally announced May 2025.
-
Revealing economic facts: LLMs know more than they say
Authors:
Marcus Buckmann,
Quynh Anh Nguyen,
Edward Hill
Abstract:
We investigate whether the hidden states of large language models (LLMs) can be used to estimate and impute economic and financial statistics. Focusing on county-level (e.g. unemployment) and firm-level (e.g. total assets) variables, we show that a simple linear model trained on the hidden states of open-source LLMs outperforms the models' text outputs. This suggests that hidden states capture ric…
▽ More
We investigate whether the hidden states of large language models (LLMs) can be used to estimate and impute economic and financial statistics. Focusing on county-level (e.g. unemployment) and firm-level (e.g. total assets) variables, we show that a simple linear model trained on the hidden states of open-source LLMs outperforms the models' text outputs. This suggests that hidden states capture richer economic information than the responses of the LLMs reveal directly. A learning curve analysis indicates that only a few dozen labelled examples are sufficient for training. We also propose a transfer learning method that improves estimation accuracy without requiring any labelled data for the target variable. Finally, we demonstrate the practical utility of hidden-state representations in super-resolution and data imputation tasks.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Anatomical Attention Alignment representation for Radiology Report Generation
Authors:
Quang Vinh Nguyen,
Minh Duc Nguyen,
Thanh Hoang Son Vo,
Hyung-Jeong Yang,
Soo-Hyung Kim
Abstract:
Automated Radiology report generation (RRG) aims at producing detailed descriptions of medical images, reducing radiologists' workload and improving access to high-quality diagnostic services. Existing encoder-decoder models only rely on visual features extracted from raw input images, which can limit the understanding of spatial structures and semantic relationships, often resulting in suboptimal…
▽ More
Automated Radiology report generation (RRG) aims at producing detailed descriptions of medical images, reducing radiologists' workload and improving access to high-quality diagnostic services. Existing encoder-decoder models only rely on visual features extracted from raw input images, which can limit the understanding of spatial structures and semantic relationships, often resulting in suboptimal text generation. To address this, we propose Anatomical Attention Alignment Network (A3Net), a framework that enhance visual-textual understanding by constructing hyper-visual representations. Our approach integrates a knowledge dictionary of anatomical structures with patch-level visual features, enabling the model to effectively associate image regions with their corresponding anatomical entities. This structured representation improves semantic reasoning, interpretability, and cross-modal alignment, ultimately enhancing the accuracy and clinical relevance of generated reports. Experimental results on IU X-Ray and MIMIC-CXR datasets demonstrate that A3Net significantly improves both visual perception and text generation quality. Our code is available at \href{https://github.com/Vinh-AI/A3Net}{GitHub}.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications
Authors:
Da Wu,
Zhanliang Wang,
Quan Nguyen,
Zhuoran Xu,
Kai Wang
Abstract:
The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through pref…
▽ More
The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate its effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight Llama 3.2-3B-Instruct. Despite relying on text input only, the MINT-derived model outperforms models trained with SFT, RAG, or DPO, and even outperforms Llama 3.1-405B-Instruct. (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Pretraining Large Brain Language Model for Active BCI: Silent Speech
Authors:
Jinzhao Zhou,
Zehong Cao,
Yiqun Duan,
Connor Barkley,
Daniel Leong,
Xiaowei Jiang,
Quoc-Toan Nguyen,
Ziyi Zhao,
Thomas Do,
Yu-Cheng Chang,
Sheng-Fu Liang,
Chin-teng Lin
Abstract:
This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the re…
▽ More
This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the recent success of pretraining large models with self-supervised paradigms to enhance EEG classification performance, we propose Large Brain Language Model (LBLM) pretrained to decode silent speech for active BCI. To pretrain LBLM, we propose Future Spectro-Temporal Prediction (FSTP) pretraining paradigm to learn effective representations from unlabeled EEG data. Unlike existing EEG pretraining methods that mainly follow a masked-reconstruction paradigm, our proposed FSTP method employs autoregressive modeling in temporal and frequency domains to capture both temporal and spectral dependencies from EEG signals. After pretraining, we finetune our LBLM on downstream tasks, including word-level and semantic-level classification. Extensive experiments demonstrate significant performance gains of the LBLM over fully-supervised and pretrained baseline models. For instance, in the difficult cross-session setting, our model achieves 47.0\% accuracy on semantic-level classification and 39.6\% in word-level classification, outperforming baseline methods by 5.4\% and 7.3\%, respectively. Our research advances silent speech decoding in active BCI systems, offering an innovative solution for EEG language model pretraining and a new dataset for fundamental research.
△ Less
Submitted 3 May, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
Network Sampling: An Overview and Comparative Analysis
Authors:
Quoc Chuong Nguyen
Abstract:
Network sampling is a crucial technique for analyzing large or partially observable networks. However, the effectiveness of different sampling methods can vary significantly depending on the context. In this study, we empirically compare representative methods from three main categories: node-based, edge-based, and exploration-based sampling. We used two real-world datasets for our analysis: a sci…
▽ More
Network sampling is a crucial technique for analyzing large or partially observable networks. However, the effectiveness of different sampling methods can vary significantly depending on the context. In this study, we empirically compare representative methods from three main categories: node-based, edge-based, and exploration-based sampling. We used two real-world datasets for our analysis: a scientific collaboration network and a temporal message-sending network. Our results indicate that no single sampling method consistently outperforms the others in both datasets. Although advanced methods tend to provide better accuracy on static networks, they often perform poorly on temporal networks, where simpler techniques can be more effective. These findings suggest that the best sampling strategy depends not only on the structural characteristics of the network but also on the specific metrics that need to be preserved or analyzed. Our work offers practical insights for researchers in choosing sampling approaches that are tailored to different types of networks and analytical objectives.
△ Less
Submitted 2 May, 2025; v1 submitted 24 April, 2025;
originally announced April 2025.
-
HDC: Hierarchical Distillation for Multi-level Noisy Consistency in Semi-Supervised Fetal Ultrasound Segmentation
Authors:
Tran Quoc Khanh Le,
Nguyen Lan Vi Vu,
Ha-Hieu Pham,
Xuan-Loc Huynh,
Tien-Huy Nguyen,
Minh Huu Nhat Le,
Quan Nguyen,
Hien D. Nguyen
Abstract:
Transvaginal ultrasound is a critical imaging modality for evaluating cervical anatomy and detecting physiological changes. However, accurate segmentation of cervical structures remains challenging due to low contrast, shadow artifacts, and indistinct boundaries. While convolutional neural networks (CNNs) have demonstrated efficacy in medical image segmentation, their reliance on large-scale annot…
▽ More
Transvaginal ultrasound is a critical imaging modality for evaluating cervical anatomy and detecting physiological changes. However, accurate segmentation of cervical structures remains challenging due to low contrast, shadow artifacts, and indistinct boundaries. While convolutional neural networks (CNNs) have demonstrated efficacy in medical image segmentation, their reliance on large-scale annotated datasets presents a significant limitation in clinical ultrasound imaging. Semi-supervised learning (SSL) offers a potential solution by utilizing unlabeled data, yet existing teacher-student frameworks often encounter confirmation bias and high computational costs. In this paper, a novel semi-supervised segmentation framework, called HDC, is proposed incorporating adaptive consistency learning with a single-teacher architecture. The framework introduces a hierarchical distillation mechanism with two objectives: Correlation Guidance Loss for aligning feature representations and Mutual Information Loss for stabilizing noisy student learning. The proposed approach reduces model complexity while enhancing generalization. Experiments on fetal ultrasound datasets, FUGC and PSFH, demonstrate competitive performance with reduced computational overhead compared to multi-teacher models.
△ Less
Submitted 16 April, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme
Authors:
Dinh Dai Quan Tran,
Hoang-Thien Nguyen,
Thanh-Huy Nguyen,
Gia-Van To,
Tien-Huy Nguyen,
Quan Nguyen
Abstract:
Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. T…
▽ More
Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.
△ Less
Submitted 23 May, 2025; v1 submitted 13 April, 2025;
originally announced April 2025.
-
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
Authors:
Tinh-Anh Nguyen-Nhu,
Huu-Loc Tran,
Nguyen-Khang Le,
Minh-Nhat Nguyen,
Tien-Huy Nguyen,
Hoang-Long Nguyen-Huu,
Huu-Phong Phan-Nguyen,
Huy-Thach Pham,
Quan Nguyen,
Hoang M. Le,
Quang-Vinh Dinh
Abstract:
The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this pape…
▽ More
The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Cycle Training with Semi-Supervised Domain Adaptation: Bridging Accuracy and Efficiency for Real-Time Mobile Scene Detection
Authors:
Huu-Phong Phan-Nguyen,
Anh Dao,
Tien-Huy Nguyen,
Tuan Quang,
Huu-Loc Tran,
Tinh-Anh Nguyen-Nhu,
Huy-Thach Pham,
Quan Nguyen,
Hoang M. Le,
Quang-Vinh Dinh
Abstract:
Nowadays, smartphones are ubiquitous, and almost everyone owns one. At the same time, the rapid development of AI has spurred extensive research on applying deep learning techniques to image classification. However, due to the limited resources available on mobile devices, significant challenges remain in balancing accuracy with computational efficiency. In this paper, we propose a novel training…
▽ More
Nowadays, smartphones are ubiquitous, and almost everyone owns one. At the same time, the rapid development of AI has spurred extensive research on applying deep learning techniques to image classification. However, due to the limited resources available on mobile devices, significant challenges remain in balancing accuracy with computational efficiency. In this paper, we propose a novel training framework called Cycle Training, which adopts a three-stage training process that alternates between exploration and stabilization phases to optimize model performance. Additionally, we incorporate Semi-Supervised Domain Adaptation (SSDA) to leverage the power of large models and unlabeled data, thereby effectively expanding the training dataset. Comprehensive experiments on the CamSSD dataset for mobile scene detection demonstrate that our framework not only significantly improves classification accuracy but also ensures real-time inference efficiency. Specifically, our method achieves a 94.00% in Top-1 accuracy and a 99.17% in Top-3 accuracy and runs inference in just 1.61ms using CPU, demonstrating its suitability for real-world mobile deployment.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking
Authors:
Huu-Loc Tran,
Tinh-Anh Nguyen-Nhu,
Huu-Phong Phan-Nguyen,
Tien-Huy Nguyen,
Nhat-Minh Nguyen-Dich,
Anh Dao,
Huy-Duc Do,
Quan Nguyen,
Hoang M. Le,
Quang-Vinh Dinh
Abstract:
Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive vid…
▽ More
Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
QUADRO: A Hybrid Quantum Optimization Framework for Drone Delivery
Authors:
James B. Holliday,
Darren Blount,
Hoang Quan Nguyen,
Samee U. Khan,
Khoa Luu
Abstract:
Quantum computing holds transformative potential for optimizing large-scale drone fleet operations, yet its near-term limitations necessitate hybrid approaches blending classical and quantum techniques. This work introduces Quantum Unmanned Aerial Delivery Routing Optimization (QUADRO), a novel hybrid framework addressing the Energy-Constrained Capacitated Unmanned Aerial Vehicle Routing Problem a…
▽ More
Quantum computing holds transformative potential for optimizing large-scale drone fleet operations, yet its near-term limitations necessitate hybrid approaches blending classical and quantum techniques. This work introduces Quantum Unmanned Aerial Delivery Routing Optimization (QUADRO), a novel hybrid framework addressing the Energy-Constrained Capacitated Unmanned Aerial Vehicle Routing Problem and the Unmanned Aerial Vehicle Scheduling Problem. By formulating these challenges as Quadratic Unconstrained Binary Optimization problems, QUADRO leverages the Quantum Approximate Optimization Algorithm for routing and scheduling, enhanced by classical heuristics and post-processing. We minimize total transit time in routing, considering payload and battery constraints, and optimize makespan scheduling across various drone fleets. Evaluated on adapted Augerat benchmarks (16-51 nodes), QUADRO competes against classical and prior hybrid methods, achieving scalable solutions with fewer than one hundred qubits. The proposed results underscore the viability of hybrid quantum-classical strategies for real-world drone logistics, paving the way for quantum-enhanced optimization in the Noisy Intermediate Scale Quantum era.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Out-of-distribution evaluations of channel agnostic masked autoencoders in fluorescence microscopy
Authors:
Christian John Hurry,
Jinjie Zhang,
Olubukola Ishola,
Emma Slade,
Cuong Q. Nguyen
Abstract:
Developing computer vision for high-content screening is challenging due to various sources of distribution-shift caused by changes in experimental conditions, perturbagens, and fluorescent markers. The impact of different sources of distribution-shift are confounded in typical evaluations of models based on transfer learning, which limits interpretations of how changes to model design and trainin…
▽ More
Developing computer vision for high-content screening is challenging due to various sources of distribution-shift caused by changes in experimental conditions, perturbagens, and fluorescent markers. The impact of different sources of distribution-shift are confounded in typical evaluations of models based on transfer learning, which limits interpretations of how changes to model design and training affect generalisation. We propose an evaluation scheme that isolates sources of distribution-shift using the JUMP-CP dataset, allowing researchers to evaluate generalisation with respect to specific sources of distribution-shift. We then present a channel-agnostic masked autoencoder $\mathbf{Campfire}$ which, via a shared decoder for all channels, scales effectively to datasets containing many different fluorescent markers, and show that it generalises to out-of-distribution experimental batches, perturbagens, and fluorescent markers, and also demonstrates successful transfer learning from one cell type to another.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Combating the Effects of Cyber-Psychosis: Using Object Security to Facilitate Critical Thinking
Authors:
Robert H. Thomson,
Quan Nguyen,
Essien Ayanam,
Matthew Canham,
Thomas C. Schmidt,
Matthias Wählisch,
Eric Osterweil
Abstract:
Humanity is currently facing an existential crisis about the nature of truth and reality driven by the availability of information online which overloads and overwhelms our cognitive capabilities, which we call Cyber-Psychosis. The results of this Cyber-Psychosis include the decline of critical thinking coupled with deceptive influences on the Internet which have become so prolific that they are c…
▽ More
Humanity is currently facing an existential crisis about the nature of truth and reality driven by the availability of information online which overloads and overwhelms our cognitive capabilities, which we call Cyber-Psychosis. The results of this Cyber-Psychosis include the decline of critical thinking coupled with deceptive influences on the Internet which have become so prolific that they are challenging our ability to form a shared understanding of reality in either the digital or physical world. Fundamental to mending our fractured digital universe is establishing the ability to know where a digital object (i.e. a piece of information like text, audio, or video) came from, whether it was modified, what it is derived from, where it has been circulated, and what (if any) lifetime that information should have. Furthermore, we argue that on-by-default object security for genuine objects will provide the necessary grounding to support critical thinking and rational online behavior, even with the ubiquity of deceptive content. To this end, we propose that the Internet needs an object security service layer. This proposition may not be as distant as it may first seem. Through an examination of several venerable (and new) protocols, we show how pieces of this problem have already been addressed. While interdisciplinary research will be key to properly crafting the architectural changes needed, here we propose an approach for how we can already use fallow protections to begin turning the tide of this emerging Cyber-Psychosis today!
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes
Authors:
Da Wu,
Zhanliang Wang,
Quan Nguyen,
Kai Wang
Abstract:
Background: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet…
▽ More
Background: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet inputs are unstructured clinical notes rather than standardized terms. How LLMs can be instructed to predict candidate genes or disease diagnosis from unstructured clinical notes remains a major challenge. Methods: We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in Man). We evaluated these approaches on rare disease datasets, including 5,980 Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house clinical notes from Childrens Hospital of Philadelphia. Results: We found that recent foundations models, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2 and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both outperform foundation models in candidate gene prioritization from clinical notes; in particular, both methods with DeepSeek backbone resulted in a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG has advantage when processing lengthy and noisy notes.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Lightweight Models for Emotional Analysis in Video
Authors:
Quoc-Tien Nguyen,
Hong-Hai Nguyen,
Van-Thong Huynh
Abstract:
In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic enco…
▽ More
In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.
△ Less
Submitted 24 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Revisiting semi-supervised learning in the era of foundation models
Authors:
Ping Zhang,
Zheda Mai,
Quang-Huy Nguyen,
Wei-Lun Chao
Abstract:
Semi-supervised learning (SSL) leverages abundant unlabeled data alongside limited labeled data to enhance learning. As vision foundation models (VFMs) increasingly serve as the backbone of vision applications, it remains unclear how SSL interacts with these pre-trained models. To address this gap, we develop new SSL benchmark datasets where frozen VFMs underperform and systematically evaluate rep…
▽ More
Semi-supervised learning (SSL) leverages abundant unlabeled data alongside limited labeled data to enhance learning. As vision foundation models (VFMs) increasingly serve as the backbone of vision applications, it remains unclear how SSL interacts with these pre-trained models. To address this gap, we develop new SSL benchmark datasets where frozen VFMs underperform and systematically evaluate representative SSL methods. We make a surprising observation: parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, even without leveraging unlabeled data. This motivates us to revisit self-training, a conceptually simple SSL baseline, where we use the supervised PEFT model to pseudo-label unlabeled data for further training. To overcome the notorious issue of noisy pseudo-labels, we propose ensembling multiple PEFT approaches and VFM backbones to produce more robust pseudo-labels. Empirical results validate the effectiveness of this simple yet powerful approach, providing actionable insights into SSL with VFMs and paving the way for more scalable and practical semi-supervised learning in the era of foundation models.
△ Less
Submitted 7 June, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
RoboDesign1M: A Large-scale Dataset for Robot Design Understanding
Authors:
Tri Le,
Toan Nguyen,
Quang Tran,
Quang Nguyen,
Baoru Huang,
Hoan Nguyen,
Minh Nhat Vu,
Tung D. Ta,
Anh Nguyen
Abstract:
Robot design is a complex and time-consuming process that requires specialized expertise. Gaining a deeper understanding of robot design data can enable various applications, including automated design generation, retrieving example designs from text, and developing AI-powered design assistants. While recent advancements in foundation models present promising approaches to addressing these challen…
▽ More
Robot design is a complex and time-consuming process that requires specialized expertise. Gaining a deeper understanding of robot design data can enable various applications, including automated design generation, retrieving example designs from text, and developing AI-powered design assistants. While recent advancements in foundation models present promising approaches to addressing these challenges, progress in this field is hindered by the lack of large-scale design datasets. In this paper, we introduce RoboDesign1M, a large-scale dataset comprising 1 million samples. Our dataset features multimodal data collected from scientific literature, covering various robotics domains. We propose a semi-automated data collection pipeline, enabling efficient and diverse data acquisition. To assess the effectiveness of RoboDesign1M, we conduct extensive experiments across multiple tasks, including design image generation, visual question answering about designs, and design image retrieval. The results demonstrate that our dataset serves as a challenging new benchmark for design understanding tasks and has the potential to advance research in this field. RoboDesign1M will be released to support further developments in AI-driven robotic design automation.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
BERT-based model for Vietnamese Fact Verification Dataset
Authors:
Bao Tran,
T. N. Khanh,
Khang Nguyen Tuong,
Thien Dang,
Quang Nguyen,
Nguyen T. Thinh,
Vo T. Hung
Abstract:
The rapid advancement of information and communication technology has facilitated easier access to information. However, this progress has also necessitated more stringent verification measures to ensure the accuracy of information, particularly within the context of Vietnam. This paper introduces an approach to address the challenges of Fact Verification using the Vietnamese dataset by integratin…
▽ More
The rapid advancement of information and communication technology has facilitated easier access to information. However, this progress has also necessitated more stringent verification measures to ensure the accuracy of information, particularly within the context of Vietnam. This paper introduces an approach to address the challenges of Fact Verification using the Vietnamese dataset by integrating both sentence selection and classification modules into a unified network architecture. The proposed approach leverages the power of large language models by utilizing pre-trained PhoBERT and XLM-RoBERTa as the backbone of the network. The proposed model was trained on a Vietnamese dataset, named ISE-DSC01, and demonstrated superior performance compared to the baseline model across all three metrics. Notably, we achieved a Strict Accuracy level of 75.11\%, indicating a remarkable 28.83\% improvement over the baseline model.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
CAMEx: Curvature-aware Merging of Experts
Authors:
Dung V. Nguyen,
Minh H. Nguyen,
Luc Q. Nguyen,
Rachel S. Y. Teo,
Tan M. Nguyen,
Linh Duy Tran
Abstract:
Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model's generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information a…
▽ More
Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model's generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information and computational resources to approximate the Fisher Information Matrix, adding memory overhead. In this paper, we introduce CAMEx (Curvature-Aware Merging of Experts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold. By leveraging natural gradients, CAMEx adapts more effectively to the structure of the parameter space, improving alignment between model updates and the manifold's geometry. This approach enhances both pre-training and fine-tuning, resulting in better optimization trajectories and improved generalization without the substantial memory overhead typically associated with curvature-aware methods. Our contributions are threefold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models; and (3) we provide both theoretical and empirical evidence to demonstrate the efficiency of our proposed method. The code is publicly available at: https://github.com/kpup1710/CAMEx.
△ Less
Submitted 3 March, 2025; v1 submitted 25 February, 2025;
originally announced February 2025.
-
SQLong: Enhanced NL2SQL for Longer Contexts with LLMs
Authors:
Dai Quoc Nguyen,
Cong Duy Vu Hoang,
Duy Vu,
Gioacchino Tangari,
Thanh Tien Vu,
Don Dharmasiri,
Yuan-Fang Li,
Long Duong
Abstract:
Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios…
▽ More
Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas.
△ Less
Submitted 20 May, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
DUPRE: Data Utility Prediction for Efficient Data Valuation
Authors:
Kieu Thao Nguyen Pham,
Rachael Hwee Ling Sim,
Quoc Phong Nguyen,
See Kiong Ng,
Bryan Kian Hsiang Low
Abstract:
Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient e…
▽ More
Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient estimation of the Shapley values have focused on reducing the number of subsets to evaluate, our framework, \texttt{DUPRE}, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, \texttt{DUPRE} fits a \emph{Gaussian process} (GP) regression model to predict the utility of every other data subset. Our key contribution lies in the design of our GP kernel based on the sliced Wasserstein distance between empirical data distributions. In particular, we show that the kernel is valid and positive semi-definite, encodes prior knowledge of similarities between different data subsets, and can be efficiently computed. We empirically verify that \texttt{DUPRE} introduces low prediction error and speeds up data valuation for various ML models, datasets, and utility functions.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Authors:
Longxu Dou,
Qian Liu,
Fan Zhou,
Changyu Chen,
Zili Wang,
Ziqi Jin,
Zichen Liu,
Tongyao Zhu,
Cunxiao Du,
Penghui Yang,
Haonan Wang,
Jiaheng Liu,
Yongchi Zhao,
Xiachong Feng,
Xin Mao,
Man Tsung Yeung,
Kunat Pipatanakul,
Fajri Koto,
Min Si Thu,
Hynek Kydlíček,
Zeyi Liu,
Qunshu Lin,
Sittipong Sripaisarnmongkol,
Kridtaphad Sae-Khow,
Nirattisai Thongchim
, et al. (16 additional authors not shown)
Abstract:
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50…
▽ More
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Spatiotemporal Graph Neural Networks in short term load forecasting: Does adding Graph Structure in Consumption Data Improve Predictions?
Authors:
Quoc Viet Nguyen,
Joaquin Delgado Fernandez,
Sergio Potenciano Menci
Abstract:
Short term Load Forecasting (STLF) plays an important role in traditional and modern power systems. Most STLF models predominantly exploit temporal dependencies from historical data to predict future consumption. Nowadays, with the widespread deployment of smart meters, their data can contain spatiotemporal dependencies. In particular, their consumption data is not only correlated to historical va…
▽ More
Short term Load Forecasting (STLF) plays an important role in traditional and modern power systems. Most STLF models predominantly exploit temporal dependencies from historical data to predict future consumption. Nowadays, with the widespread deployment of smart meters, their data can contain spatiotemporal dependencies. In particular, their consumption data is not only correlated to historical values but also to the values of neighboring smart meters. This new characteristic motivates researchers to explore and experiment with new models that can effectively integrate spatiotemporal interrelations to increase forecasting performance. Spatiotemporal Graph Neural Networks (STGNNs) can leverage such interrelations by modeling relationships between smart meters as a graph and using these relationships as additional features to predict future energy consumption. While extensively studied in other spatiotemporal forecasting domains such as traffic, environments, or renewable energy generation, their application to load forecasting remains relatively unexplored, particularly in scenarios where the graph structure is not inherently available. This paper overviews the current literature focusing on STGNNs with application in STLF. Additionally, from a technical perspective, it also benchmarks selected STGNN models for STLF at the residential and aggregate levels. The results indicate that incorporating graph features can improve forecasting accuracy at the residential level; however, this effect is not reflected at the aggregate level
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Model-Free Counterfactual Subset Selection at Scale
Authors:
Minh Hieu Nguyen,
Viet Hung Doan,
Anh Tuan Nguyen,
Jun Jo,
Quoc Viet Hung Nguyen
Abstract:
Ensuring transparency in AI decision-making requires interpretable explanations, particularly at the instance level. Counterfactual explanations are a powerful tool for this purpose, but existing techniques frequently depend on synthetic examples, introducing biases from unrealistic assumptions, flawed models, or skewed data. Many methods also assume full dataset availability, an impractical const…
▽ More
Ensuring transparency in AI decision-making requires interpretable explanations, particularly at the instance level. Counterfactual explanations are a powerful tool for this purpose, but existing techniques frequently depend on synthetic examples, introducing biases from unrealistic assumptions, flawed models, or skewed data. Many methods also assume full dataset availability, an impractical constraint in real-time environments where data flows continuously. In contrast, streaming explanations offer adaptive, real-time insights without requiring persistent storage of the entire dataset. This work introduces a scalable, model-free approach to selecting diverse and relevant counterfactual examples directly from observed data. Our algorithm operates efficiently in streaming settings, maintaining $O(\log k)$ update complexity per item while ensuring high-quality counterfactual selection. Empirical evaluations on both real-world and synthetic datasets demonstrate superior performance over baseline methods, with robust behavior even under adversarial conditions.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching
Authors:
Quan Nguyen,
Shinji Ito,
Junpei Komiyama,
Nishant A. Mehta
Abstract:
Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new met…
▽ More
Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
Authors:
Anh-Tien Nguyen,
Duy Minh Ho Nguyen,
Nghiem Tuong Diep,
Trung Quoc Nguyen,
Nhat Ho,
Jacqueline Michelle Metsch,
Miriam Cindy Maurer,
Daniel Sonntag,
Hanibal Bohnenberger,
Anne-Christin Hauschild
Abstract:
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a visio…
▽ More
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this MGPATH.
△ Less
Submitted 14 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
A Large-Scale Benchmark for Vietnamese Sentence Paraphrases
Authors:
Sang Quang Nguyen,
Kiet Van Nguyen
Abstract:
This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models l…
▽ More
This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Can ChatGPT Diagnose Alzheimer's Disease?
Authors:
Quoc-Toan Nguyen,
Linh Le,
Xuan-The Tran,
Thomas Do,
Chin-Teng Lin
Abstract:
Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating neurodegenerative condition that affects approximately 1 in 9 individuals aged 65 and older, profoundly impairing memory and cognitive function. This paper utilises 9300 electronic health records (EHRs) with data from Magnetic Resonance Imaging (MRI) and cognitive tests to address an intriguing question: As a general-purpose task s…
▽ More
Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating neurodegenerative condition that affects approximately 1 in 9 individuals aged 65 and older, profoundly impairing memory and cognitive function. This paper utilises 9300 electronic health records (EHRs) with data from Magnetic Resonance Imaging (MRI) and cognitive tests to address an intriguing question: As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs? We present an in-depth evaluation of ChatGPT using a black-box approach with zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to analyse MRI and cognitive test results, as well as its potential as a diagnostic tool for AD. By automating aspects of the diagnostic process, this research opens a transformative approach for the healthcare system, particularly in addressing disparities in resource-limited regions where AD specialists are scarce. Hence, it offers a foundation for a promising method for early detection, supporting individuals with timely interventions, which is paramount for Quality of Life (QoL).
△ Less
Submitted 9 February, 2025;
originally announced February 2025.