-
A Note on Reconfiguration Graphs of Cliques
Authors:
Quan N. Lam,
Huu An Phan,
Duc A. Hoang
Abstract:
In a reconfiguration setting, each clique of a graph $G$ is viewed as a set of tokens placed on vertices of $G$ such that no vertex has more than one token and any two tokens are adjacent. Additionally, three well-known reconfiguration rules have been studied in the literature: Token Jumping ($\mathsf{TJ}$, which involves moving a token to any unoccupied vertex), Token Sliding ($\mathsf{TS}$, whic…
▽ More
In a reconfiguration setting, each clique of a graph $G$ is viewed as a set of tokens placed on vertices of $G$ such that no vertex has more than one token and any two tokens are adjacent. Additionally, three well-known reconfiguration rules have been studied in the literature: Token Jumping ($\mathsf{TJ}$, which involves moving a token to any unoccupied vertex), Token Sliding ($\mathsf{TS}$, which involves moving a token to an adjacent unoccupied vertex), and Token Addition/Removal ($\mathsf{TAR}$, which involves either adding or removing exactly one token). Given a graph $G$ and a reconfiguration rule $\mathsf{R} \in \{\mathsf{TS}, \mathsf{TJ}, \mathsf{TAR}\}$, a reconfiguration graph of cliques of $G$ is the graph whose vertices are cliques of $G$ and two vertices (cliques of $G$) are adjacent if one can be obtained from the other by applying $\mathsf{R}$ exactly once. In this paper, we initiate the study on structural properties of reconfiguration graphs of cliques and prove a number of interesting results, most of which are under the $\sfTS$ and $\sfTJ$ rules.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling
Authors:
Kuan-Po Huang,
Shu-wen Yang,
Huy Phan,
Bo-Ru Lu,
Byeonggeun Kim,
Sashank Macha,
Qingming Tang,
Shalini Ghosh,
Hung-yi Lee,
Chieh-Chi Kao,
Chao Wang
Abstract:
Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discret…
▽ More
Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available at https://audio-impact.github.io/.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning
Authors:
Jinquan Guan,
Qi Chen,
Lizhou Liang,
Yuhang Liu,
Vu Minh Hieu Phan,
Minh-Son To,
Jian Chen,
Yutong Xie
Abstract:
Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's archit…
▽ More
Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding
Authors:
Ta Duc Huy,
Duy Anh Huynh,
Yutong Xie,
Yuankai Qi,
Qi Chen,
Phi Le Nguyen,
Sen Kim Tran,
Son Lam Phung,
Anton van den Hengel,
Zhibin Liao,
Minh-Son To,
Johan W. Verjans,
Vu Minh Hieu Phan
Abstract:
Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models stru…
▽ More
Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations. In this paper, we empirically demonstrate two key observations. First, current VLMs assign high norms to background tokens, diverting the model's attention from regions of disease. Second, the global tokens used for cross-modal learning are not representative of local disease tokens. This hampers identifying correlations between the text and disease tokens. To address this, we introduce simple, yet effective Disease-Aware Prompting (DAP) process, which uses the explainability map of a VLM to identify the appropriate image features. This simple strategy amplifies disease-relevant regions while suppressing background interference. Without any additional pixel-level annotations, DAP improves visual grounding accuracy by 20.74% compared to state-of-the-art methods across three major chest X-ray datasets.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
SOS: A Shuffle Order Strategy for Data Augmentation in Industrial Human Activity Recognition
Authors:
Anh Tuan Ha,
Hoang Khang Phan,
Thai Minh Tien Ngo,
Anh Phan Truong,
Nhat Tan Le
Abstract:
In the realm of Human Activity Recognition (HAR), obtaining high quality and variance data is still a persistent challenge due to high costs and the inherent variability of real-world activities. This study introduces a generation dataset by deep learning approaches (Attention Autoencoder and conditional Generative Adversarial Networks). Another problem that data heterogeneity is a critical challe…
▽ More
In the realm of Human Activity Recognition (HAR), obtaining high quality and variance data is still a persistent challenge due to high costs and the inherent variability of real-world activities. This study introduces a generation dataset by deep learning approaches (Attention Autoencoder and conditional Generative Adversarial Networks). Another problem that data heterogeneity is a critical challenge, one of the solutions is to shuffle the data to homogenize the distribution. Experimental results demonstrate that the random sequence strategy significantly improves classification performance, achieving an accuracy of up to 0.70 $\pm$ 0.03 and a macro F1 score of 0.64 $\pm$ 0.01. For that, disrupting temporal dependencies through random sequence reordering compels the model to focus on instantaneous recognition, thereby improving robustness against activity transitions. This approach not only broadens the effective training dataset but also offers promising avenues for enhancing HAR systems in complex, real-world scenarios.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs
Authors:
Dung Nguyen,
Minh Khoi Ho,
Huy Ta,
Thanh Tam Nguyen,
Qi Chen,
Kumar Rav,
Quy Duong Dang,
Satwik Ramchandre,
Son Lam Phung,
Zhibin Liao,
Minh-Son To,
Johan Verjans,
Phi Le Nguyen,
Vu Minh Hieu Phan
Abstract:
Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguis…
▽ More
Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguistic patterns or attend to irrelevant image areas when responding to disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive benchmark designed to evaluate LMMs' localization abilities and hallucination robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, with doctor-annotated anatomical segmentation masks for pathological regions. To improve visual reasoning, we propose the Localize-before-Answer (LobA) framework, which trains LMMs to localize target regions of interest and self-prompt to emphasize segmented pathological areas, generating grounded and reliable answers. Experimental results demonstrate that our approach significantly outperforms state-of-the-art biomedical LMMs on the challenging HEAL-MedVQA benchmark, advancing robustness in medical VQA.
△ Less
Submitted 21 May, 2025; v1 submitted 30 April, 2025;
originally announced May 2025.
-
From Aesthetics to Human Preferences: Comparative Perspectives of Evaluating Text-to-Music Systems
Authors:
Huan Zhang,
Jinhua Liang,
Huy Phan,
Wenwu Wang,
Emmanouil Benetos
Abstract:
Evaluating generative models remains a fundamental challenge, particularly when the goal is to reflect human preferences. In this paper, we use music generation as a case study to investigate the gap between automatic evaluation metrics and human preferences. We conduct comparative experiments across five state-of-the-art music generation approaches, assessing both perceptual quality and distribut…
▽ More
Evaluating generative models remains a fundamental challenge, particularly when the goal is to reflect human preferences. In this paper, we use music generation as a case study to investigate the gap between automatic evaluation metrics and human preferences. We conduct comparative experiments across five state-of-the-art music generation approaches, assessing both perceptual quality and distributional similarity to human-composed music. Specifically, we evaluate synthesis music from various perceptual dimensions and examine reference-based metrics such as Mauve Audio Divergence (MAD) and Kernel Audio Distance (KAD). Our findings reveal significant inconsistencies across the different metrics, highlighting the limitation of the current evaluation practice. To support further research, we release a benchmark dataset comprising samples from multiple models. This study provides a broader perspective on the alignment of human preference in generative modeling, advocating for more human-centered evaluation strategies across domains.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs
Authors:
Minh V. T. Pham,
Huy N. Phan,
Hoang N. Phan,
Cuong Le Chi,
Tien N. Nguyen,
Nghi D. Q. Bui
Abstract:
Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framew…
▽ More
Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framework for synthesizing realistic, verifiable, and process-aware bug-fix datasets at the repository level. SWE-Synth leverages LLM agents to simulate debugging workflows, producing not only bug-fix pairs but also test cases and structured repair trajectories. Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness. Experiments show that models trained on SWE-Synth outperform those trained on real-world datasets by 2.3% on SWE-Bench Lite. Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
FedX: Adaptive Model Decomposition and Quantization for IoT Federated Learning
Authors:
Phung Lai,
Xiaopeng Jiang,
Hai Phan,
Cristian Borcea,
Khang Tran,
An Chen,
Vijaya Datta Mayyuri,
Ruoming Jin
Abstract:
Federated Learning (FL) allows collaborative training among multiple devices without data sharing, thus enabling privacy-sensitive applications on mobile or Internet of Things (IoT) devices, such as mobile health and asset tracking. However, designing an FL system with good model utility that works with low computation/communication overhead on heterogeneous, resource-constrained mobile/IoT device…
▽ More
Federated Learning (FL) allows collaborative training among multiple devices without data sharing, thus enabling privacy-sensitive applications on mobile or Internet of Things (IoT) devices, such as mobile health and asset tracking. However, designing an FL system with good model utility that works with low computation/communication overhead on heterogeneous, resource-constrained mobile/IoT devices is challenging. To address this problem, this paper proposes FedX, a novel adaptive model decomposition and quantization FL system for IoT. To balance utility with resource constraints on IoT devices, FedX decomposes a global FL model into different sub-networks with adaptive numbers of quantized bits for different devices. The key idea is that a device with fewer resources receives a smaller sub-network for lower overhead but utilizes a larger number of quantized bits for higher model utility, and vice versa. The quantization operations in FedX are done at the server to reduce the computational load on devices. FedX iteratively minimizes the losses in the devices' local data and in the server's public data using quantized sub-networks under a regularization term, and thus it maximizes the benefits of combining FL with model quantization through knowledge sharing among the server and devices in a cost-effective training process. Extensive experiments show that FedX significantly improves quantization times by up to 8.43X, on-device computation time by 1.5X, and total end-to-end training time by 1.36X, compared with baseline FL systems. We guarantee the global model convergence theoretically and validate local model convergence empirically, highlighting FedX's optimization efficiency.
△ Less
Submitted 9 June, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
Dividing sums of cycles in the semiring of functional digraphs
Authors:
Florian Bridoux,
Christophe Crespelle,
Thi Ha Duong Phan,
Adrien Richard
Abstract:
Functional digraphs are unlabelled finite digraphs where each vertex has exactly one out-neighbor. They are isomorphic classes of finite discrete-time dynamical systems. Endowed with the direct sum and product, functional digraphs form a semiring with an interesting multiplicative structure. For instance, we do not know if the following division problem can be solved in polynomial time: given two…
▽ More
Functional digraphs are unlabelled finite digraphs where each vertex has exactly one out-neighbor. They are isomorphic classes of finite discrete-time dynamical systems. Endowed with the direct sum and product, functional digraphs form a semiring with an interesting multiplicative structure. For instance, we do not know if the following division problem can be solved in polynomial time: given two functional digraphs $A$ and $B$, does $A$ divide $B$? That $A$ divides $B$ means that there exists a functional digraph $X$ such that $AX$ is isomorphic to $B$, and many such $X$ can exist. We can thus ask for the number of solutions $X$. In this paper, we focus on the case where $B$ is a sum of cycles (a disjoint union of cycles, corresponding to the limit behavior of finite discrete-time dynamical systems). There is then a naïve sub-exponential algorithm to compute the non-isomorphic solutions $X$, and our main result is an improvement of this algorithm which has the property to be polynomial when $A$ is fixed. It uses a divide-and-conquer technique that should be useful for further developments on the division problem.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Transient gamma rays from the 2021 outburst of the recurrent nova RS Ophiuchi: the effect of gamma-ray absorption
Authors:
Vo Hong Minh Phan,
Pierre Cristofari,
Enrico Peretti,
Vincent Tatischeff,
Andrea Ciardi
Abstract:
In 2021, RS Ophiuchi was the first nova to be detected in the very-high-energy (TeV) gamma-ray domain, directly testifying of efficient acceleration of charged particles up to at least the TeV range at the nova shock. Surprisingly, the TeV gamma-ray signal peaks $\sim 2$ days after the GeV signal and the origin of this delay has still not been clearly understood. We investigate the possibility tha…
▽ More
In 2021, RS Ophiuchi was the first nova to be detected in the very-high-energy (TeV) gamma-ray domain, directly testifying of efficient acceleration of charged particles up to at least the TeV range at the nova shock. Surprisingly, the TeV gamma-ray signal peaks $\sim 2$ days after the GeV signal and the origin of this delay has still not been clearly understood. We investigate the possibility that this delay is due to the effect of gamma-ray absorption resulted from interactions between gamma rays and optical photons copiously emitted during the outburst. We model particle acceleration at a nova shock to obtain the gamma-ray emission produced in interactions between the accelerated particles and the shocked gas. The effect of gamma-ray absorption is then included in details using the radiative transfer equation. We find that this can naturally account for the delay between the peaks of GeV and TeV gamma-ray lightcurves. This result emphasizes the importance of gamma-ray absorption for interpreting gamma-ray observations of novae in the TeV range which, in turn, demonstrates the necessity of a multi-wavelength view for unraveling the underlying physics of particle acceleration in these systems.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation
Authors:
Hoang Hai Phan,
Nguyen Duc Minh Vu,
Nam Dang Phuong
Abstract:
Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-qualit…
▽ More
Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
General Formulas for Loop-Induced Decays of $A \to Zγγ$ and Their Applications
Authors:
Dzung Tri Tran,
Thanh Huy Nguyen,
Khiem Hong Phan
Abstract:
Within the framework of the Standard Model Higgs extensions, including the Two-Higgs-Doublet Model with vector-like fermions and the Triplet-Higgs Model, we derive general one-loop contributions to the rare decay process $A \rightarrow Z γγ$. The analytical expressions are formulated with Passarino-Veltman scalar functions, which represent the scalar coefficients of one-loop Lorentz-covariant tens…
▽ More
Within the framework of the Standard Model Higgs extensions, including the Two-Higgs-Doublet Model with vector-like fermions and the Triplet-Higgs Model, we derive general one-loop contributions to the rare decay process $A \rightarrow Z γγ$. The analytical expressions are formulated with Passarino-Veltman scalar functions, which represent the scalar coefficients of one-loop Lorentz-covariant tensor integrals. These functions are written in accordance with the input conventions of the {\tt LoopTools} and {\tt Collier} packages, facilitating the efficient numerical generation of decay rates and their distributions using these computational tools. As part of our phenomenological study, we examine the branching ratios of this decay channel within the viable parameter space of the considered models. Our results indicate that the branching ratio can reach $\mathcal{O}(10^{-4})$ in the Two-Higgs-Doublet Model and $\mathcal{O}(10^{-2})$ in the Triplet-Higgs Model at specific points within the allowed parameter regions. Furthermore, the inclusion of vector-like fermions in the loop leads to a significant modification of the partial decay rates, with an observed variation of approximately $10\%$. Additionally, we explore the dependence of the branching ratios on key model parameters, including the charged Higgs mass, mixing angles, and the soft-breaking scale, providing deeper insights into the phenomenological implications of these Higgs-extended frameworks.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
NH-rich organic compounds from the carbonaceous asteroid (162173) Ryugu: nanoscale spectral and isotopic characterizations
Authors:
L. G. Vacher,
V. T. H. Phan,
L. Bonal,
M. Iskakova,
O. Poch,
P. Beck,
E. Quirico,
R. C. Ogliore
Abstract:
The detection of spectral bands at 3.06 um by MicrOmega, combined with the chemical identification of other NH-containing organic molecules in Ryugu samples, suggests the presence of potential NH-bearing compounds. However, the chemical forms of these NH-rich compounds, whether associated with N-rich organics, ammonium (NH4+) salts, NH4 or NH-organics-bearing phyllosilicates, or other forms, remai…
▽ More
The detection of spectral bands at 3.06 um by MicrOmega, combined with the chemical identification of other NH-containing organic molecules in Ryugu samples, suggests the presence of potential NH-bearing compounds. However, the chemical forms of these NH-rich compounds, whether associated with N-rich organics, ammonium (NH4+) salts, NH4 or NH-organics-bearing phyllosilicates, or other forms, remain to be better understood. In this study, we report the characterization of two Ryugu particles (C0050 and C0052) using multi-scale infrared (mm-reflectance, micro-FTIR, and nano-AFM-IR) and NanoSIMS techniques to constrain the nature and origin of NH-bearing components in the Ryugu asteroid. Our findings show that Ryugu's C0052 particle contains rare, micrometer-sized NH-rich organic compounds with peaks at 1660 cm-1 (mainly due to C=O stretching of the amide I band) and 1550 cm-1 (mainly due to N-H bending vibration mode of the amide II band), indicative of amide-related compounds. In contrast, these compounds are absent in C0050. Notably, nitrogen isotopic analysis reveals that these amides in C0052 are depleted in 15N (d15N = -215 +/- 92 permil), confirming their indigenous origin, while carbon and hydrogen isotopic compositions are indistinguishable from terrestrial values within errors (d13C = -22 +/- 52 and dD = 194 +/- 368 permil). The amides detected in C0052 could have formed through hydrothermal alteration from carboxylic acids and amines precursors on the Ryugu's parent planetesimal. Alternatively, they could have originated from the irradiation of 15N-depleted N-bearing ice by UV light or galactic cosmic rays, either at the surface of the asteroid in the outer Solar System or on mantle of interstellar dust grains in the interstellar medium. Amides delivered to early Earth by primitive small bodies, such as asteroid Ryugu, may have played a crucial role in prebiotic chemistry.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Knowledge Consultation for Semi-Supervised Semantic Segmentation
Authors:
Thuan Than,
Nhat-Anh Nguyen-Dang,
Dung Nguyen,
Salwa K. Al Khatib,
Ahmed Elhagry,
Hai Phan,
Yihui He,
Zhiqiang Shen,
Marios Savvides,
Dang Huynh
Abstract:
Semi-Supervised Semantic Segmentation reduces reliance on extensive annotations by using unlabeled data and state-of-the-art models to improve overall performance. Despite the success of deep co-training methods, their underlying mechanisms remain underexplored. This work revisits Cross Pseudo Supervision with dual heterogeneous backbones and introduces Knowledge Consultation (SegKC) to further en…
▽ More
Semi-Supervised Semantic Segmentation reduces reliance on extensive annotations by using unlabeled data and state-of-the-art models to improve overall performance. Despite the success of deep co-training methods, their underlying mechanisms remain underexplored. This work revisits Cross Pseudo Supervision with dual heterogeneous backbones and introduces Knowledge Consultation (SegKC) to further enhance segmentation performance. The proposed SegKC achieves significant improvements on Pascal and Cityscapes benchmarks, with mIoU scores of 87.1%, 89.2%, and 89.8% on Pascal VOC with the 1/4, 1/2, and full split partition, respectively, while maintaining a compact model architecture.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Interactive Medical Image Analysis with Concept-based Similarity Reasoning
Authors:
Ta Duc Huy,
Sen Kim Tran,
Phan Nguyen,
Nguyen Hoang Tran,
Tran Bao Sam,
Anton van den Hengel,
Zhibin Liao,
Johan W. Verjans,
Minh-Son To,
Vu Minh Hieu Phan
Abstract:
The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches th…
▽ More
The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches the concepts are activated. Alternatively, prototype-based methods learn representations from training image patches and compare these with test image patches, using the similarity scores for final class prediction. However, interpreting the underlying concepts of these patches can be challenging and often necessitates post-hoc guesswork. To address this issue, this paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging. CSR improves upon prior state-of-the-art interpretable methods by up to 4.5\% across three biomedical datasets. Our code is released at https://github.com/tadeephuy/InteractCSR.
△ Less
Submitted 11 March, 2025; v1 submitted 9 March, 2025;
originally announced March 2025.
-
Heterogeneous bimodal attention fusion for speech emotion recognition
Authors:
Jiachen Luo,
Huy Phan,
Lin Wang,
Joshua Reiss
Abstract:
Multi-modal emotion recognition in conversations is a challenging problem due to the complex and complementary interactions between different modalities. Audio and textual cues are particularly important for understanding emotions from a human perspective. Most existing studies focus on exploring interactions between audio and text modalities at the same representation level. However, a critical i…
▽ More
Multi-modal emotion recognition in conversations is a challenging problem due to the complex and complementary interactions between different modalities. Audio and textual cues are particularly important for understanding emotions from a human perspective. Most existing studies focus on exploring interactions between audio and text modalities at the same representation level. However, a critical issue is often overlooked: the heterogeneous modality gap between low-level audio representations and high-level text representations. To address this problem, we propose a novel framework called Heterogeneous Bimodal Attention Fusion (HBAF) for multi-level multi-modal interaction in conversational emotion recognition. The proposed method comprises three key modules: the uni-modal representation module, the multi-modal fusion module, and the inter-modal contrastive learning module. The uni-modal representation module incorporates contextual content into low-level audio representations to bridge the heterogeneous multi-modal gap, enabling more effective fusion. The multi-modal fusion module uses dynamic bimodal attention and a dynamic gating mechanism to filter incorrect cross-modal relationships and fully exploit both intra-modal and inter-modal interactions. Finally, the inter-modal contrastive learning module captures complex absolute and relative interactions between audio and text modalities. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed HBAF method outperforms existing state-of-the-art baselines.
△ Less
Submitted 31 March, 2025; v1 submitted 8 March, 2025;
originally announced March 2025.
-
Bimodal Connection Attention Fusion for Speech Emotion Recognition
Authors:
Jiachen Luo,
Huy Phan,
Lin Wang,
Joshua D. Reiss
Abstract:
Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection…
▽ More
Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.
△ Less
Submitted 22 March, 2025; v1 submitted 8 March, 2025;
originally announced March 2025.
-
Large Language Models for Education: ChemTAsk -- An Open-Source Paradigm for Automated Q&A in the Graduate Classroom
Authors:
Ryann M. Perez,
Marie Shimogawa,
Yanan Chang,
Hoang Anh T. Phan,
Jason G. Marmorstein,
Evan S. K. Yanagawa,
E. James Petersson
Abstract:
Large language models (LLMs) show promise for aiding graduate level education, but are limited by their training data and potential confabulations. We developed ChemTAsk, an open-source pipeline that combines LLMs with retrieval-augmented generation (RAG) to provide accurate, context-specific assistance. ChemTAsk utilizes course materials, including lecture transcripts and primary publications, to…
▽ More
Large language models (LLMs) show promise for aiding graduate level education, but are limited by their training data and potential confabulations. We developed ChemTAsk, an open-source pipeline that combines LLMs with retrieval-augmented generation (RAG) to provide accurate, context-specific assistance. ChemTAsk utilizes course materials, including lecture transcripts and primary publications, to generate accurate responses to student queries. Over nine weeks in an advanced biological chemistry course at the University of Pennsylvania, students could opt in to use ChemTAsk for assistance in any assignment or to understand class material. Comparative analysis showed ChemTAsk performed on par with human teaching assistants (TAs) in understanding student queries and providing accurate information, particularly excelling in creative problem-solving tasks. In contrast, TAs were more precise in their responses and tailored their assistance to the specifics of the class. Student feedback indicated that ChemTAsk was perceived as correct, helpful, and faster than TAs. Open-source and proprietary models from Meta and OpenAI respectively were tested on an original biological chemistry benchmark for future iterations of ChemTAsk. It was found that OpenAI models were more tolerant to deviations in the input prompt and excelled in self-assessment to safeguard for potential confabulations. Taken together, ChemTAsk demonstrates the potential of integrating LLMs with RAG to enhance educational support, offering a scalable tool for students and educators.
△ Less
Submitted 6 February, 2025; v1 submitted 9 January, 2025;
originally announced February 2025.
-
Momentum Contrastive Learning with Enhanced Negative Sampling and Hard Negative Filtering
Authors:
Duy Hoang,
Huy Ngo,
Khoi Pham,
Tri Nguyen,
Gia Bao,
Huy Phan
Abstract:
Contrastive learning has become pivotal in unsupervised representation learning, with frameworks like Momentum Contrast (MoCo) effectively utilizing large negative sample sets to extract discriminative features. However, traditional approaches often overlook the full potential of key embeddings and are susceptible to performance degradation from noisy negative samples in the memory bank. This stud…
▽ More
Contrastive learning has become pivotal in unsupervised representation learning, with frameworks like Momentum Contrast (MoCo) effectively utilizing large negative sample sets to extract discriminative features. However, traditional approaches often overlook the full potential of key embeddings and are susceptible to performance degradation from noisy negative samples in the memory bank. This study addresses these challenges by proposing an enhanced contrastive learning framework that incorporates two key innovations. First, we introduce a dual-view loss function, which ensures balanced optimization of both query and key embeddings, improving representation quality. Second, we develop a selective negative sampling strategy that emphasizes the most challenging negatives based on cosine similarity, mitigating the impact of noise and enhancing feature discrimination. Extensive experiments demonstrate that our framework achieves superior performance on downstream tasks, delivering robust and well-structured representations. These results highlight the potential of optimized contrastive mechanisms to advance unsupervised learning and extend its applicability across domains such as computer vision and natural language processing
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
One-loop induced contributions to the rare decay of $A_0 \rightarrow h_0h_0γ$ in Two Higgs Doublet Models
Authors:
Dzung Tri Tran,
L. T. Hue,
Thanh Huy Nguyen,
Vo Quoc Phong,
Khiem Hong Phan
Abstract:
The analytic expressions for one-loop contributions to the rare decay process $A_0 \rightarrow h_0h_0γ$ within the CP-conserving of Two Higgs Doublet Models are first reported in this paper. Analytic results are presented in term of scalar one-loop Passarino-Veltman functions following the standard output of the packages~{\tt LoopTools} and {\tt Collier}. In this context, physical results for the…
▽ More
The analytic expressions for one-loop contributions to the rare decay process $A_0 \rightarrow h_0h_0γ$ within the CP-conserving of Two Higgs Doublet Models are first reported in this paper. Analytic results are presented in term of scalar one-loop Passarino-Veltman functions following the standard output of the packages~{\tt LoopTools} and {\tt Collier}. In this context, physical results for the computed process are easily generated by using one of these packages. The numerical checks are proposed to verify for the analytic results in this paper. The checks rely on the renormalization conditions that the decay amplitude must be the ultraviolet finiteness and infrared finiteness. The amplitude consisting of an external photon always obeys the Ward identity. This will be confirmed numerically in this article. In phenomenological results, the decay rates of $A_0 \rightarrow h_0h_0γ$ are evaluated at several points in the allowed regions of the parameter space. Furthermore, the differential decay widths with respect to the invariant mass of Higgs-pair in final states are studied.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Projected proximal gradient trust-region algorithm for nonsmooth optimization
Authors:
Minh N. Dao,
Hung M. Phan,
Lindon Roberts
Abstract:
We consider trust-region methods for solving optimization problems where the objective is the sum of a smooth, nonconvex function and a nonsmooth, convex regularizer. We extend the global convergence theory of such methods to include worst-case complexity bounds in the case of unbounded model Hessian growth, and introduce a new, simple nonsmooth trust-region subproblem solver based on combining se…
▽ More
We consider trust-region methods for solving optimization problems where the objective is the sum of a smooth, nonconvex function and a nonsmooth, convex regularizer. We extend the global convergence theory of such methods to include worst-case complexity bounds in the case of unbounded model Hessian growth, and introduce a new, simple nonsmooth trust-region subproblem solver based on combining several iterations of proximal gradient descent with a single projection into the trust region, which meets the sufficient descent requirements for algorithm convergence and has promising numerical results.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging
Authors:
Shubhr Singh,
Emmanouil Benetos,
Huy Phan,
Dan Stowell
Abstract:
Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph N…
▽ More
Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.
△ Less
Submitted 29 January, 2025; v1 submitted 6 January, 2025;
originally announced January 2025.
-
ProjectedEx: Enhancing Generation in Explainable AI for Prostate Cancer
Authors:
Xuyin Qi,
Zeyu Zhang,
Aaron Berliano Handoko,
Huazhan Zheng,
Mingxi Chen,
Ta Duc Huy,
Vu Minh Hieu Phan,
Lei Zhang,
Linqi Cheng,
Shiyu Jiang,
Zhiwei Zhang,
Zhibin Liao,
Yang Zhao,
Minh-Son To
Abstract:
Prostate cancer, a growing global health concern, necessitates precise diagnostic tools, with Magnetic Resonance Imaging (MRI) offering high-resolution soft tissue imaging that significantly enhances diagnostic accuracy. Recent advancements in explainable AI and representation learning have significantly improved prostate cancer diagnosis by enabling automated and precise lesion classification. Ho…
▽ More
Prostate cancer, a growing global health concern, necessitates precise diagnostic tools, with Magnetic Resonance Imaging (MRI) offering high-resolution soft tissue imaging that significantly enhances diagnostic accuracy. Recent advancements in explainable AI and representation learning have significantly improved prostate cancer diagnosis by enabling automated and precise lesion classification. However, existing explainable AI methods, particularly those based on frameworks like generative adversarial networks (GANs), are predominantly developed for natural image generation, and their application to medical imaging often leads to suboptimal performance due to the unique characteristics and complexity of medical image. To address these challenges, our paper introduces three key contributions. First, we propose ProjectedEx, a generative framework that provides interpretable, multi-attribute explanations, effectively linking medical image features to classifier decisions. Second, we enhance the encoder module by incorporating feature pyramids, which enables multiscale feedback to refine the latent space and improves the quality of generated explanations. Additionally, we conduct comprehensive experiments on both the generator and classifier, demonstrating the clinical relevance and effectiveness of ProjectedEx in enhancing interpretability and supporting the adoption of AI in medical settings. Code will be released at https://github.com/Richardqiyi/ProjectedEx
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
Personalized Large Vision-Language Models
Authors:
Chau Pham,
Hoang Phan,
David Doermann,
Yunjie Tian
Abstract:
The personalization model has gained significant attention in image generation yet remains underexplored for large vision-language models (LVLMs). Beyond generic ones, with personalization, LVLMs handle interactive dialogues using referential concepts (e.g., ``Mike and Susan are talking.'') instead of the generic form (e.g., ``a boy and a girl are talking.''), making the conversation more customiz…
▽ More
The personalization model has gained significant attention in image generation yet remains underexplored for large vision-language models (LVLMs). Beyond generic ones, with personalization, LVLMs handle interactive dialogues using referential concepts (e.g., ``Mike and Susan are talking.'') instead of the generic form (e.g., ``a boy and a girl are talking.''), making the conversation more customizable and referentially friendly. In addition, PLVM is equipped to continuously add new concepts during a dialogue without incurring additional costs, which significantly enhances the practicality. PLVM proposes Aligner, a pre-trained visual encoder to align referential concepts with the queried images. During the dialogues, it extracts features of reference images with these corresponding concepts and recognizes them in the queried image, enabling personalization. We note that the computational cost and parameter count of the Aligner are negligible within the entire framework. With comprehensive qualitative and quantitative analyses, we reveal the effectiveness and superiority of PLVM.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
Wilson Loop and Topological Properties in 3D Woodpile Photonic Crystal
Authors:
Huyen Thanh Phan,
Shun Takahashi,
Satoshi Iwamoto,
Katsunori Wakabayashi
Abstract:
We numerically study the first and the second order topological states of electromagnetic (EM) wave in the three-dimensional (3D) woodpile photonic crystal (PhC). The recent studies on 3D PhCs have mainly focused on the observation of the topological states. Here, we not only focus on finding the topological states but also propose a numerical calculation method for topological invariants, which i…
▽ More
We numerically study the first and the second order topological states of electromagnetic (EM) wave in the three-dimensional (3D) woodpile photonic crystal (PhC). The recent studies on 3D PhCs have mainly focused on the observation of the topological states. Here, we not only focus on finding the topological states but also propose a numerical calculation method for topological invariants, which is based on the Wilson loop. For the 3D woodpile PhC, the topological states emerge due to the finite difference in the winding number or partial Chern number. The selection rule for the emergence of topological hinge states is also pointed out based on the topological invariants. Our numerical calculation results are essential and put a step toward the experimental realization of topological waveguide in 3D PhCs.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
Unveiling Concept Attribution in Diffusion Models
Authors:
Quang H. Nguyen,
Hoang Phan,
Khoa D. Doan
Abstract:
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers con…
▽ More
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept. In this work, we approach diffusion models' interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. To answer this question, we decompose diffusion models using component attribution, systematically unveiling the importance of each component (specifically the model parameter) in generating a concept. The proposed framework, called \textbf{C}omponent \textbf{A}ttribution for \textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing (positive) components, while interestingly uncovers another type of components that contribute negatively to generating a concept, which is missing in the previous knowledge localization work. Based on this holistic understanding of diffusion models, we introduce two fast, inference-time model editing algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure and CAD-Amplify allows amplification of a generated concept by ablating the positive and negative components, respectively, while retaining knowledge of other concepts. Extensive experimental results validate the significance of both positive and negative components pinpointed by our framework, demonstrating the potential of providing a complete view of interpreting generative models. Our code is available \href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.
△ Less
Submitted 12 March, 2025; v1 submitted 3 December, 2024;
originally announced December 2024.
-
Fast ground-to-air transition with avian-inspired multifunctional legs
Authors:
Won Dong Shin,
Hoang-Vu Phan,
Monica A. Daley,
Auke J. Ijspeert,
Dario Floreano
Abstract:
Most birds can navigate seamlessly between aerial and terrestrial environments. Whereas the forelimbs evolved into wings primarily for flight, the hindlimbs serve diverse functions such as walking, hopping, and leaping, and jumping take-off for transitions into flight. These capabilities have inspired engineers to aim for similar multi-modality in aerial robots, expanding their range of applicatio…
▽ More
Most birds can navigate seamlessly between aerial and terrestrial environments. Whereas the forelimbs evolved into wings primarily for flight, the hindlimbs serve diverse functions such as walking, hopping, and leaping, and jumping take-off for transitions into flight. These capabilities have inspired engineers to aim for similar multi-modality in aerial robots, expanding their range of applications across diverse environments. However, challenges remain in reproducing multi-modal locomotion, across gaits with distinct kinematics and propulsive characteristics, such as walking and jumping, while preserving lightweight mass for flight. This tradeoff between mechanical complexity and versatility limits most existing aerial robots to only one additional locomotor mode. Here, we overcome the complexity-versatility tradeoff with RAVEN (Robotic Avian-inspired Vehicle for multiple ENvironments), which uses its bird-inspired multi-functional legs to jump rapidly into flight, walk on ground and hop over obstacles and gaps similar to the multi-modal locomotion of birds. We show that jumping for take-off contributes substantially to initial flight take-off speed and, remarkably, that it is more energy-efficient than solely propeller-based take-off. Our analysis suggests an important tradeoff in mass distribution between legs and body among birds adapted for different locomotor strategies, with greater investment in leg mass among terrestrial birds with multi-modal gait demands. Multi-functional robot legs expand opportunities to deploy traditional fixed-wing aircraft in complex terrains through autonomous take-offs and multi-modal gaits.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
SemiKong: Curating, Training, and Evaluating A Semiconductor Industry-Specific Large Language Model
Authors:
Christopher Nguyen,
William Nguyen,
Atsushi Suzuki,
Daisuke Oku,
Hong An Phan,
Sang Dinh,
Zooey Nguyen,
Anh Ha,
Shruti Raghavan,
Huy Vo,
Thang Nguyen,
Lan Nguyen,
Yoshikuni Hirayama
Abstract:
Large Language Models (LLMs) have demonstrated the potential to address some issues within the semiconductor industry. However, they are often general-purpose models that lack the specialized knowledge needed to tackle the unique challenges of this sector, such as the intricate physics and chemistry of semiconductor devices and processes. SemiKong, the first industry-specific LLM for the semicondu…
▽ More
Large Language Models (LLMs) have demonstrated the potential to address some issues within the semiconductor industry. However, they are often general-purpose models that lack the specialized knowledge needed to tackle the unique challenges of this sector, such as the intricate physics and chemistry of semiconductor devices and processes. SemiKong, the first industry-specific LLM for the semiconductor domain, provides a foundation that can be used to develop tailored proprietary models. With SemiKong 1.0, we aim to develop a foundational model capable of understanding etching problems at an expert level. Our key contributions include (a) curating a comprehensive corpus of semiconductor-related texts, (b) creating a foundational model with in-depth semiconductor knowledge, and (c) introducing a framework for integrating expert knowledge, thereby advancing the evaluation process of domain-specific AI models. Through fine-tuning a pre-trained LLM using our curated dataset, we have shown that SemiKong outperforms larger, general-purpose LLMs in various semiconductor manufacturing and design tasks. Our extensive experiments underscore the importance of developing domain-specific LLMs as a foundation for company- or tool-specific proprietary models, paving the way for further research and applications in the semiconductor domain. Code and dataset will be available at https://github.com/aitomatic/semikong
△ Less
Submitted 21 November, 2024; v1 submitted 20 November, 2024;
originally announced November 2024.
-
A Survey of Medical Vision-and-Language Applications and Their Techniques
Authors:
Qi Chen,
Ruoshan Zhao,
Sinuo Wang,
Vu Minh Hieu Phan,
Anton van den Hengel,
Johan Verjans,
Zhibin Liao,
Minh-Son To,
Yong Xia,
Jian Chen,
Yutong Xie,
Qi Wu
Abstract:
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and p…
▽ More
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: https://github.com/YtongXie/Medical-Vision-and-Language-Tasks-and-Methodologies-A-Survey.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation
Authors:
Hao Phung,
Quan Dao,
Trung Dao,
Hoang Phan,
Dimitris Metaxas,
Anh Tran
Abstract:
We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties i…
▽ More
We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs. The codes and pretrained models are released at https://github.com/VinAIResearch/DiMSUM.git.
△ Less
Submitted 10 April, 2025; v1 submitted 6 November, 2024;
originally announced November 2024.
-
CIT: Rethinking Class-incremental Semantic Segmentation with a Class Independent Transformation
Authors:
Jinchao Ge,
Bowen Zhang,
Akide Liu,
Minh Hieu Phan,
Qi Chen,
Yangyang Shu,
Yang Zhao
Abstract:
Class-incremental semantic segmentation (CSS) requires that a model learn to segment new classes without forgetting how to segment previous ones: this is typically achieved by distilling the current knowledge and incorporating the latest data. However, bypassing iterative distillation by directly transferring outputs of initial classes to the current learning task is not supported in existing clas…
▽ More
Class-incremental semantic segmentation (CSS) requires that a model learn to segment new classes without forgetting how to segment previous ones: this is typically achieved by distilling the current knowledge and incorporating the latest data. However, bypassing iterative distillation by directly transferring outputs of initial classes to the current learning task is not supported in existing class-specific CSS methods. Via Softmax, they enforce dependency between classes and adjust the output distribution at each learning step, resulting in a large probability distribution gap between initial and current tasks. We introduce a simple, yet effective Class Independent Transformation (CIT) that converts the outputs of existing semantic segmentation models into class-independent forms with negligible cost or performance loss. By utilizing class-independent predictions facilitated by CIT, we establish an accumulative distillation framework, ensuring equitable incorporation of all class information. We conduct extensive experiments on various segmentation architectures, including DeepLabV3, Mask2Former, and SegViTv2. Results from these experiments show minimal task forgetting across different datasets, with less than 5% for ADE20K in the most challenging 11 task configurations and less than 1% across all configurations for the PASCAL VOC 2012 dataset.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
VisualCoder: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning
Authors:
Cuong Chi Le,
Hoang-Chau Truong-Vinh,
Huy Nhat Phan,
Dung Duy Le,
Tien N. Nguyen,
Nghi D. Q. Bui
Abstract:
Predicting program behavior and reasoning about code execution remain significant challenges in software engineering, particularly for large language models (LLMs) designed for code analysis. While these models excel at understanding static syntax, they often struggle with dynamic reasoning tasks. We introduce VisualCoder, a simple yet effective approach that enhances code reasoning by integrating…
▽ More
Predicting program behavior and reasoning about code execution remain significant challenges in software engineering, particularly for large language models (LLMs) designed for code analysis. While these models excel at understanding static syntax, they often struggle with dynamic reasoning tasks. We introduce VisualCoder, a simple yet effective approach that enhances code reasoning by integrating multimodal Chain-of-Thought (CoT) reasoning with a visual Control Flow Graph (CFG). By aligning code snippets with their corresponding CFGs, VisualCoder provides deeper insights into execution flows. We address challenges in multimodal CoT integration through a reference mechanism, ensuring consistency between code and its execution path, thereby improving performance in program behavior prediction, error detection, and output generation.
△ Less
Submitted 9 February, 2025; v1 submitted 30 October, 2024;
originally announced October 2024.
-
Single-word Auditory Attention Decoding Using Deep Learning Model
Authors:
Nhan Duc Thanh Nguyen,
Huy Phan,
Kaare Mikkelsen,
Preben Kidmose
Abstract:
Identifying auditory attention by comparing auditory stimuli and corresponding brain responses, is known as auditory attention decoding (AAD). The majority of AAD algorithms utilize the so-called envelope entrainment mechanism, whereby auditory attention is identified by how the envelope of the auditory stream drives variation in the electroencephalography (EEG) signal. However, neural processing…
▽ More
Identifying auditory attention by comparing auditory stimuli and corresponding brain responses, is known as auditory attention decoding (AAD). The majority of AAD algorithms utilize the so-called envelope entrainment mechanism, whereby auditory attention is identified by how the envelope of the auditory stream drives variation in the electroencephalography (EEG) signal. However, neural processing can also be decoded based on endogenous cognitive responses, in this case, neural responses evoked by attention to specific words in a speech stream. This approach is largely unexplored in the field of AAD but leads to a single-word auditory attention decoding problem in which an epoch of an EEG signal timed to a specific word is labeled as attended or unattended. This paper presents a deep learning approach, based on EEGNet, to address this challenge. We conducted a subject-independent evaluation on an event-based AAD dataset with three different paradigms: word category oddball, word category with competing speakers, and competing speech streams with targets. The results demonstrate that the adapted model is capable of exploiting cognitive-related spatiotemporal EEG features and achieving at least 58% accuracy on the most realistic competing paradigm for the unseen subjects. To our knowledge, this is the first study dealing with this problem.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
AADNet: An End-to-End Deep Learning Model for Auditory Attention Decoding
Authors:
Nhan Duc Thanh Nguyen,
Huy Phan,
Simon Geirnaert,
Kaare Mikkelsen,
Preben Kidmose
Abstract:
Auditory attention decoding (AAD) is the process of identifying the attended speech in a multi-talker environment using brain signals, typically recorded through electroencephalography (EEG). Over the past decade, AAD has undergone continuous development, driven by its promising application in neuro-steered hearing devices. Most AAD algorithms are relying on the increase in neural entrainment to t…
▽ More
Auditory attention decoding (AAD) is the process of identifying the attended speech in a multi-talker environment using brain signals, typically recorded through electroencephalography (EEG). Over the past decade, AAD has undergone continuous development, driven by its promising application in neuro-steered hearing devices. Most AAD algorithms are relying on the increase in neural entrainment to the envelope of attended speech, as compared to unattended speech, typically using a two-step approach. First, the algorithm predicts representations of the attended speech signal envelopes; second, it identifies the attended speech by finding the highest correlation between the predictions and the representations of the actual speech signals. In this study, we proposed a novel end-to-end neural network architecture, named AADNet, which combines these two stages into a direct approach to address the AAD problem. We compare the proposed network against the traditional approaches, including linear stimulus reconstruction, canonical correlation analysis, and an alternative non-linear stimulus reconstruction using two different datasets. AADNet shows a significant performance improvement for both subject-specific and subject-independent models. Notably, the average subject-independent classification accuracies from 56.1 % to 82.7 % with analysis window lengths ranging from 1 to 40 seconds, respectively, show a significantly improved ability to generalize to data from unseen subjects. These results highlight the potential of deep learning models for advancing AAD, with promising implications for future hearing aids, assistive devices, and clinical assessments.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
One-loop analytical expressions for $γγ\rightarrow φ_iφ_j$ in Higgs Extensions of the Standard Models and its applications
Authors:
Khiem Hong Phan,
Dzung Tri Tran,
Thanh Huy Nguyen
Abstract:
General one-loop formulas for loop-induced processes $γγ\rightarrow φ_iφ_j$ with $φ_iφ_j = hh,~hH,~HH$ are presented in the paper. Analytic expressions evaluated in this work are valid for a class of Higgs Extensions of the Standard Models, e.g. Inert Doublet Higgs Models, Two Higgs Doublet Models, Zee-Babu Models as well as Triplet Higgs Models, etc. Analytic expressions for one-loop form factors…
▽ More
General one-loop formulas for loop-induced processes $γγ\rightarrow φ_iφ_j$ with $φ_iφ_j = hh,~hH,~HH$ are presented in the paper. Analytic expressions evaluated in this work are valid for a class of Higgs Extensions of the Standard Models, e.g. Inert Doublet Higgs Models, Two Higgs Doublet Models, Zee-Babu Models as well as Triplet Higgs Models, etc. Analytic expressions for one-loop form factors are written in terms of the basic scalar one-loop two-, three- and four-point functions following the output format of both the packages~{\tt LoopTools} and {\tt Collier}. Physical results can be hence evaluated numerically by using one of the mentioned packages. Analytic results are tested by several checks such as the ultraviolet finiteness, infrared finiteness of the one-loop amplitudes. Furthermore, the amplitudes also obey the ward identity due to the on-shell initial photons. This identity is also verified numerically in this works. In the applications, we present the phenomenological results for Zee-Babu model as a typical example in this report. Production cross-section for the processes $γγ\rightarrow hh$ are scanned over the parameter space of the Zee-Babu Models.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
Leveraging Hierarchical Taxonomies in Prompt-based Continual Learning
Authors:
Quyen Tran,
Hoang Phan,
Minh Le,
Tuan Truong,
Dinh Phung,
Linh Ngo,
Thien Nguyen,
Nhat Ho,
Trung Le
Abstract:
Humans perceive the world as a series of sequential events, which can be hierarchically organized with different levels of abstraction based on conceptual knowledge. Drawing inspiration from human learning behaviors, this work proposes a novel approach to mitigate catastrophic forgetting in Prompt-based Continual Learning models by exploiting the relationships between continuously emerging class d…
▽ More
Humans perceive the world as a series of sequential events, which can be hierarchically organized with different levels of abstraction based on conceptual knowledge. Drawing inspiration from human learning behaviors, this work proposes a novel approach to mitigate catastrophic forgetting in Prompt-based Continual Learning models by exploiting the relationships between continuously emerging class data. We find that applying human habits of organizing and connecting information can serve as an efficient strategy when training deep learning models. Specifically, by building a hierarchical tree structure based on the expanding set of labels, we gain fresh insights into the data, identifying groups of similar classes could easily cause confusion. Additionally, we delve deeper into the hidden connections between classes by exploring the original pretrained model's behavior through an optimal transport-based approach. From these insights, we propose a novel regularization loss function that encourages models to focus more on challenging knowledge areas, thereby enhancing overall performance. Experimentally, our method demonstrated significant superiority over the most robust state-of-the-art models on various benchmarks.
△ Less
Submitted 8 March, 2025; v1 submitted 5 October, 2024;
originally announced October 2024.
-
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
Authors:
Huy Nhat Phan,
Tien N. Nguyen,
Phong X. Nguyen,
Nghi D. Q. Bui
Abstract:
Large Language Models (LLMs) have revolutionized software engineering (SE), showcasing remarkable proficiency in various coding tasks. Despite recent advancements that have enabled the creation of autonomous software agents utilizing LLMs for end-to-end development tasks, these systems are typically designed for specific SE functions. We introduce HyperAgent, an innovative generalist multi-agent s…
▽ More
Large Language Models (LLMs) have revolutionized software engineering (SE), showcasing remarkable proficiency in various coding tasks. Despite recent advancements that have enabled the creation of autonomous software agents utilizing LLMs for end-to-end development tasks, these systems are typically designed for specific SE functions. We introduce HyperAgent, an innovative generalist multi-agent system designed to tackle a wide range of SE tasks across different programming languages by mimicking the workflows of human developers. HyperAgent features four specialized agents-Planner, Navigator, Code Editor, and Executor-capable of handling the entire lifecycle of SE tasks, from initial planning to final verification. HyperAgent sets new benchmarks in diverse SE tasks, including GitHub issue resolution on the renowned SWE-Bench benchmark, outperforming robust baselines. Furthermore, HyperAgent demonstrates exceptional performance in repository-level code generation (RepoExec) and fault localization and program repair (Defects4J), often surpassing state-of-the-art baselines.
△ Less
Submitted 5 November, 2024; v1 submitted 9 September, 2024;
originally announced September 2024.
-
MixNet: Joining Force of Classical and Modern Approaches Toward the Comprehensive Pipeline in Motor Imagery EEG Classification
Authors:
Phairot Autthasan,
Rattanaphon Chaisaen,
Huy Phan,
Maarten De Vos,
Theerawit Wilaiprasitporn
Abstract:
Recent advances in deep learning (DL) have significantly impacted motor imagery (MI)-based brain-computer interface (BCI) systems, enhancing the decoding of electroencephalography (EEG) signals. However, most studies struggle to identify discriminative patterns across subjects during MI tasks, limiting MI classification performance. In this article, we propose MixNet, a novel classification framew…
▽ More
Recent advances in deep learning (DL) have significantly impacted motor imagery (MI)-based brain-computer interface (BCI) systems, enhancing the decoding of electroencephalography (EEG) signals. However, most studies struggle to identify discriminative patterns across subjects during MI tasks, limiting MI classification performance. In this article, we propose MixNet, a novel classification framework designed to overcome this limitation by utilizing spectral-spatial signals from MI data, along with a multitask learning architecture named MIN2Net, for classification. Here, the spectral-spatial signals are generated using the filter-bank common spatial patterns (FBCSPs) method on MI data. Since the multitask learning architecture is used for the classification task, the learning in each task may exhibit different generalization rates and potential overfitting across tasks. To address this issue, we implement adaptive gradient blending, simultaneously regulating multiple loss weights and adjusting the learning pace for each task based on its generalization/overfitting tendencies. Experimental results on six benchmark data sets of different data sizes demonstrate that MixNet consistently outperforms all state-of-the-art algorithms in subject-dependent and -independent settings. Finally, the low-density EEG MI classification results show that MixNet outperforms all state-of-the-art algorithms, offering promising implications for Internet of Thing (IoT) applications, such as lightweight and portable EEG wearable devices based on low-density montages.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
$(g-2)_{e,μ}$ and Lepton flavor violating decays in a left-right model
Authors:
L. T. Hue,
Khiem Hong Phan,
T. T. Hong,
T. Phong Nguyen,
N. H. T. Nha
Abstract:
General expressions for one-loop contributions associated with lepton-flavor violating decays of the standard model-like Higgs boson $h\to e_b^\pm e_a^\mp$ and gauge boson $Z\to e^\pm_b e_a^\mp$ are introduced in the unitary gauge. The results are used to discuss these decays as new physics signals in a minimal left-right symmetric model containing only one bidoublet Higgs and a $SU(2)_R$ Higgs do…
▽ More
General expressions for one-loop contributions associated with lepton-flavor violating decays of the standard model-like Higgs boson $h\to e_b^\pm e_a^\mp$ and gauge boson $Z\to e^\pm_b e_a^\mp$ are introduced in the unitary gauge. The results are used to discuss these decays as new physics signals in a minimal left-right symmetric model containing only one bidoublet Higgs and a $SU(2)_R$ Higgs doublet accommodating data of neutrino oscillations and $(g-2)_μ$. The numerical investigation indicates that some of these decay rates can reach near future experimental sensitivities.
△ Less
Submitted 14 December, 2024; v1 submitted 2 September, 2024;
originally announced September 2024.
-
Processes $γγ\rightarrow φ_iφ_j$ in Inert Higgs Doublet Models and Two Higgs Doublet Models
Authors:
Khiem Hong Phan,
Dzung Tri Tran,
Thanh Huy Nguyen
Abstract:
In this paper, we present a phenomenological analysis of one-loop induced processes $γγ\rightarrow φ_iφ_j$, where the CP-even Higgs bosons are denoted as $φ_{i,j} \equiv h,~H$, in high-energy photon-photon collisions, within the frameworks of the Inert Higgs Doublet Model and the Two Higgs Doublet Model. The total cross sections are evaluated as functions of the center-of-mass energy, finding that…
▽ More
In this paper, we present a phenomenological analysis of one-loop induced processes $γγ\rightarrow φ_iφ_j$, where the CP-even Higgs bosons are denoted as $φ_{i,j} \equiv h,~H$, in high-energy photon-photon collisions, within the frameworks of the Inert Higgs Doublet Model and the Two Higgs Doublet Model. The total cross sections are evaluated as functions of the center-of-mass energy, finding that the cross sections for the considered processes in all the models under investigation are significantly enhanced around the threshold for charged Higgs boson pair production ($\sim 2 M_{H^\pm}$). Furthermore, the enhancement factors, defined as the ratios of cross sections of $γγ\rightarrow φ_iφ_j$ in the investigated models to those for $γγ\rightarrow hh$ in the Standard Model, are examined in the relevant regions of the model's parameter space. In the Inert Higgs Doublet Model, the factors are studied in the parameter space of $(M_{H^\pm},~μ^2_2)$ and $(M_{H^\pm},~λ_2)$. In the Two Higgs Doublet Model, the factors are examined in the planes defined by $(M_{H^\pm},~t_β)$ as well as in the space of the charged Higgs mass $M_{H^\pm}$ and the soft-breaking $Z_2$ parameter $m_{12}^2$. Two scenarios characterized by $c_{β-α} > 0$ and $c_{β-α} < 0$ are studied in further detail. The factors exhibit distinct behaviors between these two scenarios. As a result, it is possible to discriminate between them at future colliders. The dependence of the cross section for the process $γγ\rightarrow hH$ on $m_{12}^2$ provides a potential probe of the soft $Z_2$-breaking scale in the Two Higgs Doublet Model.
△ Less
Submitted 7 April, 2025; v1 submitted 1 September, 2024;
originally announced September 2024.
-
TC-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation
Authors:
Anh-Dzung Doan,
Vu Minh Hieu Phan,
Surabhi Gupta,
Markus Wagner,
Tat-Jun Chin,
Ian Reid
Abstract:
Infrared imaging offers resilience against changing lighting conditions by capturing object temperatures. Yet, in few scenarios, its lack of visual details compared to daytime visible images, poses a significant challenge for human and machine interpretation. This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for infrared-to-visible video tr…
▽ More
Infrared imaging offers resilience against changing lighting conditions by capturing object temperatures. Yet, in few scenarios, its lack of visual details compared to daytime visible images, poses a significant challenge for human and machine interpretation. This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for infrared-to-visible video translation. Our method, extending the Patch Diffusion Model, consists of two key components. Firstly, we propose a semantic-guided denoising, leveraging the strong representations of foundational models. As such, our method faithfully preserves the semantic structure of generated visible images. Secondly, we propose a novel temporal blending module to guide the denoising trajectory, ensuring the temporal consistency between consecutive frames. Experiment shows that TC-PDM outperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible video translation and by 6.1% in AP50 for day-to-night object detection. Our code is publicly available at https://github.com/dzungdoan6/tc-pdm
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
ESA: Annotation-Efficient Active Learning for Semantic Segmentation
Authors:
Jinchao Ge,
Zeyu Zhang,
Minh Hieu Phan,
Bowen Zhang,
Akide Liu,
Yang Zhao
Abstract:
Active learning enhances annotation efficiency by selecting the most revealing samples for labeling, thereby reducing reliance on extensive human input. Previous methods in semantic segmentation have centered on individual pixels or small areas, neglecting the rich patterns in natural images and the power of advanced pre-trained models. To address these challenges, we propose three key contributio…
▽ More
Active learning enhances annotation efficiency by selecting the most revealing samples for labeling, thereby reducing reliance on extensive human input. Previous methods in semantic segmentation have centered on individual pixels or small areas, neglecting the rich patterns in natural images and the power of advanced pre-trained models. To address these challenges, we propose three key contributions: Firstly, we introduce Entity-Superpixel Annotation (ESA), an innovative and efficient active learning strategy which utilizes a class-agnostic mask proposal network coupled with super-pixel grouping to capture local structural cues. Additionally, our method selects a subset of entities within each image of the target domain, prioritizing superpixels with high entropy to ensure comprehensive representation. Simultaneously, it focuses on a limited number of key entities, thereby optimizing for efficiency. By utilizing an annotator-friendly design that capitalizes on the inherent structure of images, our approach significantly outperforms existing pixel-based methods, achieving superior results with minimal queries, specifically reducing click cost by 98% and enhancing performance by 1.71%. For instance, our technique requires a mere 40 clicks for annotation, a stark contrast to the 5000 clicks demanded by conventional methods.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain
Authors:
Rounak Meyur,
Hung Phan,
Sridevi Wagle,
Jan Strube,
Mahantesh Halappanavar,
Sameera Horawalavithana,
Anurag Acharya,
Sai Munikoti
Abstract:
Wind energy project assessments present significant challenges for decision-makers, who must navigate and synthesize hundreds of pages of environmental and scientific documentation. These documents often span different regions and project scales, covering multiple domains of expertise. This process traditionally demands immense time and specialized knowledge from decision-makers. The advent of Lar…
▽ More
Wind energy project assessments present significant challenges for decision-makers, who must navigate and synthesize hundreds of pages of environmental and scientific documentation. These documents often span different regions and project scales, covering multiple domains of expertise. This process traditionally demands immense time and specialized knowledge from decision-makers. The advent of Large Language Models (LLM) and Retrieval Augmented Generation (RAG) approaches offer a transformative solution, enabling rapid, accurate cross-document information retrieval and synthesis. As the landscape of Natural Language Processing (NLP) and text generation continues to evolve, benchmarking becomes essential to evaluate and compare the performance of different RAG-based LLMs. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI (LLM) teaming. As a case study, we demonstrate the framework by introducing WeQA, a first-of-its-kind benchmark on the wind energy domain which comprises of multiple scientific documents/reports related to environmental aspects of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level, providing a foundation for rigorous assessment of RAG-based systems in complex scientific domains and enabling researchers to identify areas for improvement in domain-specific applications.
△ Less
Submitted 9 June, 2025; v1 submitted 21 August, 2024;
originally announced August 2024.
-
CodeFlow: Program Behavior Prediction with Dynamic Dependencies Learning
Authors:
Cuong Chi Le,
Hoang Nhat Phan,
Huy Nhat Phan,
Tien N. Nguyen,
Nghi D. Q. Bui
Abstract:
Predicting program behavior without execution is a critical task in software engineering. Existing models often fall short in capturing the dynamic dependencies among program elements. To address this, we present CodeFlow, a novel machine learning-based approach that predicts code coverage and detects runtime errors by learning both static and dynamic dependencies within the code. By using control…
▽ More
Predicting program behavior without execution is a critical task in software engineering. Existing models often fall short in capturing the dynamic dependencies among program elements. To address this, we present CodeFlow, a novel machine learning-based approach that predicts code coverage and detects runtime errors by learning both static and dynamic dependencies within the code. By using control flow graphs (CFGs), CodeFlow effectively represents all possible execution paths and the statistic relations between different statements, providing a more comprehensive understanding of program behaviors. CodeFlow constructs CFGs to represent possible execution paths and learns vector representations (embeddings) for CFG nodes, capturing static control-flow dependencies. Additionally, it learns dynamic dependencies by leveraging execution traces, which reflect the impacts among statements during execution. This combination enables CodeFlow to accurately predict code coverage and identify runtime errors. Our empirical evaluation demonstrates that CodeFlow significantly improves code coverage prediction accuracy and effectively localizes runtime errors, outperforming state-of-the-art models.
△ Less
Submitted 9 February, 2025; v1 submitted 5 August, 2024;
originally announced August 2024.
-
AdaCBM: An Adaptive Concept Bottleneck Model for Explainable and Accurate Diagnosis
Authors:
Townim F. Chowdhury,
Vu Minh Hieu Phan,
Kewen Liao,
Minh-Son To,
Yutong Xie,
Anton van den Hengel,
Johan W. Verjans,
Zhibin Liao
Abstract:
The integration of vision-language models such as CLIP and Concept Bottleneck Models (CBMs) offers a promising approach to explaining deep neural network (DNN) decisions using concepts understandable by humans, addressing the black-box concern of DNNs. While CLIP provides both explainability and zero-shot classification capability, its pre-training on generic image and text data may limit its clas…
▽ More
The integration of vision-language models such as CLIP and Concept Bottleneck Models (CBMs) offers a promising approach to explaining deep neural network (DNN) decisions using concepts understandable by humans, addressing the black-box concern of DNNs. While CLIP provides both explainability and zero-shot classification capability, its pre-training on generic image and text data may limit its classification accuracy and applicability to medical image diagnostic tasks, creating a transfer learning problem. To maintain explainability and address transfer learning needs, CBM methods commonly design post-processing modules after the bottleneck module. However, this way has been ineffective. This paper takes an unconventional approach by re-examining the CBM framework through the lens of its geometrical representation as a simple linear classification system. The analysis uncovers that post-CBM fine-tuning modules merely rescale and shift the classification outcome of the system, failing to fully leverage the system's learning potential. We introduce an adaptive module strategically positioned between CLIP and CBM to bridge the gap between source and downstream domains. This simple yet effective approach enhances classification performance while preserving the explainability afforded by the framework. Our work offers a comprehensive solution that encompasses the entire process, from concept discovery to model training, providing a holistic recipe for leveraging the strengths of GPT, CLIP, and CBM.
△ Less
Submitted 4 August, 2024;
originally announced August 2024.
-
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Authors:
Biao Wu,
Yutong Xie,
Zeyu Zhang,
Minh Hieu Phan,
Qi Chen,
Ling Chen,
Qi Wu
Abstract:
Vision-and-language pretraining (VLP) in the medical field utilizes contrastive learning on image-text pairs to achieve effective transfer across tasks. Yet, current VLP approaches with the masked modeling strategy face two challenges when applied to the medical domain. First, current models struggle to accurately reconstruct key pathological features due to the scarcity of medical data. Second, m…
▽ More
Vision-and-language pretraining (VLP) in the medical field utilizes contrastive learning on image-text pairs to achieve effective transfer across tasks. Yet, current VLP approaches with the masked modeling strategy face two challenges when applied to the medical domain. First, current models struggle to accurately reconstruct key pathological features due to the scarcity of medical data. Second, most methods only adopt either paired image-text or image-only data, failing to exploit the combination of both paired and unpaired data. To this end, this paper proposes the MMCLIP (Masked Medical Contrastive Language-Image Pre-Training) framework to enhance pathological learning and feature learning via unpaired data. First, we introduce the attention-masked image modeling (AttMIM) and entity-driven masked language modeling module (EntMLM), which learns to reconstruct pathological visual and textual tokens via multi-modal feature interaction, thus improving medical-enhanced features. The AttMIM module masks a portion of the image features that are highly responsive to textual features. This allows MMCLIP to improve the reconstruction of highly similar image data in medicine efficiency. Second, our MMCLIP capitalizes unpaired data to enhance multimodal learning by introducing disease-kind prompts. The experimental results show that MMCLIP achieves SOTA for zero-shot and fine-tuning classification performance on five datasets. Our code will be available at https://github.com/AIGeeksGroup/MMCLIP.
△ Less
Submitted 16 April, 2025; v1 submitted 28 July, 2024;
originally announced July 2024.
-
Passive wing deployment and retraction in beetles and flapping microrobots
Authors:
Hoang-Vu Phan,
Hoon Cheol Park,
Dario Floreano
Abstract:
Birds, bats and many insects can tuck their wings against their bodies at rest and deploy them to power flight. Whereas birds and bats use well-developed pectoral and wing muscles and tendons, how insects control these movements remains unclear, as mechanisms of wing deployment and retraction vary among insect species. Beetles (Coleoptera) display one of the most complex wing mechanisms. For examp…
▽ More
Birds, bats and many insects can tuck their wings against their bodies at rest and deploy them to power flight. Whereas birds and bats use well-developed pectoral and wing muscles and tendons, how insects control these movements remains unclear, as mechanisms of wing deployment and retraction vary among insect species. Beetles (Coleoptera) display one of the most complex wing mechanisms. For example, in rhinoceros beetles, the wing deployment initiates by fully opening the elytra and partially releasing the hindwings from the abdomen. Subsequently, the beetle starts flapping, elevates the hindwings at the bases, and unfolds the wingtips in an origami-like fashion. Whilst the origami-like fold have been extensively explored, limited attention has been given to the hindwing base deployment and retraction, which are believed to be driven by thoracic muscles. Using high-speed cameras and robotic flapping-wing models, here we demonstrate that rhinoceros beetles can effortlessly elevate the hindwings to flight position without the need for muscular activity. We show that opening the elytra triggers a spring-like partial release of the hindwings from the body, allowing the clearance needed for subsequent flapping motion that brings the hindwings into flight position. The results also show that after flight, beetles can leverage the elytra to push the hindwings back into the resting position, further strengthening the hypothesis of a passive deployment mechanism. Finally, we validate the hypothesis with a flapping microrobot that passively deploys its wings for stable controlled flight and retracts them neatly upon landing, which offers a simple yet effective approach to the design of insect-like flying micromachines.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Data-Centric Human Preference Optimization with Rationales
Authors:
Hoang Anh Just,
Ming Jin,
Anit Sahu,
Huy Phan,
Ruoxi Jia
Abstract:
Reinforcement learning from human feedback plays a crucial role in aligning language models towards human preferences, traditionally represented through comparisons between pairs or sets of responses within a given context. While many studies have enhanced algorithmic techniques to optimize learning from such data, this work shifts focus to improving preference learning through a data-centric appr…
▽ More
Reinforcement learning from human feedback plays a crucial role in aligning language models towards human preferences, traditionally represented through comparisons between pairs or sets of responses within a given context. While many studies have enhanced algorithmic techniques to optimize learning from such data, this work shifts focus to improving preference learning through a data-centric approach. Specifically, we propose enriching existing preference datasets with machine-generated rationales that explain the reasons behind choices. We develop a simple and principled framework to augment current preference learning methods with rationale information. Our comprehensive analysis highlights how rationales enhance learning efficiency. Extensive experiments reveal that rationale-enriched preference learning offers multiple advantages: it improves data efficiency, accelerates convergence to higher-performing models, and reduces verbosity bias and hallucination. Furthermore, this framework is versatile enough to integrate with various preference optimization algorithms. Overall, our findings highlight the potential of re-imagining data design for preference learning, demonstrating that even freely available machine-generated rationales can significantly boost performance across multiple dimensions. The code repository is available at https: //github.com/reds-lab/preference-learning-with-rationales
△ Less
Submitted 3 August, 2024; v1 submitted 19 July, 2024;
originally announced July 2024.
-
Benchmarking LLMs for Environmental Review and Permitting
Authors:
Rounak Meyur,
Hung Phan,
Koby Hayashi,
Ian Stewart,
Shivam Sharma,
Sarthak Chaturvedi,
Mike Parker,
Dan Nally,
Sadie Montgomery,
Karl Pazdernik,
Ali Jannesari,
Mahantesh Halappanavar,
Sai Munikoti,
Sameera Horawalavithana,
Anurag Acharya
Abstract:
The National Environment Policy Act (NEPA) stands as a foundational piece of environmental legislation in the United States, requiring federal agencies to consider the environmental impacts of their proposed actions. The primary mechanism for achieving this is through the preparation of Environmental Assessments (EAs) and, for significant impacts, comprehensive Environmental Impact Statements (EIS…
▽ More
The National Environment Policy Act (NEPA) stands as a foundational piece of environmental legislation in the United States, requiring federal agencies to consider the environmental impacts of their proposed actions. The primary mechanism for achieving this is through the preparation of Environmental Assessments (EAs) and, for significant impacts, comprehensive Environmental Impact Statements (EIS). Large Language Model (LLM)s' effectiveness in specialized domains like NEPA remains untested for adoption in federal decision-making processes. To address this gap, we present NEPA Question and Answering Dataset (NEPAQuAD), the first comprehensive benchmark derived from EIS documents, along with a modular and transparent evaluation pipeline, MAPLE, to assess LLM performance on NEPA-focused regulatory reasoning tasks. Our benchmark leverages actual EIS documents to create diverse question types, ranging from factual to complex problem-solving ones. We built a modular and transparent evaluation pipeline to test both closed- and open-source models in zero-shot or context-driven QA benchmarks. We evaluate five state-of-the-art LLMs using our framework to assess both their prior knowledge and their ability to process NEPA-specific information. The experimental results reveal that all the models consistently achieve their highest performance when provided with the gold passage as context. While comparing the other context-driven approaches for each model, Retrieval Augmented Generation (RAG)-based approaches substantially outperform PDF document contexts, indicating that neither model is well suited for long-context question-answering tasks. Our analysis suggests that NEPA-focused regulatory reasoning tasks pose a significant challenge for LLMs, particularly in terms of understanding the complex semantics and effectively processing the lengthy regulatory documents.
△ Less
Submitted 11 June, 2025; v1 submitted 9 July, 2024;
originally announced July 2024.