-
CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers
Authors:
Anh Duc Le,
Tu Vu,
Nam Le Hai,
Nguyen Thi Ngoc Diep,
Linh Ngo Van,
Trung Le,
Thien Huu Nguyen
Abstract:
Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints. Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models. However, existing KD methods often assume shared vocabularies and tokenizers, limiting their…
▽ More
Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints. Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models. However, existing KD methods often assume shared vocabularies and tokenizers, limiting their flexibility. While approaches like Universal Logit Distillation (ULD) and Dual-Space Knowledge Distillation (DSKD) address vocabulary mismatches, they overlook the critical \textbf{reasoning-aware distillation} aspect. To bridge this gap, we propose CoT2Align a universal KD framework that integrates Chain-of-Thought (CoT) augmentation and introduces Cross-CoT Alignment to enhance reasoning transfer. Additionally, we extend Optimal Transport beyond token-wise alignment to a sequence-level and layer-wise alignment approach that adapts to varying sequence lengths while preserving contextual integrity. Comprehensive experiments demonstrate that CoT2Align outperforms existing KD methods across different vocabulary settings, improving reasoning capabilities and robustness in domain-specific tasks.
△ Less
Submitted 1 March, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
Authors:
Anh-Tien Nguyen,
Duy Minh Ho Nguyen,
Nghiem Tuong Diep,
Trung Quoc Nguyen,
Nhat Ho,
Jacqueline Michelle Metsch,
Miriam Cindy Maurer,
Daniel Sonntag,
Hanibal Bohnenberger,
Anne-Christin Hauschild
Abstract:
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a visio…
▽ More
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this MGPATH.
△ Less
Submitted 14 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation
Authors:
Nghiem T. Diep,
Huy Nguyen,
Chau Nguyen,
Minh Le,
Duy M. H. Nguyen,
Daniel Sonntag,
Mathias Niepert,
Nhat Ho
Abstract:
The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between ze…
▽ More
The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.
△ Less
Submitted 18 June, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
EXGRA-MED: Extended Context Graph Alignment for Medical Vision- Language Models
Authors:
Duy M. H. Nguyen,
Nghiem T. Diep,
Trung Q. Nguyen,
Hoang-Bao Le,
Tai Nguyen,
Tien Nguyen,
TrungTin Nguyen,
Nhat Ho,
Pengtao Xie,
Roger Wattenhofer,
James Zou,
Daniel Sonntag,
Mathias Niepert
Abstract:
State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLAVA-MED and BIOMEDGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce EXGRA-MED,…
▽ More
State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLAVA-MED and BIOMEDGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce EXGRA-MED, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMa-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, EXGRA-MED matches LLAVA-MED's performance using just 10% of pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BIOMEDGPT and RADFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.
△ Less
Submitted 17 June, 2025; v1 submitted 3 October, 2024;
originally announced October 2024.
-
Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model
Authors:
Duy M. H. Nguyen,
An T. Le,
Trung Q. Nguyen,
Nghiem T. Diep,
Tai Nguyen,
Duy Duong-Tran,
Jan Peters,
Li Shen,
Mathias Niepert,
Daniel Sonntag
Abstract:
Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we c…
▽ More
Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we consider a new framework based on a dual context of both domain-shared and class-specific contexts, where the latter is generated by Large Language Models (LLMs) such as GPTs. Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in LLM knowledge. Moreover, we formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens. Through partial matching, UOT can properly align discrete sets of visual tokens and prompt embeddings under different mass distributions, which is particularly valuable for handling irrelevant or noisy elements, ensuring that the preservation of mass does not restrict transport solutions. Furthermore, UOT's characteristics integrate seamlessly with image augmentation, expanding the training sample pool while maintaining a reasonable distance between perturbed images and prompt inputs. Extensive experiments across few-shot classification and adapter settings substantiate the superiority of our model over current state-of-the-art baselines.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation
Authors:
Duy Minh Ho Nguyen,
Tan Ngoc Pham,
Nghiem Tuong Diep,
Nghi Quoc Phan,
Quang Pham,
Vinh Tong,
Binh T. Nguyen,
Ngan Hoang Le,
Nhat Ho,
Pengtao Xie,
Daniel Sonntag,
Mathias Niepert
Abstract:
Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for…
▽ More
Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for only a limited amount of annotated samples. While numerous techniques have focused on developing better fine-tuning strategies to adapt these models for specific domains, we instead examine their robustness to domain shifts in the medical image segmentation task. To this end, we compare the generalization performance to unseen domains of various pre-trained models after being fine-tuned on the same in-distribution dataset and show that foundation-based models enjoy better robustness than other architectures. From here, we further developed a new Bayesian uncertainty estimation for frozen models and used them as an indicator to characterize the model's performance on out-of-distribution (OOD) data, proving particularly beneficial for real-world applications. Our experiments not only reveal the limitations of current indicators like accuracy on the line or agreement on the line commonly used in natural image applications but also emphasize the promise of the introduced Bayesian uncertainty. Specifically, lower uncertainty predictions usually tend to higher out-of-distribution (OOD) performance.
△ Less
Submitted 18 November, 2023;
originally announced November 2023.
-
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching
Authors:
Duy M. H. Nguyen,
Hoang Nguyen,
Nghiem T. Diep,
Tan N. Pham,
Tri Cao,
Binh T. Nguyen,
Paul Swoboda,
Nhat Ho,
Shadi Albarqouni,
Pengtao Xie,
Daniel Sonntag,
Mathias Niepert
Abstract:
Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained deep networks on ImageNet and vision-language foundation models trained on web-scale data are prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and me…
▽ More
Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained deep networks on ImageNet and vision-language foundation models trained on web-scale data are prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and medical images. To bridge this gap, we introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets, covering a large number of organs and modalities such as CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art self-supervised algorithms on this dataset and propose a novel self-supervised contrastive learning algorithm using a graph-matching formulation. The proposed approach makes three contributions: (i) it integrates prior pair-wise image similarity metrics based on local and global information; (ii) it captures the structural constraints of feature embeddings through a loss function constructed via a combinatorial graph-matching objective; and (iii) it can be trained efficiently end-to-end using modern gradient-estimation techniques for black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream medical tasks ranging from segmentation and classification to object detection, and both for the in and out-of-distribution settings. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models. For challenging tasks such as Brain Tumor Classification or Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models trained on 1 billion masks by 6-7% while using only a ResNet-50.
△ Less
Submitted 18 November, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
TextANIMAR: Text-based 3D Animal Fine-Grained Retrieval
Authors:
Trung-Nghia Le,
Tam V. Nguyen,
Minh-Quan Le,
Trong-Thuan Nguyen,
Viet-Tham Huynh,
Trong-Le Do,
Khanh-Duy Le,
Mai-Khiem Tran,
Nhat Hoang-Xuan,
Thang-Long Nguyen-Ho,
Vinh-Tiep Nguyen,
Tuong-Nghiem Diep,
Khanh-Duy Ho,
Xuan-Hieu Nguyen,
Thien-Phuc Tran,
Tuan-Anh Yang,
Kim-Phat Tran,
Nhu-Vinh Hoang,
Minh-Quang Nguyen,
E-Ro Nguyen,
Minh-Khoi Nguyen-Nhat,
Tuan-An To,
Trung-Truc Huynh-Le,
Nham-Tan Nguyen,
Hoang-Chau Luong
, et al. (8 additional authors not shown)
Abstract:
3D object retrieval is an important yet challenging task that has drawn more and more attention in recent years. While existing approaches have made strides in addressing this issue, they are often limited to restricted settings such as image and sketch queries, which are often unfriendly interactions for common users. In order to overcome these limitations, this paper presents a novel SHREC chall…
▽ More
3D object retrieval is an important yet challenging task that has drawn more and more attention in recent years. While existing approaches have made strides in addressing this issue, they are often limited to restricted settings such as image and sketch queries, which are often unfriendly interactions for common users. In order to overcome these limitations, this paper presents a novel SHREC challenge track focusing on text-based fine-grained retrieval of 3D animal models. Unlike previous SHREC challenge tracks, the proposed task is considerably more challenging, requiring participants to develop innovative approaches to tackle the problem of text-based retrieval. Despite the increased difficulty, we believe this task can potentially drive useful applications in practice and facilitate more intuitive interactions with 3D objects. Five groups participated in our competition, submitting a total of 114 runs. While the results obtained in our competition are satisfactory, we note that the challenges presented by this task are far from fully solved. As such, we provide insights into potential areas for future research and improvements. We believe we can help push the boundaries of 3D object retrieval and facilitate more user-friendly interactions via vision-language technologies. https://aichallenge.hcmus.edu.vn/textanimar
△ Less
Submitted 9 August, 2023; v1 submitted 12 April, 2023;
originally announced April 2023.
-
SketchANIMAR: Sketch-based 3D Animal Fine-Grained Retrieval
Authors:
Trung-Nghia Le,
Tam V. Nguyen,
Minh-Quan Le,
Trong-Thuan Nguyen,
Viet-Tham Huynh,
Trong-Le Do,
Khanh-Duy Le,
Mai-Khiem Tran,
Nhat Hoang-Xuan,
Thang-Long Nguyen-Ho,
Vinh-Tiep Nguyen,
Nhat-Quynh Le-Pham,
Huu-Phuc Pham,
Trong-Vu Hoang,
Quang-Binh Nguyen,
Trong-Hieu Nguyen-Mau,
Tuan-Luc Huynh,
Thanh-Danh Le,
Ngoc-Linh Nguyen-Ha,
Tuong-Vy Truong-Thuy,
Truong Hoai Phong,
Tuong-Nghiem Diep,
Khanh-Duy Ho,
Xuan-Hieu Nguyen,
Thien-Phuc Tran
, et al. (9 additional authors not shown)
Abstract:
The retrieval of 3D objects has gained significant importance in recent years due to its broad range of applications in computer vision, computer graphics, virtual reality, and augmented reality. However, the retrieval of 3D objects presents significant challenges due to the intricate nature of 3D models, which can vary in shape, size, and texture, and have numerous polygons and vertices. To this…
▽ More
The retrieval of 3D objects has gained significant importance in recent years due to its broad range of applications in computer vision, computer graphics, virtual reality, and augmented reality. However, the retrieval of 3D objects presents significant challenges due to the intricate nature of 3D models, which can vary in shape, size, and texture, and have numerous polygons and vertices. To this end, we introduce a novel SHREC challenge track that focuses on retrieving relevant 3D animal models from a dataset using sketch queries and expedites accessing 3D models through available sketches. Furthermore, a new dataset named ANIMAR was constructed in this study, comprising a collection of 711 unique 3D animal models and 140 corresponding sketch queries. Our contest requires participants to retrieve 3D models based on complex and detailed sketches. We receive satisfactory results from eight teams and 204 runs. Although further improvement is necessary, the proposed task has the potential to incentivize additional research in the domain of 3D object retrieval, potentially yielding benefits for a wide range of applications. We also provide insights into potential areas of future research, such as improving techniques for feature extraction and matching and creating more diverse datasets to evaluate retrieval performance. https://aichallenge.hcmus.edu.vn/sketchanimar
△ Less
Submitted 9 August, 2023; v1 submitted 12 April, 2023;
originally announced April 2023.