Search | arXiv e-print repository

VergeIO: Depth-Aware Eye Interaction on Glasses

Authors: Xiyuxing Zhang, Duc Vu, Chengyi Shen, Yuntao Wang, Yuanchun Shi, Justin Chan

Abstract: There is growing industry interest in creating unobtrusive designs for electrooculography (EOG) sensing of eye gestures on glasses (e.g. JINS MEME and Apple eyewear). We present VergeIO, the first EOG-based glasses that enables depth-aware eye interaction using vergence with an optimized electrode layout and novel smart glass prototype. It can distinguish between four and six depth-based eye gestu… ▽ More There is growing industry interest in creating unobtrusive designs for electrooculography (EOG) sensing of eye gestures on glasses (e.g. JINS MEME and Apple eyewear). We present VergeIO, the first EOG-based glasses that enables depth-aware eye interaction using vergence with an optimized electrode layout and novel smart glass prototype. It can distinguish between four and six depth-based eye gestures with 83-98% accuracy using personalized models in a user study across 11 users and 1,320 gesture instances. It generalizes to unseen users with an accuracy of 80-98% without any calibration. To reduce false detections, we incorporate a motion artifact detection pipeline and a preamble-based activation scheme. The system uses dry sensors without any adhesives or gel, and operates in real time with 3 mW power consumption by the sensing front-end, making it suitable for always-on sensing. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2506.22760 [pdf, ps, other]

Jan-nano Technical Report

Authors: Alan Dao, Dinh Bach Vu

Abstract: Most language models face a fundamental tradeoff where powerful capabilities require substantial computational resources. We shatter this constraint with Jan-nano, a 4B parameter language model that redefines efficiency through radical specialization: instead of trying to know everything, it masters the art of finding anything instantly. Fine-tuned from Qwen3-4B using our novel multi-stage RLVR sy… ▽ More Most language models face a fundamental tradeoff where powerful capabilities require substantial computational resources. We shatter this constraint with Jan-nano, a 4B parameter language model that redefines efficiency through radical specialization: instead of trying to know everything, it masters the art of finding anything instantly. Fine-tuned from Qwen3-4B using our novel multi-stage RLVR system that completely eliminates reliance on next token prediction training (SFT), Jan-nano achieves 83.2% on SimpleQA benchmark with MCP integration while running on consumer hardware. With 128K context length, Jan-nano proves that intelligence isn't about scale, it's about strategy. △ Less

Submitted 28 June, 2025; originally announced June 2025.

arXiv:2506.20944 [pdf, ps, other]

E-FreeM2: Efficient Training-Free Multi-Scale and Cross-Modal News Verification via MLLMs

Authors: Van-Hoang Phan, Long-Khanh Pham, Dang Vu, Anh-Duy Tran, Minh-Son Dao

Abstract: The rapid spread of misinformation in mobile and wireless networks presents critical security challenges. This study introduces a training-free, retrieval-based multimodal fact verification system that leverages pretrained vision-language models and large language models for credibility assessment. By dynamically retrieving and cross-referencing trusted data sources, our approach mitigates vulnera… ▽ More The rapid spread of misinformation in mobile and wireless networks presents critical security challenges. This study introduces a training-free, retrieval-based multimodal fact verification system that leverages pretrained vision-language models and large language models for credibility assessment. By dynamically retrieving and cross-referencing trusted data sources, our approach mitigates vulnerabilities of traditional training-based models, such as adversarial attacks and data poisoning. Additionally, its lightweight design enables seamless edge device integration without extensive on-device processing. Experiments on two fact-checking benchmarks achieve SOTA results, confirming its effectiveness in misinformation detection and its robustness against various attack vectors, highlighting its potential to enhance security in mobile and wireless communication environments. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Accepted to AsiaCCS 2025 @ SCID

arXiv:2506.14835 [pdf, ps, other]

MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation

Authors: Kiet Dang Vu, Trung Thai Tran, Duc Dung Nguyen

Abstract: Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based mono… ▽ More Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based monocular 3D detection. We propose three main contributions. First, we propose the Mask Separated Self-Attention mechanism that enables the integration of the denoising process into a DETR architecture. This improves the stability of Hungarian matching to achieve a consistent optimization objective. Second, we present the Variational Query Denoising technique to address the gradient vanishing problem of conventional denoising methods, which severely restricts the efficiency of the denoising process. This explicitly introduces stochastic properties to mitigate this fundamental limitation and unlock substantial performance gains. Finally, we introduce a sophisticated self-distillation strategy, leveraging insights from later decoder layers to synergistically improve query quality in earlier layers, thereby amplifying the iterative refinement process. Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Highlighting its broad applicability, MonoVQD's core components seamlessly integrate into other architectures, delivering significant performance gains even in multi-view 3D detection scenarios on the nuScenes dataset and underscoring its robust generalization capabilities. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.12437 [pdf]

Feeling Machines: Ethics, Culture, and the Rise of Emotional AI

Authors: Vivek Chavan, Arsen Cenaj, Shuyuan Shen, Ariane Bar, Srishti Binwani, Tommaso Del Becaro, Marius Funk, Lynn Greschner, Roberto Hung, Stina Klein, Romina Kleiner, Stefanie Krause, Sylwia Olbrych, Vishvapalsinhji Parmar, Jaleh Sarafraz, Daria Soroko, Daksitha Withanage Don, Chang Zhou, Hoang Thuy Duong Vu, Parastoo Semnani, Daniel Weinhardt, Elisabeth Andre, Jörg Krüger, Xavier Fresquet

Abstract: This paper explores the growing presence of emotionally responsive artificial intelligence through a critical and interdisciplinary lens. Bringing together the voices of early-career researchers from multiple fields, it explores how AI systems that simulate or interpret human emotions are reshaping our interactions in areas such as education, healthcare, mental health, caregiving, and digital life… ▽ More This paper explores the growing presence of emotionally responsive artificial intelligence through a critical and interdisciplinary lens. Bringing together the voices of early-career researchers from multiple fields, it explores how AI systems that simulate or interpret human emotions are reshaping our interactions in areas such as education, healthcare, mental health, caregiving, and digital life. The analysis is structured around four central themes: the ethical implications of emotional AI, the cultural dynamics of human-machine interaction, the risks and opportunities for vulnerable populations, and the emerging regulatory, design, and technical considerations. The authors highlight the potential of affective AI to support mental well-being, enhance learning, and reduce loneliness, as well as the risks of emotional manipulation, over-reliance, misrepresentation, and cultural bias. Key challenges include simulating empathy without genuine understanding, encoding dominant sociocultural norms into AI systems, and insufficient safeguards for individuals in sensitive or high-risk contexts. Special attention is given to children, elderly users, and individuals with mental health challenges, who may interact with AI in emotionally significant ways. However, there remains a lack of cognitive or legal protections which are necessary to navigate such engagements safely. The report concludes with ten recommendations, including the need for transparency, certification frameworks, region-specific fine-tuning, human oversight, and longitudinal research. A curated supplementary section provides practical tools, models, and datasets to support further work in this domain. △ Less

Submitted 14 June, 2025; originally announced June 2025.

Comments: From the Spring School 2025 by AI Grid and SCAI (Sorbonne University), 16 pages

arXiv:2506.09162 [pdf]

The RSNA Lumbar Degenerative Imaging Spine Classification (LumbarDISC) Dataset

Authors: Tyler J. Richards, Adam E. Flanders, Errol Colak, Luciano M. Prevedello, Robyn L. Ball, Felipe Kitamura, John Mongan, Maryam Vazirabad, Hui-Ming Lin, Anne Kendell, Thanat Kanthawang, Salita Angkurawaranon, Emre Altinmakas, Hakan Dogan, Paulo Eduardo de Aguiar Kuriki, Arjuna Somasundaram, Christopher Ruston, Deniz Bulja, Naida Spahovic, Jennifer Sommer, Sirui Jiang, Eduardo Moreno Judice de Mattos Farina, Eduardo Caminha Nunes, Michael Brassil, Megan McNamara , et al. (11 additional authors not shown)

Abstract: The Radiological Society of North America (RSNA) Lumbar Degenerative Imaging Spine Classification (LumbarDISC) dataset is the largest publicly available dataset of adult MRI lumbar spine examinations annotated for degenerative changes. The dataset includes 2,697 patients with a total of 8,593 image series from 8 institutions across 6 countries and 5 continents. The dataset is available for free fo… ▽ More The Radiological Society of North America (RSNA) Lumbar Degenerative Imaging Spine Classification (LumbarDISC) dataset is the largest publicly available dataset of adult MRI lumbar spine examinations annotated for degenerative changes. The dataset includes 2,697 patients with a total of 8,593 image series from 8 institutions across 6 countries and 5 continents. The dataset is available for free for non-commercial use via Kaggle and RSNA Medical Imaging Resource of AI (MIRA). The dataset was created for the RSNA 2024 Lumbar Spine Degenerative Classification competition where competitors developed deep learning models to grade degenerative changes in the lumbar spine. The degree of spinal canal, subarticular recess, and neural foraminal stenosis was graded at each intervertebral disc level in the lumbar spine. The images were annotated by expert volunteer neuroradiologists and musculoskeletal radiologists from the RSNA, American Society of Neuroradiology, and the American Society of Spine Radiology. This dataset aims to facilitate research and development in machine learning and lumbar spine imaging to lead to improved patient care and clinical efficiency. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2505.21441 [pdf, ps, other]

Autoencoding Random Forests

Authors: Binh Duc Vu, Jan Kapar, Marvin Wright, David S. Watson

Abstract: We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest n… ▽ More We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 10 pages main text, 25 pages total. 5 figures main text, 9 figures total

arXiv:2505.17417 [pdf, ps, other]

Speechless: Speech Instruction Training Without Speech for Low Resource Languages

Authors: Alan Dao, Dinh Bach Vu, Huy Hoang Ha, Tuan Le Duc Anh, Shreyas Gopal, Yue Heng Yeo, Warren Keng Hoong Low, Eng Siong Chng, Jia Qi Yip

Abstract: The rapid growth of voice assistants powered by large language models (LLM) has highlighted a need for speech instruction data to train these systems. Despite the abundance of speech recognition data, there is a notable scarcity of speech instruction data, which is essential for fine-tuning models to understand and execute spoken commands. Generating high-quality synthetic speech requires a good t… ▽ More The rapid growth of voice assistants powered by large language models (LLM) has highlighted a need for speech instruction data to train these systems. Despite the abundance of speech recognition data, there is a notable scarcity of speech instruction data, which is essential for fine-tuning models to understand and execute spoken commands. Generating high-quality synthetic speech requires a good text-to-speech (TTS) model, which may not be available to low resource languages. Our novel approach addresses this challenge by halting synthesis at the semantic representation level, bypassing the need for TTS. We achieve this by aligning synthetic semantic representations with the pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text instructions while maintaining the ability to understand spoken instructions during inference. This simplified training process is a promising approach to building voice assistant for low-resource languages. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: This paper was accepted by INTERSPEECH 2025

arXiv:2505.15570 [pdf, ps, other]

Refining Neural Activation Patterns for Layer-Level Concept Discovery in Neural Network-Based Receivers

Authors: Marko Tuononen, Duy Vu, Dani Korpi, Vesa Starck, Ville Hautamäki

Abstract: Concept discovery in neural networks often targets individual neurons or human-interpretable features, overlooking distributed layer-wide patterns. We study the Neural Activation Pattern (NAP) methodology, which clusters full-layer activation distributions to identify such layer-level concepts. Applied to visual object recognition and radio receiver models, we propose improved normalization, distr… ▽ More Concept discovery in neural networks often targets individual neurons or human-interpretable features, overlooking distributed layer-wide patterns. We study the Neural Activation Pattern (NAP) methodology, which clusters full-layer activation distributions to identify such layer-level concepts. Applied to visual object recognition and radio receiver models, we propose improved normalization, distribution estimation, distance metrics, and varied cluster selection. In the radio receiver model, distinct concepts did not emerge; instead, a continuous activation manifold shaped by Signal-to-Noise Ratio (SNR) was observed -- highlighting SNR as a key learned factor, consistent with classical receiver behavior and supporting physical plausibility. Our enhancements to NAP improved in-distribution vs. out-of-distribution separation, suggesting better generalization and indirectly validating clustering quality. These results underscore the importance of clustering design and activation manifolds in interpreting and troubleshooting neural network behavior. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: 46 pages, 40 figures, 28 tables, 10 equations, and 5 listings

MSC Class: 68T07 (Primary) 62H30; 94A05 (Secondary) ACM Class: I.2.6; I.5.3; C.2.1

arXiv:2504.18729 [pdf, other]

Multimodal graph representation learning for website generation based on visual sketch

Authors: Tung D. Vu, Chung Hoang, Truong-Son Hy

Abstract: The Design2Code problem, which involves converting digital designs into functional source code, is a significant challenge in software development due to its complexity and time-consuming nature. Traditional approaches often struggle with accurately interpreting the intricate visual details and structural relationships inherent in webpage designs, leading to limitations in automation and efficienc… ▽ More The Design2Code problem, which involves converting digital designs into functional source code, is a significant challenge in software development due to its complexity and time-consuming nature. Traditional approaches often struggle with accurately interpreting the intricate visual details and structural relationships inherent in webpage designs, leading to limitations in automation and efficiency. In this paper, we propose a novel method that leverages multimodal graph representation learning to address these challenges. By integrating both visual and structural information from design sketches, our approach enhances the accuracy and efficiency of code generation, particularly in producing semantically correct and structurally sound HTML code. We present a comprehensive evaluation of our method, demonstrating significant improvements in both accuracy and efficiency compared to existing techniques. Extensive evaluation demonstrates significant improvements of multimodal graph learning over existing techniques, highlighting the potential of our method to revolutionize design-to-code automation. Code available at https://github.com/HySonLab/Design2Code △ Less

Submitted 25 April, 2025; originally announced April 2025.

arXiv:2504.15252 [pdf, other]

SuoiAI: Building a Dataset for Aquatic Invertebrates in Vietnam

Authors: Tue Vo, Lakshay Sharma, Tuan Dinh, Khuong Dinh, Trang Nguyen, Trung Phan, Minh Do, Duong Vu

Abstract: Understanding and monitoring aquatic biodiversity is critical for ecological health and conservation efforts. This paper proposes SuoiAI, an end-to-end pipeline for building a dataset of aquatic invertebrates in Vietnam and employing machine learning (ML) techniques for species classification. We outline the methods for data collection, annotation, and model training, focusing on reducing annotati… ▽ More Understanding and monitoring aquatic biodiversity is critical for ecological health and conservation efforts. This paper proposes SuoiAI, an end-to-end pipeline for building a dataset of aquatic invertebrates in Vietnam and employing machine learning (ML) techniques for species classification. We outline the methods for data collection, annotation, and model training, focusing on reducing annotation effort through semi-supervised learning and leveraging state-of-the-art object detection and classification models. Our approach aims to overcome challenges such as data scarcity, fine-grained classification, and deployment in diverse environmental conditions. △ Less

Submitted 21 April, 2025; originally announced April 2025.

Comments: Published as a workshop paper at "Tackling Climate Change with Machine Learning", ICLR 2025

arXiv:2504.03292 [pdf, other]

FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement

Authors: Gia-Nghia Tran, Quang-Huy Che, Trong-Tai Dam Vu, Bich-Nga Pham, Vinh-Tiep Nguyen, Trung-Nghia Le, Minh-Triet Tran

Abstract: Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Conce… ▽ More Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods. △ Less

Submitted 4 April, 2025; originally announced April 2025.

arXiv:2504.00339 [pdf, other]

VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation

Authors: Hoang Hai Phan, Nguyen Duc Minh Vu, Nam Dang Phuong

Abstract: Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-qualit… ▽ More Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines. △ Less

Submitted 31 March, 2025; originally announced April 2025.

arXiv:2503.18769 [pdf, other]

AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

Authors: Alan Dao, Dinh Bach Vu, Bui Quang Huy

Abstract: This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height info… ▽ More This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific (x, y, z) coordinates. Experimental results suggest that AlphaSpace demonstrates promising potential for improving manipulation tasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet. These results demonstrate the potential of structured spatial encoding for manipulation tasks and warrant further exploration. △ Less

Submitted 27 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.16286 [pdf, other]

Explainable Graph-theoretical Machine Learning: with Application to Alzheimer's Disease Prediction

Authors: Narmina Baghirova, Duy-Thanh Vũ, Duy-Cat Can, Christelle Schneuwly Diaz, Julien Bodlet, Guillaume Blanc, Georgi Hrusanov, Bernard Ries, Oliver Y. Chén

Abstract: Alzheimer's disease (AD) affects 50 million people worldwide and is projected to overwhelm 152 million by 2050. AD is characterized by cognitive decline due partly to disruptions in metabolic brain connectivity. Thus, early and accurate detection of metabolic brain network impairments is crucial for AD management. Chief to identifying such impairments is FDG-PET data. Despite advancements, most gr… ▽ More Alzheimer's disease (AD) affects 50 million people worldwide and is projected to overwhelm 152 million by 2050. AD is characterized by cognitive decline due partly to disruptions in metabolic brain connectivity. Thus, early and accurate detection of metabolic brain network impairments is crucial for AD management. Chief to identifying such impairments is FDG-PET data. Despite advancements, most graph-based studies using FDG-PET data rely on group-level analysis or thresholding. Yet, group-level analysis can veil individual differences and thresholding may overlook weaker but biologically critical brain connections. Additionally, machine learning-based AD prediction largely focuses on univariate outcomes, such as disease status. Here, we introduce explainable graph-theoretical machine learning (XGML), a framework employing kernel density estimation and dynamic time warping to construct individual metabolic brain graphs that capture the distance between pair-wise brain regions and identify subgraphs most predictive of multivariate AD-related outcomes. Using FDG-PET data from the Alzheimer's Disease Neuroimaging Initiative, XGML builds metabolic brain graphs and uncovers subgraphs predictive of eight AD-related cognitive scores in new subjects. XGML shows robust performance, particularly for predicting scores measuring learning, memory, language, praxis, and orientation, such as CDRSB ($r = 0.74$), ADAS11 ($r = 0.73$), and ADAS13 ($r = 0.71$). Moreover, XGML unveils key edges jointly but differentially predictive of several AD-related outcomes; they may serve as potential network biomarkers for assessing overall cognitive decline. Together, we show the promise of graph-theoretical machine learning in biomarker discovery and disease prediction and its potential to improve our understanding of network neural mechanisms underlying AD. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2503.11282 [pdf, other]

OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values

Authors: Christelle Schneuwly Diaz, Duy-Thanh Vu, Julien Bodelet, Duy-Cat Can, Guillaume Blanc, Haiting Jiang, Lin Yao, Guiseppe Pantaleo, ADNI, Oliver Y. Chén

Abstract: Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease predictio… ▽ More Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease prediction; but they often contain missing values. Recent "deeper" machine learning approaches show promise in improving prediction accuracy, yet the biological relevance of these models needs to be further charted. Integrating missing data analysis, predictive modeling, multimodal data analysis, and explainable AI, we propose OPTIMUS, a predictive, modular, and explainable machine learning framework, to unveil the many-to-many predictive pathways between multimodal input data and multivariate disease outcomes amidst missing values. OPTIMUS first applies modality-specific imputation to uncover data from each modality while optimizing overall prediction accuracy. It then maps multimodal biomarkers to multivariate outcomes using machine-learning and extracts biomarkers respectively predictive of each outcome. Finally, OPTIMUS incorporates XAI to explain the identified multimodal biomarkers. Using data from 346 cognitively normal subjects, 608 persons with mild cognitive impairment, and 251 AD patients, OPTIMUS identifies neural and transcriptomic signatures that jointly but differentially predict multivariate outcomes related to executive function, language, memory, and visuospatial function. Our work demonstrates the potential of building a predictive and biologically explainable machine-learning framework to uncover multimodal biomarkers that capture disease profiles across varying cognitive landscapes. The results improve our understanding of the complex many-to-many pathways in AD. △ Less

Submitted 14 March, 2025; originally announced March 2025.

arXiv:2503.07111 [pdf, other]

PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM

Authors: Alan Dao, Dinh Bach Vu, Tuan Le Duc Anh, Bui Quang Huy

Abstract: This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using projected representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from rob… ▽ More This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using projected representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from robotic to human hands. By projecting visual inputs and employing a transformer-based decoder, PoseLess achieves robust, low-latency control while addressing challenges such as depth ambiguity and data scarcity. Experimental results demonstrate competitive performance in joint angle prediction accuracy without relying on any human-labelled dataset. △ Less

Submitted 10 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

arXiv:2503.04790 [pdf, other]

SuperRAG: Beyond RAG with Layout-Aware Graph Modeling

Authors: Jeff Yang, Duy-Khanh Vu, Minh-Tien Nguyen, Xuan-Quang Nguyen, Linh Nguyen, Hung Le

Abstract: This paper introduces layout-aware graph modeling for multimodal RAG. Different from traditional RAG methods that mostly deal with flat text chunks, the proposed method takes into account the relationship of multimodalities by using a graph structure. To do that, a graph modeling structure is defined based on document layout parsing. The structure of an input document is retained with the connecti… ▽ More This paper introduces layout-aware graph modeling for multimodal RAG. Different from traditional RAG methods that mostly deal with flat text chunks, the proposed method takes into account the relationship of multimodalities by using a graph structure. To do that, a graph modeling structure is defined based on document layout parsing. The structure of an input document is retained with the connection of text chunks, tables, and figures. This representation allows the method to handle complex questions that require information from multimodalities. To confirm the efficiency of the graph modeling, a flexible RAG pipeline is developed using robust components. Experimental results on four benchmark test sets confirm the contribution of the layout-aware modeling for performance improvement of the RAG pipeline. △ Less

Submitted 28 February, 2025; originally announced March 2025.

Comments: NAACL 2025, Industry Track

arXiv:2502.17909 [pdf, other]

FactFlow: Automatic Fact Sheet Generation and Customization from Tabular Dataset via AI Chain Design & Implementation

Authors: Minh Duc Vu, Jieshan Chen, Zhenchang Xing, Qinghua Lu, Xiwei Xu, Qian Fu

Abstract: With the proliferation of data across various domains, there is a critical demand for tools that enable non-experts to derive meaningful insights without deep data analysis skills. To address this need, existing automatic fact sheet generation tools offer heuristic-based solutions to extract facts and generate stories. However, they inadequately grasp the semantics of data and struggle to generate… ▽ More With the proliferation of data across various domains, there is a critical demand for tools that enable non-experts to derive meaningful insights without deep data analysis skills. To address this need, existing automatic fact sheet generation tools offer heuristic-based solutions to extract facts and generate stories. However, they inadequately grasp the semantics of data and struggle to generate narratives that fully capture the semantics of the dataset or align the fact sheet with specific user needs. Addressing these shortcomings, this paper introduces \tool, a novel tool designed for the automatic generation and customisation of fact sheets. \tool applies the concept of collaborative AI workers to transform raw tabular dataset into comprehensive, visually compelling fact sheets. We define effective taxonomy to profile AI worker for specialised tasks. Furthermore, \tool empowers users to refine these fact sheets through intuitive natural language commands, ensuring the final outputs align closely with individual preferences and requirements. Our user evaluation with 18 participants confirms that \tool not only surpasses state-of-the-art baselines in automated fact sheet production but also provides a positive user experience during customization tasks. △ Less

Submitted 25 February, 2025; originally announced February 2025.

Comments: 11 pages, 6 figures

ACM Class: I.2; H.4

arXiv:2502.16747 [pdf, other]

SQLong: Enhanced NL2SQL for Longer Contexts with LLMs

Authors: Dai Quoc Nguyen, Cong Duy Vu Hoang, Duy Vu, Gioacchino Tangari, Thanh Tien Vu, Don Dharmasiri, Yuan-Fang Li, Long Duong

Abstract: Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios… ▽ More Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas. △ Less

Submitted 20 May, 2025; v1 submitted 23 February, 2025; originally announced February 2025.

Comments: Accepted to Table Representation Learning Workshop at ACL 2025

arXiv:2502.14669 [pdf, other]

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Authors: Alan Dao, Dinh Bach Vu

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of toke… ▽ More Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning. △ Less

Submitted 25 February, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

arXiv:2502.12591 [pdf, other]

CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base

Authors: Cong-Duy Nguyen, Xiaobao Wu, Duc Anh Vu, Shuai Zhao, Thong Nguyen, Anh Tuan Luu

Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, but they remain susceptible to hallucination, particularly object hallucination where non-existent objects or incorrect attributes are fabricated in generated descriptions. Existing detection methods achieve strong performance but rely heavily on expensive API calls and iterative LVLM-based validat… ▽ More Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, but they remain susceptible to hallucination, particularly object hallucination where non-existent objects or incorrect attributes are fabricated in generated descriptions. Existing detection methods achieve strong performance but rely heavily on expensive API calls and iterative LVLM-based validation, making them impractical for large-scale or offline use. To address these limitations, we propose CutPaste\&Find, a lightweight and training-free framework for detecting hallucinations in LVLM-generated outputs. Our approach leverages off-the-shelf visual and linguistic modules to perform multi-step verification efficiently without requiring LVLM inference. At the core of our framework is a Visual-aid Knowledge Base that encodes rich entity-attribute relationships and associated image representations. We introduce a scaling factor to refine similarity scores, mitigating the issue of suboptimal alignment values even for ground-truth image-text pairs. Comprehensive evaluations on benchmark datasets, including POPE and R-Bench, demonstrate that CutPaste\&Find achieves competitive hallucination detection performance while being significantly more efficient and cost-effective than previous methods. △ Less

Submitted 18 February, 2025; originally announced February 2025.

arXiv:2501.07192 [pdf, other]

A4O: All Trigger for One sample

Authors: Duc Anh Vu, Anh Tuan Tran, Cong Tran, Cuong Pham

Abstract: Backdoor attacks have become a critical threat to deep neural networks (DNNs), drawing many research interests. However, most of the studied attacks employ a single type of trigger. Consequently, proposed backdoor defenders often rely on the assumption that triggers would appear in a unified way. In this paper, we show that this naive assumption can create a loophole, allowing more sophisticated b… ▽ More Backdoor attacks have become a critical threat to deep neural networks (DNNs), drawing many research interests. However, most of the studied attacks employ a single type of trigger. Consequently, proposed backdoor defenders often rely on the assumption that triggers would appear in a unified way. In this paper, we show that this naive assumption can create a loophole, allowing more sophisticated backdoor attacks to bypass. We design a novel backdoor attack mechanism that incorporates multiple types of backdoor triggers, focusing on stealthiness and effectiveness. Our journey begins with the intriguing observation that the performance of a backdoor attack in deep learning models, as well as its detectability and removability, are all proportional to the magnitude of the trigger. Based on this correlation, we propose reducing the magnitude of each trigger type and combining them to achieve a strong backdoor relying on the combined trigger while still staying safely under the radar of defenders. Extensive experiments on three standard datasets demonstrate that our method can achieve high attack success rates (ASRs) while consistently bypassing state-of-the-art defenses. △ Less

Submitted 13 January, 2025; originally announced January 2025.

arXiv:2411.18126 [pdf, other]

Curriculum Demonstration Selection for In-Context Learning

Authors: Duc Anh Vu, Nguyen Tran Cong Duy, Xiaobao Wu, Hoang Minh Nhat, Du Mingzhe, Nguyen Thanh Thong, Anh Tuan Luu

Abstract: Large Language Models (LLMs) have shown strong in-context learning (ICL) abilities with a few demonstrations. However, one critical challenge is how to select demonstrations to elicit the full potential of LLMs. In this paper, we propose Curriculum Demonstration Selection (CDS), a novel demonstration selection method for ICL. Instead of merely using similarity, CDS additionally partitions samples… ▽ More Large Language Models (LLMs) have shown strong in-context learning (ICL) abilities with a few demonstrations. However, one critical challenge is how to select demonstrations to elicit the full potential of LLMs. In this paper, we propose Curriculum Demonstration Selection (CDS), a novel demonstration selection method for ICL. Instead of merely using similarity, CDS additionally partitions samples by their complexity measurements. Following curriculum learning, CDS then selects demonstrations from easy to difficult. Thus the selected demonstrations cover a wide range of difficulty levels, enabling LLMs to learn from varied complexities within the training set. Experiments demonstrate that our CDS consistently outperforms baseline methods, achieving notable improvements across nine LLMs on three benchmarks. Moreover, CDS proves especially effective in enhancing LLM performance in solving challenging problems. △ Less

Submitted 15 December, 2024; v1 submitted 27 November, 2024; originally announced November 2024.

Comments: Accepted at the 40th ACM/SIGAPP Symposium On Applied Computing (SAC 2025), Main Conference

arXiv:2411.11017 [pdf, other]

A Study of Malware Prevention in Linux Distributions

Authors: Duc-Ly Vu, Trevor Dunlap, Karla Obermeier-Velazquez, Paul Gibert, John Speed Meyers, Santiago Torres-Arias

Abstract: Malicious attacks on open source software packages are a growing concern. This concern morphed into a panic-inducing crisis after the revelation of the XZ Utils backdoor, which would have provided the attacker with, according to one observer, a "skeleton key" to the internet. This study therefore explores the challenges of preventing and detecting malware in Linux distribution package repositories… ▽ More Malicious attacks on open source software packages are a growing concern. This concern morphed into a panic-inducing crisis after the revelation of the XZ Utils backdoor, which would have provided the attacker with, according to one observer, a "skeleton key" to the internet. This study therefore explores the challenges of preventing and detecting malware in Linux distribution package repositories. To do so, we ask two research questions: (1) What measures have Linux distributions implemented to counter malware, and how have maintainers experienced these efforts? (2) How effective are current malware detection tools at identifying malicious Linux packages? To answer these questions, we conduct interviews with maintainers at several major Linux distributions and introduce a Linux package malware benchmark dataset. Using this dataset, we evaluate the performance of six open source malware detection scanners. Distribution maintainers, according to the interviews, have mostly focused on reproducible builds to date. Our interviews identified only a single Linux distribution, Wolfi OS, that performs active malware scanning. Using this new benchmark dataset, the evaluation found that the performance of existing open-source malware scanners is underwhelming. Most studied tools excel at producing false positives but only infrequently detect true malware. Those that avoid high false positive rates often do so at the expense of a satisfactory true positive. Our findings provide insights into Linux distribution package repositories' current practices for malware detection and demonstrate the current inadequacy of open-source tools designed to detect malicious Linux packages. △ Less

Submitted 25 November, 2024; v1 submitted 17 November, 2024; originally announced November 2024.

Comments: 14 pages, 3 figures, 11 tables

arXiv:2411.00005 [pdf, other]

Mastering the Craft of Data Synthesis for CodeLLMs

Authors: Meng Chen, Philip Arthur, Qianyu Feng, Cong Duy Vu Hoang, Yu-Heng Hong, Mahdi Kazemi Moghaddam, Omid Nezami, Thien Nguyen, Gioacchino Tangari, Duy Vu, Thanh Vu, Mark Johnson, Krishnaram Kenthapadi, Don Dharmasiri, Long Duong, Yuan-Fang Li

Abstract: Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and tax… ▽ More Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field. △ Less

Submitted 7 February, 2025; v1 submitted 16 October, 2024; originally announced November 2024.

Comments: Accepted at NAACL 2025

arXiv:2410.17410 [pdf, other]

Learning Graph Filters for Structure-Function Coupling based Hub Node Identification

Authors: Meiby Ortiz-Bouza, Duc Vu, Abdullah Karaaslanli, Selin Aviyente

Abstract: Over the past two decades, tools from network science have been leveraged to characterize the organization of both structural and functional networks of the brain. One such measure of network organization is hub node identification. Hubs are specialized nodes within a network that link distinct brain units corresponding to specialized functional processes. Conventional methods for identifying hub… ▽ More Over the past two decades, tools from network science have been leveraged to characterize the organization of both structural and functional networks of the brain. One such measure of network organization is hub node identification. Hubs are specialized nodes within a network that link distinct brain units corresponding to specialized functional processes. Conventional methods for identifying hub nodes utilize different types of centrality measures and participation coefficient to profile various aspects of nodal importance. These methods solely rely on the functional connectivity networks constructed from functional magnetic resonance imaging (fMRI), ignoring the structure-function coupling in the brain. In this paper, we introduce a graph signal processing (GSP) based hub detection framework that utilizes both the structural connectivity and the functional activation to identify hub nodes. The proposed framework models functional activity as graph signals on the structural connectivity. Hub nodes are then detected based on the premise that hub nodes are sparse, have higher level of activity compared to their neighbors, and the non-hub nodes' activity can be modeled as the output of a graph-based filter. Based on these assumptions, an optimization framework, GraFHub, is formulated to learn the coefficients of the optimal polynomial graph filter and detect the hub nodes. The proposed framework is evaluated on both simulated data and resting state fMRI (rs-fMRI) data from Human Connectome Project (HCP). △ Less

Submitted 22 October, 2024; originally announced October 2024.

Comments: 13 pages, 4 figures

arXiv:2410.15316 [pdf, other]

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Authors: Alan Dao, Dinh Bach Vu, Huy Hoang Ha

Abstract: Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into… ▽ More Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models. △ Less

Submitted 4 April, 2025; v1 submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.01173 [pdf, other]

Low depth amplitude estimation without really trying

Authors: Dinh-Long Vu, Bin Cheng, Patrick Rebentrost

Abstract: Standard quantum amplitude estimation algorithms provide quadratic speedup to Monte-Carlo simulations but require a circuit depth that scales as inverse of the estimation error. In view of the shallow depth in near-term devices, the precision achieved by these algorithms would be low. In this paper we bypass this limitation by performing the classical Monte-Carlo method on the quantum algorithm it… ▽ More Standard quantum amplitude estimation algorithms provide quadratic speedup to Monte-Carlo simulations but require a circuit depth that scales as inverse of the estimation error. In view of the shallow depth in near-term devices, the precision achieved by these algorithms would be low. In this paper we bypass this limitation by performing the classical Monte-Carlo method on the quantum algorithm itself, achieving a higher than classical precision using low-depth circuits. We require the quantum algorithm to be weakly biased in order to avoid error accumulation during this process. Our method is parallel and can be as weakly biased as the constituent algorithm in some cases. △ Less

Submitted 1 October, 2024; originally announced October 2024.

Comments: 31 pages, 7 figures, 2 tables

arXiv:2408.14176 [pdf, other]

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Authors: Trung Dao, Thuan Hoang Nguyen, Thanh Le, Duc Vu, Khoi Nguyen, Cuong Pham, Anh Tran

Abstract: In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modificat… ▽ More In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training. Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models. The project page is available at https://swiftbrushv2.github.io. △ Less

Submitted 27 August, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

Comments: Accepted to ECCV'24

arXiv:2408.00122 [pdf, other]

A Course Shared Task on Evaluating LLM Output for Clinical Questions

Authors: Yufang Hou, Thy Thy Tran, Doan Nam Long Vu, Yiwen Cao, Kai Li, Lukas Rohde, Iryna Gurevych

Abstract: This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students.… ▽ More This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching natural language processing (NLP) and designing course assignments. △ Less

Submitted 31 July, 2024; originally announced August 2024.

Comments: accepted at the sixth Workshop on Teaching NLP (co-located with ACL 2024)

arXiv:2407.18789 [pdf, other]

Granularity is crucial when applying differential privacy to text: An investigation for neural machine translation

Authors: Doan Nam Long Vu, Timour Igamberdiev, Ivan Habernal

Abstract: Applying differential privacy (DP) by means of the DP-SGD algorithm to protect individual data points during training is becoming increasingly popular in NLP. However, the choice of granularity at which DP is applied is often neglected. For example, neural machine translation (NMT) typically operates on the sentence-level granularity. From the perspective of DP, this setup assumes that each senten… ▽ More Applying differential privacy (DP) by means of the DP-SGD algorithm to protect individual data points during training is becoming increasingly popular in NLP. However, the choice of granularity at which DP is applied is often neglected. For example, neural machine translation (NMT) typically operates on the sentence-level granularity. From the perspective of DP, this setup assumes that each sentence belongs to a single person and any two sentences in the training dataset are independent. This assumption is however violated in many real-world NMT datasets, e.g., those including dialogues. For proper application of DP we thus must shift from sentences to entire documents. In this paper, we investigate NMT at both the sentence and document levels, analyzing the privacy/utility trade-off for both scenarios, and evaluating the risks of not using the appropriate privacy granularity in terms of leaking personally identifiable information (PII). Our findings indicate that the document-level NMT system is more resistant to membership inference attacks, emphasizing the significance of using the appropriate granularity when working with DP. △ Less

Submitted 26 September, 2024; v1 submitted 26 July, 2024; originally announced July 2024.

Comments: Accepted at EMNLP Findings 2024

arXiv:2405.18368 [pdf, other]

The 2024 Brain Tumor Segmentation (BraTS) Challenge: Glioma Segmentation on Post-treatment MRI

Authors: Maria Correia de Verdier, Rachit Saluja, Louis Gagnon, Dominic LaBella, Ujjwall Baid, Nourel Hoda Tahon, Martha Foltyn-Dumitru, Jikai Zhang, Maram Alafif, Saif Baig, Ken Chang, Gennaro D'Anna, Lisa Deptula, Diviya Gupta, Muhammad Ammar Haider, Ali Hussain, Michael Iv, Marinos Kontzialis, Paul Manning, Farzan Moodi, Teresa Nunes, Aaron Simon, Nico Sollmann, David Vu, Maruf Adewole , et al. (60 additional authors not shown)

Abstract: Gliomas are the most common malignant primary brain tumors in adults and one of the deadliest types of cancer. There are many challenges in treatment and monitoring due to the genetic diversity and high intrinsic heterogeneity in appearance, shape, histology, and treatment response. Treatments include surgery, radiation, and systemic therapies, with magnetic resonance imaging (MRI) playing a key r… ▽ More Gliomas are the most common malignant primary brain tumors in adults and one of the deadliest types of cancer. There are many challenges in treatment and monitoring due to the genetic diversity and high intrinsic heterogeneity in appearance, shape, histology, and treatment response. Treatments include surgery, radiation, and systemic therapies, with magnetic resonance imaging (MRI) playing a key role in treatment planning and post-treatment longitudinal assessment. The 2024 Brain Tumor Segmentation (BraTS) challenge on post-treatment glioma MRI will provide a community standard and benchmark for state-of-the-art automated segmentation models based on the largest expert-annotated post-treatment glioma MRI dataset. Challenge competitors will develop automated segmentation models to predict four distinct tumor sub-regions consisting of enhancing tissue (ET), surrounding non-enhancing T2/fluid-attenuated inversion recovery (FLAIR) hyperintensity (SNFH), non-enhancing tumor core (NETC), and resection cavity (RC). Models will be evaluated on separate validation and test datasets using standardized performance metrics utilized across the BraTS 2024 cluster of challenges, including lesion-wise Dice Similarity Coefficient and Hausdorff Distance. Models developed during this challenge will advance the field of automated MRI segmentation and contribute to their integration into clinical practice, ultimately enhancing patient care. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 10 pages, 4 figures, 1 table

arXiv:2403.16685 [pdf, other]

ToXCL: A Unified Framework for Toxic Speech Detection and Explanation

Authors: Nhat M. Hoang, Xuan Long Do, Duc Anh Do, Duc Anh Vu, Luu Anh Tuan

Abstract: The proliferation of online toxic speech is a pertinent problem posing threats to demographic groups. While explicit toxic speech contains offensive lexical signals, implicit one consists of coded or indirect language. Therefore, it is crucial for models not only to detect implicit toxic speech but also to explain its toxicity. This draws a unique need for unified frameworks that can effectively d… ▽ More The proliferation of online toxic speech is a pertinent problem posing threats to demographic groups. While explicit toxic speech contains offensive lexical signals, implicit one consists of coded or indirect language. Therefore, it is crucial for models not only to detect implicit toxic speech but also to explain its toxicity. This draws a unique need for unified frameworks that can effectively detect and explain implicit toxic speech. Prior works mainly formulated the task of toxic speech detection and explanation as a text generation problem. Nonetheless, models trained using this strategy can be prone to suffer from the consequent error propagation problem. Moreover, our experiments reveal that the detection results of such models are much lower than those that focus only on the detection task. To bridge these gaps, we introduce ToXCL, a unified framework for the detection and explanation of implicit toxic speech. Our model consists of three modules: a (i) Target Group Generator to generate the targeted demographic group(s) of a given post; an (ii) Encoder-Decoder Model in which the encoder focuses on detecting implicit toxic speech and is boosted by a (iii) Teacher Classifier via knowledge distillation, and the decoder generates the necessary explanation. ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly. △ Less

Submitted 20 May, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: Accepted at NAACL 2024 (Main Conference)

arXiv:2403.09302 [pdf, other]

StainFuser: Controlling Diffusion for Faster Neural Style Transfer in Multi-Gigapixel Histology Images

Authors: Robert Jewsbury, Ruoyu Wang, Abhir Bhalerao, Nasir Rajpoot, Quoc Dang Vu

Abstract: Stain normalization algorithms aim to transform the color and intensity characteristics of a source multi-gigapixel histology image to match those of a target image, mitigating inconsistencies in the appearance of stains used to highlight cellular components in the images. We propose a new approach, StainFuser, which treats this problem as a style transfer task using a novel Conditional Latent Dif… ▽ More Stain normalization algorithms aim to transform the color and intensity characteristics of a source multi-gigapixel histology image to match those of a target image, mitigating inconsistencies in the appearance of stains used to highlight cellular components in the images. We propose a new approach, StainFuser, which treats this problem as a style transfer task using a novel Conditional Latent Diffusion architecture, eliminating the need for handcrafted color components. With this method, we curate SPI-2M the largest stain normalization dataset to date of over 2 million histology images with neural style transfer for high-quality transformations. Trained on this data, StainFuser outperforms current state-of-the-art deep learning and handcrafted methods in terms of the quality of normalized images and in terms of downstream model performance on the CoNIC dataset. △ Less

Submitted 12 July, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2402.19296 [pdf]

An AI based Digital Score of Tumour-Immune Microenvironment Predicts Benefit to Maintenance Immunotherapy in Advanced Oesophagogastric Adenocarcinoma

Authors: Quoc Dang Vu, Caroline Fong, Anderley Gordon, Tom Lund, Tatiany L Silveira, Daniel Rodrigues, Katharina von Loga, Shan E Ahmed Raza, David Cunningham, Nasir Rajpoot

Abstract: Gastric and oesophageal (OG) cancers are the leading causes of cancer mortality worldwide. In OG cancers, recent studies have showed that PDL1 immune checkpoint inhibitors (ICI) in combination with chemotherapy improves patient survival. However, our understanding of the tumour immune microenvironment in OG cancers remains limited. In this study, we interrogate multiplex immunofluorescence (mIF) i… ▽ More Gastric and oesophageal (OG) cancers are the leading causes of cancer mortality worldwide. In OG cancers, recent studies have showed that PDL1 immune checkpoint inhibitors (ICI) in combination with chemotherapy improves patient survival. However, our understanding of the tumour immune microenvironment in OG cancers remains limited. In this study, we interrogate multiplex immunofluorescence (mIF) images taken from patients with advanced Oesophagogastric Adenocarcinoma (OGA) who received first-line fluoropyrimidine and platinum-based chemotherapy in the PLATFORM trial (NCT02678182) to predict the efficacy of the treatment and to explore the biological basis of patients responding to maintenance durvalumab (PDL1 inhibitor). Our proposed Artificial Intelligence (AI) based marker successfully identified responder from non-responder (p < 0.05) as well as those who could potentially benefit from ICI with statistical significance (p < 0.05) for both progression free and overall survival. Our findings suggest that T cells that express FOXP3 seem to heavily influence the patient treatment response and survival outcome. We also observed that higher levels of CD8+PD1+ cells are consistently linked to poor prognosis for both OS and PFS, regardless of ICI. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2401.14268 [pdf, other]

doi 10.1145/3654777.3676356

GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning

Authors: Minh Duc Vu, Han Wang, Zhuang Li, Jieshan Chen, Shengdong Zhao, Zhenchang Xing, Chunyang Chen

Abstract: Virtual assistants have the potential to play an important role in helping users achieves different tasks. However, these systems face challenges in their real-world usability, characterized by inefficiency and struggles in grasping user intentions. Leveraging recent advances in Large Language Models (LLMs), we introduce GptVoiceTasker, a virtual assistant poised to enhance user experiences and ta… ▽ More Virtual assistants have the potential to play an important role in helping users achieves different tasks. However, these systems face challenges in their real-world usability, characterized by inefficiency and struggles in grasping user intentions. Leveraging recent advances in Large Language Models (LLMs), we introduce GptVoiceTasker, a virtual assistant poised to enhance user experiences and task efficiency on mobile devices. GptVoiceTasker excels at intelligently deciphering user commands and executing relevant device interactions to streamline task completion. The system continually learns from historical user commands to automate subsequent usages, further enhancing execution efficiency. Our experiments affirm GptVoiceTasker's exceptional command interpretation abilities and the precision of its task automation module. In our user study, GptVoiceTasker boosted task efficiency in real-world scenarios by 34.85%, accompanied by positive participant feedback. We made GptVoiceTasker open-source, inviting further research into LLMs utilization for diverse tasks through prompt engineering and leveraging user usage data to improve efficiency. △ Less

Submitted 13 August, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: This paper has been accepted by UIST 2024

arXiv:2312.17205 [pdf, other]

EFHQ: Multi-purpose ExtremePose-Face-HQ dataset

Authors: Trung Tuan Dao, Duc Hong Vu, Cuong Pham, Anh Tran

Abstract: The existing facial datasets, while having plentiful images at near frontal views, lack images with extreme head poses, leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k high-quality images of… ▽ More The existing facial datasets, while having plentiful images at near frontal views, lack images with extreme head poses, leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k high-quality images of faces at extreme poses. To produce such a massive dataset, we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets, VFHQ and CelebV-HQ, which contain many high-resolution face videos captured in various settings. Our dataset can complement existing datasets on various facial-related tasks, such as facial synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation, and face reenactment. Specifically, training with EFHQ helps models generalize well across diverse poses, significantly improving performance in scenarios involving extreme views, confirmed by extensive experiments. Additionally, we utilize EFHQ to define a challenging cross-view face verification benchmark, in which the performance of SOTA face recognition models drops 5-37% compared to frontal-to-frontal scenarios, aiming to stimulate studies on face recognition under severe pose conditions in the wild. △ Less

Submitted 11 April, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: Project Page: https://bomcon123456.github.io/efhq/

arXiv:2312.09982 [pdf, other]

doi 10.1109/CASES60062.2024.00011

ACPO: AI-Enabled Compiler Framework

Authors: Amir H. Ashouri, Muhammad Asif Manzoor, Duc Minh Vu, Raymond Zhang, Colin Toft, Ziwen Wang, Angel Zhang, Bryan Chan, Tomasz S. Czajkowski, Yaoqing Gao

Abstract: The key to performance optimization of a program is to decide correctly when a certain transformation should be applied by a compiler. This is an ideal opportunity to apply machine-learning models to speed up the tuning process; while this realization has been around since the late 90s, only recent advancements in ML enabled a practical application of ML to compilers as an end-to-end framework.… ▽ More The key to performance optimization of a program is to decide correctly when a certain transformation should be applied by a compiler. This is an ideal opportunity to apply machine-learning models to speed up the tuning process; while this realization has been around since the late 90s, only recent advancements in ML enabled a practical application of ML to compilers as an end-to-end framework. This paper presents ACPO: An AI-Enabled Compiler Framework, a novel framework that provides LLVM with simple and comprehensive tools to benefit from employing ML models for different optimization passes. We first showcase the high-level view, class hierarchy, and functionalities of ACPO and subsequently, demonstrate \taco{a couple of use cases of ACPO by ML-enabling the Loop Unroll and Function Inlining passes used in LLVM's O3. and finally, describe how ACPO can be leveraged to optimize other passes. Experimental results reveal that the ACPO model for Loop Unroll can gain on average 4%, 3%, 5.4%, and 0.2% compared to LLVM's vanilla O3 optimization when deployed on Polybench, Coral-2, CoreMark, and Graph-500, respectively. Furthermore, by including both Function Inlining and Loop Unroll models, ACPO can provide a combined speedup of 4.5% on Polybench and 2.4% on Cbench when compared with LLVM's O3, respectively. △ Less

Submitted 13 January, 2025; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: ACPO (12 pages)

ACM Class: I.2.5; D.3.0; I.2.6

arXiv:2312.02227 [pdf, other]

Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation

Authors: Cong-Duy Nguyen, Thong Nguyen, Duc Anh Vu, Luu Anh Tuan

Abstract: The effectiveness of a model is heavily reliant on the quality of the fusion representation of multiple modalities in multimodal sentiment analysis. Moreover, each modality is extracted from raw input and integrated with the rest to construct a multimodal representation. Although previous methods have proposed multimodal representations and achieved promising results, most of them focus on forming… ▽ More The effectiveness of a model is heavily reliant on the quality of the fusion representation of multiple modalities in multimodal sentiment analysis. Moreover, each modality is extracted from raw input and integrated with the rest to construct a multimodal representation. Although previous methods have proposed multimodal representations and achieved promising results, most of them focus on forming positive and negative pairs, neglecting the variation in sentiment scores within the same class. Additionally, they fail to capture the significance of unimodal representations in the fusion vector. To address these limitations, we introduce a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis. This framework aims to enhance discrimination and generalizability of the multimodal representation and overcome biases in the fusion vector's modality. Our experimental results, along with visualizations on two widely used datasets, demonstrate the effectiveness of our approach. △ Less

Submitted 3 December, 2023; originally announced December 2023.

arXiv:2312.01661 [pdf, other]

ChatGPT as a Math Questioner? Evaluating ChatGPT on Generating Pre-university Math Questions

Authors: Phuoc Pham Van Long, Duc Anh Vu, Nhat M. Hoang, Xuan Long Do, Anh Tuan Luu

Abstract: Mathematical questioning is crucial for assessing students problem-solving skills. Since manually creating such questions requires substantial effort, automatic methods have been explored. Existing state-of-the-art models rely on fine-tuning strategies and struggle to generate questions that heavily involve multiple steps of logical and arithmetic reasoning. Meanwhile, large language models(LLMs)… ▽ More Mathematical questioning is crucial for assessing students problem-solving skills. Since manually creating such questions requires substantial effort, automatic methods have been explored. Existing state-of-the-art models rely on fine-tuning strategies and struggle to generate questions that heavily involve multiple steps of logical and arithmetic reasoning. Meanwhile, large language models(LLMs) such as ChatGPT have excelled in many NLP tasks involving logical and arithmetic reasoning. Nonetheless, their applications in generating educational questions are underutilized, especially in the field of mathematics. To bridge this gap, we take the first step to conduct an in-depth analysis of ChatGPT in generating pre-university math questions. Our analysis is categorized into two main settings: context-aware and context-unaware. In the context-aware setting, we evaluate ChatGPT on existing math question-answering benchmarks covering elementary, secondary, and ternary classes. In the context-unaware setting, we evaluate ChatGPT in generating math questions for each lesson from pre-university math curriculums that we crawl. Our crawling results in TopicMath, a comprehensive and novel collection of pre-university math curriculums collected from 121 math topics and 428 lessons from elementary, secondary, and tertiary classes. Through this analysis, we aim to provide insight into the potential of ChatGPT as a math questioner. △ Less

Submitted 27 February, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: Accepted at the 39th ACM/SIGAPP Symposium On Applied Computing (SAC 2024), Main Conference

arXiv:2311.14465 [pdf, other]

DP-NMT: Scalable Differentially-Private Machine Translation

Authors: Timour Igamberdiev, Doan Nam Long Vu, Felix Künnecke, Zhuo Yu, Jannik Holmer, Ivan Habernal

Abstract: Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implemen… ▽ More Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community. △ Less

Submitted 24 April, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

Comments: Accepted at EACL 2024

arXiv:2311.08834 [pdf, ps, other]

A* search algorithm for an optimal investment problem in vehicle-sharing systems

Authors: Ba Luat Le, Layla Martin, Emrah Demir, Duc Minh Vu

Abstract: We study an optimal investment problem that arises in the context of the vehicle-sharing system. Given a set of locations to build stations, we need to determine i) the sequence of stations to be built and the number of vehicles to acquire in order to obtain the target state where all stations are built, and ii) the number of vehicles to acquire and their allocation in order to maximize the total… ▽ More We study an optimal investment problem that arises in the context of the vehicle-sharing system. Given a set of locations to build stations, we need to determine i) the sequence of stations to be built and the number of vehicles to acquire in order to obtain the target state where all stations are built, and ii) the number of vehicles to acquire and their allocation in order to maximize the total profit returned by operating the system when some or all stations are open. The profitability associated with operating open stations, measured over a specific time period, is represented as a linear optimization problem applied to a collection of open stations. With operating capital, the owner of the system can open new stations. This property introduces a set-dependent aspect to the duration required for opening a new station, and the optimal investment problem can be viewed as a variant of the Traveling Salesman Problem (TSP) with set-dependent cost. We propose an A* search algorithm to address this particular variant of the TSP. Computational experiments highlight the benefits of the proposed algorithm in comparison to the widely recognized Dijkstra algorithm and propose future research to explore new possibilities and applications for both exact and approximate A* algorithms. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: Full version of the conference paper which is accepted to be appear in the proceeding of the The 12th International Conference on Computational Data and Social Networks - SCONET2023

arXiv:2311.08111 [pdf, ps, other]

Solving Time-Dependent Traveling Salesman Problem with Time Windows under Generic Time-Dependent Travel Cost

Authors: Duc Minh Vu, Mike Hewitt, Duc Duy Vu

Abstract: In this paper, we present formulations and an exact method to solve the Time Dependent Traveling Salesman Problem with Time Window (TD-TSPTW) under a generic travel cost function where waiting is allowed. A particular case in which the travel cost is a non-decreasing function has been addressed recently. With that assumption, because of both the First-In-First-Out property of the travel time funct… ▽ More In this paper, we present formulations and an exact method to solve the Time Dependent Traveling Salesman Problem with Time Window (TD-TSPTW) under a generic travel cost function where waiting is allowed. A particular case in which the travel cost is a non-decreasing function has been addressed recently. With that assumption, because of both the First-In-First-Out property of the travel time function and the non-decreasing property of the travel cost function, we can ignore the possibility of waiting. However, for generic travel cost functions, waiting after visiting some locations can be part of optimal solutions. To handle the general case, we introduce new lower-bound formulations that allow us to ensure the existence of optimal solutions. We adapt the existing algorithm for TD-TSPTW with non-decreasing travel costs to solve the TD-TSPTW with generic travel costs. In the experiment, we evaluate the strength of the proposed lower bound formulations and algorithm by applying them to solve the TD-TSPTW with the total travel time objective. The results indicate that the proposed algorithm is competitive with and even outperforms the state-of-art solver in various benchmark instances. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: Full version with Appendix - accepted to appear in the proceeding of SCONET2003 conference - https://csonet-conf.github.io/csonet23/index.html

arXiv:2309.01656 [pdf, other]

Building Footprint Extraction in Dense Areas using Super Resolution and Frame Field Learning

Authors: Vuong Nguyen, Anh Ho, Duc-Anh Vu, Nguyen Thi Ngoc Anh, Tran Ngoc Thang

Abstract: Despite notable results on standard aerial datasets, current state-of-the-arts fail to produce accurate building footprints in dense areas due to challenging properties posed by these areas and limited data availability. In this paper, we propose a framework to address such issues in polygonal building extraction. First, super resolution is employed to enhance the spatial resolution of aerial imag… ▽ More Despite notable results on standard aerial datasets, current state-of-the-arts fail to produce accurate building footprints in dense areas due to challenging properties posed by these areas and limited data availability. In this paper, we propose a framework to address such issues in polygonal building extraction. First, super resolution is employed to enhance the spatial resolution of aerial image, allowing for finer details to be captured. This enhanced imagery serves as input to a multitask learning module, which consists of a segmentation head and a frame field learning head to effectively handle the irregular building structures. Our model is supervised by adaptive loss weighting, enabling extraction of sharp edges and fine-grained polygons which is difficult due to overlapping buildings and low data quality. Extensive experiments on a slum area in India that mimics a dense area demonstrate that our proposed approach significantly outperforms the current state-of-the-art methods by a large margin. △ Less

Submitted 4 September, 2023; originally announced September 2023.

Comments: Accepted at The 12th International Conference on Awareness Science and Technology

arXiv:2308.16139 [pdf, other]

doi 10.1515/bmt-2024-0396

MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer Vision

Authors: Jianning Li, Zongwei Zhou, Jiancheng Yang, Antonio Pepe, Christina Gsaxner, Gijs Luijten, Chongyu Qu, Tiezheng Zhang, Xiaoxi Chen, Wenxuan Li, Marek Wodzinski, Paul Friedrich, Kangxian Xie, Yuan Jin, Narmada Ambigapathy, Enrico Nasca, Naida Solak, Gian Marco Melito, Viet Duc Vu, Afaque R. Memon, Christopher Schlachta, Sandrine De Ribaupierre, Rajnikant Patel, Roy Eagleson, Xiaojun Chen , et al. (132 additional authors not shown)

Abstract: Prior to the deep learning era, shape was commonly used to describe the objects. Nowadays, state-of-the-art (SOTA) algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surface models are used. This is seen from numerous shape-related publications in premier vision conferences as well as the growing popularity of Shape… ▽ More Prior to the deep learning era, shape was commonly used to describe the objects. Nowadays, state-of-the-art (SOTA) algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surface models are used. This is seen from numerous shape-related publications in premier vision conferences as well as the growing popularity of ShapeNet (about 51,300 models) and Princeton ModelNet (127,915 models). For the medical domain, we present a large collection of anatomical shapes (e.g., bones, organs, vessels) and 3D models of surgical instrument, called MedShapeNet, created to facilitate the translation of data-driven vision algorithms to medical applications and to adapt SOTA vision algorithms to medical problems. As a unique feature, we directly model the majority of shapes on the imaging data of real patients. As of today, MedShapeNet includes 23 dataset with more than 100,000 shapes that are paired with annotations (ground truth). Our data is freely accessible via a web interface and a Python application programming interface (API) and can be used for discriminative, reconstructive, and variational benchmarks as well as various applications in virtual, augmented, or mixed reality, and 3D printing. Exemplary, we present use cases in the fields of classification of brain tumors, facial and skull reconstructions, multi-class anatomy completion, education, and 3D printing. In future, we will extend the data and improve the interfaces. The project pages are: https://medshapenet.ikim.nrw/ and https://github.com/Jianningli/medshapenet-feedback △ Less

Submitted 12 December, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

Comments: 16 pages

MSC Class: 68T01

arXiv:2307.10212 [pdf, other]

doi 10.1587/transfun.2020EAP1101

Capsule network with shortcut routing

Authors: Dang Thanh Vu, Vo Hoang Trong, Yu Gwang-Hyun, Kim Jin-Young

Abstract: This study introduces "shortcut routing," a novel routing mechanism in capsule networks that addresses computational inefficiencies by directly activating global capsules from local capsules, eliminating intermediate layers. An attention-based approach with fuzzy coefficients is also explored for improved efficiency. Experimental results on Mnist, smallnorb, and affNist datasets show comparable cl… ▽ More This study introduces "shortcut routing," a novel routing mechanism in capsule networks that addresses computational inefficiencies by directly activating global capsules from local capsules, eliminating intermediate layers. An attention-based approach with fuzzy coefficients is also explored for improved efficiency. Experimental results on Mnist, smallnorb, and affNist datasets show comparable classification performance, achieving accuracies of 99.52%, 93.91%, and 89.02% respectively. The proposed fuzzy-based and attention-based routing methods significantly reduce the number of calculations by 1.42 and 2.5 times compared to EM routing, highlighting their computational advantages in capsule networks. These findings contribute to the advancement of efficient and accurate hierarchical pattern representation models. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: 8 pages, published at IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences E104.A(8)

arXiv:2305.08409 [pdf, other]

Validity Constraints for Data Analysis Workflows

Authors: Florian Schintke, Ninon De Mecquenem, David Frantz, Vanessa Emanuela Guarino, Marcus Hilbrich, Fabian Lehmann, Rebecca Sattler, Jan Arne Sparka, Daniel Speckhard, Hermann Stolte, Anh Duc Vu, Ulf Leser

Abstract: Porting a scientific data analysis workflow (DAW) to a cluster infrastructure, a new software stack, or even only a new dataset with some notably different properties is often challenging. Despite the structured definition of the steps (tasks) and their interdependencies during a complex data analysis in the DAW specification, relevant assumptions may remain unspecified and implicit. Such hidden a… ▽ More Porting a scientific data analysis workflow (DAW) to a cluster infrastructure, a new software stack, or even only a new dataset with some notably different properties is often challenging. Despite the structured definition of the steps (tasks) and their interdependencies during a complex data analysis in the DAW specification, relevant assumptions may remain unspecified and implicit. Such hidden assumptions often lead to crashing tasks without a reasonable error message, poor performance in general, non-terminating executions, or silent wrong results of the DAW, to name only a few possible consequences. Searching for the causes of such errors and drawbacks in a distributed compute cluster managed by a complex infrastructure stack, where DAWs for large datasets typically are executed, can be tedious and time-consuming. We propose validity constraints (VCs) as a new concept for DAW languages to alleviate this situation. A VC is a constraint specifying some logical conditions that must be fulfilled at certain times for DAW executions to be valid. When defined together with a DAW, VCs help to improve the portability, adaptability, and reusability of DAWs by making implicit assumptions explicit. Once specified, VC can be controlled automatically by the DAW infrastructure, and violations can lead to meaningful error messages and graceful behaviour (e.g., termination or invocation of repair mechanisms). We provide a broad list of possible VCs, classify them along multiple dimensions, and compare them to similar concepts one can find in related fields. We also provide a first sketch for VCs' implementation into existing DAW infrastructures. △ Less

Submitted 15 May, 2023; originally announced May 2023.

arXiv:2305.05198 [pdf, other]

doi 10.1145/3581998

Voicify Your UI: Towards Android App Control with Voice Commands

Authors: Minh Duc Vu, Han Wang, Zhuang Li, Gholamreza Haffari, Zhenchang Xing, Chunyang Chen

Abstract: Nowadays, voice assistants help users complete tasks on the smartphone with voice commands, replacing traditional touchscreen interactions when such interactions are inhibited. However, the usability of those tools remains moderate due to the problems in understanding rich language variations in human commands, along with efficiency and comprehensibility issues. Therefore, we introduce Voicify, an… ▽ More Nowadays, voice assistants help users complete tasks on the smartphone with voice commands, replacing traditional touchscreen interactions when such interactions are inhibited. However, the usability of those tools remains moderate due to the problems in understanding rich language variations in human commands, along with efficiency and comprehensibility issues. Therefore, we introduce Voicify, an Android virtual assistant that allows users to interact with on-screen elements in mobile apps through voice commands. Using a novel deep learning command parser, Voicify interprets human verbal input and performs matching with UI elements. In addition, the tool can directly open a specific feature from installed applications by fetching application code information to explore the set of in-app components. Our command parser achieved 90\% accuracy on the human command dataset. Furthermore, the direct feature invocation module achieves better feature coverage in comparison to Google Assistant. The user study demonstrates the usefulness of Voicify in real-world scenarios. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Journal ref: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7, no. 1 (2023): 1-22

arXiv:2303.06274 [pdf]

CoNIC Challenge: Pushing the Frontiers of Nuclear Detection, Segmentation, Classification and Counting

Authors: Simon Graham, Quoc Dang Vu, Mostafa Jahanifar, Martin Weigert, Uwe Schmidt, Wenhua Zhang, Jun Zhang, Sen Yang, Jinxi Xiang, Xiyue Wang, Josef Lorenz Rumberger, Elias Baumann, Peter Hirsch, Lihao Liu, Chenyang Hong, Angelica I. Aviles-Rivero, Ayushi Jain, Heeyoung Ahn, Yiyu Hong, Hussam Azzuni, Min Xu, Mohammad Yaqub, Marie-Claire Blache, Benoît Piégu, Bertrand Vernay , et al. (64 additional authors not shown)

Abstract: Nuclear detection, segmentation and morphometric profiling are essential in helping us further understand the relationship between histology and patient outcome. To drive innovation in this area, we setup a community-wide challenge using the largest available dataset of its kind to assess nuclear segmentation and cellular composition. Our challenge, named CoNIC, stimulated the development of repro… ▽ More Nuclear detection, segmentation and morphometric profiling are essential in helping us further understand the relationship between histology and patient outcome. To drive innovation in this area, we setup a community-wide challenge using the largest available dataset of its kind to assess nuclear segmentation and cellular composition. Our challenge, named CoNIC, stimulated the development of reproducible algorithms for cellular recognition with real-time result inspection on public leaderboards. We conducted an extensive post-challenge analysis based on the top-performing models using 1,658 whole-slide images of colon tissue. With around 700 million detected nuclei per model, associated features were used for dysplasia grading and survival analysis, where we demonstrated that the challenge's improvement over the previous state-of-the-art led to significant boosts in downstream performance. Our findings also suggest that eosinophils and neutrophils play an important role in the tumour microevironment. We release challenge models and WSI-level results to foster the development of further methods for biomarker discovery. △ Less

Submitted 14 March, 2023; v1 submitted 10 March, 2023; originally announced March 2023.

Showing 1–50 of 93 results for author: Vu, D