Search | arXiv e-print repository

Embedding-Based Federated Data Sharing via Differentially Private Conditional VAEs

Authors: Francesco Di Salvo, Hanh Huyen My Nguyen, Christian Ledig

Abstract: Deep Learning (DL) has revolutionized medical imaging, yet its adoption is constrained by data scarcity and privacy regulations, limiting access to diverse datasets. Federated Learning (FL) enables decentralized training but suffers from high communication costs and is often restricted to a single downstream task, reducing flexibility. We propose a data-sharing method via Differentially Private (D… ▽ More Deep Learning (DL) has revolutionized medical imaging, yet its adoption is constrained by data scarcity and privacy regulations, limiting access to diverse datasets. Federated Learning (FL) enables decentralized training but suffers from high communication costs and is often restricted to a single downstream task, reducing flexibility. We propose a data-sharing method via Differentially Private (DP) generative models. By adopting foundation models, we extract compact, informative embeddings, reducing redundancy and lowering computational overhead. Clients collaboratively train a Differentially Private Conditional Variational Autoencoder (DP-CVAE) to model a global, privacy-aware data distribution, supporting diverse downstream tasks. Our approach, validated across multiple feature extractors, enhances privacy, scalability, and efficiency, outperforming traditional FL classifiers while ensuring differential privacy. Additionally, DP-CVAE produces higher-fidelity embeddings than DP-CGAN while requiring $5{\times}$ fewer parameters. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: Accepted to MICCAI 2025

arXiv:2506.08681 [pdf, ps, other]

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Authors: Phuc Minh Nguyen, Ngoc-Hieu Nguyen, Duy H. M. Nguyen, Anji Liu, An Mai, Binh T. Nguyen, Daniel Sonntag, Khoa D. Doan

Abstract: Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as train… ▽ More Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses. This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve better performance than other methods designed to address this problem. Our implementations are provided publicly at this link. △ Less

Submitted 11 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: First version

arXiv:2505.19080 [pdf, ps, other]

ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning

Authors: Tuan Van Vo, Tan Quang Nguyen, Khang Minh Nguyen, Duy Ho Minh Nguyen, Minh Nhat Vu

Abstract: Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into robotic actions. Despite their recent advancements, VLAs often overlook the explicit reasoning and only learn the functional input-action mappings, omitting these crucial logical steps for interpretability and g… ▽ More Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into robotic actions. Despite their recent advancements, VLAs often overlook the explicit reasoning and only learn the functional input-action mappings, omitting these crucial logical steps for interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose \textit{ReFineVLA}, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we use \textit{ReFineVLA} to fine-tune pre-trained VLAs with the reasoning-enriched datasets, while maintaining their inherent generalization abilities and boosting reasoning capabilities. In addition, we conduct an attention map visualization to analyze the alignment among visual attention, linguistic prompts, and to-be-executed actions of \textit{ReFineVLA}, showcasing its ability to focus on relevant tasks and actions. Through the latter step, we explore that \textit{ReFineVLA}-trained models exhibit a meaningful attention shift towards relevant objects, highlighting the enhanced multimodal understanding and improved generalization. Evaluated across manipulation tasks, \textit{ReFineVLA} outperforms the state-of-the-art baselines. Specifically, it achieves an average increase of $5.0\%$ success rate on SimplerEnv WidowX Robot tasks, improves by an average of $8.6\%$ in variant aggregation settings, and by $1.7\%$ in visual matching settings for SimplerEnv Google Robot tasks. The source code will be publicly available. △ Less

Submitted 25 May, 2025; originally announced May 2025.

Comments: 10 pages

arXiv:2505.03770 [pdf, other]

Proceedings of 1st Workshop on Advancing Artificial Intelligence through Theory of Mind

Authors: Mouad Abrini, Omri Abend, Dina Acklin, Henny Admoni, Gregor Aichinger, Nitay Alon, Zahra Ashktorab, Ashish Atreja, Moises Auron, Alexander Aufreiter, Raghav Awasthi, Soumya Banerjee, Joe M. Barnby, Rhea Basappa, Severin Bergsmann, Djallel Bouneffouf, Patrick Callaghan, Marc Cavazza, Thierry Chaminade, Sonia Chernova, Mohamed Chetouan, Moumita Choudhury, Axel Cleeremans, Jacek B. Cywinski, Fabio Cuzzolin , et al. (83 additional authors not shown)

Abstract: This volume includes a selection of papers presented at the Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2025 in Philadelphia US on 3rd March 2025. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community. This volume includes a selection of papers presented at the Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2025 in Philadelphia US on 3rd March 2025. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community. △ Less

Submitted 28 April, 2025; originally announced May 2025.

Comments: workshop proceedings

arXiv:2504.14898 [pdf, other]

Expected Free Energy-based Planning as Variational Inference

Authors: Bert de Vries, Wouter Nuijten, Thijs van de Laar, Wouter Kouw, Sepideh Adamiat, Tim Nisslbeck, Mykola Lukashchuk, Hoang Minh Huu Nguyen, Marco Hidalgo Araya, Raphael Tresor, Thijs Jenneskens, Ivana Nikoloska, Raaja Ganapathy Subramanian, Bart van Erp, Dmitry Bagaev, Albert Podusenko

Abstract: We address the problem of planning under uncertainty, where an agent must choose actions that not only achieve desired outcomes but also reduce uncertainty. Traditional methods often treat exploration and exploitation as separate objectives, lacking a unified inferential foundation. Active inference, grounded in the Free Energy Principle, provides such a foundation by minimizing Expected Free Ener… ▽ More We address the problem of planning under uncertainty, where an agent must choose actions that not only achieve desired outcomes but also reduce uncertainty. Traditional methods often treat exploration and exploitation as separate objectives, lacking a unified inferential foundation. Active inference, grounded in the Free Energy Principle, provides such a foundation by minimizing Expected Free Energy (EFE), a cost function that combines utility with epistemic drives, such as ambiguity resolution and novelty seeking. However, the computational burden of EFE minimization had remained a significant obstacle to its scalability. In this paper, we show that EFE-based planning arises naturally from minimizing a variational free energy functional on a generative model augmented with preference and epistemic priors. This result reinforces theoretical consistency with the Free Energy Principle by casting planning under uncertainty itself as a form of variational inference. Our formulation yields policies that jointly support goal achievement and information gain, while incorporating a complexity term that accounts for bounded computational resources. This unifying framework connects and extends existing methods, enabling scalable, resource-aware implementations of active inference agents. △ Less

Submitted 23 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

Comments: 18 pages

arXiv:2503.12722 [pdf, other]

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Authors: Kenneth J. K. Ong, Lye Jia Jun, Hieu Minh "Jord" Nguyen, Seong Hah Cho, Natalia Pérez-Campanero Antolín

Abstract: As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five tr… ▽ More As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision-making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality-based steering for aligning AI agents. △ Less

Submitted 16 March, 2025; originally announced March 2025.

Comments: Poster, Technical AI Safety Conference 2025

arXiv:2503.10728 [pdf, other]

DarkBench: Benchmarking Dark Patterns in Large Language Models

Authors: Esben Kran, Hieu Minh "Jord" Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, Mateusz Maria Jurewicz

Abstract: We introduce DarkBench, a comprehensive benchmark for detecting dark design patterns--manipulative techniques that influence user behavior--in interactions with large language models (LLMs). Our benchmark comprises 660 prompts across six categories: brand bias, user retention, sycophancy, anthropomorphism, harmful generation, and sneaking. We evaluate models from five leading companies (OpenAI, An… ▽ More We introduce DarkBench, a comprehensive benchmark for detecting dark design patterns--manipulative techniques that influence user behavior--in interactions with large language models (LLMs). Our benchmark comprises 660 prompts across six categories: brand bias, user retention, sycophancy, anthropomorphism, harmful generation, and sneaking. We evaluate models from five leading companies (OpenAI, Anthropic, Meta, Mistral, Google) and find that some LLMs are explicitly designed to favor their developers' products and exhibit untruthful communication, among other manipulative behaviors. Companies developing LLMs should recognize and mitigate the impact of dark design patterns to promote more ethical AI. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: Accepted as an Oral paper at ICLR 2025

arXiv:2502.14412 [pdf, other]

Evaluating Precise Geolocation Inference Capabilities of Vision Language Models

Authors: Neel Jay, Hieu Minh Nguyen, Trung Dung Hoang, Jacob Haimes

Abstract: The prevalence of Vision-Language Models (VLMs) raises important questions about privacy in an era where visual information is increasingly available. While foundation VLMs demonstrate broad knowledge and learned capabilities, we specifically investigate their ability to infer geographic location from previously unseen image data. This paper introduces a benchmark dataset collected from Google Str… ▽ More The prevalence of Vision-Language Models (VLMs) raises important questions about privacy in an era where visual information is increasingly available. While foundation VLMs demonstrate broad knowledge and learned capabilities, we specifically investigate their ability to infer geographic location from previously unseen image data. This paper introduces a benchmark dataset collected from Google Street View that represents its global distribution of coverage. Foundation models are evaluated on single-image geolocation inference, with many achieving median distance errors of <300 km. We further evaluate VLM "agents" with access to supplemental tools, observing up to a 30.6% decrease in distance error. Our findings establish that modern foundation VLMs can act as powerful image geolocation tools, without being specifically trained for this task. When coupled with increasing accessibility of these models, our findings have greater implications for online privacy. We discuss these risks, as well as future work in this area. △ Less

Submitted 20 February, 2025; originally announced February 2025.

Comments: AAAI 2025 Workshop DATASAFE

arXiv:2502.06470 [pdf, ps, other]

A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks

Authors: Hieu Minh "Jord" Nguyen

Abstract: Theory of Mind (ToM), the ability to attribute mental states to others and predict their behaviour, is fundamental to social intelligence. In this paper, we survey studies evaluating behavioural and representational ToM in Large Language Models (LLMs), identify important safety risks from advanced LLM ToM capabilities, and suggest several research directions for effective evaluation and mitigation… ▽ More Theory of Mind (ToM), the ability to attribute mental states to others and predict their behaviour, is fundamental to social intelligence. In this paper, we survey studies evaluating behavioural and representational ToM in Large Language Models (LLMs), identify important safety risks from advanced LLM ToM capabilities, and suggest several research directions for effective evaluation and mitigation of these risks. △ Less

Submitted 10 February, 2025; originally announced February 2025.

Comments: Advancing Artificial Intelligence through Theory of Mind Workshop, AAAI 2025

arXiv:2502.02118 [pdf, other]

BRIDLE: Generalized Self-supervised Learning with Quantization

Authors: Hoang M. Nguyen, Satya N. Shukla, Qiang Zhang, Hanchao Yu, Sreya D. Roy, Taipeng Tian, Lingjiong Zhu, Yuchen Liu

Abstract: Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidi… ▽ More Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance. △ Less

Submitted 4 February, 2025; originally announced February 2025.

arXiv:2406.13997 [pdf, other]

"Global is Good, Local is Bad?": Understanding Brand Bias in LLMs

Authors: Mahammed Kamruzzaman, Hieu Minh Nguyen, Gene Louis Kim

Abstract: Many recent studies have investigated social biases in LLMs but brand bias has received little attention. This research examines the biases exhibited by LLMs towards different brands, a significant concern given the widespread use of LLMs in affected use cases such as product recommendation and market analysis. Biased models may perpetuate societal inequalities, unfairly favoring established globa… ▽ More Many recent studies have investigated social biases in LLMs but brand bias has received little attention. This research examines the biases exhibited by LLMs towards different brands, a significant concern given the widespread use of LLMs in affected use cases such as product recommendation and market analysis. Biased models may perpetuate societal inequalities, unfairly favoring established global brands while marginalizing local ones. Using a curated dataset across four brand categories, we probe the behavior of LLMs in this space. We find a consistent pattern of bias in this space -- both in terms of disproportionately associating global brands with positive attributes and disproportionately recommending luxury gifts for individuals in high-income countries. We also find LLMs are subject to country-of-origin effects which may boost local brand preference in LLM outputs in specific contexts. △ Less

Submitted 27 September, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

Comments: Accepted at EMNLP-2024 (main)

arXiv:2404.02949 [pdf, other]

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Authors: Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell

Abstract: Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured compet… ▽ More Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured competition entries. It remains challenging to help humans reliably diagnose trojans via interpretability tools. However, the competition's entries have contributed new techniques and set a new record on the benchmark from Casper et al., 2023. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: Competition for SaTML 2024

arXiv:2312.07784 [pdf, other]

Robust MRI Reconstruction by Smoothed Unrolling (SMUG)

Authors: Shijun Liang, Van Hoang Minh Nguyen, Jinghan Jia, Ismail Alkhouri, Sijia Liu, Saiprasad Ravishankar

Abstract: As the popularity of deep learning (DL) in the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances, including worst-case additive perturbations. This sensitivity often leads to unstable, aliased images. This raises the question of how to devise DL techniques for… ▽ More As the popularity of deep learning (DL) in the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances, including worst-case additive perturbations. This sensitivity often leads to unstable, aliased images. This raises the question of how to devise DL techniques for MRI reconstruction that can be robust to train-test variations. To address this problem, we propose a novel image reconstruction framework, termed Smoothed Unrolling (SMUG), which advances a deep unrolling-based MRI reconstruction model using a randomized smoothing (RS)-based robust learning approach. RS, which improves the tolerance of a model against input noises, has been widely used in the design of adversarial defense approaches for image classification tasks. Yet, we find that the conventional design that applies RS to the entire DL-based MRI model is ineffective. In this paper, we show that SMUG and its variants address the above issue by customizing the RS process based on the unrolling architecture of a DL-based MRI reconstruction model. Compared to the vanilla RS approach, we show that SMUG improves the robustness of MRI reconstruction with respect to a diverse set of instability sources, including worst-case and random noise perturbations to input measurements, varying measurement sampling rates, and different numbers of unrolling steps. Furthermore, we theoretically analyze the robustness of our method in the presence of perturbations. △ Less

Submitted 19 August, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

arXiv:2311.11003 [pdf, other]

Wasserstein Convergence Guarantees for a General Class of Score-Based Generative Models

Authors: Xuefeng Gao, Hoang M. Nguyen, Lingjiong Zhu

Abstract: Score-based generative models (SGMs) is a recent class of deep generative models with state-of-the-art performance in many applications. In this paper, we establish convergence guarantees for a general class of SGMs in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distribution. We specialize our result to several concrete SGMs with specific choices of forwar… ▽ More Score-based generative models (SGMs) is a recent class of deep generative models with state-of-the-art performance in many applications. In this paper, we establish convergence guarantees for a general class of SGMs in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distribution. We specialize our result to several concrete SGMs with specific choices of forward processes modelled by stochastic differential equations, and obtain an upper bound on the iteration complexity for each model, which demonstrates the impacts of different choices of the forward processes. We also provide a lower bound when the data distribution is Gaussian. Numerically, we experiment SGMs with different forward processes, some of which are newly proposed in this paper, for unconditional image generation on CIFAR-10. We find that the experimental results are in good agreement with our theoretical predictions on the iteration complexity, and the models with our newly proposed forward processes can outperform existing models. △ Less

Submitted 15 February, 2025; v1 submitted 18 November, 2023; originally announced November 2023.

arXiv:2211.07166 [pdf, other]

Optimal Privacy Preserving for Federated Learning in Mobile Edge Computing

Authors: Hai M. Nguyen, Nam H. Chu, Diep N. Nguyen, Dinh Thai Hoang, Van-Dinh Nguyen, Minh Hoang Ha, Eryk Dutkiewicz, Marwan Krunz

Abstract: Federated Learning (FL) with quantization and deliberately added noise over wireless networks is a promising approach to preserve user differential privacy (DP) while reducing wireless resources. Specifically, an FL process can be fused with quantized Binomial mechanism-based updates contributed by multiple users. However, optimizing quantization parameters, communication resources (e.g., transmit… ▽ More Federated Learning (FL) with quantization and deliberately added noise over wireless networks is a promising approach to preserve user differential privacy (DP) while reducing wireless resources. Specifically, an FL process can be fused with quantized Binomial mechanism-based updates contributed by multiple users. However, optimizing quantization parameters, communication resources (e.g., transmit power, bandwidth, and quantization bits), and the added noise to guarantee the DP requirement and performance of the learned FL model remains an open and challenging problem. This article aims to jointly optimize the quantization and Binomial mechanism parameters and communication resources to maximize the convergence rate under the constraints of the wireless network and DP requirement. To that end, we first derive a novel DP budget estimation of the FL with quantization/noise that is tighter than the state-of-the-art bound. We then provide a theoretical bound on the convergence rate. This theoretical bound is decomposed into two components, including the variance of the global gradient and the quadratic bias that can be minimized by optimizing the communication resources, and quantization/noise parameters. The resulting optimization turns out to be a Mixed-Integer Non-linear Programming (MINLP) problem. To tackle it, we first transform this MINLP problem into a new problem whose solutions are proved to be the optimal solutions of the original one. We then propose an approximate algorithm to solve the transformed problem with an arbitrary relative error guarantee. Extensive simulations show that under the same wireless resource constraints and DP protection requirements, the proposed approximate algorithm achieves an accuracy close to the accuracy of the conventional FL without quantization/noise. The results can achieve a higher convergence rate while preserving users' privacy. △ Less

Submitted 20 May, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

Comments: 16 pages, 10 figures

arXiv:2201.00132 [pdf, other]

doi 10.1109/ICMLA51294.2020.00223

SAFL: A Self-Attention Scene Text Recognizer with Focal Loss

Authors: Bao Hieu Tran, Thanh Le-Cong, Huu Manh Nguyen, Duc Anh Le, Thanh Hung Nguyen, Phi Le Nguyen

Abstract: In the last decades, scene text recognition has gained worldwide attention from both the academic community and actual users due to its importance in a wide range of applications. Despite achievements in optical character recognition, scene text recognition remains challenging due to inherent problems such as distortions or irregular layout. Most of the existing approaches mainly leverage recurren… ▽ More In the last decades, scene text recognition has gained worldwide attention from both the academic community and actual users due to its importance in a wide range of applications. Despite achievements in optical character recognition, scene text recognition remains challenging due to inherent problems such as distortions or irregular layout. Most of the existing approaches mainly leverage recurrence or convolution-based neural networks. However, while recurrent neural networks (RNNs) usually suffer from slow training speed due to sequential computation and encounter problems as vanishing gradient or bottleneck, CNN endures a trade-off between complexity and performance. In this paper, we introduce SAFL, a self-attention-based neural network model with the focal loss for scene text recognition, to overcome the limitation of the existing approaches. The use of focal loss instead of negative log-likelihood helps the model focus more on low-frequency samples training. Moreover, to deal with the distortions and irregular texts, we exploit Spatial TransformerNetwork (STN) to rectify text before passing to the recognition network. We perform experiments to compare the performance of the proposed model with seven benchmarks. The numerical results show that our model achieves the best performance. △ Less

Submitted 1 January, 2022; originally announced January 2022.

Comments: Accepted to ICMLA 2020

Journal ref: 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA)

arXiv:2106.05190 [pdf, ps, other]

DPER: Efficient Parameter Estimation for Randomly Missing Data

Authors: Thu Nguyen, Khoi Minh Nguyen-Duy, Duy Ho Minh Nguyen, Binh T. Nguyen, Bruce Alan Wade

Abstract: The missing data problem has been broadly studied in the last few decades and has various applications in different areas such as statistics or bioinformatics. Even though many methods have been developed to tackle this challenge, most of those are imputation techniques that require multiple iterations through the data before yielding convergence. In addition, such approaches may introduce extra b… ▽ More The missing data problem has been broadly studied in the last few decades and has various applications in different areas such as statistics or bioinformatics. Even though many methods have been developed to tackle this challenge, most of those are imputation techniques that require multiple iterations through the data before yielding convergence. In addition, such approaches may introduce extra biases and noises to the estimated parameters. In this work, we propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set under some mild assumptions. As the computation is direct without any imputation, our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods while maintaining superior estimation performance. We validate these claims by empirical results on various data sets of different sizes and release all codes in a GitHub repository to contribute to the research community related to this problem. △ Less

Submitted 6 June, 2021; originally announced June 2021.

Comments: 28 pages, 3 tables, 40 references

arXiv:2009.11360 [pdf, other]

EPEM: Efficient Parameter Estimation for Multiple Class Monotone Missing Data

Authors: Thu Nguyen, Duy H. M. Nguyen, Huy Nguyen, Binh T. Nguyen, Bruce A. Wade

Abstract: The problem of monotone missing data has been broadly studied during the last two decades and has many applications in different fields such as bioinformatics or statistics. Commonly used imputation techniques require multiple iterations through the data before yielding convergence. Moreover, those approaches may introduce extra noises and biases to the subsequent modeling. In this work, we derive… ▽ More The problem of monotone missing data has been broadly studied during the last two decades and has many applications in different fields such as bioinformatics or statistics. Commonly used imputation techniques require multiple iterations through the data before yielding convergence. Moreover, those approaches may introduce extra noises and biases to the subsequent modeling. In this work, we derive exact formulas and propose a novel algorithm to compute the maximum likelihood estimators (MLEs) of a multiple class, monotone missing dataset when all the covariance matrices of all categories are assumed to be equal, namely EPEM. We then illustrate an application of our proposed methods in Linear Discriminant Analysis (LDA). As the computation is exact, our EPEM algorithm does not require multiple iterations through the data as other imputation approaches, thus promising to handle much less time-consuming than other methods. This effectiveness was validated by empirical results when EPEM reduced the error rates significantly and required a short computation time compared to several imputation-based approaches. We also release all codes and data of our experiments in one GitHub repository to contribute to the research community related to this problem. △ Less

Submitted 23 September, 2020; originally announced September 2020.

Comments: version 1

arXiv:2004.07967 [pdf, other]

doi 10.3390/app11073214

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Authors: Huy Manh Nguyen, Tomo Miyazaki, Yoshihiro Sugaya, Shinichiro Omachi

Abstract: Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos… ▽ More Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: 8 pages, 5 figures

Journal ref: Applied Sciences, 2021

arXiv:1905.06509 [pdf, other]

TRk-CNN: Transferable Ranking-CNN for image classification of glaucoma, glaucoma suspect, and normal eyes

Authors: Tae Joon Jun, Youngsub Eom, Dohyeun Kim, Cherry Kim, Ji-Hye Park, Hoang Minh Nguyen, Daeyoung Kim

Abstract: In this paper, we proposed Transferable Ranking Convolutional Neural Network (TRk-CNN) that can be effectively applied when the classes of images to be classified show a high correlation with each other. The multi-class classification method based on the softmax function, which is generally used, is not effective in this case because the inter-class relationship is ignored. Although there is a Ran… ▽ More In this paper, we proposed Transferable Ranking Convolutional Neural Network (TRk-CNN) that can be effectively applied when the classes of images to be classified show a high correlation with each other. The multi-class classification method based on the softmax function, which is generally used, is not effective in this case because the inter-class relationship is ignored. Although there is a Ranking-CNN that takes into account the ordinal classes, it cannot reflect the inter-class relationship to the final prediction. TRk-CNN, on the other hand, combines the weights of the primitive classification model to reflect the inter-class information to the final classification phase. We evaluated TRk-CNN in glaucoma image dataset that was labeled into three classes: normal, glaucoma suspect, and glaucoma eyes. Based on the literature we surveyed, this study is the first to classify three status of glaucoma fundus image dataset into three different classes. We compared the evaluation results of TRk-CNN with Ranking-CNN (Rk-CNN) and multi-class CNN (MC-CNN) using the DenseNet as the backbone CNN model. As a result, TRk-CNN achieved an average accuracy of 92.96%, specificity of 93.33%, sensitivity for glaucoma suspect of 95.12% and sensitivity for glaucoma of 93.98%. Based on average accuracy, TRk-CNN is 8.04% and 9.54% higher than Rk-CNN and MC-CNN and surprisingly 26.83% higher for sensitivity for suspicious than multi-class CNN. Our TRk-CNN is expected to be effectively applied to the medical image classification problem where the disease state is continuous and increases in the positive class direction. △ Less

Submitted 15 May, 2019; originally announced May 2019.

Comments: 49 pages, 12 figures

arXiv:1805.05727 [pdf, other]

2sRanking-CNN: A 2-stage ranking-CNN for diagnosis of glaucoma from fundus images using CAM-extracted ROI as an intermediate input

Authors: Tae Joon Jun, Dohyeun Kim, Hoang Minh Nguyen, Daeyoung Kim, Youngsub Eom

Abstract: Glaucoma is a disease in which the optic nerve is chronically damaged by the elevation of the intra-ocular pressure, resulting in visual field defect. Therefore, it is important to monitor and treat suspected patients before they are confirmed with glaucoma. In this paper, we propose a 2-stage ranking-CNN that classifies fundus images as normal, suspicious, and glaucoma. Furthermore, we propose a… ▽ More Glaucoma is a disease in which the optic nerve is chronically damaged by the elevation of the intra-ocular pressure, resulting in visual field defect. Therefore, it is important to monitor and treat suspected patients before they are confirmed with glaucoma. In this paper, we propose a 2-stage ranking-CNN that classifies fundus images as normal, suspicious, and glaucoma. Furthermore, we propose a method of using the class activation map as a mask filter and combining it with the original fundus image as an intermediate input. Our results have improved the average accuracy by about 10% over the existing 3-class CNN and ranking-CNN, and especially improved the sensitivity of suspicious class by more than 20% over 3-class CNN. In addition, the extracted ROI was also found to overlap with the diagnostic criteria of the physician. The method we propose is expected to be efficiently applied to any medical data where there is a suspicious condition between normal and disease. △ Less

Submitted 4 July, 2018; v1 submitted 15 May, 2018; originally announced May 2018.

Comments: Accepted at BMVC 2018

arXiv:1804.06812 [pdf, other]

ECG arrhythmia classification using a 2-D convolutional neural network

Authors: Tae Joon Jun, Hoang Minh Nguyen, Daeyoun Kang, Dohyeun Kim, Daeyoung Kim, Young-Hak Kim

Abstract: In this paper, we propose an effective electrocardiogram (ECG) arrhythmia classification method using a deep two-dimensional convolutional neural network (CNN) which recently shows outstanding performance in the field of pattern recognition. Every ECG beat was transformed into a two-dimensional grayscale image as an input data for the CNN classifier. Optimization of the proposed CNN classifier inc… ▽ More In this paper, we propose an effective electrocardiogram (ECG) arrhythmia classification method using a deep two-dimensional convolutional neural network (CNN) which recently shows outstanding performance in the field of pattern recognition. Every ECG beat was transformed into a two-dimensional grayscale image as an input data for the CNN classifier. Optimization of the proposed CNN classifier includes various deep learning techniques such as batch normalization, data augmentation, Xavier initialization, and dropout. In addition, we compared our proposed classifier with two well-known CNN models; AlexNet and VGGNet. ECG recordings from the MIT-BIH arrhythmia database were used for the evaluation of the classifier. As a result, our classifier achieved 99.05% average accuracy with 97.85% average sensitivity. To precisely validate our CNN classifier, 10-fold cross-validation was performed at the evaluation which involves every ECG recording as a test data. Our experimental results have successfully validated that the proposed CNN classifier with the transformed ECG images can achieve excellent classification accuracy without any manual pre-processing of the ECG signals such as noise filtering, feature extraction, and feature reduction. △ Less

Submitted 18 April, 2018; originally announced April 2018.

Comments: Submitted to journal

arXiv:1802.01268 [pdf, other]

doi 10.1016/j.ins.2022.01.011

ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks

Authors: Duy H. M. Nguyen, Duy M. Nguyen, Mai T. N. Truong, Thu Nguyen, Khanh T. Tran, Nguyen A. Triet, Pham T. Bao, Binh T. Nguyen

Abstract: Brain extraction (skull stripping) is a challenging problem in neuroimaging. It is due to the variability in conditions from data acquisition or abnormalities in images, making brain morphology and intensity characteristics changeable and complicated. In this paper, we propose an algorithm for skull stripping in Magnetic Resonance Imaging (MRI) scans, namely ASMCNN, by combining the Active Shape M… ▽ More Brain extraction (skull stripping) is a challenging problem in neuroimaging. It is due to the variability in conditions from data acquisition or abnormalities in images, making brain morphology and intensity characteristics changeable and complicated. In this paper, we propose an algorithm for skull stripping in Magnetic Resonance Imaging (MRI) scans, namely ASMCNN, by combining the Active Shape Model (ASM) and Convolutional Neural Network (CNN) for taking full of their advantages to achieve remarkable results. Instead of working with 3D structures, we process 2D image sequences in the sagittal plane. First, we divide images into different groups such that, in each group, shapes and structures of brain boundaries have similar appearances. Second, a modified version of ASM is used to detect brain boundaries by utilizing prior knowledge of each group. Finally, CNN and post-processing methods, including Conditional Random Field (CRF), Gaussian processes, and several special rules are applied to refine the segmentation contours. Experimental results show that our proposed method outperforms current state-of-the-art algorithms by a significant margin in all experiments. △ Less

Submitted 27 January, 2022; v1 submitted 5 February, 2018; originally announced February 2018.

Comments: 47 pages, 20 figures

MSC Class: 68T10

Showing 1–23 of 23 results for author: Nguyen, H M