-
HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Authors:
Xinzhuo Li,
Adheesh Juvekar,
Xingyou Liu,
Muntasir Wahed,
Kiet A. Nguyen,
Ismini Lourentzou
Abstract:
Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations withou…
▽ More
Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
Authors:
Tianjiao Yu,
Vedant Shah,
Muntasir Wahed,
Ying Shen,
Kiet A. Nguyen,
Ismini Lourentzou
Abstract:
Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation t…
▽ More
Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving motion coherence over baselines. Extensive evaluations on both synthetic and real-world datasets show that Part$^{2}$GS consistently outperforms state-of-the-art methods by up to 10$\times$ in Chamfer Distance for movable parts.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Uncertainty in Action: Confidence Elicitation in Embodied Agents
Authors:
Tianjiao Yu,
Vedant Shah,
Muntasir Wahed,
Kiet A. Nguyen,
Adheesh Juvekar,
Tal August,
Ismini Lourentzou
Abstract:
Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abduc…
▽ More
Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abductive reasoning, along with Execution Policies, which enhance confidence calibration through scenario reinterpretation, action sampling, and hypothetical reasoning. Evaluating agents in calibration and failure prediction tasks within the Minecraft environment, we show that structured reasoning approaches, such as Chain-of-Thoughts, improve confidence calibration. However, our findings also reveal persistent challenges in distinguishing uncertainty, particularly under abductive settings, underscoring the need for more sophisticated embodied confidence elicitation methods.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations
Authors:
Khoi Anh Nguyen,
Linh Yen Vu,
Thang Dinh Duong,
Thuan Nguyen Duong,
Huy Thanh Nguyen,
Vinh Quang Dinh
Abstract:
Visual Question Answering (VQA) is a multimodal task requiring reasoning across textual and visual inputs, which becomes particularly challenging in low-resource languages like Vietnamese due to linguistic variability and the lack of high-quality datasets. Traditional methods often rely heavily on extensive annotated datasets, computationally expensive pipelines, and large pre-trained models, spec…
▽ More
Visual Question Answering (VQA) is a multimodal task requiring reasoning across textual and visual inputs, which becomes particularly challenging in low-resource languages like Vietnamese due to linguistic variability and the lack of high-quality datasets. Traditional methods often rely heavily on extensive annotated datasets, computationally expensive pipelines, and large pre-trained models, specifically in the domain of Vietnamese VQA, limiting their applicability in such scenarios. To address these limitations, we propose a training framework that combines a paraphrase-based feature augmentation module with a dynamic curriculum learning strategy. Explicitly, augmented samples are considered "easy" while raw samples are regarded as "hard". The framework then utilizes a mechanism that dynamically adjusts the ratio of easy to hard samples during training, progressively modifying the same dataset to increase its difficulty level. By enabling gradual adaptation to task complexity, this approach helps the Vietnamese VQA model generalize well, thus improving overall performance. Experimental results show consistent improvements on the OpenViVQA dataset and mixed outcomes on the ViVQA dataset, highlighting both the potential and challenges of our approach in advancing VQA for Vietnamese language.
△ Less
Submitted 6 March, 2025; v1 submitted 5 March, 2025;
originally announced March 2025.
-
CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models
Authors:
Kiet A. Nguyen,
Adheesh Juvekar,
Tianjiao Yu,
Muntasir Wahed,
Ismini Lourentzou
Abstract:
Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning. While existing LVLMs can generate segmentation masks from text prompts for single images, they struggle with segmentation-grounded reasoning across images, especially at finer granularities such as object parts. In this paper, we introduce the new task of part-focuse…
▽ More
Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning. While existing LVLMs can generate segmentation masks from text prompts for single images, they struggle with segmentation-grounded reasoning across images, especially at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which involves identifying and segmenting common objects, as well as common and unique object parts across images. To address this task, we present CALICO, the first LVLM designed for multi-image part-level reasoning segmentation. CALICO features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Correspondence Adaptation Modules that embed this information into the LVLM to facilitate multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a large-scale multi-image segmentation dataset containing $\sim$2.4M samples across $\sim$44K images spanning diverse object and part categories. Experimental results demonstrate that CALICO, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.
△ Less
Submitted 3 April, 2025; v1 submitted 26 December, 2024;
originally announced December 2024.
-
PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation
Authors:
Muntasir Wahed,
Kiet A. Nguyen,
Adheesh Sunil Juvekar,
Xinzhuo Li,
Xiaona Zhou,
Vedant Shah,
Tianjiao Yu,
Pinar Yanardag,
Ismini Lourentzou
Abstract:
Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reas…
▽ More
Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Conformalised data synthesis
Authors:
Julia A. Meister,
Khuong An Nguyen
Abstract:
With the proliferation of increasingly complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models. However, reliably assessing the quality of a 'synthesiser' model's output is an open research question with significant associated risks for high-stake domains. To address this challenge, we propose a unique synthesis algorithm…
▽ More
With the proliferation of increasingly complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models. However, reliably assessing the quality of a 'synthesiser' model's output is an open research question with significant associated risks for high-stake domains. To address this challenge, we propose a unique synthesis algorithm that generates data from high-confidence feature space regions based on the Conformal Prediction framework. We support our proposed algorithm with a comprehensive exploration of the core parameter's influence, an in-depth discussion of practical advice, and an extensive empirical evaluation of five benchmark datasets. To show our approach's versatility on ubiquitous real-world challenges, the datasets were carefully selected for their variety of difficult characteristics: low sample count, class imbalance, and non-separability. In all trials, training sets extended with our confident synthesised data performed at least as well as the original set and frequently significantly improved Deep Learning performance by up to 61 percentage points F1-score.
△ Less
Submitted 10 January, 2025; v1 submitted 14 December, 2023;
originally announced December 2023.
-
A novel Deep Learning approach for one-step Conformal Prediction approximation
Authors:
Julia A. Meister,
Khuong An Nguyen,
Stelios Kapetanakis,
Zhiyuan Luo
Abstract:
Deep Learning predictions with measurable confidence are increasingly desirable for real-world problems, especially in high-risk settings. The Conformal Prediction (CP) framework is a versatile solution that guarantees a maximum error rate given minimal constraints. In this paper, we propose a novel conformal loss function that approximates the traditionally two-step CP approach in a single step.…
▽ More
Deep Learning predictions with measurable confidence are increasingly desirable for real-world problems, especially in high-risk settings. The Conformal Prediction (CP) framework is a versatile solution that guarantees a maximum error rate given minimal constraints. In this paper, we propose a novel conformal loss function that approximates the traditionally two-step CP approach in a single step. By evaluating and penalising deviations from the stringent expected CP output distribution, a Deep Learning model may learn the direct relationship between the input data and the conformal p-values. We carry out a comprehensive empirical evaluation to show our novel loss function's competitiveness for seven binary and multi-class prediction tasks on five benchmark datasets. On the same datasets, our approach achieves significant training time reductions up to 86% compared to Aggregated Conformal Prediction (ACP), while maintaining comparable approximate validity and predictive efficiency.
△ Less
Submitted 7 August, 2023; v1 submitted 25 July, 2022;
originally announced July 2022.
-
Audio feature ranking for sound-based COVID-19 patient detection
Authors:
Julia A. Meister,
Khuong An Nguyen,
Zhiyuan Luo
Abstract:
Audio classification using breath and cough samples has recently emerged as a low-cost, non-invasive, and accessible COVID-19 screening method. However, a comprehensive survey shows that no application has been approved for official use at the time of writing, due to the stringent reliability and accuracy requirements of the critical healthcare setting. To support the development of Machine Learni…
▽ More
Audio classification using breath and cough samples has recently emerged as a low-cost, non-invasive, and accessible COVID-19 screening method. However, a comprehensive survey shows that no application has been approved for official use at the time of writing, due to the stringent reliability and accuracy requirements of the critical healthcare setting. To support the development of Machine Learning classification models, we performed an extensive comparative investigation and ranking of 15 audio features, including less well-known ones. The results were verified on two independent COVID-19 sound datasets. By using the identified top-performing features, we have increased COVID-19 classification accuracy by up to 17% on the Cambridge dataset and up to 10% on the Coswara dataset compared to the original baseline accuracies without our feature ranking.
△ Less
Submitted 23 November, 2022; v1 submitted 14 April, 2021;
originally announced April 2021.
-
A review of smartphones based indoor positioning: challenges and applications
Authors:
Khuong An Nguyen,
Zhiyuan Luo,
Guang Li,
Chris Watkins
Abstract:
The continual proliferation of mobile devices has encouraged much effort in using the smartphones for indoor positioning. This article is dedicated to review the most recent and interesting smartphones based indoor navigation systems, ranging from electromagnetic to inertia to visible light ones, with an emphasis on their unique challenges and potential real-world applications. A taxonomy of smart…
▽ More
The continual proliferation of mobile devices has encouraged much effort in using the smartphones for indoor positioning. This article is dedicated to review the most recent and interesting smartphones based indoor navigation systems, ranging from electromagnetic to inertia to visible light ones, with an emphasis on their unique challenges and potential real-world applications. A taxonomy of smartphones sensors will be introduced, which serves as the basis to categorise different positioning systems for reviewing. A set of criteria to be used for the evaluation purpose will be devised. For each sensor category, the most recent, interesting and practical systems will be examined, with detailed discussion on the open research questions for the academics, and the practicality for the potential clients.
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
Epidemic contact tracing with smartphone sensors
Authors:
Khuong An Nguyen,
Zhiyuan Luo,
Chris Watkins
Abstract:
Contact tracing is widely considered as an effective procedure in the fight against epidemic diseases. However, one of the challenges for technology based contact tracing is the high number of false positives, questioning its trust-worthiness and efficiency amongst the wider population for mass adoption. To this end, this paper proposes a novel, yet practical smartphone-based contact tracing appro…
▽ More
Contact tracing is widely considered as an effective procedure in the fight against epidemic diseases. However, one of the challenges for technology based contact tracing is the high number of false positives, questioning its trust-worthiness and efficiency amongst the wider population for mass adoption. To this end, this paper proposes a novel, yet practical smartphone-based contact tracing approach, employing WiFi and acoustic sound for relative distance estimate, in addition to the air pressure and the magnetic field for ambient environment matching. We present a model combining 6 smartphone sensors, prioritising some of them when certain conditions are met. We empirically verified our approach in various realistic environments to demonstrate an achievement of up to 95% fewer false positives, and 62% more accurate than Bluetooth-only system. To the best of our knowledge, this paper was one of the first work to propose a combination of smartphone sensors for contact tracing.
△ Less
Submitted 25 July, 2020; v1 submitted 29 May, 2020;
originally announced June 2020.
-
Attentive Neural Network for Named Entity Recognition in Vietnamese
Authors:
Kim Anh Nguyen,
Ngan Dong,
Cam-Tu Nguyen
Abstract:
We propose an attentive neural network for the task of named entity recognition in Vietnamese. The proposed attentive neural model makes use of character-based language models and word embeddings to encode words as vector representations. A neural network architecture of encoder, attention, and decoder layers is then utilized to encode knowledge of input sentences and to label entity tags. The exp…
▽ More
We propose an attentive neural network for the task of named entity recognition in Vietnamese. The proposed attentive neural model makes use of character-based language models and word embeddings to encode words as vector representations. A neural network architecture of encoder, attention, and decoder layers is then utilized to encode knowledge of input sentences and to label entity tags. The experimental results show that the proposed attentive neural network achieves the state-of-the-art results on the benchmark named entity recognition datasets in Vietnamese in comparison to both hand-crafted features based models and neural models.
△ Less
Submitted 9 June, 2019; v1 submitted 31 October, 2018;
originally announced October 2018.
-
Introducing two Vietnamese Datasets for Evaluating Semantic Models of (Dis-)Similarity and Relatedness
Authors:
Kim Anh Nguyen,
Sabine Schulte im Walde,
Ngoc Thang Vu
Abstract:
We present two novel datasets for the low-resource language Vietnamese to assess models of semantic similarity: ViCon comprises pairs of synonyms and antonyms across word classes, thus offering data to distinguish between similarity and dissimilarity. ViSim-400 provides degrees of similarity across five semantic relations, as rated by human judges. The two datasets are verified through standard co…
▽ More
We present two novel datasets for the low-resource language Vietnamese to assess models of semantic similarity: ViCon comprises pairs of synonyms and antonyms across word classes, thus offering data to distinguish between similarity and dissimilarity. ViSim-400 provides degrees of similarity across five semantic relations, as rated by human judges. The two datasets are verified through standard co-occurrence and neural network models, showing results comparable to the respective English datasets.
△ Less
Submitted 19 April, 2018; v1 submitted 15 April, 2018;
originally announced April 2018.
-
Hierarchical Embeddings for Hypernymy Detection and Directionality
Authors:
Kim Anh Nguyen,
Maximilian Köper,
Sabine Schulte im Walde,
Ngoc Thang Vu
Abstract:
We present a novel neural model HyperVec to learn hierarchical embeddings for hypernymy detection and directionality. While previous embeddings have shown limitations on prototypical hypernyms, HyperVec represents an unsupervised measure where embeddings are learned in a specific order and capture the hypernym$-$hyponym distributional hierarchy. Moreover, our model is able to generalize over unsee…
▽ More
We present a novel neural model HyperVec to learn hierarchical embeddings for hypernymy detection and directionality. While previous embeddings have shown limitations on prototypical hypernyms, HyperVec represents an unsupervised measure where embeddings are learned in a specific order and capture the hypernym$-$hyponym distributional hierarchy. Moreover, our model is able to generalize over unseen hypernymy pairs, when using only small sets of training data, and by mapping to other languages. Results on benchmark datasets show that HyperVec outperforms both state$-$of$-$the$-$art unsupervised measures and embedding models on hypernymy detection and directionality, and on predicting graded lexical entailment.
△ Less
Submitted 23 July, 2017;
originally announced July 2017.
-
Co-location Epidemic Tracking on London Public Transports Using Low Power Mobile Magnetometer
Authors:
Khuong An Nguyen,
Chris Watkins,
Zhiyuan Luo
Abstract:
The public transports provide an ideal means to enable contagious diseases transmission. This paper introduces a novel idea to detect co-location of people in such environment using just the ubiquitous geomagnetic field sensor on the smart phone. Essentially, given that all passengers must share the same journey between at least two consecutive stations, we have a long window to match the user tra…
▽ More
The public transports provide an ideal means to enable contagious diseases transmission. This paper introduces a novel idea to detect co-location of people in such environment using just the ubiquitous geomagnetic field sensor on the smart phone. Essentially, given that all passengers must share the same journey between at least two consecutive stations, we have a long window to match the user trajectory. Our idea was assessed over a painstakingly survey of over 150 kilometres of travelling distance, covering different parts of London, using the overground trains, the underground tubes and the buses.
△ Less
Submitted 1 April, 2017;
originally announced April 2017.
-
Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network
Authors:
Kim Anh Nguyen,
Sabine Schulte im Walde,
Ngoc Thang Vu
Abstract:
Distinguishing between antonyms and synonyms is a key task to achieve high performance in NLP systems. While they are notoriously difficult to distinguish by distributional co-occurrence models, pattern-based methods have proven effective to differentiate between the relations. In this paper, we present a novel neural network model AntSynNET that exploits lexico-syntactic patterns from syntactic p…
▽ More
Distinguishing between antonyms and synonyms is a key task to achieve high performance in NLP systems. While they are notoriously difficult to distinguish by distributional co-occurrence models, pattern-based methods have proven effective to differentiate between the relations. In this paper, we present a novel neural network model AntSynNET that exploits lexico-syntactic patterns from syntactic parse trees. In addition to the lexical and syntactic information, we successfully integrate the distance between the related words along the syntactic path as a new pattern feature. The results from classification experiments show that AntSynNET improves the performance over prior pattern-based methods.
△ Less
Submitted 11 January, 2017;
originally announced January 2017.
-
Neural-based Noise Filtering from Word Embeddings
Authors:
Kim Anh Nguyen,
Sabine Schulte im Walde,
Ngoc Thang Vu
Abstract:
Word embeddings have been demonstrated to benefit NLP tasks impressively. Yet, there is room for improvement in the vector representations, because current word embeddings typically contain unnecessary information, i.e., noise. We propose two novel models to improve word embeddings by unsupervised learning, in order to yield word denoising embeddings. The word denoising embeddings are obtained by…
▽ More
Word embeddings have been demonstrated to benefit NLP tasks impressively. Yet, there is room for improvement in the vector representations, because current word embeddings typically contain unnecessary information, i.e., noise. We propose two novel models to improve word embeddings by unsupervised learning, in order to yield word denoising embeddings. The word denoising embeddings are obtained by strengthening salient information and weakening noise in the original word embeddings, based on a deep feed-forward neural network filter. Results from benchmark tasks show that the filtered word denoising embeddings outperform the original word embeddings.
△ Less
Submitted 6 October, 2016;
originally announced October 2016.
-
Integrating Distributional Lexical Contrast into Word Embeddings for Antonym-Synonym Distinction
Authors:
Kim Anh Nguyen,
Sabine Schulte im Walde,
Ngoc Thang Vu
Abstract:
We propose a novel vector representation that integrates lexical contrast into distributional vectors and strengthens the most salient features for determining degrees of word similarity. The improved vectors significantly outperform standard models and distinguish antonyms from synonyms with an average precision of 0.66-0.76 across word classes (adjectives, nouns, verbs). Moreover, we integrate t…
▽ More
We propose a novel vector representation that integrates lexical contrast into distributional vectors and strengthens the most salient features for determining degrees of word similarity. The improved vectors significantly outperform standard models and distinguish antonyms from synonyms with an average precision of 0.66-0.76 across word classes (adjectives, nouns, verbs). Moreover, we integrate the lexical contrast vectors into the objective function of a skip-gram model. The novel embedding outperforms state-of-the-art models on predicting word similarities in SimLex-999, and on distinguishing antonyms from synonyms.
△ Less
Submitted 25 May, 2016;
originally announced May 2016.