-
FSMODNet: A Closer Look at Few-Shot Detection in Multispectral Data
Authors:
Manuel Nkegoum,
Minh-Tan Pham,
Élisa Fromont,
Bruno Avignon,
Sébastien Lefèvre
Abstract:
Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named "FSMODNet" that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strength…
▽ More
Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named "FSMODNet" that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at https://anonymous.4open.science/r/Test-B48D.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
Explainable AI for Infection Prevention and Control: Modeling CPE Acquisition and Patient Outcomes in an Irish Hospital with Transformers
Authors:
Minh-Khoi Pham,
Tai Tan Mai,
Martin Crane,
Rob Brennan,
Marie E. Ward,
Una Geary,
Declan Byrne,
Brian O Connell,
Colm Bergin,
Donncha Creagh,
Nick McDonald,
Marija Bezbradica
Abstract:
Carbapenemase-Producing Enterobacteriace poses a critical concern for infection prevention and control in hospitals. However, predictive modeling of previously highlighted CPE-associated risks such as readmission, mortality, and extended length of stay (LOS) remains underexplored, particularly with modern deep learning approaches. This study introduces an eXplainable AI modeling framework to inves…
▽ More
Carbapenemase-Producing Enterobacteriace poses a critical concern for infection prevention and control in hospitals. However, predictive modeling of previously highlighted CPE-associated risks such as readmission, mortality, and extended length of stay (LOS) remains underexplored, particularly with modern deep learning approaches. This study introduces an eXplainable AI modeling framework to investigate CPE impact on patient outcomes from Electronic Medical Records data of an Irish hospital. We analyzed an inpatient dataset from an Irish acute hospital, incorporating diagnostic codes, ward transitions, patient demographics, infection-related variables and contact network features. Several Transformer-based architectures were benchmarked alongside traditional machine learning models. Clinical outcomes were predicted, and XAI techniques were applied to interpret model decisions. Our framework successfully demonstrated the utility of Transformer-based models, with TabTransformer consistently outperforming baselines across multiple clinical prediction tasks, especially for CPE acquisition (AUROC and sensitivity). We found infection-related features, including historical hospital exposure, admission context, and network centrality measures, to be highly influential in predicting patient outcomes and CPE acquisition risk. Explainability analyses revealed that features like "Area of Residence", "Admission Ward" and prior admissions are key risk factors. Network variables like "Ward PageRank" also ranked highly, reflecting the potential value of structural exposure information. This study presents a robust and explainable AI framework for analyzing complex EMR data to identify key risk factors and predict CPE-related outcomes. Our findings underscore the superior performance of the Transformer models and highlight the importance of diverse clinical and network features.
△ Less
Submitted 18 September, 2025;
originally announced September 2025.
-
Strategic Cyber Defense via Reinforcement Learning-Guided Combinatorial Auctions
Authors:
Mai Pham,
Vikrant Vaze,
Peter Chin
Abstract:
Cyber defense operations increasingly require long-term strategic planning under uncertainty and resource constraints. We propose a new use of combinatorial auctions for allocating defensive action bundles in a realistic cyber environment, using host-specific valuations derived from reinforcement learning (RL) Q-values. These Q-values encode long-term expected utility, allowing upstream planning.…
▽ More
Cyber defense operations increasingly require long-term strategic planning under uncertainty and resource constraints. We propose a new use of combinatorial auctions for allocating defensive action bundles in a realistic cyber environment, using host-specific valuations derived from reinforcement learning (RL) Q-values. These Q-values encode long-term expected utility, allowing upstream planning. We train CAFormer, a differentiable Transformer-based auction mechanism, to produce allocations that are approximately incentive-compatible under misreporting. Rather than benchmarking against existing agents, we explore the qualitative and strategic properties of the learned mechanisms. Compared to oracle and heuristic allocations, our method achieves competitive revenue while offering robustness to misreporting. In addition, we find that allocation patterns correlate with adversarial and defensive activity, suggesting implicit alignment with operational priorities. Our results demonstrate the viability of auction-based planning in cyber defense and highlight the interpretability benefits of RL-derived value structures.
△ Less
Submitted 13 September, 2025;
originally announced September 2025.
-
German4All -- A Dataset and Model for Readability-Controlled Paraphrasing in German
Authors:
Miriam Anschütz,
Thanh Mai Pham,
Eslam Nasrallah,
Maximilian Müller,
Cristian-George Craciun,
Georg Groh
Abstract:
The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesize…
▽ More
The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing
△ Less
Submitted 29 August, 2025; v1 submitted 25 August, 2025;
originally announced August 2025.
-
Contributions to Label-Efficient Learning in Computer Vision and Remote Sensing
Authors:
Minh-Tan Pham
Abstract:
This manuscript presents a series of my selected contributions to the topic of label-efficient learning in computer vision and remote sensing. The central focus of this research is to develop and adapt methods that can learn effectively from limited or partially annotated data, and can leverage abundant unlabeled data in real-world applications. The contributions span both methodological developme…
▽ More
This manuscript presents a series of my selected contributions to the topic of label-efficient learning in computer vision and remote sensing. The central focus of this research is to develop and adapt methods that can learn effectively from limited or partially annotated data, and can leverage abundant unlabeled data in real-world applications. The contributions span both methodological developments and domain-specific adaptations, in particular addressing challenges unique to Earth observation data such as multi-modality, spatial resolution variability, and scene heterogeneity. The manuscript is organized around four main axes including (1) weakly supervised learning for object discovery and detection based on anomaly-aware representations learned from large amounts of background images; (2) multi-task learning that jointly trains on multiple datasets with disjoint annotations to improve performance on object detection and semantic segmentation; (3) self-supervised and supervised contrastive learning with multimodal data to enhance scene classification in remote sensing; and (4) few-shot learning for hierarchical scene classification using both explicit and implicit modeling of class hierarchies. These contributions are supported by extensive experimental results across natural and remote sensing datasets, reflecting the outcomes of several collaborative research projects. The manuscript concludes by outlining ongoing and future research directions focused on scaling and enhancing label-efficient learning for real-world applications.
△ Less
Submitted 21 August, 2025;
originally announced August 2025.
-
Sparse Partial Optimal Transport via Quadratic Regularization
Authors:
Khang Tran,
Khoa Nguyen,
Anh Nguyen,
Thong Huynh,
Son Pham,
Sy-Hoang Nguyen-Dang,
Manh Pham,
Bang Vo,
Mai Ngoc Tran,
Mai Ngoc Tran,
Dung Luong
Abstract:
Partial Optimal Transport (POT) has recently emerged as a central tool in various Machine Learning (ML) applications. It lifts the stringent assumption of the conventional Optimal Transport (OT) that input measures are of equal masses, which is often not guaranteed in real-world datasets, and thus offers greater flexibility by permitting transport between unbalanced input measures. Nevertheless, e…
▽ More
Partial Optimal Transport (POT) has recently emerged as a central tool in various Machine Learning (ML) applications. It lifts the stringent assumption of the conventional Optimal Transport (OT) that input measures are of equal masses, which is often not guaranteed in real-world datasets, and thus offers greater flexibility by permitting transport between unbalanced input measures. Nevertheless, existing major solvers for POT commonly rely on entropic regularization for acceleration and thus return dense transport plans, hindering the adoption of POT in various applications that favor sparsity. In this paper, as an alternative approach to the entropic POT formulation in the literature, we propose a novel formulation of POT with quadratic regularization, hence termed quadratic regularized POT (QPOT), which induces sparsity to the transport plan and consequently facilitates the adoption of POT in many applications with sparsity requirements. Extensive experiments on synthetic and CIFAR-10 datasets, as well as real-world applications such as color transfer and domain adaptations, consistently demonstrate the improved sparsity and favorable performance of our proposed QPOT formulation.
△ Less
Submitted 11 August, 2025;
originally announced August 2025.
-
A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support
Authors:
Long S. T. Nguyen,
Truong P. Hua,
Thanh M. Nguyen,
Toan Q. Pham,
Nam K. Ngo,
An X. Nguyen,
Nghi D. M. Pham,
Nghia H. Nguyen,
Tho T. Quan
Abstract:
With the rapid growth of Artificial Intelligence, Large Language Models (LLMs) have become essential for Question Answering (QA) systems, improving efficiency and reducing human workload in customer service. The emergence of Vietnamese LLMs (ViLLMs) highlights lightweight open-source models as a practical choice for their accuracy, efficiency, and privacy benefits. However, domain-specific evaluat…
▽ More
With the rapid growth of Artificial Intelligence, Large Language Models (LLMs) have become essential for Question Answering (QA) systems, improving efficiency and reducing human workload in customer service. The emergence of Vietnamese LLMs (ViLLMs) highlights lightweight open-source models as a practical choice for their accuracy, efficiency, and privacy benefits. However, domain-specific evaluations remain limited, and the absence of benchmark datasets reflecting real customer interactions makes it difficult for enterprises to select suitable models for support applications. To address this gap, we introduce the Customer Support Conversations Dataset (CSConDa), a curated benchmark of over 9,000 QA pairs drawn from real interactions with human advisors at a large Vietnamese software company. Covering diverse topics such as pricing, product availability, and technical troubleshooting, CSConDa provides a representative basis for evaluating ViLLMs in practical scenarios. We further present a comprehensive evaluation framework, benchmarking 11 lightweight open-source ViLLMs on CSConDa with both automatic metrics and syntactic analysis to reveal model strengths, weaknesses, and linguistic patterns. This study offers insights into model behavior, explains performance differences, and identifies key areas for improvement, supporting the development of next-generation ViLLMs. By establishing a robust benchmark and systematic evaluation, our work enables informed model selection for customer service QA and advances research on Vietnamese LLMs. The dataset is publicly available at https://huggingface.co/datasets/ura-hcmut/Vietnamese-Customer-Support-QA.
△ Less
Submitted 30 July, 2025;
originally announced July 2025.
-
Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation
Authors:
Muzhaffar Hazman,
Minh-Khoi Pham,
Shweta Soundararajan,
Goncalo Mordido,
Leonardo Custode,
David Lynch,
Giorgio Cruciata,
Yucheng Shi,
Hongmeng Song,
Wang Chao,
Pan Yue,
Aleksandar Milenovic,
Alexandros Agapitos
Abstract:
Prompt engineering has proven to be a crucial step in leveraging pretrained large language models (LLMs) in solving various real-world tasks. Numerous solutions have been proposed that seek to automate prompt engineering by using the model itself to edit prompts. However, the majority of state-of-the-art approaches are evaluated on tasks that require minimal prompt templates and on very large and…
▽ More
Prompt engineering has proven to be a crucial step in leveraging pretrained large language models (LLMs) in solving various real-world tasks. Numerous solutions have been proposed that seek to automate prompt engineering by using the model itself to edit prompts. However, the majority of state-of-the-art approaches are evaluated on tasks that require minimal prompt templates and on very large and highly capable LLMs. In contrast, solving complex tasks that require detailed information to be included in the prompt increases the amount of text that needs to be optimised. Furthermore, smaller models have been shown to be more sensitive to prompt design. To address these challenges, we propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases. In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes by searching the space of programmes populated by function compositions of syntactic, dictionary-based and LLM-based prompt-editing functions. In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes in an attempt to further fine-tune their performance. Our approach outperforms three state-of-the-art prompt optimisation approaches, PromptWizard, OPRO, and RL-Prompt, on three relatively small general-purpose LLMs in four domain-specific challenging tasks. We also illustrate several examples where these benchmark methods suffer relatively severe performance degradation, while our approach improves performance in almost all task-model combinations, only incurring minimal degradation when it does not.
△ Less
Submitted 14 July, 2025;
originally announced July 2025.
-
A Spatial Relationship Aware Dataset for Robotics
Authors:
Peng Wang,
Minh Huy Pham,
Zhihao Guo,
Wei Zhou
Abstract:
Robotic task planning in real-world environments requires not only object recognition but also a nuanced understanding of spatial relationships between objects. We present a spatial-relationship-aware dataset of nearly 1,000 robot-acquired indoor images, annotated with object attributes, positions, and detailed spatial relationships. Captured using a Boston Dynamics Spot robot and labelled with a…
▽ More
Robotic task planning in real-world environments requires not only object recognition but also a nuanced understanding of spatial relationships between objects. We present a spatial-relationship-aware dataset of nearly 1,000 robot-acquired indoor images, annotated with object attributes, positions, and detailed spatial relationships. Captured using a Boston Dynamics Spot robot and labelled with a custom annotation tool, the dataset reflects complex scenarios with similar or identical objects and intricate spatial arrangements. We benchmark six state-of-the-art scene-graph generation models on this dataset, analysing their inference speed and relational accuracy. Our results highlight significant differences in model performance and demonstrate that integrating explicit spatial relationships into foundation models, such as ChatGPT 4o, substantially improves their ability to generate executable, spatially-aware plans for robotics. The dataset and annotation tool are publicly available at https://github.com/PengPaulWang/SpatialAwareRobotDataset, supporting further research in spatial reasoning for robotics.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature
Authors:
Alisha Srivastava,
Emir Korukluoglu,
Minh Nhat Le,
Duyen Tran,
Chau Minh Pham,
Marzena Karpinska,
Mohit Iyyer
Abstract:
Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented i…
▽ More
Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling
Authors:
Long-Khanh Pham,
Thanh V. T. Tran,
Minh-Tan Pham,
Van Nguyen
Abstract:
Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two comp…
▽ More
Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard L2S benchmarks confirm the effectiveness of the proposed method across various metrics.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios
Authors:
Le Thien Phuc Nguyen,
Zhuoran Yu,
Khoa Quang Nhat Cao,
Yuwei Guo,
Tu Ho Manh Pham,
Tuan Tai Nguyen,
Toan Ngo Duc Vo,
Lucas Poon,
Soochahn Lee,
Yong Jae Lee
Abstract:
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underre…
▽ More
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models.
Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD
Code: https://github.com/plnguyen2908/UniTalk-ASD-code
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Frankentext: Stitching random text fragments into long-form narratives
Authors:
Chau Minh Pham,
Jenna Russell,
Dzung Pham,
Mohit Iyyer
Abstract:
We introduce Frankentexts, a long-form narrative generation paradigm that treats an LLM as a composer of existing texts rather than as an author. Given a writing prompt and thousands of randomly sampled human-written snippets, the model is asked to produce a narrative under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from the provided paragraphs. This task is effect…
▽ More
We introduce Frankentexts, a long-form narrative generation paradigm that treats an LLM as a composer of existing texts rather than as an author. Given a writing prompt and thousands of randomly sampled human-written snippets, the model is asked to produce a narrative under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from the provided paragraphs. This task is effectively intractable for humans: selecting and ordering snippets yields a combinatorial search space that an LLM implicitly explores, before minimally editing and stitching together selected fragments into a coherent long-form story. Despite the extreme challenge of the task, we observe through extensive automatic and human evaluation that Frankentexts significantly improve over vanilla LLM generations in terms of writing quality, diversity, and originality while remaining coherent and relevant to the prompt. Furthermore, Frankentexts pose a fundamental challenge to detectors of AI-generated text: 72% of Frankentexts produced by our best Gemini 2.5 Pro configuration are misclassified as human-written by Pangram, a state-of-the-art detector. Human annotators praise Frankentexts for their inventive premises, vivid descriptions, and dry humor; on the other hand, they identify issues with abrupt tonal shifts and uneven grammar across segments, particularly in longer pieces. The emergence of high-quality Frankentexts raises serious questions about authorship and copyright: when humans provide the raw materials and LLMs orchestrate them into new narratives, who truly owns the result?
△ Less
Submitted 30 September, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
When Are Concepts Erased From Diffusion Models?
Authors:
Kevin Lu,
Nicky Kriplani,
Rohit Gandikota,
Minh Pham,
David Bau,
Chinmay Hegde,
Niv Cohen
Abstract:
Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generatin…
▽ More
Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
△ Less
Submitted 30 May, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
Can Large Language Models Really Recognize Your Name?
Authors:
Dzung Pham,
Peter Kairouz,
Niloofar Mireshghallah,
Eugene Bagdasarian,
Chau Minh Pham,
Amir Houmansadr
Abstract:
Large language models (LLMs) are increasingly being used to protect sensitive user data. However, current LLM-based privacy solutions assume that these models can reliably detect personally identifiable information (PII), particularly named entities. In this paper, we challenge that assumption by revealing systematic failures in LLM-based privacy tasks. Specifically, we show that modern LLMs regul…
▽ More
Large language models (LLMs) are increasingly being used to protect sensitive user data. However, current LLM-based privacy solutions assume that these models can reliably detect personally identifiable information (PII), particularly named entities. In this paper, we challenge that assumption by revealing systematic failures in LLM-based privacy tasks. Specifically, we show that modern LLMs regularly overlook human names even in short text snippets due to ambiguous contexts, which cause the names to be misinterpreted or mishandled. We propose AMBENCH, a benchmark dataset of seemingly ambiguous human names, leveraging the name regularity bias phenomenon, embedded within concise text snippets along with benign prompt injections. Our experiments on modern LLMs tasked to detect PII as well as specialized tools show that recall of ambiguous names drops by 20--40% compared to more recognizable names. Furthermore, ambiguous human names are four times more likely to be ignored in supposedly privacy-preserving summaries generated by LLMs when benign prompt injections are present. These findings highlight the underexplored risks of relying solely on LLMs to safeguard user privacy and underscore the need for a more systematic investigation into their privacy failure modes.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach
Authors:
Pierre Adorni,
Minh-Tan Pham,
Stéphane May,
Sébastien Lefèvre
Abstract:
Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate the…
▽ More
Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate their comparison, we propose a cost-effective method for predicting a model's performance on multiple downstream tasks without the need for fine-tuning on each one. This method is based on what we call "capabilities encoding." The utility of this novel approach is twofold: we demonstrate its potential to simplify the selection of a foundation model for a given new task, and we employ it to offer a fresh perspective on the existing literature, suggesting avenues for future research. Codes are available at https://github.com/pierreadorni/capabilities-encoding.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation
Authors:
Quang P. M. Pham,
Khoi T. N. Nguyen,
Nhi H. Doan,
Cuong A. Pham,
Qinbo Sun,
Weimin Qi,
Kentaro Inui,
Dezhen Song
Abstract:
Efficient path planning in robotics, particularly within large-scale, complex environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability hinder real-time deployment on edge devices. We present SmallPlan - a novel framework leveraging LLMs as teacher models to train lightweight Small Lang…
▽ More
Efficient path planning in robotics, particularly within large-scale, complex environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability hinder real-time deployment on edge devices. We present SmallPlan - a novel framework leveraging LLMs as teacher models to train lightweight Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate across scene graphs that compactly represent full-scaled 3D scenes. The SLMs are trained in a simulation-powered, interleaved manner with LLM-guided supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not only enables SLMs to successfully complete navigation tasks but also makes them aware of important factors like distance travel, providing more efficient path planning. Through experiments, we demonstrate that the fine-tuned SLMs perform competitively with larger models like GPT-4o on sequential path planning, without suffering from hallucination and overfitting. SmallPlan is resource-efficient, making it well-suited for edge-device deployment and advancing practical autonomous robotics. Our source code is available here: https://github.com/quangpham2006/SmallPlan
△ Less
Submitted 25 September, 2025; v1 submitted 1 May, 2025;
originally announced May 2025.
-
SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs
Authors:
Minh V. T. Pham,
Huy N. Phan,
Hoang N. Phan,
Cuong Le Chi,
Tien N. Nguyen,
Nghi D. Q. Bui
Abstract:
Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framew…
▽ More
Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framework for synthesizing realistic, verifiable, and process-aware bug-fix datasets at the repository level. SWE-Synth leverages LLM agents to simulate debugging workflows, producing not only bug-fix pairs but also test cases and structured repair trajectories. Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness. Experiments show that models trained on SWE-Synth outperform those trained on real-world datasets by 2.3% on SWE-Bench Lite. Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Leveraging Angle of Arrival Estimation against Impersonation Attacks in Physical Layer Authentication
Authors:
Thuy M. Pham,
Linda Senigagliesi,
Marco Baldi,
Rafael F. Schaefer,
Gerhard P. Fettweis,
Arsenia Chorti
Abstract:
In this paper, we investigate the utilization of the angle of arrival (AoA) as a feature for robust physical layer authentication (PLA). While most of the existing approaches to PLA focus on common features of the physical layer of communication channels, such as channel frequency response, channel impulse response or received signal strength, the use of AoA in this domain has not yet been studied…
▽ More
In this paper, we investigate the utilization of the angle of arrival (AoA) as a feature for robust physical layer authentication (PLA). While most of the existing approaches to PLA focus on common features of the physical layer of communication channels, such as channel frequency response, channel impulse response or received signal strength, the use of AoA in this domain has not yet been studied in depth, particularly regarding the ability to thwart impersonation attacks. In this work, we demonstrate that an impersonation attack targeting AoA based PLA is only feasible under strict conditions on the attacker's location and hardware capabilities, which highlights the AoA's potential as a strong feature for PLA. We extend previous works considering a single-antenna attacker to the case of a multiple-antenna attacker, and we develop a theoretical characterization of the conditions in which a successful impersonation attack can be mounted. Furthermore, we leverage extensive simulations in support of theoretical analyses, to validate the robustness of AoA-based PLA.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
BEARCUBS: A benchmark for computer-using web agents
Authors:
Yixiao Song,
Katherine Thai,
Chau Minh Pham,
Yapei Chang,
Mazin Nadaf,
Mohit Iyyer
Abstract:
Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "smallbut mighty" benchmark of 111 information-seeki…
▽ More
Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "smallbut mighty" benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. We find that ChatGPT Agent significantly outperforms other computer-using agents with an overall accuracy of 65.8% (compared to e.g., Operator's 23.4%), showcasing substantial progress in tasks involving real computer use, such as playing web games and navigating 3D environments. Nevertheless, closing the gap to human performance requires improvements in areas like fine control, complex data filtering, and execution speed. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.
△ Less
Submitted 24 July, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
SolidMark: Evaluating Image Memorization in Generative Models
Authors:
Nicky Kriplani,
Minh Pham,
Gowthami Somepalli,
Chinmay Hegde,
Niv Cohen
Abstract:
Recent works have shown that diffusion models are able to memorize training images and emit them at generation time. However, the metrics used to evaluate memorization and its mitigation techniques suffer from dataset-dependent biases and struggle to detect whether a given specific image has been memorized or not.
This paper begins with a comprehensive exploration of issues surrounding memorizat…
▽ More
Recent works have shown that diffusion models are able to memorize training images and emit them at generation time. However, the metrics used to evaluate memorization and its mitigation techniques suffer from dataset-dependent biases and struggle to detect whether a given specific image has been memorized or not.
This paper begins with a comprehensive exploration of issues surrounding memorization metrics in diffusion models. Then, to mitigate these issues, we introduce $\rm \style{font-variant: small-caps}{SolidMark}$, a novel evaluation method that provides a per-image memorization score. We then re-evaluate existing memorization mitigation techniques. We also show that $\rm \style{font-variant: small-caps}{SolidMark}$ is capable of evaluating fine-grained pixel-level memorization. Finally, we release a variety of models based on $\rm \style{font-variant: small-caps}{SolidMark}$ to facilitate further research for understanding memorization phenomena in generative models. All of our code is available at https://github.com/NickyDCFP/SolidMark.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
CLIPPER: Compression enables long-context synthetic data generation
Authors:
Chau Minh Pham,
Yapei Chang,
Mohit Iyyer
Abstract:
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw tex…
▽ More
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).
△ Less
Submitted 4 August, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Whose story is it? Personalizing story generation by inferring author styles
Authors:
Nischal Ashok Kumar,
Chau Minh Pham,
Mohit Iyyer,
Andrew Lan
Abstract:
Personalization is critical for improving user experience in interactive writing and educational applications, yet remains understudied in story generation. We study the task of personalizing story generation, where our goal is to mimic an author's writing style, given other stories written by them. We collect Mythos, a dataset of 3.6k stories from 112 authors, with an average of 16 stories per au…
▽ More
Personalization is critical for improving user experience in interactive writing and educational applications, yet remains understudied in story generation. We study the task of personalizing story generation, where our goal is to mimic an author's writing style, given other stories written by them. We collect Mythos, a dataset of 3.6k stories from 112 authors, with an average of 16 stories per author, across five distinct sources reflecting diverse story-writing settings. We propose a two-stage pipeline for personalized story generation: first, we infer authors' implicit writing characteristics and organize them into an Author Writing Sheet, which is validated by humans to be of high quality; second, we simulate the author's persona using tailored persona descriptions and personalized story rules. We find that stories personalized using the Author Writing Sheet outperform a non-personalized baseline, achieving a 78% win-rate in capturing authors' past style and 59% in similarity to ground-truth author stories. Human evaluation supports these findings and further highlights trends, such as Reddit stories being easier to personalize, and the Creativity and Language Use aspects of stories being easier to personalize than the Plot.
△ Less
Submitted 21 May, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
Advancing Differentiable Economics: A Neural Network Framework for Revenue-Maximizing Combinatorial Auction Mechanisms
Authors:
Mai Pham,
Vikrant Vaze,
Peter Chin
Abstract:
Differentiable economics, which uses neural networks as function approximators and gradient-based optimization in automated mechanism design (AMD), marked a significant breakthrough with the introduction of RegretNet \citep{regretnet_paper}. It combines the flexibility of deep learning with a regret-based approach to relax incentive compatibility, allowing for approximations of revenue-maximizing…
▽ More
Differentiable economics, which uses neural networks as function approximators and gradient-based optimization in automated mechanism design (AMD), marked a significant breakthrough with the introduction of RegretNet \citep{regretnet_paper}. It combines the flexibility of deep learning with a regret-based approach to relax incentive compatibility, allowing for approximations of revenue-maximizing auctions. However, applying these techniques to combinatorial auctions (CAs) - where bidders value bundles rather than individual items, capturing item interdependencies - remains a challenge, primarily due to the lack of methodologies that can effectively deal with combinatorial constraints. To tackle this, we propose two architectures: CANet, a fully connected neural network, and CAFormer, a transformer-based model designed to learn optimal randomized mechanisms. Unlike existing methods in traditional AMD, our approach is more scalable and free of assumptions about the structures of allowable bundles or bidder valuations. We demonstrate that our models match current methods in non-combinatorial settings and set new benchmarks for CAs. Specifically, our models consistently outperform benchmark mechanisms derived from heuristic approaches and provide empirical solutions where analytical results are unavailable. This work bridges the gap in applying differentiable economics to combinatorial auctions, offering a scalable and flexible framework for designing revenue-maximizing mechanisms.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish
Authors:
Xin Huang,
Tarun Kumar Vangani,
Minh Duc Pham,
Xunlong Zou,
Bin Wang,
Zhengyuan Liu,
Ai Ti Aw
Abstract:
Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesi…
▽ More
Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.
△ Less
Submitted 21 January, 2025; v1 submitted 21 December, 2024;
originally announced January 2025.
-
Speech-based Multimodel Pipeline for Vietnamese Services Quality Assessment
Authors:
Quang-Anh N. D.,
Minh-Duc Pham,
Thai Kim Dinh
Abstract:
In the evolving landscape of customer service within the digital economy, traditional methods of service quality assessment have shown significant limitations, this research proposes a novel deep-learning approach to service quality assessment, focusing on the Vietnamese service sector. By leveraging a multi-modal pipeline that transcends traditional evaluation methods, the research addresses the…
▽ More
In the evolving landscape of customer service within the digital economy, traditional methods of service quality assessment have shown significant limitations, this research proposes a novel deep-learning approach to service quality assessment, focusing on the Vietnamese service sector. By leveraging a multi-modal pipeline that transcends traditional evaluation methods, the research addresses the limitations of conventional assessments by analyzing speech, speaker interactions and emotional content, offering a more comprehensive and objective means of understanding customer service interactions. This aims to provide organizations with a sophisticated tool for evaluating and improving service quality in the digital economy.
△ Less
Submitted 18 December, 2024; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Emotional Vietnamese Speech-Based Depression Diagnosis Using Dynamic Attention Mechanism
Authors:
Quang-Anh N. D.,
Manh-Hung Ha,
Thai Kim Dinh,
Minh-Duc Pham,
Ninh Nguyen Van
Abstract:
Major depressive disorder is a prevalent and serious mental health condition that negatively impacts your emotions, thoughts, actions, and overall perception of the world. It is complicated to determine whether a person is depressed due to the symptoms of depression not apparent. However, their voice can be one of the factor from which we can acknowledge signs of depression. People who are depress…
▽ More
Major depressive disorder is a prevalent and serious mental health condition that negatively impacts your emotions, thoughts, actions, and overall perception of the world. It is complicated to determine whether a person is depressed due to the symptoms of depression not apparent. However, their voice can be one of the factor from which we can acknowledge signs of depression. People who are depressed express discomfort, sadness and they may speak slowly, trembly, and lose emotion in their voices. In this study, we proposed the Dynamic Convolutional Block Attention Module (Dynamic-CBAM) to utilized with in an Attention-GRU Network to classify the emotions by analyzing the audio signal of humans. Based on the results, we can diagnose which patients are depressed or prone to depression then so that treatment and prevention can be started as soon as possible. The research delves into the intricate computational steps involved in implementing a Attention-GRU deep learning architecture. Through experimentation, the model has achieved an impressive recognition with Unweighted Accuracy (UA) rate of 0.87 and 0.86 Weighted Accuracy (WA) rate and F1 rate of 0.87 in the VNEMOS dataset. Training code is released in https://github.com/fiyud/Emotional-Vietnamese-Speech-Based-Depression-Diagnosis-Using-Dynamic-Attention-Mechanism
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning
Authors:
Hoàng-Ân Lê,
Paul Berg,
Minh-Tan Pham
Abstract:
Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel-wise class labels. Making use of one task's information to train the other would be beneficial for multi-task partially supervised learning where each training…
▽ More
Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel-wise class labels. Making use of one task's information to train the other would be beneficial for multi-task partially supervised learning where each training example is annotated only for a single task, having the potential to expand training sets with different-task datasets. This paper studies various weak losses for partially annotated data in combination with existing supervised losses. We propose Box-for-Mask and Mask-for-Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other. Ablation studies and experimental results on VOC and COCO datasets show favorable results for the proposed idea. Source code and data splits can be found at https://github.com/lhoangan/multas.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding
Authors:
Quang P. M. Pham,
Khoi T. N. Nguyen,
Lan C. Ngo,
Truong Do,
Dezhen Song,
Truong-Son Hy
Abstract:
Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view…
▽ More
Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view data. Furthermore, a major limitation of prior approaches is the lack of temporal modeling to capture time-dependent relationships among dynamically evolving entities in a scene. To address these challenges, we propose Temporal Equivariant Scene Graph Neural Network (TESGNN), consisting of two key components: (1) an Equivariant Scene Graph Neural Network (ESGNN), which extracts information from 3D point clouds to generate scene graph while preserving crucial symmetry properties, and (2) a Temporal Graph Matching Network, which fuses scene graphs generated by ESGNN across multiple time sequences into a unified global representation using an approximate graph-matching algorithm. Our combined architecture TESGNN outperforms current state-of-the-art methods in scene graph generation, achieving higher accuracy and faster training convergence. Moreover, we show that leveraging the symmetry-preserving property produces a more stable and accurate global scene representation compared to existing approaches. Last but not least, it is computationally efficient and easily implementable using existing frameworks, making it well-suited for real-time applications in robotics and computer vision. This approach paves the way for more robust and scalable solutions to complex multi-view scene understanding challenges. Our source code is publicly available at: https://github.com/HySonLab/TESGraph
△ Less
Submitted 2 March, 2025; v1 submitted 15 November, 2024;
originally announced November 2024.
-
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
Authors:
Thang M. Pham,
Phat T. Nguyen,
Seunghyun Yoon,
Viet Dac Lai,
Franck Dernoncourt,
Trung Bui
Abstract:
While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), co…
▽ More
While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.
△ Less
Submitted 25 November, 2024; v1 submitted 14 November, 2024;
originally announced November 2024.
-
Raspberry PhenoSet: A Phenology-based Dataset for Automated Growth Detection and Yield Estimation
Authors:
Parham Jafary,
Anna Bazangeya,
Michelle Pham,
Lesley G. Campbell,
Sajad Saeedi,
Kourosh Zareinia,
Habiba Bougherara
Abstract:
The future of the agriculture industry is intertwined with automation. Accurate fruit detection, yield estimation, and harvest time estimation are crucial for optimizing agricultural practices. These tasks can be carried out by robots to reduce labour costs and improve the efficiency of the process. To do so, deep learning models should be trained to perform knowledge-based tasks, which outlines t…
▽ More
The future of the agriculture industry is intertwined with automation. Accurate fruit detection, yield estimation, and harvest time estimation are crucial for optimizing agricultural practices. These tasks can be carried out by robots to reduce labour costs and improve the efficiency of the process. To do so, deep learning models should be trained to perform knowledge-based tasks, which outlines the importance of contributing valuable data to the literature. In this paper, we introduce Raspberry PhenoSet, a phenology-based dataset designed for detecting and segmenting raspberry fruit across seven developmental stages. To the best of our knowledge, Raspberry PhenoSet is the first fruit dataset to integrate biology-based classification with fruit detection tasks, offering valuable insights for yield estimation and precise harvest timing. This dataset contains 1,853 high-resolution images, the highest quality in the literature, captured under controlled artificial lighting in a vertical farm. The dataset has a total of 6,907 instances of mask annotations, manually labelled to reflect the seven phenology stages. We have also benchmarked Raspberry PhenoSet using several state-of-the-art deep learning models, including YOLOv8, YOLOv10, RT-DETR, and Mask R-CNN, to provide a comprehensive evaluation of their performance on the dataset. Our results highlight the challenges of distinguishing subtle phenology stages and underscore the potential of Raspberry PhenoSet for both deep learning model development and practical robotic applications in agriculture, particularly in yield prediction and supply chain management. The dataset and the trained models are publicly available for future studies.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Taipan: Efficient and Expressive State Space Language Models with Selective Attention
Authors:
Chien Van Nguyen,
Huy Huu Nguyen,
Thang M. Pham,
Ruiyi Zhang,
Hanieh Deilamsalehy,
Puneet Mathur,
Ryan A. Rossi,
Trung Bui,
Viet Dac Lai,
Franck Dernoncourt,
Thien Huu Nguyen
Abstract:
Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they un…
▽ More
Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Scaling Analysis in a Multi-Energy System
Authors:
Jan Soeren Schwarz,
Minh Cong Pham,
Quoc Tuan Tran,
Kai Heussen
Abstract:
This paper presents a scaling study on the planning phase of a multi-energy system (MES), which is becoming increasingly prominent in the energy sector. The research aims to investigate the interactions and challenges associated with integrating heat and electrical systems and scaling their components. In this context, interaction between these two domains are investigated and the size of the dist…
▽ More
This paper presents a scaling study on the planning phase of a multi-energy system (MES), which is becoming increasingly prominent in the energy sector. The research aims to investigate the interactions and challenges associated with integrating heat and electrical systems and scaling their components. In this context, interaction between these two domains are investigated and the size of the distributed energy resources in the MES is scaled to examine the impact of sizing on the integrating networks and their controlling system. To achieve this, the paper uses sensitivity analysis and a meta-modeling technique, both incorporated in a toolbox for scaling analysis. These methodologies are validated through simulations, and the results obtained from the simulations can contribute to the advancement of MESs and their implementation in laboratory and field testing.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
A Toolbox for Design of Experiments for Energy Systems in Co-Simulation and Hardware Tests
Authors:
Jan Sören Schwarz,
Leonard Enrique Ramos Perez,
Minh Cong Pham,
Kai Heussen,
Quoc Tuan Tran
Abstract:
In context of highly complex energy system experiments, sensitivity analysis is gaining more and more importance to investigate the effects changing parameterization has on the outcome. Thus, it is crucial how to design an experiment to efficiently use the available resources. This paper describes the functionality of a toolbox designed to support the users in design of experiment for (co-)simulat…
▽ More
In context of highly complex energy system experiments, sensitivity analysis is gaining more and more importance to investigate the effects changing parameterization has on the outcome. Thus, it is crucial how to design an experiment to efficiently use the available resources. This paper describes the functionality of a toolbox designed to support the users in design of experiment for (co-)simulation and hardware tests. It provides a structure for object-oriented description of the parameterization and variations and performs sample generation based on this to provide a complete parameterization for the recommended experiment runs. After execution of the runs, it can also be used for analysis of the results to calculate and visualize the effects. The paper also presents two application cases using the toolbox which show how it can be implemented in sensitivity analysis studies with the co-simulation framework mosaik and a hybrid energy storage experiment.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Time to Retrain? Detecting Concept Drifts in Machine Learning Systems
Authors:
Tri Minh Triet Pham,
Karthikeyan Premkumar,
Mohamed Naili,
Jinqiu Yang
Abstract:
With the boom of machine learning (ML) techniques, software practitioners build ML systems to process the massive volume of streaming data for diverse software engineering tasks such as failure prediction in AIOps. Trained using historical data, such ML models encounter performance degradation caused by concept drift, i.e., data and inter-relationship (concept) changes between training and product…
▽ More
With the boom of machine learning (ML) techniques, software practitioners build ML systems to process the massive volume of streaming data for diverse software engineering tasks such as failure prediction in AIOps. Trained using historical data, such ML models encounter performance degradation caused by concept drift, i.e., data and inter-relationship (concept) changes between training and production. It is essential to use concept rift detection to monitor the deployed ML models and re-train the ML models when needed. In this work, we explore applying state-of-the-art (SOTA) concept drift detection techniques on synthetic and real-world datasets in an industrial setting. Such an industrial setting requires minimal manual effort in labeling and maximal generality in ML model architecture. We find that current SOTA semi-supervised methods not only require significant labeling effort but also only work for certain types of ML models. To overcome such limitations, we propose a novel model-agnostic technique (CDSeer) for detecting concept drift. Our evaluation shows that CDSeer has better precision and recall compared to the state-of-the-art while requiring significantly less manual labeling. We demonstrate the effectiveness of CDSeer at concept drift detection by evaluating it on eight datasets from different domains and use cases. Results from internal deployment of CDSeer on an industrial proprietary dataset show a 57.1% improvement in precision while using 99% fewer labels compared to the SOTA concept drift detection method. The performance is also comparable to the supervised concept drift detection method, which requires 100% of the data to be labeled. The improved performance and ease of adoption of CDSeer are valuable in making ML systems more reliable.
△ Less
Submitted 3 August, 2025; v1 submitted 11 October, 2024;
originally announced October 2024.
-
Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners
Authors:
Hung Manh Pham,
Aaqib Saeed,
Dong Ma
Abstract:
The accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with accompanying textual reports further holds immense potential to enhance clinical diagnostics by combining physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarci…
▽ More
The accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with accompanying textual reports further holds immense potential to enhance clinical diagnostics by combining physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarcity of labeled data for robust cross-modal learning. To address these obstacles, we propose D-BETA, a novel framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture. D-BETA uniquely combines the strengths of generative with boosted discriminative capabilities to achieve robust cross-modal representations. This is accomplished through masked modality modeling, specialized loss functions, and an improved negative sampling strategy tailored for cross-modal alignment. Extensive experiments on five public datasets across diverse downstream tasks demonstrate that D-BETA significantly outperforms existing methods, achieving an average AUC improvement of 15% in linear probing with only one percent of training data and 2% in zero-shot performance without requiring training data over state-of-the-art models. These results highlight the effectiveness of D-BETA, underscoring its potential to advance automated clinical diagnostics through multi-modal representations. Our sample code and checkpoint are made available at https://github.com/manhph2211/D-BETA.
△ Less
Submitted 7 May, 2025; v1 submitted 2 October, 2024;
originally announced October 2024.
-
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space
Authors:
Jacob Fein-Ashley,
Ethan Feng,
Minh Pham
Abstract:
Data representation in non-Euclidean spaces has proven effective for capturing hierarchical and complex relationships in real-world datasets. Hyperbolic spaces, in particular, provide efficient embeddings for hierarchical structures. This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditiona…
▽ More
Data representation in non-Euclidean spaces has proven effective for capturing hierarchical and complex relationships in real-world datasets. Hyperbolic spaces, in particular, provide efficient embeddings for hierarchical structures. This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditional ViTs operate in Euclidean space, our method enhances the self-attention mechanism by leveraging hyperbolic distance and Möbius transformations. This enables more effective modeling of hierarchical and relational dependencies in image data. We present rigorous mathematical formulations, showing how hyperbolic geometry can be incorporated into attention layers, feed-forward networks, and optimization. We offer improved performance for image classification using the ImageNet dataset.
△ Less
Submitted 25 September, 2024; v1 submitted 25 September, 2024;
originally announced September 2024.
-
Robustness of LiDAR-Based Pose Estimation: Evaluating and Improving Odometry and Localization Under Common Point Cloud Corruptions
Authors:
Bo Yang,
Tri Minh Triet Pham,
Jinqiu Yang
Abstract:
Accurate and reliable pose estimation, i.e., determining the precise position and orientation of autonomous robots and vehicles, is critical for tasks like navigation and mapping. LiDAR is a widely used sensor for pose estimation, with odometry and localization being two primary tasks. LiDAR odometry estimates the relative motion between consecutive scans, while LiDAR localization aligns real-time…
▽ More
Accurate and reliable pose estimation, i.e., determining the precise position and orientation of autonomous robots and vehicles, is critical for tasks like navigation and mapping. LiDAR is a widely used sensor for pose estimation, with odometry and localization being two primary tasks. LiDAR odometry estimates the relative motion between consecutive scans, while LiDAR localization aligns real-time scans with a pre-recorded map to obtain a global pose. Although they have different objectives and application scenarios, both rely on point cloud registration as the underlying technique and face shared challenges of data corruption caused by adverse conditions (e.g., rain). While state-of-the-art (SOTA) pose estimation systems achieved high accuracy on clean data, their robustness to corrupted data remains unclear. In this work, we propose a framework to systematically evaluate five SOTA LiDAR pose estimation systems across 18 synthetic real-world point cloud corruptions. Our experiments reveal that odometry systems degrade significantly under specific corruptions, with relative position errors increasing from 0.5% to more than 80%, while localization systems remain highly robust. We further demonstrate that denoising techniques can effectively mitigate the adverse effects of noise-induced corruptions, and re-training learning-based systems with corrupted data significantly enhances the robustness against various corruption types.
△ Less
Submitted 4 March, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Log-concavity of the independence polynomials of $\mathbf{W}_{p}$ graphs
Authors:
Do Trong Hoang,
Vadim E. Levit,
Eugen Mandrescu,
My Hanh Pham
Abstract:
Let $G$ be a graph of order $n$. For a positive integer $p$, $G$ is said to be a $\mathbf{W}_{p}$ graph if $n\geq p$ and every $p$ pairwise disjoint independent sets of $G$ are contained within $p$ pairwise disjoint maximum independent sets. In this paper, we establish that every connected $\mathbf{W}_{p}$ graph $G$ is $p$-quasi-regularizable if and only if $n\geq(p+1)\cdotα$, where $α$ is the ind…
▽ More
Let $G$ be a graph of order $n$. For a positive integer $p$, $G$ is said to be a $\mathbf{W}_{p}$ graph if $n\geq p$ and every $p$ pairwise disjoint independent sets of $G$ are contained within $p$ pairwise disjoint maximum independent sets. In this paper, we establish that every connected $\mathbf{W}_{p}$ graph $G$ is $p$-quasi-regularizable if and only if $n\geq(p+1)\cdotα$, where $α$ is the independence number of $G$ and $p\neq2$. This finding ensures that the independence polynomial of a connected $\mathbf{W}_{p}$ graph $G$ is log-concave whenever $(p+1)\cdotα\leq n\leq p\cdotα+2\sqrt{p\cdotα+p}$ and $\frac{α^{2}}{4\left( α+1\right) }\leq p$, or $p\cdotα+2\sqrt{p\cdotα+p}<n\leq \frac{\left( α^{2}+1\right) \cdot p+\left( α-1\right) ^{2}}{α-1}$ and $\frac{α\left( α-1\right) }{α+1}\leq p$. Moreover, the clique corona graph $G\circ K_{p}$ serves as an example of the $\mathbf{W}_{p}$ graph class. We further demonstrate that the independence polynomial of $G\circ K_{p}$ is always log-concave for sufficiently large $p$.
Keywords: very well-covered graph; quasi-regularizable graph; corona graph; $\mathbf{W}_{p}$ graph; independence polynomial; log-concavity.
△ Less
Submitted 3 September, 2025; v1 submitted 1 September, 2024;
originally announced September 2024.
-
Mapping earth mounds from space
Authors:
Baki Uzun,
Shivam Pande,
Gwendal Cachin-Bernard,
Minh-Tan Pham,
Sébastien Lefèvre,
Rumais Blatrix,
Doyle McKey
Abstract:
Regular patterns of vegetation are considered widespread landscapes, although their global extent has never been estimated. Among them, spotted landscapes are of particular interest in the context of climate change. Indeed, regularly spaced vegetation spots in semi-arid shrublands result from extreme resource depletion and prefigure catastrophic shift of the ecosystem to a homogeneous desert, whil…
▽ More
Regular patterns of vegetation are considered widespread landscapes, although their global extent has never been estimated. Among them, spotted landscapes are of particular interest in the context of climate change. Indeed, regularly spaced vegetation spots in semi-arid shrublands result from extreme resource depletion and prefigure catastrophic shift of the ecosystem to a homogeneous desert, while termite mounds also producing spotted landscapes were shown to increase robustness to climate change. Yet, their identification at large scale calls for automatic methods, for instance using the popular deep learning framework, able to cope with a vast amount of remote sensing data, e.g., optical satellite imagery. In this paper, we tackle this problem and benchmark some state-of-the-art deep networks on several landscapes and geographical areas. Despite the promising results we obtained, we found that more research is needed to be able to map automatically these earth mounds from space.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
Perception-Guided Fuzzing for Simulated Scenario-Based Testing of Autonomous Driving Systems
Authors:
Tri Minh Triet Pham,
Bo Yang,
Jinqiu Yang
Abstract:
Autonomous Driving Systems (ADS) have made huge progress and started on-road testing or even commercializing trials. ADS are complex and difficult to test: they receive input data from multiple sensors and make decisions using a combination of multiple deep neural network models and code logic. The safety of ADS is of utmost importance as their misbehavior can result in costly catastrophes, includ…
▽ More
Autonomous Driving Systems (ADS) have made huge progress and started on-road testing or even commercializing trials. ADS are complex and difficult to test: they receive input data from multiple sensors and make decisions using a combination of multiple deep neural network models and code logic. The safety of ADS is of utmost importance as their misbehavior can result in costly catastrophes, including the loss of human life. In this work, we propose SimsV, which performs system-level testing on multi-module ADS. SimsV targets perception failures of ADS and further assesses the impact of perception failure on the system as a whole. SimsV leverages a high-fidelity simulator for test input and oracle generation by continuously applying predefined mutation operators. In addition, SimsV leverages various metrics to guide the testing process. We implemented a prototype SimsV for testing a commercial-grade Level 4 ADS (i.e., Apollo) using a popular open-source driving platform simulator. Our evaluation shows that SimsV is capable of finding weaknesses in the perception of Apollo. Furthermore, we show that by exploiting such weakness, SimsV finds severe problems in Apollo, including collisions.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
Evaluating the Robustness of LiDAR-based 3D Obstacles Detection and Its Impacts on Autonomous Driving Systems
Authors:
Tri Minh Triet Pham,
Bo Yang,
Jinqiu Yang
Abstract:
Autonomous driving systems (ADSs) require real-time input from multiple sensors to make time-sensitive decisions using deep neural networks. This makes the correctness of these decisions crucial to ADSs' adoption as errors can cause significant loss. Sensors such as LiDAR are sensitive to environmental changes and built-in inaccuracies and may fluctuate between frames. While there has been extensi…
▽ More
Autonomous driving systems (ADSs) require real-time input from multiple sensors to make time-sensitive decisions using deep neural networks. This makes the correctness of these decisions crucial to ADSs' adoption as errors can cause significant loss. Sensors such as LiDAR are sensitive to environmental changes and built-in inaccuracies and may fluctuate between frames. While there has been extensive work to test ADSs, it remains unclear whether current ADSs are robust against very subtle changes in LiDAR point cloud data. In this work, we study the impact of the built-in inaccuracies in LiDAR sensors on LiDAR-3D obstacle detection models to provide insight into how they can impact obstacle detection (i.e., robustness) and by extension trajectory prediction (i.e., how the robustness of obstacle detection would impact ADSs).
We propose a framework SORBET, that applies subtle perturbations to LiDAR data, evaluates the robustness of LiDAR-3D obstacle detection, and assesses the impacts on the trajectory prediction module and ADSs. We applied SORBET to evaluate the robustness of five classic LiDAR-3D obstacle detection models, including one from an industry-grade Level 4 ADS (Baidu's Apollo). Furthermore, we studied how changes in the obstacle detection results would negatively impact trajectory prediction in a cascading fashion. Our evaluation highlights the importance of testing the robustness of LiDAR-3D obstacle detection models against subtle perturbations. We find that even very subtle changes in point cloud data (i.e., removing two points) may introduce a non-trivial decrease in the detection performance. Furthermore, such a negative impact will further propagate to other modules, and endanger the safety of ADSs.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
Variational Autoencoder for Anomaly Detection: A Comparative Study
Authors:
Huy Hoang Nguyen,
Cuong Nhat Nguyen,
Xuan Tung Dao,
Quoc Trung Duong,
Dzung Pham Thi Kim,
Minh-Tan Pham
Abstract:
This paper aims to conduct a comparative analysis of contemporary Variational Autoencoder (VAE) architectures employed in anomaly detection, elucidating their performance and behavioral characteristics within this specific task. The architectural configurations under consideration encompass the original VAE baseline, the VAE with a Gaussian Random Field prior (VAE-GRF), and the VAE incorporating a…
▽ More
This paper aims to conduct a comparative analysis of contemporary Variational Autoencoder (VAE) architectures employed in anomaly detection, elucidating their performance and behavioral characteristics within this specific task. The architectural configurations under consideration encompass the original VAE baseline, the VAE with a Gaussian Random Field prior (VAE-GRF), and the VAE incorporating a vision transformer (ViT-VAE). The findings reveal that ViT-VAE exhibits exemplary performance across various scenarios, whereas VAE-GRF may necessitate more intricate hyperparameter tuning to attain its optimal performance state. Additionally, to mitigate the propensity for over-reliance on results derived from the widely used MVTec dataset, this paper leverages the recently-public MiAD dataset for benchmarking. This deliberate inclusion seeks to enhance result competitiveness by alleviating the impact of domain-specific models tailored exclusively for MVTec, thereby contributing to a more robust evaluation framework. Codes is available at https://github.com/endtheme123/VAE-compare.git.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
Spatio-temporal neural distance fields for conditional generative modeling of the heart
Authors:
Kristine Sørensen,
Paula Diez,
Jan Margeta,
Yasmin El Youssef,
Michael Pham,
Jonas Jalili Pedersen,
Tobias Kühl,
Ole de Backer,
Klaus Kofoed,
Oscar Camara,
Rasmus Paulsen
Abstract:
The rhythmic pumping motion of the heart stands as a cornerstone in life, as it circulates blood to the entire human body through a series of carefully timed contractions of the individual chambers. Changes in the size, shape and movement of the chambers can be important markers for cardiac disease and modeling this in relation to clinical demography or disease is therefore of interest. Existing m…
▽ More
The rhythmic pumping motion of the heart stands as a cornerstone in life, as it circulates blood to the entire human body through a series of carefully timed contractions of the individual chambers. Changes in the size, shape and movement of the chambers can be important markers for cardiac disease and modeling this in relation to clinical demography or disease is therefore of interest. Existing methods for spatio-temporal modeling of the human heart require shape correspondence over time or suffer from large memory requirements, making it difficult to use for complex anatomies. We introduce a novel conditional generative model, where the shape and movement is modeled implicitly in the form of a spatio-temporal neural distance field and conditioned on clinical demography. The model is based on an auto-decoder architecture and aims to disentangle the individual variations from that related to the clinical demography. It is tested on the left atrium (including the left atrial appendage), where it outperforms current state-of-the-art methods for anatomical sequence completion and generates synthetic sequences that realistically mimics the shape and motion of the real left atrium. In practice, this means we can infer functional measurements from a static image, generate synthetic populations with specified demography or disease and investigate how non-imaging clinical data effect the shape and motion of cardiac anatomies.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
ProxyGPT: Enabling User Anonymity in LLM Chatbots via (Un)Trustworthy Volunteer Proxies
Authors:
Dzung Pham,
Jade Sheffey,
Chau Minh Pham,
Amir Houmansadr
Abstract:
Popular large language model (LLM) chatbots such as ChatGPT and Claude require users to create an account with an email or a phone number before allowing full access to their services. This practice ties users' personally identifiable information (PII) to their sensitive conversational data, thus posing significant privacy risks. Unfortunately, existing private LLM solutions based on cryptography…
▽ More
Popular large language model (LLM) chatbots such as ChatGPT and Claude require users to create an account with an email or a phone number before allowing full access to their services. This practice ties users' personally identifiable information (PII) to their sensitive conversational data, thus posing significant privacy risks. Unfortunately, existing private LLM solutions based on cryptography or trusted execution environments (TEEs) remain unpopular due to their prohibitive computational expense and platform restrictions. To enable practical user anonymity in LLM chatbots, we propose ProxyGPT, a privacy-enhancing system that leverages browser interaction proxies to submit user queries on their behalf. Unlike traditional proxy systems, ProxyGPT operates at the "user" layer by proxying user interactions with the browser in identity-required environments, thus easily supporting a wide range of chatbot services. We prevent malicious proxies by performing regular integrity audits using modern web proof protocols for TLS data provenance. We further utilize state-of-the-art LLM prompt guards on the proxy's side to mitigate unwanted user requests. Additionally, we incorporate a give-and-take economy based on Chaum's blind-signature e-cash to incentivize ProxyGPT users to proxy for others. Our system evaluation and user study demonstrate the practicality of our approach, as each chat request only takes a few additional seconds on average to fully complete. To the best of our knowledge, ProxyGPT is the first comprehensive proxy-based solution for privacy-preserving AI chatbots.
△ Less
Submitted 11 June, 2025; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Robot Shape and Location Retention in Video Generation Using Diffusion Models
Authors:
Peng Wang,
Zhihao Guo,
Abdul Latheef Sait,
Minh Huy Pham
Abstract:
Diffusion models have marked a significant milestone in the enhancement of image and video generation technologies. However, generating videos that precisely retain the shape and location of moving objects such as robots remains a challenge. This paper presents diffusion models specifically tailored to generate videos that accurately maintain the shape and location of mobile robots. This developme…
▽ More
Diffusion models have marked a significant milestone in the enhancement of image and video generation technologies. However, generating videos that precisely retain the shape and location of moving objects such as robots remains a challenge. This paper presents diffusion models specifically tailored to generate videos that accurately maintain the shape and location of mobile robots. This development offers substantial benefits to those working on detecting dangerous interactions between humans and robots by facilitating the creation of training data for collision detection models, circumventing the need for collecting data from the real world, which often involves legal and ethical issues. Our models incorporate techniques such as embedding accessible robot pose information and applying semantic mask regulation within the ConvNext backbone network. These techniques are designed to refine intermediate outputs, therefore improving the retention performance of shape and location. Through extensive experimentation, our models have demonstrated notable improvements in maintaining the shape and location of different robots, as well as enhancing overall video generation quality, compared to the benchmark diffusion model. Codes will be opensourced at \href{https://github.com/PengPaulWang/diffusion-robots}{Github}.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding
Authors:
Quang P. M. Pham,
Khoi T. N. Nguyen,
Lan C. Ngo,
Truong Do,
Truong Son Hy
Abstract:
Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, m…
▽ More
Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Interactive Topic Models with Optimal Transport
Authors:
Garima Dhanania,
Sheshera Mysore,
Chau Minh Pham,
Mohit Iyyer,
Hamed Zamani,
Andrew McCallum
Abstract:
Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of…
▽ More
Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM's ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Suri: Multi-constraint Instruction Following for Long-form Text Generation
Authors:
Chau Minh Pham,
Simeng Sun,
Mohit Iyyer
Abstract:
Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challe…
▽ More
Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challenges associated with collecting human preference judgments on long-form texts, preference-tuning algorithms such as DPO are infeasible in our setting; thus, we propose Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm. Instead of receiving negative feedback from dispreferred responses, I-ORPO obtains negative feedback from synthetically corrupted instructions generated by an LLM. Using Suri, we perform supervised and I-ORPO fine-tuning on Mistral-7b-Instruct-v0.2. The resulting models, Suri-SFT and Suri-I-ORPO, generate significantly longer texts (~5K tokens) than base models without significant quality deterioration. Our human evaluation shows that while both SFT and I-ORPO models satisfy most constraints, Suri-I-ORPO generations are generally preferred for their coherent and informative incorporation of the constraints. We release our code at https://github.com/chtmp223/suri.
△ Less
Submitted 1 October, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
Leveraging knowledge distillation for partial multi-task learning from multiple remote sensing datasets
Authors:
Hoàng-Ân Lê,
Minh-Tan Pham
Abstract:
Partial multi-task learning where training examples are annotated for one of the target tasks is a promising idea in remote sensing as it allows combining datasets annotated for different tasks and predicting more tasks with fewer network parameters. The naïve approach to partial multi-task learning is sub-optimal due to the lack of all-task annotations for learning joint representations. This pap…
▽ More
Partial multi-task learning where training examples are annotated for one of the target tasks is a promising idea in remote sensing as it allows combining datasets annotated for different tasks and predicting more tasks with fewer network parameters. The naïve approach to partial multi-task learning is sub-optimal due to the lack of all-task annotations for learning joint representations. This paper proposes using knowledge distillation to replace the need of ground truths for the alternate task and enhance the performance of such approach. Experiments conducted on the public ISPRS 2D Semantic Labeling Contest dataset show the effectiveness of the proposed idea on partial multi-task learning for semantic tasks including object detection and semantic segmentation in aerial images.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.