-
A Spatial Relationship Aware Dataset for Robotics
Authors:
Peng Wang,
Minh Huy Pham,
Zhihao Guo,
Wei Zhou
Abstract:
Robotic task planning in real-world environments requires not only object recognition but also a nuanced understanding of spatial relationships between objects. We present a spatial-relationship-aware dataset of nearly 1,000 robot-acquired indoor images, annotated with object attributes, positions, and detailed spatial relationships. Captured using a Boston Dynamics Spot robot and labelled with a…
▽ More
Robotic task planning in real-world environments requires not only object recognition but also a nuanced understanding of spatial relationships between objects. We present a spatial-relationship-aware dataset of nearly 1,000 robot-acquired indoor images, annotated with object attributes, positions, and detailed spatial relationships. Captured using a Boston Dynamics Spot robot and labelled with a custom annotation tool, the dataset reflects complex scenarios with similar or identical objects and intricate spatial arrangements. We benchmark six state-of-the-art scene-graph generation models on this dataset, analysing their inference speed and relational accuracy. Our results highlight significant differences in model performance and demonstrate that integrating explicit spatial relationships into foundation models, such as ChatGPT 4o, substantially improves their ability to generate executable, spatially-aware plans for robotics. The dataset and annotation tool are publicly available at https://github.com/PengPaulWang/SpatialAwareRobotDataset, supporting further research in spatial reasoning for robotics.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature
Authors:
Alisha Srivastava,
Emir Korukluoglu,
Minh Nhat Le,
Duyen Tran,
Chau Minh Pham,
Marzena Karpinska,
Mohit Iyyer
Abstract:
Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented i…
▽ More
Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling
Authors:
Long-Khanh Pham,
Thanh V. T. Tran,
Minh-Tan Pham,
Van Nguyen
Abstract:
Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two comp…
▽ More
Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard L2S benchmarks confirm the effectiveness of the proposed method across various metrics.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios
Authors:
Le Thien Phuc Nguyen,
Zhuoran Yu,
Khoa Quang Nhat Cao,
Yuwei Guo,
Tu Ho Manh Pham,
Tuan Tai Nguyen,
Toan Ngo Duc Vo,
Lucas Poon,
Soochahn Lee,
Yong Jae Lee
Abstract:
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underre…
▽ More
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models.
Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD
Code: https://github.com/plnguyen2908/UniTalk-ASD-code
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Frankentext: Stitching random text fragments into long-form narratives
Authors:
Chau Minh Pham,
Jenna Russell,
Dzung Pham,
Mohit Iyyer
Abstract:
We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we i…
▽ More
We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we instruct the model to produce a draft by selecting and combining human-written passages, then iteratively revise the draft while maintaining a user-specified copy ratio. We evaluate the resulting Frankentexts along three axes: writing quality, instruction adherence, and detectability. Gemini-2.5-Pro performs surprisingly well on this task: 81% of its Frankentexts are coherent and 100% relevant to the prompt. Notably, up to 59% of these outputs are misclassified as human-written by detectors like Pangram, revealing limitations in AI text detectors. Human annotators can sometimes identify Frankentexts through their abrupt tone shifts and inconsistent grammar between segments, especially in longer generations. Beyond presenting a challenging generation task, Frankentexts invite discussion on building effective detectors for this new grey zone of authorship, provide training data for mixed authorship detection, and serve as a sandbox for studying human-AI co-writing processes.
△ Less
Submitted 28 May, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
When Are Concepts Erased From Diffusion Models?
Authors:
Kevin Lu,
Nicky Kriplani,
Rohit Gandikota,
Minh Pham,
David Bau,
Chinmay Hegde,
Niv Cohen
Abstract:
Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generatin…
▽ More
Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
△ Less
Submitted 30 May, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
Can Large Language Models Really Recognize Your Name?
Authors:
Dzung Pham,
Peter Kairouz,
Niloofar Mireshghallah,
Eugene Bagdasarian,
Chau Minh Pham,
Amir Houmansadr
Abstract:
Large language models (LLMs) are increasingly being used to protect sensitive user data. However, current LLM-based privacy solutions assume that these models can reliably detect personally identifiable information (PII), particularly named entities. In this paper, we challenge that assumption by revealing systematic failures in LLM-based privacy tasks. Specifically, we show that modern LLMs regul…
▽ More
Large language models (LLMs) are increasingly being used to protect sensitive user data. However, current LLM-based privacy solutions assume that these models can reliably detect personally identifiable information (PII), particularly named entities. In this paper, we challenge that assumption by revealing systematic failures in LLM-based privacy tasks. Specifically, we show that modern LLMs regularly overlook human names even in short text snippets due to ambiguous contexts, which cause the names to be misinterpreted or mishandled. We propose AMBENCH, a benchmark dataset of seemingly ambiguous human names, leveraging the name regularity bias phenomenon, embedded within concise text snippets along with benign prompt injections. Our experiments on modern LLMs tasked to detect PII as well as specialized tools show that recall of ambiguous names drops by 20--40% compared to more recognizable names. Furthermore, ambiguous human names are four times more likely to be ignored in supposedly privacy-preserving summaries generated by LLMs when benign prompt injections are present. These findings highlight the underexplored risks of relying solely on LLMs to safeguard user privacy and underscore the need for a more systematic investigation into their privacy failure modes.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach
Authors:
Pierre Adorni,
Minh-Tan Pham,
Stéphane May,
Sébastien Lefèvre
Abstract:
Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate the…
▽ More
Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate their comparison, we propose a cost-effective method for predicting a model's performance on multiple downstream tasks without the need for fine-tuning on each one. This method is based on what we call "capabilities encoding." The utility of this novel approach is twofold: we demonstrate its potential to simplify the selection of a foundation model for a given new task, and we employ it to offer a fresh perspective on the existing literature, suggesting avenues for future research. Codes are available at https://github.com/pierreadorni/capabilities-encoding.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation
Authors:
Quang P. M. Pham,
Khoi T. N. Nguyen,
Nhi H. Doan,
Cuong A. Pham,
Kentaro Inui,
Dezhen Song
Abstract:
Efficient path planning in robotics, particularly within large-scale, dynamic environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability in dynamic scenarios hinder real-time deployment on edge devices. We present SmallPlan -- a novel framework leveraging LLMs as teacher models to train…
▽ More
Efficient path planning in robotics, particularly within large-scale, dynamic environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability in dynamic scenarios hinder real-time deployment on edge devices. We present SmallPlan -- a novel framework leveraging LLMs as teacher models to train lightweight Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate across scene graphs that compactly represent full-scaled 3D scenes. The SLMs are trained in a simulation-powered, interleaved manner with LLM-guided supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not only enables SLMs to successfully complete navigation tasks but also makes them aware of important factors like travel distance and number of trials. Through experiments, we demonstrate that the fine-tuned SLMs perform competitively with larger models like GPT-4o on sequential path planning, without suffering from hallucination and overfitting. SmallPlan is resource-efficient, making it well-suited for edge-device deployment and advancing practical autonomous robotics. Our source code is available here: https://github.com/quangpham2006/SmallPlan
△ Less
Submitted 11 May, 2025; v1 submitted 1 May, 2025;
originally announced May 2025.
-
SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs
Authors:
Minh V. T. Pham,
Huy N. Phan,
Hoang N. Phan,
Cuong Le Chi,
Tien N. Nguyen,
Nghi D. Q. Bui
Abstract:
Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framew…
▽ More
Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framework for synthesizing realistic, verifiable, and process-aware bug-fix datasets at the repository level. SWE-Synth leverages LLM agents to simulate debugging workflows, producing not only bug-fix pairs but also test cases and structured repair trajectories. Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness. Experiments show that models trained on SWE-Synth outperform those trained on real-world datasets by 2.3% on SWE-Bench Lite. Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Leveraging Angle of Arrival Estimation against Impersonation Attacks in Physical Layer Authentication
Authors:
Thuy M. Pham,
Linda Senigagliesi,
Marco Baldi,
Rafael F. Schaefer,
Gerhard P. Fettweis,
Arsenia Chorti
Abstract:
In this paper, we investigate the utilization of the angle of arrival (AoA) as a feature for robust physical layer authentication (PLA). While most of the existing approaches to PLA focus on common features of the physical layer of communication channels, such as channel frequency response, channel impulse response or received signal strength, the use of AoA in this domain has not yet been studied…
▽ More
In this paper, we investigate the utilization of the angle of arrival (AoA) as a feature for robust physical layer authentication (PLA). While most of the existing approaches to PLA focus on common features of the physical layer of communication channels, such as channel frequency response, channel impulse response or received signal strength, the use of AoA in this domain has not yet been studied in depth, particularly regarding the ability to thwart impersonation attacks. In this work, we demonstrate that an impersonation attack targeting AoA based PLA is only feasible under strict conditions on the attacker's location and hardware capabilities, which highlights the AoA's potential as a strong feature for PLA. We extend previous works considering a single-antenna attacker to the case of a multiple-antenna attacker, and we develop a theoretical characterization of the conditions in which a successful impersonation attack can be mounted. Furthermore, we leverage extensive simulations in support of theoretical analyses, to validate the robustness of AoA-based PLA.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
BEARCUBS: A benchmark for computer-using web agents
Authors:
Yixiao Song,
Katherine Thai,
Chau Minh Pham,
Yapei Chang,
Mazin Nadaf,
Mohit Iyyer
Abstract:
Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seek…
▽ More
Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing search inefficiencies and domain knowledge gaps as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI's Operator) reaching only 24.3% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
SolidMark: Evaluating Image Memorization in Generative Models
Authors:
Nicky Kriplani,
Minh Pham,
Gowthami Somepalli,
Chinmay Hegde,
Niv Cohen
Abstract:
Recent works have shown that diffusion models are able to memorize training images and emit them at generation time. However, the metrics used to evaluate memorization and its mitigation techniques suffer from dataset-dependent biases and struggle to detect whether a given specific image has been memorized or not.
This paper begins with a comprehensive exploration of issues surrounding memorizat…
▽ More
Recent works have shown that diffusion models are able to memorize training images and emit them at generation time. However, the metrics used to evaluate memorization and its mitigation techniques suffer from dataset-dependent biases and struggle to detect whether a given specific image has been memorized or not.
This paper begins with a comprehensive exploration of issues surrounding memorization metrics in diffusion models. Then, to mitigate these issues, we introduce $\rm \style{font-variant: small-caps}{SolidMark}$, a novel evaluation method that provides a per-image memorization score. We then re-evaluate existing memorization mitigation techniques. We also show that $\rm \style{font-variant: small-caps}{SolidMark}$ is capable of evaluating fine-grained pixel-level memorization. Finally, we release a variety of models based on $\rm \style{font-variant: small-caps}{SolidMark}$ to facilitate further research for understanding memorization phenomena in generative models. All of our code is available at https://github.com/NickyDCFP/SolidMark.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
CLIPPER: Compression enables long-context synthetic data generation
Authors:
Chau Minh Pham,
Yapei Chang,
Mohit Iyyer
Abstract:
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw tex…
▽ More
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Whose story is it? Personalizing story generation by inferring author styles
Authors:
Nischal Ashok Kumar,
Chau Minh Pham,
Mohit Iyyer,
Andrew Lan
Abstract:
Personalization is critical for improving user experience in interactive writing and educational applications, yet remains understudied in story generation. We study the task of personalizing story generation, where our goal is to mimic an author's writing style, given other stories written by them. We collect Mythos, a dataset of 3.6k stories from 112 authors, with an average of 16 stories per au…
▽ More
Personalization is critical for improving user experience in interactive writing and educational applications, yet remains understudied in story generation. We study the task of personalizing story generation, where our goal is to mimic an author's writing style, given other stories written by them. We collect Mythos, a dataset of 3.6k stories from 112 authors, with an average of 16 stories per author, across five distinct sources reflecting diverse story-writing settings. We propose a two-stage pipeline for personalized story generation: first, we infer authors' implicit writing characteristics and organize them into an Author Writing Sheet, which is validated by humans to be of high quality; second, we simulate the author's persona using tailored persona descriptions and personalized story rules. We find that stories personalized using the Author Writing Sheet outperform a non-personalized baseline, achieving a 78% win-rate in capturing authors' past style and 59% in similarity to ground-truth author stories. Human evaluation supports these findings and further highlights trends, such as Reddit stories being easier to personalize, and the Creativity and Language Use aspects of stories being easier to personalize than the Plot.
△ Less
Submitted 21 May, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
Advancing Differentiable Economics: A Neural Network Framework for Revenue-Maximizing Combinatorial Auction Mechanisms
Authors:
Mai Pham,
Vikrant Vaze,
Peter Chin
Abstract:
Differentiable economics, which uses neural networks as function approximators and gradient-based optimization in automated mechanism design (AMD), marked a significant breakthrough with the introduction of RegretNet \citep{regretnet_paper}. It combines the flexibility of deep learning with a regret-based approach to relax incentive compatibility, allowing for approximations of revenue-maximizing…
▽ More
Differentiable economics, which uses neural networks as function approximators and gradient-based optimization in automated mechanism design (AMD), marked a significant breakthrough with the introduction of RegretNet \citep{regretnet_paper}. It combines the flexibility of deep learning with a regret-based approach to relax incentive compatibility, allowing for approximations of revenue-maximizing auctions. However, applying these techniques to combinatorial auctions (CAs) - where bidders value bundles rather than individual items, capturing item interdependencies - remains a challenge, primarily due to the lack of methodologies that can effectively deal with combinatorial constraints. To tackle this, we propose two architectures: CANet, a fully connected neural network, and CAFormer, a transformer-based model designed to learn optimal randomized mechanisms. Unlike existing methods in traditional AMD, our approach is more scalable and free of assumptions about the structures of allowable bundles or bidder valuations. We demonstrate that our models match current methods in non-combinatorial settings and set new benchmarks for CAs. Specifically, our models consistently outperform benchmark mechanisms derived from heuristic approaches and provide empirical solutions where analytical results are unavailable. This work bridges the gap in applying differentiable economics to combinatorial auctions, offering a scalable and flexible framework for designing revenue-maximizing mechanisms.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish
Authors:
Xin Huang,
Tarun Kumar Vangani,
Minh Duc Pham,
Xunlong Zou,
Bin Wang,
Zhengyuan Liu,
Ai Ti Aw
Abstract:
Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesi…
▽ More
Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.
△ Less
Submitted 21 January, 2025; v1 submitted 21 December, 2024;
originally announced January 2025.
-
Speech-based Multimodel Pipeline for Vietnamese Services Quality Assessment
Authors:
Quang-Anh N. D.,
Minh-Duc Pham,
Thai Kim Dinh
Abstract:
In the evolving landscape of customer service within the digital economy, traditional methods of service quality assessment have shown significant limitations, this research proposes a novel deep-learning approach to service quality assessment, focusing on the Vietnamese service sector. By leveraging a multi-modal pipeline that transcends traditional evaluation methods, the research addresses the…
▽ More
In the evolving landscape of customer service within the digital economy, traditional methods of service quality assessment have shown significant limitations, this research proposes a novel deep-learning approach to service quality assessment, focusing on the Vietnamese service sector. By leveraging a multi-modal pipeline that transcends traditional evaluation methods, the research addresses the limitations of conventional assessments by analyzing speech, speaker interactions and emotional content, offering a more comprehensive and objective means of understanding customer service interactions. This aims to provide organizations with a sophisticated tool for evaluating and improving service quality in the digital economy.
△ Less
Submitted 18 December, 2024; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Emotional Vietnamese Speech-Based Depression Diagnosis Using Dynamic Attention Mechanism
Authors:
Quang-Anh N. D.,
Manh-Hung Ha,
Thai Kim Dinh,
Minh-Duc Pham,
Ninh Nguyen Van
Abstract:
Major depressive disorder is a prevalent and serious mental health condition that negatively impacts your emotions, thoughts, actions, and overall perception of the world. It is complicated to determine whether a person is depressed due to the symptoms of depression not apparent. However, their voice can be one of the factor from which we can acknowledge signs of depression. People who are depress…
▽ More
Major depressive disorder is a prevalent and serious mental health condition that negatively impacts your emotions, thoughts, actions, and overall perception of the world. It is complicated to determine whether a person is depressed due to the symptoms of depression not apparent. However, their voice can be one of the factor from which we can acknowledge signs of depression. People who are depressed express discomfort, sadness and they may speak slowly, trembly, and lose emotion in their voices. In this study, we proposed the Dynamic Convolutional Block Attention Module (Dynamic-CBAM) to utilized with in an Attention-GRU Network to classify the emotions by analyzing the audio signal of humans. Based on the results, we can diagnose which patients are depressed or prone to depression then so that treatment and prevention can be started as soon as possible. The research delves into the intricate computational steps involved in implementing a Attention-GRU deep learning architecture. Through experimentation, the model has achieved an impressive recognition with Unweighted Accuracy (UA) rate of 0.87 and 0.86 Weighted Accuracy (WA) rate and F1 rate of 0.87 in the VNEMOS dataset. Training code is released in https://github.com/fiyud/Emotional-Vietnamese-Speech-Based-Depression-Diagnosis-Using-Dynamic-Attention-Mechanism
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning
Authors:
Hoàng-Ân Lê,
Paul Berg,
Minh-Tan Pham
Abstract:
Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel-wise class labels. Making use of one task's information to train the other would be beneficial for multi-task partially supervised learning where each training…
▽ More
Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel-wise class labels. Making use of one task's information to train the other would be beneficial for multi-task partially supervised learning where each training example is annotated only for a single task, having the potential to expand training sets with different-task datasets. This paper studies various weak losses for partially annotated data in combination with existing supervised losses. We propose Box-for-Mask and Mask-for-Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other. Ablation studies and experimental results on VOC and COCO datasets show favorable results for the proposed idea. Source code and data splits can be found at https://github.com/lhoangan/multas.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding
Authors:
Quang P. M. Pham,
Khoi T. N. Nguyen,
Lan C. Ngo,
Truong Do,
Dezhen Song,
Truong-Son Hy
Abstract:
Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view…
▽ More
Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view data. Furthermore, a major limitation of prior approaches is the lack of temporal modeling to capture time-dependent relationships among dynamically evolving entities in a scene. To address these challenges, we propose Temporal Equivariant Scene Graph Neural Network (TESGNN), consisting of two key components: (1) an Equivariant Scene Graph Neural Network (ESGNN), which extracts information from 3D point clouds to generate scene graph while preserving crucial symmetry properties, and (2) a Temporal Graph Matching Network, which fuses scene graphs generated by ESGNN across multiple time sequences into a unified global representation using an approximate graph-matching algorithm. Our combined architecture TESGNN outperforms current state-of-the-art methods in scene graph generation, achieving higher accuracy and faster training convergence. Moreover, we show that leveraging the symmetry-preserving property produces a more stable and accurate global scene representation compared to existing approaches. Last but not least, it is computationally efficient and easily implementable using existing frameworks, making it well-suited for real-time applications in robotics and computer vision. This approach paves the way for more robust and scalable solutions to complex multi-view scene understanding challenges. Our source code is publicly available at: https://github.com/HySonLab/TESGraph
△ Less
Submitted 2 March, 2025; v1 submitted 15 November, 2024;
originally announced November 2024.
-
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
Authors:
Thang M. Pham,
Phat T. Nguyen,
Seunghyun Yoon,
Viet Dac Lai,
Franck Dernoncourt,
Trung Bui
Abstract:
While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), co…
▽ More
While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.
△ Less
Submitted 25 November, 2024; v1 submitted 14 November, 2024;
originally announced November 2024.
-
Raspberry PhenoSet: A Phenology-based Dataset for Automated Growth Detection and Yield Estimation
Authors:
Parham Jafary,
Anna Bazangeya,
Michelle Pham,
Lesley G. Campbell,
Sajad Saeedi,
Kourosh Zareinia,
Habiba Bougherara
Abstract:
The future of the agriculture industry is intertwined with automation. Accurate fruit detection, yield estimation, and harvest time estimation are crucial for optimizing agricultural practices. These tasks can be carried out by robots to reduce labour costs and improve the efficiency of the process. To do so, deep learning models should be trained to perform knowledge-based tasks, which outlines t…
▽ More
The future of the agriculture industry is intertwined with automation. Accurate fruit detection, yield estimation, and harvest time estimation are crucial for optimizing agricultural practices. These tasks can be carried out by robots to reduce labour costs and improve the efficiency of the process. To do so, deep learning models should be trained to perform knowledge-based tasks, which outlines the importance of contributing valuable data to the literature. In this paper, we introduce Raspberry PhenoSet, a phenology-based dataset designed for detecting and segmenting raspberry fruit across seven developmental stages. To the best of our knowledge, Raspberry PhenoSet is the first fruit dataset to integrate biology-based classification with fruit detection tasks, offering valuable insights for yield estimation and precise harvest timing. This dataset contains 1,853 high-resolution images, the highest quality in the literature, captured under controlled artificial lighting in a vertical farm. The dataset has a total of 6,907 instances of mask annotations, manually labelled to reflect the seven phenology stages. We have also benchmarked Raspberry PhenoSet using several state-of-the-art deep learning models, including YOLOv8, YOLOv10, RT-DETR, and Mask R-CNN, to provide a comprehensive evaluation of their performance on the dataset. Our results highlight the challenges of distinguishing subtle phenology stages and underscore the potential of Raspberry PhenoSet for both deep learning model development and practical robotic applications in agriculture, particularly in yield prediction and supply chain management. The dataset and the trained models are publicly available for future studies.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Taipan: Efficient and Expressive State Space Language Models with Selective Attention
Authors:
Chien Van Nguyen,
Huy Huu Nguyen,
Thang M. Pham,
Ruiyi Zhang,
Hanieh Deilamsalehy,
Puneet Mathur,
Ryan A. Rossi,
Trung Bui,
Viet Dac Lai,
Franck Dernoncourt,
Thien Huu Nguyen
Abstract:
Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they un…
▽ More
Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Scaling Analysis in a Multi-Energy System
Authors:
Jan Soeren Schwarz,
Minh Cong Pham,
Quoc Tuan Tran,
Kai Heussen
Abstract:
This paper presents a scaling study on the planning phase of a multi-energy system (MES), which is becoming increasingly prominent in the energy sector. The research aims to investigate the interactions and challenges associated with integrating heat and electrical systems and scaling their components. In this context, interaction between these two domains are investigated and the size of the dist…
▽ More
This paper presents a scaling study on the planning phase of a multi-energy system (MES), which is becoming increasingly prominent in the energy sector. The research aims to investigate the interactions and challenges associated with integrating heat and electrical systems and scaling their components. In this context, interaction between these two domains are investigated and the size of the distributed energy resources in the MES is scaled to examine the impact of sizing on the integrating networks and their controlling system. To achieve this, the paper uses sensitivity analysis and a meta-modeling technique, both incorporated in a toolbox for scaling analysis. These methodologies are validated through simulations, and the results obtained from the simulations can contribute to the advancement of MESs and their implementation in laboratory and field testing.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
A Toolbox for Design of Experiments for Energy Systems in Co-Simulation and Hardware Tests
Authors:
Jan Sören Schwarz,
Leonard Enrique Ramos Perez,
Minh Cong Pham,
Kai Heussen,
Quoc Tuan Tran
Abstract:
In context of highly complex energy system experiments, sensitivity analysis is gaining more and more importance to investigate the effects changing parameterization has on the outcome. Thus, it is crucial how to design an experiment to efficiently use the available resources. This paper describes the functionality of a toolbox designed to support the users in design of experiment for (co-)simulat…
▽ More
In context of highly complex energy system experiments, sensitivity analysis is gaining more and more importance to investigate the effects changing parameterization has on the outcome. Thus, it is crucial how to design an experiment to efficiently use the available resources. This paper describes the functionality of a toolbox designed to support the users in design of experiment for (co-)simulation and hardware tests. It provides a structure for object-oriented description of the parameterization and variations and performs sample generation based on this to provide a complete parameterization for the recommended experiment runs. After execution of the runs, it can also be used for analysis of the results to calculate and visualize the effects. The paper also presents two application cases using the toolbox which show how it can be implemented in sensitivity analysis studies with the co-simulation framework mosaik and a hybrid energy storage experiment.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Time to Retrain? Detecting Concept Drifts in Machine Learning Systems
Authors:
Tri Minh Triet Pham,
Karthikeyan Premkumar,
Mohamed Naili,
Jinqiu Yang
Abstract:
With the boom of machine learning (ML) techniques, software practitioners build ML systems to process the massive volume of streaming data for diverse software engineering tasks such as failure prediction in AIOps. Trained using historical data, such ML models encounter performance degradation caused by concept drift, i.e., data and inter-relationship (concept) changes between training and product…
▽ More
With the boom of machine learning (ML) techniques, software practitioners build ML systems to process the massive volume of streaming data for diverse software engineering tasks such as failure prediction in AIOps. Trained using historical data, such ML models encounter performance degradation caused by concept drift, i.e., data and inter-relationship (concept) changes between training and production. It is essential to use concept rift detection to monitor the deployed ML models and re-train the ML models when needed. In this work, we explore applying state-of-the-art (SOTA) concept drift detection techniques on synthetic and real-world datasets in an industrial setting. Such an industrial setting requires minimal manual effort in labeling and maximal generality in ML model architecture. We find that current SOTA semi-supervised methods not only require significant labeling effort but also only work for certain types of ML models. To overcome such limitations, we propose a novel model-agnostic technique (CDSeer) for detecting concept drift. Our evaluation shows that CDSeer has better precision and recall compared to the state-of-the-art while requiring significantly less manual labeling. We demonstrate the effectiveness of CDSeer at concept drift detection by evaluating it on eight datasets from different domains and use cases. Results from internal deployment of CDSeer on an industrial proprietary dataset show a 57.1% improvement in precision while using 99% fewer labels compared to the SOTA concept drift detection method. The performance is also comparable to the supervised concept drift detection method, which requires 100% of the data to be labeled. The improved performance and ease of adoption of CDSeer are valuable in making ML systems more reliable.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners
Authors:
Hung Manh Pham,
Aaqib Saeed,
Dong Ma
Abstract:
The accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with accompanying textual reports further holds immense potential to enhance clinical diagnostics by combining physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarci…
▽ More
The accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with accompanying textual reports further holds immense potential to enhance clinical diagnostics by combining physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarcity of labeled data for robust cross-modal learning. To address these obstacles, we propose D-BETA, a novel framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture. D-BETA uniquely combines the strengths of generative with boosted discriminative capabilities to achieve robust cross-modal representations. This is accomplished through masked modality modeling, specialized loss functions, and an improved negative sampling strategy tailored for cross-modal alignment. Extensive experiments on five public datasets across diverse downstream tasks demonstrate that D-BETA significantly outperforms existing methods, achieving an average AUC improvement of 15% in linear probing with only one percent of training data and 2% in zero-shot performance without requiring training data over state-of-the-art models. These results highlight the effectiveness of D-BETA, underscoring its potential to advance automated clinical diagnostics through multi-modal representations. Our sample code and checkpoint are made available at https://github.com/manhph2211/D-BETA.
△ Less
Submitted 7 May, 2025; v1 submitted 2 October, 2024;
originally announced October 2024.
-
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space
Authors:
Jacob Fein-Ashley,
Ethan Feng,
Minh Pham
Abstract:
Data representation in non-Euclidean spaces has proven effective for capturing hierarchical and complex relationships in real-world datasets. Hyperbolic spaces, in particular, provide efficient embeddings for hierarchical structures. This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditiona…
▽ More
Data representation in non-Euclidean spaces has proven effective for capturing hierarchical and complex relationships in real-world datasets. Hyperbolic spaces, in particular, provide efficient embeddings for hierarchical structures. This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditional ViTs operate in Euclidean space, our method enhances the self-attention mechanism by leveraging hyperbolic distance and Möbius transformations. This enables more effective modeling of hierarchical and relational dependencies in image data. We present rigorous mathematical formulations, showing how hyperbolic geometry can be incorporated into attention layers, feed-forward networks, and optimization. We offer improved performance for image classification using the ImageNet dataset.
△ Less
Submitted 25 September, 2024; v1 submitted 25 September, 2024;
originally announced September 2024.
-
Robustness of LiDAR-Based Pose Estimation: Evaluating and Improving Odometry and Localization Under Common Point Cloud Corruptions
Authors:
Bo Yang,
Tri Minh Triet Pham,
Jinqiu Yang
Abstract:
Accurate and reliable pose estimation, i.e., determining the precise position and orientation of autonomous robots and vehicles, is critical for tasks like navigation and mapping. LiDAR is a widely used sensor for pose estimation, with odometry and localization being two primary tasks. LiDAR odometry estimates the relative motion between consecutive scans, while LiDAR localization aligns real-time…
▽ More
Accurate and reliable pose estimation, i.e., determining the precise position and orientation of autonomous robots and vehicles, is critical for tasks like navigation and mapping. LiDAR is a widely used sensor for pose estimation, with odometry and localization being two primary tasks. LiDAR odometry estimates the relative motion between consecutive scans, while LiDAR localization aligns real-time scans with a pre-recorded map to obtain a global pose. Although they have different objectives and application scenarios, both rely on point cloud registration as the underlying technique and face shared challenges of data corruption caused by adverse conditions (e.g., rain). While state-of-the-art (SOTA) pose estimation systems achieved high accuracy on clean data, their robustness to corrupted data remains unclear. In this work, we propose a framework to systematically evaluate five SOTA LiDAR pose estimation systems across 18 synthetic real-world point cloud corruptions. Our experiments reveal that odometry systems degrade significantly under specific corruptions, with relative position errors increasing from 0.5% to more than 80%, while localization systems remain highly robust. We further demonstrate that denoising techniques can effectively mitigate the adverse effects of noise-induced corruptions, and re-training learning-based systems with corrupted data significantly enhances the robustness against various corruption types.
△ Less
Submitted 4 March, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Log-concavity of the independence polynomials of $\mathbf{W}_{p}$ graphs
Authors:
Do Trong Hoang,
Vadim E. Levit,
Eugen Mandrescu,
My Hanh Pham
Abstract:
Let $G$ be a $\mathbf{W}_{p}$ graph if $n\geq p$ and every $p$ pairwise disjoint independent sets of $G$ are contained within $p$ pairwise disjoint maximum independent sets. In this paper, we establish that every $\mathbf{W}_{p}$ graph $G$ is $p$-quasi-regularizable if and only if $n\geq (p+1)α$, where $α$ is the independence number of $G$. This finding ensures that the independence polynomial of…
▽ More
Let $G$ be a $\mathbf{W}_{p}$ graph if $n\geq p$ and every $p$ pairwise disjoint independent sets of $G$ are contained within $p$ pairwise disjoint maximum independent sets. In this paper, we establish that every $\mathbf{W}_{p}$ graph $G$ is $p$-quasi-regularizable if and only if $n\geq (p+1)α$, where $α$ is the independence number of $G$. This finding ensures that the independence polynomial of a connected $\mathbf{W}_{p}$ graph $G$ is log-concave whenever $(p+1)α\leq n\leq 2pα+p+1$. Furthermore, we demonstrate that the independence polynomial of the clique corona $G\circ K_{p}$ is invariably log-concave for all $p\geq 1$. As an application, we validate a long-standing conjecture claiming that the independence polynomial of a very well-covered graph is unimodal.
△ Less
Submitted 1 September, 2024;
originally announced September 2024.
-
Mapping earth mounds from space
Authors:
Baki Uzun,
Shivam Pande,
Gwendal Cachin-Bernard,
Minh-Tan Pham,
Sébastien Lefèvre,
Rumais Blatrix,
Doyle McKey
Abstract:
Regular patterns of vegetation are considered widespread landscapes, although their global extent has never been estimated. Among them, spotted landscapes are of particular interest in the context of climate change. Indeed, regularly spaced vegetation spots in semi-arid shrublands result from extreme resource depletion and prefigure catastrophic shift of the ecosystem to a homogeneous desert, whil…
▽ More
Regular patterns of vegetation are considered widespread landscapes, although their global extent has never been estimated. Among them, spotted landscapes are of particular interest in the context of climate change. Indeed, regularly spaced vegetation spots in semi-arid shrublands result from extreme resource depletion and prefigure catastrophic shift of the ecosystem to a homogeneous desert, while termite mounds also producing spotted landscapes were shown to increase robustness to climate change. Yet, their identification at large scale calls for automatic methods, for instance using the popular deep learning framework, able to cope with a vast amount of remote sensing data, e.g., optical satellite imagery. In this paper, we tackle this problem and benchmark some state-of-the-art deep networks on several landscapes and geographical areas. Despite the promising results we obtained, we found that more research is needed to be able to map automatically these earth mounds from space.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
Perception-Guided Fuzzing for Simulated Scenario-Based Testing of Autonomous Driving Systems
Authors:
Tri Minh Triet Pham,
Bo Yang,
Jinqiu Yang
Abstract:
Autonomous Driving Systems (ADS) have made huge progress and started on-road testing or even commercializing trials. ADS are complex and difficult to test: they receive input data from multiple sensors and make decisions using a combination of multiple deep neural network models and code logic. The safety of ADS is of utmost importance as their misbehavior can result in costly catastrophes, includ…
▽ More
Autonomous Driving Systems (ADS) have made huge progress and started on-road testing or even commercializing trials. ADS are complex and difficult to test: they receive input data from multiple sensors and make decisions using a combination of multiple deep neural network models and code logic. The safety of ADS is of utmost importance as their misbehavior can result in costly catastrophes, including the loss of human life. In this work, we propose SimsV, which performs system-level testing on multi-module ADS. SimsV targets perception failures of ADS and further assesses the impact of perception failure on the system as a whole. SimsV leverages a high-fidelity simulator for test input and oracle generation by continuously applying predefined mutation operators. In addition, SimsV leverages various metrics to guide the testing process. We implemented a prototype SimsV for testing a commercial-grade Level 4 ADS (i.e., Apollo) using a popular open-source driving platform simulator. Our evaluation shows that SimsV is capable of finding weaknesses in the perception of Apollo. Furthermore, we show that by exploiting such weakness, SimsV finds severe problems in Apollo, including collisions.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
Evaluating the Robustness of LiDAR-based 3D Obstacles Detection and Its Impacts on Autonomous Driving Systems
Authors:
Tri Minh Triet Pham,
Bo Yang,
Jinqiu Yang
Abstract:
Autonomous driving systems (ADSs) require real-time input from multiple sensors to make time-sensitive decisions using deep neural networks. This makes the correctness of these decisions crucial to ADSs' adoption as errors can cause significant loss. Sensors such as LiDAR are sensitive to environmental changes and built-in inaccuracies and may fluctuate between frames. While there has been extensi…
▽ More
Autonomous driving systems (ADSs) require real-time input from multiple sensors to make time-sensitive decisions using deep neural networks. This makes the correctness of these decisions crucial to ADSs' adoption as errors can cause significant loss. Sensors such as LiDAR are sensitive to environmental changes and built-in inaccuracies and may fluctuate between frames. While there has been extensive work to test ADSs, it remains unclear whether current ADSs are robust against very subtle changes in LiDAR point cloud data. In this work, we study the impact of the built-in inaccuracies in LiDAR sensors on LiDAR-3D obstacle detection models to provide insight into how they can impact obstacle detection (i.e., robustness) and by extension trajectory prediction (i.e., how the robustness of obstacle detection would impact ADSs).
We propose a framework SORBET, that applies subtle perturbations to LiDAR data, evaluates the robustness of LiDAR-3D obstacle detection, and assesses the impacts on the trajectory prediction module and ADSs. We applied SORBET to evaluate the robustness of five classic LiDAR-3D obstacle detection models, including one from an industry-grade Level 4 ADS (Baidu's Apollo). Furthermore, we studied how changes in the obstacle detection results would negatively impact trajectory prediction in a cascading fashion. Our evaluation highlights the importance of testing the robustness of LiDAR-3D obstacle detection models against subtle perturbations. We find that even very subtle changes in point cloud data (i.e., removing two points) may introduce a non-trivial decrease in the detection performance. Furthermore, such a negative impact will further propagate to other modules, and endanger the safety of ADSs.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
Variational Autoencoder for Anomaly Detection: A Comparative Study
Authors:
Huy Hoang Nguyen,
Cuong Nhat Nguyen,
Xuan Tung Dao,
Quoc Trung Duong,
Dzung Pham Thi Kim,
Minh-Tan Pham
Abstract:
This paper aims to conduct a comparative analysis of contemporary Variational Autoencoder (VAE) architectures employed in anomaly detection, elucidating their performance and behavioral characteristics within this specific task. The architectural configurations under consideration encompass the original VAE baseline, the VAE with a Gaussian Random Field prior (VAE-GRF), and the VAE incorporating a…
▽ More
This paper aims to conduct a comparative analysis of contemporary Variational Autoencoder (VAE) architectures employed in anomaly detection, elucidating their performance and behavioral characteristics within this specific task. The architectural configurations under consideration encompass the original VAE baseline, the VAE with a Gaussian Random Field prior (VAE-GRF), and the VAE incorporating a vision transformer (ViT-VAE). The findings reveal that ViT-VAE exhibits exemplary performance across various scenarios, whereas VAE-GRF may necessitate more intricate hyperparameter tuning to attain its optimal performance state. Additionally, to mitigate the propensity for over-reliance on results derived from the widely used MVTec dataset, this paper leverages the recently-public MiAD dataset for benchmarking. This deliberate inclusion seeks to enhance result competitiveness by alleviating the impact of domain-specific models tailored exclusively for MVTec, thereby contributing to a more robust evaluation framework. Codes is available at https://github.com/endtheme123/VAE-compare.git.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
Spatio-temporal neural distance fields for conditional generative modeling of the heart
Authors:
Kristine Sørensen,
Paula Diez,
Jan Margeta,
Yasmin El Youssef,
Michael Pham,
Jonas Jalili Pedersen,
Tobias Kühl,
Ole de Backer,
Klaus Kofoed,
Oscar Camara,
Rasmus Paulsen
Abstract:
The rhythmic pumping motion of the heart stands as a cornerstone in life, as it circulates blood to the entire human body through a series of carefully timed contractions of the individual chambers. Changes in the size, shape and movement of the chambers can be important markers for cardiac disease and modeling this in relation to clinical demography or disease is therefore of interest. Existing m…
▽ More
The rhythmic pumping motion of the heart stands as a cornerstone in life, as it circulates blood to the entire human body through a series of carefully timed contractions of the individual chambers. Changes in the size, shape and movement of the chambers can be important markers for cardiac disease and modeling this in relation to clinical demography or disease is therefore of interest. Existing methods for spatio-temporal modeling of the human heart require shape correspondence over time or suffer from large memory requirements, making it difficult to use for complex anatomies. We introduce a novel conditional generative model, where the shape and movement is modeled implicitly in the form of a spatio-temporal neural distance field and conditioned on clinical demography. The model is based on an auto-decoder architecture and aims to disentangle the individual variations from that related to the clinical demography. It is tested on the left atrium (including the left atrial appendage), where it outperforms current state-of-the-art methods for anatomical sequence completion and generates synthetic sequences that realistically mimics the shape and motion of the real left atrium. In practice, this means we can infer functional measurements from a static image, generate synthetic populations with specified demography or disease and investigate how non-imaging clinical data effect the shape and motion of cardiac anatomies.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
ProxyGPT: Enabling User Anonymity in LLM Chatbots via (Un)Trustworthy Volunteer Proxies
Authors:
Dzung Pham,
Jade Sheffey,
Chau Minh Pham,
Amir Houmansadr
Abstract:
Popular large language model (LLM) chatbots such as ChatGPT and Claude require users to create an account with an email or a phone number before allowing full access to their services. This practice ties users' personally identifiable information (PII) to their sensitive conversational data, thus posing significant privacy risks. Unfortunately, existing private LLM solutions based on cryptography…
▽ More
Popular large language model (LLM) chatbots such as ChatGPT and Claude require users to create an account with an email or a phone number before allowing full access to their services. This practice ties users' personally identifiable information (PII) to their sensitive conversational data, thus posing significant privacy risks. Unfortunately, existing private LLM solutions based on cryptography or trusted execution environments (TEEs) remain unpopular due to their prohibitive computational expense and platform restrictions. To enable practical user anonymity in LLM chatbots, we propose ProxyGPT, a privacy-enhancing system that leverages browser interaction proxies to submit user queries on their behalf. Unlike traditional proxy systems, ProxyGPT operates at the "user" layer by proxying user interactions with the browser in identity-required environments, thus easily supporting a wide range of chatbot services. We prevent malicious proxies by performing regular integrity audits using modern web proof protocols for TLS data provenance. We further utilize state-of-the-art LLM prompt guards on the proxy's side to mitigate unwanted user requests. Additionally, we incorporate a give-and-take economy based on Chaum's blind-signature e-cash to incentivize ProxyGPT users to proxy for others. Our system evaluation and user study demonstrate the practicality of our approach, as each chat request only takes a few additional seconds on average to fully complete. To the best of our knowledge, ProxyGPT is the first comprehensive proxy-based solution for privacy-preserving AI chatbots.
△ Less
Submitted 11 June, 2025; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Robot Shape and Location Retention in Video Generation Using Diffusion Models
Authors:
Peng Wang,
Zhihao Guo,
Abdul Latheef Sait,
Minh Huy Pham
Abstract:
Diffusion models have marked a significant milestone in the enhancement of image and video generation technologies. However, generating videos that precisely retain the shape and location of moving objects such as robots remains a challenge. This paper presents diffusion models specifically tailored to generate videos that accurately maintain the shape and location of mobile robots. This developme…
▽ More
Diffusion models have marked a significant milestone in the enhancement of image and video generation technologies. However, generating videos that precisely retain the shape and location of moving objects such as robots remains a challenge. This paper presents diffusion models specifically tailored to generate videos that accurately maintain the shape and location of mobile robots. This development offers substantial benefits to those working on detecting dangerous interactions between humans and robots by facilitating the creation of training data for collision detection models, circumventing the need for collecting data from the real world, which often involves legal and ethical issues. Our models incorporate techniques such as embedding accessible robot pose information and applying semantic mask regulation within the ConvNext backbone network. These techniques are designed to refine intermediate outputs, therefore improving the retention performance of shape and location. Through extensive experimentation, our models have demonstrated notable improvements in maintaining the shape and location of different robots, as well as enhancing overall video generation quality, compared to the benchmark diffusion model. Codes will be opensourced at \href{https://github.com/PengPaulWang/diffusion-robots}{Github}.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding
Authors:
Quang P. M. Pham,
Khoi T. N. Nguyen,
Lan C. Ngo,
Truong Do,
Truong Son Hy
Abstract:
Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, m…
▽ More
Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Interactive Topic Models with Optimal Transport
Authors:
Garima Dhanania,
Sheshera Mysore,
Chau Minh Pham,
Mohit Iyyer,
Hamed Zamani,
Andrew McCallum
Abstract:
Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of…
▽ More
Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM's ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Suri: Multi-constraint Instruction Following for Long-form Text Generation
Authors:
Chau Minh Pham,
Simeng Sun,
Mohit Iyyer
Abstract:
Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challe…
▽ More
Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challenges associated with collecting human preference judgments on long-form texts, preference-tuning algorithms such as DPO are infeasible in our setting; thus, we propose Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm. Instead of receiving negative feedback from dispreferred responses, I-ORPO obtains negative feedback from synthetically corrupted instructions generated by an LLM. Using Suri, we perform supervised and I-ORPO fine-tuning on Mistral-7b-Instruct-v0.2. The resulting models, Suri-SFT and Suri-I-ORPO, generate significantly longer texts (~5K tokens) than base models without significant quality deterioration. Our human evaluation shows that while both SFT and I-ORPO models satisfy most constraints, Suri-I-ORPO generations are generally preferred for their coherent and informative incorporation of the constraints. We release our code at https://github.com/chtmp223/suri.
△ Less
Submitted 1 October, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
Leveraging knowledge distillation for partial multi-task learning from multiple remote sensing datasets
Authors:
Hoàng-Ân Lê,
Minh-Tan Pham
Abstract:
Partial multi-task learning where training examples are annotated for one of the target tasks is a promising idea in remote sensing as it allows combining datasets annotated for different tasks and predicting more tasks with fewer network parameters. The naïve approach to partial multi-task learning is sub-optimal due to the lack of all-task annotations for learning joint representations. This pap…
▽ More
Partial multi-task learning where training examples are annotated for one of the target tasks is a promising idea in remote sensing as it allows combining datasets annotated for different tasks and predicting more tasks with fewer network parameters. The naïve approach to partial multi-task learning is sub-optimal due to the lack of all-task annotations for learning joint representations. This paper proposes using knowledge distillation to replace the need of ground truths for the alternate task and enhance the performance of such approach. Experiments conducted on the public ISPRS 2D Semantic Labeling Contest dataset show the effectiveness of the proposed idea on partial multi-task learning for semantic tasks including object detection and semantic segmentation in aerial images.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models
Authors:
Nastaran Saadati,
Minh Pham,
Nasla Saleem,
Joshua R. Waite,
Aditya Balu,
Zhanhong Jiang,
Chinmay Hegde,
Soumik Sarkar
Abstract:
Recent advances in decentralized deep learning algorithms have demonstrated cutting-edge performance on various tasks with large pre-trained models. However, a pivotal prerequisite for achieving this level of competitiveness is the significant communication and computation overheads when updating these models, which prohibits the applications of them to real-world scenarios. To address this issue,…
▽ More
Recent advances in decentralized deep learning algorithms have demonstrated cutting-edge performance on various tasks with large pre-trained models. However, a pivotal prerequisite for achieving this level of competitiveness is the significant communication and computation overheads when updating these models, which prohibits the applications of them to real-world scenarios. To address this issue, drawing inspiration from advanced model merging techniques without requiring additional training, we introduce the Decentralized Iterative Merging-And-Training (DIMAT) paradigm--a novel decentralized deep learning framework. Within DIMAT, each agent is trained on their local data and periodically merged with their neighboring agents using advanced model merging techniques like activation matching until convergence is achieved. DIMAT provably converges with the best available rate for nonconvex functions with various first-order methods, while yielding tighter error bounds compared to the popular existing approaches. We conduct a comprehensive empirical analysis to validate DIMAT's superiority over baselines across diverse computer vision tasks sourced from multiple datasets. Empirical results validate our theoretical claims by showing that DIMAT attains faster and higher initial gain in accuracy with independent and identically distributed (IID) and non-IID data, incurring lower communication overhead. This DIMAT paradigm presents a new opportunity for the future decentralized learning, enhancing its adaptability to real-world with sparse and light-weight communication and computation.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Robust Concept Erasure Using Task Vectors
Authors:
Minh Pham,
Kelly O. Marshall,
Chinmay Hegde,
Niv Cohen
Abstract:
With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user's pro…
▽ More
With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user's prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.
△ Less
Submitted 19 February, 2025; v1 submitted 4 April, 2024;
originally announced April 2024.
-
TwinLiteNetPlus: A Real-Time Multi-Task Segmentation Model for Autonomous Driving
Authors:
Quang-Huy Che,
Duc-Tri Le,
Minh-Quan Pham,
Vinh-Tiep Nguyen,
Duc-Khai Lam
Abstract:
Semantic segmentation is crucial for autonomous driving, particularly for the tasks of Drivable Area and Lane Segmentation, ensuring safety and navigation. To address the high computational costs of current state-of-the-art (SOTA) models, this paper introduces TwinLiteNetPlus, a model capable of balancing efficiency and accuracy. TwinLiteNetPlus incorporates standard and depth-wise separable dilat…
▽ More
Semantic segmentation is crucial for autonomous driving, particularly for the tasks of Drivable Area and Lane Segmentation, ensuring safety and navigation. To address the high computational costs of current state-of-the-art (SOTA) models, this paper introduces TwinLiteNetPlus, a model capable of balancing efficiency and accuracy. TwinLiteNetPlus incorporates standard and depth-wise separable dilated convolutions, reducing complexity while maintaining high accuracy. It is available in four configurations, from the robust 1.94 million-parameter TwinLiteNetPlus_{Large} to the ultra-lightweight 34K-parameter TwinLiteNetPlus_{Nano}. Notably, TwinLiteNetPlus_{Large} attains a 92.9% mIoU (mean Intersection over Union) for Drivable Area Segmentation and a 34.2% IoU (Intersection over Union) for Lane Segmentation. These results achieve remarkable performance, surpassing current state-of-the-art models while only requiring 11 times fewer Floating Point Operations (FLOPs) for computation. Rigorously evaluated on various embedded devices, TwinLiteNetPlus demonstrates promising latency and power efficiency, underscoring its potential for real-world autonomous vehicle applications. The code is available on https://github.com/chequanghuy/TwinLiteNetPlus.
△ Less
Submitted 10 April, 2025; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Insight Into the Collocation of Multi-Source Satellite Imagery for Multi-Scale Vessel Detection
Authors:
Tran-Vu La,
Minh-Tan Pham,
Marco Chini
Abstract:
Ship detection from satellite imagery using Deep Learning (DL) is an indispensable solution for maritime surveillance. However, applying DL models trained on one dataset to others having differences in spatial resolution and radiometric features requires many adjustments. To overcome this issue, this paper focused on the DL models trained on datasets that consist of different optical images and a…
▽ More
Ship detection from satellite imagery using Deep Learning (DL) is an indispensable solution for maritime surveillance. However, applying DL models trained on one dataset to others having differences in spatial resolution and radiometric features requires many adjustments. To overcome this issue, this paper focused on the DL models trained on datasets that consist of different optical images and a combination of radar and optical data. When dealing with a limited number of training images, the performance of DL models via this approach was satisfactory. They could improve 5-20% of average precision, depending on the optical images tested. Likewise, DL models trained on the combined optical and radar dataset could be applied to both optical and radar images. Our experiments showed that the models trained on an optical dataset could be used for radar images, while those trained on a radar dataset offered very poor scores when applied to optical images.
△ Less
Submitted 23 May, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Leveraging feature communication in federated learning for remote sensing image classification
Authors:
Anh-Kiet Duong,
Hoàng-Ân Lê,
Minh-Tan Pham
Abstract:
In the realm of Federated Learning (FL) applied to remote sensing image classification, this study introduces and assesses several innovative communication strategies. Our exploration includes feature-centric communication, pseudo-weight amalgamation, and a combined method utilizing both weights and features. Experiments conducted on two public scene classification datasets unveil the effectivenes…
▽ More
In the realm of Federated Learning (FL) applied to remote sensing image classification, this study introduces and assesses several innovative communication strategies. Our exploration includes feature-centric communication, pseudo-weight amalgamation, and a combined method utilizing both weights and features. Experiments conducted on two public scene classification datasets unveil the effectiveness of these strategies, showcasing accelerated convergence, heightened privacy, and reduced network information exchange. This research provides valuable insights into the implications of feature-centric communication in FL, offering potential applications tailored for remote sensing scenarios.
△ Less
Submitted 23 May, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck
Authors:
Thang M. Pham,
Peijie Chen,
Tin Nguyen,
Seunghyun Yoon,
Trung Bui,
Anh Totti Nguyen
Abstract:
CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB - an explainable and editable classifier to (1) express the class name into a set of text descriptors that des…
▽ More
CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB - an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (~10x in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 and Dogs-120, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings.
△ Less
Submitted 12 April, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Low-light phase retrieval with implicit generative priors
Authors:
Raunak Manekar,
Elisa Negrini,
Minh Pham,
Daniel Jacobs,
Jaideep Srivastava,
Stanley J. Osher,
Jianwei Miao
Abstract:
Phase retrieval (PR) is fundamentally important in scientific imaging and is crucial for nanoscale techniques like coherent diffractive imaging (CDI). Low radiation dose imaging is essential for applications involving radiation-sensitive samples. However, most PR methods struggle in low-dose scenarios due to high shot noise. Recent advancements in optical data acquisition setups, such as in-situ C…
▽ More
Phase retrieval (PR) is fundamentally important in scientific imaging and is crucial for nanoscale techniques like coherent diffractive imaging (CDI). Low radiation dose imaging is essential for applications involving radiation-sensitive samples. However, most PR methods struggle in low-dose scenarios due to high shot noise. Recent advancements in optical data acquisition setups, such as in-situ CDI, have shown promise for low-dose imaging, but they rely on a time series of measurements, making them unsuitable for single-image applications. Similarly, data-driven phase retrieval techniques are not easily adaptable to data-scarce situations. Zero-shot deep learning methods based on pre-trained and implicit generative priors have been effective in various imaging tasks but have shown limited success in PR. In this work, we propose low-dose deep image prior (LoDIP), which combines in-situ CDI with the power of implicit generative priors to address single-image low-dose phase retrieval. Quantitative evaluations demonstrate LoDIP's superior performance in this task and its applicability to real experimental scenarios.
△ Less
Submitted 23 August, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
A Language Model for Particle Tracking
Authors:
Andris Huang,
Yash Melkani,
Paolo Calafiura,
Alina Lazar,
Daniel Thomas Murnane,
Minh-Tuan Pham,
Xiangyang Ju
Abstract:
Particle tracking is crucial for almost all physics analysis programs at the Large Hadron Collider. Deep learning models are pervasively used in particle tracking related tasks. However, the current practice is to design and train one deep learning model for one task with supervised learning techniques. The trained models work well for tasks they are trained on but show no or little generalization…
▽ More
Particle tracking is crucial for almost all physics analysis programs at the Large Hadron Collider. Deep learning models are pervasively used in particle tracking related tasks. However, the current practice is to design and train one deep learning model for one task with supervised learning techniques. The trained models work well for tasks they are trained on but show no or little generalization capabilities. We propose to unify these models with a language model. In this paper, we present a tokenized detector representation that allows us to train a BERT model for particle tracking. The trained BERT model, namely TrackingBERT, offers latent detector module embedding that can be used for other tasks. This work represents the first step towards developing a foundational model for particle detector understanding.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.