Search | arXiv e-print repository

Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

Authors: Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting

Abstract: Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio… ▽ More Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines. △ Less

Submitted 30 June, 2025; originally announced July 2025.

Comments: 42 pages, 3 composite figures, 4 tables

arXiv:2506.15842 [pdf, ps, other]

doi 10.1145/3719027.3744816

A Sea of Cyber Threats: Maritime Cybersecurity from the Perspective of Mariners

Authors: Anna Raymaker, Akshaya Kumar, Miuyin Yong Wong, Ryan Pickren, Animesh Chhotaray, Frank Li, Saman Zonouz, Raheem Beyah

Abstract: Maritime systems, including ships and ports, are critical components of global infrastructure, essential for transporting over 80% of the world's goods and supporting internet connectivity. However, these systems face growing cybersecurity threats, as shown by recent attacks disrupting Maersk, one of the world's largest shipping companies, causing widespread impacts on international trade. The uni… ▽ More Maritime systems, including ships and ports, are critical components of global infrastructure, essential for transporting over 80% of the world's goods and supporting internet connectivity. However, these systems face growing cybersecurity threats, as shown by recent attacks disrupting Maersk, one of the world's largest shipping companies, causing widespread impacts on international trade. The unique challenges of the maritime environment--such as diverse operational conditions, extensive physical access points, fragmented regulatory frameworks, and its deeply interconnected structure--require maritime-specific cybersecurity research. Despite the sector's importance, maritime cybersecurity remains underexplored, leaving significant gaps in understanding its challenges and risks. To address these gaps, we investigate how maritime system operators perceive and navigate cybersecurity challenges within this complex landscape. We conducted a user study comprising surveys and semi-structured interviews with 21 officer-level mariners. Participants reported direct experiences with shipboard cyber-attacks, including GPS spoofing and logistics-disrupting ransomware, demonstrating the real-world impact of these threats. Our findings reveal systemic and human-centric issues, such as training poorly aligned with maritime needs, insufficient detection and response tools, and serious gaps in mariners' cybersecurity understanding. Our contributions include a categorization of threats identified by mariners and recommendations for improving maritime security, including better training, response protocols, and regulation. These insights aim to guide future research and policy to strengthen the resilience of maritime systems. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 18 pages, 2 figures, To appear in the Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS '25)

arXiv:2506.02461 [pdf, ps, other]

XToM: Exploring the Multilingual Theory of Mind for Large Language Models

Authors: Chunkit Chan, Yauwai Yim, Hongchuan Zeng, Zhiying Zou, Xinyuan Cheng, Zhifan Sun, Zheye Deng, Kawai Chung, Yuzhuo Ao, Yixiang Fan, Cheng Jiayang, Ercong Nie, Ginny Y. Wong, Helmut Schmid, Hinrich Schütze, Simon See, Yangqiu Song

Abstract: Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse lin… ▽ More Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2505.10261 [pdf]

The Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine

Authors: Rui Yang, Huitao Li, Matthew Yu Heng Wong, Yuhe Ke, Xin Li, Kunyu Yu, Jingchi Liao, Jonathan Chong Kai Liew, Sabarinath Vinod Nair, Jasmine Chiat Ling Ong, Irene Li, Douglas Teodoro, Chuan Hong, Daniel Shu Wei Ting, Nan Liu

Abstract: Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extract… ▽ More Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2505.08414 [pdf]

An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care

Authors: Zhi Da Soh, Yang Bai, Kai Yu, Yang Zhou, Xiaofeng Lei, Sahil Thakur, Zann Lee, Lee Ching Linette Phang, Qingsheng Peng, Can Can Xue, Rachel Shujuan Chong, Quan V. Hoang, Lavanya Raghavan, Yih Chung Tham, Charumathi Sabanayagam, Wei-Chi Wu, Ming-Chih Ho, Jiangnan He, Preeti Gupta, Ecosse Lamoureux, Seang Mei Saw, Vinay Nangia, Songhomitra Panda-Jonas, Jie Xu, Ya Xing Wang , et al. (6 additional authors not shown)

Abstract: Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati… ▽ More Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved $\ge$ 82.2% accuracy in disease detection, $\ge$ 89% in severity differentiation, $\ge$ 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation. △ Less

Submitted 13 May, 2025; originally announced May 2025.

arXiv:2505.07835 [pdf, other]

Intelligent Product 3.0: Decentralised AI Agents and Web3 Intelligence Standards

Authors: Alex C. Y. Wong, Duncan McFarlane, C. Ellarby, M. Lee, M. Kuok

Abstract: Twenty-five years ago, the specification of the Intelligent Product was established, envisaging real-time connectivity that not only enables products to gather accurate data about themselves but also allows them to assess and influence their own destiny. Early work by the Auto-ID project focused on creating a single, open-standard repository for storing and retrieving product information, laying a… ▽ More Twenty-five years ago, the specification of the Intelligent Product was established, envisaging real-time connectivity that not only enables products to gather accurate data about themselves but also allows them to assess and influence their own destiny. Early work by the Auto-ID project focused on creating a single, open-standard repository for storing and retrieving product information, laying a foundation for scalable connectivity. A decade later, the approach was revisited in light of low-cost RFID systems that promised a low-cost link between physical goods and networked information environments. Since then, advances in blockchain, Web3, and artificial intelligence have introduced unprecedented levels of resilience, consensus, and autonomy. By leveraging decentralised identity, blockchain-based product information and history, and intelligent AI-to-AI collaboration, this paper examines these developments and outlines a new specification for the Intelligent Product 3.0, illustrating how decentralised and AI-driven capabilities facilitate seamless interaction between physical AI and everyday products. △ Less

Submitted 14 May, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

Comments: 18 pages, 1 Figure, 3 Tables; Corrected typo in Section 3.4 heading

ACM Class: I.2.11; C.3; C.2.4

arXiv:2504.13462 [pdf, other]

Stratify: Rethinking Federated Learning for Non-IID Data through Balanced Sampling

Authors: Hui Yeok Wong, Chee Kau Lim, Chee Seng Chan

Abstract: Federated Learning (FL) on non-independently and identically distributed (non-IID) data remains a critical challenge, as existing approaches struggle with severe data heterogeneity. Current methods primarily address symptoms of non-IID by applying incremental adjustments to Federated Averaging (FedAvg), rather than directly resolving its inherent design limitations. Consequently, performance signi… ▽ More Federated Learning (FL) on non-independently and identically distributed (non-IID) data remains a critical challenge, as existing approaches struggle with severe data heterogeneity. Current methods primarily address symptoms of non-IID by applying incremental adjustments to Federated Averaging (FedAvg), rather than directly resolving its inherent design limitations. Consequently, performance significantly deteriorates under highly heterogeneous conditions, as the fundamental issue of imbalanced exposure to diverse class and feature distributions remains unresolved. This paper introduces Stratify, a novel FL framework designed to systematically manage class and feature distributions throughout training, effectively tackling the root cause of non-IID challenges. Inspired by classical stratified sampling, our approach employs a Stratified Label Schedule (SLS) to ensure balanced exposure across labels, significantly reducing bias and variance in aggregated gradients. Complementing SLS, we propose a label-aware client selection strategy, restricting participation exclusively to clients possessing data relevant to scheduled labels. Additionally, Stratify incorporates a fine-grained, high-frequency update scheme, accelerating convergence and further mitigating data heterogeneity. To uphold privacy, we implement a secure client selection protocol leveraging homomorphic encryption, enabling precise global label statistics without disclosing sensitive client information. Extensive evaluations on MNIST, CIFAR-10, CIFAR-100, Tiny-ImageNet, COVTYPE, PACS, and Digits-DG demonstrate that Stratify attains performance comparable to IID baselines, accelerates convergence, and reduces client-side computation compared to state-of-the-art methods, underscoring its practical effectiveness in realistic federated learning scenarios. △ Less

Submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.05081 [pdf, other]

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Authors: Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learni… ▽ More Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learning (ICL) datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This duality explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs. △ Less

Submitted 7 April, 2025; originally announced April 2025.

Comments: 30 pages, 12 tables, 6 figures

arXiv:2504.02283 [pdf]

Ga$_2$O$_3$ TCAD Mobility Parameter Calibration using Simulation Augmented Machine Learning with Physics Informed Neural Network

Authors: Le Minh Long Nguyen, Edric Ong, Matthew Eng, Yuhao Zhang, Hiu Yung Wong

Abstract: In this paper, we demonstrate the possibility of performing automatic Technology Computer-Aided-Design (TCAD) parameter calibration using machine learning, verified with experimental data. The machine only needs to be trained by TCAD data. Schottky Barrier Diode (SBD) fabricated with emerging ultra-wide-bandgap material, Gallium Oxide (Ga$_2$O$_3$), is measured and its current-voltage (IV) is used… ▽ More In this paper, we demonstrate the possibility of performing automatic Technology Computer-Aided-Design (TCAD) parameter calibration using machine learning, verified with experimental data. The machine only needs to be trained by TCAD data. Schottky Barrier Diode (SBD) fabricated with emerging ultra-wide-bandgap material, Gallium Oxide (Ga$_2$O$_3$), is measured and its current-voltage (IV) is used for Ga$_2$O$_3$ Philips Unified Mobility (PhuMob) model parameters, effective anode workfunction, and ambient temperature extraction (7 parameters). A machine comprised of an autoencoder (AE) and a neural network (NN) (AE-NN) is used. Ga$_2$O$_3$ PhuMob parameters are extracted from the noisy experimental curves. TCAD simulation with the extracted parameters shows that the quality of the parameters is as good as an expert's calibration at the pre-turned-on regime but not in the on-state regime. By using a simple physics-informed neural network (PINN) (AE-PINN), the machine performs as well as the human expert in all regimes. △ Less

Submitted 3 April, 2025; originally announced April 2025.

Comments: 4 pages, 3 figures

arXiv:2503.18436 [pdf, other]

Distributionally Robust Federated Learning: An ADMM Algorithm

Authors: Wen Bai, Yi Wong, Xiao Qiao, Chin Pang Ho

Abstract: Federated learning (FL) aims to train machine learning (ML) models collaboratively using decentralized data, bypassing the need for centralized data aggregation. Standard FL models often assume that all data come from the same unknown distribution. However, in practical situations, decentralized data frequently exhibit heterogeneity. We propose a novel FL model, Distributionally Robust Federated L… ▽ More Federated learning (FL) aims to train machine learning (ML) models collaboratively using decentralized data, bypassing the need for centralized data aggregation. Standard FL models often assume that all data come from the same unknown distribution. However, in practical situations, decentralized data frequently exhibit heterogeneity. We propose a novel FL model, Distributionally Robust Federated Learning (DRFL), that applies distributionally robust optimization to overcome the challenges posed by data heterogeneity and distributional ambiguity. We derive a tractable reformulation for DRFL and develop a novel solution method based on the alternating direction method of multipliers (ADMM) algorithm to solve this problem. Our experimental results demonstrate that DRFL outperforms standard FL models under data heterogeneity and ambiguity. △ Less

Submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.12440 [pdf, ps, other]

HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs

Authors: Tsz Chung Cheng, Chung Shing Cheng, Chaak Ming Lau, Eugene Tin-Ho Lam, Chun Yat Wong, Hoi On Yu, Cheuk Hei Chong

Abstract: The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs)… ▽ More The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs) on Cantonese language understanding tasks, extending to English and Written Chinese for cross-lingual evaluation. HKCanto-Eval integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios. Additionally, the benchmark includes questions designed to tap into the underlying linguistic metaknowledge of the models. Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge, highlighting the need for more targeted training data and evaluation methods. The code can be accessed at https://github.com/hon9kon9ize/hkeval2025 △ Less

Submitted 6 July, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.11861 [pdf, other]

Banking on Feedback: Text Analysis of Mobile Banking iOS and Google App Reviews

Authors: Yekta Amirkhalili, Ho Yi Wong

Abstract: The rapid growth of mobile banking (m-banking), especially after the COVID-19 pandemic, has reshaped the financial sector. This study analyzes consumer reviews of m-banking apps from five major Canadian banks, collected from Google Play and iOS App stores. Sentiment analysis and topic modeling classify reviews as positive, neutral, or negative, highlighting user preferences and areas for improveme… ▽ More The rapid growth of mobile banking (m-banking), especially after the COVID-19 pandemic, has reshaped the financial sector. This study analyzes consumer reviews of m-banking apps from five major Canadian banks, collected from Google Play and iOS App stores. Sentiment analysis and topic modeling classify reviews as positive, neutral, or negative, highlighting user preferences and areas for improvement. Data pre-processing was performed with NLTK, a Python language processing tool, and topic modeling used Latent Dirichlet Allocation (LDA). Sentiment analysis compared methods, with Long Short-Term Memory (LSTM) achieving 82\% accuracy for iOS reviews and Multinomial Naive Bayes 77\% for Google Play. Positive reviews praised usability, reliability, and features, while negative reviews identified login issues, glitches, and dissatisfaction with updates.This is the first study to analyze both iOS and Google Play m-banking app reviews, offering insights into app strengths and weaknesses. Findings underscore the importance of user-friendly designs, stable updates, and better customer service. Advanced text analytics provide actionable recommendations for improving user satisfaction and experience. △ Less

Submitted 14 March, 2025; originally announced March 2025.

arXiv:2502.11176 [pdf, other]

LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning

Authors: Tianshi Zheng, Jiayang Cheng, Chunyang Li, Haochen Shi, Zihao Wang, Jiaxin Bai, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Modern large language models (LLMs) employ various forms of logical inference, both implicitly and explicitly, when addressing reasoning tasks. Understanding how to optimally leverage these inference paradigms is critical for advancing LLMs' reasoning capabilities. This paper adopts an exploratory approach by introducing a controlled evaluation environment for analogical reasoning -- a fundamental… ▽ More Modern large language models (LLMs) employ various forms of logical inference, both implicitly and explicitly, when addressing reasoning tasks. Understanding how to optimally leverage these inference paradigms is critical for advancing LLMs' reasoning capabilities. This paper adopts an exploratory approach by introducing a controlled evaluation environment for analogical reasoning -- a fundamental cognitive task -- that is systematically parameterized across three dimensions: modality (textual, visual, symbolic), difficulty (easy, medium, hard), and task format (multiple-choice or free-text generation). We analyze the comparative dynamics of inductive, abductive, and deductive inference pipelines across these dimensions, and demonstrate that our findings generalize to broader in-context learning tasks. Additionally, we investigate advanced paradigms such as hypothesis selection, verification, and refinement, revealing their potential to scale up logical inference in LLM reasoning. This exploratory study provides a foundation for future research in enhancing LLM reasoning through systematic logical inference strategies. Resources are available at https://github.com/HKUST-KnowComp/LogiDynamics. △ Less

Submitted 9 April, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

Comments: 21 pages

arXiv:2501.08710 [pdf, other]

Disentangled Interleaving Variational Encoding

Authors: Noelle Y. L. Wong, Eng Yeow Cheu, Zhonglin Chiam, Dipti Srinivasan

Abstract: Conflicting objectives present a considerable challenge in interleaving multi-task learning, necessitating the need for meticulous design and balance to ensure effective learning of a representative latent data space across all tasks without mutual negative impact. Drawing inspiration from the concept of marginal and conditional probability distributions in probability theory, we design a principl… ▽ More Conflicting objectives present a considerable challenge in interleaving multi-task learning, necessitating the need for meticulous design and balance to ensure effective learning of a representative latent data space across all tasks without mutual negative impact. Drawing inspiration from the concept of marginal and conditional probability distributions in probability theory, we design a principled and well-founded approach to disentangle the original input into marginal and conditional probability distributions in the latent space of a variational autoencoder. Our proposed model, Deep Disentangled Interleaving Variational Encoding (DeepDIVE) learns disentangled features from the original input to form clusters in the embedding space and unifies these features via the cross-attention mechanism in the fusion stage. We theoretically prove that combining the objectives for reconstruction and forecasting fully captures the lower bound and mathematically derive a loss function for disentanglement using Naïve Bayes. Under the assumption that the prior is a mixture of log-concave distributions, we also establish that the Kullback-Leibler divergence between the prior and the posterior is upper bounded by a function minimized by the minimizer of the cross entropy loss, informing our adoption of radial basis functions (RBF) and cross entropy with interleaving training for DeepDIVE to provide a justified basis for convergence. Experiments on two public datasets show that DeepDIVE disentangles the original input and yields forecast accuracies better than the original VAE and comparable to existing state-of-the-art baselines. △ Less

Submitted 16 January, 2025; v1 submitted 15 January, 2025; originally announced January 2025.

arXiv:2412.20418 [pdf, other]

Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

Authors: Shiyun Chen, Li Lin, Pujin Cheng, ZhiCheng Jin, JianJian Chen, HaiDong Zhu, Kenneth K. Y. Wong, Xiaoying Tang

Abstract: Multimodal learning has been demonstrated to enhance performance across various clinical tasks, owing to the diverse perspectives offered by different modalities of data. However, existing multimodal segmentation methods rely on well-registered multimodal data, which is unrealistic for real-world clinical images, particularly for indistinct and diffuse regions such as liver tumors. In this paper,… ▽ More Multimodal learning has been demonstrated to enhance performance across various clinical tasks, owing to the diverse perspectives offered by different modalities of data. However, existing multimodal segmentation methods rely on well-registered multimodal data, which is unrealistic for real-world clinical images, particularly for indistinct and diffuse regions such as liver tumors. In this paper, we introduce Diff4MMLiTS, a four-stage multimodal liver tumor segmentation pipeline: pre-registration of the target organs in multimodal CTs; dilation of the annotated modality's mask and followed by its use in inpainting to obtain multimodal normal CTs without tumors; synthesis of strictly aligned multimodal CTs with tumors using the latent diffusion model based on multimodal CT features and randomly generated tumor masks; and finally, training the segmentation model, thus eliminating the need for strictly aligned multimodal data. Extensive experiments on public and internal datasets demonstrate the superiority of Diff4MMLiTS over other state-of-the-art multimodal segmentation methods. △ Less

Submitted 29 December, 2024; originally announced December 2024.

arXiv:2412.15614 [pdf, other]

Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM

Authors: Yangyang Guo, Ziwei Xu, Xilie Xu, YongKang Wong, Liqiang Nie, Mohan Kankanhalli

Abstract: This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incorrectly labeled option (pseudo-labeled) to the original query as a suffix. Using this modified query, our second approach applies the PGD method to add… ▽ More This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incorrectly labeled option (pseudo-labeled) to the original query as a suffix. Using this modified query, our second approach applies the PGD method to add imperceptible perturbations to the image. Combining these two techniques enables successful attacks on the LLaVA 1.5 model. △ Less

Submitted 20 December, 2024; originally announced December 2024.

Comments: ICML TiFA Challenge Technical Report

arXiv:2412.14757 [pdf, other]

doi 10.1145/3727123

Space-time Peer-to-Peer Distribution of Multi-party Entanglement for Any Quantum Network

Authors: Yuexun Huang, Xiangyu Ren, Bikun Li, Yat Wong, Zhiding Liang, Liang Jiang

Abstract: Graph states are a class of important multiparty entangled states, of which bell pairs are the special case. Realizing a robust and fast distribution of arbitrary graph states in the downstream layer of the quantum network can be essential for further large-scale quantum networks. We propose a novel quantum network protocol called P2PGSD inspired by the classical Peer-to-Peer (P2P) network to effi… ▽ More Graph states are a class of important multiparty entangled states, of which bell pairs are the special case. Realizing a robust and fast distribution of arbitrary graph states in the downstream layer of the quantum network can be essential for further large-scale quantum networks. We propose a novel quantum network protocol called P2PGSD inspired by the classical Peer-to-Peer (P2P) network to efficiently implement the general graph state distribution in the network layer, which demonstrates advantages in resource efficiency and scalability over existing methods for sparse graph states. An explicit mathematical model for a general graph state distribution problem has also been constructed, above which the intractability for a wide class of resource minimization problems is proved and the optimality of the existing algorithms is discussed. In addition, we leverage the spacetime quantum network inspired by the symmetry from relativity for memory management in network problems and used it to improve our proposed algorithm. The advantages of our protocols are confirmed by numerical simulations showing an improvement of up to 50% for general sparse graph states, paving the way for a resource-efficient multiparty entanglement distribution across any network topology. △ Less

Submitted 5 April, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

Journal ref: Proc. ACM Meas. Anal. Comput. Syst. 9, 2, Article 31 (June 2025)

arXiv:2412.10726 [pdf, other]

NoisyEQA: Benchmarking Embodied Question Answering Against Noisy Queries

Authors: Tao Wu, Chuhao Zhou, Yen Heng Wong, Lin Gu, Jianfei Yang

Abstract: The rapid advancement of Vision-Language Models (VLMs) has significantly advanced the development of Embodied Question Answering (EQA), enhancing agents' abilities in language understanding and reasoning within complex and realistic scenarios. However, EQA in real-world scenarios remains challenging, as human-posed questions often contain noise that can interfere with an agent's exploration and re… ▽ More The rapid advancement of Vision-Language Models (VLMs) has significantly advanced the development of Embodied Question Answering (EQA), enhancing agents' abilities in language understanding and reasoning within complex and realistic scenarios. However, EQA in real-world scenarios remains challenging, as human-posed questions often contain noise that can interfere with an agent's exploration and response, bringing challenges especially for language beginners and non-expert users. To address this, we introduce a NoisyEQA benchmark designed to evaluate an agent's ability to recognize and correct noisy questions. This benchmark introduces four common types of noise found in real-world applications: Latent Hallucination Noise, Memory Noise, Perception Noise, and Semantic Noise generated through an automated dataset creation framework. Additionally, we also propose a 'Self-Correction' prompting mechanism and a new evaluation metric to enhance and measure both noise detection capability and answer quality. Our comprehensive evaluation reveals that current EQA agents often struggle to detect noise in questions, leading to responses that frequently contain erroneous information. Through our Self-Correct Prompting mechanism, we can effectively improve the accuracy of agent answers. △ Less

Submitted 14 December, 2024; originally announced December 2024.

arXiv:2412.06820 [pdf]

doi 10.1016/j.neucom.2024.129053

Artificial Intelligence without Restriction Surpassing Human Intelligence with Probability One: Theoretical Insight into Secrets of the Brain with AI Twins of the Brain

Authors: Guang-Bin Huang, M. Brandon Westover, Eng-King Tan, Haibo Wang, Dongshun Cui, Wei-Ying Ma, Tiantong Wang, Qi He, Haikun Wei, Ning Wang, Qiyuan Tian, Kwok-Yan Lam, Xin Yao, Tien Yin Wong

Abstract: Artificial Intelligence (AI) has apparently become one of the most important techniques discovered by humans in history while the human brain is widely recognized as one of the most complex systems in the universe. One fundamental critical question which would affect human sustainability remains open: Will artificial intelligence (AI) evolve to surpass human intelligence in the future? This paper… ▽ More Artificial Intelligence (AI) has apparently become one of the most important techniques discovered by humans in history while the human brain is widely recognized as one of the most complex systems in the universe. One fundamental critical question which would affect human sustainability remains open: Will artificial intelligence (AI) evolve to surpass human intelligence in the future? This paper shows that in theory new AI twins with fresh cellular level of AI techniques for neuroscience could approximate the brain and its functioning systems (e.g. perception and cognition functions) with any expected small error and AI without restrictions could surpass human intelligence with probability one in the end. This paper indirectly proves the validity of the conjecture made by Frank Rosenblatt 70 years ago about the potential capabilities of AI, especially in the realm of artificial neural networks. Intelligence is just one of fortuitous but sophisticated creations of the nature which has not been fully discovered. Like mathematics and physics, with no restrictions artificial intelligence would lead to a new subject with its self-contained systems and principles. We anticipate that this paper opens new doors for 1) AI twins and other AI techniques to be used in cellular level of efficient neuroscience dynamic analysis, functioning analysis of the brain and brain illness solutions; 2) new worldwide collaborative scheme for interdisciplinary teams concurrently working on and modelling different types of neurons and synapses and different level of functioning subsystems of the brain with AI techniques; 3) development of low energy of AI techniques with the aid of fundamental neuroscience properties; and 4) new controllable, explainable and safe AI techniques with reasoning capabilities of discovering principles in nature. △ Less

Submitted 4 December, 2024; originally announced December 2024.

Comments: Accepted by journal Neurocomputing

arXiv:2412.04629 [pdf, ps, other]

Argumentative Experience: Reducing Confirmation Bias on Controversial Issues through LLM-Generated Multi-Persona Debates

Authors: Li Shi, Houjiang Liu, Yian Wong, Utkarsh Mujumdar, Dan Zhang, Jacek Gwizdka, Matthew Lease

Abstract: Large language models (LLMs) are enabling designers to give life to exciting new user experiences for information access. In this work, we present a system that generates LLM personas to debate a topic of interest from different perspectives. How might information seekers use and benefit from such a system? Can centering information access around diverse viewpoints help to mitigate thorny challeng… ▽ More Large language models (LLMs) are enabling designers to give life to exciting new user experiences for information access. In this work, we present a system that generates LLM personas to debate a topic of interest from different perspectives. How might information seekers use and benefit from such a system? Can centering information access around diverse viewpoints help to mitigate thorny challenges like confirmation bias in which information seekers over-trust search results matching existing beliefs? How do potential biases and hallucinations in LLMs play out alongside human users who are also fallible and possibly biased? Our study exposes participants to multiple viewpoints on controversial issues via a mixed-methods, within-subjects study. We use eye-tracking metrics to quantitatively assess cognitive engagement alongside qualitative feedback. Compared to a baseline search system, we see more creative interactions and diverse information-seeking with our multi-persona debate system, which more effectively reduces user confirmation bias and conviction toward their initial beliefs. Overall, our study contributes to the emerging design space of LLM-based information access systems, specifically investigating the potential of simulated personas to promote greater exposure to information diversity, emulate collective intelligence, and mitigate bias in information seeking. △ Less

Submitted 29 May, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

arXiv:2410.20635 [pdf, other]

Generating and Optimizing Topologically Distinct Guesses for Mobile Manipulator Path Planning

Authors: Rufus Cheuk Yin Wong, Mayank Sewlia, Adrian Wiltz, Dimos V. Dimarogonas

Abstract: Optimal path planning often suffers from getting stuck in a local optimum. This is often the case for mobile manipulators due to nonconvexities induced by obstacles and robot kinematics. This paper attempts to circumvent this issue by proposing a pipeline to obtain multiple distinct local optima. By evaluating and selecting the optimum among multiple distinct local optima, it is likely to obtain a… ▽ More Optimal path planning often suffers from getting stuck in a local optimum. This is often the case for mobile manipulators due to nonconvexities induced by obstacles and robot kinematics. This paper attempts to circumvent this issue by proposing a pipeline to obtain multiple distinct local optima. By evaluating and selecting the optimum among multiple distinct local optima, it is likely to obtain a closer approximation of the global optimum. We demonstrate this capability in optimal path planning of nonholonomic mobile manipulators in the presence of obstacles and subject to end effector path constraints. The nonholomicity, obstacles, and end effector path constraints often cause direct optimal path planning approaches to get stuck in local optima. We demonstrate that our pipeline is able to circumvent this issue and produce a final local optimum that is close to the global optimum. △ Less

Submitted 27 October, 2024; originally announced October 2024.

arXiv:2410.20436 [pdf, other]

CoralSCOP-LAT: Labeling and Analyzing Tool for Coral Reef Images with Dense Mask

Authors: Yuk-Kwan Wong, Ziqiang Zheng, Mingzhe Zhang, David Suggett, Sai-Kit Yeung

Abstract: Images of coral reefs provide invaluable information, which is essentially critical for surveying and monitoring the coral reef ecosystems. Robust and precise identification of coral reef regions within surveying imagery is paramount for assessing coral coverage, spatial distribution, and other statistical analyses. However, existing coral reef analytical approaches mainly focus on sparse points s… ▽ More Images of coral reefs provide invaluable information, which is essentially critical for surveying and monitoring the coral reef ecosystems. Robust and precise identification of coral reef regions within surveying imagery is paramount for assessing coral coverage, spatial distribution, and other statistical analyses. However, existing coral reef analytical approaches mainly focus on sparse points sampled from the whole imagery, which are highly subject to the sampling density and cannot accurately express the coral ambulance. Meanwhile, the analysis is both time-consuming and labor-intensive, and it is also limited to coral biologists. In this work, we propose CoralSCOP-LAT, an automatic and semi-automatic coral reef labeling and analysis tool, specially designed to segment coral reef regions (dense pixel masks) in coral reef images, significantly promoting analysis proficiency and accuracy. CoralSCOP-LAT leverages the advanced coral reef foundation model to accurately delineate coral regions, supporting dense coral reef analysis and reducing the dependency on manual annotation. The proposed CoralSCOP-LAT surpasses the existing tools by a large margin from analysis efficiency, accuracy, and flexibility. We perform comprehensive evaluations from various perspectives and the comparison demonstrates that CoralSCOP-LAT not only accelerates the coral reef analysis but also improves accuracy in coral segmentation and analysis. Our CoralSCOP-LAT, as the first dense coral reef analysis tool in the market, facilitates repeated large-scale coral reef monitoring analysis, contributing to more informed conservation efforts and sustainable management of coral reef ecosystems. Our tool will be available at https://coralscop.hkustvgd.com/. △ Less

Submitted 27 October, 2024; originally announced October 2024.

Comments: The coral reef labeling and analysis tool is available at https://coralscop.hkustvgd.com/

arXiv:2410.04239 [pdf, other]

Persona Knowledge-Aligned Prompt Tuning Method for Online Debate

Authors: Chunkit Chan, Cheng Jiayang, Xin Liu, Yauwai Yim, Yuxin Jiang, Zheye Deng, Haoran Li, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument… ▽ More Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument persuasiveness and impact with the social personae of the audience has not been explored due to the difficulty and complexity. We have observed the impressive simulation and personification capability of ChatGPT, indicating a giant pre-trained language model may function as an individual to provide personae and exert unique influences based on diverse background knowledge. Therefore, we propose a persona knowledge-aligned framework for argument quality assessment tasks from the audience side. This is the first work that leverages the emergence of ChatGPT and injects such audience personae knowledge into smaller language models via prompt tuning. The performance of our pipeline demonstrates significant and consistent improvement compared to competitive architectures. △ Less

Submitted 5 October, 2024; originally announced October 2024.

Comments: Accepted to ECAI 2024

arXiv:2409.19738 [pdf, other]

doi 10.1145/3613904.3642771

The Future of HCI-Policy Collaboration

Authors: Qian Yang, Richmond Y Wong, Steven J Jackson, Sabine Junginger, Margaret D Hagan, Thomas Gilbert, John Zimmerman

Abstract: Policies significantly shape computation's societal impact, a crucial HCI concern. However, challenges persist when HCI professionals attempt to integrate policy into their work or affect policy outcomes. Prior research considered these challenges at the ``border'' of HCI and policy. This paper asks: What if HCI considers policy integral to its intellectual concerns, placing system-people-policy i… ▽ More Policies significantly shape computation's societal impact, a crucial HCI concern. However, challenges persist when HCI professionals attempt to integrate policy into their work or affect policy outcomes. Prior research considered these challenges at the ``border'' of HCI and policy. This paper asks: What if HCI considers policy integral to its intellectual concerns, placing system-people-policy interaction not at the border but nearer the center of HCI research, practice, and education? What if HCI fosters a mosaic of methods and knowledge contributions that blend system, human, and policy expertise in various ways, just like HCI has done with blending system and human expertise? We present this re-imagined HCI-policy relationship as a provocation and highlight its usefulness: It spotlights previously overlooked system-people-policy interaction work in HCI. It unveils new opportunities for HCI's futuring, empirical, and design projects. It allows HCI to coordinate its diverse policy engagements, enhancing its collective impact on policy outcomes. △ Less

Submitted 29 September, 2024; originally announced September 2024.

Comments: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24)

arXiv:2408.15287 [pdf, other]

Quantum-Powered Personalized Learning

Authors: Yifan Zhou, Chong Cheng Xu, Mingi Song, Yew Kee Wong

Abstract: This paper explores the transformative potential of quantum computing in the realm of personalized learning. Traditional machine learning models and GPU-based approaches have long been utilized to tailor educational experiences to individual student needs. However, these methods face significant challenges in terms of scalability, computational efficiency, and real-time adaptation to the dynamic n… ▽ More This paper explores the transformative potential of quantum computing in the realm of personalized learning. Traditional machine learning models and GPU-based approaches have long been utilized to tailor educational experiences to individual student needs. However, these methods face significant challenges in terms of scalability, computational efficiency, and real-time adaptation to the dynamic nature of educational data. This study proposes leveraging quantum computing to address these limitations. We review existing personalized learning systems, classical machine learning methods, and emerging quantum computing applications in education. We then outline a protocol for data collection, privacy preservation using quantum techniques, and preprocessing, followed by the development and implementation of quantum algorithms specifically designed for personalized learning. Our findings indicate that quantum algorithms offer substantial improvements in efficiency, scalability, and personalization quality compared to classical methods. This paper discusses the implications of integrating quantum computing into educational systems, highlighting the potential for enhanced teaching methodologies, curriculum design, and overall student experiences. We conclude by summarizing the advantages of quantum computing in education and suggesting future research directions. △ Less

Submitted 25 August, 2024; originally announced August 2024.

Comments: 9 pages, 2 figures

arXiv:2408.07921 [pdf]

Physics-Informed Neural Network for Predicting Out-of-Training-Range TCAD Solution with Minimized Domain Expertise

Authors: Albert Lu, Yu Foon Chau, Hiu Yung Wong

Abstract: Machine learning (ML) is promising in assisting technology computer-aided design (TCAD) simulations to alleviate difficulty in convergence and prolonged simulation time. While ML is widely used in TCAD, they either require access to the internal solver, require extensive domain expertise, are only trained by terminal quantities such as currents and voltages, and/or lack out-of-training-range predi… ▽ More Machine learning (ML) is promising in assisting technology computer-aided design (TCAD) simulations to alleviate difficulty in convergence and prolonged simulation time. While ML is widely used in TCAD, they either require access to the internal solver, require extensive domain expertise, are only trained by terminal quantities such as currents and voltages, and/or lack out-of-training-range prediction capability. In this paper, using Si nanowire as an example, we demonstrate that it is possible to use a physics-informed neural network (PINN) to predict out-of-training-range TCAD solutions without accessing the internal solver and with minimal domain expertise. The machine not only can predict a 2.5 times larger range than the training but also can predict the inversion region by only being trained with subthreshold region data. The physics-informed module is also trained with data without the need for human-coded equations making this easier to be extended to more sophisticated systems. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2408.05564 [pdf, other]

Meta-heuristic Optimizer Inspired by the Philosophy of Yi Jing

Authors: Yisheng Yang, Sim Kuan Goh, Qing Cai, Shen Yuong Wong, Ho-Kin Tang

Abstract: Drawing inspiration from the philosophy of Yi Jing, the Yin-Yang pair optimization (YYPO) algorithm has been shown to achieve competitive performance in single objective optimizations, in addition to the advantage of low time complexity when compared to other population-based meta-heuristics. Building upon a reversal concept in Yi Jing, we propose the novel Yi optimization (YI) algorithm. Specific… ▽ More Drawing inspiration from the philosophy of Yi Jing, the Yin-Yang pair optimization (YYPO) algorithm has been shown to achieve competitive performance in single objective optimizations, in addition to the advantage of low time complexity when compared to other population-based meta-heuristics. Building upon a reversal concept in Yi Jing, we propose the novel Yi optimization (YI) algorithm. Specifically, we enhance the Yin-Yang pair in YYPO with a proposed Yi-point, in which we use Cauchy flight to update the solution, by implementing both the harmony and reversal concept of Yi Jing. The proposed Yi-point balances both the effort of exploration and exploitation in the optimization process. To examine YI, we use the IEEE CEC 2017 benchmarks and compare YI against the dynamical YYPO, CV1.0 optimizer, and four classical optimizers, i.e., the differential evolution, the genetic algorithm, the particle swarm optimization, and the simulated annealing. According to the experimental results, YI shows highly competitive performance while keeping the low time complexity. The results of this work have implications for enhancing a meta-heuristic optimizer using the philosophy of Yi Jing. While this work implements only certain aspects of Yi Jing, we envisage enhanced performance by incorporating other aspects. △ Less

Submitted 10 August, 2024; originally announced August 2024.

Comments: This work has been submitted to the IEEE for possible publication. arXiv admin note: substantial text overlap with arXiv:2104.08564

arXiv:2407.07666 [pdf]

A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability

Authors: Ting Fang Tan, Kabilan Elangovan, Jasmine Ong, Nigam Shah, Joseph Sung, Tien Yin Wong, Lan Xue, Nan Liu, Haibo Wang, Chang Fu Kuo, Simon Chesterman, Zee Kin Yeong, Daniel SW Ting

Abstract: A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models… ▽ More A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2406.15527 [pdf, other]

Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling

Authors: Cong Xu, Gayathri Saranathan, Mahammad Parwez Alam, Arpit Shah, James Lim, Soon Yee Wong, Foltin Martin, Suparna Bhattacharya

Abstract: Evaluating LLMs and text-to-image models is a computationally intensive task often overlooked. Efficient evaluation is crucial for understanding the diverse capabilities of these models and enabling comparisons across a growing number of new models and benchmarks. To address this, we introduce SubLIME, a data-efficient evaluation framework that employs adaptive sampling techniques, such as cluster… ▽ More Evaluating LLMs and text-to-image models is a computationally intensive task often overlooked. Efficient evaluation is crucial for understanding the diverse capabilities of these models and enabling comparisons across a growing number of new models and benchmarks. To address this, we introduce SubLIME, a data-efficient evaluation framework that employs adaptive sampling techniques, such as clustering and quality-based methods, to create representative subsets of benchmarks. Our approach ensures statistically aligned model rankings compared to full datasets, evidenced by high Pearson correlation coefficients. Empirical analysis across six NLP benchmarks reveals that: (1) quality-based sampling consistently achieves strong correlations (0.85 to 0.95) with full datasets at a 10\% sampling rate such as Quality SE and Quality CPD (2) clustering methods excel in specific benchmarks such as MMLU (3) no single method universally outperforms others across all metrics. Extending this framework, we leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks. SubLIME dynamically selects the optimal technique for each benchmark, significantly reducing evaluation costs while preserving ranking integrity and score distribution. Notably, a minimal sampling rate of 1% proves effective for benchmarks like MMLU. Additionally, we demonstrate that employing difficulty-based sampling to target more challenging benchmark segments enhances model differentiation with broader score distributions. We also combine semantic search, tool use, and GPT-4 review to identify redundancy across benchmarks within specific LLM categories, such as coding benchmarks. This allows us to further reduce the number of samples needed to maintain targeted rank preservation. Overall, SubLIME offers a versatile and cost-effective solution for the robust evaluation of LLMs and text-to-image models. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.04629 [pdf, other]

STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting

Authors: Zenghao Chai, Chen Tang, Yongkang Wong, Mohan Kankanhalli

Abstract: The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Di… ▽ More The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Distillation Sampling (SDS) exhibit domain gap and cannot preserve view-consistency using only T2I priors, and (2) For post hoc animation, simply applying the source motions to target 3D avatars yields translation artifacts and misalignment. To address these issues, we propose Skeleton-aware Text-based 4D Avatar generation with in-network motion Retargeting (STAR). STAR considers the geometry and skeleton differences between the template mesh and target avatar, and corrects the mismatched source motion by resorting to the pretrained motion retargeting techniques. With the informatively retargeted and occlusion-aware skeleton, we embrace the skeleton-conditioned T2I and text-to-video (T2V) priors, and propose a hybrid SDS module to coherently provide multi-view and frame-consistent supervision signals. Hence, STAR can progressively optimize the geometry, texture, and motion in an end-to-end manner. The quantitative and qualitative experiments demonstrate our proposed STAR can synthesize high-quality 4D avatars with vivid animations that align well with the text description. Additional ablation studies shows the contributions of each component in STAR. The source code and demos are available at: \href{https://star-avatar.github.io}{https://star-avatar.github.io}. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Tech report

arXiv:2405.16654 [pdf, other]

doi 10.1145/3643834.3660714

Ethics Pathways: A Design Activity for Reflecting on Ethics Engagement in HCI Research

Authors: Inha Cha, Ajit G. Pillai, Richmond Y. Wong

Abstract: This paper introduces Ethics Pathways, a design activity aimed at understanding HCI and design researchers' ethics engagements and flows during their research process. Despite a strong ethical commitment in these fields, challenges persist in grasping the complexity of researchers' engagement with ethics -- practices conducted to operationalize ethics -- in situated institutional contexts. Ethics… ▽ More This paper introduces Ethics Pathways, a design activity aimed at understanding HCI and design researchers' ethics engagements and flows during their research process. Despite a strong ethical commitment in these fields, challenges persist in grasping the complexity of researchers' engagement with ethics -- practices conducted to operationalize ethics -- in situated institutional contexts. Ethics Pathways, developed through six playtesting sessions, offers a design approach to understanding the complexities of researchers' past ethics engagements in their work. This activity involves four main tasks: recalling ethical incidents; describing stakeholders involved in the situation; recounting their actions or speculative alternatives; and reflection and emotion walk-through. The paper reflects on the role of design decisions and facilitation strategies in achieving these goals. The design activity contributes to the discourse on ethical HCI research by conceptualizing ethics engagement as a part of ongoing research processing, highlighting connections between individual affective experiences, social interactions across power differences, and institutional goals. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: Accepted at ACM Designing Interactive Systems (DIS) 2024

arXiv:2405.13911 [pdf, other]

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

Authors: Wei Li, Hehe Fan, Yongkang Wong, Mohan Kankanhalli, Yi Yang

Abstract: Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduc… ▽ More Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between textual and real videos, we employ the CLIP model as the feature extractor to align image and text modalities. During text-only pre-alignment, the continuous textual frames, encoded as a sequence of CLIP text features, are analogous to continuous CLIP image features, thus aligning the LLM with real video representation. Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video content with LLMs. In particular, without training on any video data, the TOPA-Llama2-13B model achieves a Top-1 accuracy of 51.0% on the challenging long-form video understanding benchmark, Egoschema. This performance surpasses previous video-text pre-training approaches and proves competitive with recent GPT-3.5-based video agents. △ Less

Submitted 3 November, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

Comments: NeurIPS 2024 (Spotlight)

arXiv:2405.12538 [pdf, other]

Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

Authors: Yi Cheng, Ziwei Xu, Dongyun Lin, Harry Cheng, Yongkang Wong, Ying Sun, Joo Hwee Lim, Mohan Kankanhalli

Abstract: For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leadi… ▽ More For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leading to a mismatch between the desired and generated output. Second, generative models trained on visual-label pairs lack the comprehensive knowledge to accurately represent all aspects of the input data in their generated outputs. To address these challenges, we propose a knowledge-enhanced iterative refinement framework for visual content generation. We begin by analyzing and identifying the key challenges faced by existing generative models. Then, we introduce various knowledge sources, including human insights, pre-trained models, logic rules, and world knowledge, which can be leveraged to address these challenges. Furthermore, we propose a novel visual generation framework that incorporates a knowledge-based feedback module to iteratively refine the generation process. This module gradually improves the alignment between the generated content and user intentions. We demonstrate the efficacy of the proposed framework through preliminary results, highlighting the potential of knowledge-enhanced generative models for intention-aligned content generation. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.10904 [pdf, other]

doi 10.1145/3563657.3596012

Broadening Privacy and Surveillance: Eliciting Interconnected Values with a Scenarios Workbook on Smart Home Cameras

Authors: Richmond Y. Wong, Jason Caleb Valdez, Ashten Alexander, Ariel Chiang, Olivia Quesada, James Pierce

Abstract: We use a design workbook of speculative scenarios as a values elicitation activity with 14 participants. The workbook depicts use case scenarios with smart home camera technologies that involve surveillance and uneven power relations. The scenarios were initially designed by the researchers to explore scenarios of privacy and surveillance within three social relationships involving "primary" and "… ▽ More We use a design workbook of speculative scenarios as a values elicitation activity with 14 participants. The workbook depicts use case scenarios with smart home camera technologies that involve surveillance and uneven power relations. The scenarios were initially designed by the researchers to explore scenarios of privacy and surveillance within three social relationships involving "primary" and "non-primary" users: Parents-Children, Landlords-Tenants, and Residents-Domestic Workers. When the scenarios were utilized as part of a values elicitation activity with participants, we found that they reflected on a broader set of interconnected social values beyond privacy and surveillance, including autonomy and agency, physical safety, property rights, trust and accountability, and fairness. The paper suggests that future research about ethical issues in smart homes should conceptualize privacy as interconnected with a broader set of social values (which can align or be in tension with privacy), and reflects on considerations for doing research with non-primary users. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: Proceedings of the 2023 ACM Designing Interactive Systems Conference (DIS '23)

arXiv:2403.16304 [pdf]

SoK: An Essential Guide For Using Malware Sandboxes In Security Applications: Challenges, Pitfalls, and Lessons Learned

Authors: Omar Alrawi, Miuyin Yong Wong, Athanasios Avgetidis, Kevin Valakuzhy, Boladji Vinny Adjibi, Konstantinos Karakatsanis, Mustaque Ahamad, Doug Blough, Fabian Monrose, Manos Antonakakis

Abstract: Malware sandboxes provide many benefits for security applications, but they are complex. These complexities can overwhelm new users in different research areas and make it difficult to select, configure, and use sandboxes. Even worse, incorrectly using sandboxes can have a negative impact on security applications. In this paper, we address this knowledge gap by systematizing 84 representative pape… ▽ More Malware sandboxes provide many benefits for security applications, but they are complex. These complexities can overwhelm new users in different research areas and make it difficult to select, configure, and use sandboxes. Even worse, incorrectly using sandboxes can have a negative impact on security applications. In this paper, we address this knowledge gap by systematizing 84 representative papers for using x86/64 malware sandboxes in the academic literature. We propose a novel framework to simplify sandbox components and organize the literature to derive practical guidelines for using sandboxes. We evaluate the proposed guidelines systematically using three common security applications and demonstrate that the choice of different sandboxes can significantly impact the results. Specifically, our results show that the proposed guidelines improve the sandbox observable activities by at least 1.6x and up to 11.3x. Furthermore, we observe a roughly 25% improvement in accuracy, precision, and recall when using the guidelines to help with a malware family classification task. We conclude by affirming that there is no "silver bullet" sandbox deployment that generalizes, and we recommend that users apply our framework to define a scope for their analysis, a threat model, and derive context about how the sandbox artifacts will influence their intended use case. Finally, it is important that users document their experiment, limitations, and potential solutions for reproducibility △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2403.12381 [pdf, other]

Explainable AutoML (xAutoML) with adaptive modeling for yield enhancement in semiconductor smart manufacturing

Authors: Weihong Zhai, Xiupeng Shi, Yiik Diew Wong, Qing Han, Lisheng Chen

Abstract: Enhancing yield is recognized as a paramount driver to reducing production costs in semiconductor smart manufacturing. However, optimizing and ensuring high yield rates is a highly complex and technical challenge, especially while maintaining reliable yield diagnosis and prognosis, and this shall require understanding all the confounding factors in a complex condition. This study proposes a domain… ▽ More Enhancing yield is recognized as a paramount driver to reducing production costs in semiconductor smart manufacturing. However, optimizing and ensuring high yield rates is a highly complex and technical challenge, especially while maintaining reliable yield diagnosis and prognosis, and this shall require understanding all the confounding factors in a complex condition. This study proposes a domain-specific explainable automated machine learning technique (termed xAutoML), which autonomously self-learns the optimal models for yield prediction, with an extent of explainability, and also provides insights on key diagnosis factors. The xAutoML incorporates tailored problem-solving functionalities in an auto-optimization pipeline to address the intricacies of semiconductor yield enhancement. Firstly, to capture the key diagnosis factors, knowledge-informed feature extraction coupled with model-agnostic key feature selection is designed. Secondly, combined algorithm selection and hyperparameter tuning with adaptive loss are developed to generate optimized classifiers for better defect prediction, and adaptively evolve in response to shifting data patterns. Moreover, a suite of explainability tools is provided throughout the AutoML pipeline, enhancing user understanding and fostering trust in the automated processes. The proposed xAutoML exhibits superior performance, with domain-specific refined countermeasures, adaptive optimization capabilities, and embedded explainability. Findings exhibit that the proposed xAutoML is a compelling solution for semiconductor yield improvement, defect diagnosis, and related applications. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.17502 [pdf, other]

FedLPPA: Learning Personalized Prompt and Aggregation for Federated Weakly-supervised Medical Image Segmentation

Authors: Li Lin, Yixiang Liu, Jiewei Wu, Pujin Cheng, Zhiyuan Cai, Kenneth K. Y. Wong, Xiaoying Tang

Abstract: Federated learning (FL) effectively mitigates the data silo challenge brought about by policies and privacy concerns, implicitly harnessing more data for deep model training. However, traditional centralized FL models grapple with diverse multi-center data, especially in the face of significant data heterogeneity, notably in medical contexts. In the realm of medical image segmentation, the growing… ▽ More Federated learning (FL) effectively mitigates the data silo challenge brought about by policies and privacy concerns, implicitly harnessing more data for deep model training. However, traditional centralized FL models grapple with diverse multi-center data, especially in the face of significant data heterogeneity, notably in medical contexts. In the realm of medical image segmentation, the growing imperative to curtail annotation costs has amplified the importance of weakly-supervised techniques which utilize sparse annotations such as points, scribbles, etc. A pragmatic FL paradigm shall accommodate diverse annotation formats across different sites, which research topic remains under-investigated. In such context, we propose a novel personalized FL framework with learnable prompt and aggregation (FedLPPA) to uniformly leverage heterogeneous weak supervision for medical image segmentation. In FedLPPA, a learnable universal knowledge prompt is maintained, complemented by multiple learnable personalized data distribution prompts and prompts representing the supervision sparsity. Integrated with sample features through a dual-attention mechanism, those prompts empower each local task decoder to adeptly adjust to both the local distribution and the supervision form. Concurrently, a dual-decoder strategy, predicated on prompt similarity, is introduced for enhancing the generation of pseudo-labels in weakly-supervised learning, alleviating overfitting and noise accumulation inherent to local data, while an adaptable aggregation method is employed to customize the task decoder on a parameter-wise basis. Extensive experiments on four distinct medical image segmentation tasks involving different modalities underscore the superiority of FedLPPA, with its efficacy closely parallels that of fully supervised centralized training. Our code and data will be available. △ Less

Submitted 31 May, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: 12 pages, 10 figures

arXiv:2402.10981 [pdf]

doi 10.1109/LAEDC61552.2024.10555838

Stuck-at Faults in ReRAM Neuromorphic Circuit Array and their Correction through Machine Learning

Authors: Vedant Sawal, Hiu Yung Wong

Abstract: In this paper, we study the inference accuracy of the Resistive Random Access Memory (ReRAM) neuromorphic circuit due to stuck-at faults (stuck-on, stuck-off, and stuck at a certain resistive value). A simulation framework using Python is used to perform supervised machine learning (neural network with 3 hidden layers, 1 input layer, and 1 output layer) of handwritten digits and construct a corres… ▽ More In this paper, we study the inference accuracy of the Resistive Random Access Memory (ReRAM) neuromorphic circuit due to stuck-at faults (stuck-on, stuck-off, and stuck at a certain resistive value). A simulation framework using Python is used to perform supervised machine learning (neural network with 3 hidden layers, 1 input layer, and 1 output layer) of handwritten digits and construct a corresponding fully analog neuromorphic circuit (4 synaptic arrays) simulated by Spectre. A generic 45nm Process Development Kit (PDK) was used. We study the difference in the inference accuracy degradation due to stuck-on and stuck-off defects. Various defect patterns are studied including circular, ring, row, column, and circular-complement defects. It is found that stuck-on and stuck-off defects have a similar effect on inference accuracy. However, it is also found that if there is a spatial defect variation across the columns, the inference accuracy may be degraded significantly. We also propose a machine learning (ML) strategy to recover the inference accuracy degradation due to stuck-at faults. The inference accuracy is improved from 48% to 85% in a defective neuromorphic circuit. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2402.10646 [pdf, other]

AbsInstruct: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation

Authors: Zhaowei Wang, Wei Fan, Qing Zong, Hongming Zhang, Sehyun Choi, Tianqing Fang, Xin Liu, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study. Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored. In this work, we design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning. The framework builds instructions with in-depth explanations to assist LL… ▽ More Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study. Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored. In this work, we design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning. The framework builds instructions with in-depth explanations to assist LLMs in capturing the underlying rationale of abstraction. Meanwhile, we introduce a plausibility estimator to select instructions that are more consistent with the abstraction knowledge of LLMs to be aligned. Then, our framework combines abstraction instructions with general-purpose ones to build a hybrid dataset. Extensive experiments and analyses demonstrate that our framework can considerably enhance LLMs' abstraction ability with strong generalization performance while maintaining their general instruction-following abilities. △ Less

Submitted 17 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted by ACL 2024

arXiv:2402.09108 [pdf, other]

Novel Long Distance Free Space Quantum Secure Direct Communication for Web 3.0 Networks

Authors: Yifan Zhou, Xinlin Zhou, Zi Yan Li, Yew Kee Wong, Yan Shing Liang

Abstract: With the advent of Web 3.0, the swift advancement of technology confronts an imminent threat from quantum computing. Security protocols safeguarding the integrity of Web 2.0 and Web 3.0 are growing more susceptible to both quantum attacks and sophisticated classical threats. The article introduces our novel long-distance free-space quantum secure direct communication (LF QSDC) as a method to safeg… ▽ More With the advent of Web 3.0, the swift advancement of technology confronts an imminent threat from quantum computing. Security protocols safeguarding the integrity of Web 2.0 and Web 3.0 are growing more susceptible to both quantum attacks and sophisticated classical threats. The article introduces our novel long-distance free-space quantum secure direct communication (LF QSDC) as a method to safeguard against security breaches in both quantum and classical contexts. Differing from techniques like quantum key distribution (QKD), LF QSDC surpasses constraints by facilitating encrypted data transmission sans key exchanges, thus diminishing the inherent weaknesses of key-based systems. The distinctiveness of this attribute, coupled with its quantum mechanics base, protects against quantum computer assaults and advanced non-quantum dangers, harmonizing seamlessly with the untrustworthy tenets of the Web 3.0 age. The focus of our study is the technical design and incorporation of LF QSDC into web 3.0 network infrastructures, highlighting its efficacy for extended-range communication. LF QSDC is based on the memory DL04 protocol and enhanced with our novel Quantum-Aware Low-Density Parity Check (LDPC), Pointing, Acquisition, and Tracking (PAT) technologies, and Atmospheric Quantum Correction Algorithm (AQCA). Utilizing this method not only bolsters the security of worldwide Web 3.0 networks but also guarantees their endurance in a time when quantum and sophisticated classical threats exist simultaneously. Consequently, LF QSDC stands out as a robust security solution, well-suited for Web 3.0 systems amidst the constantly evolving digital environment. △ Less

Submitted 29 August, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

Comments: 17 pages, 6 figures

arXiv:2401.14074 [pdf, other]

ProCNS: Progressive Prototype Calibration and Noise Suppression for Weakly-Supervised Medical Image Segmentation

Authors: Y. Liu, L. Lin, K. K. Y. Wong, X. Tang

Abstract: Weakly-supervised segmentation (WSS) has emerged as a solution to mitigate the conflict between annotation cost and model performance by adopting sparse annotation formats (e.g., point, scribble, block, etc.). Typical approaches attempt to exploit anatomy and topology priors to directly expand sparse annotations into pseudo-labels. However, due to a lack of attention to the ambiguous edges in medi… ▽ More Weakly-supervised segmentation (WSS) has emerged as a solution to mitigate the conflict between annotation cost and model performance by adopting sparse annotation formats (e.g., point, scribble, block, etc.). Typical approaches attempt to exploit anatomy and topology priors to directly expand sparse annotations into pseudo-labels. However, due to a lack of attention to the ambiguous edges in medical images and insufficient exploration of sparse supervision, existing approaches tend to generate erroneous and overconfident pseudo proposals in noisy regions, leading to cumulative model error and performance degradation. In this work, we propose a novel WSS approach, named ProCNS, encompassing two synergistic modules devised with the principles of progressive prototype calibration and noise suppression. Specifically, we design a Prototype-based Regional Spatial Affinity (PRSA) loss to maximize the pair-wise affinities between spatial and semantic elements, providing our model of interest with more reliable guidance. The affinities are derived from the input images and the prototype-refined predictions. Meanwhile, we propose an Adaptive Noise Perception and Masking (ANPM) module to obtain more enriched and representative prototype representations, which adaptively identifies and masks noisy regions within the pseudo proposals, reducing potential erroneous interference during prototype computation. Furthermore, we generate specialized soft pseudo-labels for the noisy regions identified by ANPM, providing supplementary supervision. Extensive experiments on six medical image segmentation tasks involving different modalities demonstrate that the proposed framework significantly outperforms representative state-of-the-art methods. △ Less

Submitted 23 December, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

arXiv:2312.03231 [pdf, other]

Deep Multimodal Fusion for Surgical Feedback Classification

Authors: Rafal Kocielnik, Elyssa Y. Wong, Timothy N. Chu, Lydia Lin, De-An Huang, Jiayun Wang, Anima Anandkumar, Andrew J. Hung

Abstract: Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In th… ▽ More Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: "Anatomic", "Technical", "Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities. △ Less

Submitted 5 December, 2023; originally announced December 2023.

Journal ref: Published in Proceedings of Machine Learning for Health 2024

arXiv:2311.07604 [pdf, other]

Finetuning Text-to-Image Diffusion Models for Fairness

Authors: Xudong Shen, Chao Du, Tianyu Pang, Min Lin, Yongkang Wong, Mohan Kankanhalli

Abstract: The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases could propagate a skewed worldview and restrict opportunities for minority groups. In this work, we frame fairness as a distributional alignment problem. Our solution consists of two main technical contributions: (1) a distributional alignment loss… ▽ More The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases could propagate a skewed worldview and restrict opportunities for minority groups. In this work, we frame fairness as a distributional alignment problem. Our solution consists of two main technical contributions: (1) a distributional alignment loss that steers specific characteristics of the generated images towards a user-defined target distribution, and (2) adjusted direct finetuning of diffusion model's sampling process (adjusted DFT), which leverages an adjusted gradient to directly optimize losses defined on the generated images. Empirically, our method markedly reduces gender, racial, and their intersectional biases for occupational prompts. Gender bias is significantly reduced even when finetuning just five soft tokens. Crucially, our method supports diverse perspectives of fairness beyond absolute equality, which is demonstrated by controlling age to a $75\%$ young and $25\%$ old distribution while simultaneously debiasing gender and race. Finally, our method is scalable: it can debias multiple concepts at once by simply including these prompts in the finetuning data. We share code and various fair diffusion model adaptors at https://sail-sg.github.io/finetune-fair-diffusion/. △ Less

Submitted 15 March, 2024; v1 submitted 11 November, 2023; originally announced November 2023.

Comments: ICLR 2024 oral presentation

arXiv:2311.03032 [pdf, other]

Reconfigurable, Transformable Soft Pneumatic Actuator with Tunable 3D Deformations for Dexterous Soft Robotics Applications

Authors: Dickson Chiu Yu Wong, Mingtan Li, Shijie Kang, Lifan Luo, Hongyu Yu

Abstract: Numerous soft actuators based on PneuNet design have already been proposed and extensively employed across various soft robotics applications in recent years. Despite their widespread use, a common limitation of most existing designs is that their action is pre-determined during the fabrication process, thereby restricting the ability to modify or alter their function during operation. To address… ▽ More Numerous soft actuators based on PneuNet design have already been proposed and extensively employed across various soft robotics applications in recent years. Despite their widespread use, a common limitation of most existing designs is that their action is pre-determined during the fabrication process, thereby restricting the ability to modify or alter their function during operation. To address this shortcoming, in this article the design of a Reconfigurable, Transformable Soft Pneumatic Actuator (RT-SPA) is proposed. The working principle of the RT-SPA is analogous to the conventional PneuNet. The key distinction between the two lies in the ability of the RT-SPA to undergo controlled transformations, allowing for more versatile bending and twisting motions in various directions. Furthermore, the unique reconfigurable design of the RT-SPA enables the selection of actuation units with different sizes to achieve a diverse range of three-dimensional deformations. This versatility enhances the potential of the RT-SPA for adaptation to a multitude of tasks and environments, setting it apart from traditional PneuNet. The paper begins with a detailed description of the design and fabrication of the RT-SPA. Following this, a series of experiments are conducted to evaluate the performance of the RT-SPA. Finally, the abilities of the RT-SPA for locomotion, gripping, and object manipulation are demonstrated to illustrate the versatility of the RT-SPA across different aspects. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: Submitted to Soft Robotics Journal. 12 pages, 10 figures

arXiv:2310.16684 [pdf, other]

Local Statistics for Generative Image Detection

Authors: Yung Jer Wong, Teck Khim Ng

Abstract: Diffusion models (DMs) are generative models that learn to synthesize images from Gaussian noise. DMs can be trained to do a variety of tasks such as image generation and image super-resolution. Researchers have made significant improvements in the capability of synthesizing photorealistic images in the past few years. These successes also hasten the need to address the potential misuse of synthes… ▽ More Diffusion models (DMs) are generative models that learn to synthesize images from Gaussian noise. DMs can be trained to do a variety of tasks such as image generation and image super-resolution. Researchers have made significant improvements in the capability of synthesizing photorealistic images in the past few years. These successes also hasten the need to address the potential misuse of synthesized images. In this paper, we highlighted the effectiveness of Bayer pattern and local statistics in distinguishing digital camera images from DM-generated images. We further hypothesized that local statistics should be used to address the spatial non-stationarity problems in images. We showed that our approach produced promising results for distinguishing real images from synthesized images. This approach is also robust to various perturbations such as image resizing and JPEG compression. △ Less

Submitted 3 March, 2025; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: 4 pages

arXiv:2310.05210 [pdf, other]

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining

Authors: Qing Zong, Zhaowei Wang, Baixuan Xu, Tianshi Zheng, Haochen Shi, Weiqi Wang, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA (A Unified Framework for Text, Image, and Layout Fusion in Argumen… ▽ More A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA (A Unified Framework for Text, Image, and Layout Fusion in Argument Mining), is designed to handle this mixed data. It excels at not only understanding text but also detecting optical characters and recognizing layout details in images. Our model significantly outperforms existing baselines, earning our team, KnowComp, the 1st place in the leaderboard of Argumentative Stance Classification subtask in this shared task. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: Accepted to the 10th Workshop on Argument Mining, co-located with EMNLP 2023

arXiv:2310.04992 [pdf, other]

doi 10.1056/AIoa2300221

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Authors: Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, Yuyang Zhao, Xuehui Shi, Junfang Xian, Xiaoxia Qu, Sirui Zhu, Lijie Pan, Xiaoniao Chen, Xiaojia Zhang, Shuai Jiang, Kebing Wang, Chenlong Yang, Mingqiang Chen, Sujie Fan, Jianhua Hu, Aiguo Lv , et al. (17 additional authors not shown)

Abstract: We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassifi… ▽ More We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassification of disease phenotype, and systemic biomarker and disease prediction, with each application enhanced with expert-level intelligence and accuracy. The generalist intelligence of VisionFM outperformed ophthalmologists with basic and intermediate levels in jointly diagnosing 12 common ophthalmic diseases. Evaluated on a new large-scale ophthalmic disease diagnosis benchmark database, as well as a new large-scale segmentation and detection benchmark database, VisionFM outperformed strong baseline deep neural networks. The ophthalmic image representations learned by VisionFM exhibited noteworthy explainability, and demonstrated strong generalizability to new ophthalmic modalities, disease spectrum, and imaging devices. As a foundation model, VisionFM has a large capacity to learn from diverse ophthalmic imaging data and disparate datasets. To be commensurate with this capacity, in addition to the real data used for pre-training, we also generated and leveraged synthetic ophthalmic imaging data. Experimental results revealed that synthetic data that passed visual Turing tests, can also enhance the representation learning capability of VisionFM, leading to substantial performance gains on downstream ophthalmic AI tasks. Beyond the ophthalmic AI applications developed, validated, and demonstrated in this work, substantial further applications can be achieved in an efficient and cost-effective manner using VisionFM as the foundation. △ Less

Submitted 7 October, 2023; originally announced October 2023.

Journal ref: The latest VisionFM work has been published in NEJM AI, 2024

arXiv:2309.16738 [pdf, other]

ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens

Authors: Yangyang Guo, Haoyu Zhang, Yongkang Wong, Liqiang Nie, Mohan Kankanhalli

Abstract: Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method ELIP, to remove less influentia… ▽ More Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, \emph{etc}. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance. △ Less

Submitted 17 November, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

arXiv:2309.03031 [pdf, other]

MCM: Multi-condition Motion Synthesis Framework for Multi-scenario

Authors: Zeyu Ling, Bo Han, Yongkang Wong, Mohan Kangkanhalli, Weidong Geng

Abstract: The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition… ▽ More The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In this paper, we address these challenges by introducing MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions. The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input while preserving its generative capabilities. Specifically, MCM employs two-branch architecture consisting of a main branch and a control branch. The control branch shares the same structure as the main branch and is initialized with the parameters of the main branch, effectively maintaining the generation ability of the main branch and supporting multi-condition input. We also introduce a Transformer-based diffusion model MWNet (DDPM-like) as our main branch that can capture the spatial complexity and inter-joint correlations in motion sequences through a channel-dimension self-attention module. Quantitative comparisons demonstrate that our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods. Furthermore, the qualitative evaluation shows that MCM not only streamlines the adaptation of methodologies originally designed for text-to-motion tasks to domains like music-to-dance and speech-to-gesture, eliminating the need for extensive network re-configurations but also enables effective multi-condition modal control, realizing "once trained is motion need". △ Less

Submitted 6 September, 2023; originally announced September 2023.

arXiv:2308.08561 [pdf]

doi 10.5281/zenodo.7983561

Implementation of The Future of Drug Discovery: QuantumBased Machine Learning Simulation (QMLS)

Authors: Yifan Zhou, Yan Shing Liang, Yew Kee Wong, Haichuan Qiu, Yu Xi Wu, Bin He

Abstract: The Research & Development (R&D) phase of drug development is a lengthy and costly process. To revolutionize this process, we introduce our new concept QMLS to shorten the whole R&D phase to three to six months and decrease the cost to merely fifty to eighty thousand USD. For Hit Generation, Machine Learning Molecule Generation (MLMG) generates possible hits according to the molecular structure of… ▽ More The Research & Development (R&D) phase of drug development is a lengthy and costly process. To revolutionize this process, we introduce our new concept QMLS to shorten the whole R&D phase to three to six months and decrease the cost to merely fifty to eighty thousand USD. For Hit Generation, Machine Learning Molecule Generation (MLMG) generates possible hits according to the molecular structure of the target protein while the Quantum Simulation (QS) filters molecules from the primary essay based on the reaction and binding effectiveness with the target protein. Then, For Lead Optimization, the resultant molecules generated and filtered from MLMG and QS are compared, and molecules that appear as a result of both processes will be made into dozens of molecular variations through Machine Learning Molecule Variation (MLMV), while others will only be made into a few variations. Lastly, all optimized molecules would undergo multiple rounds of QS filtering with a high standard for reaction effectiveness and safety, creating a few dozen pre-clinical-trail-ready drugs. This paper is based on our first paper, where we pitched the concept of machine learning combined with quantum simulations. In this paper we will go over the detailed design and framework of QMLS, including MLMG, MLMV, and QS. △ Less

Submitted 5 September, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

Comments: 13 pages, 6 figures

Journal ref: International Journal of Computer Science and Mobile Applications, Vol 11 Issue 5,May- 2023

Showing 1–50 of 136 results for author: Wong, Y