Search | arXiv e-print repository

HNOSeg-XS: Extremely Small Hartley Neural Operator for Efficient and Resolution-Robust 3D Image Segmentation

Authors: Ken C. L. Wong, Hongzhi Wang, Tanveer Syeda-Mahmood

Abstract: In medical image segmentation, convolutional neural networks (CNNs) and transformers are dominant. For CNNs, given the local receptive fields of convolutional layers, long-range spatial correlations are captured through consecutive convolutions and pooling. However, as the computational cost and memory footprint can be prohibitively large, 3D models can only afford fewer layers than 2D models with… ▽ More In medical image segmentation, convolutional neural networks (CNNs) and transformers are dominant. For CNNs, given the local receptive fields of convolutional layers, long-range spatial correlations are captured through consecutive convolutions and pooling. However, as the computational cost and memory footprint can be prohibitively large, 3D models can only afford fewer layers than 2D models with reduced receptive fields and abstract levels. For transformers, although long-range correlations can be captured by multi-head attention, its quadratic complexity with respect to input size is computationally demanding. Therefore, either model may require input size reduction to allow more filters and layers for better segmentation. Nevertheless, given their discrete nature, models trained with patch-wise training or image downsampling may produce suboptimal results when applied on higher resolutions. To address this issue, here we propose the resolution-robust HNOSeg-XS architecture. We model image segmentation by learnable partial differential equations through the Fourier neural operator which has the zero-shot super-resolution property. By replacing the Fourier transform by the Hartley transform and reformulating the problem in the frequency domain, we created the HNOSeg-XS model, which is resolution robust, fast, memory efficient, and extremely parameter efficient. When tested on the BraTS'23, KiTS'23, and MVSeg'23 datasets with a Tesla V100 GPU, HNOSeg-XS showed its superior resolution robustness with fewer than 34.7k model parameters. It also achieved the overall best inference time (< 0.24 s) and memory efficiency (< 1.8 GiB) compared to the tested CNN and transformer models. △ Less

Submitted 10 July, 2025; originally announced July 2025.

Comments: This paper was accepted by IEEE TMI 2025

arXiv:2507.04903 [pdf, ps, other]

BackFed: An Efficient & Standardized Benchmark Suite for Backdoor Attacks in Federated Learning

Authors: Thinh Dao, Dung Thuy Nguyen, Khoa D Doan, Kok-Seng Wong

Abstract: Federated Learning (FL) systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, and unrealistic assumptions hinder fair comparisons and valid conclusions about their effectiveness in… ▽ More Federated Learning (FL) systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, and unrealistic assumptions hinder fair comparisons and valid conclusions about their effectiveness in real-world scenarios. To address this, we introduce BackFed - a comprehensive benchmark suite designed to standardize, streamline, and reliably evaluate backdoor attacks and defenses in FL, with a focus on practical constraints. Our benchmark offers key advantages through its multi-processing implementation that significantly accelerates experimentation and the modular design that enables seamless integration of new methods via well-defined APIs. With a standardized evaluation pipeline, we envision BackFed as a plug-and-play environment for researchers to comprehensively and reliably evaluate new attacks and defenses. Using BackFed, we conduct large-scale studies of representative backdoor attacks and defenses across both Computer Vision and Natural Language Processing tasks with diverse model architectures and experimental settings. Our experiments critically assess the performance of proposed attacks and defenses, revealing unknown limitations and modes of failures under practical conditions. These empirical insights provide valuable guidance for the development of new methods and for enhancing the security of FL systems. Our framework is openly available at https://github.com/thinh-dao/BackFed. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: Under review at NeurIPS'25

arXiv:2507.03351 [pdf, ps, other]

Specific Absorption Rate-Aware Multiuser MIMO Assisted by Fluid Antenna System

Authors: Yuqi Ye, Li You, Hao Xu, Ahmed Elzanaty, Kai-Kit Wong, Xiqi Gao

Abstract: With the development of the upcoming sixth-generation (6G) wireless networks, there is a pressing need for innovative technologies capable of satisfying heightened performance indicators. Fluid antenna system (FAS) is proposed recently as a promising technique to achieve higher data rates and more diversity gains by dynamically changing the positions of the antennas to form a more desirable channe… ▽ More With the development of the upcoming sixth-generation (6G) wireless networks, there is a pressing need for innovative technologies capable of satisfying heightened performance indicators. Fluid antenna system (FAS) is proposed recently as a promising technique to achieve higher data rates and more diversity gains by dynamically changing the positions of the antennas to form a more desirable channel. However, worries regarding the possibly harmful effects of electromagnetic (EM) radiation emitted by devices have arisen as a result of the rapid evolution of advanced techniques in wireless communication systems. Specific absorption rate (SAR) is a widely adopted metric to quantify EM radiation worldwide. In this paper, we investigate the SAR-aware multiuser multiple-input multiple-output (MIMO) communications assisted by FAS. In particular, a two-layer iterative algorithm is proposed to minimize the SAR value under signal-to-interference-plus-noise ratio (SINR) and FAS constraints. Moreover, the minimum weighted SINR maximization problem under SAR and FAS constraints is studied by finding its relationship with the SAR minimization problem. Simulation results verify that the proposed SAR-aware FAS design outperforms the adaptive backoff and fixed-position antenna designs. △ Less

Submitted 4 July, 2025; originally announced July 2025.

Comments: 12 pages, 9 figures, to appear in IEEE Transactions on Wireless Communications

arXiv:2507.03133 [pdf, ps, other]

ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Authors: Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong

Abstract: Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning task… ▽ More Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs' reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: under review

arXiv:2506.20235 [pdf, ps, other]

Directed Link Prediction using GNN with Local and Global Feature Fusion

Authors: Yuyang Zhang, Xu Shen, Yu Xie, Ka-Chun Wong, Weidun Xie, Chengbin Peng

Abstract: Link prediction is a classical problem in graph analysis with many practical applications. For directed graphs, recently developed deep learning approaches typically analyze node similarities through contrastive learning and aggregate neighborhood information through graph convolutions. In this work, we propose a novel graph neural network (GNN) framework to fuse feature embedding with community i… ▽ More Link prediction is a classical problem in graph analysis with many practical applications. For directed graphs, recently developed deep learning approaches typically analyze node similarities through contrastive learning and aggregate neighborhood information through graph convolutions. In this work, we propose a novel graph neural network (GNN) framework to fuse feature embedding with community information. We theoretically demonstrate that such hybrid features can improve the performance of directed link prediction. To utilize such features efficiently, we also propose an approach to transform input graphs into directed line graphs so that nodes in the transformed graph can aggregate more information during graph convolutions. Experiments on benchmark datasets show that our approach outperforms the state-of-the-art in most cases when 30%, 40%, 50%, and 60% of the connected links are used as training data, respectively. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20048 [pdf, ps, other]

A Principled Path to Fitted Distributional Evaluation

Authors: Sungee Hong, Jiayi Wang, Zhengling Qi, Raymond Ka Wai Wong

Abstract: In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted-Q evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributi… ▽ More In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted-Q evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.16578 [pdf, ps, other]

SafeTriage: Facial Video De-identification for Privacy-Preserving Stroke Triage

Authors: Tongan Cai, Haomiao Ni, Wenchao Ma, Yuan Xue, Qian Ma, Rachel Leicht, Kelvin Wong, John Volpi, Stephen T. C. Wong, James Z. Wang, Sharon X. Huang

Abstract: Effective stroke triage in emergency settings often relies on clinicians' ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges -- especially when training robust and generalizable models across inst… ▽ More Effective stroke triage in emergency settings often relies on clinicians' ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges -- especially when training robust and generalizable models across institutions. To address these concerns, we propose SafeTriage, a novel method designed to de-identify patient facial videos while preserving essential motion cues crucial for stroke diagnosis. SafeTriage leverages a pretrained video motion transfer (VMT) model to map the motion characteristics of real patient faces onto synthetic identities. This approach retains diagnostically relevant facial dynamics without revealing the patients' identities. To mitigate the distribution shift between normal population pre-training videos and patient population test videos, we introduce a conditional generative model for visual prompt tuning, which adapts the input space of the VMT model to ensure accurate motion transfer without needing to fine-tune the VMT model backbone. Comprehensive evaluation, including quantitative metrics and clinical expert assessments, demonstrates that SafeTriage-produced synthetic videos effectively preserve stroke-relevant facial patterns, enabling reliable AI-based triage. Our evaluations also show that SafeTriage provides robust privacy protection while maintaining diagnostic accuracy, offering a secure and ethically sound foundation for data sharing and AI-driven clinical analysis in neurological disorders. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: IPMI 2025

arXiv:2506.14488 [pdf, ps, other]

Reimagining Target-Aware Molecular Generation through Retrieval-Enhanced Aligned Diffusion

Authors: Dong Xu, Zhangfan Yang, Ka-chun Wong, Zexuan Zhu, Jiangqiang Li, Junkai Ji

Abstract: Breakthroughs in high-accuracy protein structure prediction, such as AlphaFold, have established receptor-based molecule design as a critical driver for rapid early-phase drug discovery. However, most approaches still struggle to balance pocket-specific geometric fit with strict valence and synthetic constraints. To resolve this trade-off, a Retrieval-Enhanced Aligned Diffusion termed READ is intr… ▽ More Breakthroughs in high-accuracy protein structure prediction, such as AlphaFold, have established receptor-based molecule design as a critical driver for rapid early-phase drug discovery. However, most approaches still struggle to balance pocket-specific geometric fit with strict valence and synthetic constraints. To resolve this trade-off, a Retrieval-Enhanced Aligned Diffusion termed READ is introduced, which is the first to merge molecular Retrieval-Augmented Generation with an SE(3)-equivariant diffusion model. Specifically, a contrastively pre-trained encoder aligns atom-level representations during training, then retrieves graph embeddings of pocket-matched scaffolds to guide each reverse-diffusion step at inference. This single mechanism can inject real-world chemical priors exactly where needed, producing valid, diverse, and shape-complementary ligands. Experimental results demonstrate that READ can achieve very competitive performance in CBGBench, surpassing state-of-the-art generative models and even native ligands. That suggests retrieval and diffusion can be co-optimized for faster, more reliable structure-based drug design. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 13 pages, 5 figures

arXiv:2506.14288 [pdf, ps, other]

Large Language Model Empowered Design of Fluid Antenna Systems: Challenges, Frameworks, and Case Studies for 6G

Authors: Chao Wang, Kai-Kit Wong, Zan Li, Liang Jin, Chan-Byoung Chae

Abstract: The Fluid Antenna System (FAS), which enables flexible Multiple-Input Multiple-Output (MIMO) communications, introduces new spatial degrees of freedom for next-generation wireless networks. Unlike traditional MIMO, FAS involves joint port selection and precoder design, a combinatorial NP-hard optimization problem. Moreover, fully leveraging FAS requires acquiring Channel State Information (CSI) ac… ▽ More The Fluid Antenna System (FAS), which enables flexible Multiple-Input Multiple-Output (MIMO) communications, introduces new spatial degrees of freedom for next-generation wireless networks. Unlike traditional MIMO, FAS involves joint port selection and precoder design, a combinatorial NP-hard optimization problem. Moreover, fully leveraging FAS requires acquiring Channel State Information (CSI) across its ports, a challenge exacerbated by the system's near-continuous reconfigurability. These factors make traditional system design methods impractical for FAS due to nonconvexity and prohibitive computational complexity. While deep learning (DL)-based approaches have been proposed for MIMO optimization, their limited generalization and fitting capabilities render them suboptimal for FAS. In contrast, Large Language Models (LLMs) extend DL's capabilities by offering general-purpose adaptability, reasoning, and few-shot learning, thereby overcoming the limitations of task-specific, data-intensive models. This article presents a vision for LLM-driven FAS design, proposing a novel flexible communication framework. To demonstrate the potential, we examine LLM-enhanced FAS in multiuser scenarios, showcasing how LLMs can revolutionize FAS optimization. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 9 pages

MSC Class: fluid antenna system

arXiv:2506.13317 [pdf, ps, other]

A Contemporary Survey on Fluid Antenna Systems: Fundamentals and Networking Perspectives

Authors: Hanjiang Hong, Kai-Kit Wong, Hao Xu, Xinghao Guo, Farshad Rostami Ghadi, Yu Chen, Yin Xu, Chan-Byoung Chae, Baiyang Liu, Kin-Fai Tong, Yangyang Zhang

Abstract: The explosive growth of teletraffic, fueled by the convergence of cyber-physical systems and data-intensive applications, such as the Internet of Things (IoT), autonomous systems, and immersive communications, demands a multidisciplinary suite of innovative solutions across the physical and network layers. Fluid antenna systems (FAS) represent a transformative advancement in antenna design, offeri… ▽ More The explosive growth of teletraffic, fueled by the convergence of cyber-physical systems and data-intensive applications, such as the Internet of Things (IoT), autonomous systems, and immersive communications, demands a multidisciplinary suite of innovative solutions across the physical and network layers. Fluid antenna systems (FAS) represent a transformative advancement in antenna design, offering enhanced spatial degrees of freedom through dynamic reconfigurability. By exploiting spatial flexibility, FAS can adapt to varying channel conditions and optimize wireless performance, making it a highly promising candidate for next-generation communication networks. This paper provides a comprehensive survey of the state of the art in FAS research. We begin by examining key application scenarios in which FAS offers significant advantages. We then present the fundamental principles of FAS, covering channel measurement and modeling, single-user configurations, and the multi-user fluid antenna multiple access (FAMA) framework. Following this, we delve into key network-layer techniques such as quality-of-service (QoS) provisioning, power allocation, and content placement strategies. We conclude by identifying prevailing challenges and outlining future research directions to support the continued development of FAS in next-generation wireless networks. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.07986 [pdf, ps, other]

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Authors: Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong

Abstract: Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between… ▽ More Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA} △ Less

Submitted 11 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

Comments: Project Page: https://vchitect.github.io/TACA/

arXiv:2506.07362 [pdf, ps, other]

Fluid Antenna-Empowered Receive Spatial Modulation

Authors: Xinghao Guo, Yin Xu, Dazhi He, Cixiao Zhang, Hanjiang Hong, Kai-Kit Wong, Chan-Byoung Chae, Wenjun Zhang, Yiyan Wu

Abstract: Fluid antenna (FA), as an emerging antenna technology, fully exploits spatial diversity. This paper integrates FA with the receive spatial modulation (RSM) scheme and proposes a novel FA-empowered RSM (FA-RSM) system. In this system, the transmitter is equipped with an FA that simultaneously activates multiple ports to transmit precoded signals. We address three key challenges in the FA-RSM system… ▽ More Fluid antenna (FA), as an emerging antenna technology, fully exploits spatial diversity. This paper integrates FA with the receive spatial modulation (RSM) scheme and proposes a novel FA-empowered RSM (FA-RSM) system. In this system, the transmitter is equipped with an FA that simultaneously activates multiple ports to transmit precoded signals. We address three key challenges in the FA-RSM system: port selection, theoretical analysis, and detection. First, for port selection, an optimal algorithm from a capacity maximization perspective are proposed, followed by two low-complexity alternatives. Second, for theoretical analysis, performance evaluation metrics are provided for port selection, which demonstrate that increasing the number of activated ports enhances system performance. Third, regarding detection, two low-complexity detectors are proposed. Simulation results confirm that the FA-RSM system significantly outperforms the conventional RSM system. The proposed low-complexity port selection algorithms facilitate minimal performance degradation. Moreover, while activating additional ports improves performance, the gain gradually saturates due to inherent spatial correlation, highlighting the importance of effective port selection in reducing system complexity and cost. Finally, both proposed detectors achieve near-optimal detection performance with low computational complexity, emphasizing the receiver-friendly nature of the FA-RSM system. △ Less

Submitted 8 June, 2025; originally announced June 2025.

Comments: 12 pages, submitted to IEEE Journal

arXiv:2506.05569 [pdf, ps, other]

Fluid Antenna System-Assisted Self-Interference Cancellation for In-Band Full Duplex Communications

Authors: Hanjiang Hong, Kai-Kit Wong, Hao Xu, Yiyan Wu, Sai Xu, Chan-Byoung Chae, Baiyang Liu, Kin-Fai Tong

Abstract: In-band full-duplex (IBFD) systems are expected to double the spectral efficiency compared to half-duplex systems, provided that loopback self-interference (SI) can be effectively suppressed. The inherent interference mitigation capabilities of the emerging fluid antenna system (FAS) technology make it a promising candidate for addressing the SI challenge in IBFD systems. This paper thus proposes… ▽ More In-band full-duplex (IBFD) systems are expected to double the spectral efficiency compared to half-duplex systems, provided that loopback self-interference (SI) can be effectively suppressed. The inherent interference mitigation capabilities of the emerging fluid antenna system (FAS) technology make it a promising candidate for addressing the SI challenge in IBFD systems. This paper thus proposes a FAS-assisted self-interference cancellation (SIC) framework, which leverages a receiver-side FAS to dynamically select an interference-free port. Analytical results include a lower bound and an approximation of the residual SI (RSI) power, both derived for rich-scattering channels by considering the joint spatial correlation amongst the FAS ports. Simulations of RSI power and forward link rates validate the analysis, showing that the SIC performance improves with the number of FAS ports. Additionally, simulations under practical conditions, such as finite-scattering environments and wideband integrated access and backhaul (IAB) channels, reveal that the proposed approach offers superior SIC capability and significant forward rate gains over conventional IBFD SIC schemes. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.03850 [pdf, ps, other]

Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

Authors: Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

Abstract: Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we rev… ▽ More Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which estimates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: ICML 2025

arXiv:2506.03123 [pdf, ps, other]

DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

Authors: Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, Ziwei Liu

Abstract: Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by a… ▽ More Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \href{https://github.com/Vchitect/DCM}{https://github.com/Vchitect/DCM}. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.00886 [pdf, ps, other]

Toward a Theory of Agents as Tool-Use Decision-Makers

Authors: Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Kam-Fai Wong

Abstract: As Large Language Models (LLMs) evolve into increasingly autonomous agents, fundamental questions about their epistemic foundations remain unresolved: What defines an agent? How should it make decisions? And what objectives should guide its behavior? In this position paper, we argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, wha… ▽ More As Large Language Models (LLMs) evolve into increasingly autonomous agents, fundamental questions about their epistemic foundations remain unresolved: What defines an agent? How should it make decisions? And what objectives should guide its behavior? In this position paper, we argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, what they need to know, and how to acquire that knowledge efficiently. We propose a unified theory that treats internal reasoning and external actions as equivalent epistemic tools, enabling agents to systematically coordinate introspection and interaction. Building on this framework, we advocate for aligning an agent's tool use decision-making boundary with its knowledge boundary, thereby minimizing unnecessary tool use and maximizing epistemic efficiency. This perspective shifts the design of agents from mere action executors to knowledge-driven intelligence systems, offering a principled path toward building foundation agents capable of adaptive, efficient, and goal-directed behavior. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2505.24873 [pdf, ps, other]

MiniMax-Remover: Taming Bad Noise Helps Video Object Removal

Authors: Bojia Zi, Weixuan Peng, Xianbiao Qi, Jianan Wang, Shihao Zhao, Rong Xiao, Kam-Fai Wong

Abstract: Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow infer… ▽ More Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow inference. To address these limitations, we propose MiniMax-Remover, a novel two-stage video object removal approach. Motivated by the observation that text condition is not best suited for this task, we simplify the pretrained video generation model by removing textual input and cross-attention layers, resulting in a more lightweight and efficient model architecture in the first stage. In the second stage, we distilled our remover on successful videos produced by the stage-1 model and curated by human annotators, using a minimax optimization strategy to further improve editing quality and inference speed. Specifically, the inner maximization identifies adversarial input noise ("bad noise") that makes failure removals, while the outer minimization step trains the model to generate high-quality removal results even under such challenging conditions. As a result, our method achieves a state-of-the-art video object removal results with as few as 6 sampling steps and doesn't rely on CFG, significantly improving inference efficiency. Extensive experiments demonstrate the effectiveness and superiority of MiniMax-Remover compared to existing methods. Codes and Videos are available at: https://minimax-remover.github.io. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.23680 [pdf, ps, other]

Performance Analysis of Wireless Communication Systems Assisted by Fluid Reconfigurable Intelligent Surfaces

Authors: Farshad Rostami Ghadi, Kai-Kit Wong, F. Javier Lopez-Martinez, George C. Alexandropoulos, Chan-Byoung Chae

Abstract: This letter investigates the performance of emerging wireless communication systems assisted by a fluid reconfigurable intelligent surface (FRIS). Unlike conventional reconfigurable intelligent surfaces (RISs), an FRIS consists of fluid-inspired metamaterials arranged in a densely packed matrix of sub-elements over a surface. It dynamically activates specific elements for signal reflection and mod… ▽ More This letter investigates the performance of emerging wireless communication systems assisted by a fluid reconfigurable intelligent surface (FRIS). Unlike conventional reconfigurable intelligent surfaces (RISs), an FRIS consists of fluid-inspired metamaterials arranged in a densely packed matrix of sub-elements over a surface. It dynamically activates specific elements for signal reflection and modulation based on real-time channel conditions. Considering a downlink scenario where a base station communicates with a user terminal via a FRIS, we first characterize the statistical behavior of the equivalent end-to-end channel by deriving closed-form approximations for its cumulative distribution and probability density functions. Using these expressions, an analytical approximation for the outage probability and a tight upper bound on the ergodic capacity, including their asymptotic behaviors for high signal-to-noise ratio values, are derived. Our findings reveal key performance trends demonstrating that FRIS can substantially improve link reliability and spectral efficiency compared to conventional RISs, owing to its capability to dynamically select optimal elements from a dense preconfigured grid. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.20231 [pdf, ps, other]

Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue

Authors: Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong

Abstract: Existing Task-Oriented Dialogue (TOD) systems primarily focus on single-session dialogues, limiting their effectiveness in long-term memory augmentation. To address this challenge, we introduce a MS-TOD dataset, the first multi-session TOD dataset designed to retain long-term memory across sessions, enabling fewer turns and more efficient task completion. This defines a new benchmark task for eval… ▽ More Existing Task-Oriented Dialogue (TOD) systems primarily focus on single-session dialogues, limiting their effectiveness in long-term memory augmentation. To address this challenge, we introduce a MS-TOD dataset, the first multi-session TOD dataset designed to retain long-term memory across sessions, enabling fewer turns and more efficient task completion. This defines a new benchmark task for evaluating long-term memory in multi-session TOD. Based on this new dataset, we propose a Memory-Active Policy (MAP) that improves multi-session dialogue efficiency through a two-stage approach. 1) Memory-Guided Dialogue Planning retrieves intent-aligned history, identifies key QA units via a memory judger, refines them by removing redundant questions, and generates responses based on the reconstructed memory. 2) Proactive Response Strategy detects and correct errors or omissions, ensuring efficient and accurate task completion. We evaluate MAP on MS-TOD dataset, focusing on response quality and effectiveness of the proactive strategy. Experiments on MS-TOD demonstrate that MAP significantly improves task success and turn efficiency in multi-session scenarios, while maintaining competitive performance on conventional single-session tasks. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.18628 [pdf, ps, other]

Multi-Subarray FD-RIS Enhanced Multi-user Wireless Networks: With Joint Distance-Angle Beamforming

Authors: Han Xiao, Xiaoyan Hu, Wenjie Wang, Kai-Kit Wong, Kun Yang, Shi Jin

Abstract: The concept of the frequency diverse reconfigurable intelligent surface (FD-RIS) technology has been introduced, which can enable simultaneous implementation of distance-angle beamforming in far-field communication scenarios. In order to improve the managing ability on undesired harmonic signals and the diversity of frequency offsets, this paper presents a novel multi-subarray FD-RIS framework. In… ▽ More The concept of the frequency diverse reconfigurable intelligent surface (FD-RIS) technology has been introduced, which can enable simultaneous implementation of distance-angle beamforming in far-field communication scenarios. In order to improve the managing ability on undesired harmonic signals and the diversity of frequency offsets, this paper presents a novel multi-subarray FD-RIS framework. In this framework, the RIS is evenly divided into multiple subarrays, each employing a distinct time-modulation frequency to enable the diversity of frequency offsets. Additionally, to suppress the undesired harmonic signals, a new time-modulation technique is employed to periodically adjust the phase-shift of each element. Based on the proposed multi-subarray FD-RIS, the signal processing model is first analytically derived. To evaluate the effectiveness of the proposed multi-subarray FD-RIS, we integrate it into a multi-user communication scenario and formulate an optimization problem that aims to maximize the weighted sum rate of all users. This is achieved by jointly optimizing the active beamforming, time delays, and modulation frequencies. Subsequently, a novel iterative algorithm is proposed to effectively solve this problem with low computing complexity. Simulation results demonstrate that the proposed multi-subarray FD-RIS can significantly enhance the performance of far-field communication networks by leveraging unique distance-angle beamforming. Furthermore, to achieve same performance gains, the FD-RIS-assisted system can substantially reduce the required number of RIS elements, number of antennas, and power budget, than the conventional RIS-assisted schemes. The proposed algorithm also demonstrates a notably superiority in performance and computational complexity compared with the baseline algorithms such as semi-definite relaxation (SDR) and zero-forcing (ZF). △ Less

Submitted 24 May, 2025; originally announced May 2025.

arXiv:2505.17829 [pdf, other]

Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning

Authors: Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong

Abstract: Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of interm… ▽ More Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets. △ Less

Submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.17448 [pdf, ps, other]

Baitradar: A Multi-Model Clickbait Detection Algorithm Using Deep Learning

Authors: Bhanuka Gamage, Adnan Labib, Aisha Joomun, Chern Hong Lim, KokSheik Wong

Abstract: Following the rising popularity of YouTube, there is an emerging problem on this platform called clickbait, which provokes users to click on videos using attractive titles and thumbnails. As a result, users ended up watching a video that does not have the content as publicized in the title. This issue is addressed in this study by proposing an algorithm called BaitRadar, which uses a deep learning… ▽ More Following the rising popularity of YouTube, there is an emerging problem on this platform called clickbait, which provokes users to click on videos using attractive titles and thumbnails. As a result, users ended up watching a video that does not have the content as publicized in the title. This issue is addressed in this study by proposing an algorithm called BaitRadar, which uses a deep learning technique where six inference models are jointly consulted to make the final classification decision. These models focus on different attributes of the video, including title, comments, thumbnail, tags, video statistics and audio transcript. The final classification is attained by computing the average of multiple models to provide a robust and accurate output even in situation where there is missing data. The proposed method is tested on 1,400 YouTube videos. On average, a test accuracy of 98% is achieved with an inference time of less than 2s. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: Appear in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'21), Toronto, ON, Canada

arXiv:2505.17433 [pdf, ps, other]

MemeReaCon: Probing Contextual Meme Understanding in Large Vision-Language Models

Authors: Zhengyi Zhao, Shubo Zhang, Yuxi Zhang, Yanxi Zhao, Yifan Zhang, Zezhong Wang, Huimin Wang, Yutian Zhao, Bin Liang, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu

Abstract: Memes have emerged as a popular form of multimodal online communication, where their interpretation heavily depends on the specific context in which they appear. Current approaches predominantly focus on isolated meme analysis, either for harmful content detection or standalone interpretation, overlooking a fundamental challenge: the same meme can express different intents depending on its convers… ▽ More Memes have emerged as a popular form of multimodal online communication, where their interpretation heavily depends on the specific context in which they appear. Current approaches predominantly focus on isolated meme analysis, either for harmful content detection or standalone interpretation, overlooking a fundamental challenge: the same meme can express different intents depending on its conversational context. This oversight creates an evaluation gap: although humans intuitively recognize how context shapes meme interpretation, Large Vision Language Models (LVLMs) can hardly understand context-dependent meme intent. To address this critical limitation, we introduce MemeReaCon, a novel benchmark specifically designed to evaluate how LVLMs understand memes in their original context. We collected memes from five different Reddit communities, keeping each meme's image, the post text, and user comments together. We carefully labeled how the text and meme work together, what the poster intended, how the meme is structured, and how the community responded. Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose. MemeReaCon thus serves both as a diagnostic tool exposing current limitations and as a challenging benchmark to drive development toward more sophisticated LVLMs of the context-aware understanding. △ Less

Submitted 4 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.17427 [pdf, ps, other]

T$^2$: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering

Authors: Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Huimin Wang, Yutian Zhao, Bin Liang, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu

Abstract: Recent advances in Large Language Models (LLMs) have demonstrated remarkable performance in Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightfo… ▽ More Recent advances in Large Language Models (LLMs) have demonstrated remarkable performance in Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightforward questions. But they add human bias to the reasoning process and fail to leverage models' inherent reasoning capabilities. To address these limitations, we present T$^2$: Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity. T$^2$ leverages the insight that if an LLM can effectively solve similar questions using specific reasoning strategies, it can apply the same strategy to the original question. This insight enables to adoption of concise reasoning for straightforward questions while maintaining detailed analysis for complex problems. T$^2$ works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question. Experimental evaluation across seven diverse CQA benchmarks demonstrates that T$^2$ not only achieves higher accuracy than baseline methods but also reduces computational overhead by up to 25.2\%. △ Less

Submitted 4 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: arXiv admin note: substantial text overlap with arXiv:2503.22985

arXiv:2505.15200 [pdf, ps, other]

Performance Analysis of Fluid Antenna System under Spatially-Correlated Rician Fading Channels

Authors: Jiangsheng Huangfu, Zhengyu Song, Tianwei Hou, Anna Li, Yuanwei Liu, Arumugam Nallanathan, Kai-Kit Wong

Abstract: Fluid antenna systems (FAS) are among the most promising technologies for the sixth generation (6G) mobile communication networks. Unlike traditional fixed-position multiple-input multiple-output (MIMO) systems, a FAS possesses position reconfigurability to switch on-demand among $N$ predefined ports over a prescribed space. This paper explores the performance of a single-input single-output (SISO… ▽ More Fluid antenna systems (FAS) are among the most promising technologies for the sixth generation (6G) mobile communication networks. Unlike traditional fixed-position multiple-input multiple-output (MIMO) systems, a FAS possesses position reconfigurability to switch on-demand among $N$ predefined ports over a prescribed space. This paper explores the performance of a single-input single-output (SISO) model with a fixed-position antenna transmitter and a single-antenna FAS receiver, referred to as the Rx-SISO-FAS model, under spatially-correlated Rician fading channels. Our contributions include exact expressions and closed-form bounds for the outage probability of the Rx-SISO-FAS model, as well as exact and closed-form lower bounds for the ergodic rate. Importantly, we also analyze the performance considering both uniform linear array (ULA) and uniform planar array (UPA) configurations for the ports of the FAS. To gain insights, we evaluate the diversity order of the proposed model and our analytical results indicate that with a fixed overall system size, increasing the number of ports, $N$, significantly decreases the outage performance of FAS under different Rician fading factors. Our numerical results further demonstrate that: $i)$ the Rx-SISO-FAS model can enhance performance under spatially-correlated Rician fading channels over the fixed-position antenna counterpart; $ii)$ the Rician factor negatively impacts performance in the low signal-to-noise ratio (SNR) regime; $iii$) FAS can outperform an $L$ branches maximum ratio combining (MRC) system under Rician fading channels; and $iv)$ when the number of ports is identical, UPA outperforms ULA. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.14116 [pdf, ps, other]

Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst

Authors: Hongru Wang, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z. Pan, Zeming Liu, Kam-Fai Wong

Abstract: Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we i… ▽ More Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce \textit{Self-Reasoning Language Model} (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than $+2.5$ points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute $+7.89$ average improvement with $64$ sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.13616 [pdf, other]

FIRES: Fluid Integrated Reflecting and Emitting Surfaces

Authors: Farshad Rostami Ghadi, Kai-Kit Wong, Masoud Kaveh, F. Javier Lopez-Martinez, Chan-Byoung Chae, George C. Alexandropoulos

Abstract: This letter introduces the concept of fluid integrated reflecting and emitting surface (FIRES), which constitutes a new paradigm seamlessly integrating the flexibility of fluid-antenna systems (FASs) with the dual functionality of simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs). The potential of the proposed metasurface structure is studied though an FIRES-… ▽ More This letter introduces the concept of fluid integrated reflecting and emitting surface (FIRES), which constitutes a new paradigm seamlessly integrating the flexibility of fluid-antenna systems (FASs) with the dual functionality of simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs). The potential of the proposed metasurface structure is studied though an FIRES-enabled multicast system based on the energy splitting protocol. In this model, the FIRES is divided into non-overlapping subareas, each functioning as a 'fluid' element capable of concurrent reflection and transmission and changing its position of radiation within the subarea. In particular, we formulate an optimization problem for the design of the triple tunable features of the surface unit elements, which is solved via a tailored particle swarm optimization approach. Our results showcase that the proposed FIRES architecture significantly outperforms its conventional STAR-RIS counterpart. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.13328 [pdf, other]

Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

Authors: Hongru Wang, Wenyu Huang, Yufei Wang, Yuanhao Xi, Jianqiao Lu, Huan Zhang, Nan Hu, Zeming Liu, Jeff Z. Pan, Kam-Fai Wong

Abstract: Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool i… ▽ More Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textit{tool creation}; 2) \textit{tool utilization}: tool awareness, tool selection, tool execution; and 3) \textit{role-consistent response}: response generation and role play. Furthermore, we build \texttt{VirtualMobile} -- an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnote{We will use tools and APIs alternatively, there are no significant differences between them in this paper.}. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.08838 [pdf, ps, other]

Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts

Authors: Peixuan Ge, Tongkun Su, Faqin Lv, Baoliang Zhao, Peng Zhang, Chi Hong Wong, Liang Yao, Yu Sun, Zenan Wang, Pak Kin Wong, Ying Hu

Abstract: Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveragi… ▽ More Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveraging the standardized nature of US reports. By aligning modular text fragments with diverse imaging data and curating a bilingual English-Chinese dataset, the method achieves consistent and clinically accurate text generation across organ sites and languages. Fine-tuning with selective unfreezing of the vision transformer (ViT) further improves text-image alignment. Compared to the previous state-of-the-art KMVE method, our approach achieves relative gains of about 2\% in BLEU scores, approximately 3\% in ROUGE-L, and about 15\% in CIDEr, while significantly reducing errors such as missing or incorrect content. By unifying multi-organ and multi-language report generation into a single, scalable framework, this work demonstrates strong potential for real-world clinical workflows. △ Less

Submitted 19 May, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

arXiv:2505.03841 [pdf, other]

Contact-Aware Safety in Soft Robots Using High-Order Control Barrier and Lyapunov Functions

Authors: Kiwan Wong, Maximilian Stölzle, Wei Xiao, Cosimo Della Santina, Daniela Rus, Gioele Zardini

Abstract: Robots operating alongside people, particularly in sensitive scenarios such as aiding the elderly with daily tasks or collaborating with workers in manufacturing, must guarantee safety and cultivate user trust. Continuum soft manipulators promise safety through material compliance, but as designs evolve for greater precision, payload capacity, and speed, and increasingly incorporate rigid elements… ▽ More Robots operating alongside people, particularly in sensitive scenarios such as aiding the elderly with daily tasks or collaborating with workers in manufacturing, must guarantee safety and cultivate user trust. Continuum soft manipulators promise safety through material compliance, but as designs evolve for greater precision, payload capacity, and speed, and increasingly incorporate rigid elements, their injury risk resurfaces. In this letter, we introduce a comprehensive High-Order Control Barrier Function (HOCBF) + High-Order Control Lyapunov Function (HOCLF) framework that enforces strict contact force limits across the entire soft-robot body during environmental interactions. Our approach combines a differentiable Piecewise Cosserat-Segment (PCS) dynamics model with a convex-polygon distance approximation metric, named Differentiable Conservative Separating Axis Theorem (DCSAT), based on the soft robot geometry to enable real-time, whole-body collision detection, resolution, and enforcement of the safety constraints. By embedding HOCBFs into our optimization routine, we guarantee safety and actively regulate environmental coupling, allowing, for instance, safe object manipulation under HOCLF-driven motion objectives. Extensive planar simulations demonstrate that our method maintains safety-bounded contacts while achieving precise shape and task-space regulation. This work thus lays a foundation for the deployment of soft robots in human-centric environments with provable safety and performance. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: 8 pages

arXiv:2505.02344 [pdf, other]

An End-to-End Model For Logits Based Large Language Models Watermarking

Authors: Kahim Wong, Jicheng Zhou, Jiantao Zhou, Yain-Whar Si

Abstract: The rise of LLMs has increased concerns over source tracing and copyright protection for AIGC, highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suf… ▽ More The rise of LLMs has increased concerns over source tracing and copyright protection for AIGC, highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suffer significant performance drops when the text is modified and could introduce biases that degrade LLM performance in downstream tasks. These methods fail to achieve an optimal tradeoff between text quality and robustness, particularly due to the lack of end-to-end optimization of the encoder and decoder. In this paper, we introduce a novel end-to-end logits perturbation method for watermarking LLM-generated text. By jointly optimization, our approach achieves a better balance between quality and robustness. To address non-differentiable operations in the end-to-end training pipeline, we introduce an online prompting technique that leverages the on-the-fly LLM as a differentiable surrogate. Our method achieves superior robustness, outperforming distortion-free methods by 37-39% under paraphrasing and 17.2% on average, while maintaining text quality on par with these distortion-free methods in terms of text perplexity and downstream tasks. Our method can be easily generalized to different LLMs. Code is available at https://github.com/KAHIMWONG/E2E_LLM_WM. △ Less

Submitted 22 May, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

arXiv:2505.01988 [pdf, ps, other]

Sparse Code Transceiver Design for Unsourced Random Access with Analytical Power Division in Gaussian MAC

Authors: Zhentian Zhang, Mohammad Javad Ahmadi, Jian Dang, Kai-Kit Wong, Zaichen Zhang, Christos Masouros

Abstract: In this work, we discuss the problem of unsourced random access (URA) over a Gaussian multiple access channel (GMAC). To address the challenges posed by emerging massive machine-type connectivity, URA reframes multiple access as a coding-theoretic problem. The sparse code-oriented schemes are highly valued because they are widely used in existing protocols, making their implementation require only… ▽ More In this work, we discuss the problem of unsourced random access (URA) over a Gaussian multiple access channel (GMAC). To address the challenges posed by emerging massive machine-type connectivity, URA reframes multiple access as a coding-theoretic problem. The sparse code-oriented schemes are highly valued because they are widely used in existing protocols, making their implementation require only minimal changes to current networks. However, drawbacks such as the heavy reliance on extrinsic feedback from powerful channel codes and the lack of transmission robustness pose obstacles to the development of sparse codes. To address these drawbacks, a novel sparse code structure based on a universally applicable power division strategy is proposed. Comprehensive numerical results validate the effectiveness of the proposed scheme. Specifically, by employing the proposed power division method, which is derived analytically and does not require extensive simulations, a performance improvement of approximately 2.8 dB is achieved compared to schemes with identical channel code setups. △ Less

Submitted 4 May, 2025; originally announced May 2025.

arXiv:2505.01458 [pdf, other]

A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

Authors: Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, Jianwei Zhang

Abstract: Navigation and manipulation are core capabilities in Embodied AI, yet training agents with these capabilities in the real world faces high costs and time complexity. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing their properties overlooked in previous surveys. We also analyz… ▽ More Navigation and manipulation are core capabilities in Embodied AI, yet training agents with these capabilities in the real world faces high costs and time complexity. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing their properties overlooked in previous surveys. We also analyze their features for navigation and manipulation tasks, along with hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and cutting-edge methods-such as world models and geometric equivariance-to help researchers select suitable tools while accounting for hardware constraints. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2505.00675 [pdf, other]

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

Authors: Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, Jeff Z. Pan

Abstract: Memory is a fundamental component of AI systems, underpinning large language models (LLMs)-based agents. While prior surveys have focused on memory applications with LLMs (e.g., enabling personalized memory in conversational agents), they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric and contextual for… ▽ More Memory is a fundamental component of AI systems, underpinning large language models (LLMs)-based agents. While prior surveys have focused on memory applications with LLMs (e.g., enabling personalized memory in conversational agents), they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric and contextual forms, and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression. We map these operations to the most relevant research topics across long-term, long-context, parametric modification, and multi-source memory. By reframing memory systems through the lens of atomic operations and representation types, this survey provides a structured and dynamic perspective on research, benchmark datasets, and tools related to memory in AI, clarifying the functional interplay in LLMs based agents while outlining promising directions for future research\footnote{The paper list, datasets, methods and tools are available at \href{https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI}{https://github.com/Elvin-Yiming-Du/Survey\_Memory\_in\_AI}.}. △ Less

Submitted 27 May, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.19162 [pdf, other]

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

Authors: Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, Kwan-Yee K. Wong

Abstract: Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the nee… ▽ More Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, SPC can guide the test-time search of diverse LLMs and significantly improve their mathematical reasoning performance on MATH500 and AIME2024, surpassing those guided by state-of-the-art process reward models. △ Less

Submitted 17 May, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

Comments: Project webpage: https://chen-judge.github.io/SPC/

arXiv:2504.17629 [pdf, ps, other]

Integrated Sensing and Communications for Unsourced Random Access: A Spectrum Sharing Compressive Sensing Approach

Authors: Zhentian Zhang, Jian Dang, Kai-Kit Wong, Zaichen Zhang, Christos Masouros

Abstract: This paper addresses the unsourced/uncoordinated random access problem in an integrated sensing and communications (ISAC) system, with a focus on uplink multiple access code design. Recent theoretical advancements highlight that an ISAC system will be overwhelmed by the increasing number of active devices, driven by the growth of massive machine-type communication (mMTC). To meet the demands of fu… ▽ More This paper addresses the unsourced/uncoordinated random access problem in an integrated sensing and communications (ISAC) system, with a focus on uplink multiple access code design. Recent theoretical advancements highlight that an ISAC system will be overwhelmed by the increasing number of active devices, driven by the growth of massive machine-type communication (mMTC). To meet the demands of future mMTC network, fundamental solutions are required that ensure robust capacity while maintaining favorable energy and spectral efficiency. One promising approach to support emerging massive connectivity is the development of systems based on the unsourced ISAC (UNISAC) framework. This paper proposes a spectrum-sharing compressive sensing-based UNISAC (SSCS-UNISAC) and offers insights into the practical design of UNISAC multiple access codes. In this framework, both communication signals (data transmission) and sensing signals (e.g., radar echoes) overlap within finite channel uses and are transmitted via the proposed UNISAC protocol. The proposed decoder exhibits robust performance, providing 20-30 dB capacity gains compared to conventional protocols such as TDMA and ALOHA. Numerical results validate the promising performance of the proposed scheme. △ Less

Submitted 24 April, 2025; originally announced April 2025.

arXiv:2504.15093 [pdf, ps, other]

Rethinking the Potential of Multimodality in Collaborative Problem Solving Diagnosis with Large Language Models

Authors: K. Wong, B. Wu, S. Bulathwela, M. Cukurova

Abstract: Detecting collaborative and problem-solving behaviours from digital traces to interpret students' collaborative problem solving (CPS) competency is a long-term goal in the Artificial Intelligence in Education (AIEd) field. Although multimodal data and advanced models are argued to have the potential to detect complex CPS behaviours, empirical evidence on their value remains limited with some contr… ▽ More Detecting collaborative and problem-solving behaviours from digital traces to interpret students' collaborative problem solving (CPS) competency is a long-term goal in the Artificial Intelligence in Education (AIEd) field. Although multimodal data and advanced models are argued to have the potential to detect complex CPS behaviours, empirical evidence on their value remains limited with some contrasting evidence. In this study, we investigated the potential of multimodal data to improve model performance in diagnosing 78 secondary school students' CPS subskills and indicators in authentic educational settings. In particular, text embeddings from verbal data and acoustic embeddings from audio data were used in a multimodal classification model for CPS diagnosis. Both unimodal and multimodal transformer-based models outperformed traditional models in detecting CPS classes. Although the inclusion of multimodality did not improve the performance of traditional unimodal models, its integration into transformer-based models demonstrated improved performance for diagnosing social-cognitive CPS classes compared to unimodal transformer-based models. Based on the results, the paper argues that multimodality and the selection of a particular modelling technique should not be taken for granted to achieve the best performance in the automated detection of every CPS subskill and indicator. Rather, their value is limited to certain types of CPS indicators, affected by the complexity of the labels, and dependent on the composition of indicators in the dataset. We conclude the paper by discussing the required nuance when considering the value of LLMs and multimodality in automated CPS diagnosis, highlighting the need for human-AI complementarity, and proposing the exploration of relevant model architectures and techniques to improve CPS diagnosis in authentic educational contexts. △ Less

Submitted 21 April, 2025; originally announced April 2025.

Comments: Accepted for 26th International Conference on Artificial Intelligence in Education (AIED 2025), 22 - 26 July 2025, Palermo, Italy. 17 pages, 1 figure

arXiv:2504.14870 [pdf, ps, other]

Acting Less is Reasoning More! Teaching Model to Act Efficiently

Authors: Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, Heng Ji

Abstract: Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools during long-form reasoning, such as search engines and code interpreters, to solve tasks beyond the capabilities of internal reasoning. While reinforcement learning (RL) has shown promise in training such agents, most of existing approaches typically optimize only for final correctness w… ▽ More Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools during long-form reasoning, such as search engines and code interpreters, to solve tasks beyond the capabilities of internal reasoning. While reinforcement learning (RL) has shown promise in training such agents, most of existing approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use. This often leads to excessive tool calling, incurring high computational costs and hindering the development of internal reasoning capabilities - a phenomenon known as \textit{cognitive offloading}. To this end, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers answer correctness and corresponding tool use behavior of model to reach that answer. To validate the effectiveness, we introduce the metric of \textit{tool productivity}, defined as the ratio between the number of correct answers and the total number of tool calls across all test cases. This metric reflects how efficiently tool usage contributes to successful task completion, with higher values indicating smarter and more autonomous reasoning. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 68.3\% and improves tool productivity by up to 215.4\%, while maintaining comparable answer accuracy. △ Less

Submitted 31 May, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

arXiv:2504.14653 [pdf, other]

Wireless Large AI Model: Shaping the AI-Native Future of 6G and Beyond

Authors: Fenghao Zhu, Xinquan Wang, Xinyi Li, Maojun Zhang, Yixuan Chen, Chongwen Huang, Zhaohui Yang, Xiaoming Chen, Zhaoyang Zhang, Richeng Jin, Yongming Huang, Wei Feng, Tingting Yang, Baoming Bai, Feifei Gao, Kun Yang, Yuanwei Liu, Sami Muhaidat, Chau Yuen, Kaibin Huang, Kai-Kit Wong, Dusit Niyato, Mérouane Debbah

Abstract: The emergence of sixth-generation and beyond communication systems is expected to fundamentally transform digital experiences through introducing unparalleled levels of intelligence, efficiency, and connectivity. A promising technology poised to enable this revolutionary vision is the wireless large AI model (WLAM), characterized by its exceptional capabilities in data processing, inference, and d… ▽ More The emergence of sixth-generation and beyond communication systems is expected to fundamentally transform digital experiences through introducing unparalleled levels of intelligence, efficiency, and connectivity. A promising technology poised to enable this revolutionary vision is the wireless large AI model (WLAM), characterized by its exceptional capabilities in data processing, inference, and decision-making. In light of these remarkable capabilities, this paper provides a comprehensive survey of WLAM, elucidating its fundamental principles, diverse applications, critical challenges, and future research opportunities. We begin by introducing the background of WLAM and analyzing the key synergies with wireless networks, emphasizing the mutual benefits. Subsequently, we explore the foundational characteristics of WLAM, delving into their unique relevance in wireless environments. Then, the role of WLAM in optimizing wireless communication systems across various use cases and the reciprocal benefits are systematically investigated. Furthermore, we discuss the integration of WLAM with emerging technologies, highlighting their potential to enable transformative capabilities and breakthroughs in wireless communication. Finally, we thoroughly examine the high-level challenges hindering the practical implementation of WLAM and discuss pivotal future research directions. △ Less

Submitted 28 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

arXiv:2504.06836 [pdf, other]

Determining Fetal Orientations From Blind Sweep Ultrasound Video

Authors: Jakub Maciej Wiśniewski, Anders Nymark Christensen, Mary Le Ngo, Martin Grønnebæk Tolsgaard, Chun Kit Wong

Abstract: Cognitive demands of fetal ultrasound examinations pose unique challenges among clinicians. With the goal of providing an assistive tool, we developed an automated pipeline for predicting fetal orientation from ultrasound videos acquired following a simple blind sweep protocol. Leveraging on a pre-trained head detection and segmentation model, this is achieved by first determining the fetal presen… ▽ More Cognitive demands of fetal ultrasound examinations pose unique challenges among clinicians. With the goal of providing an assistive tool, we developed an automated pipeline for predicting fetal orientation from ultrasound videos acquired following a simple blind sweep protocol. Leveraging on a pre-trained head detection and segmentation model, this is achieved by first determining the fetal presentation (cephalic or breech) with a template matching approach, followed by the fetal lie (facing left or right) by analyzing the spatial distribution of segmented brain anatomies. Evaluation on a dataset of third-trimester ultrasound scans demonstrated the promising accuracy of our pipeline. This work distinguishes itself by introducing automated fetal lie prediction and by proposing an assistive paradigm that augments sonographer expertise rather than replacing it. Future research will focus on enhancing acquisition efficiency, and exploring real-time clinical integration to improve workflow and support for obstetric clinicians. △ Less

Submitted 9 April, 2025; originally announced April 2025.

Comments: 10 pages

arXiv:2504.04098 [pdf, ps, other]

RIS-Empowered Integrated Location Sensing and Communication with Superimposed Pilots

Authors: Wenchao Xia, Ben Zhao, Wankai Tang, Yongxu Zhu, Kai-Kit Wong, Sangarapillai Lambotharan, Hyundong Shin

Abstract: In addition to enhancing wireless communication coverage quality, reconfigurable intelligent surface (RIS) technique can also assist in positioning. In this work, we consider RIS-assisted superimposed pilot and data transmission without the assumption availability of prior channel state information and position information of mobile user equipments (UEs). To tackle this challenge, we design a fram… ▽ More In addition to enhancing wireless communication coverage quality, reconfigurable intelligent surface (RIS) technique can also assist in positioning. In this work, we consider RIS-assisted superimposed pilot and data transmission without the assumption availability of prior channel state information and position information of mobile user equipments (UEs). To tackle this challenge, we design a frame structure of transmission protocol composed of several location coherence intervals, each with pure-pilot and data-pilot transmission durations. The former is used to estimate UE locations, while the latter is time-slotted, duration of which does not exceed the channel coherence time, where the data and pilot signals are transmitted simultaneously. We conduct the Fisher Information matrix (FIM) analysis and derive \text {Cramér-Rao bound} (CRB) for the position estimation error. The inverse fast Fourier transform (IFFT) is adopted to obtain the estimation results of UE positions, which are then exploited for channel estimation. Furthermore, we derive the closed-form lower bound of the ergodic achievable rate of superimposed pilot (SP) transmission, which is used to optimize the phase profile of the RIS to maximize the achievable sum rate using the genetic algorithm. Finally, numerical results validate the accuracy of the UE position estimation using the IFFT algorithm and the superiority of the proposed SP scheme by comparison with the regular pilot scheme. △ Less

Submitted 5 April, 2025; originally announced April 2025.

arXiv:2504.03183 [pdf, other]

On Fundamental Limits for Fluid Antenna-assisted Integrated Sensing and Communications for Unsourced Random Access

Authors: Zhentian Zhang, Kai-Kit Wong, Jian Dang, Zaichen Zhang, Chan-Byoung Chae

Abstract: This paper investigates the unsourced random access (URA) problem for integrated sensing and commutations (ISAC). Recent results reveal that conventional multiple access strategies for ISAC such as treating interference as noise (TIN) and time-division multiple access (TDMA) can be easily overwhelmed and fail to support the increasingly surging number of active users. Hence, the unsourced ISAC (UN… ▽ More This paper investigates the unsourced random access (URA) problem for integrated sensing and commutations (ISAC). Recent results reveal that conventional multiple access strategies for ISAC such as treating interference as noise (TIN) and time-division multiple access (TDMA) can be easily overwhelmed and fail to support the increasingly surging number of active users. Hence, the unsourced ISAC (UNISAC) system model has emerged as a promising enabler for the future ISAC networks. To advance this work, we adopt a more realistic channel model and propose to utilize fluid antenna system (FAS) for UNISAC. The achievable performance bound and floor of the proposed FAS-UNISAC are derived to validate the great potential. Our results demonstrate that promising improvement on the available user volume and the sensing and communication capability can be obtained due to the spatial diversities inherent within fluid antenna. △ Less

Submitted 4 April, 2025; originally announced April 2025.

arXiv:2504.03128 [pdf, other]

FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge

Authors: Kahim Wong, Jicheng Zhou, Kemou Li, Yain-Whar Si, Xiaowei Wu, Jiantao Zhou

Abstract: The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Ex… ▽ More The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network. The code and dataset are available at https://github.com/KAHIMWONG/FontGuard. △ Less

Submitted 3 April, 2025; originally announced April 2025.

arXiv:2504.01372 [pdf, ps, other]

doi 10.1109/TVT.2025.3557859

SCNR Maximization for MIMO ISAC Assisted by Fluid Antenna System

Authors: Yuqi Ye, Li You, Hao Xu, Ahmed Elzanaty, Kai-Kit Wong, Xiqi Gao

Abstract: The integrated sensing and communication (ISAC) technology has been extensively researched to enhance communication rates and radar sensing capabilities. Additionally, a new technology known as fluid antenna system (FAS) has recently been proposed to obtain higher communication rates for future wireless networks by dynamically altering the antenna position to obtain a more favorable channel condit… ▽ More The integrated sensing and communication (ISAC) technology has been extensively researched to enhance communication rates and radar sensing capabilities. Additionally, a new technology known as fluid antenna system (FAS) has recently been proposed to obtain higher communication rates for future wireless networks by dynamically altering the antenna position to obtain a more favorable channel condition. The application of the FAS technology in ISAC scenarios holds significant research potential. In this paper, we investigate a FAS-assisted multiple-input multiple-output (MIMO) ISAC system for maximizing the radar sensing signal-clutter-noise ratio (SCNR) under communication signal-to-interference-plus-noise ratio (SINR) and antenna position constraints. We devise an iterative algorithm that tackles the optimization problem by maximizing a lower bound of SCNR with respect to the transmit precoding matrix and the antenna position. By addressing the non-convexity of the problem through this iterative approach, our method significantly improves the SCNR. Our simulation results demonstrate that the proposed scheme achieves a higher SCNR compared to the baselines. △ Less

Submitted 2 April, 2025; originally announced April 2025.

Comments: 6 Pages, 3 figures, to appear in IEEE Transactions on Vehicular Technology

arXiv:2504.00906 [pdf, other]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Authors: Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang

Abstract: Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performanc… ▽ More Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S. △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: 18 pages, 13 figures, 8 tables

arXiv:2503.24377 [pdf, other]

Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

Authors: Rui Wang, Hongru Wang, Boyang Xue, Jianhui Pang, Shudong Liu, Yi Chen, Jiahao Qiu, Derek Fai Wong, Heng Ji, Kam-Fai Wong

Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning beh… ▽ More Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors. In contrast, System 1 reasoning is computationally efficient but leads to suboptimal performance. Consequently, it is critical to balance the trade-off between performance (benefits) and computational costs (budgets), giving rise to the concept of reasoning economy. In this survey, we provide a comprehensive analysis of reasoning economy in both the post-training and test-time inference stages of LLMs, encompassing i) the cause of reasoning inefficiency, ii) behavior analysis of different reasoning patterns, and iii) potential solutions to achieve reasoning economy. By offering actionable insights and highlighting open challenges, we aim to shed light on strategies for improving the reasoning economy of LLMs, thereby serving as a valuable resource for advancing research in this evolving area. We also provide a public repository to continually track developments in this fast-evolving field. △ Less

Submitted 31 March, 2025; originally announced March 2025.

Comments: In Progress; Paper list Repo: https://github.com/DevoAllen/Awesome-Reasoning-Economy-Papers

arXiv:2503.23673 [pdf, other]

WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation

Authors: Zhengyi Zhao, Shubo Zhang, Bin Liang, Binyang Li, Kam-Fai Wong

Abstract: In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in poten… ▽ More In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models. △ Less

Submitted 30 March, 2025; originally announced March 2025.

arXiv:2503.23078 [pdf, other]

EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems

Authors: Zhengyi Zhao, Shubo Zhang, Yiming Du, Bin Liang, Baojun Wang, Zhongyang Li, Binyang Li, Kam-Fai Wong

Abstract: Existing large language models (LLMs) have shown remarkable progress in dialogue systems. However, many approaches still overlook the fundamental role of events throughout multi-turn interactions, leading to \textbf{incomplete context tracking}. Without tracking these events, dialogue systems often lose coherence and miss subtle shifts in user intent, causing disjointed responses. To bridge this g… ▽ More Existing large language models (LLMs) have shown remarkable progress in dialogue systems. However, many approaches still overlook the fundamental role of events throughout multi-turn interactions, leading to \textbf{incomplete context tracking}. Without tracking these events, dialogue systems often lose coherence and miss subtle shifts in user intent, causing disjointed responses. To bridge this gap, we present \textbf{EventWeave}, an event-centric framework that identifies and updates both core and supporting events as the conversation unfolds. Specifically, we organize these events into a dynamic event graph, which represents the interplay between \textbf{core events} that shape the primary idea and \textbf{supporting events} that provide critical context during the whole dialogue. By leveraging this dynamic graph, EventWeave helps models focus on the most relevant events when generating responses, thus avoiding repeated visits of the entire dialogue history. Experimental results on two benchmark datasets show that EventWeave improves response quality and event relevance without fine-tuning. △ Less

Submitted 29 March, 2025; originally announced March 2025.

arXiv:2503.22985 [pdf, other]

FReM: A Flexible Reasoning Mechanism for Balancing Quick and Slow Thinking in Long-Context Question Answering

Authors: Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Bin Liang, Binyang Li, Kam-Fai Wong

Abstract: Long-context question-answering (LCQA) systems have greatly benefited from the powerful reasoning capabilities of large language models (LLMs), which can be categorized into slow and quick reasoning modes. However, both modes have their limitations. Slow thinking generally leans to explore every possible reasoning path, which leads to heavy overthinking and wastes time. Quick thinking usually reli… ▽ More Long-context question-answering (LCQA) systems have greatly benefited from the powerful reasoning capabilities of large language models (LLMs), which can be categorized into slow and quick reasoning modes. However, both modes have their limitations. Slow thinking generally leans to explore every possible reasoning path, which leads to heavy overthinking and wastes time. Quick thinking usually relies on pattern matching rather than truly understanding the query logic, which misses proper understanding. To address these issues, we propose FReM: Flexible Reasoning Mechanism, a method that adjusts reasoning depth according to the complexity of each question. Specifically, FReM leverages synthetic reference QA examples to provide an explicit chain of thought, enabling efficient handling of simple queries while allowing deeper reasoning for more complex ones. By doing so, FReM helps quick-thinking models move beyond superficial pattern matching and narrows the reasoning space for slow-thinking models to avoid unnecessary exploration. Experiments on seven QA datasets show that FReM improves reasoning accuracy and scalability, particularly for complex multihop questions, indicating its potential to advance LCQA methodologies. △ Less

Submitted 29 March, 2025; originally announced March 2025.

arXiv:2503.16751 [pdf, other]

UAV-Relay Assisted RSMA Fluid Antenna System: Outage Probability Analysis

Authors: Farshad Rostami Ghadi, Masoud Kaveh, Francisco Hernando-Gallego, Diego Martin, Kai-Kit Wong, Chan-Byoung Chae

Abstract: This letter studies the impact of fluid antenna system (FAS) technology on the performance of unmanned aerial vehicle (UAV)-assisted multiuser communication networks. Specifically, we consider a scenario where a fixed-position antenna (FPA) base station (BS) serves K FAS-equipped users with the assistance of a UAV acting as an aerial relay. The BS employs rate-splitting multiple access (RSMA), whi… ▽ More This letter studies the impact of fluid antenna system (FAS) technology on the performance of unmanned aerial vehicle (UAV)-assisted multiuser communication networks. Specifically, we consider a scenario where a fixed-position antenna (FPA) base station (BS) serves K FAS-equipped users with the assistance of a UAV acting as an aerial relay. The BS employs rate-splitting multiple access (RSMA), while the UAV operates in half-duplex (HD) mode using the decode-and-forward (DF) strategy. For this system, we derive a compact analytical expression for the outage probability (OP) and its asymptotic behavior in the high signal-to-noise ratio (SNR) regime, leveraging the multivariate t-distribution. Our results show how deploying FAS at ground users (GUs) in UAV-aided communications improves overall system performance compared to using FPA GUs. △ Less

Submitted 20 March, 2025; originally announced March 2025.

Showing 1–50 of 593 results for author: Wong, K