-
Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework
Authors:
Sadia Kamal,
Tim Oates,
Joy Wan
Abstract:
Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised…
▽ More
Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
Authors:
Wei Tao,
Haocheng Lu,
Xiaoyang Qu,
Bin Zhang,
Kai Lu,
Jiguang Wan,
Jianzong Wang
Abstract:
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE,…
▽ More
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Stability of the centers of group algebras of general affine groups $GA_n(q)$
Authors:
Jinkui Wan,
Lan Zhou
Abstract:
The general affine group $GA_n(q)$ consisting of invertible affine transformations of an affine space of codimension one in the vector space $\mathbb{F}_q^n$ over a finite field $\mathbb{F}_q$, can be viewed as a subgroup of the general linear group $GL_{n}(q)$ over $\mathbb{F}_q$. In the article, we introduce the notion of the type of each matrix in $GA_n(q)$ and give an explicit representative f…
▽ More
The general affine group $GA_n(q)$ consisting of invertible affine transformations of an affine space of codimension one in the vector space $\mathbb{F}_q^n$ over a finite field $\mathbb{F}_q$, can be viewed as a subgroup of the general linear group $GL_{n}(q)$ over $\mathbb{F}_q$. In the article, we introduce the notion of the type of each matrix in $GA_n(q)$ and give an explicit representative for each conjugacy class. Then the center $\mathscr{A}_n(q)$ of the integral group algebra $\mathbb{Z}[GA_n(q)]$ is proved to be a filtered algebra via the length function defined via the reflections lying in $GA_n(q)$. We show in the associated graded algebras $\mathscr{G}_n(q)$ the structure constants with respect to the basis consisting of the conjugacy class sums are independent of $n$. The structure constants in $\mathscr{G}_n(q)$ is further shown to contain the structure constants in the graded algebras introduced by the first author and Wang for $GL_n(q)$ as special cases. The stability leads to a universal stable center $\mathscr{G}(q)$ with positive integer structure constants only depending on $q$ which governs the algebras $\mathscr{G}_n(q)$ for all $n$.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
GEX: Democratizing Dexterity with Fully-Actuated Dexterous Hand and Exoskeleton Glove
Authors:
Yunlong Dong,
Xing Liu,
Jun Wan,
Zelin Deng
Abstract:
This paper introduces GEX, an innovative low-cost dexterous manipulation system that combines the GX11 tri-finger anthropomorphic hand (11 DoF) with the EX12 tri-finger exoskeleton glove (12 DoF), forming a closed-loop teleoperation framework through kinematic retargeting for high-fidelity control. Both components employ modular 3D-printed finger designs, achieving ultra-low manufacturing costs wh…
▽ More
This paper introduces GEX, an innovative low-cost dexterous manipulation system that combines the GX11 tri-finger anthropomorphic hand (11 DoF) with the EX12 tri-finger exoskeleton glove (12 DoF), forming a closed-loop teleoperation framework through kinematic retargeting for high-fidelity control. Both components employ modular 3D-printed finger designs, achieving ultra-low manufacturing costs while maintaining full actuation capabilities. Departing from conventional tendon-driven or underactuated approaches, our electromechanical system integrates independent joint motors across all 23 DoF, ensuring complete state observability and accurate kinematic modeling. This full-actuation architecture enables precise bidirectional kinematic calculations, substantially enhancing kinematic retargeting fidelity between the exoskeleton and robotic hand. The proposed system bridges the cost-performance gap in dexterous manipulation research, providing an accessible platform for acquiring high-quality demonstration data to advance embodied AI and dexterous robotic skill transfer learning.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
GCFL: A Gradient Correction-based Federated Learning Framework for Privacy-preserving CPSS
Authors:
Jiayi Wan,
Xiang Zhu,
Fanzhen Liu,
Wei Fan,
Xiaolong Xu
Abstract:
Federated learning, as a distributed architecture, shows great promise for applications in Cyber-Physical-Social Systems (CPSS). In order to mitigate the privacy risks inherent in CPSS, the integration of differential privacy with federated learning has attracted considerable attention. Existing research mainly focuses on dynamically adjusting the noise added or discarding certain gradients to mit…
▽ More
Federated learning, as a distributed architecture, shows great promise for applications in Cyber-Physical-Social Systems (CPSS). In order to mitigate the privacy risks inherent in CPSS, the integration of differential privacy with federated learning has attracted considerable attention. Existing research mainly focuses on dynamically adjusting the noise added or discarding certain gradients to mitigate the noise introduced by differential privacy. However, these approaches fail to remove the noise that hinders convergence and correct the gradients affected by the noise, which significantly reduces the accuracy of model classification. To overcome these challenges, this paper proposes a novel framework for differentially private federated learning that balances rigorous privacy guarantees with accuracy by introducing a server-side gradient correction mechanism. Specifically, after clients perform gradient clipping and noise perturbation, our framework detects deviations in the noisy local gradients and employs a projection mechanism to correct them, mitigating the negative impact of noise. Simultaneously, gradient projection promotes the alignment of gradients from different clients and guides the model towards convergence to a global optimum. We evaluate our framework on several benchmark datasets, and the experimental results demonstrate that it achieves state-of-the-art performance under the same privacy budget.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
NTIRE 2025 XGC Quality Assessment Challenge: Methods and Results
Authors:
Xiaohong Liu,
Xiongkuo Min,
Qiang Hu,
Xiaoyun Zhang,
Jie Guo,
Guangtao Zhai,
Shushi Wang,
Yingjie Zhou,
Lu Liu,
Jingxin Li,
Liu Yang,
Farong Wen,
Li Xu,
Yanwei Jiang,
Xilei Zhu,
Chunyi Li,
Zicheng Zhang,
Huiyu Duan,
Xiele Wu,
Yixuan Gao,
Yuqin Cao,
Jun Jia,
Wei Sun,
Jiezhang Cao,
Radu Timofte
, et al. (70 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking he…
▽ More
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking head. The user-generated video track uses the FineVD-GC, which contains 6,284 user generated videos. The user-generated video track has a total of 125 registered participants. A total of 242 submissions are received in the development phase, and 136 submissions are received in the test phase. Finally, 5 participating teams submitted their models and fact sheets. The AI generated video track uses the Q-Eval-Video, which contains 34,029 AI-Generated Videos (AIGVs) generated by 11 popular Text-to-Video (T2V) models. A total of 133 participants have registered in this track. A total of 396 submissions are received in the development phase, and 226 submissions are received in the test phase. Finally, 6 participating teams submitted their models and fact sheets. The talking head track uses the THQA-NTIRE, which contains 12,247 2D and 3D talking heads. A total of 89 participants have registered in this track. A total of 225 submissions are received in the development phase, and 118 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Each participating team in every track has proposed a method that outperforms the baseline, which has contributed to the development of fields in three tracks.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models
Authors:
Junjie Li,
Nan Zhang,
Xiaoyang Qu,
Kai Lu,
Guokuan Li,
Jiguang Wan,
Jianzong Wang
Abstract:
Object Navigation (ObjectNav) is a fundamental task in embodied artificial intelligence. Although significant progress has been made in semantic map construction and target direction prediction in current research, redundant exploration and exploration failures remain inevitable. A critical but underexplored direction is the timely termination of exploration to overcome these challenges. We observ…
▽ More
Object Navigation (ObjectNav) is a fundamental task in embodied artificial intelligence. Although significant progress has been made in semantic map construction and target direction prediction in current research, redundant exploration and exploration failures remain inevitable. A critical but underexplored direction is the timely termination of exploration to overcome these challenges. We observe a diminishing marginal effect between exploration steps and exploration rates and analyze the cost-benefit relationship of exploration. Inspired by this, we propose RATE-Nav, a Region-Aware Termination-Enhanced method. It includes a geometric predictive region segmentation algorithm and region-Based exploration estimation algorithm for exploration rate calculation. By leveraging the visual question answering capabilities of visual language models (VLMs) and exploration rates enables efficient termination.RATE-Nav achieves a success rate of 67.8% and an SPL of 31.3% on the HM3D dataset. And on the more challenging MP3D dataset, RATE-Nav shows approximately 10% improvement over previous zero-shot methods.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
BAGNet: A Boundary-Aware Graph Attention Network for 3D Point Cloud Semantic Segmentation
Authors:
Wei Tao,
Xiaoyang Qu,
Kai Lu,
Jiguang Wan,
Shenglin He,
Jianzong Wang
Abstract:
Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph-based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large-scale point cloud. In this…
▽ More
Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph-based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large-scale point cloud. In this paper, we observe that boundary points possess more intricate spatial structural information and develop a novel graph attention network known as the Boundary-Aware Graph attention Network (BAGNet). On one hand, BAGNet contains a boundary-aware graph attention layer (BAGLayer), which employs edge vertex fusion and attention coefficients to capture features of boundary points, reducing the computation time. On the other hand, BAGNet employs a lightweight attention pooling layer to extract the global feature of the point cloud to maintain model accuracy. Extensive experiments on standard datasets demonstrate that BAGNet outperforms state-of-the-art methods in point cloud semantic segmentation with higher accuracy and less inference time.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding
Authors:
Yichi Zhang,
Gongwei Chen,
Jun Zhu,
Jia Wan
Abstract:
Visual Grounding is a task that aims to localize a target region in an image based on a free-form natural language description. With the rise of Transformer architectures, there is an increasing need for larger datasets to boost performance. However, the high cost of manual annotation poses a challenge, hindering the scale of data and the ability of large models to enhance their effectiveness. Pre…
▽ More
Visual Grounding is a task that aims to localize a target region in an image based on a free-form natural language description. With the rise of Transformer architectures, there is an increasing need for larger datasets to boost performance. However, the high cost of manual annotation poses a challenge, hindering the scale of data and the ability of large models to enhance their effectiveness. Previous pseudo label generation methods heavily rely on human-labeled captions of the original dataset, limiting scalability and diversity. To address this, we propose D2AF, a robust annotation framework for visual grounding using only input images. This approach overcomes dataset size limitations and enriches both the quantity and diversity of referring expressions. Our approach leverages multimodal large models and object detection models. By implementing dual-driven annotation strategies, we effectively generate detailed region-text pairs using both closed-set and open-set approaches. We further conduct an in-depth analysis of data quantity and data distribution. Our findings demonstrate that increasing data volume enhances model performance. However, the degree of improvement depends on how well the pseudo labels broaden the original data distribution. Based on these insights, we propose a consistency and distribution aware filtering method to further improve data quality by effectively removing erroneous and redundant data. This approach effectively eliminates noisy data, leading to improved performance. Experiments on three visual grounding tasks demonstrate that our method significantly improves the performance of existing models and achieves state-of-the-art results.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Compressive Fourier-Domain Intensity Coupling (C-FOCUS) enables near-millimeter deep imaging in the intact mouse brain in vivo
Authors:
Renzhi He,
Yucheng Li,
Brianna Urbina,
Jiandi Wan,
Yi Xue
Abstract:
Two-photon microscopy is a powerful tool for in vivo imaging, but its imaging depth is typically limited to a few hundred microns due to tissue scattering, even with existing scattering correction techniques. Moreover, most active scattering correction methods are restricted to small regions by the optical memory effect. Here, we introduce compressive Fourier-domain intensity coupling for scatteri…
▽ More
Two-photon microscopy is a powerful tool for in vivo imaging, but its imaging depth is typically limited to a few hundred microns due to tissue scattering, even with existing scattering correction techniques. Moreover, most active scattering correction methods are restricted to small regions by the optical memory effect. Here, we introduce compressive Fourier-domain intensity coupling for scattering correction (C-FOCUS), an active scattering correction approach that integrates Fourier-domain intensity modulation with compressive sensing for two-photon microscopy. Using C-FOCUS, we demonstrate high-resolution imaging of YFP-labeled neurons and FITC-labeled blood vessels at depths exceeding 900 um in the intact mouse brain in vivo. Furthermore, we achieve transcranial imaging of YFP-labeled dendritic structures through the intact adult mouse skull. C-FOCUS enables high-contrast fluorescence imaging at depths previously inaccessible using two-photon microscopy with 1035 nm excitation, enhancing fluorescence intensity by over 20-fold compared to uncorrected imaging. C-FOCUS provides a broadly applicable strategy for rapid, deep-tissue optical imaging in vivo.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation
Authors:
Li Zhong,
Ahmed Ghazal,
Jun-Jun Wan,
Frederik Zilly,
Patrick Mackens,
Joachim E. Vollrath,
Bogdan Sorin Coseriu
Abstract:
Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing…
▽ More
Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Large Language Models as Computable Approximations to Solomonoff Induction
Authors:
Jun Wan,
Lingrui Mei
Abstract:
The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Alg…
▽ More
The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Scaling Laws for State Dynamics in Large Language Models
Authors:
Jacob X Li,
Shreyas S Raman,
Jessica Wan,
Fahad Samman,
Jazlyn Lin
Abstract:
Large Language Models (LLMs) are increasingly used in tasks requiring internal state tracking, yet their ability to model state transition dynamics remains poorly understood. We evaluate how well LLMs capture deterministic state dynamics across 3 domains: Box Tracking, Abstract DFA Sequences, and Complex Text Games, each formalizable as a finite-state system. Across tasks, we find that next-state…
▽ More
Large Language Models (LLMs) are increasingly used in tasks requiring internal state tracking, yet their ability to model state transition dynamics remains poorly understood. We evaluate how well LLMs capture deterministic state dynamics across 3 domains: Box Tracking, Abstract DFA Sequences, and Complex Text Games, each formalizable as a finite-state system. Across tasks, we find that next-state prediction accuracy degrades with increasing state-space size and sparse transitions. GPT-2 XL reaches about 70% accuracy in low-complexity settings but drops below 30% when the number of boxes or states exceeds 5 or 10, respectively. In DFA tasks, Pythia-1B fails to exceed 50% accuracy when the number of states is > 10 and transitions are < 30. Through activation patching, we identify attention heads responsible for propagating state information: GPT-2 XL Layer 22 Head 20, and Pythia-1B Heads at Layers 10, 11, 12, and 14. While these heads successfully move relevant state features, action information is not reliably routed to the final token, indicating weak joint state-action reasoning. Our results suggest that state tracking in LLMs emerges from distributed interactions of next-token heads rather than explicit symbolic computation.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning
Authors:
Ajian Liu,
Haocheng Yuan,
Xiao Guo,
Hui Ma,
Wanyi Zhuang,
Changtao Miao,
Yan Hong,
Chuanbiao Song,
Jun Lan,
Qi Chu,
Tao Gong,
Yanyan Liang,
Weiqiang Wang,
Jun Wan,
Xiaoming Liu,
Zhen Lei
Abstract:
Presentation Attack Detection and Face Forgery Detection are designed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes respectively. But separate training of these two models makes them vulnerable to unknown attacks and burdens deployment environments. The lack of a Unified Face Attack Detection model to handle both types of attacks is mainly…
▽ More
Presentation Attack Detection and Face Forgery Detection are designed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes respectively. But separate training of these two models makes them vulnerable to unknown attacks and burdens deployment environments. The lack of a Unified Face Attack Detection model to handle both types of attacks is mainly due to two factors. First, there's a lack of adequate benchmarks for models to explore. Existing UAD datasets have limited attack types and samples, restricting the model's ability to address advanced threats. To address this, we propose UniAttackDataPlus (UniAttackData+), the most extensive and sophisticated collection of forgery techniques to date. It includes 2,875 identities and their 54 kinds of falsified samples, totaling 697,347 videos. Second, there's a lack of a reliable classification criterion. Current methods try to find an arbitrary criterion within the same semantic space, which fails when encountering diverse attacks. So, we present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework (HiPTune) that adaptively explores multiple classification criteria from different semantic spaces. We build a Visual Prompt Tree to explore various classification rules hierarchically. Then, by adaptively pruning the prompts, the model can select the most suitable prompts to guide the encoder to extract discriminative features at different levels in a coarse-to-fine way. Finally, to help the model understand the classification criteria in visual space, we propose a Dynamically Prompt Integration module to project the visual prompts to the text encoder for more accurate semantics. Experiments on 12 datasets have shown the potential to inspire further innovations in the UAD field.
△ Less
Submitted 19 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Dynamics and leapfrogging phenomena of multiple helical vortices for 3D incompressible Euler equations
Authors:
Daomin Cao,
Junhong Fan,
Guolin Qin,
Jie Wan
Abstract:
In this paper, we investigate the time evolution of multiple interacted helical vortices without swirl for the incompressible Euler equations in $\mathbb R^3$. Assuming that the initial helical symmetric vorticity is concentrated within an $\ep$ neighborhood of $N$ distinct helices with vanishing mutual distance of order $O(\frac{1}{|\ln \ep|})$, and each vortex core possesses a vorticity mass of…
▽ More
In this paper, we investigate the time evolution of multiple interacted helical vortices without swirl for the incompressible Euler equations in $\mathbb R^3$. Assuming that the initial helical symmetric vorticity is concentrated within an $\ep$ neighborhood of $N$ distinct helices with vanishing mutual distance of order $O(\frac{1}{|\ln \ep|})$, and each vortex core possesses a vorticity mass of order $O(\frac{1}{|\ln \ep|^2})$, we show that as $\ep\to 0$, the motion of the helical vortices converges to a dynamical system for positive times. Notably, in the case of two interacting helical vortices with initial mutual separation $ \frac{ρ_0}{|\ln \ep|}$, by selecting sufficiently small $ρ_0$, our analysis extends to longer timescales encompassing several periods. This result provides the first mathematical justification for the phenomena in numerical observation termed "leapfrogging of Kelvin waves" reported in e.g. [N. Hietala et al., Phys. Rev. Fluids, 2016].
△ Less
Submitted 5 June, 2025; v1 submitted 18 May, 2025;
originally announced May 2025.
-
Performance Analysis of Cooperative Integrated Sensing and Communications for 6G Networks
Authors:
Dongsheng Sui,
Cunhua Pan,
Hong Ren,
Jiahua Wan,
Liuchang Zhuo,
Jing Jin,
Qixing Wang,
Jiangzhou Wang
Abstract:
In this work, we aim to effectively characterize the performance of cooperative integrated sensing and communication (ISAC) networks and to reveal how performance metrics relate to network parameters. To this end, we introduce a generalized stochastic geometry framework to model the cooperative ISAC networks, which approximates the spatial randomness of the network deployment. Based on this framew…
▽ More
In this work, we aim to effectively characterize the performance of cooperative integrated sensing and communication (ISAC) networks and to reveal how performance metrics relate to network parameters. To this end, we introduce a generalized stochastic geometry framework to model the cooperative ISAC networks, which approximates the spatial randomness of the network deployment. Based on this framework, we derive analytical expressions for key performance metrics in both communication and sensing domains, with a particular focus on communication coverage probability and radar information rate. The analytical expressions derived explicitly highlight how performance metrics depend on network parameters, thereby offering valuable insights into the deployment and design of cooperative ISAC networks. In the end, we validate the theoretical performance analysis through Monte Carlo simulation results. Our results demonstrate that increasing the number of cooperative base stations (BSs) significantly improves both metrics, while increasing the BS deployment density has a limited impact on communication coverage probability but substantially enhances the radar information rate. Additionally, increasing the number of transmit antennas is effective when the total number of transmit antennas is relatively small. The incremental performance gain reduces with the increase of the number of transmit antennas, suggesting that indiscriminately increasing antennas is not an efficient strategy to improve the performance of the system in cooperative ISAC networks.
△ Less
Submitted 13 May, 2025; v1 submitted 13 May, 2025;
originally announced May 2025.
-
Long-Term Individual Causal Effect Estimation via Identifiable Latent Representation Learning
Authors:
Ruichu Cai,
Junjie Wan,
Weilin Chen,
Zeqin Yang,
Zijian Li,
Peng Zhen,
Jiecheng Guo
Abstract:
Estimating long-term causal effects by combining long-term observational and short-term experimental data is a crucial but challenging problem in many real-world scenarios. In existing methods, several ideal assumptions, e.g. latent unconfoundedness assumption or additive equi-confounding bias assumption, are proposed to address the latent confounder problem raised by the observational data. Howev…
▽ More
Estimating long-term causal effects by combining long-term observational and short-term experimental data is a crucial but challenging problem in many real-world scenarios. In existing methods, several ideal assumptions, e.g. latent unconfoundedness assumption or additive equi-confounding bias assumption, are proposed to address the latent confounder problem raised by the observational data. However, in real-world applications, these assumptions are typically violated which limits their practical effectiveness. In this paper, we tackle the problem of estimating the long-term individual causal effects without the aforementioned assumptions. Specifically, we propose to utilize the natural heterogeneity of data, such as data from multiple sources, to identify latent confounders, thereby significantly avoiding reliance on idealized assumptions. Practically, we devise a latent representation learning-based estimator of long-term causal effects. Theoretically, we establish the identifiability of latent confounders, with which we further achieve long-term effect identification. Extensive experimental studies, conducted on multiple synthetic and semi-synthetic datasets, demonstrate the effectiveness of our proposed method.
△ Less
Submitted 8 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition
Authors:
Hussain Ahmad,
Qingyang Zeng,
Jing Wan
Abstract:
The emergence of multimodal content, particularly text and images on social media, has positioned Multimodal Named Entity Recognition (MNER) as an increasingly important area of research within Natural Language Processing. Despite progress in high-resource languages such as English, MNER remains underexplored for low-resource languages like Urdu. The primary challenges include the scarcity of anno…
▽ More
The emergence of multimodal content, particularly text and images on social media, has positioned Multimodal Named Entity Recognition (MNER) as an increasingly important area of research within Natural Language Processing. Despite progress in high-resource languages such as English, MNER remains underexplored for low-resource languages like Urdu. The primary challenges include the scarcity of annotated multimodal datasets and the lack of standardized baselines. To address these challenges, we introduce the U-MNER framework and release the Twitter2015-Urdu dataset, a pioneering resource for Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated with Urdu-specific grammar rules. We establish benchmark baselines by evaluating both text-based and multimodal models on this dataset, providing comparative analyses to support future research on Urdu MNER. The U-MNER framework integrates textual and visual context using Urdu-BERT for text embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion Module to align and fuse information. Our model achieves state-of-the-art performance on the Twitter2015-Urdu dataset, laying the groundwork for further MNER research in low-resource languages.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
The unidirectional Seebeck detection of the Néel vector in the two-dimensional tetragonal $\mathcal{PT}$-symmetric antiferromagnetic materials
Authors:
Ya-Ting Xiao,
Ying-Li Wu,
Jia-Liang Wan,
Xiao-Qin Yu
Abstract:
The efficient detection of the reversal (180$^{\circ}$ rotation) of the Néel vector is one of the crucial tasks in antiferromagnetic spintronics. Here, we propose a thermal approach to detect the reversal of the Néel vector in the tetragonal $\mathcal{PT}$ antiferromagnetic materials through the unidirectional Seebeck effect (USE). Being different from the previous works in which USE stems from th…
▽ More
The efficient detection of the reversal (180$^{\circ}$ rotation) of the Néel vector is one of the crucial tasks in antiferromagnetic spintronics. Here, we propose a thermal approach to detect the reversal of the Néel vector in the tetragonal $\mathcal{PT}$ antiferromagnetic materials through the unidirectional Seebeck effect (USE). Being different from the previous works in which USE stems from the global Rashba spin-orbit coupling (SOC) or asymmetric magnon scattering, we find that the USE originates from the coupling of the hidden Rashba SOC and the Néel vector in the tetragonal $\mathcal{PT}$ antiferromagnetic materials in the absence of the global Rashba SOC. Using a generic minimal model, we analyse the behaviors of the USE for the two-dimensional tetragonal lattice $\mathcal{PT}$ antiferromagnet. Importantly, It's found that when the Néel vector is reversed, the sign of the USE changes, which can be utilized to detect the reversal of the Néel vector.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Formation of trapped surfaces for the Einstein--Maxwell--charged scalar field system
Authors:
Dawei Shen,
Jingbo Wan
Abstract:
In this paper, we prove a scale-critical trapped surface formation result for the Einstein--Maxwell--charged scalar field (EMCSF) system, without any symmetry assumptions. Specifically, we establish a scale-critical semi-global existence theorem from past null infinity and show that the focusing of gravitational waves, the concentration of electromagnetic fields, or the condensation of complex sca…
▽ More
In this paper, we prove a scale-critical trapped surface formation result for the Einstein--Maxwell--charged scalar field (EMCSF) system, without any symmetry assumptions. Specifically, we establish a scale-critical semi-global existence theorem from past null infinity and show that the focusing of gravitational waves, the concentration of electromagnetic fields, or the condensation of complex scalar fields, each individually, can lead to the formation of a trapped surface. In addition, we capture a nontrivial charging process along past null infinity, which introduces new difficulties due to the abnormal behavior of the matter fields. Nevertheless, the semi-global existence result and the formation of a trapped surface remain valid.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
Authors:
Xiaofan Li,
Chenming Wu,
Zhao Yang,
Zhihao Xu,
Dingkang Liang,
Yumeng Zhang,
Ji Wan,
Jun Wang
Abstract:
This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which resul…
▽ More
This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Physics-informed Transformer Model for the Design of Wavelength-filtering Ring Resonator
Authors:
Yu Dian Lim,
Feng Shuo Wan,
Ren Jie Wan,
Chuan Seng Tan
Abstract:
We have developed a physics-informed transformer model to suggest design parameters in wavelength-filtering ring resonator, that suit a given pair of resonant wavelengths with <6 nm errors. The model provides a versatile method for rapid and accurate design of resonators corresponding to various resonant wavelengths.
We have developed a physics-informed transformer model to suggest design parameters in wavelength-filtering ring resonator, that suit a given pair of resonant wavelengths with <6 nm errors. The model provides a versatile method for rapid and accurate design of resonators corresponding to various resonant wavelengths.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions
Authors:
Chang Zong,
Bin Li,
Shoujun Zhou,
Jian Wan,
Lei Zhang
Abstract:
Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interac…
▽ More
Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.
△ Less
Submitted 22 April, 2025; v1 submitted 22 April, 2025;
originally announced April 2025.
-
SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking
Authors:
Yunfeng Li,
Bo Wang,
Jiahao Wan,
Xueyi Wu,
Ye Li
Abstract:
Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benc…
▽ More
Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Few-Shot Vision-Language Action-Incremental Policy Learning
Authors:
Mingchen Song,
Xiang Deng,
Guoqiang Zhong,
Qi Lv,
Jia Wan,
Yinchuan Li,
Jianye Hao,
Weili Guan
Abstract:
Recently, Transformer-based robotic manipulation methods utilize multi-view spatial representations and language instructions to learn robot motion trajectories by leveraging numerous robot demonstrations. However, the collection of robot data is extremely challenging, and existing methods lack the capability for continuous learning on new tasks with only a few demonstrations. In this paper, we fo…
▽ More
Recently, Transformer-based robotic manipulation methods utilize multi-view spatial representations and language instructions to learn robot motion trajectories by leveraging numerous robot demonstrations. However, the collection of robot data is extremely challenging, and existing methods lack the capability for continuous learning on new tasks with only a few demonstrations. In this paper, we formulate these challenges as the Few-Shot Action-Incremental Learning (FSAIL) task, and accordingly design a Task-prOmpt graPh evolutIon poliCy (TOPIC) to address these issues. Specifically, to address the data scarcity issue in robotic imitation learning, TOPIC learns Task-Specific Prompts (TSP) through the deep interaction of multi-modal information within few-shot demonstrations, thereby effectively extracting the task-specific discriminative information. On the other hand, to enhance the capability for continual learning on new tasks and mitigate the issue of catastrophic forgetting, TOPIC adopts a Continuous Evolution Strategy (CES). CES leverages the intrinsic relationships between tasks to construct a task relation graph, which effectively facilitates the adaptation of new tasks by reusing skills learned from previous tasks. TOPIC pioneers few-shot continual learning in the robotic manipulation task, and extensive experimental results demonstrate that TOPIC outperforms state-of-the-art baselines by over 26$\%$ in success rate, significantly enhancing the continual learning capabilities of existing Transformer-based policies.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
a1: Steep Test-time Scaling Law via Environment Augmented Generation
Authors:
Lingrui Mei,
Shenghua Liu,
Yiwei Wang,
Baolong Bi,
Yuyao Ge,
Jun Wan,
Yurong Wu,
Xueqi Cheng
Abstract:
Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a fram…
▽ More
Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a framework that enhances LLM reasoning through: (1) real-time environmental feedback validating each reasoning step, (2) dynamic branch exploration for investigating alternative solution paths when faced with errors, and (3) experience-based learning from successful reasoning trajectories. Unlike existing methods, EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback with branching exploration. Our a1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks, matching larger models like o1 on competition mathematics while outperforming comparable models by up to 24.4 percentage points. Analysis reveals EAG's distinctive scaling pattern: initial token investment in environment interaction yields substantial long-term performance dividends, with advantages amplifying proportionally to task complexity. EAG's theoretical framework demonstrates how environment interactivity and systematic branch exploration together establish a new paradigm for reliable machine reasoning, particularly for problems requiring precise multi-step calculation and logical verification.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Density-based Object Detection in Crowded Scenes
Authors:
Chenyang Zhao,
Jia Wan,
Antoni B. Chan
Abstract:
Compared with the generic scenes, crowded scenes contain highly-overlapped instances, which result in: 1) more ambiguous anchors during training of object detectors, and 2) more predictions are likely to be mistakenly suppressed in post-processing during inference. To address these problems, we propose two new strategies, density-guided anchors (DGA) and density-guided NMS (DG-NMS), which uses obj…
▽ More
Compared with the generic scenes, crowded scenes contain highly-overlapped instances, which result in: 1) more ambiguous anchors during training of object detectors, and 2) more predictions are likely to be mistakenly suppressed in post-processing during inference. To address these problems, we propose two new strategies, density-guided anchors (DGA) and density-guided NMS (DG-NMS), which uses object density maps to jointly compute optimal anchor assignments and reweighing, as well as an adaptive NMS. Concretely, based on an unbalanced optimal transport (UOT) problem, the density owned by each ground-truth object is transported to each anchor position at a minimal transport cost. And density on anchors comprises an instance-specific density distribution, from which DGA decodes the optimal anchor assignment and re-weighting strategy. Meanwhile, DG-NMS utilizes the predicted density map to adaptively adjust the NMS threshold to reduce mistaken suppressions. In the UOT, a novel overlap-aware transport cost is specifically designed for ambiguous anchors caused by overlapped neighboring objects. Extensive experiments on the challenging CrowdHuman dataset with Citypersons dataset demonstrate that our proposed density-guided detector is effective and robust to crowdedness. The code and pre-trained models will be made available later.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs
Authors:
Wei Tao,
Xiaoyang Qu,
Kai Lu,
Jiguang Wan,
Guokuan Li,
Jianzong Wang
Abstract:
When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection me…
▽ More
When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
Decoding the variability in the star-formation histories of z ~ 0.8 galaxies
Authors:
Jenny T. Wan,
Sandro Tacchella,
Francesco D'Eugenio,
Benjamin D. Johnson,
Arjen van der Wel
Abstract:
The scatter of the star-forming main sequence (SFMS) holds a wealth of information about how galaxies evolve. The timescales encoded in this scatter can provide valuable insight into the relative importance of the physical processes regulating star formation. In this paper, we present a detailed observational analysis of the timescales imprinted in galaxy star-formation history (SFH) fluctuations…
▽ More
The scatter of the star-forming main sequence (SFMS) holds a wealth of information about how galaxies evolve. The timescales encoded in this scatter can provide valuable insight into the relative importance of the physical processes regulating star formation. In this paper, we present a detailed observational analysis of the timescales imprinted in galaxy star-formation history (SFH) fluctuations by using the stochastic SFH model to fit 1928 massive, z ~ 0.8 galaxies in the LEGA-C survey. We find that the total intrinsic scatter of the SFMS is ~0.3 dex in galaxies with stellar masses $\gtrsim 10^{10}~\mathrm{M}_\odot$. This scatter decreases as the timescale over which SFRs are averaged increases, declining to a non-negligible ~0.15 - 0.25 dex at 2 Gyr, underscoring the importance of long-timescale SFH diversity to the SFMS scatter. Furthermore, galaxies currently above (below) the SFMS tend to have been above (below) the SFMS for at least ~1 Gyr, providing evidence that individual galaxies may follow different median tracks through SFR$-\mathrm{M}_*$ space. On shorter timescales (~30 - 100 Myr), galaxies' SFRs also vary on the order of ~0.1 - 0.2 dex. Our work supports the idea that the SFMS emerges from a population average of the pathways that individual galaxies trace through the SFR$-\mathrm{M}_*$ plane. The scatter reflects the long-term heterogeneity of these paths likely set by the evolutionary timescales of halo growth and cooling, accentuated by short-term variations reflecting the dynamical timescale of the galaxy and its interstellar medium. Our results emphasize the dynamic nature of the SFMS and the importance of understanding the diverse processes governing star formation.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Mixture-of-Attack-Experts with Class Regularization for Unified Physical-Digital Face Attack Detection
Authors:
Shunxin Chen,
Ajian Liu,
Junze Zheng,
Jun Wan,
Kailai Peng,
Sergio Escalera,
Zhen Lei
Abstract:
Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small i…
▽ More
Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small inter-class variation between live and fake faces. To address these limitations, we propose the Fine-Grained MoE with Class-Aware Regularization CLIP framework (FG-MoE-CLIP-CAR), incorporating key improvements at both the feature and loss levels. At the feature level, we employ a Soft Mixture of Experts (Soft MoE) architecture to leverage different experts for specialized feature processing. Additionally, we refine the Soft MoE to capture more subtle differences among various types of fake faces. At the loss level, we introduce two constraint modules: the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their respective class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Experimental results on two unified physical-digital attack datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection
Authors:
Yongze Li,
Ning Li,
Ajian Liu,
Hui Ma,
Liying Yang,
Xihong Chen,
Zhiyao Liang,
Yanyan Liang,
Jun Wan,
Zhen Lei
Abstract:
Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we…
▽ More
Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
A Channel-Triggered Backdoor Attack on Wireless Semantic Image Reconstruction
Authors:
Jialin Wan,
Jinglong Shen,
Nan Cheng,
Zhisheng Yin,
Yiliang Liu,
Wenchao Xu,
Xuemin,
Shen
Abstract:
This paper investigates backdoor attacks in image-oriented semantic communications. The threat of backdoor attacks on symbol reconstruction in semantic communication (SemCom) systems has received limited attention. Previous research on backdoor attacks targeting SemCom symbol reconstruction primarily focuses on input-level triggers, which are impractical in scenarios with strict input constraints.…
▽ More
This paper investigates backdoor attacks in image-oriented semantic communications. The threat of backdoor attacks on symbol reconstruction in semantic communication (SemCom) systems has received limited attention. Previous research on backdoor attacks targeting SemCom symbol reconstruction primarily focuses on input-level triggers, which are impractical in scenarios with strict input constraints. In this paper, we propose a novel channel-triggered backdoor attack (CT-BA) framework that exploits inherent wireless channel characteristics as activation triggers. Our key innovation involves utilizing fundamental channel statistics parameters, specifically channel gain with different fading distributions or channel noise with different power, as potential triggers. This approach enhances stealth by eliminating explicit input manipulation, provides flexibility through trigger selection from diverse channel conditions, and enables automatic activation via natural channel variations without adversary intervention. We extensively evaluate CT-BA across four joint source-channel coding (JSCC) communication system architectures and three benchmark datasets. Simulation results demonstrate that our attack achieves near-perfect attack success rate (ASR) while maintaining effective stealth. Finally, we discuss potential defense mechanisms against such attacks.
△ Less
Submitted 20 May, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
Authors:
Wei Tao,
Bin Zhang,
Xiaoyang Qu,
Jiguang Wan,
Jianzong Wang
Abstract:
Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. Th…
▽ More
Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection
Authors:
Bin Zhang,
Xiaoyang Qu,
Guokuan Li,
Jiguang Wan,
Jianzong Wang
Abstract:
As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detecti…
▽ More
As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
RUNA: Object-level Out-of-Distribution Detection via Regional Uncertainty Alignment of Multimodal Representations
Authors:
Bin Zhang,
Jinggang Chen,
Xiaoyang Qu,
Guokuan Li,
Kai Lu,
Jiguang Wan,
Jing Xiao,
Jianzong Wang
Abstract:
Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution…
▽ More
Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
Authors:
Hao Tan,
Zichang Tan,
Jun Li,
Ajian Liu,
Jun Wan,
Zhen Lei
Abstract:
Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable region…
▽ More
Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Video Individual Counting for Moving Drones
Authors:
Yaowu Fan,
Jia Wan,
Tao Han,
Antoni B. Chan,
Andy J. Ma
Abstract:
Video Individual Counting (VIC) has received increasing attentions recently due to its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous crowd counting datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. Whi…
▽ More
Video Individual Counting (VIC) has received increasing attentions recently due to its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous crowd counting datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. While VIC methods have been proposed based on localization-then-association or localization-then-classification, they may not perform well due to difficulty in accurate localization of crowded and small targets under challenging scenarios. To address these issues, we collect a MovingDroneCrowd Dataset and propose a density map based VIC method. Different from existing datasets, our dataset consists of videos captured by fast-moving drones in crowded scenes under diverse illuminations, shooting heights and angles. Other than localizing individuals, we propose a Depth-wise Cross-Frame Attention (DCFA) module, which directly estimate inflow and outflow density maps through learning shared density maps between consecutive frames. The inflow density maps across frames are summed up to obtain the number of unique pedestrians in a video. Experiments on our datasets and publicly available ones show the superiority of our method over the state of the arts for VIC in highly dynamic and complex crowded scenes. Our dataset and codes will be released publicly.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Embodied Crowd Counting
Authors:
Runling Long,
Yunlong Wang,
Jia Wan,
Xiang Deng,
Xinting Zhu,
Weili Guan,
Antoni B. Chan,
Liqiang Nie
Abstract:
Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navig…
▽ More
Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navigation methods have shown significant potential in precise object detection in interactive scenes. These methods incorporate active camera settings, holding promise in addressing the fundamental issues in crowd counting. However, most existing methods are designed for indoor navigation, showing unknown performance in analyzing complex object distribution in large scale scenes, such as crowds. Besides, most existing embodied navigation datasets are indoor scenes with limited scale and object quantity, preventing them from being introduced into dense crowd analysis. Based on this, a novel task, Embodied Crowd Counting (ECC), is proposed. We first build up an interactive simulator, Embodied Crowd Counting Dataset (ECCD), which enables large scale scenes and large object quantity. A prior probability distribution that approximates realistic crowd distribution is introduced to generate crowds. Then, a zero-shot navigation method (ZECC) is proposed. This method contains a MLLM driven coarse-to-fine navigation mechanism, enabling active Z-axis exploration, and a normal-line-based crowd distribution analysis method for fine counting. Experimental results against baselines show that the proposed method achieves the best trade-off between counting accuracy and navigation cost.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Error estimates of asymptotic-preserving neural networks in approximating stochastic linearized Boltzmann equation
Authors:
Jiayu Wan,
Liu Liu
Abstract:
In this paper, we construct an asymptotic-preserving neural networks (APNNs) [21] for the linearized Boltzmann equation in the acoustic scaling and with uncertain parameters. Utilizing the micro-macro decomposition, we design the loss function based on the stochastic-Galerkin system conducted from the micro-macro equations. Rigorous analysis is provided to show the capability of neural networks in…
▽ More
In this paper, we construct an asymptotic-preserving neural networks (APNNs) [21] for the linearized Boltzmann equation in the acoustic scaling and with uncertain parameters. Utilizing the micro-macro decomposition, we design the loss function based on the stochastic-Galerkin system conducted from the micro-macro equations. Rigorous analysis is provided to show the capability of neural networks in approximating solutions near the global Maxwellian. By employing hypocoercivity techniques, we demonstrate two key results: the existence of APNNs when the loss function approaches zero, and the convergence of the APNN approximated solution as the loss tends to zero, with the error exhibiting an exponential decay in time.
△ Less
Submitted 23 March, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
PinLanding: Content-First Keyword Landing Page Generation via Multi-Modal AI for Web-Scale Discovery
Authors:
Faye Zhang,
Jasmine Wan,
Qianyu Cheng,
Jinfeng Rao
Abstract:
Online platforms like Pinterest hosting vast content collections traditionally rely on manual curation or user-generated search logs to create keyword landing pages (KLPs) -- topic-centered collection pages that serve as entry points for content discovery. While manual curation ensures quality, it doesn't scale to millions of collections, and search log approaches result in limited topic coverage…
▽ More
Online platforms like Pinterest hosting vast content collections traditionally rely on manual curation or user-generated search logs to create keyword landing pages (KLPs) -- topic-centered collection pages that serve as entry points for content discovery. While manual curation ensures quality, it doesn't scale to millions of collections, and search log approaches result in limited topic coverage and imprecise content matching. In this paper, we present PinLanding, a novel content-first architecture that transforms the way platforms create topical collections. Instead of deriving topics from user behavior, our system employs a multi-stage pipeline combining vision-language model (VLM) for attribute extraction, large language model (LLM) for topic generation, and a CLIP-based dual-encoder architecture for precise content matching. Our model achieves 99.7% Recall@10 on Fashion200K benchmark, demonstrating strong attribute understanding capabilities. In production deployment for search engine optimization with 4.2 million shopping landing pages, the system achieves a 4X increase in topic coverage and 14.29% improvement in collection attribute precision over the traditional search log-based approach via human evaluation. The architecture can be generalized beyond search traffic to power various user experiences, including content discovery and recommendations, providing a scalable solution to transform unstructured content into curated topical collections across any content domain.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Nonparametric Heterogeneous Long-term Causal Effect Estimation via Data Combination
Authors:
Weilin Chen,
Ruichu Cai,
Junjie Wan,
Zeqin Yang,
José Miguel Hernández-Lobato
Abstract:
Long-term causal inference has drawn increasing attention in many scientific domains. Existing methods mainly focus on estimating average long-term causal effects by combining long-term observational data and short-term experimental data. However, it is still understudied how to robustly and effectively estimate heterogeneous long-term causal effects, significantly limiting practical applications.…
▽ More
Long-term causal inference has drawn increasing attention in many scientific domains. Existing methods mainly focus on estimating average long-term causal effects by combining long-term observational data and short-term experimental data. However, it is still understudied how to robustly and effectively estimate heterogeneous long-term causal effects, significantly limiting practical applications. In this paper, we propose several two-stage style nonparametric estimators for heterogeneous long-term causal effect estimation, including propensity-based, regression-based, and multiple robust estimators. We conduct a comprehensive theoretical analysis of their asymptotic properties under mild assumptions, with the ultimate goal of building a better understanding of the conditions under which some estimators can be expected to perform better. Extensive experiments across several semi-synthetic and real-world datasets validate the theoretical results and demonstrate the effectiveness of the proposed estimators.
△ Less
Submitted 2 March, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
Authors:
Wenwen Yu,
Zhibo Yang,
Jianqiang Wan,
Sibo Song,
Jun Tang,
Wenqing Cheng,
Yuliang Liu,
Xiang Bai
Abstract:
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individu…
▽ More
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input\&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal large language model structure, further enhancing text localization and recognition capabilities, thereby confirming the generality of SPOT prompting technique. The code is available at \href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
Qwen2.5-VL Technical Report
Authors:
Shuai Bai,
Keqin Chen,
Xuejing Liu,
Jialin Wang,
Wenbin Ge,
Sibo Song,
Kai Dang,
Peng Wang,
Shijie Wang,
Jun Tang,
Humen Zhong,
Yuanzhi Zhu,
Mingkun Yang,
Zhaohai Li,
Jianqiang Wan,
Pengfei Wang,
Wei Ding,
Zheren Fu,
Yiheng Xu,
Jiabo Ye,
Xi Zhang,
Tianbao Xie,
Zesen Cheng,
Hang Zhang,
Zhibo Yang
, et al. (2 additional authors not shown)
Abstract:
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehensio…
▽ More
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
COFFE: A Code Efficiency Benchmark for Code Generation
Authors:
Yun Peng,
Jun Wan,
Yichen Li,
Xiaoxue Ren
Abstract:
Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctne…
▽ More
Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation.
To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
EvidenceMap: Learning Evidence Analysis to Unleash the Power of Small Language Models for Biomedical Question Answering
Authors:
Chang Zong,
Jian Wan,
Siliang Tang,
Lei Zhang
Abstract:
When addressing professional questions in the biomedical domain, humans typically acquire multiple pieces of information as evidence and engage in multifaceted analysis to provide high-quality answers. Current LLM-based question answering methods lack a detailed definition and learning process for evidence analysis, leading to the risk of error propagation and hallucinations while using evidence.…
▽ More
When addressing professional questions in the biomedical domain, humans typically acquire multiple pieces of information as evidence and engage in multifaceted analysis to provide high-quality answers. Current LLM-based question answering methods lack a detailed definition and learning process for evidence analysis, leading to the risk of error propagation and hallucinations while using evidence. Although increasing the parameter size of LLMs can alleviate these issues, it also presents challenges in training and deployment with limited resources. In this study, we propose EvidenceMap, which aims to enable a tiny pre-trained language model to explicitly learn multiple aspects of biomedical evidence, including supportive evaluation, logical correlation and content summarization, thereby latently guiding a small generative model (around 3B parameters) to provide textual responses. Experimental results demonstrate that our method, learning evidence analysis by fine-tuning a model with only 66M parameters, exceeds the RAG method with an 8B LLM by 19.9% and 5.7% in reference-based quality and accuracy, respectively.
△ Less
Submitted 13 February, 2025; v1 submitted 22 January, 2025;
originally announced January 2025.
-
Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity
Authors:
Jun Wan
Abstract:
In 2020, OpenAI proposed the first type of Scaling Laws, describing the relationships between model loss and the scale of parameters, data, and training computation. In 2024, OpenAI proposed the second type of Scaling Laws, describing the relationship between model inference performance and inference computation. In this paper, we analyze LLMs training and inference processes from the perspective…
▽ More
In 2020, OpenAI proposed the first type of Scaling Laws, describing the relationships between model loss and the scale of parameters, data, and training computation. In 2024, OpenAI proposed the second type of Scaling Laws, describing the relationship between model inference performance and inference computation. In this paper, we analyze LLMs training and inference processes from the perspective of lossless compression using conditional Kolmogorov complexity, and unify these two types of Scaling Laws. We find that both types of Scaling Laws improve approximation of conditional Kolmogorov complexity by increasing execution steps of Turing machine. The first type of Scaling Laws increases execution steps by increasing number of model parameters. The second type of Scaling Laws increases execution steps by increasing the number of intermediate tokens.
△ Less
Submitted 10 February, 2025; v1 submitted 12 January, 2025;
originally announced January 2025.
-
On representation theory of cyclotomic Hecke-Clifford algebras
Authors:
Lei Shi,
Jinkui Wan
Abstract:
In this article, we give an explicit construction of the simple modules for both non-degenerate and degenerate cyclotomic Hecke-Clifford superalgebras over an algebraically closed field of characteristic not equal to $2$ under certain condition in terms of parameters in defining these algebras. As an application, we obtain a sufficient condition on the semi-simplicity of these cyclotomic Hecke-Cli…
▽ More
In this article, we give an explicit construction of the simple modules for both non-degenerate and degenerate cyclotomic Hecke-Clifford superalgebras over an algebraically closed field of characteristic not equal to $2$ under certain condition in terms of parameters in defining these algebras. As an application, we obtain a sufficient condition on the semi-simplicity of these cyclotomic Hecke-Clifford superalgebras via a dimension comparison. As a byproduct, both generic non-degenerate and degenerate cyclotomic Hecke-Clifford superalgebras are shown to be semisimple.
△ Less
Submitted 26 March, 2025; v1 submitted 12 January, 2025;
originally announced January 2025.
-
Extrinsic nonlinear acoustic valley Hall effect in the massive Dirac materials
Authors:
Jia-Liang Wan,
Ying-Li Wu,
Ke-Qiu Chen,
Xiao-Qin Yu
Abstract:
The nonlinear acoustic valley Hall effect (AVHE), a recently discovered novel acoustically driven phenomena, has sparked extensive interests in valleytronics. So far, only the intrinsic contributions from band structure (Berry curvature or asymmetric energy dispersions) to nonlinear AVHE have been investigated. Here, we theoretically investigate the nonlinear AVHE from both intrinsic and extrinsic…
▽ More
The nonlinear acoustic valley Hall effect (AVHE), a recently discovered novel acoustically driven phenomena, has sparked extensive interests in valleytronics. So far, only the intrinsic contributions from band structure (Berry curvature or asymmetric energy dispersions) to nonlinear AVHE have been investigated. Here, we theoretically investigate the nonlinear AVHE from both intrinsic and extrinsic contributions in two-dimensional (2D) hexagonal massive Dirac materials with disorders based on the Boltzmann formalism and also concretely analyse the behaviours of nonlinear AVHE in disordered monolayer MoS2. It's found that the extrinsic contributions (side jump and skew scattering) can also give rise to a pure nonlinear AVHE in the 2D hexagonal massive Dirac materials. Remarkably, the extrinsic mechanisms dominate the nonlinear AVHE in the disordered monolayer MoS2.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
A canonical foliation on null infinity in perturbations of Kerr
Authors:
Sergiu Klainerman,
Dawei Shen,
Jingbo Wan
Abstract:
Kerr stability for small angular momentum has been proved in the series of works by Klainerman-Szeftel, Giorgi-Klainerman-Szeftel and Shen. Some of the most basic conclusions of the result, concerning various physical quantities on the future null infinity are derived in the work of Klainerman-Szeftel. Further important conclusions were later derived in An-He-Shen and Chen-Klainerman. In this pape…
▽ More
Kerr stability for small angular momentum has been proved in the series of works by Klainerman-Szeftel, Giorgi-Klainerman-Szeftel and Shen. Some of the most basic conclusions of the result, concerning various physical quantities on the future null infinity are derived in the work of Klainerman-Szeftel. Further important conclusions were later derived in An-He-Shen and Chen-Klainerman. In this paper, based on the existence and uniqueness results for GCM spheres by Klainerman-Szeftel, we establish the existence of a canonical foliation on the future null infinity for which the null energy, linear momentum, center of mass and angular momentum are well defined and satisfy the expected physical laws of gravitational radiation. The rigid character of this foliation eliminates the usual ambiguities related to these quantities in the physics literature. We also show that under the initial assumption of Klainerman-Szeftel, the center of mass of the black hole has a large deformation (recoil) after the perturbation.
△ Less
Submitted 28 December, 2024;
originally announced December 2024.
-
A New Method to Capturing Compositional Knowledge in Linguistic Space
Authors:
Jiahe Wan
Abstract:
Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understan…
▽ More
Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing "no" logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.