Search | arXiv e-print repository

Enhancing the Performance of Global Model by Improving the Adaptability of Local Models in Federated Learning

Authors: Wujun Zhou, Shu Ding, ZeLin Li, Wei Wang

Abstract: Federated learning enables the clients to collaboratively train a global model, which is aggregated from local models. Due to the heterogeneous data distributions over clients and data privacy in federated learning, it is difficult to train local models to achieve a well-performed global model. In this paper, we introduce the adaptability of local models, i.e., the average performance of local mod… ▽ More Federated learning enables the clients to collaboratively train a global model, which is aggregated from local models. Due to the heterogeneous data distributions over clients and data privacy in federated learning, it is difficult to train local models to achieve a well-performed global model. In this paper, we introduce the adaptability of local models, i.e., the average performance of local models on data distributions over clients, and enhance the performance of the global model by improving the adaptability of local models. Since each client does not know the data distributions over other clients, the adaptability of the local model cannot be directly optimized. First, we provide the property of an appropriate local model which has good adaptability on the data distributions over clients. Then, we formalize the property into the local training objective with a constraint and propose a feasible solution to train the local model. Extensive experiments on federated learning benchmarks demonstrate that our method significantly improves the adaptability of local models and achieves a well-performed global model that consistently outperforms the baseline methods. △ Less

Submitted 18 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

arXiv:2505.09590 [pdf, ps, other]

Distance-aware Self-adaptive Graph Convolution for Fine-grained Hierarchical Recommendation

Authors: Tao Huang, Yihong Chen, Wei Fan, Wei Zhou, Junhao Wen

Abstract: Graph Convolutional Networks (GCNs) are widely used to improve recommendation accuracy and performance by effectively learning the representations of user and item nodes. However, two major challenges remain: (1) the lack of further optimization in the graph representation structure and (2) insufficient attention given to the varying contributions of different convolutional layers.This paper propo… ▽ More Graph Convolutional Networks (GCNs) are widely used to improve recommendation accuracy and performance by effectively learning the representations of user and item nodes. However, two major challenges remain: (1) the lack of further optimization in the graph representation structure and (2) insufficient attention given to the varying contributions of different convolutional layers.This paper proposes SAGCN, a distance-based adaptive hierarchical aggregation method that refines the aggregation process through differentiated representation metrics. SAGCN introduces a detailed approach to multilayer information aggregation and representation space optimization, enabling the model to learn hierarchical embedding weights based on the distance between hierarchical representations. This innovation allows for more precise cross-layer information aggregation, improves the model's ability to capture hierarchical embeddings, and optimizes the representation space structure. Additionally, the objective loss function is refined to better align with recommendation tasks.Extensive experiments conducted on four real-world datasets demonstrate significant improvements, including over a 5% increase on Yelp and a 5.58% increase in Recall@10 on the ML_1M dataset. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.07611 [pdf]

Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods,Datasets,and Future Directions

Authors: Yi Zhang, Wenye Zhou, Ruonan Lin, Xin Yang, Hao Zheng

Abstract: Traffic accident prediction and detection are critical for enhancing road safety,and vision-based traffic accident anticipation (Vision-TAA) has emerged as a promising approach in the era of deep learning.This paper reviews 147 recent studies,focusing on the application of supervised,unsupervised,and hybrid deep learning models for accident prediction,alongside the use of real-world and synthetic… ▽ More Traffic accident prediction and detection are critical for enhancing road safety,and vision-based traffic accident anticipation (Vision-TAA) has emerged as a promising approach in the era of deep learning.This paper reviews 147 recent studies,focusing on the application of supervised,unsupervised,and hybrid deep learning models for accident prediction,alongside the use of real-world and synthetic datasets.Current methodologies are categorized into four key approaches: image and video feature-based prediction, spatiotemporal feature-based prediction, scene understanding,and multimodal data fusion.While these methods demonstrate significant potential,challenges such as data scarcity,limited generalization to complex scenarios,and real-time performance constraints remain prevalent. This review highlights opportunities for future research,including the integration of multimodal data fusion, self-supervised learning,and Transformer-based architectures to enhance prediction accuracy and scalability.By synthesizing existing advancements and identifying critical gaps, this paper provides a foundational reference for developing robust and adaptive Vision-TAA systems,contributing to road safety and traffic management. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2505.06679 [pdf, other]

Jailbreaking the Text-to-Video Generative Models

Authors: Jiayang Liu, Siyuan Liang, Shiqian Zhao, Rongcheng Tu, Wenbo Zhou, Xiaochun Cao, Dacheng Tao, Siew Kei Lam

Abstract: Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing effo… ▽ More Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing efforts, such as T2VSafetyBench, have provided valuable benchmarks for evaluating the safety of text-to-video models against unsafe prompts but lack systematic studies for exploiting their vulnerabilities effectively. In this paper, we propose the \textit{first} optimization-based jailbreak attack against text-to-video models, which is specifically designed. Our approach formulates the prompt generation task as an optimization problem with three key objectives: (1) maximizing the semantic similarity between the input and generated prompts, (2) ensuring that the generated prompts can evade the safety filter of the text-to-video model, and (3) maximizing the semantic similarity between the generated videos and the original input prompts. To further enhance the robustness of the generated prompts, we introduce a prompt mutation strategy that creates multiple prompt variants in each iteration, selecting the most effective one based on the averaged score. This strategy not only improves the attack success rate but also boosts the semantic relevance of the generated video. We conduct extensive experiments across multiple text-to-video models, including Open-Sora, Pika, Luma, and Kling. The results demonstrate that our method not only achieves a higher attack success rate compared to baseline methods but also generates videos with greater semantic similarity to the original input prompts. △ Less

Submitted 10 May, 2025; originally announced May 2025.

arXiv:2505.04028 [pdf]

Appeal and Scope of Misinformation Spread by AI Agents and Humans

Authors: Lynnette Hui Xian Ng, Wenqi Zhou, Kathleen M. Carley

Abstract: This work examines the influence of misinformation and the role of AI agents, called bots, on social network platforms. To quantify the impact of misinformation, it proposes two new metrics based on attributes of tweet engagement and user network position: Appeal, which measures the popularity of the tweet, and Scope, which measures the potential reach of the tweet. In addition, it analyzes 5.8 mi… ▽ More This work examines the influence of misinformation and the role of AI agents, called bots, on social network platforms. To quantify the impact of misinformation, it proposes two new metrics based on attributes of tweet engagement and user network position: Appeal, which measures the popularity of the tweet, and Scope, which measures the potential reach of the tweet. In addition, it analyzes 5.8 million misinformation tweets on the COVID-19 vaccine discourse over three time periods: Pre-Vaccine, Vaccine Launch, and Post-Vaccine. Results show that misinformation was more prevalent during the first two periods. Human-generated misinformation tweets tend to have higher appeal and scope compared to bot-generated ones. Tweedie regression analysis reveals that human-generated misinformation tweets were most concerning during Vaccine Launch week, whereas bot-generated misinformation reached its highest appeal and scope during the Pre-Vaccine period. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: Accepted to AMCIS 2025

arXiv:2505.03912 [pdf, other]

OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation

Authors: Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, Han Zhao, Siteng Huang, Donglin Wang

Abstract: Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core de… ▽ More Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core design elements of existing dual-system architectures. Ultimately, it will provide a low-cost open-source model for further exploration. Of course, this project will continue to update with more experimental conclusions and open-source models with improved performance for everyone to choose from. Project page: https://openhelix-robot.github.io/. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.03574 [pdf, other]

LlamaFirewall: An open source guardrail system for building secure AI agents

Authors: Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, Joshua Saxe

Abstract: Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrail… ▽ More Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent's security guardrails. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.03494 [pdf]

UPMAD-Net: A Brain Tumor Segmentation Network with Uncertainty Guidance and Adaptive Multimodal Feature Fusion

Authors: Zhanyuan Jia, Ni Yao, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Fubao Zhu, Chen Zhao, Weihua Zhou

Abstract: Background: Brain tumor segmentation has a significant impact on the diagnosis and treatment of brain tumors. Accurate brain tumor segmentation remains challenging due to their irregular shapes, vague boundaries, and high variability. Objective: We propose a brain tumor segmentation method that combines deep learning with prior knowledge derived from a region-growing algorithm. Methods: The propos… ▽ More Background: Brain tumor segmentation has a significant impact on the diagnosis and treatment of brain tumors. Accurate brain tumor segmentation remains challenging due to their irregular shapes, vague boundaries, and high variability. Objective: We propose a brain tumor segmentation method that combines deep learning with prior knowledge derived from a region-growing algorithm. Methods: The proposed method utilizes a multi-scale feature fusion (MSFF) module and adaptive attention mechanisms (AAM) to extract multi-scale features and capture global contextual information. To enhance the model's robustness in low-confidence regions, the Monte Carlo Dropout (MC Dropout) strategy is employed for uncertainty estimation. Results: Extensive experiments demonstrate that the proposed method achieves superior performance on Brain Tumor Segmentation (BraTS) datasets, significantly outperforming various state-of-the-art methods. On the BraTS2021 dataset, the test Dice scores are 89.18% for Enhancing Tumor (ET) segmentation, 93.67% for Whole Tumor (WT) segmentation, and 91.23% for Tumor Core (TC) segmentation. On the BraTS2019 validation set, the validation Dice scores are 87.43%, 90.92%, and 90.40% for ET, WT, and TC segmentation, respectively. Ablation studies further confirmed the contribution of each module to segmentation accuracy, indicating that each component played a vital role in overall performance improvement. Conclusion: This study proposed a novel 3D brain tumor segmentation network based on the U-Net architecture. By incorporating the prior knowledge and employing the uncertainty estimation method, the robustness and performance were improved. The code for the proposed method is available at https://github.com/chenzhao2023/UPMAD_Net_BrainSeg. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: 21 pages, 7 figures

arXiv:2505.01950 [pdf, other]

Segment Any RGB-Thermal Model with Language-aided Distillation

Authors: Dong Xing, Xianxun Zhu, Wei Zhou, Qika Lin, Hang Yang, Yuqing Wang

Abstract: The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, w… ▽ More The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, we propose a novel framework, SARTM, which customizes the powerful SAM for RGB-T semantic segmentation. Our key idea is to unleash the potential of SAM while introduce semantic understanding modules for RGB-T data pairs. Specifically, our framework first involves fine tuning the original SAM by adding extra LoRA layers, aiming at preserving SAM's strong generalization and segmentation capabilities for downstream tasks. Secondly, we introduce language information as guidance for training our SARTM. To address cross-modal inconsistencies, we introduce a Cross-Modal Knowledge Distillation(CMKD) module that effectively achieves modality adaptation while maintaining its generalization capabilities. This semantic module enables the minimization of modality gaps and alleviates semantic ambiguity, facilitating the combination of any modality under any visual conditions. Furthermore, we enhance the segmentation performance by adjusting the segmentation head of SAM and incorporating an auxiliary semantic segmentation head, which integrates multi-scale features for effective fusion. Extensive experiments are conducted across three multi-modal RGBT semantic segmentation benchmarks: MFNET, PST900, and FMB. Both quantitative and qualitative results consistently demonstrate that the proposed SARTM significantly outperforms state-of-the-art approaches across a variety of conditions. △ Less

Submitted 3 May, 2025; originally announced May 2025.

Comments: arXiv admin note: text overlap with arXiv:2412.04220 by other authors

arXiv:2505.01189 [pdf, ps, other]

Principal Non-singularity of Fourier Matrices on $\mathbb Z_p \times \mathbb Z_q$ and $\mathbb Z_2^k \times \mathbb Z_q$

Authors: Weiqi Zhou

Abstract: Let $F_n$ be the $n\times n$ Fourier matrix (on cyclic groups $\mathbb Z_n$), a reknowned theorem of Chebotarëv asserts that all minors in $F_n$ for prime $n$ are non-zero. In this short note it is shown that (i) all principal minors in the Kronecker product $F_p\otimes F_q$ are non-vanishing (principal non-singularity) for distinct odd primes $p,q$ if $q$ is large enough and generates the multipl… ▽ More Let $F_n$ be the $n\times n$ Fourier matrix (on cyclic groups $\mathbb Z_n$), a reknowned theorem of Chebotarëv asserts that all minors in $F_n$ for prime $n$ are non-zero. In this short note it is shown that (i) all principal minors in the Kronecker product $F_p\otimes F_q$ are non-vanishing (principal non-singularity) for distinct odd primes $p,q$ if $q$ is large enough and generates the multiplicative group $\mathbb Z_p^*$; (ii) the Fourier matrix on $\mathbb Z_2^k \times \mathbb Z_q$ is principally non-singular upon permutation (in particular, for $k=1$ the identity permutation suffices) for odd prime $q$ and $k=1,2,3$. The proof is just an exposition of existing techniques re-organized in a unified way. The result will have implications in combining Riesz bases of exponentials. △ Less

Submitted 2 May, 2025; originally announced May 2025.

MSC Class: 42A99; 15A15

arXiv:2505.00304 [pdf, other]

Reinforcement Learning with Continuous Actions Under Unmeasured Confounding

Authors: Yuhan Li, Eugene Han, Yifan Hu, Wenzhuo Zhou, Zhengling Qi, Yifan Cui, Ruoqing Zhu

Abstract: This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces when unmeasured confounders are present. While most existing research focuses on policy evaluation within partially observable Markov decision processes (POMDPs) and assumes discrete action spaces, we advance this field by establishing a novel identification result to enable the no… ▽ More This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces when unmeasured confounders are present. While most existing research focuses on policy evaluation within partially observable Markov decision processes (POMDPs) and assumes discrete action spaces, we advance this field by establishing a novel identification result to enable the nonparametric estimation of policy value for a given target policy under an infinite-horizon framework. Leveraging this identification, we develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy that maximizes the estimated policy value. Furthermore, we provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy. Extensive simulations and a real-world application using the German Family Panel data demonstrate the effectiveness of our proposed methodology. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.19638 [pdf, other]

LODAP: On-Device Incremental Learning Via Lightweight Operations and Data Pruning

Authors: Biqing Duan, Qing Wang, Di Liu, Wei Zhou, Zhenli He, Shengfa Miao

Abstract: Incremental learning that learns new classes over time after the model's deployment is becoming increasingly crucial, particularly for industrial edge systems, where it is difficult to communicate with a remote server to conduct computation-intensive learning. As more classes are expected to learn after their execution for edge devices. In this paper, we propose LODAP, a new on-device incremental… ▽ More Incremental learning that learns new classes over time after the model's deployment is becoming increasingly crucial, particularly for industrial edge systems, where it is difficult to communicate with a remote server to conduct computation-intensive learning. As more classes are expected to learn after their execution for edge devices. In this paper, we propose LODAP, a new on-device incremental learning framework for edge systems. The key part of LODAP is a new module, namely Efficient Incremental Module (EIM). EIM is composed of normal convolutions and lightweight operations. During incremental learning, EIM exploits some lightweight operations, called adapters, to effectively and efficiently learn features for new classes so that it can improve the accuracy of incremental learning while reducing model complexity as well as training overhead. The efficiency of LODAP is further enhanced by a data pruning strategy that significantly reduces the training data, thereby lowering the training overhead. We conducted extensive experiments on the CIFAR-100 and Tiny- ImageNet datasets. Experimental results show that LODAP improves the accuracy by up to 4.32\% over existing methods while reducing around 50\% of model complexity. In addition, evaluations on real edge systems demonstrate its applicability for on-device machine learning. The code is available at https://github.com/duanbiqing/LODAP. △ Less

Submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.19478 [pdf, other]

CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design

Authors: Weitao Feng, Hang Zhou, Jing Liao, Li Cheng, Wenbo Zhou

Abstract: We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation… ▽ More We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis. △ Less

Submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.19300 [pdf]

Myocardial Region-guided Feature Aggregation Net for Automatic Coronary artery Segmentation and Stenosis Assessment using Coronary Computed Tomography Angiography

Authors: Ni Yao, Xiangyu Liu, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Chengyang Li, Fubao Zhu, Weihua Zhou, Chen Zhao

Abstract: Coronary artery disease (CAD) remains a leading cause of mortality worldwide, requiring accurate segmentation and stenosis detection using Coronary Computed Tomography angiography (CCTA). Existing methods struggle with challenges such as low contrast, morphological variability and small vessel segmentation. To address these limitations, we propose the Myocardial Region-guided Feature Aggregation N… ▽ More Coronary artery disease (CAD) remains a leading cause of mortality worldwide, requiring accurate segmentation and stenosis detection using Coronary Computed Tomography angiography (CCTA). Existing methods struggle with challenges such as low contrast, morphological variability and small vessel segmentation. To address these limitations, we propose the Myocardial Region-guided Feature Aggregation Net, a novel U-shaped dual-encoder architecture that integrates anatomical prior knowledge to enhance robustness in coronary artery segmentation. Our framework incorporates three key innovations: (1) a Myocardial Region-guided Module that directs attention to coronary regions via myocardial contour expansion and multi-scale feature fusion, (2) a Residual Feature Extraction Encoding Module that combines parallel spatial channel attention with residual blocks to enhance local-global feature discrimination, and (3) a Multi-scale Feature Fusion Module for adaptive aggregation of hierarchical vascular features. Additionally, Monte Carlo dropout f quantifies prediction uncertainty, supporting clinical interpretability. For stenosis detection, a morphology-based centerline extraction algorithm separates the vascular tree into anatomical branches, enabling cross-sectional area quantification and stenosis grading. The superiority of MGFA-Net was demonstrated by achieving an Dice score of 85.04%, an accuracy of 84.24%, an HD95 of 6.1294 mm, and an improvement of 5.46% in true positive rate for stenosis detection compared to3D U-Net. The integrated segmentation-to-stenosis pipeline provides automated, clinically interpretable CAD assessment, bridging deep learning with anatomical prior knowledge for precision medicine. Our code is publicly available at http://github.com/chenzhao2023/MGFA_CCTA △ Less

Submitted 27 April, 2025; originally announced April 2025.

Comments: 31 pages, 12 figures

arXiv:2504.18600 [pdf, other]

QuantBench: Benchmarking AI Methods for Quantitative Investment

Authors: Saizhuo Wang, Hao Kong, Jiadong Guo, Fengrui Hua, Yiyan Qi, Wanyun Zhou, Jiahao Zheng, Xinyu Wang, Lionel M. Ni, Jian Guo

Abstract: The field of artificial intelligence (AI) in quantitative investment has seen significant advancements, yet it lacks a standardized benchmark aligned with industry practices. This gap hinders research progress and limits the practical application of academic innovations. We present QuantBench, an industrial-grade benchmark platform designed to address this critical need. QuantBench offers three ke… ▽ More The field of artificial intelligence (AI) in quantitative investment has seen significant advancements, yet it lacks a standardized benchmark aligned with industry practices. This gap hinders research progress and limits the practical application of academic innovations. We present QuantBench, an industrial-grade benchmark platform designed to address this critical need. QuantBench offers three key strengths: (1) standardization that aligns with quantitative investment industry practices, (2) flexibility to integrate various AI algorithms, and (3) full-pipeline coverage of the entire quantitative investment process. Our empirical studies using QuantBench reveal some critical research directions, including the need for continual learning to address distribution shifts, improved methods for modeling relational financial data, and more robust approaches to mitigate overfitting in low signal-to-noise environments. By providing a common ground for evaluation and fostering collaboration between researchers and practitioners, QuantBench aims to accelerate progress in AI for quantitative investment, similar to the impact of benchmark platforms in computer vision and natural language processing. △ Less

Submitted 24 April, 2025; originally announced April 2025.

arXiv:2504.17463 [pdf, other]

doi 10.3847/1538-4365/adcd60

Hemispheric Distribution of Solar Active Regions During Solar Cycles 23-25

Authors: Yuxia Liu, Tingting Xu, Miao Wan, Linhua Deng, Xinhua Zhao, Shiyang Qi, Nanbin Xiang, Weihong Zhou

Abstract: Solar active regions (ARs) are crucial for understanding the long-term evolution of solar activities and predicting eruptive phenomena, including solar flares and coronal mass ejections. However, the cycle-dependent properties in the north-south asymmetry of ARs have not been fully understood. In this study, we investigate the hemispheric distribution of ARs from Carrington Rotation 1909 to 2278 (… ▽ More Solar active regions (ARs) are crucial for understanding the long-term evolution of solar activities and predicting eruptive phenomena, including solar flares and coronal mass ejections. However, the cycle-dependent properties in the north-south asymmetry of ARs have not been fully understood. In this study, we investigate the hemispheric distribution of ARs from Carrington Rotation 1909 to 2278 (between 1996 May and 2023 November) by using three parameters that describe the magnetic field distribution of ARs: number, area, and flux. The main findings are as follows: (1) The three AR parameters show significant hemispheric asymmetry in cycles 23-25. The strong correlation between AR area and flux indicates that they can better reflect the intrinsic properties of solar magnetic field. (2) The correlation between sunspot activity and AR parameters varies in the two hemispheres across the different cycles. The AR parameters provide additional information for the variations in sunspot activity, which can better predict the intensity and cyclical changes of solar activity. (3) The variation in the fitting slope sign of the asymmetry index for AR parameters reflects periodic changes in hemispheric ARs, providing valuable insights into the activity of other stars. (4) Both the dominant hemisphere and the cumulative trend of AR parameters display a cycle-dependent behavior. Moreover, the trend variations of AR area and flux are similar, reflecting the long-term evolutionary characteristics of solar magnetic field. Our analysis results are relevant for understanding the hemispheric coupling of solar magnetic activity and its cyclic evolutionary patterns. △ Less

Submitted 24 April, 2025; originally announced April 2025.

arXiv:2504.17263 [pdf, other]

Precision Neural Network Quantization via Learnable Adaptive Modules

Authors: Wenqiang Zhou, Zhendong Yu, Xinyu Liu, Jiaming Yang, Rong Xiao, Tao Wang, Chenwei Tang, Jiancheng Lv

Abstract: Quantization Aware Training (QAT) is a neural network quantization technique that compresses model size and improves operational efficiency while effectively maintaining model performance. The paradigm of QAT is to introduce fake quantization operators during the training process, allowing the model to autonomously compensate for information loss caused by quantization. Making quantization paramet… ▽ More Quantization Aware Training (QAT) is a neural network quantization technique that compresses model size and improves operational efficiency while effectively maintaining model performance. The paradigm of QAT is to introduce fake quantization operators during the training process, allowing the model to autonomously compensate for information loss caused by quantization. Making quantization parameters trainable can significantly improve the performance of QAT, but at the cost of compromising the flexibility during inference, especially when dealing with activation values with substantially different distributions. In this paper, we propose an effective learnable adaptive neural network quantization method, called Adaptive Step Size Quantization (ASQ), to resolve this conflict. Specifically, the proposed ASQ method first dynamically adjusts quantization scaling factors through a trained module capable of accommodating different activations. Then, to address the rigid resolution issue inherent in Power of Two (POT) quantization, we propose an efficient non-uniform quantization scheme. We utilize the Power Of Square root of Two (POST) as the basis for exponential quantization, effectively handling the bell-shaped distribution of neural network weights across various bit-widths while maintaining computational efficiency through a Look-Up Table method (LUT). Extensive experimental results demonstrate that the proposed ASQ method is superior to the state-of-the-art QAT approaches. Notably that the ASQ is even competitive compared to full precision baselines, with its 4-bit quantized ResNet34 model improving accuracy by 1.2\% on ImageNet. △ Less

Submitted 24 April, 2025; originally announced April 2025.

arXiv:2504.16601 [pdf, other]

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

Authors: Andy Li, Wei Zhou, Rashina Hoda, Chris Bain, Peter Poon

Abstract: This study evaluates how well large language models (LLMs) and traditional machine translation (MT) tools translate medical consultation summaries from English into Arabic, Chinese, and Vietnamese. It assesses both patient, friendly and clinician, focused texts using standard automated metrics. Results showed that traditional MT tools generally performed better, especially for complex texts, while… ▽ More This study evaluates how well large language models (LLMs) and traditional machine translation (MT) tools translate medical consultation summaries from English into Arabic, Chinese, and Vietnamese. It assesses both patient, friendly and clinician, focused texts using standard automated metrics. Results showed that traditional MT tools generally performed better, especially for complex texts, while LLMs showed promise, particularly in Vietnamese and Chinese, when translating simpler summaries. Arabic translations improved with complexity due to the language's morphology. Overall, while LLMs offer contextual flexibility, they remain inconsistent, and current evaluation metrics fail to capture clinical relevance. The study highlights the need for domain-specific training, improved evaluation methods, and human oversight in medical translation. △ Less

Submitted 23 April, 2025; originally announced April 2025.

Comments: 8 pages, 2 tables and 1 Figure

arXiv:2504.16552 [pdf, ps, other]

DTVM: Revolutionizing Smart Contract Execution with Determinism and Compatibility

Authors: Wei Zhou, Xiong Xu, Changzheng Wei, Ying Yan, Wei Tang, Zhihao Chen, Xuebing Huang, Wengang Chen, Jie Zhang, Yang Chen, Xiaofu Zheng, Hanghang Wu, Shenglong Chen, Ermei Wang, Xiangfei Chen, Yang Yu, Meng Wu, Tao Zhu, Liwei Yuan, Feng Yu, Alex Zhang, Wei Wang, Ji Luo, Zhengyu He, Wenbiao Zhao

Abstract: We introduce the DeTerministic Virtual Machine (DTVM) Stack, a next-generation smart contract execution framework designed to address critical performance, determinism, and ecosystem compatibility challenges in blockchain networks. Building upon WebAssembly (Wasm) while maintaining full Ethereum Virtual Machine (EVM) ABI compatibility, DTVM introduces a Deterministic Middle Intermediate Representa… ▽ More We introduce the DeTerministic Virtual Machine (DTVM) Stack, a next-generation smart contract execution framework designed to address critical performance, determinism, and ecosystem compatibility challenges in blockchain networks. Building upon WebAssembly (Wasm) while maintaining full Ethereum Virtual Machine (EVM) ABI compatibility, DTVM introduces a Deterministic Middle Intermediate Representation (dMIR) and a hybrid lazy-JIT compilation engine to balance compilation speed and execution efficiency. DTVM further accommodates diverse instruction set architectures (e.g., EVM, RISC-V) through modular adaptation layers. This enables seamless integration with DTVM's hybrid lazy-JIT compilation engine, which dynamically optimizes performance while preserving deterministic execution guarantees across heterogeneous environments. The key contributions including: 1). The framework achieves up to 2$\times$ acceleration over evmone in dominant Ethereum contract (e.g. ERC20/721/1155) execution and reduces fibonacci computation latency by 11.8$\sim$40.5% compared to Wasm based VMs. 2). A novel trampoline hot-switch mechanism enables sub-millisecond (0.95ms) post-deployment invocation times, outperforming up to about 23$\times$ in compilation and invocation efficiency. 3). It supports multi-language development (Solidity, C++, Rust, Java, Go, and AssemblyScript) through unified bytecode conversion while maintaining EVM ABI compatibility for seamless invocation. It reduces machine code object sizes by 30.0$\sim$72.6%, coupled with a minimized Trusted Computing Base. 4). It offers SmartCogent, an AI-driven full-stack development experience, leveraging fine-tuned LLMs and retrieval-augmented generation to automate tasks across the smart contract lifecycle: development, debugging, security auditing, and deployment. DTVM Stack has been open-sourced (https://github.com/DTVMStack). △ Less

Submitted 9 June, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

arXiv:2504.16511 [pdf, other]

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

Authors: Fengze Liu, Weidong Zhou, Binbin Liu, Zhimiao Yu, Yifan Zhang, Haobin Lin, Yifeng Yu, Bingni Zhang, Xiaohuan Zhou, Taifeng Wang, Yong Cao

Abstract: Quality and diversity are two critical metrics for the training data of large language models (LLMs), positively impacting performance. Existing studies often optimize these metrics separately, typically by first applying quality filtering and then adjusting data proportions. However, these approaches overlook the inherent trade-off between quality and diversity, necessitating their joint consider… ▽ More Quality and diversity are two critical metrics for the training data of large language models (LLMs), positively impacting performance. Existing studies often optimize these metrics separately, typically by first applying quality filtering and then adjusting data proportions. However, these approaches overlook the inherent trade-off between quality and diversity, necessitating their joint consideration. Given a fixed training quota, it is essential to evaluate both the quality of each data point and its complementary effect on the overall dataset. In this paper, we introduce a unified data selection framework called QuaDMix, which automatically optimizes the data distribution for LLM pretraining while balancing both quality and diversity. Specifically, we first propose multiple criteria to measure data quality and employ domain classification to distinguish data points, thereby measuring overall diversity. QuaDMix then employs a unified parameterized data sampling function that determines the sampling probability of each data point based on these quality and diversity related labels. To accelerate the search for the optimal parameters involved in the QuaDMix framework, we conduct simulated experiments on smaller models and use LightGBM for parameters searching, inspired by the RegMix method. Our experiments across diverse models and datasets demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks. These results outperform the independent strategies for quality and diversity, highlighting the necessity and ability to balance data quality and diversity. △ Less

Submitted 25 April, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

arXiv:2504.16405 [pdf, other]

EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment

Authors: Lancheng Gao, Ziheng Jia, Yunhao Zeng, Wei Sun, Yiming Zhang, Wei Zhou, Guangtao Zhai, Xiongkuo Min

Abstract: The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities. Among these, understanding image-evoked emotions aims to enhance MLLMs' empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluation… ▽ More The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities. Among these, understanding image-evoked emotions aims to enhance MLLMs' empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking. To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories. Our core contributions include: 1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated. 2) We design four tasks to evaluate MLLMs' ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model's proficiency in performing joint and comparative analysis. In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs. The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal. Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding. △ Less

Submitted 7 May, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

arXiv:2504.15710 [pdf]

Prediction of CO2 reduction reaction intermediates and products on transition metal-doped r-GeSe monolayers:A combined DFT and machine learning approach

Authors: Xuxin Kang, Wenjing Zhou, Ziyuan Li, Zhaoqin Chu, Hanqin Yin, Shan Gao, Aijun Du, Xiangmei Duan

Abstract: The electrocatalytic CO2 reduction reaction (CO2RR) is a complex multi-proton-electron transfer process that generates a vast network of reaction intermediates. Accurate prediction of free energy changes (G) of these intermediates and products is essential for evaluating catalytic performance. We combined density functional theory (DFT) and machine learning (ML) to screen 25 single-atom catalysts… ▽ More The electrocatalytic CO2 reduction reaction (CO2RR) is a complex multi-proton-electron transfer process that generates a vast network of reaction intermediates. Accurate prediction of free energy changes (G) of these intermediates and products is essential for evaluating catalytic performance. We combined density functional theory (DFT) and machine learning (ML) to screen 25 single-atom catalysts (SACs) on defective r-GeSe monolayers for CO2 reduction to methanol, methane, and formic acid. Among nine ML models evaluated with 14 intrinsic and DFT-based features, the XGBoost performed best (R2 = 0.92 and MAE = 0.24 eV), aligning closely with DFT calculations and identifying Ni, Ru, and Rh@GeSe as prospective catalysts. Feature importance analysis in free energy and product predictions highlighted the significance of CO2 activation with O-C-O and IPC-O1 as the key attributes. Furthermore, by incorporating non-DFT-based features, rapid predictions became possible, and the XGBoost model retained its predictive performance with R2 = 0.89 and MAE = 0.29 eV. This accuracy was further validated using Ir@GeSe. Our work highlights effective SACs for CO2RR, and provides valuable insights for efficient catalyst design. △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2504.15384 [pdf]

ICGM-FRAX: Iterative Cross Graph Matching for Hip Fracture Risk Assessment using Dual-energy X-ray Absorptiometry Images

Authors: Chen Zhao, Anjum Shaik, Joyce H. Keyak, Nancy E. Lane, Jeffrey D. Deng, Kuan-Jui Su, Qiuying Sha, Hui Shen, Hong-Wen Deng, Weihua Zhou

Abstract: Hip fractures represent a major health concern, particularly among the elderly, often leading decreased mobility and increased mortality. Early and accurate detection of at risk individuals is crucial for effective intervention. In this study, we propose Iterative Cross Graph Matching for Hip Fracture Risk Assessment (ICGM-FRAX), a novel approach for predicting hip fractures using Dual-energy X-ra… ▽ More Hip fractures represent a major health concern, particularly among the elderly, often leading decreased mobility and increased mortality. Early and accurate detection of at risk individuals is crucial for effective intervention. In this study, we propose Iterative Cross Graph Matching for Hip Fracture Risk Assessment (ICGM-FRAX), a novel approach for predicting hip fractures using Dual-energy X-ray Absorptiometry (DXA) images. ICGM-FRAX involves iteratively comparing a test (subject) graph with multiple template graphs representing the characteristics of hip fracture subjects to assess the similarity and accurately to predict hip fracture risk. These graphs are obtained as follows. The DXA images are separated into multiple regions of interest (RoIs), such as the femoral head, shaft, and lesser trochanter. Radiomic features are then calculated for each RoI, with the central coordinates used as nodes in a graph. The connectivity between nodes is established according to the Euclidean distance between these coordinates. This process transforms each DXA image into a graph, where each node represents a RoI, and edges derived by the centroids of RoIs capture the spatial relationships between them. If the test graph closely matches a set of template graphs representing subjects with incident hip fractures, it is classified as indicating high hip fracture risk. We evaluated our method using 547 subjects from the UK Biobank dataset, and experimental results show that ICGM-FRAX achieved a sensitivity of 0.9869, demonstrating high accuracy in predicting hip fractures. △ Less

Submitted 21 April, 2025; originally announced April 2025.

Comments: 23 pages, 4 figures

arXiv:2504.15279 [pdf, other]

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Authors: Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu

Abstract: Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verifi… ▽ More Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans-revealing significant gaps in visual reasoning. Furthermore, we provide a supplementary training dataset and a reinforcement-learning baseline to support further progress. △ Less

Submitted 21 April, 2025; originally announced April 2025.

Comments: Code, data, and baselines are available at https://visulogic-benchmark.github.io/VisuLogic

arXiv:2504.15146 [pdf, other]

Behavioral Universe Network (BUN): A Behavioral Information-Based Framework for Complex Systems

Authors: Wei Zhou, Ailiya Borjigin, Cong He

Abstract: Modern digital ecosystems feature complex, dynamic interactions among autonomous entities across diverse domains. Traditional models often separate agents and objects, lacking a unified foundation to capture their interactive behaviors. This paper introduces the Behavioral Universe Network (BUN), a theoretical framework grounded in the Agent-Interaction-Behavior (AIB) formalism. BUN treats subject… ▽ More Modern digital ecosystems feature complex, dynamic interactions among autonomous entities across diverse domains. Traditional models often separate agents and objects, lacking a unified foundation to capture their interactive behaviors. This paper introduces the Behavioral Universe Network (BUN), a theoretical framework grounded in the Agent-Interaction-Behavior (AIB) formalism. BUN treats subjects (active agents), objects (resources), and behaviors (operations) as first-class entities, all governed by a shared Behavioral Information Base (BIB). We detail the AIB core concepts and demonstrate how BUN leverages information-driven triggers, semantic enrichment, and adaptive rules to coordinate multi-agent systems. We highlight key benefits: enhanced behavior analysis, strong adaptability, and cross-domain interoperability. We conclude by positioning BUN as a promising foundation for next-generation digital governance and intelligent applications. △ Less

Submitted 21 April, 2025; originally announced April 2025.

Comments: 17 pages, 1 figure

arXiv:2504.14267 [pdf, other]

Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

Authors: Li Yu, Xuanzhe Sun, Wei Zhou, Moncef Gabbouj

Abstract: Video saliency prediction is crucial for downstream applications, such as video compression and human-computer interaction. With the flourishing of multimodal learning, researchers started to explore multimodal video saliency prediction, including audio-visual and text-visual approaches. Auditory cues guide the gaze of viewers to sound sources, while textual cues provide semantic guidance for unde… ▽ More Video saliency prediction is crucial for downstream applications, such as video compression and human-computer interaction. With the flourishing of multimodal learning, researchers started to explore multimodal video saliency prediction, including audio-visual and text-visual approaches. Auditory cues guide the gaze of viewers to sound sources, while textual cues provide semantic guidance for understanding video content. Integrating these complementary cues can improve the accuracy of saliency prediction. Therefore, we attempt to simultaneously analyze visual, auditory, and textual modalities in this paper, and propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction. TAVDiff treats video saliency prediction as an image generation task conditioned on textual, audio, and visual inputs, and predicts saliency maps through stepwise denoising. To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames and introduce a saliency-oriented image-text response (SITR) mechanism to generate image-text response maps. It is used as conditional information to guide the model to localize the visual regions that are semantically related to the textual description. Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds. At the same time, since the diffusion transformer (DiT) directly concatenates the conditional information with the timestep, which may affect the estimation of the noise level. To achieve effective conditional guidance, we propose Saliency-DiT, which decouples the conditional information from the timestep. Experimental results show that TAVDiff outperforms existing methods, improving 1.03\%, 2.35\%, 2.71\% and 0.33\% on SIM, CC, NSS and AUC-J metrics, respectively. △ Less

Submitted 19 April, 2025; originally announced April 2025.

arXiv:2504.12711 [pdf, other]

NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/. △ Less

Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

arXiv:2504.12328 [pdf, other]

A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future

Authors: Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, Lei Zou

Abstract: Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs' behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the appli… ▽ More Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs' behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github\footnote{https://github.com/JLZhong23/awesome-reward-models}. △ Less

Submitted 12 April, 2025; originally announced April 2025.

arXiv:2504.12276 [pdf, other]

The Tenth NTIRE 2025 Image Denoising Challenge Report

Authors: Lei Sun, Hang Guo, Bin Ren, Luc Van Gool, Radu Timofte, Yawei Li, Xiangyu Kong, Hyunhee Park, Xiaoxuan Yu, Suejin Han, Hakjae Jeon, Jia Li, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Jingyu Ma, Zhijuan Huang, Huiyuan Fu, Hongyuan Yu, Boqi Zhang, Jiawei Shi, Heng Zhang, Huadong Ma, Deepak Kumar Tyagi , et al. (69 additional authors not shown)

Abstract: This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent ad… ▽ More This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50. A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results, providing insights into the current state-of-the-art in image denoising. △ Less

Submitted 16 April, 2025; originally announced April 2025.

arXiv:2504.11854 [pdf, ps, other]

Less-excludable Mechanism for DAOs in Public Good Auctions

Authors: Jing Chen, Wentao Zhou

Abstract: With the rise of smart contracts, decentralized autonomous organizations (DAOs) have emerged in public good auctions, allowing "small" bidders to gather together and enlarge their influence in high-valued auctions. However, models and mechanisms in the existing research literature do not guarantee non-excludability, which is a main property of public goods. As such, some members of the winning DAO… ▽ More With the rise of smart contracts, decentralized autonomous organizations (DAOs) have emerged in public good auctions, allowing "small" bidders to gather together and enlarge their influence in high-valued auctions. However, models and mechanisms in the existing research literature do not guarantee non-excludability, which is a main property of public goods. As such, some members of the winning DAO may be explicitly prevented from accessing the public good. This side effect leads to regrouping of small bidders within the DAO to have a larger say in the final outcome. In particular, we provide a polynomial-time algorithm to compute the best regrouping of bidders that maximizes the total bidding power of a DAO. We also prove that such a regrouping is less-excludable, better aligning the needs of the entire DAO and the nature of public goods. Next, notice that members of a DAO in public good auctions often have a positive externality among themselves. Thus we introduce a collective factor into the members' utility functions. We further extend the mechanism's allocation for each member to allow for partial access to the public good. Under the new model, we propose a mechanism that is incentive compatible in generic games and achieves higher social welfare as well as less-excludable allocations. △ Less

Submitted 18 April, 2025; v1 submitted 16 April, 2025; originally announced April 2025.

arXiv:2504.11733 [pdf, other]

DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment

Authors: Li Yu, Situo Wang, Wei Zhou, Moncef Gabbouj

Abstract: Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contras… ▽ More Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model's superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP's visual and textual components, and integrates them into different stages of the NR-VQA pipeline. Specifically, a Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream. Additionally, a Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling. On the ventral stream side, a Basic Visual Feature Extraction Module is employed to strengthen detail analysis. Finally, a text-guided adaptive fusion strategy is proposed to enable dynamic weighting of features, facilitating more effective integration of spatial and temporal information. △ Less

Submitted 19 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.10078 [pdf, other]

Unleashing Expert Opinion from Social Media for Stock Prediction

Authors: Wanyun Zhou, Saizhuo Wang, Xiang Li, Yiyan Qi, Jian Guo, Xiaowen Chu

Abstract: While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value… ▽ More While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value for trading predictions due to their noisy nature. To tackle this, we propose a novel dynamic expert tracing algorithm that filters out non-informative posts and identifies both true and inverse experts whose consistent predictions can serve as valuable trading signals. Our approach achieves significant improvements over existing expert identification methods in stock trend prediction. However, when using binary expert predictions to predict the return ratio, similar to all other expert identification methods, our approach faces a common challenge of signal sparsity with expert signals cover only about 4% of all stock-day combinations in our dataset. To address this challenge, we propose a dual graph attention neural network that effectively propagates expert signals across related stocks, enabling accurate prediction of return ratios and significantly increasing signal coverage. Empirical results show that our propagated expert-based signals not only exhibit strong predictive power independently but also work synergistically with traditional financial features. These combined signals significantly outperform representative baseline models in all quant-related metrics including predictive accuracy, return metrics, and correlation metrics, resulting in more robust investment strategies. We hope this work inspires further research into leveraging social media data for enhancing quantitative investment strategies. The code can be seen in https://github.com/wanyunzh/DualGAT. △ Less

Submitted 14 April, 2025; originally announced April 2025.

arXiv:2504.09361 [pdf, other]

PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking

Authors: Jiahuan Long, Tingsong Jiang, Wen Yao, Shuai Jia, Weijia Zhang, Weien Zhou, Chao Ma, Xiaoqian Chen

Abstract: Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attack… ▽ More Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world. △ Less

Submitted 12 April, 2025; originally announced April 2025.

Comments: Accepted by ECCV 2024

arXiv:2504.08857 [pdf, other]

Structural robustness of the international food supply network under external shocks and its determinants

Authors: Han-Yu Zhu, Yin-Ting Zhang, Wen-Jie Xie, Wei-Xing Zhou

Abstract: The stability of the global food supply network is critical for ensuring food security. This study constructs an aggregated international food supply network based on the trade data of four staple crops and evaluates its structural robustness through network integrity under accumulating external shocks. Network integrity is typically quantified in network science by the relative size of the larges… ▽ More The stability of the global food supply network is critical for ensuring food security. This study constructs an aggregated international food supply network based on the trade data of four staple crops and evaluates its structural robustness through network integrity under accumulating external shocks. Network integrity is typically quantified in network science by the relative size of the largest connected component, and we propose a new robustness metric that incorporates both the broadness p and severity q of external shocks. Our findings reveal that the robustness of the network has gradually increased over the past decades, punctuated by temporary declines that can be explained by major historical events. While the aggregated network remains robust under moderate disruptions, extreme shocks targeting key suppliers such as the United States and India can trigger systemic collapse. When the shock broadness p is less than about 0.3 and the shock severity q is close to 1, the structural robustness curves S(p,q) decrease linearly with respect to the shock broadness p, suggesting that the most critical economies have relatively even influence on network integrity. Comparing the robustness curves of the four individual staple foods, we find that the soybean supply network is the least robust. Furthermore, regression and machine learning analyses show that increaseing food (particularly rice and soybean) production enhances network robustness, while rising food prices significantly weaken it. △ Less

Submitted 11 April, 2025; originally announced April 2025.

arXiv:2504.05535 [pdf, other]

COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

Authors: M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang , et al. (7 additional authors not shown)

Abstract: Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address the… ▽ More Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P. △ Less

Submitted 7 April, 2025; originally announced April 2025.

arXiv:2504.01612 [pdf, other]

doi 10.1088/1674-4527/adc788

The Mini-SiTian Array: first-two-year operation

Authors: Min He, Hong Wu, Liang Ge, Jian-feng Tian, Zheng Wang, Hai-yang Mu, Yu Zhang, Yang Huang, Jie Zheng, Zhou Fan, Zheng-yang Li, Hong-hui Gu, Heng-geng Han, Kai Xiao, Zhi-rui Li, Jun-jie Jin, Bei-chuan Wang, Jun Ma, Jin-hang Zou, Ying Wu, Jiu-peng Guo, Li-guo Fang, Zhi-gang Hou, Bo-wen Zhang, Yun-fei Xu , et al. (48 additional authors not shown)

Abstract: The SiTian project, designed to utilize 60 telescopes distributed across multiple sites in China, is a next-generation time-domain survey initiative. As a pathfinder for the SiTian project, the Mini-SiTian (MST) has been proposed and implemented to test the SiTian's brain and data pipeline, and to evaluate the feasibility of its technology and science cases. Mounted at the Xinglong Observatory, th… ▽ More The SiTian project, designed to utilize 60 telescopes distributed across multiple sites in China, is a next-generation time-domain survey initiative. As a pathfinder for the SiTian project, the Mini-SiTian (MST) has been proposed and implemented to test the SiTian's brain and data pipeline, and to evaluate the feasibility of its technology and science cases. Mounted at the Xinglong Observatory, the MST project comprises three 30 cm telescopes and has been operated since Nov. 2022. Each telescope of the MST possesses a large field of view, covering $2.29^{\circ}$ $\times$ $1.53^{\circ}$ FOV, and is equipped with $g'$, $r'$ and $i'$ filters, respectively. Acting as the pioneer of the forthcoming SiTian project, the MST is dedicated to the discovery of variable stars, transients, and outburst events, and has already obtained some interesting scientific results. In this paper, we will summarize the first-two-year operation of the MST project. △ Less

Submitted 2 April, 2025; originally announced April 2025.

Comments: 10 pages, 11 figures, Accepted for publication in a special issue of Research in Astronomy and Astrophysics on the Mini-SiTian Array

arXiv:2504.01025 [pdf]

Diagnosis of Pulmonary Hypertension by Integrating Multimodal Data with a Hybrid Graph Convolutional and Transformer Network

Authors: Fubao Zhu, Yang Zhang, Gengmin Liang, Jiaofen Nan, Yanting Li, Chuang Han, Danyang Sun, Zhiguo Wang, Chen Zhao, Wenxuan Zhou, Jian He, Yi Xu, Iokfai Cheang, Xu Zhu, Yanli Zhou, Weihua Zhou

Abstract: Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study ana… ▽ More Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study analyzed data from 204 patients (112 with pre-capillary PH, 32 with post-capillary PH, and 60 non-PH controls) at the First Affiliated Hospital of Nanjing Medical University. Diagnoses were confirmed through right heart catheterization. We selected 6 samples from each category for the test set (18 samples, 10%), with the remaining 186 samples used for the training set. This process was repeated 35 times for testing. This paper proposes a deep learning model that combines Graph convolutional networks (GCN), Convolutional neural networks (CNN), and Transformers. The model was developed to process multimodal data, including short-axis (SAX) sequences, four-chamber (4CH) sequences, and clinical parameters. Our model achieved a performance of Area under the receiver operating characteristic curve (AUC) = 0.81 +- 0.06(standard deviation) and Accuracy (ACC) = 0.73 +- 0.06 on the test set. The discriminative abilities were as follows: non-PH subjects (AUC = 0.74 +- 0.11), pre-capillary PH (AUC = 0.86 +- 0.06), and post-capillary PH (AUC = 0.83 +- 0.10). It has the potential to support clinical decision-making by effectively integrating multimodal data to assist physicians in making accurate and timely diagnoses. △ Less

Submitted 27 March, 2025; originally announced April 2025.

Comments: 23 pages, 8 figures, 4 tables

arXiv:2504.00882 [pdf, other]

CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models

Authors: Wei Zhou, Yuyang Gao, Xuanhe Zhou, Guoliang Li

Abstract: Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based tec… ▽ More Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based techniques often involve high maintenance effort (e.g., crafting custom translation rules) or produce unreliable results (e.g., LLM generates non-existent functions), especially when handling complex queries. In this demonstration, we present CrackSQL, the first hybrid SQL dialect translation system that combines rule and LLM-based methods to overcome these limitations. CrackSQL leverages the adaptability of LLMs to minimize manual intervention, while enhancing translation accuracy by segmenting lengthy complex SQL via functionality-based query processing. To further improve robustness, it incorporates a novel cross-dialect syntax embedding model for precise syntax alignment, as well as an adaptive local-to-global translation strategy that effectively resolves interdependent query operations. CrackSQL supports three translation modes and offers multiple deployment and access options including a web console interface, a PyPI package, and a command-line prompt, facilitating adoption across a variety of real-world use cases △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: Extension of our SIGMOD 2025 paper. Please refer to source code available at: https://github.com/weAIDB/CrackSQL

arXiv:2504.00786 [pdf, other]

FeatInsight: An Online ML Feature Management System on 4Paradigm Sage-Studio Platform

Authors: Xin Tong, Xuanhe Zhou, Bingsheng He, Guoliang Li, Zirui Tang, Wei Zhou, Fan Wu, Mian Lu, Yuqiang Chen

Abstract: Feature management is essential for many online machine learning applications and can often become the performance bottleneck (e.g., taking up to 70% of the overall latency in sales prediction service). Improper feature configurations (e.g., introducing too many irrelevant features) can severely undermine the model's generalization capabilities. However, managing online ML features is challenging… ▽ More Feature management is essential for many online machine learning applications and can often become the performance bottleneck (e.g., taking up to 70% of the overall latency in sales prediction service). Improper feature configurations (e.g., introducing too many irrelevant features) can severely undermine the model's generalization capabilities. However, managing online ML features is challenging due to (1) large-scale, complex raw data (e.g., the 2018 PHM dataset contains 17 tables and dozens to hundreds of columns), (2) the need for high-performance, consistent computation of interdependent features with complex patterns, and (3) the requirement for rapid updates and deployments to accommodate real-time data changes. In this demo, we present FeatInsight, a system that supports the entire feature lifecycle, including feature design, storage, visualization, computation, verification, and lineage management. FeatInsight (with OpenMLDB as the execution engine) has been deployed in over 100 real-world scenarios on 4Paradigm's Sage Studio platform, handling up to a trillion-dimensional feature space and enabling millisecond-level feature updates. We demonstrate how FeatInsight enhances feature design efficiency (e.g., for online product recommendation) and improve feature computation performance (e.g., for online fraud detection). The code is available at https://github.com/4paradigm/FeatInsight. △ Less

Submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.21837 [pdf]

Impact of Oxygen on DNA Damage Distribution in 3D Genome and Its Correlation to Oxygen Enhancement Ratio under High LET Irradiation

Authors: Ankang Hu, Wanyi Zhou, Xiyu Luo, Rui Qiu, Junli Li

Abstract: The variation of the oxygen enhancement ratio (OER) across different values of Linear Energy Transfer (LET) currently lacks a comprehensive mechanistic interpretation and a mechanistic model. Our earlier research revealed a significant correlation between the distribution of double-strand breaks (DSBs) within the 3D genome and radiation-induced cell death, which offers valuable insights into the o… ▽ More The variation of the oxygen enhancement ratio (OER) across different values of Linear Energy Transfer (LET) currently lacks a comprehensive mechanistic interpretation and a mechanistic model. Our earlier research revealed a significant correlation between the distribution of double-strand breaks (DSBs) within the 3D genome and radiation-induced cell death, which offers valuable insights into the oxygen effect. In this study, we formulate a model where the reaction of oxygen is represented as the probability of inducing DNA strand breaks. Then it is integrated into a track-structure Monte Carlo simulation to investigate the impact of oxygen on the spatial distribution of DSBs within the 3D genome. Results show that the incidence ratios of clustered DSBs in a single topologically associating domain (TAD) (case 2) and DSBs in frequently-interacting TADs (case 3) under aerobic and hypoxic conditions closely align with the trend of the OER of cell survival across various LET values. By utilizing the parameters derived from our previous study, we calculate the OER values related to cell survival. Our OER curves exhibit good correspondence with experimental data. This study provides a potentially mechanistic explanation for the changes in OER across different LET levels. High-LET irradiation leads to dense ionization events, resulting in an overabundance of lesions that readily induce case 2 and case 3. The probabilities of cell death associated with case 2 and case 3 are substantially higher than other damage patterns. This may contribute to the main mechanism governing the variation of OER for high LET. Our study further underscores the importance of the DSB distribution within the 3D genome in the context of radiation-induced cell death. This study also provides valuable reference points for establishing a mechanistic model of OER. △ Less

Submitted 27 March, 2025; originally announced March 2025.

Comments: 14 pages, 6 figures

arXiv:2503.21446 [pdf]

No-drift phase-change memory alloy for neuromorphic computing

Authors: Xiaozhe Wang, Ruobing Wang, Suyang Sun, Ding Xu, Chao Nie, Zhou Zhou, Chenyu Wen, Junying Zhang, Ruixuan Chu, Xueyang Shen, Wen Zhou, Zhitang Song, Jiang-Jing Wang, En Ma, Wei Zhang

Abstract: Spontaneous structural relaxation is intrinsic to glassy materials due to their metastable nature. For phase-change materials (PCMs), the resultant temporal change in electrical resistance seriously hamper in-memory computing (IMC) applications. Here, we report an ab-initio-calculation-informed design of amorphous PCM composed of robust "molecule-like" motifs with minimal Peierls distortion, depri… ▽ More Spontaneous structural relaxation is intrinsic to glassy materials due to their metastable nature. For phase-change materials (PCMs), the resultant temporal change in electrical resistance seriously hamper in-memory computing (IMC) applications. Here, we report an ab-initio-calculation-informed design of amorphous PCM composed of robust "molecule-like" motifs with minimal Peierls distortion, depriving the amorphous alloy of structural ingredients that would gradually evolve upon aging to entail resistance drift. We demonstrate amorphous CrTe3 thin films that display practically no resistance drift at any working temperature from -200 to 165 degree C. We achieve multilevel programming of CrTe3 through both step-wise crystallization and step-wise amorphization using a hybrid opto-electronic device at various temperatures. Moreover, the application potential of CrTe3 in neuromorphic computing is testified by its incorporation in a vehicle with automatic path-tracking function. Our work opens a new avenue to achieving IMC-requisite properties via judicious design of the composition and atomic-level structure of disordered PCM alloys. △ Less

Submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.21284 [pdf, other]

Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression

Authors: Hanyue Tu, Siqi Wu, Li Li, Wengang Zhou, Houqiang Li

Abstract: Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specific… ▽ More Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates. The source code is available at https://github.com/hytu99/MSINN-VRLIC. △ Less

Submitted 27 March, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

Comments: Accepted for publication in IEEE Transactions on Multimedia 2025

arXiv:2503.20628 [pdf, ps, other]

Carleman estimate for full-discrete approximations of the complex Ginzburg-Landau equation with dynamic boundary conditions and applications to controllability

Authors: Xu Zhu, Wenwen Zhou, Bin Wu

Abstract: In this paper, we investigate Carleman estimate and controllability result for the fully-discrete approximations of a one-dimensional Ginzburg-Landau equation with dynamic boundary conditions. We first establish a new discrete Carleman estimate for the corresponding adjoint system. Based on this Carleman estimate, we obtain a relaxed observability inequality for the adjoint system, and then a cont… ▽ More In this paper, we investigate Carleman estimate and controllability result for the fully-discrete approximations of a one-dimensional Ginzburg-Landau equation with dynamic boundary conditions. We first establish a new discrete Carleman estimate for the corresponding adjoint system. Based on this Carleman estimate, we obtain a relaxed observability inequality for the adjoint system, and then a controllability result for the fully-discrete Ginzburg-Landau equation with dynamic boundary conditions. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.20314 [pdf, other]

Wan: Open and Advanced Large-Scale Video Generative Models

Authors: Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu , et al. (37 additional authors not shown)

Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluat… ▽ More This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1. △ Less

Submitted 18 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

Comments: 60 pages, 33 figures

arXiv:2503.18843 [pdf, other]

Experimental Evidence of Vortex $γ$ Photons in All-Optical Inverse Compton Scattering

Authors: Mingxuan Wei, Siyu Chen, Yu Wang, Xichen Hu, Mingyang Zhu, Hao Hu, Pei-Lun He, Weijun Zhou, Jiao Jia, Li Lu, Boyuan Li, Feng Liu, Min Chen, Liming Chen, Jian-Xing Li, Wenchao Yan, Jie Zhang

Abstract: Vortex $γ$ photons carrying orbital angular momenta (OAM) hold great potential for various applications. However, their generation remains a great challenge. Here, we successfully generate sub-MeV vortex $γ$ photons via all-optical inverse Compton scattering of relativistic electrons colliding with a sub-relativistic Laguerre-Gaussian laser. In principle, directly measuring the OAM of $γ$ photons… ▽ More Vortex $γ$ photons carrying orbital angular momenta (OAM) hold great potential for various applications. However, their generation remains a great challenge. Here, we successfully generate sub-MeV vortex $γ$ photons via all-optical inverse Compton scattering of relativistic electrons colliding with a sub-relativistic Laguerre-Gaussian laser. In principle, directly measuring the OAM of $γ$ photons is challenging due to their incoherence and extremely short wavelength. Therein, we put forward a novel method to determine the OAM properties by revealing the quantum opening angle of vortex $γ$ photons, since vortex particles exhibit not only a spiral phase but also transverse momentum according to the quantum electrodynamics theory. Thus,$γ$ photons carrying OAM anifest a much larger angular distribution than those without OAM, which has been clearly observed in our experiments. This angular expansion is considered as an overall effect lying beyond classical theory. Our method provides the first experimental evidence for detecting vortex $γ$ photons and opens a new perspective for investigating OAM-induced quantum phenomena in broad fields. △ Less

Submitted 24 March, 2025; originally announced March 2025.

Comments: 8 pages, 4 figures

arXiv:2503.18672 [pdf, other]

Feature Calibration enhanced Parameter Synthesis for CLIP-based Class-incremental Learning

Authors: Juncen Guo, Yang Liu, Xiaoguang Zhu, Lianlong Sun, Liangyu Teng, Jingyi Wu, Di Li, Wei Zhou, Liang Song

Abstract: Class-Incremental Learning (CIL) enables models to continuously learn new class knowledge while retaining previous classes, facilitating adaptation and evolution in dynamic, real-world environments. Traditional CIL methods primarily rely on visual features, which limits their effectiveness in complex, multimodal scenarios. In contrast, VLMs show promising potential for enhancing CIL by leveraging… ▽ More Class-Incremental Learning (CIL) enables models to continuously learn new class knowledge while retaining previous classes, facilitating adaptation and evolution in dynamic, real-world environments. Traditional CIL methods primarily rely on visual features, which limits their effectiveness in complex, multimodal scenarios. In contrast, VLMs show promising potential for enhancing CIL by leveraging pre-trained knowledge and integrating multi-modal semantic cues such as text and vision. However, existing approaches struggle to mitigate catastrophic forgetting while preserving the generalization strengths of VLMs across diverse modalities. To address these challenges, we propose a Feature Calibration Enhanced Parameter Synthesis (FCPS) framework. Specifically, FCPS introduces a dynamic parameter adjustment mechanism that iteratively calibrates the contribution of original visual features to the final class decision, thus preserving the model's intrinsic generalization capability across modalities. Simultaneously, parameter integration enables effective knowledge transfer, maintaining a balance between acquiring new class representations and preserving old knowledge. Experimental results on popular benchmarks (e.g., CIFAR100 and ImageNet100) validate the superiority of the proposed method. △ Less

Submitted 17 April, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.18034 [pdf, ps, other]

Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models

Authors: Qiao Liang, Yanjiang Liu, Weixiang Zhou, Ben He, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun, Yingfei Sun

Abstract: Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of prior knowledge of the… ▽ More Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of prior knowledge of the vision encoder on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient, particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities. △ Less

Submitted 30 May, 2025; v1 submitted 23 March, 2025; originally announced March 2025.

arXiv:2503.18004 [pdf, other]

Dynamic structural resilience of international staple food trade networks

Authors: Si-Yao Wei, Wei-Xing Zhou

Abstract: It is important to maintain the resilient international food trade network for food security. We have constructed the international trade networks of maize, rice, soybean, and wheat based on bilateral flows data between economies. Drawing on information theory, we have measured their dynamic resilience based on efficiency and redundancy during 1986 to 2022. We have also investigated the impact of… ▽ More It is important to maintain the resilient international food trade network for food security. We have constructed the international trade networks of maize, rice, soybean, and wheat based on bilateral flows data between economies. Drawing on information theory, we have measured their dynamic resilience based on efficiency and redundancy during 1986 to 2022. We have also investigated the impact of economies and relationships on their resilience. Overall, we argue that rice and soybean trade networks deserve more attention while resilience in maize and wheat shows a steady upward trend. Meanwhile, our findings emphasize the importance of diversity of trade flows and partners for enhancing resilience. Currently, for example, excessively high monopolization of soybean trade may not be beneficial for its resilience. Also, we have found that major exporters and relationships between geographically bordering economies have greater impact on the resilience. Moreover, we have confirmed the existence of different network structures with the optimal resilience as relationships are removed cumulatively, which may be an informative guide for the international food trade. △ Less

Submitted 23 March, 2025; originally announced March 2025.

arXiv:2503.17407 [pdf, other]

A Comprehensive Survey on Long Context Language Modeling

Authors: Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li , et al. (12 additional authors not shown)

Abstract: Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-c… ▽ More Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2503.16937 [pdf, other]

External tides: an important driver of velocity dispersion in molecular clouds

Authors: J. W. Zhou

Abstract: Using the 3D density distribution derived from the 3D dust map of the solar neighborhood, the gravitational potential is obtained by solving the Poisson equation, from which the tidal tensor is computed. In the optimal decomposition, the external tidal tensor follows the same formalism as that of a point mass. The average tidal strength of the clouds, derived from both tidal tensor analysis and pi… ▽ More Using the 3D density distribution derived from the 3D dust map of the solar neighborhood, the gravitational potential is obtained by solving the Poisson equation, from which the tidal tensor is computed. In the optimal decomposition, the external tidal tensor follows the same formalism as that of a point mass. The average tidal strength of the clouds, derived from both tidal tensor analysis and pixel-by-pixel computation, shows consistent results. The equivalent velocity dispersion of the clouds, estimated from the average tidal strength, is comparable in magnitude to the velocity dispersion measured from CO (1-0) line emission. This suggests that tidal effects from surrounding material may play a significant role in driving velocity dispersion within the clouds. Future studies should carefully consider these tidal effects in star-forming regions. △ Less

Submitted 21 March, 2025; originally announced March 2025.

Comments: Accepted for publication in A&A Letter

Showing 51–100 of 1,927 results for author: Zhou, W