Skip to main content

Showing 1–50 of 7,746 results for author: Wang, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10562  [pdf, ps, other

    cs.CV

    End-to-End Vision Tokenizer Tuning

    Authors: Wenxuan Wang, Fan Zhang, Yufeng Cui, Haiwen Diao, Zhuoyan Luo, Huchuan Lu, Jing Liu, Xinlong Wang

    Abstract: Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.10522  [pdf

    cs.RO cs.AI cs.LG

    Knowledge capture, adaptation and composition (KCAC): A framework for cross-task curriculum learning in robotic manipulation

    Authors: Xinrui Wang, Yan Jin

    Abstract: Reinforcement learning (RL) has demonstrated remarkable potential in robotic manipulation but faces challenges in sample inefficiency and lack of interpretability, limiting its applicability in real world scenarios. Enabling the agent to gain a deeper understanding and adapt more efficiently to diverse working scenarios is crucial, and strategic knowledge utilization is a key factor in this proces… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  3. arXiv:2505.10191  [pdf

    physics.ao-ph cs.AI cs.LG nlin.CD

    LanTu: Dynamics-Enhanced Deep Learning for Eddy-Resolving Ocean Forecasting

    Authors: Qingyu Zheng, Qi Shao, Guijun Han, Wei Li, Hong Li, Xuan Wang

    Abstract: Mesoscale eddies dominate the spatiotemporal multiscale variability of the ocean, and their impact on the energy cascade of the global ocean cannot be ignored. Eddy-resolving ocean forecasting is providing more reliable protection for fisheries and navigational safety, but also presents significant scientific challenges and high computational costs for traditional numerical models. Artificial inte… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: 22 pages, 6 figures

  4. arXiv:2505.10117  [pdf, other

    cs.LG cs.CL

    Learning Virtual Machine Scheduling in Cloud Computing through Language Agents

    Authors: JieHao Wu, Ziwei Wang, Junjie Sheng, Wenhao Li, Xiangfei Wang, Jun Luo

    Abstract: In cloud services, virtual machine (VM) scheduling is a typical Online Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by large-scale complexity and fluctuating demands. Traditional optimization methods struggle to adapt to real-time changes, domain-expert-designed heuristic approaches suffer from rigid strategies, and existing learning-based methods often lack generalizability… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  5. arXiv:2505.09388  [pdf, other

    cs.CL

    Qwen3 Technical Report

    Authors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou , et al. (35 additional authors not shown)

    Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  6. arXiv:2505.09385  [pdf, ps, other

    cs.CV cs.AI

    FedSaaS: Class-Consistency Federated Semantic Segmentation via Global Prototype Supervision and Local Adversarial Harmonization

    Authors: Xiaoyang Yu, Xiaoming Wu, Xin Wang, Dongrun Li, Ming Yang, Peng Cheng

    Abstract: Federated semantic segmentation enables pixel-level classification in images through collaborative learning while maintaining data privacy. However, existing research commonly overlooks the fine-grained class relationships within the semantic space when addressing heterogeneous problems, particularly domain shift. This oversight results in ambiguities between class representation. To overcome this… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  7. arXiv:2505.09259  [pdf, ps, other

    cs.NI

    Interplay Between AI and Space-Air-Ground Integrated Network: The Road Ahead

    Authors: Chenyu Wu, Xi Wang, Yi Hu, Shuai Han, Dusit Niyato

    Abstract: Space-air-ground integrated network (SAGIN) is envisioned as a key network architecture for achieving ubiquitous coverage in the next-generation communication system. Concurrently, artificial intelligence (AI) plays a pivotal role in managing the complex control of SAGIN, thereby enhancing its automation and flexibility. Despite this, there remains a significant research gap concerning the interac… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  8. arXiv:2505.08601  [pdf, other

    cs.CV cond-mat.mtrl-sci

    Rejoining fragmented ancient bamboo slips with physics-driven deep learning

    Authors: Jinchi Zhu, Zhou Zhao, Hailong Lei, Xiaoguang Wang, Jialiang Lu, Jing Li, Qianqian Tang, Jiachen Shen, Gui-Song Xia, Bo Du, Yongchao Xu

    Abstract: Bamboo slips are a crucial medium for recording ancient civilizations in East Asia, and offers invaluable archaeological insights for reconstructing the Silk Road, studying material culture exchanges, and global history. However, many excavated bamboo slips have been fragmented into thousands of irregular pieces, making their rejoining a vital yet challenging step for understanding their content.… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  9. arXiv:2505.08414  [pdf

    eess.IV cs.CV

    An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care

    Authors: Zhi Da Soh, Yang Bai, Kai Yu, Yang Zhou, Xiaofeng Lei, Sahil Thakur, Zann Lee, Lee Ching Linette Phang, Qingsheng Peng, Can Can Xue, Rachel Shujuan Chong, Quan V. Hoang, Lavanya Raghavan, Yih Chung Tham, Charumathi Sabanayagam, Wei-Chi Wu, Ming-Chih Ho, Jiangnan He, Preeti Gupta, Ecosse Lamoureux, Seang Mei Saw, Vinay Nangia, Songhomitra Panda-Jonas, Jie Xu, Ya Xing Wang , et al. (6 additional authors not shown)

    Abstract: Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  10. arXiv:2505.08367  [pdf, ps, other

    cs.RO

    MA-ROESL: Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos

    Authors: Xianghui Wang, Xinming Zhang, Yanjun Chen, Xiaoyu Shen, Wei Zhang

    Abstract: Vision-language models (VLMs) have demonstrated excellent high-level planning capabilities, enabling locomotion skill learning from video demonstrations without the need for meticulous human-level reward design. However, the improper frame sampling method and low training efficiency of current methods remain a critical bottleneck, resulting in substantial computational overhead and time costs. To… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  11. arXiv:2505.08361  [pdf, ps, other

    cs.AI

    Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning

    Authors: Xinyue Wang, Biwei Huang

    Abstract: Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning -- where known components are reconfigured to handle new situations -- we introduce World Modeling with Compositional Causal Components (WM3C). This novel framework enhances RL generalization… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Published as a conference paper at ICLR 2025

  12. arXiv:2505.08266  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Open the Eyes of MPNN: Vision Enhances MPNN in Link Prediction

    Authors: Yanbin Wei, Xuehao Wang, Zhan Zhuang, Yang Chen, Shuhao Chen, Yulong Zhang, Yu Zhang, James Kwok

    Abstract: Message-passing graph neural networks (MPNNs) and structural features (SFs) are cornerstones for the link prediction task. However, as a common and intuitive mode of understanding, the potential of visual perception has been overlooked in the MPNN community. For the first time, we equip MPNNs with vision structural awareness by proposing an effective framework called Graph Vision Network (GVN), al… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: ICML 2025

  13. arXiv:2505.08235  [pdf, other

    cs.CV

    EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation

    Authors: Hanle Zheng, Xujie Han, Zegang Peng, Shangbin Zhang, Guangxun Du, Zhuo Zou, Xilin Wang, Jibin Wu, Hao Guo, Lei Deng

    Abstract: Video Frame Interpolation (VFI) is a fundamental yet challenging task in computer vision, particularly under conditions involving large motion, occlusion, and lighting variation. Recent advancements in event cameras have opened up new opportunities for addressing these challenges. While existing event-based VFI methods have succeeded in recovering large and complex motions by leveraging handcrafte… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  14. arXiv:2505.08215  [pdf, ps, other

    cs.AI cs.SD eess.AS

    Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People

    Authors: Haoshuai Zhou, Boxuan Cao, Changgeng Mo, Linkai Li, Shan Xiang Wang

    Abstract: Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  15. arXiv:2505.08037  [pdf, other

    cs.CL cs.LG

    TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

    Authors: Yutong Liu, Feng Xiao, Ziyue Zhang, Yongbin Yu, Cheng Huang, Fan Gao, Xiangxiang Wang, Ma-bao Ban, Manping Fan, Thupten Tsering, Cheng Huang, Gadeng Luosang, Renzeng Duojie, Nyima Tashi

    Abstract: Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabele… ▽ More

    Submitted 14 May, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

    Comments: 14 pages, 7 figures

  16. arXiv:2505.07839  [pdf

    eess.IV cs.AI

    Sub-diffraction terahertz backpropagation compressive imaging

    Authors: Yongsheng Zhu, Shaojing Liu, Ximiao Wang, Runli Li, Haili Yang, Jiali Wang, Hongjia Zhu, Yanlin Ke, Ningsheng Xu, Huanjun Chen, Shaozhi Deng

    Abstract: Terahertz single-pixel imaging (TSPI) has garnered significant attention due to its simplicity and cost-effectiveness. However, the relatively long wavelength of THz waves limits sub-diffraction-scale imaging resolution. Although TSPI technique can achieve sub-wavelength resolution, it requires harsh experimental conditions and time-consuming processes. Here, we propose a sub-diffraction THz backp… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  17. arXiv:2505.07819  [pdf, other

    cs.RO cs.AI cs.CV

    H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning

    Authors: Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, Huazhe Xu

    Abstract: Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce $\textbf{Triply-Hierarchical Diffusion Policy}~(\textbf{H$^{\mathbf{3}}… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  18. arXiv:2505.07796  [pdf, other

    cs.CL cs.AI cs.LG

    Learning Dynamics in Continual Pre-Training for Large Language Models

    Authors: Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng

    Abstract: Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We hav… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Accepted to ICML2025 (spotlight)

  19. arXiv:2505.07608  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

    Authors: Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai , et al. (40 additional authors not shown)

    Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  20. arXiv:2505.07263  [pdf, other

    cs.CV

    Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

    Authors: Xiaokun Wang, Chris, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, Yahui Zhou

    Abstract: We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM rea… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  21. arXiv:2505.07245  [pdf

    cs.LG cs.AI

    REMEDI: Relative Feature Enhanced Meta-Learning with Distillation for Imbalanced Prediction

    Authors: Fei Liu, Huanhuan Ren, Yu Guan, Xiuxu Wang, Wang Lv, Zhiqiang Hu, Yaxi Chen

    Abstract: Predicting future vehicle purchases among existing owners presents a critical challenge due to extreme class imbalance (<0.5% positive rate) and complex behavioral patterns. We propose REMEDI (Relative feature Enhanced Meta-learning with Distillation for Imbalanced prediction), a novel multi-stage framework addressing these challenges. REMEDI first trains diverse base models to capture complementa… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  22. arXiv:2505.07198  [pdf, other

    cs.CV

    Ranking-aware Continual Learning for LiDAR Place Recognition

    Authors: Xufei Wang, Gengxuan Tian, Junqiao Zhao, Siyue Tao, Qiwen Gu, Qiankun Yu, Tiantian Feng

    Abstract: Place recognition plays a significant role in SLAM, robot navigation, and autonomous driving applications. Benefiting from deep learning, the performance of LiDAR place recognition (LPR) has been greatly improved. However, many existing learning-based LPR methods suffer from catastrophic forgetting, which severely harms the performance of LPR on previously trained places after training on a new en… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: 8 pages, 4 figures

  23. arXiv:2505.06987  [pdf, other

    cs.CL cs.AI

    Convert Language Model into a Value-based Strategic Planner

    Authors: Xiaoyu Wang, Yue Zhao, Qingqing Gu, Zhonglin Jiang, Xiaokai Chen, Yong Chen, Luo Ji

    Abstract: Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage t… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: 11 pages, 5 figures, Accepted by ACL 2025 Industry Track

  24. arXiv:2505.06804  [pdf, other

    cs.LG stat.ML

    Topology Guidance: Controlling the Outputs of Generative Models via Vector Field Topology

    Authors: Xiaohan Wang, Matthew Berger

    Abstract: For domains that involve numerical simulation, it can be computationally expensive to run an ensemble of simulations spanning a parameter space of interest to a user. To this end, an attractive surrogate for simulation is the generative modeling of fields produced by an ensemble, allowing one to synthesize fields in a computationally cheap, yet accurate, manner. However, for the purposes of visual… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  25. arXiv:2505.06690  [pdf

    cs.LG

    E2E-FANet: A Highly Generalizable Framework for Waves prediction Behind Floating Breakwaters via Exogenous-to-Endogenous Variable Attention

    Authors: Jianxin Zhang, Lianzi Jiang, Xinyu Han, Xiangrong Wang, Weinan Huang

    Abstract: Accurate prediction of waves behind floating breakwaters (FB) is crucial for optimizing coastal engineering structures, enhancing safety, and improving design efficiency. Existing methods demonstrate limitations in capturing nonlinear interactions between waves and structures, while exhibiting insufficient capability in modeling the complex frequency-domain relationships among elevations of differ… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  26. arXiv:2505.06688  [pdf

    cs.LG

    A Novel Framework for Significant Wave Height Prediction based on Adaptive Feature Extraction Time-Frequency Network

    Authors: Jianxin Zhang, Lianzi Jiang, Xinyu Han, Xiangrong Wang

    Abstract: Precise forecasting of significant wave height (Hs) is essential for the development and utilization of wave energy. The challenges in predicting Hs arise from its non-linear and non-stationary characteristics. The combination of decomposition preprocessing and machine learning models have demonstrated significant effectiveness in Hs prediction by extracting data features. However, decomposing the… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  27. arXiv:2505.06685  [pdf, ps, other

    cs.MM cs.CV

    Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding

    Authors: Dawei Huang, Qing Li, Chuan Yan, Zebang Cheng, Yurong Huang, Xiang Li, Bin Li, Xiaohui Wang, Zheng Lian, Xiaojiang Peng

    Abstract: Emotion understanding in videos aims to accurately recognize and interpret individuals' emotional states by integrating contextual, visual, textual, and auditory cues. While Large Multimodal Models (LMMs) have demonstrated significant progress in general vision-language (VL) tasks, their performance in emotion-specific scenarios remains limited. Moreover, fine-tuning LMMs on emotion-related tasks… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  28. arXiv:2505.06651  [pdf, other

    cs.LG cs.AI

    Dyn-D$^2$P: Dynamic Differentially Private Decentralized Learning with Provable Utility Guarantee

    Authors: Zehan Zhu, Yan Huang, Xin Wang, Shouling Ji, Jinming Xu

    Abstract: Most existing decentralized learning methods with differential privacy (DP) guarantee rely on constant gradient clipping bounds and fixed-level DP Gaussian noises for each node throughout the training process, leading to a significant accuracy degradation compared to non-private counterparts. In this paper, we propose a new Dynamic Differentially Private Decentralized learning approach (termed Dyn… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: This paper has been accepted by the 34th International Joint Conference on Artificial Intelligence(IJCAI 2025)

  29. arXiv:2505.06553  [pdf, ps, other

    cs.SE

    ActRef: Enhancing the Understanding of Python Code Refactoring with Action-Based Analysis

    Authors: Siqi Wang, Xing Hu, Xin Xia, Xinyu Wang

    Abstract: Refactoring, the process of improving the code structure of a software system without altering its behavior, is crucial for managing code evolution in software development. Identifying refactoring actions in source code is essential for understanding software evolution and guiding developers in maintaining and improving the code quality. This study presents an action-based Refactoring Analysis Fra… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: 21 pages, 5 figures

  30. arXiv:2505.06527  [pdf, other

    cs.CV cs.AI

    Improving Generalization of Medical Image Registration Foundation Model

    Authors: Jing Hu, Kaiwei Yu, Hongjiang Xian, Shu Hu, Xin Wang

    Abstract: Deformable registration is a fundamental task in medical image processing, aiming to achieve precise alignment by establishing nonlinear correspondences between images. Traditional methods offer good adaptability and interpretability but are limited by computational efficiency. Although deep learning approaches have significantly improved registration speed and accuracy, they often lack flexibilit… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: IJCNN

  31. arXiv:2505.06501  [pdf, ps, other

    cs.DB

    Survey of Filtered Approximate Nearest Neighbor Search over the Vector-Scalar Hybrid Data

    Authors: Yanjun Lin, Kai Zhang, Zhenying He, Yinan Jing, X. Sean Wang

    Abstract: Filtered approximate nearest neighbor search (FANNS), an extension of approximate nearest neighbor search (ANNS) that incorporates scalar filters, has been widely applied to constrained retrieval of vector data. Despite its growing importance, no dedicated survey on FANNS over the vector-scalar hybrid data currently exists, and the field has several problems, including inconsistent definitions of… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: This manuscript was submitted to The VLDB Journal for review

  32. arXiv:2505.06330  [pdf, ps, other

    cs.LG cs.AI eess.SP

    Prompting Large Language Models for Training-Free Non-Intrusive Load Monitoring

    Authors: Junyu Xue, Xudong Wang, Xiaoling He, Shicheng Liu, Yi Wang, Guoming Tang

    Abstract: Non-intrusive Load Monitoring (NILM) aims to disaggregate aggregate household electricity consumption into individual appliance usage, enabling more effective energy management. While deep learning has advanced NILM, it remains limited by its dependence on labeled data, restricted generalization, and lack of interpretability. In this paper, we introduce the first prompt-based NILM framework that l… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  33. arXiv:2505.05956  [pdf, other

    eess.SP cs.LG cs.NI

    Multi-User Beamforming with Deep Reinforcement Learning in Sensing-Aided Communication

    Authors: Xiyu Wang, Gilberto Berardinelli, Hei Victor Cheng, Petar Popovski, Ramoni Adeogun

    Abstract: Mobile users are prone to experience beam failure due to beam drifting in millimeter wave (mmWave) communications. Sensing can help alleviate beam drifting with timely beam changes and low overhead since it does not need user feedback. This work studies the problem of optimizing sensing-aided communication by dynamically managing beams allocated to mobile users. A multi-beam scheme is introduced,… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted for Presentation at IEEE EuCNC & 6G Summit 2025

  34. arXiv:2505.05804  [pdf, other

    cs.CV

    Describe Anything in Medical Images

    Authors: Xi Xiao, Yunbei Zhang, Thanh-Huy Nguyen, Ba-Thinh Lam, Janet Wang, Jihun Hamm, Tianyang Wang, Xingjian Li, Xiao Wang, Hao Xu, Tianming Liu, Min Xu

    Abstract: Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global unde… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  35. arXiv:2505.05795  [pdf, other

    eess.SY cs.RO

    Formation Maneuver Control Based on the Augmented Laplacian Method

    Authors: Xinzhe Zhou, Xuyang Wang, Xiaoming Duan, Yuzhu Bai, Jianping He

    Abstract: This paper proposes a novel formation maneuver control method for both 2-D and 3-D space, which enables the formation to translate, scale, and rotate with arbitrary orientation. The core innovation is the novel design of weights in the proposed augmented Laplacian matrix. Instead of using scalars, we represent weights as matrices, which are designed based on a specified rotation axis and allow the… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  36. arXiv:2505.05768  [pdf, other

    eess.IV cs.AI cs.CV

    Predicting Diabetic Macular Edema Treatment Responses Using OCT: Dataset and Methods of APTOS Competition

    Authors: Weiyi Zhang, Peranut Chotcomwongse, Yinwen Li, Pusheng Xu, Ruijie Yao, Lianhao Zhou, Yuxuan Zhou, Hui Feng, Qiping Zhou, Xinyue Wang, Shoujin Huang, Zihao Jin, Florence H. T. Chung, Shujun Wang, Yalin Zheng, Mingguang He, Danli Shi, Paisan Ruamviboonsuk

    Abstract: Diabetic macular edema (DME) significantly contributes to visual impairment in diabetic patients. Treatment responses to intravitreal therapies vary, highlighting the need for patient stratification to predict therapeutic benefits and enable personalized strategies. To our knowledge, this study is the first to explore pre-treatment stratification for predicting DME treatment responses. To advance… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: 42 pages,5 tables, 12 figures, challenge report

  37. arXiv:2505.05744  [pdf, other

    cs.LG cs.CL

    Harnessing LLMs Explanations to Boost Surrogate Models in Tabular Data Classification

    Authors: Ruxue Shi, Hengrui Gu, Xu Shen, Xin Wang

    Abstract: Large Language Models (LLMs) have shown remarkable ability in solving complex tasks, making them a promising tool for enhancing tabular learning. However, existing LLM-based methods suffer from high resource requirements, suboptimal demonstration selection, and limited interpretability, which largely hinder their prediction performance and application in the real world. To overcome these problems,… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  38. arXiv:2505.05589  [pdf, ps, other

    cs.CV cs.AI cs.LG

    ReactDance: Progressive-Granular Representation for Long-Term Coherent Reactive Dance Generation

    Authors: Jingzhong Lin, Yuanyuan Qi, Xinru Li, Wenxuan Huang, Xiangfeng Xu, Bangyan Li, Xuejiao Wang, Gaoqi He

    Abstract: Reactive dance generation (RDG) produces follower movements conditioned on guiding dancer and music while ensuring spatial coordination and temporal coherence. However, existing methods overemphasize global constraints and optimization, overlooking local information, such as fine-grained spatial interactions and localized temporal context. Therefore, we present ReactDance, a novel diffusion-based… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  39. arXiv:2505.05568  [pdf, ps, other

    cs.LG cs.AI cs.DB

    Griffin: Towards a Graph-Centric Relational Database Foundation Model

    Authors: Yanbo Wang, Xiyuan Wang, Quan Gan, Minjie Wang, Qibin Yang, David Wipf, Muhan Zhang

    Abstract: We introduce Griffin, the first foundation model attemptation designed specifically for Relational Databases (RDBs). Unlike previous smaller models focused on single RDB tasks, Griffin unifies the data encoder and task decoder to handle diverse tasks. Additionally, we enhance the architecture by incorporating a cross-attention module and a novel aggregator. Griffin utilizes pretraining on both sin… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  40. arXiv:2505.05472  [pdf, other

    cs.CV

    Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    Authors: Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, Weilin Huang

    Abstract: Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements… ▽ More

    Submitted 11 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: Mogao Technical Report

  41. arXiv:2505.05470  [pdf, other

    cs.CV cs.AI

    Flow-GRPO: Training Flow Matching Models via Online RL

    Authors: Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang

    Abstract: We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistica… ▽ More

    Submitted 11 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: Code: https://github.com/yifan123/flow_grpo

  42. arXiv:2505.05237  [pdf, other

    cs.LG

    Latte: Transfering LLMs` Latent-level Knowledge for Few-shot Tabular Learning

    Authors: Ruxue Shi, Hengrui Gu, Hangting Ye, Yiwei Dai, Xu Shen, Xin Wang

    Abstract: Few-shot tabular learning, in which machine learning models are trained with a limited amount of labeled data, provides a cost-effective approach to addressing real-world challenges. The advent of Large Language Models (LLMs) has sparked interest in leveraging their pre-trained knowledge for few-shot tabular learning. Despite promising results, existing approaches either rely on test-time knowledg… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  43. arXiv:2505.04947  [pdf, other

    cs.DC

    DFPL: Decentralized Federated Prototype Learning Across Heterogeneous Data Distributions

    Authors: Hongliang Zhang, Fenghua Xu, Zhongyuan Yu, Chunqiang Hu, Shanchen Pang, Xiaofen Wang, Jiguo Yu

    Abstract: Federated learning is a distributed machine learning paradigm that enables the collaborative training of multiple clients through centralized model aggregation. However, standard federated learning relies on a centralized server, making it vulnerable to server failures. While existing solutions utilize blockchain technology to implement Decentralized Federated Learning (DFL), the statistical heter… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  44. arXiv:2505.04921  [pdf, other

    cs.CV cs.CL

    Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

    Authors: Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang

    Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integra… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 75 Pages,10 figures; Project: https://github.com/HITsz-TMG/Awesome-Large-Multimodal-Reasoning-Models

  45. arXiv:2505.04802  [pdf, other

    cs.LG astro-ph.EP cs.AI cs.DC physics.ao-ph

    ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling

    Authors: Xiao Wang, Jong-Youl Choi, Takuya Kurihaya, Isaac Lyngaas, Hong-Jun Yoon, Ming Fan, Nasik Muhammad Nafi, Aristeidis Tsaris, Ashwin M. Aji, Maliha Hossain, Mohamed Wahib, Dali Wang, Peter Thornton, Prasanna Balaprakash, Moetasim Ashfaq, Dan Lu

    Abstract: Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-reso… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  46. arXiv:2505.04797  [pdf, ps, other

    cs.SE

    Quantum Artificial Intelligence for Software Engineering: the Road Ahead

    Authors: Xinyi Wang, Shaukat Ali, Paolo Arcaini

    Abstract: Artificial Intelligence (AI) has been applied to various areas of software engineering, including requirements engineering, coding, testing, and debugging. This has led to the emergence of AI for Software Engineering as a distinct research area within software engineering. With the development of quantum computing, the field of Quantum AI (QAI) is arising, enhancing the performance of classical AI… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  47. arXiv:2505.04424  [pdf, ps, other

    cs.CV

    RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation

    Authors: Jing Hu, Chengming Feng, Shu Hu, Ming-Ching Chang, Xin Li, Xi Wu, Xin Wang

    Abstract: Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: IJCAI2025

  48. arXiv:2505.04396  [pdf, other

    cs.LG physics.ao-ph

    Supporting renewable energy planning and operation with data-driven high-resolution ensemble weather forecast

    Authors: Jingnan Wang, Jie Chao, Shangshang Yang, Congyi Nai, Kaijun Ren, Kefeng Deng, Xi Chen, Yaxin Liu, Hanqiuzi Wen, Ziniu Xiao, Lifeng Zhang, Xiaodong Wang, Jiping Guan, Baoxiang Pan

    Abstract: The planning and operation of renewable energy, especially wind power, depend crucially on accurate, timely, and high-resolution weather information. Coarse-grid global numerical weather forecasts are typically downscaled to meet these requirements, introducing challenges of scale inconsistency, process representation error, computation cost, and entanglement of distinct uncertainty sources from c… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  49. arXiv:2505.04376  [pdf, other

    eess.IV cs.CV

    Label-efficient Single Photon Images Classification via Active Learning

    Authors: Zili Zhang, Ziting Wen, Yiheng Qiang, Hongzhou Dong, Wenle Dong, Xinyang Li, Xiaofan Wang, Xiaoqiang Ren

    Abstract: Single-photon LiDAR achieves high-precision 3D imaging in extreme environments through quantum-level photon detection technology. Current research primarily focuses on reconstructing 3D scenes from sparse photon events, whereas the semantic interpretation of single-photon images remains underexplored, due to high annotation costs and inefficient labeling strategies. This paper presents the first a… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  50. arXiv:2505.04354  [pdf, other

    math.OC cs.AI

    Optimization Problem Solving Can Transition to Evolutionary Agentic Workflows

    Authors: Wenhao Li, Bo Jin, Mingyi Hong, Changhong Lu, Xiangfeng Wang

    Abstract: This position paper argues that optimization problem solving can transition from expert-dependent to evolutionary agentic workflows. Traditional optimization practices rely on human specialists for problem formulation, algorithm selection, and hyperparameter tuning, creating bottlenecks that impede industrial adoption of cutting-edge methods. We contend that an evolutionary agentic workflow, power… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 27 pages, 5 figures