Skip to main content

Showing 1–50 of 890 results for author: Wei, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10152  [pdf, ps, other

    cs.CV

    Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization

    Authors: Yikang Wei

    Abstract: Federated domain generalization aims to learn a generalizable model from multiple decentralized source domains for deploying on the unseen target domain. The style augmentation methods have achieved great progress on domain generalization. However, the existing style augmentation methods either explore the data styles within isolated source domain or interpolate the style information across existi… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: IJCAI 2025

  2. arXiv:2505.09343  [pdf, ps, other

    cs.DC cs.AI cs.AR

    Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

    Authors: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei

    Abstract: The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inferen… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version will appear as part of the Industry Track in Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25)

  3. arXiv:2505.08588  [pdf, ps, other

    cs.CL cs.AI cs.CY cs.HC

    Small but Significant: On the Promise of Small Language Models for Accessible AIED

    Authors: Yumou Wei, Paulo Carvalho, John Stamper

    Abstract: GPT has become nearly synonymous with large language models (LLMs), an increasingly popular term in AIED proceedings. A simple keyword-based search reveals that 61% of the 76 long and short papers presented at AIED 2024 describe novel solutions using LLMs to address some of the long-standing challenges in education, and 43% specifically mention GPT. Although LLMs pioneered by GPT create exciting o… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: This vision paper advocates using small language models (e.g., Phi-2) in AI for education (AIED)

  4. arXiv:2505.08266  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Open the Eyes of MPNN: Vision Enhances MPNN in Link Prediction

    Authors: Yanbin Wei, Xuehao Wang, Zhan Zhuang, Yang Chen, Shuhao Chen, Yulong Zhang, Yu Zhang, James Kwok

    Abstract: Message-passing graph neural networks (MPNNs) and structural features (SFs) are cornerstones for the link prediction task. However, as a common and intuitive mode of understanding, the potential of visual perception has been overlooked in the MPNN community. For the first time, we equip MPNNs with vision structural awareness by proposing an effective framework called Graph Vision Network (GVN), al… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: ICML 2025

  5. arXiv:2505.08238  [pdf, other

    cs.RO

    Motion Control of High-Dimensional Musculoskeletal Systems with Hierarchical Model-Based Planning

    Authors: Yunyue Wei, Shanning Zhuang, Vincent Zhuang, Yanan Sui

    Abstract: Controlling high-dimensional nonlinear systems, such as those found in biological and robotic applications, is challenging due to large state and action spaces. While deep reinforcement learning has achieved a number of successes in these domains, it is computationally intensive and time consuming, and therefore not suitable for solving large collections of tasks that require significant manual tu… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Accepted by ICLR 2025

  6. arXiv:2505.08231  [pdf, other

    cs.CV

    HMPNet: A Feature Aggregation Architecture for Maritime Object Detection from a Shipborne Perspective

    Authors: Yu Zhang, Fengyuan Liu, Juan Lyu, Yi Wei, Changdong Yu

    Abstract: In the realm of intelligent maritime navigation, object detection from a shipborne perspective is paramount. Despite the criticality, the paucity of maritime-specific data impedes the deployment of sophisticated visual perception techniques, akin to those utilized in autonomous vehicular systems, within the maritime context. To bridge this gap, we introduce Navigation12, a novel dataset annotated… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: This paper has been accepted to ICME 2025

  7. arXiv:2505.07184  [pdf, other

    cs.CL

    Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs

    Authors: Yifan Wei, Xiaoyan Yu, Tengfei Pan, Angsheng Li, Li Du

    Abstract: Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora, yet their performance remains suboptimal in knowledge-intensive domains such as medicine and scientific research, where high factual precision is required. While synthetic data provides a promising avenue for augmenting domain knowledge, existing methods frequently generate redundant sample… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  8. arXiv:2505.06469  [pdf, ps, other

    cs.AI cs.HC

    KCluster: An LLM-based Clustering Approach to Knowledge Component Discovery

    Authors: Yumou Wei, Paulo Carvalho, John Stamper

    Abstract: Educators evaluate student knowledge using knowledge component (KC) models that map assessment questions to KCs. Still, designing KC models for large question banks remains an insurmountable challenge for instructors who need to analyze each question by hand. The growing use of Generative AI in education is expected only to aggravate this chronic deficiency of expert-designed KC models, as course… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted to the Educational Data Mining (EDM) 2025 conference

  9. arXiv:2505.06378  [pdf, other

    cs.GT cs.AI

    Bi-LSTM based Multi-Agent DRL with Computation-aware Pruning for Agent Twins Migration in Vehicular Embodied AI Networks

    Authors: Yuxiang Wei, Zhuoqi Zeng, Yue Zhong, Jiawen Kang, Ryan Wen Liu, M. Shamim Hossain

    Abstract: With the advancement of large language models and embodied Artificial Intelligence (AI) in the intelligent transportation scenarios, the combination of them in intelligent transportation spawns the Vehicular Embodied AI Network (VEANs). In VEANs, Autonomous Vehicles (AVs) are typical agents whose local advanced AI applications are defined as vehicular embodied AI agents, enabling capabilities such… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  10. arXiv:2505.05422  [pdf, other

    cs.CV cs.AI cs.CL

    TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

    Authors: Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan

    Abstract: Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLI… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: Technical Report

  11. arXiv:2505.03710  [pdf, other

    stat.ML cs.AI cs.LG

    Actor-Critics Can Achieve Optimal Sample Efficiency

    Authors: Kevin Tan, Wei Fan, Yuting Wei

    Abstract: Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $ε$-optimal policy with a sample complexity of $O(1/ε^2)$ trajectories with general function approximation when strategic explorati… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Accepted to ICML 2025

  12. arXiv:2505.03469  [pdf, other

    cs.CL

    Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models

    Authors: Bin Yu, Hang Yuan, Yuliang Wei, Bailing Wang, Weizhen Qi, Kai Chen

    Abstract: Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the "overthinking" problem from teacher models, producing verbose and redundant… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: 11 pages, 2 figures

  13. arXiv:2505.00526  [pdf, other

    econ.EM cs.LG stat.CO

    Pre-Training Estimators for Structural Models: Application to Consumer Search

    Authors: Yanhao 'Max' Wei, Zhenling Jiang

    Abstract: We explore pretraining estimators for structural econometric models. The estimator is "pretrained" in the sense that the bulk of the computational cost and researcher effort occur during the construction of the estimator. Subsequent applications of the estimator to different datasets require little computational cost or researcher effort. The estimation leverages a neural net to recognize the stru… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    ACM Class: G.3; J.4; I.2

  14. arXiv:2505.00055  [pdf, other

    cs.MA cs.GT

    TinyMA-IEI-PPO: Exploration Incentive-Driven Multi-Agent DRL with Self-Adaptive Pruning for Vehicular Embodied AI Agent Twins Migration

    Authors: Zhuoqi Zeng, Yuxiang Wei, Jiawen Kang

    Abstract: Embodied Artificial Intelligence (EAI) addresses autonomous driving challenges in Vehicular Embodied AI Networks (VEANETs) through multi-modal perception, adaptive decision-making, and hardware-software co-scheduling. However, the computational demands of virtual services and the inherent mobility of autonomous vehicles (AVs) necessitate real-time migration of Vehicular Embodied Agent AI Twins (VE… ▽ More

    Submitted 30 April, 2025; originally announced May 2025.

  15. arXiv:2504.20972  [pdf, other

    cs.CL

    SetKE: Knowledge Editing for Knowledge Elements Overlap

    Authors: Yifan Wei, Xiaoyan Yu, Ran Song, Hao Peng, Angsheng Li

    Abstract: Large Language Models (LLMs) excel in tasks such as retrieval and question answering but require updates to incorporate new knowledge and reduce inaccuracies and hallucinations. Traditional updating methods, like fine-tuning and incremental learning, face challenges such as overfitting and high computational costs. Knowledge Editing (KE) provides a promising alternative but often overlooks the Kno… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: The CR version will be updated subsequently

    Journal ref: IJCAI 2025

  16. arXiv:2504.19486  [pdf, other

    cs.CR

    The Cost of Performance: Breaking ThreadX with Kernel Object Masquerading Attacks

    Authors: Xinhui Shao, Zhen Ling, Yue Zhang, Huaiyu Yan, Yumeng Wei, Lan Luo, Zixia Liu, Junzhou Luo, Xinwen Fu

    Abstract: Microcontroller-based IoT devices often use embedded real-time operating systems (RTOSs). Vulnerabilities in these embedded RTOSs can lead to compromises of those IoT devices. Despite the significance of security protections, the absence of standardized security guidelines results in various levels of security risk across RTOS implementations. Our initial analysis reveals that popular RTOSs such a… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  17. arXiv:2504.19276  [pdf, other

    cs.LG cs.AI cs.CL

    Anyprefer: An Agentic Framework for Preference Data Synthesis

    Authors: Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao

    Abstract: High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods often adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies since the reward model shares weights with t… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  18. arXiv:2504.17901  [pdf, other

    cs.RO cs.AI

    Beyond Task and Motion Planning: Hierarchical Robot Planning with General-Purpose Policies

    Authors: Benned Hedegaard, Ziyi Yang, Yichen Wei, Ahmed Jaafar, Stefanie Tellex, George Konidaris, Naman Shah

    Abstract: Task and motion planning is a well-established approach for solving long-horizon robot planning problems. However, traditional methods assume that each task-level robot action, or skill, can be reduced to kinematic motion planning. In this work, we address the challenge of planning with both kinematic skills and closed-loop motor controllers that go beyond kinematic considerations. We propose a no… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  19. arXiv:2504.17353  [pdf, other

    cs.CL cs.CV cs.MM

    M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction

    Authors: Chengguang Gan, Sunbowen Lee, Zhixi Cai, Yanbin Wei, Lei Zheng, Yunhao Liang, Shiwen Ni, Tatsunori Mori

    Abstract: Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability t… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  20. arXiv:2504.17343  [pdf, other

    cs.CV

    TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

    Authors: Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun

    Abstract: The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they fa… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  21. arXiv:2504.16798  [pdf, other

    cs.MM cs.CV cs.LG

    4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis

    Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun

    Abstract: Multimodal neuroimaging provides complementary structural and functional insights into both human brain organization and disease-related dynamics. Recent studies demonstrate enhanced diagnostic sensitivity for Alzheimer's disease (AD) through synergistic integration of neuroimaging data (e.g., sMRI, fMRI) with behavioral cognitive scores tabular data biomarkers. However, the intrinsic heterogeneit… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

  22. arXiv:2504.16656  [pdf, other

    cs.CV

    Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

    Authors: Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou

    Abstract: We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that jointly leverages the Mixed Preference Optimization (MPO) and the Group Relative Policy Optimization (GRPO), which harmonizes reward-model guidance with rule-based strategies, thereby addressing… ▽ More

    Submitted 25 April, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

  23. arXiv:2504.14378  [pdf

    cond-mat.mtrl-sci cs.LG

    Machine learning enhanced atom probe tomography analysis: a snapshot review

    Authors: Yue Li, Ye Wei, Alaukik Saxena, Markus Kühbach, Christoph Freysoldt, Baptiste Gault

    Abstract: Atom probe tomography (APT) is a burgeoning characterization technique that provides compositional mapping of materials in three-dimensions at near-atomic scale. Since its significant expansion in the past 30 years, we estimate that one million APT datasets have been collected, each containing millions to billions of individual ions. Their analysis and the extraction of microstructural information… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

  24. arXiv:2504.13207  [pdf, other

    cs.GR cs.RO

    BEV-GS: Feed-forward Gaussian Splatting in Bird's-Eye-View for Road Reconstruction

    Authors: Wenhua Wu, Tong Zhao, Chensheng Peng, Lei Yang, Yintao Wei, Zhe Liu, Hesheng Wang

    Abstract: Road surface is the sole contact medium for wheels or robot feet. Reconstructing road surface is crucial for unmanned vehicles and mobile robots. Recent studies on Neural Radiance Fields (NeRF) and Gaussian Splatting (GS) have achieved remarkable results in scene reconstruction. However, they typically rely on multi-view image inputs and require prolonged optimization times. In this paper, we prop… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  25. arXiv:2504.12599  [pdf, other

    cs.CV

    3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

    Authors: Wenxin Chen, Mengxue Qu, Weitai Kang, Yan Yan, Yao Zhao, Yunchao Wei

    Abstract: 3D Referring Expression Segmentation (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly. Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs. SSL uses a teacher-student paradigm where teacher generates high-confidence-filtered pseudo-la… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  26. arXiv:2504.12276  [pdf, other

    cs.CV

    The Tenth NTIRE 2025 Image Denoising Challenge Report

    Authors: Lei Sun, Hang Guo, Bin Ren, Luc Van Gool, Radu Timofte, Yawei Li, Xiangyu Kong, Hyunhee Park, Xiaoxuan Yu, Suejin Han, Hakjae Jeon, Jia Li, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Jingyu Ma, Zhijuan Huang, Huiyuan Fu, Hongyuan Yu, Boqi Zhang, Jiawei Shi, Heng Zhang, Huadong Ma, Deepak Kumar Tyagi , et al. (69 additional authors not shown)

    Abstract: This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent ad… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  27. arXiv:2504.11326  [pdf, other

    cs.CV

    PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

    Authors: Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, Philip Torr, Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang, Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma, Hao Fang, Runmin Cong, Xiankai Lu , et al. (11 additional authors not shown)

    Abstract: This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, languag… ▽ More

    Submitted 21 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: Workshop Page: https://pvuw.github.io/. arXiv admin note: text overlap with arXiv:2504.00476, arXiv:2504.05178

  28. arXiv:2504.11143  [pdf, other

    cs.CV

    Taming Consistency Distillation for Accelerated Human Image Animation

    Authors: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yujie Wei, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang

    Abstract: Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image ani… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  29. MSCRS: Multi-modal Semantic Graph Prompt Learning Framework for Conversational Recommender Systems

    Authors: Yibiao Wei, Jie Zou, Weikang Guo, Guoqing Wang, Xing Xu, Yang Yang

    Abstract: Conversational Recommender Systems (CRSs) aim to provide personalized recommendations by interacting with users through conversations. Most existing studies of CRS focus on extracting user preferences from conversational contexts. However, due to the short and sparse nature of conversational contexts, it is difficult to fully capture user preferences by conversational contexts only. We argue that… ▽ More

    Submitted 25 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

  30. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  31. arXiv:2504.09544  [pdf, other

    cs.LG cs.CE cs.CV

    Causal integration of chemical structures improves representations of microscopy images for morphological profiling

    Authors: Yemin Yu, Neil Tenenholtz, Lester Mackey, Ying Wei, David Alvarez-Melis, Ava P. Amini, Alex X. Lu

    Abstract: Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We h… ▽ More

    Submitted 16 April, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

    Comments: 24 pages

  32. arXiv:2504.08937  [pdf, other

    cs.GR cs.CV cs.LG eess.IV stat.ML

    Rethinking Few-Shot Image Fusion: Granular Ball Priors Enable General-Purpose Deep Fusion

    Authors: Minjie Deng, Yan Wei, Hao Zhai, An Wu, Yuncan Ouyang, Qianyao Peng

    Abstract: In image fusion tasks, the absence of real fused images as priors presents a fundamental challenge. Most deep learning-based fusion methods rely on large-scale paired datasets to extract global weighting features from raw images, thereby generating fused outputs that approximate real fused images. In contrast to previous studies, this paper explores few-shot training of neural networks under the c… ▽ More

    Submitted 25 April, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

  33. arXiv:2504.07954  [pdf, other

    cs.CV cs.CL

    Perception-R1: Pioneering Perception Policy with Reinforcement Learning

    Authors: En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, Wenbing Tao

    Abstract: Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in th… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Github page: https://github.com/linkangheng/PR1

  34. arXiv:2504.07165  [pdf, other

    cs.CV

    Perception in Reflection

    Authors: Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, Vishal M. Patel

    Abstract: We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visu… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  35. arXiv:2504.06666  [pdf, other

    cs.CV

    Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

    Authors: Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, Di Hu

    Abstract: High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  36. arXiv:2504.05599  [pdf, other

    cs.CV cs.CL

    Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

    Authors: Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou

    Abstract: We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text al… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  37. arXiv:2504.05300  [pdf, ps, other

    cs.LG math.NA math.ST stat.ML

    Dimension-Free Convergence of Diffusion Models for Approximate Gaussian Mixtures

    Authors: Gen Li, Changxiao Cai, Yuting Wei

    Abstract: Diffusion models are distinguished by their exceptional generative performance, particularly in producing high-quality samples through iterative denoising. While current theory suggests that the number of denoising steps required for accurate sample generation should scale linearly with data dimension, this does not reflect the practical efficiency of widely used algorithms like Denoising Diffusio… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  38. arXiv:2504.04445  [pdf, other

    cs.RO

    A Convex and Global Solution for the P$n$P Problem in 2D Forward-Looking Sonar

    Authors: Jiayi Su, Jingyu Qian, Liuqing Yang, Yufan Yuan, Yanbing Fu, Jie Wu, Yan Wei, Fengzhong Qu

    Abstract: The perspective-$n$-point (P$n$P) problem is important for robotic pose estimation. It is well studied for optical cameras, but research is lacking for 2D forward-looking sonar (FLS) in underwater scenarios due to the vastly different imaging principles. In this paper, we demonstrate that, despite the nonlinearity inherent in sonar image formation, the P$n$P problem for 2D FLS can still be effecti… ▽ More

    Submitted 10 April, 2025; v1 submitted 6 April, 2025; originally announced April 2025.

  39. arXiv:2504.04443  [pdf, other

    cs.IR

    Squeeze and Excitation: A Weighted Graph Contrastive Learning for Collaborative Filtering

    Authors: Zheyu Chen, Jinfeng Xu, Yutong Wei, Ziyue Peng

    Abstract: Contrastive Learning (CL) has recently emerged as a powerful technique in recommendation systems, particularly for its capability to harness self-supervised signals from perturbed views to mitigate the persistent challenge of data sparsity. The process of constructing perturbed views of the user-item bipartite graph and performing contrastive learning between perturbed views in a graph convolution… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: Accepted by SIGIR 2025

  40. arXiv:2504.04156  [pdf, other

    cs.CV

    CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation

    Authors: Kai Fang, Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei

    Abstract: Effective Class Incremental Segmentation (CIS) requires simultaneously mitigating catastrophic forgetting and ensuring sufficient plasticity to integrate new classes. The inherent conflict above often leads to a back-and-forth, which turns the objective into finding the balance between the performance of previous~(old) and incremental~(new) classes. To address this conflict, we introduce a novel a… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR 2025

  41. arXiv:2504.03877  [pdf, other

    cs.LG

    Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

    Authors: Yuchen Wei, Dennis Pearl, Matthew Beckman, Rebecca J. Passonneau

    Abstract: Formative assessment in STEM topics aims to promote student learning by identifying students' current understanding, thus targeting how to promote further learning. Previous studies suggest that the assessment performance of current generative large language models (LLMs) on constructed responses to open-ended questions is significantly lower than that of supervised classifiers trained on high-qua… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

    Comments: 13 pages excluding references. 9 tables and 4 figures

    ACM Class: I.2.7; K.3.1

  42. arXiv:2504.03723  [pdf, other

    cs.AR cs.MA

    VFlow: Discovering Optimal Agentic Workflows for Verilog Generation

    Authors: Yangbo Wei, Zhen Huang, Huang Li, Wei W. Xing, Ting-Jung Lin, Lei He

    Abstract: Hardware design automation faces challenges in generating high-quality Verilog code efficiently. This paper introduces VFlow, an automated framework that optimizes agentic workflows for Verilog code generation. Unlike existing approaches that rely on pre-defined prompting strategies, VFlow leverages Monte Carlo Tree Search (MCTS) to discover effective sequences of Large Language Models invocations… ▽ More

    Submitted 30 March, 2025; originally announced April 2025.

    Comments: 6 pages

  43. arXiv:2504.03254  [pdf, other

    cs.CV

    SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding

    Authors: Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruixuan Chen, Junshi Xia, Naoto Yokoya

    Abstract: Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Languag… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  44. arXiv:2504.01038  [pdf, other

    eess.IV cs.CV cs.HC

    An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection

    Authors: Xian-Xian Liu, Yuanyuan Wei, Mingkun Xu, Yongze Guo, Hongwei Zhang, Huicong Dong, Qun Song, Qi Zhao, Wei Luo, Feng Tien, Juntao Gao, Simon Fong

    Abstract: Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One C… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

    Comments: 26 pages, 4 figures, 6 tables

  45. arXiv:2503.23292  [pdf, other

    cs.CR

    FedCAPrivacy: Privacy-Preserving Heterogeneous Federated Learning with Anonymous Adaptive Clustering

    Authors: Yunan Wei, Shengnan Zhao, Chuan Zhao, Zhe Liu, Zhenxiang Chen, Minghao Zhao

    Abstract: Federated learning (FL) is a distributed machine learning paradigm enabling multiple clients to train a model collaboratively without exposing their local data. Among FL schemes, clustering is an effective technique addressing the heterogeneity issue (i.e., differences in data distribution and computational ability affect training performance and effectiveness) via grouping participants with simil… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  46. arXiv:2503.21807  [pdf, other

    cs.LG cs.AI cs.MA

    LERO: LLM-driven Evolutionary framework with Hybrid Rewards and Enhanced Observation for Multi-Agent Reinforcement Learning

    Authors: Yuan Wei, Xiaohan Shan, Jianmin Li

    Abstract: Multi-agent reinforcement learning (MARL) faces two critical bottlenecks distinct from single-agent RL: credit assignment in cooperative tasks and partial observability of environmental states. We propose LERO, a framework integrating Large language models (LLMs) with evolutionary optimization to address these MARL-specific challenges. The solution centers on two LLM-generated components: a hybrid… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  47. arXiv:2503.21013  [pdf, other

    cs.NI cs.DC

    AllReduce Scheduling with Hierarchical Deep Reinforcement Learning

    Authors: Yufan Wei, Mickel Liu, Wenfei Wu

    Abstract: AllReduce is a technique in distributed computing which saw use in many critical applications of deep learning. Existing methods of AllReduce scheduling oftentimes lack flexibility due to being topology-specific or relying on extensive handcrafted designs that require domain-specific knowledge. In this work, we aim to alleviate this inflexibility by proposing a deep-reinforcement-learning (DRL)-ba… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  48. arXiv:2503.19369  [pdf, other

    cs.CV

    EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

    Authors: Yufei Cai, Hu Han, Yuxiang Wei, Shiguang Shan, Xilin Chen

    Abstract: The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burd… ▽ More

    Submitted 25 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

  49. arXiv:2503.18595  [pdf, other

    cs.LG cs.AI

    Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition

    Authors: Chengxiang Huang, Yake Wei, Zequn Yang, Di Hu

    Abstract: Sensory training during the early ages is vital for human development. Inspired by this cognitive phenomenon, we observe that the early training stage is also important for the multimodal learning process, where dataset information is rapidly acquired. We refer to this stage as the prime learning window. However, based on our observation, this prime learning window in multimodal learning is often… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: 10pages, 16 figures, CVPR2025

  50. arXiv:2503.18578  [pdf, other

    cs.LG cs.AI cs.CV

    Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding

    Authors: Tianyu Chen, Xingcheng Fu, Yisen Gao, Haodong Qian, Yuecen Wei, Kun Yan, Haoyi Zhou, Jianxin Li

    Abstract: Modern vision-language models (VLMs) develop patch embedding and convolution backbone within vector space, especially Euclidean ones, at the very founding. When expanding VLMs to a galaxy scale for understanding astronomical phenomena, the integration of spherical space for planetary orbits and hyperbolic spaces for black holes raises two formidable challenges. a) The current pre-training model is… ▽ More

    Submitted 30 April, 2025; v1 submitted 24 March, 2025; originally announced March 2025.