Skip to main content

Showing 1–50 of 758 results for author: Wei, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09882  [pdf, ps, other

    cs.HC

    SnapNCode: An Integrated Development Environment for Programming Physical Objects Interactions

    Authors: Xiaoyan Wei, Zijian Yue, Hsiang-Ting Chen

    Abstract: Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: 18 pages, HCII 2025

  2. arXiv:2505.09872  [pdf, ps, other

    cs.HC

    Context-AI Tunes: Context-Aware AI-Generated Music for Stress Reduction

    Authors: Xiaoyan Wei, Zebang Zhang, Zijian Yue, Hsiang-Ting Chen

    Abstract: Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that genera… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: 17 pages, HCII 2025

  3. arXiv:2505.09768  [pdf, other

    cs.LG

    Self-Consuming Generative Models with Adversarially Curated Data

    Authors: Xiukun Wei, Xueru Zhang

    Abstract: Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates "self-consuming loops", which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their pre… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  4. arXiv:2505.09393  [pdf, ps, other

    cs.GR cs.AI cs.CV

    UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units

    Authors: Huakun Liu, Hiroki Ota, Xin Wei, Yutaro Hirao, Monica Perusquia-Hernandez, Hideaki Uchiyama, Kiyoshi Kiyokawa

    Abstract: Sparse wearable inertial measurement units (IMUs) have gained popularity for estimating 3D human motion. However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: Accepted by CVPR 2025

  5. arXiv:2505.09343  [pdf, ps, other

    cs.DC cs.AI cs.AR

    Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

    Authors: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei

    Abstract: The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inferen… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version will appear as part of the Industry Track in Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25)

  6. arXiv:2505.09118  [pdf, other

    cs.CV

    Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning

    Authors: Dayong Liang, Changmeng Zheng, Zhiyuan Wen, Yi Cai, Xiao-Yong Wei, Qing Li

    Abstract: Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizin… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  7. arXiv:2505.07035  [pdf, ps, other

    cs.IT eess.SP

    Robust Movable-Antenna Position Optimization with Imperfect CSI for MISO Systems

    Authors: Haifeng Ma, Weidong Mei, Xin Wei, Boyu Ning, Zhi Chen

    Abstract: Movable antenna (MA) technology has emerged as a promising solution for reconfiguring wireless channel conditions through local antenna movement within confined regions. Unlike previous works assuming perfect channel state information (CSI), this letter addresses the robust MA position optimization problem under imperfect CSI conditions for a multiple-input single-output (MISO) MA system. Specific… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: Accepted to IEEE Communications Letters

  8. arXiv:2505.06699  [pdf, other

    cs.LG stat.ML

    Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws

    Authors: Xiyuan Wei, Ming Lin, Fanjiang Ye, Fengguang Song, Liangliang Cao, My T. Thai, Tianbao Yang

    Abstract: This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named $\textbf{model steering}$. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading… ▽ More

    Submitted 13 May, 2025; v1 submitted 10 May, 2025; originally announced May 2025.

    Comments: 18 pages, 6 figures

  9. arXiv:2505.06290  [pdf, other

    cs.LG cs.DM

    UniCO: Towards a Unified Model for Combinatorial Optimization Problems

    Authors: Zefang Zong, Xiaochen Wei, Guozhen Zhang, Chen Gao, Huandong Wang, Yong Li

    Abstract: Combinatorial Optimization (CO) encompasses a wide range of problems that arise in many real-world scenarios. While significant progress has been made in developing learning-based methods for specialized CO problems, a unified model with a single architecture and parameter set for diverse CO problems remains elusive. Such a model would offer substantial advantages in terms of efficiency and conven… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  10. arXiv:2505.05914  [pdf, ps, other

    cs.IT eess.SP

    Mechanical Power Modeling and Energy Efficiency Maximization for Movable Antenna Systems

    Authors: Xin Wei, Weidong Mei, Xuan Huang, Zhi Chen, Boyu Ning

    Abstract: Movable antennas (MAs) have recently garnered significant attention in wireless communications due to their capability to reshape wireless channels via local antenna movement within a confined region. However, to achieve accurate antenna movement, MA drivers introduce non-negligible mechanical power consumption, rendering energy efficiency (EE) optimization more critical compared to conventional f… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  11. arXiv:2505.03484  [pdf, other

    cs.IR

    STAR-Rec: Making Peace with Length Variance and Pattern Diversity in Sequential Recommendation

    Authors: Maolin Wang, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Xuetao Wei, Zitao Liu, Hongzhi Yin, Yi Chang, Xiangyu Zhao

    Abstract: Recent deep sequential recommendation models often struggle to effectively model key characteristics of user behaviors, particularly in handling sequence length variations and capturing diverse interaction patterns. We propose STAR-Rec, a novel architecture that synergistically combines preference-aware attention and state-space modeling through a sequence-level mixture-of-experts framework. STAR-… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Accepted by SIGIR 2025

  12. arXiv:2505.03186  [pdf, other

    cs.SD cs.CV eess.AS

    CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

    Authors: Detao Bai, Zhiheng Ma, Xihan Wei, Liefeng Bo

    Abstract: The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a… ▽ More

    Submitted 15 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

  13. arXiv:2505.01113  [pdf, other

    cs.RO cs.CV cs.NE

    NeuroLoc: Encoding Navigation Cells for 6-DOF Camera Localization

    Authors: Xun Li, Jian Yang, Fenli Jia, Muyu Wang, Qi Wu, Jun Wu, Jinpeng Mi, Jilin Hu, Peidong Liang, Xuan Tang, Ke Li, Xiong You, Xian Wei

    Abstract: Recently, camera localization has been widely adopted in autonomous robotic navigation due to its efficiency and convenience. However, autonomous navigation in unknown environments often suffers from scene ambiguity, environmental disturbances, and dynamic object transformation in camera localization. To address this problem, inspired by the biological brain navigation mechanism (such as grid cell… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  14. arXiv:2504.21577  [pdf, other

    cs.MM

    Latent Feature-Guided Conditional Diffusion for High-Fidelity Generative Image Semantic Communication

    Authors: Zehao Chen, Xinfeng Wei, Haonan Tong, Zhaohui Yang, Changchuan Yin

    Abstract: Semantic communication is proposed and expected to improve the efficiency and effectiveness of massive data transmission over sixth generation (6G) networks. However, existing deep learning-based joint source and channel coding (DeepJSCC) image semantic communication scheme predominantly focuses on optimizing pixel-level metrics, and neglects human perceptual requirements, which results in degrade… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: 6 pages, 6 figures

  15. arXiv:2504.20094  [pdf, ps, other

    cs.IR cs.CL cs.HC

    MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?

    Authors: Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang, Frank Ong, Se-eun Yoon, Rachit Pareek, Michelle Gong

    Abstract: In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent a… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  16. arXiv:2504.18152  [pdf, other

    cs.CV

    ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding

    Authors: Yi-Xing Peng, Qize Yang, Yu-Ming Tang, Shenghao Fu, Kun-Yu Lin, Xihan Wei, Wei-Shi Zheng

    Abstract: Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios, each… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  17. arXiv:2504.18148  [pdf, other

    cs.LG

    A Generative Graph Contrastive Learning Model with Global Signal

    Authors: Xiaofan Wei, Binyan Zhang

    Abstract: Graph contrastive learning (GCL) has garnered significant attention recently since it learns complex structural information from graphs through self-supervised learning manner. However, prevalent GCL models may suffer from performance degradation due to inappropriate contrastive signals. Concretely, they commonly generate augmented views based on random perturbation, which leads to biased essentia… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  18. arXiv:2504.17365  [pdf, other

    cs.CV cs.CL

    TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

    Authors: Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, Changbo Wang

    Abstract: Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, exis… ▽ More

    Submitted 28 April, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

  19. arXiv:2504.16464  [pdf, other

    cs.RO cs.AI

    ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance

    Authors: Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Shanghang Zhang

    Abstract: While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional in… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 9 pages, 3 figures

  20. arXiv:2504.15472  [pdf, other

    cs.RO cs.LG eess.SY

    LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning

    Authors: Pingcheng Jian, Xiao Wei, Yanbaihui Liu, Samuel A. Moore, Michael M. Zavlanos, Boyuan Chen

    Abstract: We introduce Large Language Model-Assisted Preference Prediction (LAPP), a novel framework for robot learning that enables efficient, customizable, and expressive behavior acquisition with minimum human effort. Unlike prior approaches that rely heavily on reward engineering, human demonstrations, motion capture, or expensive pairwise preference labels, LAPP leverages large language models (LLMs) t… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  21. arXiv:2504.14988  [pdf, other

    cs.CV

    Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

    Authors: Hong-Tao Yu, Xiu-Shen Wei, Yuxin Peng, Serge Belongie

    Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine… ▽ More

    Submitted 13 May, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

  22. arXiv:2504.14290  [pdf, other

    cs.CV

    Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization

    Authors: Shouwei Ruan, Zhenyu Wu, Yao Huang, Ruochen Zhang, Yitong Sun, Caixin Kang, Xingxing Wei

    Abstract: Ensuring the safety of generated content remains a fundamental challenge for Text-to-Image (T2I) generation. Existing studies either fail to guarantee complete safety under potentially harmful concepts or struggle to balance safety with generation quality. To address these issues, we propose Safety-Constrained Direct Preference Optimization (SC-DPO), a novel framework for safety alignment in T2I m… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

    Comments: 10 pages, 6 figures

  23. arXiv:2504.14194  [pdf, other

    cs.CL

    Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

    Authors: Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He

    Abstract: The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-f… ▽ More

    Submitted 30 April, 2025; v1 submitted 19 April, 2025; originally announced April 2025.

    Comments: Under review

  24. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  25. arXiv:2504.13237  [pdf, other

    cs.CL

    ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs

    Authors: Yan Yang, Yixia Li, Hongru Wang, Xuetao Wei, Jianqiao Yu, Yun Chen, Guanhua Chen

    Abstract: With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these met… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  26. arXiv:2504.13226  [pdf, other

    cs.GR

    Image Editing with Diffusion Models: A Survey

    Authors: Jia Wang, Jie Hu, Xiaoqi Ma, Hanghang Ma, Xiaoming Wei, Enhua Wu

    Abstract: With deeper exploration of diffusion model, developments in the field of image generation have triggered a boom in image creation. As the quality of base-model generated images continues to improve, so does the demand for further application like image editing. In recent years, many remarkable works are realizing a wide variety of editing effects. However, the wide variety of editing types and div… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  27. arXiv:2504.10829  [pdf, other

    cs.CV

    LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation

    Authors: Hengyu Shi, Junhao Su, Huansheng Ning, Xiaoming Wei, Jialin Gao

    Abstract: Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free a… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  28. arXiv:2504.10479  [pdf, other

    cs.CV

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang , et al. (26 additional authors not shown)

    Abstract: We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single p… ▽ More

    Submitted 18 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: Technical Report

  29. arXiv:2504.09948  [pdf, other

    cs.CV cs.AI cs.MM

    Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

    Authors: Huijie Liu, Bingcan Wang, Jie Hu, Xiaoming Wei, Guoliang Kang

    Abstract: Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particula… ▽ More

    Submitted 30 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: 10 pages, 10 figures, 3 tables

  30. arXiv:2504.09540  [pdf, other

    cs.CV

    EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler

    Authors: Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, Shanghang Zhang

    Abstract: Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the ori… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  31. arXiv:2504.07590  [pdf, other

    cs.CR

    DWFS-Obfuscation: Dynamic Weighted Feature Selection for Robust Malware Familial Classification under Obfuscation

    Authors: Xingyuan Wei, Zijun Cheng, Ning Li, Qiujian Lv, Ziyang Yu, Degang Sun

    Abstract: Due to its open-source nature, the Android operating system has consistently been a primary target for attackers. Learning-based methods have made significant progress in the field of Android malware detection. However, traditional detection methods based on static features struggle to identify obfuscated malicious code, while methods relying on dynamic analysis suffer from low efficiency. To addr… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: 15 pages, 1 figure

    ACM Class: I.2.7

  32. arXiv:2504.06544  [pdf, ps, other

    cs.CV

    LCGC: Learning from Consistency Gradient Conflicting for Class-Imbalanced Semi-Supervised Debiasing

    Authors: Weiwei Xing, Yue Cheng, Hongzhu Yi, Xiaohui Gao, Xiang Wei, Xiaoyu Guo, Yuming Zhang, Xinyu Pang

    Abstract: Classifiers often learn to be biased corresponding to the class-imbalanced dataset, especially under the semi-supervised learning (SSL) set. While previous work tries to appropriately re-balance the classifiers by subtracting a class-irrelevant image's logit, but lacks a firm theoretical basis. We theoretically analyze why exploiting a baseline image can refine pseudo-labels and prove that the bla… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: This paper has been accepted by AAAI 2025

  33. arXiv:2504.06027  [pdf, other

    cs.CV eess.IV

    OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model

    Authors: Xiaochen Wei, Weiwei Guo, Wenxian Yu, Feiming Wei, Dongying Li

    Abstract: Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, current methods often fail to extract modality-invariant features when aligning image pairs with large nonlinear radiometric differences. To address this issues, we propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation to eliminate t… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  34. arXiv:2504.05118  [pdf, other

    cs.AI

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Authors: Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang , et al. (2 additional authors not shown)

    Abstract: We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the pr… ▽ More

    Submitted 10 April, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

  35. arXiv:2504.02137  [pdf, other

    cs.IR cs.AI

    Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID

    Authors: Carolina Zheng, Minhui Huang, Dmitrii Pedchenko, Kaushik Rangadurai, Siyu Wang, Gaby Nahum, Jie Lei, Yang Yang, Tao Liu, Zutian Luo, Xiaohan Wei, Dinesh Ramasamy, Jiyan Yang, Yiping Han, Lin Yang, Hangjun Xu, Rong Jin, Shuang Yang

    Abstract: The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many sys… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  36. arXiv:2503.23668  [pdf, other

    cs.AI

    MolGround: A Benchmark for Molecular Grounding

    Authors: Jiaxin Wu, Ting Zhang, Rubing Chen, Wengyu Zhang, Chen Jason Zhang, Xiao-Yong Wei, Li Qing

    Abstract: Current molecular understanding approaches predominantly focus on the descriptive aspect of human perception, providing broad, topic-level insights. However, the referential aspect -- linking molecular concepts to specific structural components -- remains largely unexplored. To address this gap, we propose a molecular grounding benchmark designed to evaluate a model's referential abilities. We ali… ▽ More

    Submitted 30 April, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

  37. arXiv:2503.23350  [pdf, other

    cs.AI

    A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models

    Authors: Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S. Yu, Qing Li

    Abstract: With the advancement of web techniques, they have significantly revolutionized various aspects of people's lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intel… ▽ More

    Submitted 10 May, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

    Comments: Accepted by KDD 2025;

  38. arXiv:2503.22346  [pdf, other

    cs.CV

    ArchCAD-400K: An Open Large-Scale Architectural CAD Dataset and New Baseline for Panoptic Symbol Spotting

    Authors: Ruifeng Luo, Zhengjie Liu, Tianxiao Cheng, Jie Wang, Tongjie Wang, Xingguang Wei, Haomin Wang, YanPeng Li, Fu Chai, Fei Cheng, Shenglong Ye, Wenhai Wang, Yanting Zhang, Yu Qiao, Hongjie Zhang, Xianzhong Zhao

    Abstract: Recognizing symbols in architectural CAD drawings is critical for various advanced engineering applications. In this paper, we propose a novel CAD data annotation engine that leverages intrinsic attributes from systematically archived CAD drawings to automatically generate high-quality annotations, thus significantly reducing manual labeling efforts. Utilizing this engine, we construct ArchCAD-400… ▽ More

    Submitted 2 April, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

  39. arXiv:2503.21500  [pdf, other

    cs.CL

    OpenHuEval: Evaluating Large Language Model on Hungarian Specifics

    Authors: Haote Yang, Xingjian Wei, Jiang Wu, Noémi Ligeti-Nagy, Jiaxing Sun, Yinfan Wang, Zijian Győző Yang, Junyuan Gao, Jingchao Wang, Bowen Jiang, Shasha Wang, Nanjun Yu, Zihao Zhang, Shixin Hong, Hongwei Liu, Wei Li, Songyang Zhang, Dahua Lin, Lijun Wu, Gábor Prószéky, Conghui He

    Abstract: We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  40. arXiv:2503.20454  [pdf, other

    cs.LG cs.CV

    Lipschitz Constant Meets Condition Number: Learning Robust and Compact Deep Neural Networks

    Authors: Yangqi Feng, Shing-Ho J. Lin, Baoyuan Gao, Xian Wei

    Abstract: Recent research has revealed that high compression of Deep Neural Networks (DNNs), e.g., massive pruning of the weight matrix of a DNN, leads to a severe drop in accuracy and susceptibility to adversarial attacks. Integration of network pruning into an adversarial training framework has been proposed to promote adversarial robustness. It has been observed that a highly pruned weight matrix tends t… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: 13 pages, 6 figures

  41. arXiv:2503.19367  [pdf, other

    cs.CV

    VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction

    Authors: Zizhi Chen, Minghao Han, Xukun Zhang, Shuwei Ma, Tao Liu, Xing Wei, Lihua Zhang

    Abstract: Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answerin… ▽ More

    Submitted 29 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: Acceppted by ICME2025

  42. arXiv:2503.18484  [pdf, other

    cs.CV cs.CL

    PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

    Authors: Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He

    Abstract: Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enab… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: Equal contribution: Junyuan Gao, Jiahe Song, Jiang Wu; Corresponding author: Conghui He

  43. arXiv:2503.18328  [pdf, other

    cs.CV

    TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering

    Authors: Chun Gu, Xiaofei Wei, Li Zhang, Xiatian Zhu

    Abstract: Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: CVPR 2025. Code: https://github.com/fudan-zvg/tensoflow

  44. arXiv:2503.18159  [pdf, other

    cs.CV cs.AI cs.SD

    DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

    Authors: Peng Chen, Xiaobao Wei, Ming Lu, Hui Chen, Feng Tian

    Abstract: Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial anim… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

    Comments: Accepted by ICME2025

  45. arXiv:2503.14476  [pdf, other

    cs.LG cs.CL

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Authors: Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai , et al. (10 additional authors not shown)

    Abstract: Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecouple… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: Project Page: https://dapo-sia.github.io/

  46. arXiv:2503.12769  [pdf, other

    cs.CV

    ViSpeak: Visual Instruction Feedback in Streaming Videos

    Authors: Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, Wei-Shi Zheng

    Abstract: Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedbac… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  47. arXiv:2503.11780  [pdf, other

    cs.CV

    Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

    Authors: Tianyi Zhao, Boyang Liu, Yanglei Gao, Yiming Sun, Maoxun Yuan, Xingxing Wei

    Abstract: Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature e… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: 10 pages, 6 figures

  48. arXiv:2503.11368  [pdf, other

    cs.CV

    PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture

    Authors: Xiaokang Wei, Bowen Zhang, Xianghui Yang, Yuxuan Wang, Chunchao Guo, Xi Zhao, Yan Luximon

    Abstract: Generating high-quality physically based rendering (PBR) materials is important to achieve realistic rendering in the downstream tasks, yet it remains challenging due to the intertwined effects of materials and lighting. While existing methods have made breakthroughs by incorporating material decomposition in the 3D generation pipeline, they tend to bake highlights into albedo and ignore spatially… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Homepage: https://pbr3dgen1218.github.io/

  49. arXiv:2503.10950  [pdf, other

    physics.chem-ph cs.CV

    DNA Origami Nanostructures Observed in Transmission Electron Microscopy Images can be Characterized through Convolutional Neural Networks

    Authors: Xingfei Wei, Qiankun Mo, Chi Chen, Mark Bathe, Rigoberto Hernandez

    Abstract: Artificial intelligence (AI) models remain an emerging strategy to accelerate materials design and development. We demonstrate that convolutional neural network (CNN) models can characterize DNA origami nanostructures employed in programmable self-assembling, which is important in many applications such as in biomedicine. Specifically, we benchmark the performance of 9 CNN models -- viz. AlexNet,… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  50. arXiv:2503.10152  [pdf, other

    cs.CV

    A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection

    Authors: Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng

    Abstract: Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that detectors can recognize new or novel objects. However, previous works directly align the feature space with CLIP and fail to learn the semantic knowledge effectiv… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Accepted to TMM 2025