Skip to main content

Showing 1–50 of 857 results for author: Yu, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01006  [pdf, ps, other

    cs.CV cs.AI cs.LG

    GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang , et al. (54 additional authors not shown)

    Abstract: We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the fi… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  2. arXiv:2507.00761  [pdf, ps, other

    cs.LG

    A Probabilistic Approach to Wildfire Spread Prediction Using a Denoising Diffusion Surrogate Model

    Authors: Wenbo Yu, Anirbit Ghosh, Tobias Sebastian Finn, Rossella Arcucci, Marc Bocquet, Sibo Cheng

    Abstract: Thanks to recent advances in generative AI, computers can now simulate realistic and complex natural processes. We apply this capability to predict how wildfires spread, a task made difficult by the unpredictable nature of fire and the variety of environmental conditions it depends on. In this study, We present the first denoising diffusion model for predicting wildfire spread, a new kind of AI fr… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  3. arXiv:2506.21811  [pdf, other

    cs.DB cs.GR

    Revisiting Graph Analytics Benchmark

    Authors: Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Xue Li, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou

    Abstract: The rise of graph analytics platforms has led to the development of various benchmarks for evaluating and comparing platform performance. However, existing benchmarks often fall short of fully assessing performance due to limitations in core algorithm selection, data generation processes (and the corresponding synthetic datasets), as well as the neglect of API usability evaluation. To address thes… ▽ More

    Submitted 4 March, 2025; originally announced June 2025.

  4. arXiv:2506.20495  [pdf, ps, other

    cs.CL cs.AI cs.IR cs.LG cs.SE

    ReCode: Updating Code API Knowledge with Reinforcement Learning

    Authors: Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang

    Abstract: Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-base… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Work in progress

  5. Near Data Processing in Taurus Database

    Authors: Shu Lin, Arunprasad P. Marathe, Per-Ȧke Larson, Chong Chen, Calvin Sun, Paul Lee, Weidong Yu

    Abstract: Huawei's cloud-native database system GaussDB for MySQL (also known as Taurus) stores data in a separate storage layer consisting of a pool of storage servers. Each server has considerable compute power making it possible to push data reduction operations (selection, projection, and aggregation) close to storage. This paper describes the design and implementation of near data processing (NDP) in T… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    ACM Class: H.2.4

    Journal ref: 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 2022, pp. 1662-1674,

  6. arXiv:2506.19807  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.MA

    KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

    Authors: Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang

    Abstract: Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerb… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Work in progress

  7. arXiv:2506.19266  [pdf

    q-bio.NC cs.CV eess.IV

    Convergent and divergent connectivity patterns of the arcuate fasciculus in macaques and humans

    Authors: Jiahao Huang, Ruifeng Li, Wenwen Yu, Anan Li, Xiangning Li, Mingchao Yan, Lei Xie, Qingrun Zeng, Xueyan Jia, Shuxin Wang, Ronghui Ju, Feng Chen, Qingming Luo, Hui Gong, Andrew Zalesky, Xiaoquan Yang, Yuanjing Feng, Zheng Wang

    Abstract: The organization and connectivity of the arcuate fasciculus (AF) in nonhuman primates remain contentious, especially concerning how its anatomy diverges from that of humans. Here, we combined cross-scale single-neuron tracing - using viral-based genetic labeling and fluorescence micro-optical sectioning tomography in macaques (n = 4; age 3 - 11 years) - with whole-brain tractography from 11.7T dif… ▽ More

    Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: 34 pages, 6 figures

  8. arXiv:2506.18348  [pdf, ps, other

    cs.AI

    Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team

    Authors: Weilun Yu, Shixiang Tang, Yonggui Huang, Nanqing Dong, Li Fan, Honggang Qi, Wei Liu, Xiaoli Diao, Xi Chen, Wanli Ouyang

    Abstract: Scientific progress increasingly relies on effective collaboration among researchers, a dynamic that large language models (LLMs) have only begun to emulate. While recent LLM-based scientist agents show promise in autonomous scientific discovery, they often lack the interactive reasoning and evaluation mechanisms essential to real-world research. We propose IDVSCI (Internal Discussion and Vote SCI… ▽ More

    Submitted 27 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  9. arXiv:2506.16475  [pdf, ps, other

    cs.RO cs.AI cs.LG

    Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining

    Authors: Yaru Niu, Yunzhe Zhang, Mingyang Yu, Changyi Lin, Chenhao Li, Yikai Wang, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Bingqing Chen, Jonathan Francis, Zhenzhen Li, Jie Tan, Ding Zhao

    Abstract: Quadrupedal robots have demonstrated impressive locomotion capabilities in complex environments, but equipping them with autonomous versatile manipulation skills in a scalable way remains a significant challenge. In this work, we introduce a cross-embodiment imitation learning system for quadrupedal manipulation, leveraging data collected from both humans and LocoMan, a quadruped equipped with mul… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  10. arXiv:2506.14437  [pdf, ps, other

    cs.IR

    Similarity = Value? Consultation Value Assessment and Alignment for Personalized Search

    Authors: Weicong Qin, Yi Xu, Weijie Yu, Teng Shi, Chenglei Shen, Ming He, Jianping Fan, Xiao Zhang, Jun Xu

    Abstract: Personalized search systems in e-commerce platforms increasingly involve user interactions with AI assistants, where users consult about products, usage scenarios, and more. Leveraging consultation to personalize search services is trending. Existing methods typically rely on semantic similarity to align historical consultations with current queries due to the absence of 'value' labels, but we obs… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  11. arXiv:2506.12924  [pdf, ps, other

    cs.IT math.CO

    Optimal Reconstruction Codes with Given Reads in Multiple Burst-Substitutions Channels

    Authors: Wenjun Yu, Yubo Sun, Zixiang Xu, Gennian Ge, Moshe Schwartz

    Abstract: We study optimal reconstruction codes over the multiple-burst substitution channel. Our main contribution is establishing a trade-off between the error-correction capability of the code, the number of reads used in the reconstruction process, and the decoding list size. We show that over a channel that introduces at most $t$ bursts, we can use a length-$n$ code capable of correcting $ε$ errors, wi… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  12. arXiv:2506.12738  [pdf, ps, other

    cs.CV cs.AI

    Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution

    Authors: Hang Xu, Wei Yu, Jiangtong Tan, Zhen Zou, Feng Zhao

    Abstract: Blind Super-Resolution (blind SR) aims to enhance the model's generalization ability with unknown degradation, yet it still encounters severe overfitting issues. Some previous methods inspired by dropout, which enhances generalization by regularizing features, have shown promising results in blind SR. Nevertheless, these methods focus solely on regularizing features before the final layer and over… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

    Comments: 8 pages, 8 figures, CVPR2025

  13. arXiv:2506.08691  [pdf, ps, other

    cs.CV

    VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

    Authors: Congzhi Zhang, Jiawei Peng, Zhenglin Wang, Yilong Lai, Haowen Sun, Heng Chang, Fei Ma, Weijiang Yu

    Abstract: Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST metic… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: Accepted by ACL 2025 main

  14. arXiv:2506.06637  [pdf

    cs.LG cs.AI cs.CV eess.SP

    Non-Intrusive Load Monitoring Based on Image Load Signatures and Continual Learning

    Authors: Olimjon Toirov, Wei Yu

    Abstract: Non-Intrusive Load Monitoring (NILM) identifies the operating status and energy consumption of each electrical device in the circuit by analyzing the electrical signals at the bus, which is of great significance for smart power management. However, the complex and changeable load combinations and application environments lead to the challenges of poor feature robustness and insufficient model gene… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: 10 pages, 3 figures, 2025 2nd International Conference on Digital Society and Artificial Intelligence (DSAI 2025), Conference dates: May 23-25, 2025

  15. arXiv:2506.03972  [pdf, ps, other

    cs.CV

    MS-YOLO: A Multi-Scale Model for Accurate and Efficient Blood Cell Detection

    Authors: Guohua Wu, Shengqi Chen, Pengchao Deng, Wenting Yu

    Abstract: Complete blood cell detection holds significant value in clinical diagnostics. Conventional manual microscopy methods suffer from time inefficiency and diagnostic inaccuracies. Existing automated detection approaches remain constrained by high deployment costs and suboptimal accuracy. While deep learning has introduced powerful paradigms to this field, persistent challenges in detecting overlappin… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  16. arXiv:2506.03660  [pdf, ps, other

    cs.CV

    INP-Former++: Advancing Universal Anomaly Detection via Intrinsic Normal Prototypes and Residual Learning

    Authors: Wei Luo, Haiming Yao, Yunkang Cao, Qiyu Chen, Ang Gao, Weiming Shen, Wenyong Yu

    Abstract: Anomaly detection (AD) is essential for industrial inspection and medical diagnosis, yet existing methods typically rely on ``comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variatio… ▽ More

    Submitted 30 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

    Comments: 15 pages, 11 figures, 13 tables. arXiv admin note: substantial text overlap with arXiv:2503.02424

  17. arXiv:2506.03147  [pdf, ps, other

    cs.CV cs.AI cs.CL

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Authors: Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan

    Abstract: Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipul… ▽ More

    Submitted 18 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  18. arXiv:2506.02634  [pdf, ps, other

    cs.DC cs.AI

    KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

    Authors: Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, Haibo Chen

    Abstract: Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV\$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present t… ▽ More

    Submitted 18 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

    Comments: Accepted by USENIX ATC'25

  19. arXiv:2506.01947  [pdf, ps, other

    eess.IV cs.CV

    RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report

    Authors: Marcos V. Conde, Radu Timofte, Radu Berdan, Beril Besbinar, Daisuke Iso, Pengzhou Ji, Xiong Dun, Zeying Fan, Chen Wu, Zhansheng Wang, Pengbo Zhang, Jiazi Huang, Qinglin Liu, Wei Yu, Shengping Zhang, Xiangyang Ji, Kyungsik Kim, Minkyung Kim, Hwalmin Lee, Hekun Ma, Huan Zheng, Yanyan Wei, Zhao Zhang, Jing Fang, Meilin Gao , et al. (8 additional authors not shown)

    Abstract: Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

  20. arXiv:2506.00677  [pdf, ps, other

    cs.CR cs.ET physics.app-ph

    Review of Blockchain-Based Approaches to Spent Fuel Management in Nuclear Power Plants

    Authors: Yuxiang Xu, Wenjuan Yu, Yuqian Wan, Zhongming Zhang

    Abstract: This study addresses critical challenges in managing the transportation of spent nuclear fuel, including inadequate data transparency, stringent confidentiality requirements, and a lack of trust among collaborating parties, issues prevalent in traditional centralized management systems. Given the high risks involved, balancing data confidentiality with regulatory transparency is imperative. To ove… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  21. arXiv:2505.24528  [pdf, ps, other

    cs.CV cs.LG

    Geospatial Foundation Models to Enable Progress on Sustainable Development Goals

    Authors: Pedram Ghamisi, Weikang Yu, Xiaokang Zhang, Aldino Rizaldy, Jian Wang, Chufeng Zhou, Richard Gloaguen, Gustau Camps-Valls

    Abstract: Foundation Models (FMs) are large-scale, pre-trained AI systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  22. arXiv:2505.24254  [pdf, ps, other

    cs.LG

    Rethinking Continual Learning with Progressive Neural Collapse

    Authors: Zheng Wang, Wanhao Yu, Li Yang, Sen Lin

    Abstract: Continual Learning (CL) seeks to build an agent that can continuously learn a sequence of tasks, where a key challenge, namely Catastrophic Forgetting, persists due to the potential knowledge interference among different tasks. On the other hand, deep neural networks (DNNs) are shown to converge to a terminal state termed Neural Collapse during training, where all class prototypes geometrically fo… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  23. arXiv:2505.24113  [pdf, ps, other

    cs.MA

    Distributed Neural Policy Gradient Algorithm for Global Convergence of Networked Multi-Agent Reinforcement Learning

    Authors: Pengcheng Dai, Yuanqiu Mo, Wenwu Yu, Wei Ren

    Abstract: This paper studies the networked multi-agent reinforcement learning (NMARL) problem, where the objective of agents is to collaboratively maximize the discounted average cumulative rewards. Different from the existing methods that suffer from poor expression due to linear function approximation, we propose a distributed neural policy gradient algorithm that features two innovatively designed neural… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  24. arXiv:2505.24003  [pdf, ps, other

    cs.LG cs.AI

    Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting

    Authors: ChengAo Shen, Wenchao Yu, Ziming Zhao, Dongjin Song, Wei Cheng, Haifeng Chen, Jingchao Ni

    Abstract: Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, apply… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  25. arXiv:2505.23239  [pdf, ps, other

    cs.SE cs.AI

    OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software

    Authors: Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou

    Abstract: Usability evaluation is critical to the impact and adoption of open source software (OSS), yet traditional methods relying on human evaluators suffer from high costs and limited scalability. To address these limitations, we introduce OSS-UAgent, an automated, configurable, and interactive agent-based usability evaluation framework specifically designed for open source software. Our framework emplo… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  26. arXiv:2505.23175  [pdf, ps, other

    cs.RO

    LocoTouch: Learning Dexterous Quadrupedal Transport with Tactile Sensing

    Authors: Changyi Lin, Yuxin Ray Song, Boda Huo, Mingyang Yu, Yikai Wang, Shiqi Liu, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Jie Tan, Yiyue Luo, Ding Zhao

    Abstract: Quadrupedal robots have demonstrated remarkable agility and robustness in traversing complex terrains. However, they remain limited in performing object interactions that require sustained contact. In this work, we present LocoTouch, a system that equips quadrupedal robots with tactile sensing to address a challenging task in this category: long-distance transport of unsecured cylindrical objects,… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Project page: https://linchangyi1.github.io/LocoTouch

  27. arXiv:2505.22654  [pdf, ps, other

    cs.CV cs.CL

    VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

    Authors: Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu

    Abstract: Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of t… ▽ More

    Submitted 12 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

    Comments: Changes from v1: Uploaded code link and fixed minor typos

  28. Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs

    Authors: Jiawen Chen, Qi Shao, Duxin Chen, Wenwu Yu

    Abstract: Spatio-temporal prediction is a pivotal task with broad applications in traffic management, climate monitoring, energy scheduling, etc. However, existing methodologies often struggle to balance model expressiveness and computational efficiency, especially when scaling to large real-world datasets. To tackle these challenges, we propose STH-SepNet (Spatio-Temporal Hypergraph Separation Networks), a… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  29. arXiv:2505.19151  [pdf, ps, other

    cs.GR cs.AI cs.CV

    SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

    Authors: Shenggan Cheng, Yuanxin Wei, Lansong Diao, Yong Liu, Bujiao Chen, Lianghua Huang, Yu Liu, Wenyuan Yu, Jiangsu Du, Wei Lin, Yang You

    Abstract: Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usuall… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 9 pages, 6 figures

  30. arXiv:2505.18631  [pdf, other

    cs.RO

    S2R-Bench: A Sim-to-Real Evaluation Benchmark for Autonomous Driving

    Authors: Li Wang, Guangqi Yang, Lei Yang, Ziying Song, Xinyu Zhang, Ying Chen, Lin Liu, Junjie Gao, Zhiwei Li, Qingshan Yang, Jun Li, Liangliang Wang, Wenhao Yu, Bin Xu, Weida Wang, Huaping Liu

    Abstract: Safety is a long-standing and the final pursuit in the development of autonomous driving systems, with a significant portion of safety challenge arising from perception. How to effectively evaluate the safety as well as the reliability of perception algorithms is becoming an emerging issue. Despite its critical importance, existing perception methods exhibit a limitation in their robustness, prima… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  31. arXiv:2505.17060  [pdf, ps, other

    cs.CL cs.AI

    SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

    Authors: Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

    Abstract: In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  32. arXiv:2505.16980  [pdf, ps, other

    cs.CV cs.MM

    Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

    Authors: Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, Tao Mei

    Abstract: Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal incons… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: CVPR 2025

  33. arXiv:2505.16314  [pdf, ps, other

    cs.CV cs.AI

    NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

    Authors: Shuhao Han, Haotian Fan, Fangyuan Kong, Wenjie Liao, Chunle Guo, Chongyi Li, Radu Timofte, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Jianhui Sun, Xinli Yue, Tianyi Wang, Huan Hou, Junda Lu, Xinyang Huang, Zitang Zhou, Zijian Zhang, Xuhui Zheng, Xuecheng Wu, Chong Peng, Xuezhi Cao , et al. (90 additional authors not shown)

    Abstract: This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspe… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  34. arXiv:2505.16042  [pdf, other

    cs.RO

    Reference Free Platform Adaptive Locomotion for Quadrupedal Robots using a Dynamics Conditioned Policy

    Authors: David Rytz, Suyoung Choi, Wanming Yu, Wolfgang Merkt, Jemin Hwangbo, Ioannis Havoutis

    Abstract: This article presents Platform Adaptive Locomotion (PAL), a unified control method for quadrupedal robots with different morphologies and dynamics. We leverage deep reinforcement learning to train a single locomotion policy on procedurally generated robots. The policy maps proprioceptive robot state information and base velocity commands into desired joint actuation targets, which are conditioned… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 8 pages, 6 tables, 5 figures

  35. arXiv:2505.16008  [pdf, other

    cs.CL cs.AI cs.CR

    LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization

    Authors: Wenrui Yu, Yiyi Chen, Johannes Bjerva, Sokol Kosta, Qiongxiu Li

    Abstract: We propose LAGO - Language Similarity-Aware Graph Optimization - a novel approach for few-shot cross-lingual embedding inversion attacks, addressing critical privacy vulnerabilities in multilingual NLP systems. Unlike prior work in embedding inversion attacks that treat languages independently, LAGO explicitly models linguistic relationships through a graph-based constrained distributed optimizati… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  36. arXiv:2505.15287  [pdf, ps, other

    cs.CV

    GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation

    Authors: Yuchen Li, Chaoran Feng, Zhenyu Tang, Kaiyuan Deng, Wangbo Yu, Yonghong Tian, Li Yuan

    Abstract: We introduce GS2E (Gaussian Splatting to Event), a large-scale synthetic event dataset for high-fidelity event vision tasks, captured from real-world sparse multi-view RGB images. Existing event datasets are often synthesized from dense RGB videos, which typically lack viewpoint diversity and geometric consistency, or depend on expensive, difficult-to-scale hardware setups. GS2E overcomes these li… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 21 pages, 7 figures. More details at http://intothemild.github.io/GS2E.github.io

  37. arXiv:2505.15235  [pdf, ps, other

    eess.IV cs.CV

    X-GRM: Large Gaussian Reconstruction Model for Sparse-view X-rays to Computed Tomography

    Authors: Yifan Liu, Wuyang Li, Weihao Yu, Chenxin Li, Alexandre Alahi, Max Meng, Yixuan Yuan

    Abstract: Computed Tomography serves as an indispensable tool in clinical workflows, providing non-invasive visualization of internal anatomical structures. Existing CT reconstruction works are limited to small-capacity model architecture and inflexible volume representation. In this work, we present X-GRM (X-ray Gaussian Reconstruction Model), a large feedforward model for reconstructing 3D CT volumes from… ▽ More

    Submitted 26 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  38. arXiv:2505.15185  [pdf, ps, other

    cs.CV

    MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

    Authors: Yifan Liu, Keyu Fan, Weihao Yu, Chenxin Li, Hao Lu, Yixuan Yuan

    Abstract: Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual pri… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  39. arXiv:2505.14683  [pdf, ps, other

    cs.CV

    Emerging Properties in Unified Multimodal Pretraining

    Authors: Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan

    Abstract: Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When… ▽ More

    Submitted 23 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: 37 pages, 17 figures

  40. arXiv:2505.14135  [pdf, other

    cs.CV

    Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

    Authors: Ruihuang Li, Caijin Zhou, Shoujian Zheng, Jianxiang Lu, Jiabin Huang, Comi Chen, Junshu Tang, Guangzheng Xu, Jiale Tao, Hongmei Wang, Donghao Li, Wenqing Yu, Senbo Wang, Zhimin Li, Yetshuan Shi, Haoyu Yang, Yukun Wang, Wenxun Dai, Jiaqi Li, Linqing Wang, Qixun Wang, Zhiyong Xu, Yingfang Zhang, Jiangfeng Xiong, Weijie Kong , et al. (33 additional authors not shown)

    Abstract: Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simult… ▽ More

    Submitted 28 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

  41. arXiv:2505.12306  [pdf, other

    cs.CL

    Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection

    Authors: Yuwei Zhang, Wenhao Yu, Shangbin Feng, Yifan Zhu, Letian Peng, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

    Abstract: Despite significant advances in large language models (LLMs), their knowledge memorization capabilities remain underexplored, due to the lack of standardized and high-quality test ground. In this paper, we introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention. Specifically, we propose WikiDYK, which leve… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: Dataset is available at https://huggingface.co/datasets/YWZBrandon/wikidyk

  42. arXiv:2505.11945  [pdf, other

    cs.CV

    Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

    Authors: Bonan li, Zicheng Zhang, Songhua Liu, Weihao Yu, Xinchao Wang

    Abstract: Visual instruction tuning aims to enable large language models to comprehend the visual world, with a pivotal challenge lying in establishing an effective vision-to-language projection. However, existing methods often grapple with the intractable trade-off between accuracy and efficiency. In this paper, we present LLaVA-Meteor, a novel approach designed to break this deadlock, equipped with a nove… ▽ More

    Submitted 22 May, 2025; v1 submitted 17 May, 2025; originally announced May 2025.

    Comments: Under Review

  43. arXiv:2505.10610  [pdf, other

    cs.CV cs.CL

    MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

    Authors: Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman

    Abstract: The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thor… ▽ More

    Submitted 26 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

    Comments: Work in progress

  44. arXiv:2505.09999  [pdf, ps, other

    cs.DC

    ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

    Authors: Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin

    Abstract: With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limite… ▽ More

    Submitted 5 June, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

    Comments: Released URL

  45. arXiv:2505.07802  [pdf, ps, other

    cs.RO cs.AI cs.LG

    Improving Trajectory Stitching with Flow Models

    Authors: Reece O'Mahoney, Wanming Yu, Ioannis Havoutis

    Abstract: Generative models have shown great promise as trajectory planners, given their affinity to modeling complex distributions and guidable inference process. Previous works have successfully applied these in the context of robotic manipulation but perform poorly when the required solution does not exist as a complete trajectory within the training set. We identify that this is a result of being unable… ▽ More

    Submitted 3 June, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

  46. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  47. arXiv:2505.01396  [pdf, ps, other

    cs.RO cs.AI cs.LG

    SIME: Enhancing Policy Self-Improvement with Modal-level Exploration

    Authors: Yang Jin, Jun Lv, Wenye Yu, Hongjie Fang, Yong-Lu Li, Cewu Lu

    Abstract: Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  48. arXiv:2505.01050  [pdf, other

    cs.CV cs.LG

    Transferable Adversarial Attacks on Black-Box Vision-Language Models

    Authors: Kai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Chengming Xu, Haoqi Hu, Matt Fredrikson

    Abstract: Vision Large Language Models (VLLMs) are increasingly deployed to offer advanced capabilities on inputs comprising both text and images. While prior research has shown that adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts, the extent and effectiveness of such vulnerabilities remain underexplored for VLLMs. We present a comprehe… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  49. arXiv:2504.21650  [pdf, other

    cs.CV

    HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

    Authors: Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, Li Yuan

    Abstract: The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue,… ▽ More

    Submitted 13 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

    Comments: Project Homepage: https://zhouhyocean.github.io/holotime/ Code: https://github.com/PKU-YuanGroup/HoloTime

  50. arXiv:2504.21024  [pdf, other

    cs.CL

    WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model

    Authors: Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, Dong Yu

    Abstract: Agent self-improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: 19 pages