Skip to main content

Showing 1–50 of 813 results for author: Yu, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09999  [pdf, ps, other

    cs.DC

    ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

    Authors: Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin

    Abstract: With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limite… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.07802  [pdf, ps, other

    cs.RO cs.AI cs.LG

    Improving Trajectory Stitching with Flow Models

    Authors: Reece O'Mahoney, Wanming Yu, Ioannis Havoutis

    Abstract: Generative models have shown great promise as trajectory planners, given their affinity to modeling complex distributions and guidable inference process. Previous works have successfully applied these in the context of robotic manipulation but perform poorly when the required solution does not exist as a complete trajectory within the training set. We identify that this is a result of being unable… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  3. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  4. arXiv:2505.01396  [pdf, ps, other

    cs.RO cs.AI cs.LG

    SIME: Enhancing Policy Self-Improvement with Modal-level Exploration

    Authors: Yang Jin, Jun Lv, Wenye Yu, Hongjie Fang, Yong-Lu Li, Cewu Lu

    Abstract: Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  5. arXiv:2505.01050  [pdf, other

    cs.CV cs.LG

    Transferable Adversarial Attacks on Black-Box Vision-Language Models

    Authors: Kai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Chengming Xu, Haoqi Hu, Matt Fredrikson

    Abstract: Vision Large Language Models (VLLMs) are increasingly deployed to offer advanced capabilities on inputs comprising both text and images. While prior research has shown that adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts, the extent and effectiveness of such vulnerabilities remain underexplored for VLLMs. We present a comprehe… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  6. arXiv:2504.21650  [pdf, other

    cs.CV

    HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

    Authors: Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, Li Yuan

    Abstract: The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue,… ▽ More

    Submitted 13 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

    Comments: Project Homepage: https://zhouhyocean.github.io/holotime/ Code: https://github.com/PKU-YuanGroup/HoloTime

  7. arXiv:2504.21024  [pdf, other

    cs.CL

    WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model

    Authors: Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, Dong Yu

    Abstract: Agent self-improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: 19 pages

  8. arXiv:2504.20536  [pdf, other

    cs.CR

    Starfish: Rebalancing Multi-Party Off-Chain Payment Channels

    Authors: Minghui Xu, Wenxuan Yu, Guangyong Shang, Guangpeng Qi, Dongliang Duan, Shan Wang, Kun Li, Yue Zhang, Xiuzhen Cheng

    Abstract: Blockchain technology has revolutionized the way transactions are executed, but scalability remains a major challenge. Payment Channel Network (PCN), as a Layer-2 scaling solution, has been proposed to address this issue. However, skewed payments can deplete the balance of one party within a channel, restricting the ability of PCNs to transact through a path and subsequently reducing the transacti… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: 17 pages, 10 figures

  9. arXiv:2504.20460  [pdf, ps, other

    cs.IT

    Sequence Reconstruction under Channels with Multiple Bursts of Insertions or Deletions

    Authors: Zhaojun Lan, Yubo Sun, Wenjun Yu, Gennian Ge

    Abstract: The sequence reconstruction problem involves a model where a sequence is transmitted over several identical channels. This model investigates the minimum number of channels required for the unique reconstruction of the transmitted sequence. Levenshtein established that this number exceeds the maximum size of the intersection between the error balls of any two distinct transmitted sequences by one.… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  10. arXiv:2504.18792  [pdf, other

    cs.RO

    STDArm: Transferring Visuomotor Policies From Static Data Training to Dynamic Robot Manipulation

    Authors: Yifan Duan, Heng Li, Yilong Wu, Wenhao Yu, Xinran Zhang, Yedong Shen, Jianmin Ji, Yanyong Zhang

    Abstract: Recent advances in mobile robotic platforms like quadruped robots and drones have spurred a demand for deploying visuomotor policies in increasingly dynamic environments. However, the collection of high-quality training data, the impact of platform motion and processing delays, and limited onboard computing resources pose significant barriers to existing solutions. In this work, we present STDArm,… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

    Comments: 10 pages, 8 figures, accepted by RSS 2025

  11. arXiv:2504.15302  [pdf, other

    cs.DC cs.OS

    RAGDoll: Efficient Offloading-based Online RAG System on a Single GPU

    Authors: Weiping Yu, Ningyi Liao, Siqiang Luo, Junfeng Liu

    Abstract: Retrieval-Augmented Generation (RAG) enhances large language model (LLM) generation quality by incorporating relevant external knowledge. However, deploying RAG on consumer-grade platforms is challenging due to limited memory and the increasing scale of both models and knowledge bases. In this work, we introduce RAGDoll, a resource-efficient, self-adaptive RAG serving system integrated with LLMs,… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  12. arXiv:2504.14204  [pdf, other

    cs.LG cs.AI

    DConAD: A Differencing-based Contrastive Representation Learning Framework for Time Series Anomaly Detection

    Authors: Wenxin Zhang, Xiaojian Lin, Wenjun Yu, Guangzhen Yao, jingxiang Zhong, Yu Li, Renda Han, Songcheng Xu, Hao Shi, Cuicui Luo

    Abstract: Time series anomaly detection holds notable importance for risk identification and fault detection across diverse application domains. Unsupervised learning methods have become popular because they have no requirement for labels. However, due to the challenges posed by the multiplicity of abnormal patterns, the sparsity of anomalies, and the growth of data scale and complexity, these methods often… ▽ More

    Submitted 2 May, 2025; v1 submitted 19 April, 2025; originally announced April 2025.

  13. arXiv:2504.13351  [pdf, other

    cs.RO cs.AI cs.HC cs.LG cs.MM

    Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

    Authors: Chen Wang, Fei Xia, Wenhao Yu, Tingnan Zhang, Ruohan Zhang, C. Karen Liu, Li Fei-Fei, Jie Tan, Jacky Liang

    Abstract: Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the detai… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: ICRA 2025

  14. arXiv:2504.13175  [pdf, other

    cs.RO

    Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation

    Authors: Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, Jiangmiao Pang

    Abstract: Visuomotor policies learned from teleoperated demonstrations face challenges such as lengthy data collection, high costs, and limited data diversity. Existing approaches address these issues by augmenting image observations in RGB space or employing Real-to-Sim-to-Real pipelines based on physical simulators. However, the former is constrained to 2D data augmentation, while the latter suffers from… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Published at Robotics: Science and Systems (RSS) 2025

  15. arXiv:2504.12401  [pdf, other

    cs.CV

    NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results

    Authors: Lei Sun, Andrea Alfarano, Peiqi Duan, Shaolin Su, Kaiwei Wang, Boxin Shi, Radu Timofte, Danda Pani Paudel, Luc Van Gool, Qinglin Liu, Wei Yu, Xiaoqian Lv, Lu Yang, Shuigen Wang, Shengping Zhang, Xiangyang Ji, Long Bao, Yuqiang Yang, Jinao Song, Ziyi Wang, Shuang Wen, Heng Sun, Kean Liu, Mingchen Zhong, Senyan Xu , et al. (63 additional authors not shown)

    Abstract: This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on com… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  16. arXiv:2504.11788  [pdf, other

    cs.CL cs.AI

    Enhancing Web Agents with Explicit Rollback Mechanisms

    Authors: Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, Dong Yu

    Abstract: With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabli… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  17. arXiv:2504.06027  [pdf, other

    cs.CV eess.IV

    OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model

    Authors: Xiaochen Wei, Weiwei Guo, Wenxian Yu, Feiming Wei, Dongying Li

    Abstract: Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, current methods often fail to extract modality-invariant features when aligning image pairs with large nonlinear radiometric differences. To address this issues, we propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation to eliminate t… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  18. arXiv:2504.04386  [pdf, other

    cs.IR

    Decoding Recommendation Behaviors of In-Context Learning LLMs Through Gradient Descent

    Authors: Yi Xu, Weicong Qin, Weijie Yu, Ming He, Jianping Fan, Jun Xu

    Abstract: Recently, there has been a growing trend in utilizing large language models (LLMs) for recommender systems, referred to as LLMRec. A notable approach within this trend is not to fine-tune these models directly but instead to leverage In-Context Learning (ICL) methods tailored for LLMRec, denoted as LLM-ICL Rec. Many contemporary techniques focus on harnessing ICL content to enhance LLMRec performa… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: 12 pages, 9 figures

  19. arXiv:2504.04062  [pdf, other

    cs.IR

    QE-RAG: A Robust Retrieval-Augmented Generation Benchmark for Query Entry Errors

    Authors: Kepu Zhang, Zhongxiang Sun, Weijie Yu, Xiaoxue Zang, Kai Zheng, Yang Song, Han Li, Jun Xu

    Abstract: Retriever-augmented generation (RAG) has become a widely adopted approach for enhancing the factual accuracy of large language models (LLMs). While current benchmarks evaluate the performance of RAG methods from various perspectives, they share a common assumption that user queries used for retrieval are error-free. However, in real-world interactions between users and LLMs, query entry errors suc… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

  20. arXiv:2504.04042  [pdf, other

    cs.CL

    SyLeR: A Framework for Explicit Syllogistic Legal Reasoning in Large Language Models

    Authors: Kepu Zhang, Weijie Yu, Zhongxiang Sun, Jun Xu

    Abstract: Syllogistic reasoning is a fundamental aspect of legal decision-making, enabling logical conclusions by connecting general legal principles with specific case facts. Although existing large language models (LLMs) can generate responses to legal questions, they fail to perform explicit syllogistic reasoning, often producing implicit and unstructured answers that lack explainability and trustworthin… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  21. arXiv:2504.02590  [pdf, other

    cs.CL

    LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning

    Authors: Kepu Zhang, Guofu Xie, Weijie Yu, Mingyue Xu, Xu Tang, Yaxin Li, Jun Xu

    Abstract: The legal mathematical reasoning ability of LLMs is crucial when applying them to real-world scenarios, as it directly affects the credibility of the LLM. While existing legal LLMs can perform general judicial question answering, their legal mathematical reasoning capabilities have not been trained. Open-domain reasoning models, though able to generate detailed calculation steps, do not follow the… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  22. arXiv:2503.23888  [pdf, other

    cs.CV cs.AI

    MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

    Authors: Xin Zhang, Siting Huang, Xiangyang Luo, Yifan Xie, Weijiang Yu, Heng Chang, Fei Ma, Fei Yu

    Abstract: Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a te… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: 6 pages, 5 figures,IEEE International Conference on Multimedia & Expo 2025

  23. arXiv:2503.23434  [pdf, other

    cs.LG

    Towards Trustworthy GUI Agents: A Survey

    Authors: Yucheng Shi, Wenhao Yu, Wenlin Yao, Wenhu Chen, Ninghao Liu

    Abstract: GUI agents, powered by large foundation models, can interact with digital interfaces, enabling various applications in web automation, mobile navigation, and software testing. However, their increasing autonomy has raised critical concerns about their security, privacy, and safety. This survey examines the trustworthiness of GUI agents in five critical dimensions: security vulnerabilities, reliabi… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

    Comments: 10 pages, work in process

  24. arXiv:2503.23162  [pdf, other

    cs.CV

    NeuralGS: Bridging Neural Fields and 3D Gaussian Splatting for Compact 3D Representations

    Authors: Zhenyu Tang, Chaoran Feng, Xinhua Cheng, Wangbo Yu, Junwu Zhang, Yuan Liu, Xiaoxiao Long, Wenping Wang, Li Yuan

    Abstract: 3D Gaussian Splatting (3DGS) demonstrates superior quality and rendering speed, but with millions of 3D Gaussians and significant storage and transmission costs. Recent 3DGS compression methods mainly concentrate on compressing Scaffold-GS, achieving impressive performance but with an additional voxel structure and a complex encoding and quantization strategy. In this paper, we aim to develop a si… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

    Comments: Project page: https://pku-yuangroup.github.io/NeuralGS/

  25. arXiv:2503.22091  [pdf, other

    cs.DB

    A Graph-native Optimization Framework for Complex Graph Queries

    Authors: Bingqing Lyu, Xiaoli Zhou, Longbin Lai, Yufan Yang, Yunkai Lou, Wenyuan Yu, Jingren Zhou

    Abstract: This technical report extends the SIGMOD 2025 paper "A Modular Graph-Native Query Optimization Framework" by providing a comprehensive exposition of GOpt's advanced technical mechanisms, implementation strategies, and extended evaluations. While the original paper introduced GOpt's unified intermediate representation (GIR) and demonstrated its performance benefits, this report delves into the fram… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  26. arXiv:2503.21779  [pdf, other

    cs.CV

    X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction

    Authors: Weihao Yu, Yuanhao Cai, Ruyi Zha, Zhiwen Fan, Chenxin Li, Yixuan Yuan

    Abstract: Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a n… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Project Page: https://x2-gaussian.github.io/

  27. arXiv:2503.20314  [pdf, other

    cs.CV

    Wan: Open and Advanced Large-Scale Video Generative Models

    Authors: Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu , et al. (37 additional authors not shown)

    Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluat… ▽ More

    Submitted 18 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: 60 pages, 33 figures

  28. arXiv:2503.20290  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions

    Authors: Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Yu Tsao, Junichi Yamagishi, Yuxuan Wang, Chao Zhang

    Abstract: This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduc… ▽ More

    Submitted 1 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: 23 pages, 16 figures

  29. arXiv:2503.20020  [pdf, other

    cs.RO

    Gemini Robotics: Bringing AI into the Physical World

    Authors: Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bohez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Oscar Chang, Jose Enrique Chen, Xi Chen, Hao-Tien Lewis Chiang, Krzysztof Choromanski, David D'Ambrosio, Sudeep Dasari , et al. (93 additional authors not shown)

    Abstract: Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Lang… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  30. MM-LINS: a Multi-Map LiDAR-Inertial System for Over-Degenerate Environments

    Authors: Yongxin Ma, Jie Xu, Shenghai Yuan, Tian Zhi, Wenlu Yu, Jun Zhou, Lihua Xie

    Abstract: SLAM plays a crucial role in automation tasks, such as warehouse logistics, healthcare robotics, and restaurant delivery. These scenes come with various challenges, including navigating around crowds of people, dealing with flying plastic bags that can temporarily blind sensors, and addressing reduced LiDAR density caused by cooking smoke. Such scenarios can result in over-degeneracy, causing the… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted by IEEE Transactions on Intelligent Vehicles

  31. arXiv:2503.19046  [pdf, other

    eess.SP cs.IT cs.LG

    Learning Beamforming Codebooks for Active Sensing with Reconfigurable Intelligent Surface

    Authors: Zhongze Zhang, Wei Yu

    Abstract: This paper explores the design of beamforming codebooks for the base station (BS) and for the reconfigurable intelligent surfaces (RISs) in an active sensing scheme for uplink localization, in which the mobile user transmits a sequence of pilots to the BS through reflection at the RISs, and the BS and the RISs are adaptively configured by carefully choosing BS beamforming codeword and RIS codeword… ▽ More

    Submitted 31 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: Accepted in IEEE Transactions on Wireless Communications

  32. arXiv:2503.18923  [pdf, other

    cs.CV

    Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

    Authors: Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Jiahua Dong, Wangbo Yu, Ge Zhang, Ian Reid, Xiaodan Liang

    Abstract: Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in video contexts remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmark… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: 24 pages

  33. arXiv:2503.16976  [pdf, other

    cs.CV cs.AI

    GeoT: Geometry-guided Instance-dependent Transition Matrix for Semi-supervised Tooth Point Cloud Segmentation

    Authors: Weihao Yu, Xiaoqing Guo, Chenxin Li, Yifan Liu, Yixuan Yuan

    Abstract: Achieving meticulous segmentation of tooth point clouds from intra-oral scans stands as an indispensable prerequisite for various orthodontic applications. Given the labor-intensive nature of dental annotation, a significant amount of data remains unlabeled, driving increasing interest in semi-supervised approaches. One primary challenge of existing semi-supervised medical segmentation methods lie… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: IPMI2025

  34. arXiv:2503.15234  [pdf, other

    cs.CV cs.AI

    CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification

    Authors: Wenlong Yu, Qilong Wang, Chuang Liu, Dong Li, Qinghua Hu

    Abstract: Explainability is a critical factor influencing the wide deployment of deep vision models (DVMs). Concept-based post-hoc explanation methods can provide both global and local insights into model decisions. However, current methods in this field face challenges in that they are inflexible to automatically construct accurate and sufficient linguistic explanations for global concepts and local circui… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

  35. arXiv:2503.14573  [pdf

    eess.IV cs.CV cs.GR

    Three-dimensional Reconstruction of the Lumbar Spine with Submillimeter Accuracy Using Biplanar X-ray Images

    Authors: Wanxin Yu, Zhemin Zhu, Cong Wang, Yihang Bao, Chunjie Xia, Rongshan Cheng, Yan Yu, Tsung-Yuan Tsai

    Abstract: Three-dimensional reconstruction of the spine under weight-bearing conditions from biplanar X-ray images is of great importance for the clinical assessment of spinal diseases. However, the current fully automated reconstruction methods have low accuracy and fail to meet the clinical application standards. This study developed and validated a fully automated method for high-accuracy 3D reconstructi… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: 21 pages, 10 figures, 4 tables

  36. arXiv:2503.13709  [pdf, other

    cs.LG

    Multi-modal Time Series Analysis: A Tutorial and Survey

    Authors: Yushan Jiang, Kanghui Ning, Zijie Pan, Xuyang Shen, Jingchao Ni, Wenchao Yu, Anderson Schneider, Haifeng Chen, Yuriy Nevmyvaka, Dongjin Song

    Abstract: Multi-modal time series analysis has recently emerged as a prominent research area in data mining, driven by the increasing availability of diverse data modalities, such as text, images, and structured tabular data from real-world sources. However, effective analysis of multi-modal time series is hindered by data heterogeneity, modality gap, misalignment, and inherent noise. Recent advancements in… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  37. arXiv:2503.12404  [pdf, other

    cs.CV

    SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation

    Authors: Jianhao Yang, Wenshuo Yu, Yuanchao Lv, Jiance Sun, Bokang Sun, Mingyang Liu

    Abstract: Remote sensing image segmentation is crucial for environmental monitoring, disaster assessment, and resource management, directly affecting the accuracy and efficiency of surface information extraction. The performance of existing supervised models in remote sensing image segmentation tasks highly depends on the quality of label data. However, current label data mainly relies on manual annotation,… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  38. arXiv:2503.10792  [pdf, other

    cs.LG cs.AI

    Byzantine-Resilient Federated Learning via Distributed Optimization

    Authors: Yufei Xia, Wenrui Yu, Qiongxiu Li

    Abstract: Byzantine attacks present a critical challenge to Federated Learning (FL), where malicious participants can disrupt the training process, degrade model accuracy, and compromise system reliability. Traditional FL frameworks typically rely on aggregation-based protocols for model updates, leaving them vulnerable to sophisticated adversarial strategies. In this paper, we demonstrate that distributed… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  39. Quadratic Transform for Fractional Programming in Signal Processing and Machine Learning

    Authors: Kaiming Shen, Wei Yu

    Abstract: Fractional programming (FP) is a branch of mathematical optimization that deals with the optimization of ratios. It is an invaluable tool for signal processing and machine learning, because many key metrics in these fields are fractionally structured, e.g., the signal-to-interference-plus-noise ratio (SINR) in wireless communications, the Cramér-Rao bound (CRB) in radar sensing, the normalized cut… ▽ More

    Submitted 14 May, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: 20 pages

    Journal ref: IEEE Signal Processing Magazine 2025

  40. arXiv:2503.07505  [pdf, other

    cs.LG cs.AI cs.DC

    From Centralized to Decentralized Federated Learning: Theoretical Insights, Privacy Preservation, and Robustness Challenges

    Authors: Qiongxiu Li, Wenrui Yu, Yufei Xia, Jun Pang

    Abstract: Federated Learning (FL) enables collaborative learning without directly sharing individual's raw data. FL can be implemented in either a centralized (server-based) or decentralized (peer-to-peer) manner. In this survey, we present a novel perspective: the fundamental difference between centralized FL (CFL) and decentralized FL (DFL) is not merely the network topology, but the underlying training p… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  41. arXiv:2503.06312  [pdf, other

    cs.CV

    GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models

    Authors: Zhitong Xiong, Yi Wang, Weikang Yu, Adam J Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, Xiao Xiang Zhu

    Abstract: Earth observation (EO) data, collected from diverse sensors with varying imaging principles, present significant challenges in creating unified analytical frameworks. We present GeoLangBind, a novel agglomerative vision--language foundation model that bridges the gap between heterogeneous EO data modalities using language as a unifying medium. Our approach aligns different EO data types into a sha… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: code & weights: https://github.com/xiong-zhitong/GeoLB-SigLIP

  42. arXiv:2503.05035  [pdf, other

    cs.RO cs.LG

    QuietPaw: Learning Quadrupedal Locomotion with Versatile Noise Preference Alignment

    Authors: Yuyou Zhang, Yihang Yao, Shiqi Liu, Yaru Niu, Changyi Lin, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Jie Tan, Ding Zhao

    Abstract: When operating at their full capacity, quadrupedal robots can produce loud footstep noise, which can be disruptive in human-centered environments like homes, offices, and hospitals. As a result, balancing locomotion performance with noise constraints is crucial for the successful real-world deployment of quadrupedal robots. However, achieving adaptive noise control is challenging due to (a) the tr… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  43. arXiv:2503.02424  [pdf, other

    cs.CV

    Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection

    Authors: Wei Luo, Yunkang Cao, Haiming Yao, Xiaotian Zhang, Jianan Lou, Yuqi Cheng, Weiming Shen, Wenyong Yu

    Abstract: Anomaly detection (AD) is essential for industrial inspection, yet existing methods typically rely on ``comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

  44. arXiv:2503.02375  [pdf, other

    cs.CV

    mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction

    Authors: Jiarui Yang, Songpengcheng Xia, Zengyuan Lai, Lan Sun, Qi Wu, Wenxian Yu, Ling Pei

    Abstract: Millimeter-wave (mmWave) radar offers robust sensing capabilities in diverse environments, making it a highly promising solution for human body reconstruction due to its privacy-friendly and non-intrusive nature. However, the significant sparsity of mmWave point clouds limits the estimation accuracy. To overcome this challenge, we propose a two-stage deep learning framework that enhances mmWave po… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  45. arXiv:2503.01711  [pdf, other

    cs.IR cs.CL

    MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment

    Authors: Weicong Qin, Yi Xu, Weijie Yu, Chenglei Shen, Ming He, Jianping Fan, Xiao Zhang, Jun Xu

    Abstract: Personalized product search aims to retrieve and rank items that match users' preferences and search intent. Despite their effectiveness, existing approaches typically assume that users' query fully captures their real motivation. However, our analysis of a real-world e-commerce platform reveals that users often engage in relevant consultations before searching, indicating they refine intents thro… ▽ More

    Submitted 5 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: added project repository & dataset URL

  46. arXiv:2503.01013  [pdf, other

    cs.LG

    Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop

    Authors: Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, Haifeng Chen

    Abstract: Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototype-based time series encoder with three collaborating Large Language Models (… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

  47. arXiv:2502.19779  [pdf, other

    cs.CL cs.AI cs.LG

    Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

    Authors: Peilin Wu, Xinlu Zhang, Wenhao Yu, Xingyu Liu, Xinya Du, Zhiyu Zoey Chen

    Abstract: Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing evaluation benchmarks often assume a single optimal approach to leveraging retrieved information, failing to account for varying user needs. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cas… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

  48. arXiv:2502.18482  [pdf, other

    cs.CL cs.AI cs.DB cs.IR

    MixLLM: Dynamic Routing in Mixed Large Language Models

    Authors: Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen

    Abstract: Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-o… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

    Comments: 11 pages, 7 figures, accepted by NAACL 2025 main conference

    MSC Class: N/A

  49. arXiv:2502.16161  [pdf, other

    cs.CV cs.CL

    OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

    Authors: Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai

    Abstract: Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individu… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

  50. arXiv:2502.14133  [pdf, other

    cs.CL

    Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification

    Authors: Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu

    Abstract: Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to g… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

    Comments: Pre-print, 15 pages, 4 figures