Skip to main content

Showing 1–50 of 2,105 results for author: Zhao, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10565  [pdf, other

    cs.CV

    Depth Anything with Any Prior

    Authors: Zehan Wang, Siyu Chen, Lihe Yang, Jialei Wang, Ziang Zhang, Hengshuang Zhao, Zhou Zhao

    Abstract: This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we intr… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: Home page: https://prior-depth-anything.github.io/

  2. arXiv:2505.10561  [pdf, other

    cs.SD eess.AS

    T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback

    Authors: Zehan Wang, Ke Lei, Chen Zhu, Jiawei Huang, Sashuai Zhou, Luping Liu, Xize Cheng, Shengpeng Ji, Zhenhui Ye, Tao Jin, Zhou Zhao

    Abstract: Text-to-audio (T2A) generation has achieved remarkable progress in generating a variety of audio outputs from language prompts. However, current state-of-the-art T2A models still struggle to satisfy human preferences for prompt-following and acoustic quality when generating complex multi-event audio. To improve the performance of the model in these high-level applications, we propose to enhance th… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: ACL 2025

  3. arXiv:2505.09558  [pdf, other

    eess.AS cs.AI cs.LG cs.MM cs.SD

    WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

    Authors: Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao

    Abstract: End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT.… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  4. arXiv:2505.09178  [pdf, ps, other

    cs.CV

    UniCAD: Efficient and Extendable Architecture for Multi-Task Computer-Aided Diagnosis System

    Authors: Yitao Zhu, Yuan Yin, Zhenrong Shen, Zihao Zhao, Haiyu Song, Sheng Wang, Dinggang Shen, Qian Wang

    Abstract: The growing complexity and scale of visual model pre-training have made developing and deploying multi-task computer-aided diagnosis (CAD) systems increasingly challenging and resource-intensive. Furthermore, the medical imaging community lacks an open-source CAD platform to enable the rapid creation of efficient and extendable diagnostic models. To address these issues, we propose UniCAD, a unifi… ▽ More

    Submitted 15 May, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

    Comments: 14 pages

  5. arXiv:2505.08808  [pdf, ps, other

    cs.CV cs.AI

    SparseMeXT Unlocking the Potential of Sparse Representations for HD Map Construction

    Authors: Anqing Jiang, Jinhao Chai, Yu Gao, Yiru Wang, Yuwen Heng, Zhigang Sun, Hao Sun, Zezhong Zhao, Li Sun, Jian Zhou, Lijuan Zhu, Shugong Xu, Hao Zhao

    Abstract: Recent advancements in high-definition \emph{HD} map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird's-eye view \emph{BEV} features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  6. arXiv:2505.08725  [pdf, other

    cs.CV

    Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

    Authors: Zongchuang Zhao, Haoyu Fu, Dingkang Liang, Xin Zhou, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai

    Abstract: The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer fro… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: The dataset and code will be released at https://github.com/zc-zhao/DriveMonkey

  7. arXiv:2505.08601  [pdf, other

    cs.CV cond-mat.mtrl-sci

    Rejoining fragmented ancient bamboo slips with physics-driven deep learning

    Authors: Jinchi Zhu, Zhou Zhao, Hailong Lei, Xiaoguang Wang, Jialiang Lu, Jing Li, Qianqian Tang, Jiachen Shen, Gui-Song Xia, Bo Du, Yongchao Xu

    Abstract: Bamboo slips are a crucial medium for recording ancient civilizations in East Asia, and offers invaluable archaeological insights for reconstructing the Silk Road, studying material culture exchanges, and global history. However, many excavated bamboo slips have been fragmented into thousands of irregular pieces, making their rejoining a vital yet challenging step for understanding their content.… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  8. arXiv:2505.08199  [pdf, ps, other

    cs.LG

    A Multi-scale Representation Learning Framework for Long-Term Time Series Forecasting

    Authors: Boshi Gao, Qingjian Ni, Fanbo Ju, Yu Chen, Ziqi Zhao

    Abstract: Long-term time series forecasting (LTSF) offers broad utility in practical settings like energy consumption and weather prediction. Accurately predicting long-term changes, however, is demanding due to the intricate temporal patterns and inherent multi-scale variations within time series. This work confronts key issues in LTSF, including the suboptimal use of multi-granularity information, the neg… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  9. arXiv:2505.08167  [pdf

    cs.CL cs.AI

    Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage

    Authors: Ruilin Liu, Zhixiao Zhao, Jieqiong Li, Chang Liu, Dongbo Wang

    Abstract: The rapid development of large language models (LLMs) has provided significant support and opportunities for the advancement of domain-specific LLMs. However, fine-tuning these large models using Intangible Cultural Heritage (ICH) data inevitably faces challenges such as bias, incorrect knowledge inheritance, and catastrophic forgetting. To address these issues, we propose a novel training method… ▽ More

    Submitted 13 May, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

    Comments: 22 pages, 5 figures

  10. arXiv:2505.08148  [pdf, ps, other

    cs.CR cs.AI cs.CL cs.LG

    A Large-Scale Empirical Analysis of Custom GPTs' Vulnerabilities in the OpenAI Ecosystem

    Authors: Sunday Oyinlola Ogundoyin, Muhammad Ikram, Hassan Jameel Asghar, Benjamin Zi Hao Zhao, Dali Kaafar

    Abstract: Millions of users leverage generative pretrained transformer (GPT)-based language models developed by leading model providers for a wide range of tasks. To support enhanced user interaction and customization, many platforms-such as OpenAI-now enable developers to create and publish tailored model instances, known as custom GPTs, via dedicated repositories or application stores. These custom GPTs e… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  11. arXiv:2505.07920  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Re$^2$: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions

    Authors: Daoze Zhang, Zhijian Bao, Sihang Du, Zhiyi Zhao, Kuangling Zhang, Dezheng Bao, Yang Yang

    Abstract: Peer review is a critical component of scientific progress in the fields like AI, but the rapid increase in submission volume has strained the reviewing system, which inevitably leads to reviewer shortages and declines review quality. Besides the growing research popularity, another key factor in this overload is the repeated resubmission of substandard manuscripts, largely due to the lack of effe… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 2 figures, 5 tables

  12. arXiv:2505.07431  [pdf, ps, other

    cs.IR

    Diffusion-driven SpatioTemporal Graph KANsformer for Medical Examination Recommendation

    Authors: Jianan Li, Yangtao Zhou, Zhifu Zhao, Qinglan Huang, Jian Qi, Xiao He, Hua Chu, Fu Li

    Abstract: Recommendation systems in AI-based medical diagnostics and treatment constitute a critical component of AI in healthcare. Although some studies have explored this area and made notable progress, healthcare recommendation systems remain in their nascent stage. And these researches mainly target the treatment process such as drug or disease recommendations. In addition to the treatment process, the… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  13. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  14. arXiv:2505.06832  [pdf, other

    cs.RO

    UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms

    Authors: Xueyang Guo, Hongwei Hu, Chengye Song, Jiale Chen, Zilin Zhao, Yu Fu, Bowen Guan, Zhenze Liu

    Abstract: Open-vocabulary, task-oriented grasping of specific functional parts, particularly with dual arms, remains a key challenge, as current Vision-Language Models (VLMs), while enhancing task understanding, often struggle with precise grasp generation within defined constraints and effective dual-arm coordination. We innovatively propose UniDiffGrasp, a unified framework integrating VLM reasoning with… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: 8 pages, 5 figures

  15. arXiv:2505.06682  [pdf, other

    eess.SP cs.AI

    A Short Overview of Multi-Modal Wi-Fi Sensing

    Authors: Zijian Zhao

    Abstract: Wi-Fi sensing has emerged as a significant technology in wireless sensing and Integrated Sensing and Communication (ISAC), offering benefits such as low cost, high penetration, and enhanced privacy. Currently, it is widely utilized in various applications, including action recognition, human localization, and crowd counting. However, Wi-Fi sensing also faces challenges, such as low robustness and… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  16. arXiv:2505.06665  [pdf, other

    cs.CV

    MultiTaskVIF: Segmentation-oriented visible and infrared image fusion via multi-task learning

    Authors: Zixian Zhao, Andrew Howes, Xingchen Zhang

    Abstract: Visible and infrared image fusion (VIF) has attracted significant attention in recent years. Traditional VIF methods primarily focus on generating fused images with high visual quality, while recent advancements increasingly emphasize incorporating semantic information into the fusion model during training. However, most existing segmentation-oriented VIF methods adopt a cascade structure comprisi… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  17. arXiv:2505.06575  [pdf, other

    cs.CV

    GRACE: Estimating Geometry-level 3D Human-Scene Contact from 2D Images

    Authors: Chengfeng Wang, Wei Zhai, Yuhang Yang, Yang Cao, Zhengjun Zha

    Abstract: Estimating the geometry level of human-scene contact aims to ground specific contact surface points at 3D human geometries, which provides a spatial prior and bridges the interaction between human and scene, supporting applications such as human behavior analysis, embodied AI, and AR/VR. To complete the task, existing approaches predominantly rely on parametric human models (e.g., SMPL), which est… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  18. arXiv:2505.05119  [pdf, ps, other

    cs.LG cs.MA

    USPR: Learning a Unified Solver for Profiled Routing

    Authors: Chuanbo Hua, Federico Berto, Zhikai Zhao, Jiwoo Son, Changhyun Kwon, Jinkyoo Park

    Abstract: The Profiled Vehicle Routing Problem (PVRP) extends the classical VRP by incorporating vehicle-client-specific preferences and constraints, reflecting real-world requirements such as zone restrictions and service-level preferences. While recent reinforcement learning (RL) solvers have shown promise, they require retraining for each new profile distribution, suffer from poor representation ability,… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  19. arXiv:2505.04788  [pdf, ps, other

    cs.CV

    Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

    Authors: Bangyan Liao, Zhenjun Zhao, Haoang Li, Yi Zhou, Yingping Zeng, Hao Li, Peidong Liu

    Abstract: Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve t… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: Accepted to CVPR 2025 as Award Candidate & Oral Presentation. The first two authors contributed equally to this work. Code: https://github.com/WU-CVGL/GlobustVP

  20. arXiv:2505.04480  [pdf, ps, other

    cs.AI cs.NE cs.RO

    TrajEvo: Designing Trajectory Prediction Heuristics via LLM-driven Evolution

    Authors: Zhikai Zhao, Chuanbo Hua, Federico Berto, Kanghoon Lee, Zihan Ma, Jiachen Li, Jinkyoo Park

    Abstract: Trajectory prediction is a crucial task in modeling human behavior, especially in fields as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy, while recently proposed deep learning approaches suffer from computational cost, lack of explainability, and generalization issues that limit their practical adoption. In this paper, we… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  21. arXiv:2505.04461  [pdf, other

    cs.LG cs.AI cs.SI

    A Survey on Temporal Interaction Graph Representation Learning: Progress, Challenges, and Opportunities

    Authors: Pengfei Jiao, Hongjiang Chen, Xuan Guo, Zhidong Zhao, Dongxiao He, Di Jin

    Abstract: Temporal interaction graphs (TIGs), defined by sequences of timestamped interaction events, have become ubiquitous in real-world applications due to their capability to model complex dynamic system behaviors. As a result, temporal interaction graph representation learning (TIGRL) has garnered significant attention in recent years. TIGRL aims to embed nodes in TIGs into low-dimensional representati… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: IJCAI 2025 Survey Track

  22. arXiv:2505.03847  [pdf, other

    cs.SI

    Event-aware analysis of cross-city visitor flows using large language models and social media data

    Authors: Xiaohan Wang, Zhan Zhao, Ruiyu Wang, Yang Xu

    Abstract: Public events, such as music concerts and fireworks displays, can cause irregular surges in cross-city travel demand, leading to potential overcrowding, travel delays, and public safety concerns. To better anticipate and accommodate such demand surges, it is essential to estimate cross-city visitor flows with awareness of public events. Although prior studies typically focused on the effects of a… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  23. arXiv:2505.03797  [pdf, other

    cs.LG stat.ML

    Utilising Gradient-Based Proposals Within Sequential Monte Carlo Samplers for Training of Partial Bayesian Neural Networks

    Authors: Andrew Millard, Joshua Murphy, Simon Maskell, Zheng Zhao

    Abstract: Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks while only having a subset of the parameters be stochastic. Using sequential Monte Carlo (SMC) samplers as the inference method for pBNNs gives a non-parametric probabilistic estimation of the stochastic parameters, and has shown improved performance over parametric methods. In thi… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  24. arXiv:2505.03543  [pdf, other

    cs.IR

    1$^{st}$ Place Solution of WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge

    Authors: Junwei Xu, Zehao Zhao, Xiaoyu Hu, Zhenjie Song

    Abstract: The WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge focuses on effectively applying multimodal embedding features to improve click-through rate (CTR) prediction in recommender systems. This technical report presents our 1$^{st}$ place winning solution for Task 2, combining sequential modeling and feature interaction learning to effectively capture user-item interactions. For multimo… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Technical report for the 1$^{st}$ place solution of WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge

    ACM Class: H.3.1; I.2

  25. arXiv:2505.03422  [pdf, other

    cs.CV cs.RO

    LiftFeat: 3D Geometry-Aware Local Feature Matching

    Authors: Yepeng Liu, Wenpeng Lai, Zhou Zhao, Yuxuan Xiong, Jinchi Zhu, Jun Cheng, Yongchao Xu

    Abstract: Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called \textit{LiftFeat… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Accepted at ICRA 2025

  26. arXiv:2505.02331  [pdf, other

    cs.CV cs.SD

    VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

    Authors: Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong

    Abstract: Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced s… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Comments: Source code and pre-trained models will be available at https://github.com/MSA-LMC/VAEmo

  27. Tricolore: Multi-Behavior User Profiling for Enhanced Candidate Generation in Recommender Systems

    Authors: Xiao Zhou, Zhongxiang Zhao, Hanze Guo

    Abstract: Online platforms aggregate extensive user feedback across diverse behaviors, providing a rich source for enhancing user engagement. Traditional recommender systems, however, typically optimize for a single target behavior and represent user preferences with a single vector, limiting their ability to handle multiple important behaviors or optimization objectives. This conventional approach also str… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Journal ref: IEEE Transactions on Knowledge and Data Engineering(TKDE 2025)

  28. arXiv:2505.01934  [pdf, other

    cs.CV

    GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels

    Authors: Yongxin Su, Lin Chen, Kaiting Zhang, Zhongliang Zhao, Chenfeng Hou, Ziping Yu

    Abstract: We propose GauS-SLAM, a dense RGB-D SLAM system that leverages 2D Gaussian surfels to achieve robust tracking and high-fidelity mapping. Our investigations reveal that Gaussian-based scene representations exhibit geometry distortion under novel viewpoints, which significantly degrades the accuracy of Gaussian-based tracking methods. These geometry inconsistencies arise primarily from the depth mod… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  29. arXiv:2505.00848  [pdf, other

    cs.NI eess.SP eess.SY

    SeLR: Sparsity-enhanced Lagrangian Relaxation for Computation Offloading at the Edge

    Authors: Negar Erfaniantaghvayi, Zhongyuan Zhao, Kevin Chan, Ananthram Swami, Santiago Segarra

    Abstract: This paper introduces a novel computational approach for offloading sensor data processing tasks to servers in edge networks for better accuracy and makespan. A task is assigned with one of several offloading options, each comprises a server, a route for uploading data to the server, and a service profile that specifies the performance and resource consumption at the server and in the network. Thi… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: 10 pages, 6 figures, submitted to ACM Mobihoc'25

    ACM Class: C.2.1

  30. arXiv:2505.00690  [pdf, other

    cs.CV cs.AI cs.RO

    Towards Autonomous Micromobility through Scalable Urban Simulation

    Authors: Wayne Wu, Honglin He, Chaoyuan Zhang, Jack He, Seth Z. Zhao, Ran Gong, Quanyi Li, Bolei Zhou

    Abstract: Micromobility, which utilizes lightweight mobile machines moving in urban public spaces, such as delivery robots and mobility scooters, emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of unpredictable obstac… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: CVPR 2025 Highlight. Project page: https://metadriverse.github.io/urban-sim/

  31. arXiv:2504.21721  [pdf, other

    cs.NI eess.SP eess.SY

    Generalizing Biased Backpressure Routing and Scheduling to Wireless Multi-hop Networks with Advanced Air-interfaces

    Authors: Zhongyuan Zhao, Yujun Ming, Ananthram Swami, Kevin Chan, Fikadu Dagefu, Santiago Segarra

    Abstract: Backpressure (BP) routing and scheduling is a well-established resource allocation method for wireless multi-hop networks, known for its fully distributed operations and proven maximum queue stability. Recent advances in shortest path-biased BP routing (SP-BP) mitigate shortcomings such as slow startup and random walk, but exclusive link-level commodity selection still suffers from the last-packet… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: 10 pages, 11 figures, submitted to ACM Mobihoc'25

    MSC Class: 05C12 (Primary) 05-08 (Secondary) ACM Class: C.2.2; C.2.1; I.2.11; I.2.6

  32. arXiv:2504.21582  [pdf, other

    cs.MA cs.AI

    MF-LLM: Simulating Collective Decision Dynamics via a Mean-Field Large Language Model Framework

    Authors: Qirui Mi, Mengyue Yang, Xiangning Yu, Zhiyu Zhao, Cheng Deng, Bo An, Haifeng Zhang, Xu Chen, Jun Wang

    Abstract: Simulating collective decision-making involves more than aggregating individual behaviors; it arises from dynamic interactions among individuals. While large language models (LLMs) show promise for social simulation, existing approaches often exhibit deviations from real-world data. To address this gap, we propose the Mean-Field LLM (MF-LLM) framework, which explicitly models the feedback loop bet… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: 27 pages, 8 figures, 4 tables

  33. arXiv:2504.21530  [pdf, other

    cs.RO cs.CV

    RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

    Authors: Haifeng Huang, Xinyi Chen, Yilun Chen, Hao Li, Xiaoshen Han, Zehan Wang, Tai Wang, Jiangmiao Pang, Zhou Zhao

    Abstract: Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and siz… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  34. arXiv:2504.21266  [pdf, other

    cs.CV

    CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion

    Authors: Zhifu Zhao, Hanyang Hua, Jianan Li, Shaoxin Wu, Fu Li, Yangtao Zhou, Yang Li

    Abstract: In action recognition tasks, feature diversity is essential for enhancing model generalization and performance. Existing methods typically promote feature diversity by expanding the training data in the sample space, which often leads to inefficiencies and semantic inconsistencies. To overcome these problems, we propose a novel Coarse-fine text co-guidance Diffusion model (CoCoDiff). CoCoDiff gene… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  35. arXiv:2504.21214  [pdf, other

    cs.CL cs.AI eess.AS

    Pretraining Large Brain Language Model for Active BCI: Silent Speech

    Authors: Jinzhao Zhou, Zehong Cao, Yiqun Duan, Connor Barkley, Daniel Leong, Xiaowei Jiang, Quoc-Toan Nguyen, Ziyi Zhao, Thomas Do, Yu-Cheng Chang, Sheng-Fu Liang, Chin-teng Lin

    Abstract: This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the re… ▽ More

    Submitted 3 May, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

  36. arXiv:2504.20630  [pdf, other

    eess.AS cs.MM cs.SD

    ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

    Authors: Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao

    Abstract: Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  37. arXiv:2504.19062  [pdf, other

    eess.AS cs.CL cs.SD

    Versatile Framework for Song Generation with Prompt-based Control

    Authors: Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao

    Abstract: Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, ali… ▽ More

    Submitted 29 April, 2025; v1 submitted 26 April, 2025; originally announced April 2025.

  38. arXiv:2504.18608  [pdf, other

    cs.CR

    ECG Identity Authentication in Open-set with Multi-model Pretraining and Self-constraint Center & Irrelevant Sample Repulsion Learning

    Authors: Mingyu Dong, Zhidong Zhao, Hao Wang, Yefei Zhang, Yanjun Deng

    Abstract: Electrocardiogram (ECG) signal exhibits inherent uniqueness, making it a promising biometric modality for identity authentication. As a result, ECG authentication has gained increasing attention in recent years. However, most existing methods focus primarily on improving authentication accuracy within closed-set settings, with limited research addressing the challenges posed by open-set scenarios.… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: 10 pages,

  39. arXiv:2504.18249  [pdf, other

    cs.CV cs.AI cs.LG

    Event-Based Eye Tracking. 2025 Event-based Vision Workshop

    Authors: Qinyu Chen, Chang Gao, Min Liu, Daniele Perrone, Yan Ru Pei, Zuowen Wang, Zhuo Zou, Shihang Tan, Tao Han, Guorui Lu, Zhen Xu, Junyuan Ding, Ziteng Wang, Zongwei Wu, Han Han, Yuliang Wu, Jinze Chen, Wei Zhai, Yang Cao, Zheng-jun Zha, Nuwan Bandara, Thivya Kandappu, Archan Misra, Xiaopeng Lin, Hongxiang Huang , et al. (7 additional authors not shown)

    Abstract: This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research.… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  40. arXiv:2504.17782  [pdf, other

    cs.SD cs.LG

    Unleashing the Power of Natural Audio Featuring Multiple Sound Sources

    Authors: Xize Cheng, Slytherin Wang, Zehan Wang, Rongjie Huang, Tao Jin, Zhou Zhao

    Abstract: Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep,… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: Work in Progress

  41. arXiv:2504.16786  [pdf, other

    cs.CL cs.LG

    MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores

    Authors: Fengwei Zhou, Jiafei Song, Wenjin Jason Li, Gengjian Xue, Zhikang Zhao, Yichao Lu, Bailin Na

    Abstract: Recent advances in large language models have significantly improved their ability to process long-context input, but practical applications are challenged by increased inference time and resource consumption, particularly in resource-constrained environments. To address these challenges, we propose MOOSComp, a token-classification-based long-context compression method that enhances the performanc… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

  42. arXiv:2504.16516  [pdf, other

    cs.CV cs.AI

    Think Hierarchically, Act Dynamically: Hierarchical Multi-modal Fusion and Reasoning for Vision-and-Language Navigation

    Authors: Junrong Yue, Yifan Zhang, Chuan Qin, Bo Li, Xiaomin Lie, Xinlei Yu, Wenxin Zhang, Zhendong Zhao

    Abstract: Vision-and-Language Navigation (VLN) aims to enable embodied agents to follow natural language instructions and reach target locations in real-world environments. While prior methods often rely on either global scene representations or object-level features, these approaches are insufficient for capturing the complex interactions across modalities required for accurate navigation. In this paper, w… ▽ More

    Submitted 24 April, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

    Comments: 11 pages, 4 figures, Submitted to ACM MM 2025

  43. arXiv:2504.16464  [pdf, other

    cs.RO cs.AI

    ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance

    Authors: Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Shanghang Zhang

    Abstract: While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional in… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 9 pages, 3 figures

  44. arXiv:2504.14906  [pdf, other

    eess.AS cs.CV cs.SD

    OmniAudio: Generating Spatial Audio from 360-Degree Video

    Authors: Huadai Liu, Tianyi Luo, Qikai Jiang, Kaicheng Luo, Peiwen Sun, Jialei Wan, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue

    Abstract: Traditional video-to-audio generation techniques primarily focus on field-of-view (FoV) video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a stan… ▽ More

    Submitted 11 May, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: ICML 2025

  45. arXiv:2504.14820  [pdf, other

    cs.RO

    Accelerating Visual Reinforcement Learning with Separate Primitive Policy for Peg-in-Hole Tasks

    Authors: Zichun Xu, Zhaomin Wang, Yuntao Li, Lei Zhuang, Zhiyuan Zhao, Guocai Yang, Jingdong Zhao

    Abstract: For peg-in-hole tasks, humans rely on binocular visual perception to locate the peg above the hole surface and then proceed with insertion. This paper draws insights from this behavior to enable agents to learn efficient assembly strategies through visual reinforcement learning. Hence, we propose a Separate Primitive Policy (S2P) to simultaneously learn how to derive location and insertion actions… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  46. arXiv:2504.14466  [pdf, other

    cs.ET

    A Bio-inspired Asymmetric Double-Gate Ferroelectric FET for Emulating Astrocyte and Dendrite Dynamics in Neuromorphic Systems

    Authors: Zhouhang Jiang, A N M Nafiul Islam, Zhuangyu Han, Zijian Zhao, Franz Müller, Jiahui Duan, Halid Mulaosmanovic, Stefan Dünkel, Sven Beyer, Sourav Dutta, Vijaykrishnan Narayanan, Thomas Kämpfe, Suma George Cardwell, Frances Chance, Abhronil Sengupta, Kai Ni

    Abstract: Neuromorphic systems seek to replicate the functionalities of biological neural networks to attain significant improvements in performance and efficiency of AI computing platforms. However, these systems have generally remained limited to emulation of simple neurons and synapses; and ignored higher order functionalities enabled by other components of the brain like astrocytes and dendrites. In thi… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

    Comments: 37 pages, 6 figure, 2 tables

  47. arXiv:2504.14282  [pdf, other

    cs.AI cs.LG

    CHAINSFORMER: Numerical Reasoning on Knowledge Graphs from a Chain Perspective

    Authors: Ze Zhao, Bin Lu, Xiaoying Gan, Gu Tang, Luoyi Fu, Xinbing Wang

    Abstract: Reasoning over Knowledge Graphs (KGs) plays a pivotal role in knowledge graph completion or question answering systems, providing richer and more accurate triples and attributes. As numerical attributes become increasingly essential in characterizing entities and relations in KGs, the ability to reason over these attributes has gained significant importance. Existing graph-based methods such as Gr… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

    Comments: Accepted to ICDE 2025

  48. arXiv:2504.13788  [pdf, other

    cs.CV

    RefComp: A Reference-guided Unified Framework for Unpaired Point Cloud Completion

    Authors: Yixuan Yang, Jinyu Yang, Zixiang Zhao, Victor Sanchez, Feng Zheng

    Abstract: The unpaired point cloud completion task aims to complete a partial point cloud by using models trained with no ground truth. Existing unpaired point cloud completion methods are class-aware, i.e., a separate model is needed for each object class. Since they have limited generalization capabilities, these methods perform poorly in real-world scenarios when confronted with a wide range of point clo… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  49. arXiv:2504.13482  [pdf, other

    cs.IR

    Improving Sequential Recommenders through Counterfactual Augmentation of System Exposure

    Authors: Ziqi Zhao, Zhaochun Ren, Jiyuan Yang, Zuming Yan, Zihan Wang, Liu Yang, Pengjie Ren, Zhumin Chen, Maarten de Rijke, Xin Xin

    Abstract: In sequential recommendation (SR), system exposure refers to items that are exposed to the user. Typically, only a few of the exposed items would be interacted with by the user. Although SR has achieved great success in predicting future user interests, existing SR methods still fail to fully exploit system exposure data. Most methods only model items that have been interacted with, while the larg… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: accepted at SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)

  50. arXiv:2504.12908  [pdf, other

    cs.RO cs.CV

    Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation

    Authors: Yuyang Li, Wenxin Du, Chang Yu, Puhao Li, Zihang Zhao, Tengyu Liu, Chenfanfu Jiang, Yixin Zhu, Siyuan Huang

    Abstract: Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. VBTSs have emerged as a promising solution, offering high spatial resolution and cost-effectiveness by sensing contact through camera-captured deformation patterns of elastic gel pads. However, these sensors' complex physical characteristics and visual signal processing requirements present unique chal… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: 17 pages, 7 figures