Skip to main content

Showing 1–50 of 4,148 results for author: Zhang, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06138  [pdf, ps, other

    cs.CL cs.AI

    Coding Triangle: How Does Large Language Model Understand Code?

    Authors: Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen

    Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we re… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  2. arXiv:2507.06000  [pdf, ps, other

    cs.HC

    Exploring Collaboration Patterns and Strategies in Human-AI Co-creation through the Lens of Agency: A Scoping Review of the Top-tier HCI Literature

    Authors: Shuning Zhang, Hui Wang, Xin Yi

    Abstract: As Artificial Intelligence (AI) increasingly becomes an active collaborator in co-creation, understanding the distribution and dynamic of agency is paramount. The Human-Computer Interaction (HCI) perspective is crucial for this analysis, as it uniquely reveals the interaction dynamics and specific control mechanisms that dictate how agency manifests in practice. Despite this importance, a systemat… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  3. arXiv:2507.05911  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Differentiable Reward Optimization for LLM based TTS system

    Authors: Changfeng Gao, Zhihao Du, Shiliang Zhang

    Abstract: This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furth… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  4. arXiv:2507.05685  [pdf, ps, other

    cs.LG cs.AI

    Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach

    Authors: Xiaobing Chen, Boyang Zhang, Xiangwei Zhou, Mingxuan Sun, Shuai Zhang, Songyang Zhang, Geoffrey Ye Li

    Abstract: The integration of Federated Learning (FL) and Mixture-of-Experts (MoE) presents a compelling pathway for training more powerful, large-scale artificial intelligence models (LAMs) on decentralized data while preserving privacy. However, efficient federated training of these complex MoE-structured LAMs is hindered by significant system-level challenges, particularly in managing the interplay betwee… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: 7 pages

  5. arXiv:2507.05661  [pdf

    cs.RO cs.CV

    3DGS_LSR:Large_Scale Relocation for Autonomous Driving Based on 3D Gaussian Splatting

    Authors: Haitao Lu, Haijier Chen, Haoze Liu, Shoujian Zhang, Bo Xu, Ziao Liu

    Abstract: In autonomous robotic systems, precise localization is a prerequisite for safe navigation. However, in complex urban environments, GNSS positioning often suffers from signal occlusion and multipath effects, leading to unreliable absolute positioning. Traditional mapping approaches are constrained by storage requirements and computational inefficiency, limiting their applicability to resource-const… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: 13 pages,7 figures,4 tables

  6. arXiv:2507.05621  [pdf, ps, other

    cs.CV cs.MM

    AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization Framework

    Authors: Suoxiang Zhang, Xiaxi Li, Hongrui Chang, Zhuoyan Hou, Guoxin Wu, Ronghua Ji

    Abstract: Domain-specific image generation aims to produce high-quality visual content for specialized fields while ensuring semantic accuracy and detail fidelity. However, existing methods exhibit two critical limitations: First, current approaches address prompt engineering and model adaptation separately, overlooking the inherent dependence between semantic understanding and visual representation in spec… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  7. arXiv:2507.05403  [pdf, ps, other

    cs.DB

    PBE Meets LLM: When Few Examples Aren't Few-Shot Enough

    Authors: Shuning Zhang, Yongjoo Park

    Abstract: Large language models (LLMs) can generate code from natural language descriptions. Their performance is typically evaluated using programming benchmarks that simulate real-world tasks. These benchmarks provide specifications in the form of docstrings, function signatures, or bug reports. The model then generates a program, which is tested against predefined test cases. In contrast, Programming by… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: 7 pages, 5 figures, accepted by VLDB QDB'25 workshop

  8. arXiv:2507.05216  [pdf, ps, other

    cs.LG cs.CY stat.AP stat.ML

    Bridging Prediction and Intervention Problems in Social Systems

    Authors: Lydia T. Liu, Inioluwa Deborah Raji, Angela Zhou, Luke Guerdan, Jessica Hullman, Daniel Malinsky, Bryan Wilder, Simone Zhang, Hammaad Adam, Amanda Coston, Ben Laufer, Ezinne Nwankwo, Michael Zanger-Tishler, Eli Ben-Michael, Solon Barocas, Avi Feller, Marissa Gerchick, Talia Gillis, Shion Guha, Daniel Ho, Lily Hu, Kosuke Imai, Sayash Kapoor, Joshua Loftus, Razieh Nabi , et al. (10 additional authors not shown)

    Abstract: Many automated decision systems (ADS) are designed to solve prediction problems -- where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  9. arXiv:2507.04671  [pdf, ps, other

    cs.LG cs.CV

    DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation

    Authors: Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao

    Abstract: Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Ar… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: Accepted by IJCAI 2025

  10. arXiv:2507.04651  [pdf, ps, other

    cs.IR

    FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation

    Authors: Maolin Wang, Yutian Xiao, Binhao Wang, Sheng Zhang, Shanshan Ye, Wanyu Wang, Hongzhi Yin, Ruocheng Guo, Zenglin Xu

    Abstract: Modern recommendation systems face significant challenges in processing multimodal sequential data, particularly in temporal dynamics modeling and information flow coordination. Traditional approaches struggle with distribution discrepancies between heterogeneous features and noise interference in multimodal signals. We propose \textbf{FindRec}~ (\textbf{F}lexible unified \textbf{in}formation \tex… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: Accepted by KDD 2025

  11. arXiv:2507.04239  [pdf, ps, other

    cs.LG cs.AI

    Scaling Context Requires Rethinking Attention

    Authors: Carles Gelada, Jacob Buckman, Sean Zhang, Txus Bach

    Abstract: We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  12. arXiv:2507.04171  [pdf, ps, other

    cs.ET cs.CY

    2024 NSF CSSI-Cybertraining-SCIPE PI Meeting August 12 to 13, 2024, Charlotte, NC

    Authors: Abani Patra, Mary Thomas, Elias Bou-Harb, Jeffrey Carver, Yuebin Guo, Ratnesh Kumar, Julien Langou, Guoyu Lu, Vivak Patel, Marianna Safronova, Isla Simpson, Dhruva Chakravorty, Jane Combs, Hantao Cui, Sushil Prasad, Adnan Rajib, Susan Rathbun, Erik Saule, Isla Simpson, Alan Sussman, Shaowen Wang, Sarina Zhe Zhang, Ben Brown, Varun Chandola, Daniel Crawford , et al. (10 additional authors not shown)

    Abstract: The second annual NSF, OAC CSSI, CyberTraining and related programs PI meeting was held August 12 to 13 in Charlotte, NC, with participation from PIs or representatives of all major awards. Keynotes, panels, breakouts, and poster sessions allowed PIs to engage with each other, NSF staff, and invited experts. The 286 attendees represented 292 awards across CSSI, CyberTraining, OAC Core, CIP, SCIPE… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Annual NSF PI meeting; contains summaries of meetings and breakout sessions, lists of participants, links to presented posters on figshare

  13. arXiv:2507.04147  [pdf, ps, other

    cs.GR cs.CV cs.DC

    A3FR: Agile 3D Gaussian Splatting with Incremental Gaze Tracked Foveated Rendering in Virtual Reality

    Authors: Shuo Xin, Haiyu Wang, Sai Qian Zhang

    Abstract: Virtual reality (VR) significantly transforms immersive digital interfaces, greatly enhancing education, professional practices, and entertainment by increasing user engagement and opening up new possibilities in various industries. Among its numerous applications, image rendering is crucial. Nevertheless, rendering methodologies like 3D Gaussian Splatting impose high computational demands, driven… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: ACM International Conference on Supercomputing 2025

  14. arXiv:2507.03938  [pdf, ps, other

    cs.CV cs.RO

    VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial Supervision

    Authors: Kezhong Liu, Yiwen Zhou, Mozi Chen, Jianhua He, Jingao Xu, Zheng Yang, Chris Xiaoxuan Lu, Shengkai Zhang

    Abstract: This work proposes a mmWave radar's scene flow estimation framework supervised by data from a widespread visual-inertial (VI) sensor suite, allowing crowdsourced training data from smart vehicles. Current scene flow estimation methods for mmWave radar are typically supervised by dense point clouds from 3D LiDARs, which are expensive and not widely available in smart vehicles. While VI data are mor… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

  15. arXiv:2507.03930  [pdf, ps, other

    cs.RO

    RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot

    Authors: Liang Heng, Xiaoqi Li, Shangqing Mao, Jiaming Liu, Ruolin Liu, Jingli Wei, Yu-Kai Wang, Yueru Jia, Chenyang Gu, Rui Zhao, Shanghang Zhang, Hao Dong

    Abstract: Recent advancements in imitation learning have shown promising results in robotic manipulation, driven by the availability of high-quality training data. To improve data collection efficiency, some approaches focus on developing specialized teleoperation devices for robot control, while others directly use human hand demonstrations to obtain training data. However, the former requires both a robot… ▽ More

    Submitted 7 July, 2025; v1 submitted 5 July, 2025; originally announced July 2025.

  16. OpenSN: An Open Source Library for Emulating LEO Satellite Networks

    Authors: Wenhao Lu, Zhiyuan Wang, Hefan Zhang, Shan Zhang, Hongbin Luo

    Abstract: Low-earth-orbit (LEO) satellite constellations (e.g., Starlink) are becoming a necessary component of future Internet. There have been increasing studies on LEO satellite networking. It is a crucial problem how to evaluate these studies in a systematic and reproducible manner. In this paper, we present OpenSN, i.e., an open source library for emulating large-scale satellite network (SN). Different… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 17 pages

    Journal ref: IEEE Transactions on Parallel and Distributed Systems (TPDS), 2025

  17. arXiv:2507.02345  [pdf, ps, other

    q-bio.BM cs.AI

    HelixDesign-Antibody: A Scalable Production-Grade Platform for Antibody Design Built on HelixFold3

    Authors: Jie Gao, Jing Hu, Shanzhuo Zhang, Kunrui Zhu, Sheng Qian, Yueyang Huang, Xiaonan Zhang, Xiaomin Fang

    Abstract: Antibody engineering is essential for developing therapeutics and advancing biomedical research. Traditional discovery methods often rely on time-consuming and resource-intensive experimental screening. To enhance and streamline this process, we introduce a production-grade, high-throughput platform built on HelixFold3, HelixDesign-Antibody, which utilizes the high-accuracy structure prediction mo… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  18. arXiv:2507.02029  [pdf, ps, other

    cs.RO

    RoboBrain 2.0 Technical Report

    Authors: BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xiangqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Mengfei Du, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Junkai Zhao, Xiaojie Zhang, Sh/anyu Rong, Huaihai Lyu, Zhengliang Cai , et al. (26 additional authors not shown)

    Abstract: We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain… ▽ More

    Submitted 5 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

  19. arXiv:2507.01961  [pdf, ps, other

    cs.RO cs.AI

    AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

    Authors: Sixiang Chen, Jiaming Liu, Siyuan Qian, Han Jiang, Lily Li, Renrui Zhang, Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

    Abstract: Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumu… ▽ More

    Submitted 5 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

    Comments: Project website: https://ac-dit.github.io/

  20. arXiv:2507.01951  [pdf, ps, other

    cs.LG cs.CL

    Test-Time Scaling with Reflective Generative Model

    Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

    Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra pr… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  21. arXiv:2507.01949  [pdf, ps, other

    cs.CV

    Kwai Keye-VL Technical Report

    Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao , et al. (35 additional authors not shown)

    Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video unde… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Technical Report: https://github.com/Kwai-Keye/Keye

  22. arXiv:2507.01701  [pdf, ps, other

    cs.MA cs.AI

    Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture

    Authors: Bochen Han, Songmao Zhang

    Abstract: In this paper, we propose to incorporate the blackboard architecture into LLM multi-agent systems (MASs) so that (1) agents with various roles can share all the information and others' messages during the whole problem-solving process, (2) agents that will take actions are selected based on the current content of the blackboard, and (3) the selection and execution round is repeated until a consens… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  23. arXiv:2507.01663  [pdf, ps, other

    cs.LG cs.AI

    AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

    Authors: Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, Jianping Wu

    Abstract: Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs). Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks, while task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance. Moreover, most existing frameworks are tightly coupled w… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  24. arXiv:2507.01603  [pdf, ps, other

    cs.CV

    DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

    Authors: Yue-Jiang Dong, Wang Zhao, Jiale Xu, Ying Shan, Song-Hai Zhang

    Abstract: Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods r… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  25. arXiv:2507.01467  [pdf, ps, other

    cs.CV

    Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think

    Authors: Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li

    Abstract: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  26. arXiv:2507.01111  [pdf, ps, other

    cs.RO cs.HC

    Environment-Aware and Human-Cooperative Swing Control for Lower-Limb Prostheses in Diverse Obstacle Scenarios

    Authors: Haosen Xing, Haoran Ma, Sijin Zhang, Hartmut Geyer

    Abstract: Current control strategies for powered lower limb prostheses often lack awareness of the environment and the user's intended interactions with it. This limitation becomes particularly apparent in complex terrains. Obstacle negotiation, a critical scenario exemplifying such challenges, requires both real-time perception of obstacle geometry and responsiveness to user intention about when and where… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  27. arXiv:2507.01041  [pdf, ps, other

    cs.LG cs.AI

    Fast AI Model Splitting over Edge Networks

    Authors: Zuguang Li, Wen Wu, Shaohua Wu, Songge Zhang, Ye Wang, Xuemin, Shen

    Abstract: Split learning (SL) has emerged as a computationally efficient approach for artificial intelligence (AI) model training, which can alleviate device-side computational workloads. However, complex AI model architectures pose high computational complexity to obtain the optimal model splitting. In this paper, we represent an arbitrary AI model as a directed acyclic graph (DAG), and then reformulate th… ▽ More

    Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced July 2025.

    Comments: 13 pages, 14 figures

  28. arXiv:2507.00902  [pdf, ps, other

    eess.SY cs.AI eess.SP

    Constellation as a Service: Tailored Connectivity Management in Direct-Satellite-to-Device Networks

    Authors: Feng Wang, Shengyu Zhang, Een-Kee Hong, Tony Q. S. Quek

    Abstract: Direct-satellite-to-device (DS2D) communication is emerging as a promising solution for global mobile service extension, leveraging the deployment of satellite constellations. However, the challenge of managing DS2D connectivity for multi-constellations becomes outstanding, including high interference and frequent handovers caused by multi-coverage overlap and rapid satellite movement. Moreover, e… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: To appear in IEEE Communications Magazine

  29. arXiv:2507.00880  [pdf, ps, other

    cs.LG cs.AI

    NN-Former: Rethinking Graph Structure in Neural Architecture Representation

    Authors: Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang

    Abstract: The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent compli… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to CVPR 2025. Code is avaiable at https://github.com/XuRuihan/NNFormer

  30. arXiv:2507.00606  [pdf, ps, other

    cs.CL cs.AI

    Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

    Authors: Tao Xiong, Xavier Hu, Wenyan Fan, Shengyu Zhang

    Abstract: Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning with… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  31. arXiv:2507.00566  [pdf, ps, other

    cs.CV

    Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment

    Authors: Kai Zhou, Shuhai Zhang, Zeng You, Jinwu Hu, Mingkui Tan, Fei Liu

    Abstract: Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and the… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: This paper is accepted by IEEE TIP 2025. Code is publicly available at https://github.com/kaai520/PGFA

  32. arXiv:2507.00479  [pdf, ps, other

    cs.IR

    On Mitigating Data Sparsity in Conversational Recommender Systems

    Authors: Sixiao Zhang, Mingrui Liu, Cheng Long, Wei Yuan, Hongxu Chen, Xiangyu Zhao, Hongzhi Yin

    Abstract: Conversational recommender systems (CRSs) capture user preference through textual information in dialogues. However, they suffer from data sparsity on two fronts: the dialogue space is vast and linguistically diverse, while the item space exhibits long-tail and sparse distributions. Existing methods struggle with (1) generalizing to varied dialogue expressions due to underutilization of rich textu… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  33. arXiv:2507.00166  [pdf, ps, other

    cs.RO

    Novel Design of 3D Printed Tumbling Microrobots for in vivo Targeted Drug Delivery

    Authors: Aaron C. Davis, Siting Zhang, Adalyn Meeks, Diya Sakhrani, Luis Carlos Sanjuan Acosta, D. Ethan Kelley, Emma Caldwell, Luis Solorio, Craig J. Goergen, David J. Cappelleri

    Abstract: This paper presents innovative designs for 3D-printed tumbling microrobots, specifically engineered for targeted in vivo drug delivery applications. The microrobot designs, created using stereolithography 3D printing technologies, incorporate permanent micro-magnets to enable actuation via a rotating magnetic field actuator system. The experimental framework encompasses a series of locomotion char… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  34. arXiv:2506.23690  [pdf, ps, other

    cs.CV

    SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

    Authors: Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen

    Abstract: Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce vis… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Project page: https://lucaria-academy.github.io/SynMotion/

  35. arXiv:2506.23667  [pdf, ps, other

    cs.CL

    L0: Reinforcement Learning to Become General Agents

    Authors: Junjie Zhang, Jingyi Xi, Zhuoyang Song, Junyu Lu, Yuhua Ke, Ting Sun, Yukun Yang, Jiaxing Zhang, Songxin Zhang, Zejian Xie

    Abstract: Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying rei… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  36. arXiv:2506.23623  [pdf, ps, other

    cs.CV

    Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

    Authors: Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang

    Abstract: Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened de… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Accepted by CVPR 2025; Code: https://github.com/spyflying/VCT_AVS; Models: https://huggingface.co/nowherespyfly/VCT_AVS

  37. arXiv:2506.23618  [pdf, ps, other

    cs.CV

    TurboVSR: Fantastic Video Upscalers and Where to Find Them

    Authors: Zhongdao Wang, Guodongfang Zhao, Jingjing Ren, Bailan Feng, Shifeng Zhang, Wenbo Li

    Abstract: Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper,… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: ICCV, 2025

  38. arXiv:2506.23607  [pdf, ps, other

    cs.CV

    PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum

    Authors: Shiqi Zhang, Sha Zhang, Jiajun Deng, Yedong Shen, Mingxiao MA, Yanyong Zhang

    Abstract: Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  39. arXiv:2506.23601  [pdf, ps, other

    cs.CL cs.AI

    Semantic-guided Diverse Decoding for Large Language Model

    Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou

    Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  40. arXiv:2506.23565  [pdf, ps, other

    cs.CV

    OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving

    Authors: Mingqian Ji, Jian Yang, Shanshan Zhang

    Abstract: Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector's ability of 3D geometry estimation. However, we observe… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV2025

  41. arXiv:2506.23351  [pdf, ps, other

    cs.RO cs.AI cs.LG cs.MA

    Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

    Authors: Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, Haibao Yu, Xiaokang Yang, Yusen Qin, Zhiqiang Xie, Yao Mu, Ping Luo, Tian Nian, Weiliang Deng, Yiheng Ge, Yibin Liu, Zixuan Li, Dehui Wang, Zhixuan Liang, Haohui Xie, Rijie Zeng , et al. (74 additional authors not shown)

    Abstract: Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To ad… ▽ More

    Submitted 2 July, 2025; v1 submitted 29 June, 2025; originally announced June 2025.

    Comments: Challenge Webpage: https://robotwin-benchmark.github.io/cvpr-2025-challenge/

  42. arXiv:2506.23236  [pdf, ps, other

    cs.CV cs.AI

    VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions

    Authors: Marko Mihajlovic, Siwei Zhang, Gen Li, Kaifeng Zhao, Lea Müller, Siyu Tang

    Abstract: Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding human-environment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point cloud… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: [ICCV 2025] https://markomih.github.io/VolumetricSMPL

  43. arXiv:2506.23077  [pdf, ps, other

    cs.CV

    Dynamic Contrastive Learning for Hierarchical Retrieval: A Case Study of Distance-Aware Cross-View Geo-Localization

    Authors: Suofei Zhang, Xinxin Wang, Xiaofu Wu, Quan Zhou, Haifeng Hu

    Abstract: Existing deep learning-based cross-view geo-localization methods primarily focus on improving the accuracy of cross-domain image matching, rather than enabling models to comprehensively capture contextual information around the target and minimize the cost of localization errors. To support systematic research into this Distance-Aware Cross-View Geo-Localization (DACVGL) problem, we construct Dist… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  44. arXiv:2506.23062  [pdf, ps, other

    math.PR cs.DS math.AP math.NA math.ST

    Shifted Composition IV: Underdamped Langevin and Numerical Discretizations with Partial Acceleration

    Authors: Jason M. Altschuler, Sinho Chewi, Matthew S. Zhang

    Abstract: Quantifying the convergence rate of the underdamped Langevin dynamics (ULD) is a classical topic, in large part due to the possibility for diffusive-to-ballistic speedups -- as was recently established for the continuous-time dynamics via space-time Poincare inequalities. A central challenge for analyzing ULD is that its degeneracy necessitates the development of new analysis approaches, e.g., the… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  45. arXiv:2506.22788  [pdf, ps, other

    cs.RO

    SPI-BoTER: Error Compensation for Industrial Robots via Sparse Attention Masking and Hybrid Loss with Spatial-Physical Information

    Authors: Xuao Hou, Yongquan Jia, Shijin Zhang, Yuqiang Wu

    Abstract: The widespread application of industrial robots in fields such as cutting and welding has imposed increasingly stringent requirements on the trajectory accuracy of end-effectors. However, current error compensation methods face several critical challenges, including overly simplified mechanism modeling, a lack of physical consistency in data-driven approaches, and substantial data requirements. Th… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  46. arXiv:2506.22716  [pdf, ps, other

    cs.LG cs.AI cs.CL cs.DB

    BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute

    Authors: Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks V. S. Lakshmanan, Qingyun Wu, Victor Rühle

    Abstract: Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired trade-off. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: Accepted to ICML 2025 (main conference)

  47. arXiv:2506.22139  [pdf, ps, other

    cs.CV

    Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

    Authors: Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this pap… ▽ More

    Submitted 7 July, 2025; v1 submitted 27 June, 2025; originally announced June 2025.

    Comments: Accepted at ICCV 2025

  48. arXiv:2506.21669  [pdf, ps, other

    cs.AI

    SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

    Authors: Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Yulin Luo, Junyu Lu, Chunkai Fan, Qiang Zhou, Yiming Zhao, Ning Liu Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang

    Abstract: Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancements in reinforcement fine-tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self-evolving embodied intelligence with multi-modal interactions remains largely unexp… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  49. arXiv:2506.21611  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Does Multimodality Lead to Better Time Series Forecasting?

    Authors: Xiyuan Zhang, Boran Han, Haoyang Fang, Abdul Fatir Ansari, Shuai Zhang, Danielle C. Maddix, Cuixiong Hu, Andrew Gordon Wilson, Michael W. Mahoney, Hao Wang, Yan Liu, Huzefa Rangwala, George Karypis, Bernie Wang

    Abstract: Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 14 forecasting tasks spanning 7 domains, including health, environment, and… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  50. arXiv:2506.20981  [pdf, ps, other

    cs.CR

    PrivacyGo: Privacy-Preserving Ad Measurement with Multidimensional Intersection

    Authors: Jian Du, Haohao Qian, Shikun Zhang, Wen-jie Lu, Donghang Lu, Yongchuan Niu, Bo Jiang, Yongjun Zhao, Qiang Yan

    Abstract: This paper tackles the challenging and practical problem of multi-identifier private user profile matching for privacy-preserving ad measurement, a cornerstone of modern advertising analytics. We introduce a comprehensive cryptographic framework leveraging reversed Oblivious Pseudorandom Functions (OPRF) and novel blind key rotation techniques to support secure matching across multiple identifiers… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.