Skip to main content

Showing 1–50 of 6,682 results for author: Zhang, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01957  [pdf, ps, other

    cs.CV cs.AI

    Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

    Authors: Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han

    Abstract: We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To a… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: The first two authors contributed equally to this work

  2. arXiv:2507.01949  [pdf, ps, other

    cs.CV

    Kwai Keye-VL Technical Report

    Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao , et al. (35 additional authors not shown)

    Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video unde… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Technical Report: https://github.com/Kwai-Keye/Keye

  3. arXiv:2507.01926  [pdf, ps, other

    cs.CV

    IC-Custom: Diverse Image Customization via In-Context Learning

    Authors: Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan

    Abstract: Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome t… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Project page: https://liyaowei-stu.github.io/project/IC_Custom

  4. arXiv:2507.01721  [pdf, ps, other

    cs.CV

    Soft Self-labeling and Potts Relaxations for Weakly-Supervised Segmentation

    Authors: Zhongwen Zhang, Yuri Boykov

    Abstract: We consider weakly supervised segmentation where only a fraction of pixels have ground truth labels (scribbles) and focus on a self-labeling approach optimizing relaxations of the standard unsupervised CRF/Potts loss on unlabeled pixels. While WSSS methods can directly optimize such losses via gradient descent, prior work suggests that higher-order optimization can improve network training by intr… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: published at CVPR 2025

  5. arXiv:2507.01389  [pdf, ps, other

    cs.LG quant-ph

    Surrogate Modeling via Factorization Machine and Ising Model with Enhanced Higher-Order Interaction Learning

    Authors: Anbang Wang, Dunbo Cai, Yu Zhang, Yangqing Huang, Xiangyang Feng, Zhihong Zhang

    Abstract: Recently, a surrogate model was proposed that employs a factorization machine to approximate the underlying input-output mapping of the original system, with quantum annealing used to optimize the resulting surrogate function. Inspired by this approach, we propose an enhanced surrogate model that incorporates additional slack variables into both the factorization machine and its associated Ising r… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  6. arXiv:2507.01337  [pdf, ps, other

    cs.IT eess.SP

    Dynamical Multimodal Fusion with Mixture-of-Experts for Localizations

    Authors: Bohao Wang, Zitao Shuai, Fenghao Zhu, Chongwen Huang, Yongliang Shen, Zhaoyang Zhang, Qianqian Yang, Sami Muhaidat, Merouane Debbah

    Abstract: Multimodal fingerprinting is a crucial technique to sub-meter 6G integrated sensing and communications (ISAC) localization, but two hurdles block deployment: (i) the contribution each modality makes to the target position varies with the operating conditions such as carrier frequency, and (ii) spatial and fingerprint ambiguities markedly undermine localization accuracy, especially in non-line-of-s… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  7. arXiv:2507.01161  [pdf, ps, other

    eess.SY cs.RO

    Imitation Learning for Satellite Attitude Control under Unknown Perturbations

    Authors: Zhizhuo Zhang, Hao Peng, Xiaoli Bai

    Abstract: This paper presents a novel satellite attitude control framework that integrates Soft Actor-Critic (SAC) reinforcement learning with Generative Adversarial Imitation Learning (GAIL) to achieve robust performance under various unknown perturbations. Traditional control techniques often rely on precise system models and are sensitive to parameter uncertainties and external perturbations. To overcome… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 2025 AAS/AIAA Astrodynamics Specialist Conference

  8. arXiv:2507.01066  [pdf

    cs.IR cs.CV cs.LG

    Embedding-based Retrieval in Multimodal Content Moderation

    Authors: Hanzhong Liang, Jinghao Shi, Xiang Shen, Zixuan Wang, Vera Wen, Ardalan Mehrani, Zhiqian Chen, Yifan Wu, Zhixin Zhang

    Abstract: Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: Camera ready for SIGIR 2025

  9. arXiv:2507.01050  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

    Authors: Jing Yu, Yibo Zhao, Jiapeng Zhu, Wenming Shao, Bo Pang, Zhao Zhang, Xiang Li

    Abstract: The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics. However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and rob… ▽ More

    Submitted 23 June, 2025; originally announced July 2025.

  10. arXiv:2507.00917  [pdf, ps, other

    cs.RO

    A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

    Authors: Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai

    Abstract: The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interacti… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey

  11. arXiv:2507.00755  [pdf

    eess.AS cs.AI cs.SD

    LearnAFE: Circuit-Algorithm Co-design Framework for Learnable Audio Analog Front-End

    Authors: Jinhai Hu, Zhongyi Zhang, Cong Sheng Leow, Wang Ling Goh, Yuan Gao

    Abstract: This paper presents a circuit-algorithm co-design framework for learnable analog front-end (AFE) in audio signal classification. Designing AFE and backend classifiers separately is a common practice but non-ideal, as shown in this paper. Instead, this paper proposes a joint optimization of the backend classifier with the AFE's transfer function to achieve system-level optimum. More specifically, t… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 11 pages, 15 figures, accepted for publication on IEEE Transactions on Circuits and Systems I: Regular Papers

  12. arXiv:2507.00490  [pdf, ps, other

    cs.CV eess.IV

    Just Noticeable Difference for Large Multimodal Models

    Authors: Zijian Chen, Yuan Tian, Yuze Sun, Wei Sun, Zicheng Zhang, Weisi Lin, Guangtao Zhai, Wenjun Zhang

    Abstract: Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large mu… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

    Comments: 19 pages, 19 figures

  13. arXiv:2507.00458  [pdf, ps, other

    eess.AS cs.SD

    Mitigating Language Mismatch in SSL-Based Speaker Anonymization

    Authors: Zhe Zhang, Wen-Chin Huang, Xin Wang, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: Speaker anonymization aims to protect speaker identity while preserving content information and the intelligibility of speech. However, most speaker anonymization systems (SASs) are developed and evaluated using only English, resulting in degraded utility for other languages. This paper investigates language mismatch in SASs for Japanese and Mandarin speech. First, we fine-tune a self-supervised l… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to Interspeech 2025

  14. arXiv:2507.00377  [pdf, ps, other

    cs.CV

    MedDiff-FT: Data-Efficient Diffusion Model Fine-tuning with Structural Guidance for Controllable Medical Image Synthesis

    Authors: Jianhao Xie, Ziang Zhang, Zhenyu Weng, Yuesheng Zhu, Guibo Luo

    Abstract: Recent advancements in deep learning for medical image segmentation are often limited by the scarcity of high-quality training data.While diffusion models provide a potential solution by generating synthetic images, their effectiveness in medical imaging remains constrained due to their reliance on large-scale medical datasets and the need for higher image quality. To address these challenges, we… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: 11 pages,3 figures

  15. arXiv:2507.00075  [pdf, ps, other

    cs.LG cs.AI

    Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

    Authors: Yifan Sun, Yushan Liang, Zhen Zhang, Jiaye Teng

    Abstract: Self-improvement is among the most prominent techniques within the realm of large language models (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of s… ▽ More

    Submitted 29 June, 2025; originally announced July 2025.

    Comments: 24 pages

  16. arXiv:2506.24019  [pdf, ps, other

    cs.CV cs.CL

    Ella: Embodied Social Agents with Lifelong Memory

    Authors: Hongxin Zhang, Zheyuan Zhang, Zeyuan Wang, Zunzhe Zhang, Lixing Fang, Qinhong Zhou, Chuang Gan

    Abstract: We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella's capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a nam… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  17. arXiv:2506.24009  [pdf, ps, other

    cs.IT cs.AI

    Bridging Physical and Digital Worlds: Embodied Large AI for Future Wireless Systems

    Authors: Xinquan Wang, Fenghao Zhu, Zhaohui Yang, Chongwen Huang, Xiaoming Chen, Zhaoyang Zhang, Sami Muhaidat, Mérouane Debbah

    Abstract: Large artificial intelligence (AI) models offer revolutionary potential for future wireless systems, promising unprecedented capabilities in network optimization and performance. However, current paradigms largely overlook crucial physical interactions. This oversight means they primarily rely on offline datasets, leading to difficulties in handling real-time wireless dynamics and non-stationary e… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 7 pages, 4 figures

  18. arXiv:2506.23508  [pdf, ps, other

    cs.CL cs.AI

    Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably

    Authors: Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang

    Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 18 pages (Preprint. Work in progress)

  19. arXiv:2506.23453  [pdf, ps, other

    stat.ML cs.LG

    Minimax Optimal Two-Stage Algorithm For Moment Estimation Under Covariate Shift

    Authors: Zhen Zhang, Xin Liu, Shaoli Wang, Jiaye Teng

    Abstract: Covariate shift occurs when the distribution of input features differs between the training and testing phases. In covariate shift, estimating an unknown function's moment is a classical problem that remains under-explored, despite its common occurrence in real-world scenarios. In this paper, we investigate the minimax lower bound of the problem when the source and target distributions are known.… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  20. arXiv:2506.23361  [pdf, ps, other

    cs.CV

    OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

    Authors: Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

    Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data cons… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: A data construction pipeline and a diffusion Transformer framework for controllable subject-driven video customization

  21. arXiv:2506.23351  [pdf, ps, other

    cs.RO cs.AI cs.LG cs.MA

    Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

    Authors: Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, Haibao Yu, Xiaokang Yang, Yusen Qin, Zhiqiang Xie, Yao Mu, Ping Luo, Tian Nian, Weiliang Deng, Yiheng Ge, Yibin Liu, Zixuan Li, Dehui Wang, Zhixuan Liang, Haohui Xie, Rijie Zeng , et al. (74 additional authors not shown)

    Abstract: Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To ad… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Challenge Webpage: https://robotwin-benchmark.github.io/cvpr-2025-challenge/

  22. arXiv:2506.23329  [pdf, ps, other

    cs.CV

    IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

    Authors: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng

    Abstract: Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using pr… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Project Page: https://ir3d-bench.github.io/

  23. arXiv:2506.23078  [pdf, ps, other

    cs.RO

    Event-based Stereo Visual-Inertial Odometry with Voxel Map

    Authors: Zhaoxing Zhang, Xiaoxiang Wang, Chengliang Zhang, Yangyang Guo, Zikang Yuan, Xin Yang

    Abstract: The event camera, renowned for its high dynamic range and exceptional temporal resolution, is recognized as an important sensor for visual odometry. However, the inherent noise in event streams complicates the selection of high-quality map points, which critically determine the precision of state estimation. To address this challenge, we propose Voxel-ESVIO, an event-based stereo visual-inertial o… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  24. arXiv:2506.22722  [pdf, ps, other

    cs.CR cs.AI

    Kill Two Birds with One Stone! Trajectory enabled Unified Online Detection of Adversarial Examples and Backdoor Attacks

    Authors: Anmin Fu, Fanyu Meng, Huaibing Peng, Hua Ma, Zhi Zhang, Yifeng Zheng, Willy Susilo, Yansong Gao

    Abstract: The proposed UniGuard is the first unified online detection framework capable of simultaneously addressing adversarial examples and backdoor attacks. UniGuard builds upon two key insights: first, both AE and backdoor attacks have to compromise the inference phase, making it possible to tackle them simultaneously during run-time via online detection. Second, an adversarial input, whether a perturbe… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  25. arXiv:2506.22632  [pdf, ps, other

    cs.OS

    Using SBPF to Accelerate Kernel Memory Access From Userspace

    Authors: Boming Kong, Zhizhou Zhang, Jonathan Balkind

    Abstract: The cost of communication between the operating system kernel and user applications has long blocked improvements in software performance. Traditionally, operating systems encourage software developers to use the system call interface to transfer (or initiate transfer of) data between user applications and the kernel. This approach not only hurts performance at the software level due to memory cop… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  26. arXiv:2506.22459  [pdf, ps, other

    eess.SP cs.LG

    Physics-Embedded Neural Networks for sEMG-based Continuous Motion Estimation

    Authors: Wending Heng, Chaoyuan Liang, Yihui Zhao, Zhiqiang Zhang, Glen Cooper, Zhenhong Li

    Abstract: Accurately decoding human motion intentions from surface electromyography (sEMG) is essential for myoelectric control and has wide applications in rehabilitation robotics and assistive technologies. However, existing sEMG-based motion estimation methods often rely on subject-specific musculoskeletal (MSK) models that are difficult to calibrate, or purely data-driven models that lack physiological… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: Accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

  27. arXiv:2506.22303  [pdf, ps, other

    cs.IR

    Education-Oriented Graph Retrieval-Augmented Generation for Learning Path Recommendation

    Authors: Xinghe Cheng, Zihan Zhang, Jiapu Wang, Liangda Fang, Chaobo He, Quanlong Guan, Shirui Pan, Weiqi Luo

    Abstract: Learning path recommendation seeks to provide learners with a structured sequence of learning items (e.g., knowledge concepts or exercises) to optimize their learning efficiency. Despite significant efforts in this area, most existing methods primarily rely on prerequisite relationships, which present two major limitations: 1) Many educational datasets do not explicitly provide prerequisite relati… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  28. arXiv:2506.22291  [pdf, ps, other

    cs.CV cs.AI

    RoomCraft: Controllable and Complete 3D Indoor Scene Generation

    Authors: Mengqi Zhou, Xipeng Wang, Yuxi Wang, Zhaoxiang Zhang

    Abstract: Generating realistic 3D indoor scenes from user inputs remains a challenging problem in computer vision and graphics, requiring careful balance of geometric consistency, spatial relationships, and visual realism. While neural generation methods often produce repetitive elements due to limited global spatial reasoning, procedural approaches can leverage constraints for controllable generation but s… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  29. arXiv:2506.22216  [pdf, ps, other

    cs.CV eess.IV

    ReF-LLE: Personalized Low-Light Enhancement via Reference-Guided Deep Reinforcement Learning

    Authors: Ming Zhao, Pingping Liu, Tongshun Zhang, Zhe Zhang

    Abstract: Low-light image enhancement presents two primary challenges: 1) Significant variations in low-light images across different conditions, and 2) Enhancement levels influenced by subjective preferences and user intent. To address these issues, we propose ReF-LLE, a novel personalized low-light image enhancement method that operates in the Fourier frequency domain and incorporates deep reinforcement l… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 6 pages, 8 figures, accepted by ICME2025

  30. MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

    Authors: Zheng Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, Dazhao Cheng

    Abstract: Recently, Mixture-of-Experts (MoE) has become one of the most popular techniques to scale pre-trained models to extraordinarily large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 11 pages, accepted at IPDPS 2023

    Journal ref: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 167-177. IEEE, 2023

  31. MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators

    Authors: Zheng Zhang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

    Abstract: Operator fusion, a key technique to improve data locality and alleviate GPU memory bandwidth pressure, often fails to extend to the fusion of multiple compute-intensive operators due to saturated computation throughput. However, the dynamicity of tensor dimension sizes could potentially lead to these operators becoming memory-bound, necessitating the generation of fused kernels, a task hindered by… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 12 pages, accepted at SC 2024

    Journal ref: SC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024

  32. arXiv:2506.21885  [pdf, ps, other

    cs.CV cs.MM cs.RO

    Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles

    Authors: Chuheng Wei, Ziye Qin, Ziyan Zhang, Guoyuan Wu, Matthew J. Barth

    Abstract: Multi-sensor fusion plays a critical role in enhancing perception for autonomous driving, overcoming individual sensor limitations, and enabling comprehensive environmental understanding. This paper first formalizes multi-sensor fusion strategies into data-level, feature-level, and decision-level categories and then provides a systematic review of deep learning-based methods corresponding to each… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted by IEEE IV 2025

  33. arXiv:2506.21765  [pdf, ps, other

    eess.IV cs.CV

    TUS-REC2024: A Challenge to Reconstruct 3D Freehand Ultrasound Without External Tracker

    Authors: Qi Li, Shaheer U. Saeed, Yuliang Huang, Mingyuan Luo, Zhongnuo Yan, Jiongquan Chen, Xin Yang, Dong Ni, Nektarios Winter, Phuc Nguyen, Lucas Steinberger, Caelan Haney, Yuan Zhao, Mingjie Jiang, Bowen Ren, SiYeoul Lee, Seonho Kim, MinKyung Seo, MinWoo Kim, Yimeng Dou, Zhiwei Zhang, Yin Li, Tomy Varghese, Dean C. Barratt, Matthew J. Clarkson , et al. (2 additional authors not shown)

    Abstract: Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequence… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  34. arXiv:2506.21655  [pdf, ps, other

    cs.LG cs.AI cs.CV

    APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization

    Authors: Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao

    Abstract: Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or "overthinking" reasoning. Our work investigates how the KL penalty and o… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  35. arXiv:2506.21618  [pdf, ps, other

    cs.CL cs.AI

    TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge

    Authors: Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, Junchi Yan

    Abstract: In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  36. arXiv:2506.21605  [pdf, ps, other

    cs.CL cs.AI

    MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

    Authors: Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, Zhenhua Dong

    Abstract: Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains challenges. Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios. They also lack comprehensive metrics to reflect the m… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: 17 pages, 5 figures. Accepted by ACL 2025 findings

  37. arXiv:2506.21559  [pdf, ps, other

    cs.CL

    GraphLAMA: Enabling Efficient Adaptation of Graph Language Models with Limited Annotations

    Authors: Junze Chen, Cheng Yang, Shujie Li, Zhiqiang Zhang, Yawen Li, Junping Du, Chuan Shi

    Abstract: Large language models (LLMs) have demonstrated their strong capabilities in various domains, and have been recently integrated for graph analysis as graph language models (GLMs). With LLMs as the predictor, some GLMs can interpret unseen tasks described by natural language, and learn from a few examples in the prompts without parameter tuning, known as in-context learning (ICL). Another subset of… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  38. arXiv:2506.21536  [pdf, ps, other

    cs.AI cs.HC

    PsyLite Technical Report

    Authors: Fangjun Ding, Renyu Zhang, Xinyu Feng, Chengye Xie, Zheng Zhang, Yanting Zhang

    Abstract: With the rapid development of digital technology, AI-driven psychological counseling has gradually become an important research direction in the field of mental health. However, existing models still have deficiencies in dialogue safety, detailed scenario handling, and lightweight deployment. To address these issues, this study proposes PsyLite, a lightweight psychological counseling large languag… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  39. arXiv:2506.21137  [pdf, ps, other

    cs.LG

    NaLaFormer: Norm-Aware Linear Attention for Transformer Models

    Authors: Weikang Meng, Yadan Luo, Liangyu Huo, Yaowei Wang, Xin Li, Zheng Zhang

    Abstract: Linear attention has emerged as a viable alternative to softmax attention by reducing complexity from quadratic to linear in sequence length. To preserve two fundamental properties of softmax, non-negativity and entropy reduction, current works employ various linearly separatable kernel functions with $L1$ normalization instead of softmax operator. However, query norms are neglected by the normali… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  40. arXiv:2506.21071  [pdf, ps, other

    cs.LG cs.CL

    Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph

    Authors: Jingwei Wang, Zai Zhang, Hao Qian, Chunjing Gan, Binbin Hu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Bin Shi, Bo Dong

    Abstract: Teaching large language models (LLMs) to use tools is crucial for improving their problem-solving abilities and expanding their applications. However, effectively using tools is challenging because it requires a deep understanding of tool functionalities and user intentions. Previous methods relied mainly on LLMs to generate instruction data, but the quality of these data was often insufficient. I… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: 20 pages, 12 figures

  41. arXiv:2506.20990  [pdf, ps, other

    cs.LG cs.CL cs.CV

    SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes

    Authors: Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang

    Abstract: Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-var… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  42. arXiv:2506.20644  [pdf, ps, other

    cs.LG

    Efficient Federated Learning with Encrypted Data Sharing for Data-Heterogeneous Edge Devices

    Authors: Hangyu Li, Hongyue Wu, Guodong Fan, Zhen Zhang, Shizhan Chen, Zhiyong Feng

    Abstract: As privacy protection gains increasing importance, more models are being trained on edge devices and subsequently merged into the central server through Federated Learning (FL). However, current research overlooks the impact of network topology, physical distance, and data heterogeneity on edge devices, leading to issues such as increased latency and degraded model performance. To address these is… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Accepted by ICWS 2025

  43. arXiv:2506.20624  [pdf, ps, other

    cs.PL quant-ph

    PhasePoly: An Optimization Framework forPhase Polynomials in Quantum Circuits

    Authors: Zihan Chen, Henry Chen, Yuwei Jin, Minghao Guo, Enhyeok Jang, Jiakang Li, Caitlin Chan, Won Woo Ro, Eddy Z. Zhang

    Abstract: Quantum computing has transformative computational power to make classically intractable computing feasible. As the algorithms that achieve practical quantum advantage are beyond manual tuning, quantum circuit optimization has become extremely important and integrated into today's quantum software stack. This paper focuses on a critical type of quantum circuit optimization -- phase-polynomial opti… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: 14 pages, 12 figures

  44. arXiv:2506.20583  [pdf, ps, other

    cs.CV cs.AI

    Dense Video Captioning using Graph-based Sentence Summarization

    Authors: Zhiwang Zhang, Dong Xu, Wanli Ouyang, Luping Zhou

    Abstract: Recently, dense video captioning has made attractive progress in detecting and captioning all events in a long untrimmed video. Despite promising results were achieved, most existing methods do not sufficiently explore the scene evolution within an event temporal proposal for captioning, and therefore perform less satisfactorily when the scenes and objects change over a relatively long proposal. T… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: 12 pages

  45. arXiv:2506.20567  [pdf, ps, other

    cs.CV cs.AI

    Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization

    Authors: Zhiwang Zhang, Dong Xu, Wanli Ouyang, Chuanqi Tan

    Abstract: In this work, we propose a division-and-summarization (DaS) framework for dense video captioning. After partitioning each untrimmed long video as multiple event proposals, where each event proposal consists of a set of short video segments, we extract visual feature (e.g., C3D feature) from each segment and use the existing image/video captioning approach to generate one sentence description for t… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: 10 pages

  46. arXiv:2506.20168  [pdf, ps, other

    cs.CV

    Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

    Authors: Zhentao He, Can Zhang, Ziheng Wu, Zhenghao Chen, Yufei Zhan, Yifan Li, Zhao Zhang, Xian Wang, Minghui Qiu

    Abstract: Recent advancements in multimodal large language models have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation. In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  47. arXiv:2506.20159  [pdf, ps, other

    cs.SE cs.AI

    AI and Agile Software Development: From Frustration to Success -- XP2025 Workshop Summary

    Authors: Tomas Herda, Victoria Pichler, Zheying Zhang, Pekka Abrahamsson, Geir K. Hanssen

    Abstract: The full-day workshop on AI and Agile at XP 2025 convened a diverse group of researchers and industry practitioners to address the practical challenges and opportunities of integrating Artificial Intelligence into Agile software development. Through interactive sessions, participants identified shared frustrations related to integrating AI into Agile Software Development practices, including chall… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  48. arXiv:2506.20062  [pdf, ps, other

    cs.HC cs.AI

    Beyond Autocomplete: Designing CopilotLens Towards Transparent and Explainable AI Coding Agents

    Authors: Runlong Ye, Zeling Zhang, Boushra Almazroua, Michael Liut

    Abstract: AI-powered code assistants are widely used to generate code completions, significantly boosting developer productivity. However, these tools typically present suggestions without explaining their rationale, leaving their decision-making process inscrutable. This opacity hinders developers' ability to critically evaluate the output, form accurate mental models, and build calibrated trust in the sys… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  49. arXiv:2506.20061  [pdf, ps, other

    cs.LG cs.CL

    Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models

    Authors: Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang

    Abstract: Developing effective instruction-following policies in reinforcement learning remains challenging due to the reliance on extensive human-labeled instruction datasets and the difficulty of learning from sparse rewards. In this paper, we propose a novel approach that leverages the capabilities of large language models (LLMs) to automatically generate open-ended instructions retrospectively from prev… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Under Review

  50. arXiv:2506.19890  [pdf, ps, other

    cs.LG cs.AI

    Causal-Aware Intelligent QoE Optimization for VR Interaction with Adaptive Keyframe Extraction

    Authors: Ziru Zhang, Jiadong Yu, Danny H. K. Tsang

    Abstract: The optimization of quality of experience (QoE) in multi-user virtual reality (VR) interactions demands a delicate balance between ultra-low latency, high-fidelity motion synchronization, and equitable resource allocation. While adaptive keyframe extraction mitigates transmission overhead, existing approaches often overlook the causal relationships among allocated bandwidth, CPU frequency, and use… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.