Skip to main content

Showing 1–50 of 7,679 results for author: Zhang, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10442  [pdf, ps, other

    cs.RO cs.AI

    IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning

    Authors: Dechen Gao, Hang Wang, Hanchu Zhou, Nejib Ammar, Shatadal Mishra, Ahmadreza Moradipari, Iman Soltani, Junshan Zhang

    Abstract: Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning: IL provides stable learning from demonstrations, and RL promotes generalization through exploration. While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability an… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.10143  [pdf, ps, other

    cs.CL

    GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLMs

    Authors: Longchao Da, Parth Mitesh Shah, Kuan-Ru Liou, Jiaxing Zhang, Hua Wei

    Abstract: Large Language Models are now key assistants in human decision-making processes. However, a common note always seems to follow: "LLMs can make mistakes. Be careful with important info." This points to the reality that not all outputs from LLMs are dependable, and users must evaluate them manually. The challenge deepens as hallucinated responses, often presented with seemingly plausible explanation… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: 5 pages, 4 figures, accepted to IJCAI2025 demo track

    MSC Class: 68T50; 68T30 ACM Class: I.2.7; I.2.4; H.3.3

  3. arXiv:2505.10049  [pdf, ps, other

    cs.CV

    Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

    Authors: Jinlong Fan, Xuepu Zeng, Jing Zhang, Mingming Gong, Yuxiang Yang, Dacheng Tao

    Abstract: Dynamic scene representation and reconstruction have undergone transformative advances in recent years, catalyzed by breakthroughs in neural radiance fields and 3D Gaussian splatting techniques. While initially developed for static environments, these methodologies have rapidly evolved to address the complexities inherent in 4D dynamic scenes through an expansive body of research. Coupled with inn… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  4. arXiv:2505.09388  [pdf, other

    cs.CL

    Qwen3 Technical Report

    Authors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou , et al. (35 additional authors not shown)

    Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  5. arXiv:2505.09325  [pdf, ps, other

    cs.SD eess.AS

    SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset

    Authors: Yicheng Gu, Chaoren Wang, Junan Zhang, Xueyao Zhang, Zihao Fang, Haorui He, Zhizheng Wu

    Abstract: The lack of a publicly-available large-scale and diverse dataset has long been a significant bottleneck for singing voice applications like Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). To tackle this problem, we present SingNet, an extensive, diverse, and in-the-wild singing voice dataset. Specifically, we propose a data processing pipeline to extract ready-to-use training dat… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  6. arXiv:2505.08903  [pdf, ps, other

    cs.SE

    Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

    Authors: Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, David Lo

    Abstract: Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including requirements engineering and design, code analysis and generation, software maintenance, and quality assurance. As LLMs become more integral to SE, evaluating t… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  7. arXiv:2505.08854  [pdf, ps, other

    cs.CV cs.AI cs.RO

    Generative AI for Autonomous Driving: Frontiers and Opportunities

    Authors: Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, Hengxu You, Juntong Peng, Junge Zhang, Zehao Wang, Rui Song, Mingxuan Yan, Walter Zimmer, Xingcheng Zhou, Peiran Li, Zhaohan Lu, Chia-Ju Chen, Yue Huang, Ryan A. Rossi, Lichao Sun, Hongkai Yu , et al. (22 additional authors not shown)

    Abstract: Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, partic… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  8. arXiv:2505.08844  [pdf, other

    q-bio.GN cs.AI

    CellTypeAgent: Trustworthy cell type annotation with Large Language Models

    Authors: Jiawen Chen, Jianghao Zhang, Huaxiu Yao, Yun Li

    Abstract: Cell type annotation is a critical yet laborious step in single-cell RNA sequencing analysis. We present a trustworthy large language model (LLM)-agent, CellTypeAgent, which integrates LLMs with verification from relevant databases. CellTypeAgent achieves higher accuracy than existing methods while mitigating hallucinations. We evaluated CellTypeAgent across nine real datasets involving 303 cell t… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    MSC Class: 68T20 ACM Class: I.2.1

  9. arXiv:2505.08723  [pdf, other

    cs.CV

    TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series

    Authors: Xiaolei Qin, Di Wang, Jing Zhang, Fengxiang Wang, Xin Su, Bo Du, Liangpei Zhang

    Abstract: Satellite image time series (SITS) provide continuous observations of the Earth's surface, making them essential for applications such as environmental management and disaster assessment. However, existing spatiotemporal foundation models rely on plain vision transformers, which encode entire temporal sequences without explicitly capturing multiscale spatiotemporal relationships between land objec… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  10. arXiv:2505.08617  [pdf, ps, other

    cs.CV

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Authors: Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng

    Abstract: While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effective… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Work in progress

  11. arXiv:2505.07841  [pdf, other

    cs.NI cs.LG

    Token Communication-Driven Multimodal Large Models in Resource-Constrained Multiuser Networks

    Authors: Junhe Zhang, Wanli Ni, Pengwei Wang, Dongyu Wang

    Abstract: The proliferation of intelligent applications at the wireless edge, alongside the exponential growth of multimodal data, poses challenges for deploying multimodal large models (MLMs) in resource-constrained networks. These constraints manifest as limited bandwidth, computational capacity, and stringent latency requirements, particularly under low signal-to-noise ratio (SNR) conditions. To overcome… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  12. arXiv:2505.07734  [pdf, ps, other

    cs.CV

    LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

    Authors: Jiangling Zhang, Weijie Zhu, Jirui Huang, Yaxiong Chen

    Abstract: Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vis… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  13. arXiv:2505.07538  [pdf, ps, other

    cs.CV

    Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

    Authors: Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

    Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally dist… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  14. arXiv:2505.07208  [pdf, other

    cs.SE

    An Empirical Study: MEMS as a Static Performance Metric

    Authors: Liwei Zhang, Baoquan Cui, Xutong Ma, Jian Zhang

    Abstract: Static performance estimation is essential during compile-time analysis, yet traditional runtime-based methods are costly and platform-dependent. We investigate mems, the number of memory accesses, as a static and architecture-independent performance metric. We develop a Clang-based automated instrumentation tool that rewrites source code to insert path tracing and \textit{mems} counting logic. Th… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  15. arXiv:2505.06997  [pdf, ps, other

    cs.AI

    A Multi-Agent Reinforcement Learning Approach for Cooperative Air-Ground-Human Crowdsensing in Emergency Rescue

    Authors: Wenhao Lu, Zhengqiu Zhu, Yong Zhao, Yonglin Tian, Junjie Zeng, Jun Zhang, Zhong Liu, Fei-Yue Wang

    Abstract: Mobile crowdsensing is evolving beyond traditional human-centric models by integrating heterogeneous entities like unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs). Optimizing task allocation among these diverse agents is critical, particularly in challenging emergency rescue scenarios characterized by complex environments, limited communication, and partial observability. This… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  16. arXiv:2505.06993  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Towards the Three-Phase Dynamics of Generalization Power of a DNN

    Authors: Yuxuan He, Junpeng Zhang, Hongyuan Zhang, Quanshi Zhang

    Abstract: This paper proposes a new perspective for analyzing the generalization power of deep neural networks (DNNs), i.e., directly disentangling and analyzing the dynamics of generalizable and non-generalizable interaction encoded by a DNN through the training process. Specifically, this work builds upon the recent theoretical achievement in explainble AI, which proves that the detailed inference logic o… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  17. arXiv:2505.06901  [pdf, ps, other

    cs.AR

    Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression

    Authors: Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaoxiao Liu, Hai "Helen" Li, Yiran Chen

    Abstract: Large language models (LLMs) have demonstrated transformative capabilities across diverse artificial intelligence applications, yet their deployment is hindered by substantial memory and computational demands, especially in resource-constrained environments. Quantization techniques have emerged as a critical solution, reducing data precision to enhance memory and computational efficiency. However,… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: ISCA 2025

  18. arXiv:2505.06856  [pdf, other

    cs.AI cs.RO

    Beyond Patterns: Harnessing Causal Logic for Autonomous Driving Trajectory Prediction

    Authors: Bonan Wang, Haicheng Liao, Chengyue Wang, Bin Rao, Yanchen Guan, Guyang Yu, Jiaxun Zhang, Songning Lai, Chengzhong Xu, Zhenning Li

    Abstract: Accurate trajectory prediction has long been a major challenge for autonomous driving (AD). Traditional data-driven models predominantly rely on statistical correlations, often overlooking the causal relationships that govern traffic behavior. In this paper, we introduce a novel trajectory prediction framework that leverages causal inference to enhance predictive robustness, generalization, and ac… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Journal ref: IJCAI 2025

  19. arXiv:2505.06729  [pdf, ps, other

    cs.RO

    STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation

    Authors: Haokun Zhu, Zongtai Li, Zhixuan Liu, Wenshan Wang, Ji Zhang, Jonathan Francis, Jean Oh

    Abstract: Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation poses two key challenges: effectively representing complex environment information and determining \textit{when and how} to query VLMs. Insufficient environment understanding and over-reliance on VLMs (e.g.… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  20. arXiv:2505.06690  [pdf

    cs.LG

    E2E-FANet: A Highly Generalizable Framework for Waves prediction Behind Floating Breakwaters via Exogenous-to-Endogenous Variable Attention

    Authors: Jianxin Zhang, Lianzi Jiang, Xinyu Han, Xiangrong Wang, Weinan Huang

    Abstract: Accurate prediction of waves behind floating breakwaters (FB) is crucial for optimizing coastal engineering structures, enhancing safety, and improving design efficiency. Existing methods demonstrate limitations in capturing nonlinear interactions between waves and structures, while exhibiting insufficient capability in modeling the complex frequency-domain relationships among elevations of differ… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  21. arXiv:2505.06688  [pdf

    cs.LG

    A Novel Framework for Significant Wave Height Prediction based on Adaptive Feature Extraction Time-Frequency Network

    Authors: Jianxin Zhang, Lianzi Jiang, Xinyu Han, Xiangrong Wang

    Abstract: Precise forecasting of significant wave height (Hs) is essential for the development and utilization of wave energy. The challenges in predicting Hs arise from its non-linear and non-stationary characteristics. The combination of decomposition preprocessing and machine learning models have demonstrated significant effectiveness in Hs prediction by extracting data features. However, decomposing the… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  22. arXiv:2505.06524  [pdf, ps, other

    cs.CV

    Causal Prompt Calibration Guided Segment Anything Model for Open-Vocabulary Multi-Entity Segmentation

    Authors: Jingyao Wang, Jianqi Zhang, Wenwen Qiang, Changwen Zheng

    Abstract: Despite the strength of the Segment Anything Model (SAM), it struggles with generalization issues in open-vocabulary multi-entity segmentation (OVMS). Through empirical and causal analyses, we find that (i) the prompt bias is the primary cause of the generalization issues; (ii) this bias is closely tied to the task-irrelevant generating factors within the prompts, which act as confounders and affe… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  23. arXiv:2505.06258  [pdf, ps, other

    cs.LG cs.AI

    ABE: A Unified Framework for Robust and Faithful Attribution-Based Explainability

    Authors: Zhiyu Zhu, Jiayu Zhang, Zhibo Jin, Fang Chen, Jianlong Zhou

    Abstract: Attribution algorithms are essential for enhancing the interpretability and trustworthiness of deep learning models by identifying key features driving model decisions. Existing frameworks, such as InterpretDL and OmniXAI, integrate multiple attribution methods but suffer from scalability limitations, high coupling, theoretical constraints, and lack of user-friendly implementations, hindering neur… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  24. arXiv:2505.06091  [pdf, other

    cs.LG cs.AI cs.SC

    UniSymNet: A Unified Symbolic Network Guided by Transformer

    Authors: Xinxin Li, Juan Zhang, Da Li, Xingyu Liu, Jin Xu, Junping Yin

    Abstract: Symbolic Regression (SR) is a powerful technique for automatically discovering mathematical expressions from input data. Mainstream SR algorithms search for the optimal symbolic tree in a vast function space, but the increasing complexity of the tree structure limits their performance. Inspired by neural networks, symbolic networks have emerged as a promising new paradigm. However, most existing s… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  25. arXiv:2505.05950  [pdf, ps, other

    cs.LG

    FloE: On-the-Fly MoE Inference on Memory-constrained GPU

    Authors: Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou

    Abstract: With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive sce… ▽ More

    Submitted 11 May, 2025; v1 submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025

  26. arXiv:2505.05831  [pdf, ps, other

    cs.RO

    Oh F**k! How Do People Feel about Robots that Leverage Profanity?

    Authors: Madison R. Shippy, Brian J. Zhang, Naomi T. Fitter

    Abstract: Profanity is nearly as old as language itself, and cursing has become particularly ubiquitous within the last century. At the same time, robots in personal and service applications are often overly polite, even though past work demonstrates the potential benefits of robot norm-breaking. Thus, we became curious about robots using curse words in error scenarios as a means for improving social percep… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Under review for the 2025 IEEE RO-MAN Conference

  27. arXiv:2505.05763  [pdf

    cs.LG cs.CL

    BMMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection

    Authors: Yize Zhou, Jie Zhang, Meijie Wang, Lun Yu

    Abstract: Academic misconduct detection in biomedical research remains challenging due to algorithmic narrowness in existing methods and fragmented analytical pipelines. We present BMMDetect, a multimodal deep learning framework that integrates journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies) for hol… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  28. arXiv:2505.05715  [pdf, ps, other

    cs.SE cs.PL

    JustinANN: Realistic Test Generation for Java Programs Driven by Annotations

    Authors: Baoquan Cui, Rong Qu, Jian Zhang

    Abstract: Automated test case generation is important. However, the automatically generated test input does not always make sense, and the automated assertion is difficult to validate against the program under test. In this paper, we propose JustinANN, a flexible and scalable tool to generate test cases for Java programs, providing realistic test inputs and assertions. We have observed that, in practice, Ja… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  29. arXiv:2505.05473  [pdf, ps, other

    cs.CV

    DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

    Authors: Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani

    Abstract: Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pi… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: CVPR 2025. Project website: https://qitaozhao.github.io/DiffusionSfM

  30. arXiv:2505.05464  [pdf, other

    cs.CL

    Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

    Authors: Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He

    Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: ICML 2025. Our code is publicly available at https://github.com/shiqichen17/VLM_Merging

  31. arXiv:2505.05366  [pdf, ps, other

    cs.NI

    SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication

    Authors: Mikhail Khalilov, Siyuan Shen, Marcin Chrapek, Tiancheng Chen, Kenji Nakano, Peter-Jan Gootzen, Salvatore Di Girolamo, Rami Nudelman, Gil Bloch, Sreevatsa Anantharamu, Mahmoud Elhaddad, Jithin Jose, Abdul Kabbani, Scott Moe, Konstantin Taranov, Zhuolong Yu, Jie Zhang, Nicola Mazzoletti, Torsten Hoefler

    Abstract: RDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing ha… ▽ More

    Submitted 10 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

  32. arXiv:2505.05225  [pdf, ps, other

    cs.CL

    QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

    Authors: Mengze Hong, Wailing Ng, Di Jiang, Chen Jason Zhang

    Abstract: The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first mu… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  33. arXiv:2505.05098  [pdf, ps, other

    cs.RO cs.CL cs.CV cs.ET

    X-Driver: Explainable Autonomous Driving with Vision-Language Models

    Authors: Wei Liu, Jiyuan Zhang, Binxiong Zheng, Yufeng Hu, Yingzhan Lin, Zengfeng Zeng

    Abstract: End-to-end autonomous driving has advanced significantly, offering benefits such as system simplicity and stronger driving performance in both open-loop and closed-loop settings than conventional pipelines. However, existing frameworks still suffer from low success rates in closed-loop evaluations, highlighting their limitations in real-world deployment. In this paper, we introduce X-Driver, a uni… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  34. arXiv:2505.04996  [pdf, other

    cs.GR cs.CV cs.SD eess.AS

    Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

    Authors: Jinhe Huang, Yongkang Cheng, Yuming Hang, Gaoge Han, Jinewei Li, Jing Zhang, Xingjian Gu

    Abstract: Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Gener… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: accepted by ICMR 2025

  35. arXiv:2505.04946  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

    Authors: Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao

    Abstract: Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mat… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  36. arXiv:2505.04936  [pdf, other

    cs.IT eess.SP

    Fluid Antenna-Assisted MU-MIMO Systems with Decentralized Baseband Processing

    Authors: Tianyi Liao, Wei Guo, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief

    Abstract: The fluid antenna system (FAS) has emerged as a disruptive technology, offering unprecedented degrees of freedom (DoF) for wireless communication systems. However, optimizing fluid antenna (FA) positions entails significant computational costs, especially when the number of FAs is large. To address this challenge, we introduce a decentralized baseband processing (DBP) architecture to FAS, which pa… ▽ More

    Submitted 12 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: 7 pages, 5 figures, submitted to an IEEE conference

  37. arXiv:2505.04930  [pdf, ps, other

    cs.IT eess.SP

    Accurate and Fast Channel Estimation for Fluid Antenna Systems with Diffusion Models

    Authors: Erqiang Tang, Wei Guo, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief

    Abstract: Fluid antenna systems (FAS) offer enhanced spatial diversity for next-generation wireless systems. However, acquiring accurate channel state information (CSI) remains challenging due to the large number of reconfigurable ports and the limited availability of radio-frequency (RF) chains -- particularly in high-dimensional FAS scenarios. To address this challenge, we propose an efficient posterior s… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: 6 pages, 5 figures, submitted to an IEEE conference

  38. arXiv:2505.04891  [pdf

    cs.LG cs.AI stat.ML

    Clustering with Communication: A Variational Framework for Single Cell Representation Learning

    Authors: Cong Qi, Yeqing Chen, Jie Zhang, Wei Zhi

    Abstract: Single-cell RNA sequencing (scRNA-seq) has revealed complex cellular heterogeneity, but recent studies emphasize that understanding biological function also requires modeling cell-cell communication (CCC), the signaling interactions mediated by ligand-receptor pairs that coordinate cellular behavior. Tools like CellChat have demonstrated that CCC plays a critical role in processes such as cell dif… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  39. arXiv:2505.04836  [pdf, ps, other

    eess.SP cs.CV

    Integrated Image Reconstruction and Target Recognition based on Deep Learning Technique

    Authors: Cien Zhang, Jiaming Zhang, Jiajun He, Okan Yurduseven

    Abstract: Computational microwave imaging (CMI) has gained attention as an alternative technique for conventional microwave imaging techniques, addressing their limitations such as hardware-intensive physical layer and slow data collection acquisition speed to name a few. Despite these advantages, CMI still encounters notable computational bottlenecks, especially during the image reconstruction stage. In th… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: Submitted to The 2025 15th IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC 2025)

  40. arXiv:2505.04089  [pdf

    cs.NE

    A New Scope and Domain Measure Comparison Method for Global Convergence Analysis in Evolutionary Computation

    Authors: Liu-Yue Luo, Zhi-Hui Zhan, Kay Chen Tan, Jun Zhang

    Abstract: Convergence analysis is a fundamental research topic in evolutionary computation (EC). The commonly used analysis method models the EC algorithm as a homogeneous Markov chain for analysis, which is not always suitable for different EC variants, and also sometimes causes misuse and confusion due to their complex process. In this article, we categorize the existing researches on convergence analysis… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: 14 pages, 8 figures

  41. arXiv:2505.03729  [pdf, other

    cs.RO cs.CV

    Visual Imitation Enables Contextual Humanoid Control

    Authors: Arthur Allshire, Hongsuk Choi, Junyi Zhang, David McAllister, Anthony Zhang, Chung Min Kim, Trevor Darrell, Pieter Abbeel, Jitendra Malik, Angjoo Kanazawa

    Abstract: How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them-casually capture a human motion video and feed it to humanoids. We introduce VIDEOMIMIC, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies fo… ▽ More

    Submitted 13 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

    Comments: Project website: https://www.videomimic.net/

  42. arXiv:2505.03603  [pdf, other

    cs.CV cs.MM

    PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model

    Authors: S. Z. Zhou, Y. B. Wang, J. F. Wu, T. Hu, J. N. Zhang, Z. J. Li, Y. Liu

    Abstract: Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings a… ▽ More

    Submitted 11 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

  43. arXiv:2505.03539  [pdf, other

    cs.CV cs.RO eess.IV

    Panoramic Out-of-Distribution Segmentation

    Authors: Mengfei Duan, Kailun Yang, Yuheng Zhang, Yihong Cao, Fei Teng, Kai Luo, Jiaming Zhang, Zhiyong Li, Shutao Li

    Abstract: Panoramic imaging enables capturing 360° images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introdu… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Code and datasets will be available at https://github.com/MengfeiD/PanOoS

  44. arXiv:2505.03418  [pdf, other

    cs.LG

    Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey

    Authors: Da Zheng, Lun Du, Junwei Su, Yuchen Tian, Yuqi Zhu, Jintian Zhang, Lanning Wei, Ningyu Zhang, Huajun Chen

    Abstract: Problem-solving has been a fundamental driver of human progress in numerous domains. With advancements in artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools capable of tackling complex problems across diverse domains. Unlike traditional computational systems, LLMs combine raw computational power with an approximation of human reasoning, allowing them to generate s… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  45. arXiv:2505.03314  [pdf, other

    cs.SD cs.AI eess.AS

    Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation

    Authors: Jincheng Zhang, György Fazekas, Charalampos Saitis

    Abstract: The recent surge in the popularity of diffusion models for image synthesis has attracted new attention to their potential for generation tasks in other domains. However, their applications to symbolic music generation remain largely under-explored because symbolic music is typically represented as sequences of discrete events and standard diffusion models are not well-suited for discrete data. We… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  46. arXiv:2505.03171  [pdf, other

    cs.AI

    CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics

    Authors: Junqi Liu, Xiaohan Lin, Jonas Bayer, Yael Dillies, Weijie Jiang, Xiaodan Liang, Roman Soletskyi, Haiming Wang, Yunzhou Xie, Beibei Xiong, Zhengfeng Yang, Jujian Zhang, Lihong Zhi, Jia Li, Zhengying Liu

    Abstract: Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive be… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  47. arXiv:2505.03135  [pdf, other

    cs.AI

    Holmes: Automated Fact Check with Large Language Models

    Authors: Haoran Ou, Gelei Deng, Xingshuo Han, Jie Zhang, Xinlei He, Han Qiu, Shangwei Guo, Tianwei Zhang

    Abstract: The rise of Internet connectivity has accelerated the spread of disinformation, threatening societal trust, decision-making, and national security. Disinformation has evolved from simple text to complex multimodal forms combining images and text, challenging existing detection methods. Traditional deep learning models struggle to capture the complexity of multimodal disinformation. Inspired by adv… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  48. arXiv:2505.03113  [pdf, other

    cs.CV

    Image Recognition with Online Lightweight Vision Transformer: A Survey

    Authors: Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu, Shibiao Xu

    Abstract: The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its… ▽ More

    Submitted 10 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

  49. arXiv:2505.02922  [pdf, ps, other

    cs.LG

    RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

    Authors: Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, Mao Yang

    Abstract: The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index,… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: 16 pages

  50. arXiv:2505.02820  [pdf, other

    cs.AI cs.CL cs.LG

    AutoLibra: Agent Metric Induction from Open-Ended Feedback

    Authors: Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang

    Abstract: Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback, e.g., "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide wh… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: https://opensocial.world/