Skip to main content

Showing 1–50 of 5,228 results for author: Zhang, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10348  [pdf, ps, other

    cs.HC cs.SD eess.AS

    ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection

    Authors: Cunhang Fan, Xiaoke Yang, Hongyu Zhang, Ying Chen, Lu Li, Jian Zhou, Zhao Lv

    Abstract: Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spa… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.10228  [pdf, ps, other

    cs.RO eess.SY

    Quad-LCD: Layered Control Decomposition Enables Actuator-Feasible Quadrotor Trajectory Planning

    Authors: Anusha Srikanthan, Hanli Zhang, Spencer Folk, Vijay Kumar, Nikolai Matni

    Abstract: In this work, we specialize contributions from prior work on data-driven trajectory generation for a quadrotor system with motor saturation constraints. When motors saturate in quadrotor systems, there is an ``uncontrolled drift" of the vehicle that results in a crash. To tackle saturation, we apply a control decomposition and learn a tracking penalty from simulation data consisting of low, medium… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: 4 pages, 4 figures

    Journal ref: ICRA 2025 Workshop on 25 Years of Aerial Robotics: Challenges and Opportunities

  3. UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs

    Authors: Yi Gui, Yao Wan, Zhen Li, Zhongyi Zhang, Dongping Chen, Hongyu Zhang, Yi Su, Bohua Chen, Xing Zhou, Wenbin Jiang, Xiangliang Zhang

    Abstract: Automating the synthesis of User Interfaces (UIs) plays a crucial role in enhancing productivity and accelerating the development lifecycle, reducing both development time and manual effort. Recently, the rapid development of Multimodal Large Language Models (MLLMs) has made it possible to generate front-end Hypertext Markup Language (HTML) code directly from webpage designs. However, real-world w… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: WWW' 2025

  4. arXiv:2505.09698  [pdf, ps, other

    cs.RO cs.AI

    ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

    Authors: Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, Daniel Seita

    Abstract: Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: 47 pages, 29 figures. Under review

  5. arXiv:2505.09451  [pdf, ps, other

    cs.AR

    SEGA-DCIM: Design Space Exploration-Guided Automatic Digital CIM Compiler with Multiple Precision Support

    Authors: Haikang Diao, Haoyi Zhang, Jiahao Song, Haoyang Luo, Yibo Lin, Runsheng Wang, Yuan Wang, Xiyuan Tang

    Abstract: Digital computing-in-memory (DCIM) has been a popular solution for addressing the memory wall problem in recent years. However, the DCIM design still heavily relies on manual efforts, and the optimization of DCIM is often based on human experience. These disadvantages limit the time to market while increasing the design difficulty of DCIMs. This work proposes a design space exploration-guided auto… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  6. arXiv:2505.08687  [pdf, ps, other

    cs.LG cs.AI

    AC-PKAN: Attention-Enhanced and Chebyshev Polynomial-Based Physics-Informed Kolmogorov-Arnold Networks

    Authors: Hangwei Zhang, Zhimu Huang, Yan Wang

    Abstract: Kolmogorov-Arnold Networks (KANs) have recently shown promise for solving partial differential equations (PDEs). Yet their original formulation is computationally and memory intensive, motivating the introduction of Chebyshev Type-I-based KANs (Chebyshev1KANs). Although Chebyshev1KANs have outperformed the vanilla KANs architecture, our rigorous theoretical analysis reveals that they still suffer… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  7. arXiv:2505.08366  [pdf

    eess.SP cs.AI

    Non-contact Vital Signs Detection in Dynamic Environments

    Authors: Shuai Sun, Chong-Xi Liang, Chengwei Ye, Huanzhen Zhang, Kangsheng Wang

    Abstract: Accurate phase demodulation is critical for vital sign detection using millimeter-wave radar. However, in complex environments, time-varying DC offsets and phase imbalances can severely degrade demodulation performance. To address this, we propose a novel DC offset calibration method alongside a Hilbert and Differential Cross-Multiply (HADCM) demodulation algorithm. The approach estimates time-var… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  8. arXiv:2505.08260  [pdf, ps, other

    cs.CV

    Few-shot Novel Category Discovery

    Authors: Chunming Li, Shidong Wang, Haofeng Zhang

    Abstract: The recently proposed Novel Category Discovery (NCD) adapt paradigm of transductive learning hinders its application in more real-world scenarios. In fact, few labeled data in part of new categories can well alleviate this burden, which coincides with the ease that people can label few of new category data. Therefore, this paper presents a new setting in which a trained agent is able to flexibly s… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  9. arXiv:2505.08239  [pdf, ps, other

    cs.GR cs.CV

    ACT-R: Adaptive Camera Trajectories for 3D Reconstruction from Single Image

    Authors: Yizhi Wang, Mingrui Zhao, Ali Mahdavi-Amiri, Hao Zhang

    Abstract: We introduce adaptive view planning to multi-view synthesis, aiming to improve both occlusion revelation and 3D consistency for single-view 3D reconstruction. Instead of generating an unordered set of views independently or simultaneously, we generate a sequence of views, leveraging temporal consistency to enhance 3D coherence. Most importantly, our view sequence is not determined by a pre-determi… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  10. arXiv:2505.07916  [pdf, ps, other

    eess.AS cs.SD

    MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

    Authors: Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He

    Abstract: We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, w… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  11. arXiv:2505.07854  [pdf

    cs.AI cs.MA

    CCL: Collaborative Curriculum Learning for Sparse-Reward Multi-Agent Reinforcement Learning via Co-evolutionary Task Evolution

    Authors: Yufei Lin, Chengwei Ye, Huanzhen Zhang, Kangsheng Wang, Linuo Xu, Shuyan Liu, Zeyu Zhang

    Abstract: Sparse reward environments pose significant challenges in reinforcement learning, especially within multi-agent systems (MAS) where feedback is delayed and shared across agents, leading to suboptimal learning. We propose Collaborative Multi-dimensional Course Learning (CCL), a novel curriculum learning framework that addresses this by (1) refining intermediate tasks for individual agents, (2) usin… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  12. arXiv:2505.07634  [pdf, ps, other

    cs.RO cs.AI cs.CV

    Neural Brain: A Neuroscience-inspired Framework for Embodied Agents

    Authors: Jian Liu, Xiongtao Shi, Thai Duy Nguyen, Haitian Zhang, Tianxiang Zhang, Wei Sun, Yanjie Li, Athanasios V. Vasilakos, Giovanni Iacca, Arshad Ali Khan, Arvind Kumar, Jae Won Cho, Ajmal Mian, Lihua Xie, Erik Cambria, Lin Wang

    Abstract: The rapid evolution of artificial intelligence (AI) has shifted from static, data-driven models to dynamic systems capable of perceiving and interacting with real-world environments. Despite advancements in pattern recognition and symbolic reasoning, current AI systems, such as large language models, remain disembodied, unable to physically engage with the world. This limitation has driven the ris… ▽ More

    Submitted 14 May, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

    Comments: 51 pages, 17 figures, 9 tables

  13. arXiv:2505.07608  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

    Authors: Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai , et al. (40 additional authors not shown)

    Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  14. arXiv:2505.07538  [pdf, ps, other

    cs.CV

    Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

    Authors: Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

    Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally dist… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  15. arXiv:2505.07446  [pdf, other

    cs.RO

    TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking

    Authors: Hanjing Ye, Yu Zhan, Weixi Situ, Guangcheng Chen, Jingwen Yu, Kuanqi Cai, Hong Zhang

    Abstract: Tracking a target person from robot-egocentric views is crucial for developing autonomous robots that provide continuous personalized assistance or collaboration in Human-Robot Interaction (HRI) and Embodied AI. However, most existing target person tracking (TPT) benchmarks are limited to controlled laboratory environments with few distractions, clean backgrounds, and short-term occlusions. In thi… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Under review. web: https://medlartea.github.io/tpt-bench/

  16. arXiv:2505.07395  [pdf, other

    cs.RO

    ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

    Authors: Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, Donglin Wang

    Abstract: Vision-Language-Action (VLA) models have shown great potential in general robotic decision-making tasks via imitation learning. However, the variable quality of training data often constrains the performance of these models. On the other hand, offline Reinforcement Learning (RL) excels at learning robust policy models from mixed-quality data. In this paper, we introduce Reinforced robot GPT (Reinb… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  17. arXiv:2505.07209  [pdf, other

    cs.CV

    Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models

    Authors: Yan Xie, Zequn Zeng, Hao Zhang, Yucheng Ding, Yi Wang, Zhengjue Wang, Bo Chen, Hongwei Liu

    Abstract: Concept Bottleneck Models (CBMs) try to make the decision-making process transparent by exploring an intermediate concept space between the input image and the output prediction. Existing CBMs just learn coarse-grained relations between the whole image and the concepts, less considering local image information, leading to two main drawbacks: i) they often produce spurious visual-concept relations,… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: CVPR 2025

  18. arXiv:2505.07147  [pdf, other

    cs.CG math.MG

    All Polyhedral Manifolds are Connected by a 2-Step Refolding

    Authors: Lily Chung, Erik D. Demaine, Jenny Diomidova, Tonan Kamata, Jayson Lynch, Ryuhei Uehara, Hanyu Alice Zhang

    Abstract: We prove that, for any two polyhedral manifolds $\mathcal P, \mathcal Q$, there is a polyhedral manifold $\mathcal I$ such that $\mathcal P, \mathcal I$ share a common unfolding and $\mathcal I,\mathcal Q$ share a common unfolding. In other words, we can unfold $\mathcal P$, refold (glue) that unfolding into $\mathcal I$, unfold $\mathcal I$, and then refold into $\mathcal Q$. Furthermore, if… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: 14 pages, 10 figures. Presented at JCDCG^3 2024. arXiv admin note: substantial text overlap with arXiv:2412.02174

  19. arXiv:2505.06993  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Towards the Three-Phase Dynamics of Generalization Power of a DNN

    Authors: Yuxuan He, Junpeng Zhang, Hongyuan Zhang, Quanshi Zhang

    Abstract: This paper proposes a new perspective for analyzing the generalization power of deep neural networks (DNNs), i.e., directly disentangling and analyzing the dynamics of generalizable and non-generalizable interaction encoded by a DNN through the training process. Specifically, this work builds upon the recent theoretical achievement in explainble AI, which proves that the detailed inference logic o… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  20. arXiv:2505.06912  [pdf, ps, other

    cs.CV

    Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI

    Authors: Chao Ding, Mouxiao Bian, Pengcheng Chen, Hongliang Zhang, Tianbin Li, Lihao Liu, Jiayuan Chen, Zhuoran Li, Yabei Zhong, Yongqi Liu, Haiqing Huang, Dongming Shan, Junjun He, Jie Xu

    Abstract: Despite strong performance in medical question-answering, the clinical adoption of Large Language Models (LLMs) is critically hampered by their opaque 'black-box' reasoning, limiting clinician trust. This challenge is compounded by the predominant reliance of current medical LLMs on corpora from scientific literature or synthetic data, which often lack the granular expert validation and high clini… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  21. arXiv:2505.06557  [pdf, ps, other

    cs.CV

    Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining

    Authors: Lu Dong, Haiyu Zhang, Hongjie Zhang, Yifei Huang, Zhen-Hua Ling, Yu Qiao, Limin Wang, Yali Wang

    Abstract: The task of weakly supervised temporal sentence grounding (WSTSG) aims to detect temporal intervals corresponding to a language description from untrimmed videos with only video-level video-language correspondence. For an anchor sample, most existing approaches generate negative samples either from other videos or within the same video for contrastive learning. However, some training samples are h… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: TCSVT 2025, doi at https://ieeexplore.ieee.org/document/10970001

  22. arXiv:2505.06461  [pdf, ps, other

    cs.DC cs.LG

    Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference

    Authors: Haolin Zhang, Jeff Huang

    Abstract: The common assumption in on-device AI is that GPUs, with their superior parallel processing, always provide the best performance for large language model (LLM) inference. In this work, we challenge this notion by empirically demonstrating that, under certain conditions, CPUs can outperform GPUs for LLM inference on mobile devices. Using a 1-billion-parameter LLM deployed via llama.cpp on the iPhon… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  23. arXiv:2505.06256  [pdf, other

    eess.SP cs.AI

    SpectrumFM: A Foundation Model for Intelligent Spectrum Management

    Authors: Fuhui Zhou, Chunyu Liu, Hao Zhang, Wei Wu, Qihui Wu, Derrick Wing Kwan Ng, Tony Q. S. Quek, Chan-Byoung Chae

    Abstract: Intelligent spectrum management is crucial for improving spectrum efficiency and achieving secure utilization of spectrum resources. However, existing intelligent spectrum management methods, typically based on small-scale models, suffer from notable limitations in recognition accuracy, convergence speed, and generalization, particularly in the complex and dynamic spectrum environments. To address… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  24. arXiv:2505.06117  [pdf, other

    cs.CV

    Photovoltaic Defect Image Generator with Boundary Alignment Smoothing Constraint for Domain Shift Mitigation

    Authors: Dongying Li, Binyi Su, Hua Zhang, Yong Li, Haiyong Chen

    Abstract: Accurate defect detection of photovoltaic (PV) cells is critical for ensuring quality and efficiency in intelligent PV manufacturing systems. However, the scarcity of rich defect data poses substantial challenges for effective model training. While existing methods have explored generative models to augment datasets, they often suffer from instability, limited diversity, and domain shifts. To addr… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  25. arXiv:2505.05874  [pdf, ps, other

    cs.LG physics.chem-ph q-bio.BM

    A 3D pocket-aware and evolutionary conserved interaction guided diffusion model for molecular optimization

    Authors: Anjie Qiao, Hao Zhang, Qianmu Yuan, Qirui Deng, Jingtian Su, Weifeng Huang, Huihao Zhou, Guo-Bo Li, Zhen Wang, Jinping Lei

    Abstract: Generating molecules that bind to specific protein targets via diffusion models has shown good promise for structure-based drug design and molecule optimization. Especially, the diffusion models with binding interaction guidance enables molecule generation with high affinity through forming favorable interaction within protein pocket. However, the generated molecules may not form interactions with… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  26. arXiv:2505.05856  [pdf, ps, other

    cs.DC

    DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

    Authors: Xuan Peng, Xuanhua Shi, Haolin Zhang, Yunfei Zhao, Xuehai Qian

    Abstract: Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively support. In this paper, we introduce DawnPiper, a memory-scalable pipeline parallel training framework. Firstly, we develop a DL compilation-based profiling met… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  27. arXiv:2505.05520  [pdf

    cs.CV cs.AI

    GaMNet: A Hybrid Network with Gabor Fusion and NMamba for Efficient 3D Glioma Segmentation

    Authors: Chengwei Ye, Huanzhen Zhang, Yufei Lin, Kangsheng Wang, Linuo Xu, Shuyan Liu

    Abstract: Gliomas are aggressive brain tumors that pose serious health risks. Deep learning aids in lesion segmentation, but CNN and Transformer-based models often lack context modeling or demand heavy computation, limiting real-time use on mobile medical devices. We propose GaMNet, integrating the NMamba module for global modeling and a multi-scale CNN for efficient local feature extraction. To improve int… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  28. arXiv:2505.05130  [pdf, ps, other

    cs.DC

    CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models

    Authors: Mengjun Yi, Hanwen Zhang, Hui Dou, Jian Zhao, Furao Shen

    Abstract: Large pre-trained Vision-Language Models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), have exhibited remarkable zero-shot performance across various image classification tasks. Fine-tuning these models on domain-specific datasets further enhances their effectiveness for downstream applications. However, fine-tuning in cloud environments raises significant concerns regarding data… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  29. arXiv:2505.04993  [pdf, other

    cs.CL

    Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

    Authors: Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, Dongyan Zhao

    Abstract: Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To addr… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  30. arXiv:2505.04947  [pdf, other

    cs.DC

    DFPL: Decentralized Federated Prototype Learning Across Heterogeneous Data Distributions

    Authors: Hongliang Zhang, Fenghua Xu, Zhongyuan Yu, Chunqiang Hu, Shanchen Pang, Xiaofen Wang, Jiguo Yu

    Abstract: Federated learning is a distributed machine learning paradigm that enables the collaborative training of multiple clients through centralized model aggregation. However, standard federated learning relies on a centralized server, making it vulnerable to server failures. While existing solutions utilize blockchain technology to implement Decentralized Federated Learning (DFL), the statistical heter… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  31. arXiv:2505.04620  [pdf, other

    cs.CV

    On Path to Multimodal Generalist: General-Level and General-Bench

    Authors: Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu , et al. (7 additional authors not shown)

    Abstract: The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expande… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: ICML'25, 305 pages, 115 tables, 177 figures, project page: https://generalist.top/

  32. arXiv:2505.04606  [pdf, other

    cs.SE

    OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

    Authors: Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng

    Abstract: The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limi… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: To appear at ISSTA'25

  33. arXiv:2505.04270  [pdf, ps, other

    cs.CV cs.AI

    Object-Shot Enhanced Grounding Network for Egocentric Video

    Authors: Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, Liqiang Nie

    Abstract: Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitatio… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: Accepted by CVPR 2025

  34. arXiv:2505.04088  [pdf

    cs.CV

    SMMT: Siamese Motion Mamba with Self-attention for Thermal Infrared Target Tracking

    Authors: Shang Zhang, Huanbin Zhang, Dali Feng, Yujie Cui, Ruoyan Xiong, Cen He

    Abstract: Thermal infrared (TIR) object tracking often suffers from challenges such as target occlusion, motion blur, and background clutter, which significantly degrade the performance of trackers. To address these issues, this paper pro-poses a novel Siamese Motion Mamba Tracker (SMMT), which integrates a bidirectional state-space model and a self-attention mechanism. Specifically, we introduce the Motion… ▽ More

    Submitted 10 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

  35. Call for Action: towards the next generation of symbolic regression benchmark

    Authors: Guilherme S. Imai Aldeia, Hengzhe Zhang, Geoffrey Bomarito, Miles Cranmer, Alcides Fonseca, Bogdan Burlacu, William G. La Cava, Fabrício Olivetti de França

    Abstract: Symbolic Regression (SR) is a powerful technique for discovering interpretable mathematical expressions. However, benchmarking SR methods remains challenging due to the diversity of algorithms, datasets, and evaluation criteria. In this work, we present an updated version of SRBench. Our benchmark expands the previous one by nearly doubling the number of evaluated methods, refining evaluation metr… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: 10 pages, 4 figures, 3 tables, accepted in Genetic and Evolutionary Computation Conference (GECCO '25) Symbolic Regression Workshop

  36. arXiv:2505.03846  [pdf

    cs.CV cs.AI

    GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation

    Authors: Kangsheng Wang, Yuhang Li, Chengwei Ye, Yufei Lin, Huanzhen Zhang, Bohan Hu, Linuo Xu, Shuyan Liu

    Abstract: Apparent personality analysis from short videos poses significant chal-lenges due to the complex interplay of visual, auditory, and textual cues. In this paper, we propose GAME, a Graph-Augmented Multimodal Encoder designed to robustly model and fuse multi-source features for automatic personality prediction. For the visual stream, we construct a facial graph and introduce a dual-branch Geo Two-St… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  37. arXiv:2505.03827  [pdf, other

    cs.LG cs.AI

    MISE: Meta-knowledge Inheritance for Social Media-Based Stressor Estimation

    Authors: Xin Wang, Ling Feng, Huijun Zhang, Lei Cao, Kaisheng Zeng, Qi Li, Yang Ding, Yi Dai, David Clifton

    Abstract: Stress haunts people in modern society, which may cause severe health issues if left unattended. With social media becoming an integral part of daily life, leveraging social media to detect stress has gained increasing attention. While the majority of the work focuses on classifying stress states and stress categories, this study introduce a new task aimed at estimating more specific stressors (li… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: WWW2025, Oral Presentation

  38. arXiv:2505.03756  [pdf, other

    cs.AR cs.AI cs.LG cs.PF

    Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

    Authors: Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo

    Abstract: Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dep… ▽ More

    Submitted 19 April, 2025; originally announced May 2025.

  39. arXiv:2505.03426  [pdf, other

    cs.CV cs.AI

    Phenotype-Guided Generative Model for High-Fidelity Cardiac MRI Synthesis: Advancing Pretraining and Clinical Applications

    Authors: Ziyu Li, Yujian Hu, Zhengyao Ding, Yiheng Mao, Haitao Li, Fan Yi, Hongkun Zhang, Zhengxing Huang

    Abstract: Cardiac Magnetic Resonance (CMR) imaging is a vital non-invasive tool for diagnosing heart diseases and evaluating cardiac health. However, the limited availability of large-scale, high-quality CMR datasets poses a major challenge to the effective application of artificial intelligence (AI) in this domain. Even the amount of unlabeled data and the health status it covers are difficult to meet the… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  40. arXiv:2505.03334  [pdf, other

    cs.CV cs.DB

    From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection

    Authors: Guoting Wei, Yu Liu, Xia Yuan, Xizhe Xue, Linlin Guo, Yifan Yang, Chunxia Zhao, Zongwen Bai, Haokui Zhang, Rong Xiao

    Abstract: In recent years, language-guided open-world aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary, which fails to meet the demands of more fine-grained open-world detection. To address this limitation, we propose constructing a larg… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  41. arXiv:2505.03320  [pdf, other

    cs.CL

    Recall with Reasoning: Chain-of-Thought Distillation for Mamba's Long-Context Memory and Extrapolation

    Authors: Junyu Ma, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu

    Abstract: Mamba's theoretical infinite-context potential is limited in practice when sequences far exceed training lengths. This work explores unlocking Mamba's long-context memory ability by a simple-yet-effective method, Recall with Reasoning (RwR), by distilling chain-of-thought (CoT) summarization from a teacher model. Specifically, RwR prepends these summarization as CoT prompts during fine-tuning, tea… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  42. arXiv:2505.02867  [pdf, other

    cs.CV

    RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

    Authors: Ruiqi Wang, Hao Zhang

    Abstract: We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting input expressions that are more general than what prior works were designed to handle. Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our m… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: 42 pages, 31 figures. For more details: https://suikei-wang.github.io/RESAnything/

  43. arXiv:2505.02573  [pdf, other

    cs.LG cs.AI cs.DB cs.SI

    Rethinking Federated Graph Learning: A Data Condensation Perspective

    Authors: Hao Zhang, Xunkai Li, Yinlin Zhu, Lianglin Hu

    Abstract: Federated graph learning is a widely recognized technique that promotes collaborative training of graph neural networks (GNNs) by multi-client graphs.However, existing approaches heavily rely on the communication of model parameters or gradients for federated optimization and fail to adequately address the data heterogeneity introduced by intricate and diverse graph distributions. Although some me… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  44. arXiv:2505.02391  [pdf, other

    cs.LG cs.AI cs.CL

    Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

    Authors: Jiarui Yao, Yifan Hao, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang

    Abstract: Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  45. arXiv:2505.02383  [pdf, ps, other

    cs.LG

    Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

    Authors: Bingshan Hu, Zhiming Huang, Tianyue H. Zhang, Mathias Lécuyer, Nidhi Hegde

    Abstract: We address differentially private stochastic bandit problems from the angles of exploring the deep connections among Thompson Sampling with Gaussian priors, Gaussian mechanisms, and Gaussian differential privacy (GDP). We propose DP-TS-UCB, a novel parametrized private bandit algorithm that enables to trade off privacy and regret. DP-TS-UCB satisfies $ \tilde{O} \left(T^{0.25(1-α)}\right)$-GDP and… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025

  46. arXiv:2505.02126  [pdf, other

    cs.CV

    GarmentGS: Point-Cloud Guided Gaussian Splatting for High-Fidelity Non-Watertight 3D Garment Reconstruction

    Authors: Zhihao Tang, Shenghao Yang, Hongtao Zhang, Mingbo Zhao

    Abstract: Traditional 3D garment creation requires extensive manual operations, resulting in time and labor costs. Recently, 3D Gaussian Splatting has achieved breakthrough progress in 3D scene reconstruction and rendering, attracting widespread attention and opening new pathways for 3D garment reconstruction. However, due to the unstructured and irregular nature of Gaussian primitives, it is difficult to r… ▽ More

    Submitted 14 May, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

    Comments: Accepted by ICMR 2025

  47. arXiv:2505.02064  [pdf, other

    cs.CV

    RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

    Authors: Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu

    Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timesta… ▽ More

    Submitted 5 May, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

    Comments: 13 pages, 4 figures, 5 tables

  48. arXiv:2505.02013  [pdf, ps, other

    cs.CV

    MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

    Authors: Siran Peng, Zipei Wang, Li Gao, Xiangyu Zhu, Tianshuo Zhang, Ajian Liu, Haoyuan Zhang, Zhen Lei

    Abstract: Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results,… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

  49. arXiv:2505.01974  [pdf, other

    cs.RO

    KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation

    Authors: Di Zhang, Chengbo Yuan, Chuan Wen, Hai Zhang, Junqiao Zhao, Yang Gao

    Abstract: Collecting demonstrations enriched with fine-grained tactile information is critical for dexterous manipulation, particularly in contact-rich tasks that require precise force control and physical interaction. While prior works primarily focus on teleoperation or video-based retargeting, they often suffer from kinematic mismatches and the absence of real-time tactile feedback, hindering the acquisi… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  50. arXiv:2505.01746  [pdf, other

    cs.CV

    Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

    Authors: Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo

    Abstract: Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling t… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: Accepted as ICLR 2025 (Spotlight)