Skip to main content

Showing 201–250 of 3,625 results for author: Hang

.
  1. arXiv:2503.23367  [pdf, other

    cs.CV

    FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning

    Authors: Hang Guo, Yawei Li, Taolin Zhang, Jiangshan Wang, Tao Dai, Shu-Tao Xia, Luca Benini

    Abstract: Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our ke… ▽ More

    Submitted 6 April, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

    Comments: Technical Report

  2. arXiv:2503.23365  [pdf, other

    cs.CV cs.RO

    OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users

    Authors: Zhangcun Yan, Jianqing Li, Peng Hang, Jian Sun

    Abstract: With the acceleration of urbanization and the growth of transportation demands, the safety of vulnerable road users (VRUs, such as pedestrians and cyclists) in mixed traffic flows has become increasingly prominent, necessitating high-precision and diverse trajectory data to support the development and optimization of autonomous driving systems. However, existing datasets fall short in capturing th… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

  3. arXiv:2503.22976  [pdf, other

    cs.CV

    From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

    Authors: Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang

    Abstract: Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a no… ▽ More

    Submitted 27 May, 2025; v1 submitted 29 March, 2025; originally announced March 2025.

    Comments: Project page: https://fudan-zvg.github.io/spar

  4. arXiv:2503.22875  [pdf, other

    cs.DC cs.PF

    A Pilot Study on Tunable Precision Emulation via Automatic BLAS Offloading

    Authors: Hang Liu, Junjie Li, Yinzhi Wang

    Abstract: This study explores the use of automatic BLAS offloading and INT8-based emulation for accelerating traditional HPC workloads on modern GPU architectures. Through the use of low-bitwidth integer units and cache-coherent Unified Memory Architecture, we emulate double-precision matrix multiplications in the MuST application without code changes. We find that accuracy depends on both arithmetic precis… ▽ More

    Submitted 2 April, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

  5. arXiv:2503.22021  [pdf, other

    stat.ME

    Discussion of "Robust Distance Covariance" by S. Leyder, J. Raymaekers, and P.J. Rousseeuw

    Authors: Hallin Marc, Davide La Vecchia, Hang Liu, Xinyi Xu

    Abstract: Distance covariance and distance correlation have long been regarded as natural measures of dependence between two random vectors, and have been used in a variety of situations for testing independence. Despite their popularity, the robustness of their empirical versions remain highly undiscovered. The paper named "Robust Distance Covariance" by S. Leyder, J. Raymaekers, and P.J. Rousseeuw (below… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  6. arXiv:2503.21991  [pdf, other

    cs.CV cs.AI cs.GR

    BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

    Authors: Hang Zhou, Xinxin Zuo, Rui Ma, Li Cheng

    Abstract: In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: CVPR 2025. Project page: https://ryanhangzhou.github.io/bootplace/ , code: https://github.com/RyanHangZhou/BOOTPLACE

  7. arXiv:2503.21843  [pdf, other

    cs.CV cs.AI

    CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition

    Authors: Hanyu Liu, Siyao Li, Ying Yu, Yixuan Jiang, Hang Xiao, Jingxi Long, Haotian Tang

    Abstract: Human Activity Recognition (HAR) is a fundamental technology for numerous human - centered intelligent applications. Although deep learning methods have been utilized to accelerate feature extraction, issues such as multimodal data mixing, activity heterogeneity, and complex model deployment remain largely unresolved. The aim of this paper is to address issues such as multimodal data mixing, activ… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  8. arXiv:2503.21823  [pdf, other

    cs.CV

    Low-Rank Adaptation of Pre-Trained Stable Diffusion for Rigid-Body Target ISAR Imaging

    Authors: Boan Zhang, Hang Dong, Jiongge Zhang, Long Tian, Rongrong Wang, Zhenhua Wu, Xiyang Liu, Hongwei Liu

    Abstract: Traditional range-instantaneous Doppler (RID) methods for rigid-body target imaging often suffer from low resolution due to the limitations of time-frequency analysis (TFA). To address this challenge, our primary focus is on obtaining high resolution time-frequency representations (TFRs) from their low resolution counterparts. Recognizing that the curve features of TFRs are a specific type of text… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: 4 pages, IGARSS 2025

  9. arXiv:2503.21805  [pdf, other

    cs.CL cs.AI

    ImF: Implicit Fingerprint for Large Language Models

    Authors: Wu jiaxuan, Peng Wanli, Fu hang, Xue Yiming, Wen juan

    Abstract: Training large language models (LLMs) is resource-intensive and expensive, making protecting intellectual property (IP) for LLMs crucial. Recently, embedding fingerprints into LLMs has emerged as a prevalent method for establishing model ownership. However, existing fingerprinting techniques typically embed identifiable patterns with weak semantic coherence, resulting in fingerprints that signific… ▽ More

    Submitted 17 May, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: 13 pages, 6 figures

  10. arXiv:2503.21696  [pdf, other

    cs.CL cs.CV

    Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

    Authors: Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting Zhuang

    Abstract: Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied s… ▽ More

    Submitted 14 May, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

    Comments: Code: https://github.com/zwq2018/embodied_reasoner Dataset: https://huggingface.co/datasets/zwq2018/embodied_reasoner

  11. arXiv:2503.21246  [pdf, other

    cs.CV

    DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

    Authors: Haoyu Zhao, Zhongang Qi, Cong Wang, Qingping Zheng, Guansong Lu, Fei Chen, Hang Xu, Zuxuan Wu

    Abstract: With diffusion transformer (DiT) excelling in video generation, its use in specific tasks has drawn increasing attention. However, adapting DiT for pose-guided human image animation faces two core challenges: (a) existing U-Net-based pose control methods may be suboptimal for the DiT backbone; and (b) removing text guidance, as in previous approaches, often leads to semantic loss and model degrada… ▽ More

    Submitted 18 May, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

    Comments: 16 pages, 11 figures

  12. arXiv:2503.21099  [pdf, ps, other

    cs.CV

    Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection

    Authors: Yun Zhu, Le Hui, Hang Yang, Jianjun Qian, Jin Xie, Jian Yang

    Abstract: Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objec… ▽ More

    Submitted 13 June, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  13. arXiv:2503.19824  [pdf, other

    cs.CV cs.GR cs.MM

    AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

    Authors: Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, Errui Ding, Jingdong Wang, Youjian Zhao, Hang Zhou, Ziwei Liu

    Abstract: Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human vi… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Project page: https://guanjz20.github.io/projects/AudCast

  14. arXiv:2503.19656  [pdf, other

    cs.LG cs.AI

    Towards Reliable Time Series Forecasting under Future Uncertainty: Ambiguity and Novelty Rejection Mechanisms

    Authors: Ninghui Feng, Songning Lai, Xin Zhou, Jiayu Yang, Kunlong Feng, Zhenxiao Yin, Fobao Zhou, Zhangyi Hu, Yutao Yue, Yuxuan Liang, Boyu Wang, Hang Zhao

    Abstract: In real-world time series forecasting, uncertainty and lack of reliable evaluation pose significant challenges. Notably, forecasting errors often arise from underfitting in-distribution data and failing to handle out-of-distribution inputs. To enhance model reliability, we introduce a dual rejection mechanism combining ambiguity and novelty rejection. Ambiguity rejection, using prediction error va… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  15. arXiv:2503.19114  [pdf, other

    cs.CL cs.IR cs.LG

    Understanding and Improving Information Preservation in Prompt Compression for LLMs

    Authors: Weronika Łajewska, Momchil Hardalov, Laura Aina, Neha Anna John, Hang Su, Lluís Màrquez

    Abstract: Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced t… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: 21 pages, 6 figures, 23 tables

  16. arXiv:2503.18738  [pdf, other

    cs.RO

    RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation

    Authors: Chengbo Yuan, Suraj Joshi, Shaoting Zhu, Hang Su, Hang Zhao, Yang Gao

    Abstract: Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e.g., green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: Project Page: https://roboengine.github.io/

  17. arXiv:2503.18114  [pdf, other

    cs.LG cs.NE q-bio.NC

    Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry

    Authors: Chi-Ning Chou, Hang Le, Yichen Wang, SueYeon Chung

    Abstract: The ability to integrate task-relevant information into neural representations is a fundamental aspect of both biological and artificial intelligence. To enable theoretical analysis, recent work has examined whether a network learns task-relevant features (rich learning) or resembles a random feature model (or a kernel machine, i.e., lazy learning). However, this simple lazy-versus-rich dichotomy… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

  18. arXiv:2503.18065  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

    Authors: Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang

    Abstract: Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires exte… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

  19. arXiv:2503.17793  [pdf, other

    cs.LG cs.AI cs.CL

    Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

    Authors: Codefuse, Ling Team, :, Wenting Cai, Yuchen Cao, Chaoyu Chen, Chen Chen, Siba Chen, Qing Cui, Peng Di, Junpeng Fang, Zi Gong, Ting Guo, Zhengyu He, Yang Huang, Cong Li, Jianguo Li, Zheng Li, Shijie Lian, BingChang Liu, Songshan Luo, Shuo Mao, Min Shen, Jian Wu, Jiaolong Yang , et al. (8 additional authors not shown)

    Abstract: Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the Deep… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

    Comments: 20 pages, 6 figures

    ACM Class: I.2.7

  20. arXiv:2503.16975  [pdf, other

    cs.CV

    EasyRobust: A Comprehensive and Easy-to-use Toolkit for Robust and Generalized Vision

    Authors: Xiaofeng Mao, Yuefeng Chen, Rong Zhang, Hui Xue, Zhao Li, Hang Su

    Abstract: Deep neural networks (DNNs) has shown great promise in computer vision tasks. However, machine vision achieved by DNNs cannot be as robust as human perception. Adversarial attacks and data distribution shifts have been known as two major scenarios which degrade machine performance and obstacle the wide deployment of machines "in the wild". In order to break these obstructions and facilitate the re… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  21. arXiv:2503.16942  [pdf, other

    cs.CV

    Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model

    Authors: Yingying Fan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Errui Ding, Yu Wu, Jingdong Wang

    Abstract: Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their inter… ▽ More

    Submitted 25 March, 2025; v1 submitted 21 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025

  22. arXiv:2503.16552  [pdf, other

    cs.RO cs.MA

    A Vehicle-Infrastructure Multi-layer Cooperative Decision-making Framework

    Authors: Yiming Cui, Shiyu Fang, Peng Hang, Jian Sun

    Abstract: Autonomous driving has entered the testing phase, but due to the limited decision-making capabilities of individual vehicle algorithms, safety and efficiency issues have become more apparent in complex scenarios. With the advancement of connected communication technologies, autonomous vehicles equipped with connectivity can leverage vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) comm… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: 7 pages, 6 figures

  23. arXiv:2503.15965  [pdf

    q-fin.PM cs.CE

    Practical Portfolio Optimization with Metaheuristics:Pre-assignment Constraint and Margin Trading

    Authors: Hang Kin Poon

    Abstract: Portfolio optimization is a critical area in finance, aiming to maximize returns while minimizing risk. Metaheuristic algorithms were shown to solve complex optimization problems efficiently, with Genetic Algorithms and Particle Swarm Optimization being among the most popular methods. This paper introduces an innovative approach to portfolio optimization that incorporates pre-assignment to limit t… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  24. arXiv:2503.15831  [pdf, other

    cs.CV

    EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

    Authors: Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, Zuxuan Wu

    Abstract: Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-… ▽ More

    Submitted 9 May, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

    Comments: CVPR2025

  25. arXiv:2503.15454  [pdf, other

    cs.CL

    Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering Systems

    Authors: Yuelyu Ji, Hang Zhang, Yanshan Wang

    Abstract: Medical Question Answering systems based on Retrieval Augmented Generation is promising for clinical decision support because they can integrate external knowledge, thus reducing inaccuracies inherent in standalone large language models (LLMs). However, these systems may unintentionally propagate or amplify biases associated with sensitive demographic attributes like race, gender, and socioeconomi… ▽ More

    Submitted 26 March, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

  26. arXiv:2503.14887  [pdf, ps, other

    cs.IR cs.LG

    Pseudo Relevance Feedback is Enough to Close the Gap Between Small and Large Dense Retrieval Models

    Authors: Hang Li, Xiao Wang, Bevan Koopman, Guido Zuccon

    Abstract: Scaling dense retrievers to larger large language model (LLM) backbones has been a dominant strategy for improving their retrieval effectiveness. However, this has substantial cost implications: larger backbones require more expensive hardware (e.g. GPUs with more memory) and lead to higher indexing and querying costs (latency, energy consumption). In this paper, we challenge this paradigm by intr… ▽ More

    Submitted 5 June, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

  27. arXiv:2503.14701  [pdf, other

    cs.RO cs.CV

    ARC-Calib: Autonomous Markerless Camera-to-Robot Calibration via Exploratory Robot Motions

    Authors: Podshara Chanrungmaneekul, Yiting Chen, Joshua T. Grace, Aaron M. Dollar, Kaiyu Hang

    Abstract: Camera-to-robot (also known as eye-to-hand) calibration is a critical component of vision-based robot manipulation. Traditional marker-based methods often require human intervention for system setup. Furthermore, existing autonomous markerless calibration methods typically rely on pre-trained robot tracking models that impede their application on edge devices and require fine-tuning for novel robo… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: 8 pages, 9 figures

  28. arXiv:2503.14489  [pdf, other

    cs.CV

    Stable Virtual Camera: Generative View Synthesis with Diffusion Models

    Authors: Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, Varun Jampani

    Abstract: We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe,… ▽ More

    Submitted 1 April, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

  29. arXiv:2503.14476  [pdf, other

    cs.LG cs.CL

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Authors: Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu , et al. (10 additional authors not shown)

    Abstract: Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecouple… ▽ More

    Submitted 19 May, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

    Comments: Project Page: https://dapo-sia.github.io/

  30. arXiv:2503.14155  [pdf

    physics.optics physics.class-ph

    Electromagnetic Duality Symmetry-Protected Dirac-Like Cones

    Authors: Muxuan Yang, Dongyang Yan, Lei Gao, Wei Liu, Yun Lai, Yadong Xu, Zhi Hong Hang, Jie Luo

    Abstract: Dirac-like cones, featuring conical linear dispersions intersecting with flat bands, typically arise from accidental degeneracy of multiple modes that requires precise tuning of material and structural parameters, inherently limiting their robustness and applications. In this work, by introducing electromagnetic duality symmetry into photonic crystals, we demonstrate the emergence of intrinsically… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  31. arXiv:2503.13940  [pdf, other

    cs.CV eess.SP

    Multi-Modal Self-Supervised Semantic Communication

    Authors: Hang Zhao, Hongru Li, Dongfang Xu, Shenghui Song, Khaled B. Letaief

    Abstract: Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge,… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  32. arXiv:2503.13288  [pdf, other

    cs.LG cs.AI cs.CL

    $φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

    Authors: Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Jun Liu, Qika Lin, Zhiyong Wu

    Abstract: Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sa… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: 13 pages, 6 figures

  33. arXiv:2503.13178  [pdf, other

    cs.AI cs.LG

    Rapfi: Distilling Efficient Neural Network for the Game of Gomoku

    Authors: Zhanggen Jin, Haobin Duan, Zhiyang Hang

    Abstract: Games have played a pivotal role in advancing artificial intelligence, with AI agents using sophisticated techniques to compete. Despite the success of neural network based game AIs, their performance often requires significant computational resources. In this paper, we present Rapfi, an efficient Gomoku agent that outperforms CNN-based agents in limited computation environments. Rapfi leverages a… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  34. arXiv:2503.12908  [pdf, other

    cs.CL cs.AI

    HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models

    Authors: Xinyan Jiang, Hang Ye, Yongxin Zhu, Xiaoying Zheng, Zikang Chen, Jun Gong

    Abstract: Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model's prediction as inducing heads, then induces hal… ▽ More

    Submitted 23 May, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: Accepted by ACL2025 findings

  35. arXiv:2503.12668  [pdf, other

    cs.LG cs.PF

    ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

    Authors: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang

    Abstract: Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, el… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

    Comments: 14 pages, 7 figures

  36. arXiv:2503.12149  [pdf, other

    cs.CL cs.MM cs.SI

    Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models

    Authors: Junjie Chen, Xuyang Liu, Subin Huang, Linfeng Zhang, Hang Yu

    Abstract: With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datase… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

  37. arXiv:2503.12037  [pdf, other

    cs.LG cs.AI

    Unsupervised Graph Anomaly Detection via Multi-Hypersphere Heterophilic Graph Learning

    Authors: Hang Ni, Jindong Han, Nengjun Zhu, Hao Liu

    Abstract: Graph Anomaly Detection (GAD) plays a vital role in various data mining applications such as e-commerce fraud prevention and malicious user detection. Recently, Graph Neural Network (GNN) based approach has demonstrated great effectiveness in GAD by first encoding graph data into low-dimensional representations and then identifying anomalies under the guidance of supervised or unsupervised signals… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

  38. arXiv:2503.11929  [pdf, ps, other

    math.OC math.AP

    Local controllability of a free-boundary problem for 1D degenerate parabolic equations

    Authors: Lingyang Liu, Hang Gao

    Abstract: This paper deals with the local controllability of a free-boundary problem for the 1D boundary-degenerate parabolic equation with distributed controls, locally supported in space. We prove that, if the final time T is fixed and the initial state is sufficiently small, there exist controls that drive the state exactly to rest at time t = T. The proof is based on Schauder's fixed point theorem, comb… ▽ More

    Submitted 3 May, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

  39. arXiv:2503.11837  [pdf, other

    gr-qc astro-ph.HE

    Resonance locking: radian-level phase shifts due to nonlinear hydrodynamics of $g$-modes in merging neutron star binaries

    Authors: K. J. Kwon, Hang Yu, Tejaswi Venumadhav

    Abstract: A neutron star (NS) in a binary system deforms due to the companion's tidal gravitational field. As the binary inspirals due to gravitational wave (GW) emission, the NS's deformation evolves; this evolution is typically modeled as the star's linear response to the companion's time-evolving tidal potential. In principle, the fluid elements' displacements can be excited and evolve nonlinearly since… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: 22 pages, 9 figures

  40. arXiv:2503.11465  [pdf, other

    cs.CV

    Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios

    Authors: Hang Shao, Lei Luo, Jianjun Qian, Mengkai Yan, Shuo Chen, Jian Yang

    Abstract: Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  41. arXiv:2503.11047  [pdf, other

    quant-ph

    Quantum ensemble learning with a programmable superconducting processor

    Authors: Jiachen Chen, Yaozu Wu, Zhen Yang, Shibo Xu, Xuan Ye, Daili Li, Ke Wang, Chuanyu Zhang, Feitong Jin, Xuhao Zhu, Yu Gao, Ziqi Tan, Zhengyi Cui, Aosai Zhang, Ning Wang, Yiren Zou, Tingting Li, Fanhao Shen, Jiarun Zhong, Zehang Bao, Zitian Zhu, Zixuan Song, Jinfeng Deng, Hang Dong, Pengfei Zhang , et al. (8 additional authors not shown)

    Abstract: Quantum machine learning is among the most exciting potential applications of quantum computing. However, the vulnerability of quantum information to environmental noises and the consequent high cost for realizing fault tolerance has impeded the quantum models from learning complex datasets. Here, we introduce AdaBoost.Q, a quantum adaptation of the classical adaptive boosting (AdaBoost) algorithm… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 9 pages, 4 figures

  42. arXiv:2503.10630  [pdf, other

    cs.CV cs.RO

    UniGoal: Towards Universal Zero-shot Goal-oriented Navigation

    Authors: Hang Yin, Xiuwei Xu, Lingqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu

    Abstract: In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify… ▽ More

    Submitted 18 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025. Project page: https://bagh2178.github.io/UniGoal/

  43. arXiv:2503.10434  [pdf, other

    cs.RO cs.CV cs.LG

    Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback

    Authors: Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, Ningyi Xu, Hang Zhao

    Abstract: Generating human-like and adaptive trajectories is essential for autonomous driving in dynamic environments. While generative models have shown promise in synthesizing feasible trajectories, they often fail to capture the nuanced variability of human driving styles due to dataset biases and distributional shifts. To address this, we introduce TrajHF, a human feedback-driven finetuning framework fo… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 10 pages, 5 figures

  44. arXiv:2503.10196  [pdf, other

    math.NA

    A filtered Lie splitting method for the Zakharov system with low regularity estimates

    Authors: Lun Ji, Hang Li, Chunmei Su

    Abstract: In this paper, we present an error estimate for the filtered Lie splitting scheme applied to the Zakharov system, characterized by solutions exhibiting very low regularity across all dimensions. Our findings are derived from the application of multilinear estimates established within the framework of discrete Bourgain spaces. Specifically, we demonstrate that when the solution… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    MSC Class: 65M12; 65M15; 65T50

  45. arXiv:2503.09942  [pdf, other

    cs.CV

    Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

    Authors: Yasheng Sun, Zhiliang Xu, Hang Zhou, Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Borong Liang, Yingying Li, Haocheng Feng, Jingdong Wang, Ziwei Liu, Koike Hideki

    Abstract: Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and co… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: Project Page: https://sunyasheng.github.io/projects/COSH-DIT

  46. arXiv:2503.09642  [pdf, other

    cs.GR cs.AI

    Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

    Authors: Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang , et al. (7 additional authors not shown)

    Abstract: Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. With this model, we demonstrate that the cost of training a top-pe… ▽ More

    Submitted 23 March, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

  47. arXiv:2503.08564  [pdf, other

    cs.RO cs.AI

    MoE-Loco: Mixture of Experts for Multitask Locomotion

    Authors: Runhan Huang, Shaoting Zhu, Yilun Du, Hang Zhao

    Abstract: We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask locomotion for legged robots. Our method enables a single policy to handle diverse terrains, including bars, pits, stairs, slopes, and baffles, while supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient conflicts that typically arise in multitask reinforcement learning, improving both training efficien… ▽ More

    Submitted 20 May, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

    Comments: 9 pages, 10 figures

  48. arXiv:2503.08471  [pdf, other

    cs.CV

    TrackOcc: Camera-based 4D Panoptic Occupancy Tracking

    Authors: Zhuoguang Chen, Kenan Li, Xiuyu Yang, Tao Jiang, Yiming Li, Hang Zhao

    Abstract: Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addre… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Accepted at ICRA 2025

  49. arXiv:2503.08461  [pdf, other

    cs.MM cs.DC

    FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework

    Authors: Jianian Zhu, Hang Wu, Haojie Wang, Yinghui Li, Biao Hou, Ruixuan Li, Jidong Zhai

    Abstract: Multi-modal Large Language Models (MLLMs) serving systems commonly employ KV-cache compression to reduce memory footprint. However, existing compression methods introduce significant processing overhead and queuing delays, particularly in concurrent serving scenarios. We present \texttt{FastCache}, a novel serving framework that effectively addresses these challenges through two key innovations: (… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: 14 pages, 14 figures

  50. arXiv:2503.07485  [pdf, other

    cs.CV

    Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction

    Authors: Zongzheng Zhang, Xinrun Li, Sizhe Zou, Guoxuan Chi, Siqi Li, Xuchong Qiu, Guoliang Wang, Guantian Zheng, Leichen Wang, Hang Zhao, Hao Zhao

    Abstract: Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existi… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: ICRA 2025, Project Page: https://github.com/XR-Lee/neural-symbolic