Skip to main content

Showing 1–50 of 1,286 results for author: Zhang, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10558  [pdf, ps, other

    cs.GR cs.CV

    Style Customization of Text-to-Vector Generation with Image Diffusion Priors

    Authors: Peiying Zhang, Nanxuan Zhao, Jing Liao

    Abstract: Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: Accepted by SIGGRAPH 2025 (Conference Paper). Project page: https://customsvg.github.io

  2. arXiv:2505.09388  [pdf, other

    cs.CL

    Qwen3 Technical Report

    Authors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou , et al. (35 additional authors not shown)

    Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  3. arXiv:2505.08838  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts

    Authors: Peixuan Ge, Tongkun Su, Faqin Lv, Baoliang Zhao, Peng Zhang, Chi Hong Wong, Liang Yao, Yu Sun, Zenan Wang, Pak Kin Wong, Ying Hu

    Abstract: Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveragi… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  4. arXiv:2505.06520  [pdf, other

    cs.LG cs.AI cs.CR

    PRUNE: A Patching Based Repair Framework for Certiffable Unlearning of Neural Networks

    Authors: Xuran Li, Jingyi Wang, Xiaohan Yuan, Peixin Zhang, Zhan Qin, Zhibo Wang, Kui Ren

    Abstract: It is often desirable to remove (a.k.a. unlearn) a speciffc part of the training data from a trained neural network model. A typical application scenario is to protect the data holder's right to be forgotten, which has been promoted by many recent regulation rules. Existing unlearning methods involve training alternative models with remaining data, which may be costly and challenging to verify fro… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  5. arXiv:2505.02823  [pdf, other

    cs.CV

    MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing

    Authors: Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, Qian He

    Abstract: Current multi-subject customization approaches encounter two critical challenges: the difficulty in acquiring diverse multi-subject training data, and attribute entanglement across different subjects. To bridge these gaps, we propose MUSAR - a simple yet effective framework to achieve robust multi-subject customization while requiring only single-subject training data. Firstly, to break the data l… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: Project page at https://github.com/guozinan126/MUSAR

  6. arXiv:2505.02753  [pdf, other

    cs.CV

    Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models

    Authors: Yankai Jiang, Peng Zhang, Donglin Yang, Yuan Tian, Hai Lin, Xiaosong Wang

    Abstract: We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as hig… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: This paper is accepted to CVPR 2025

  7. arXiv:2505.02186  [pdf

    cs.CE

    Probabilistic Method for Optimizing Submarine Search and Rescue Strategy Under Environmental Uncertainty

    Authors: Runhao Liu, Ziming Chen, Peng Zhang

    Abstract: When coping with the urgent challenge of locating and rescuing a deep-sea submersible in the event of communication or power failure, environmental uncertainty in the ocean can not be ignored. However, classic physical models are limited to deterministic scenarios. Therefore, we present a hybrid algorithm framework combined with dynamic analysis for target submarine, Monte Carlo and Bayesian metho… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

  8. arXiv:2505.02096  [pdf, other

    cs.MM

    TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing

    Authors: Yaru Chen, Peiliang Zhang, Fei Li, Faegheh Sardari, Ruohao Guo, Zhenbo Li, Wenwu Wang

    Abstract: Audio-Visual Video Parsing (AVVP) task aims to parse the event categories and occurrence times from audio and visual modalities in a given video. Existing methods usually focus on implicitly modeling audio and visual features through weak labels, without mining semantic relationships for different modalities and explicit modeling of event temporal dependencies. This makes it difficult for the mode… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Comments: Accepted by ICMR 2025

  9. arXiv:2505.01286  [pdf, other

    cs.LG cs.AI

    2DXformer: Dual Transformers for Wind Power Forecasting with Dual Exogenous Variables

    Authors: Yajuan Zhang, Jiahai Jiang, Yule Yan, Liang Yang, Ping Zhang

    Abstract: Accurate wind power forecasting can help formulate scientific dispatch plans, which is of great significance for maintaining the safety, stability, and efficient operation of the power system. In recent years, wind power forecasting methods based on deep learning have focused on extracting the spatiotemporal correlations among data, achieving significant improvements in forecasting accuracy. Howev… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: Accepted by ICDM 2024

  10. arXiv:2504.21826  [pdf, other

    cs.RO

    An Underwater, Fault-Tolerant, Laser-Aided Robotic Multi-Modal Dense SLAM System for Continuous Underwater In-Situ Observation

    Authors: Yaming Ou, Junfeng Fan, Chao Zhou, Pengju Zhang, Zongyuan Shen, Yichen Fu, Xiaoyan Liu, Zengguang Hou

    Abstract: Existing underwater SLAM systems are difficult to work effectively in texture-sparse and geometrically degraded underwater environments, resulting in intermittent tracking and sparse mapping. Therefore, we present Water-DSLAM, a novel laser-aided multi-sensor fusion system that can achieve uninterrupted, fault-tolerant dense SLAM capable of continuous in-situ observation in diverse complex underwa… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  11. arXiv:2504.21466  [pdf, other

    cs.IT

    Semantic-aided Parallel Image Transmission Compatible with Practical System

    Authors: Mingkai Xu, Yongpeng Wu, Yuxuan Shi, Xiang-Gen Xia, Merouane Debbah, Wenjun Zhang, Ping Zhang

    Abstract: In this paper, we propose a novel semantic-aided image communication framework for supporting the compatibility with practical separation-based coding architectures. Particularly, the deep learning (DL)-based joint source-channel coding (JSCC) is integrated into the classical separate source-channel coding (SSCC) to transmit the images via the combination of semantic stream and image stream from D… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: This paper has been accepted by IEEE Transactions on Wireless Communications

  12. arXiv:2504.20073  [pdf, other

    cs.LG cs.AI cs.CL

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Authors: Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

    Abstract: Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for t… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  13. arXiv:2504.19600  [pdf

    cs.CV cs.AI

    Image Generation Method Based on Heat Diffusion Models

    Authors: Pengfei Zhang, Shouqing Jia

    Abstract: Denoising Diffusion Probabilistic Models (DDPMs) achieve high-quality image generation without adversarial training, but they process images as a whole. Since adjacent pixels are highly likely to belong to the same object, we propose the Heat Diffusion Model (HDM) to further preserve image details and generate more realistic images. HDM is a model that incorporates pixel-level operations while mai… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  14. arXiv:2504.18458  [pdf, other

    cs.CL cs.AI cs.CV

    Fast-Slow Thinking for Large Vision-Language Model Reasoning

    Authors: Wenyi Xiao, Leilei Gan, Weilong Dai, Wanggui He, Ziwei Huang, Haoyuan Li, Fangxun Shu, Zhelun Yu, Peng Zhang, Hao Jiang, Fei Wu

    Abstract: Recent advances in large vision-language models (LVLMs) have revealed an \textit{overthinking} phenomenon, where models generate verbose reasoning across all tasks regardless of questions. To address this issue, we present \textbf{FAST}, a novel \textbf{Fa}st-\textbf{S}low \textbf{T}hinking framework that dynamically adapts reasoning depth based on question characteristics. Through empirical analy… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: 16 pages, 5 figures, and 12 tables

  15. arXiv:2504.18428  [pdf, other

    cs.CL

    PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

    Authors: Yiming Wang, Pei Zhang, Jialong Tang, Haoran Wei, Baosong Yang, Rui Wang, Chenshu Sun, Feitong Sun, Jiran Zhang, Junxuan Wu, Qiqian Cang, Yichang Zhang, Fei Huang, Junyang Lin, Fei Huang, Jingren Zhou

    Abstract: In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced L… ▽ More

    Submitted 30 April, 2025; v1 submitted 25 April, 2025; originally announced April 2025.

    Comments: Work in Progress

  16. arXiv:2504.17789  [pdf, other

    cs.CV

    Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

    Authors: Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu

    Abstract: Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a n… ▽ More

    Submitted 27 April, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

    Comments: Project Page: https://ma-xu.github.io/token-shuffle/ Add related works

  17. arXiv:2504.17457  [pdf, other

    cs.CV

    Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks

    Authors: Zhiying Li, Yeying Jin, Fan Shen, Zhi Liu, Weibin Chen, Pengju Zhang, Xiaomei Zhang, Boyu Chen, Michael Shen, Kejian Wu, Zhaoxin Fan, Jin Dong

    Abstract: Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: 14 pages, 7 figures

  18. arXiv:2504.17224  [pdf, other

    cs.CV

    Visual and textual prompts for enhancing emotion recognition in video

    Authors: Zhifeng Wang, Qixuan Zhang, Peter Zhang, Wenjia Niu, Kaihao Zhang, Ramesh Sankaranarayana, Sabrina Caldwell, Tom Gedeon

    Abstract: Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, lead… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 12 pages, 10 figures

  19. arXiv:2504.16915  [pdf, other

    cs.CV

    DreamO: A Unified Framework for Image Customization

    Authors: Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, Xinglong Wu

    Abstract: Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge.… ▽ More

    Submitted 13 May, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

  20. arXiv:2504.16411  [pdf, other

    cs.CL

    Out-of-the-Box Conditional Text Embeddings from Large Language Models

    Authors: Kosuke Yamada, Peinan Zhang

    Abstract: Conditional text embedding is a proposed representation that captures the shift in perspective on texts when conditioned on a specific aspect. Previous methods have relied on extensive training data for fine-tuning models, leading to challenges in terms of labor and resource costs. We propose PonTE, a novel unsupervised conditional text embedding method that leverages a causal large language model… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: work in progress

  21. arXiv:2504.14509  [pdf, other

    cs.CV cs.AI

    DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

    Authors: Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu

    Abstract: In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing… ▽ More

    Submitted 24 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: Project: https://superhero-7.github.io/DreamID/

  22. arXiv:2504.14208  [pdf, other

    cs.IR

    FedCIA: Federated Collaborative Information Aggregation for Privacy-Preserving Recommendation

    Authors: Mingzhe Han, Dongsheng Li, Jiafeng Xia, Jiahao Liu, Hansu Gu, Peng Zhang, Ning Gu, Tun Lu

    Abstract: Recommendation algorithms rely on user historical interactions to deliver personalized suggestions, which raises significant privacy concerns. Federated recommendation algorithms tackle this issue by combining local model training with server-side model aggregation, where most existing algorithms use a uniform weighted summation to aggregate item embeddings from different client models. This appro… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

  23. arXiv:2504.14174  [pdf, other

    cs.LG cs.AI

    A Physics-guided Multimodal Transformer Path to Weather and Climate Sciences

    Authors: Jing Han, Hanting Chen, Kai Han, Xiaomeng Huang, Yongyun Hu, Wenjun Xu, Dacheng Tao, Ping Zhang

    Abstract: With the rapid development of machine learning in recent years, many problems in meteorology can now be addressed using AI models. In particular, data-driven algorithms have significantly improved accuracy compared to traditional methods. Meteorological data is often transformed into 2D images or 3D videos, which are then fed into AI models for learning. Additionally, these models often incorporat… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

    Comments: Perspective article

  24. arXiv:2504.12331  [pdf, other

    cs.CL cs.AI

    Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLMs and Data Augmentation

    Authors: Xiangju Li, Dong Yang, Xiaogang Zhu, Faliang Huang, Peng Zhang, Zhongying Zhao

    Abstract: Span-level emotion-cause-category triplet extraction represents a novel and complex challenge within emotion cause analysis. This task involves identifying emotion spans, cause spans, and their associated emotion categories within the text to form structured triplets. While prior research has predominantly concentrated on clause-level emotion-cause pair extraction and span-level emotion-cause dete… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  25. arXiv:2504.10974  [pdf, other

    cs.CV eess.IV

    Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion

    Authors: Zhisheng Zhang, Peng Zhang, Fengxiang Wang, Liangli Ma, Fuchun Sun

    Abstract: Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect t… ▽ More

    Submitted 16 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

  26. arXiv:2504.10918  [pdf, other

    cs.HC

    Adaptive Human-Agent Teaming: A Review of Empirical Studies from the Process Dynamics Perspective

    Authors: Mengyao Wang, Jiayun Wu, Shuai Ma, Nuo Li, Peng Zhang, Ning Gu, Tun Lu

    Abstract: The rapid advancement of AI, including Large Language Models, has propelled autonomous agents forward, accelerating the human-agent teaming (HAT) paradigm to leverage complementary strengths. However, HAT research remains fragmented, often focusing on isolated team development phases or specific challenges like trust calibration while overlooking the real-world need for adaptability. Addressing th… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  27. arXiv:2504.09549  [pdf, other

    cs.CV

    SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

    Authors: Xiang Hu, Pingping Zhang, Yuhao Wang, Bin Yan, Huchuan Lu

    Abstract: Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they ove… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  28. arXiv:2504.09213  [pdf, ps, other

    cs.HC cs.LG cs.NE

    Spiking Neural Network for Intra-cortical Brain Signal Decoding

    Authors: Song Yang, Haotian Fu, Herui Zhang, Peng Zhang, Wei Li, Dongrui Wu

    Abstract: Decoding brain signals accurately and efficiently is crucial for intra-cortical brain-computer interfaces. Traditional decoding approaches based on neural activity vector features suffer from low accuracy, whereas deep learning based approaches have high computational cost. To improve both the decoding accuracy and efficiency, this paper proposes a spiking neural network (SNN) for effective and en… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  29. arXiv:2504.09060  [pdf, other

    cs.LG cs.AI q-bio.GN

    Multimodal 3D Genome Pre-training

    Authors: Minghao Yang, Pengteng Li, Yan Liang, Qianyi Cai, Zhihang Zheng, Shichen Zhang, Pengfei Zhang, Zhi-An Huang, Hui Xiong

    Abstract: Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semant… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  30. arXiv:2504.08255  [pdf

    cs.NI cs.ET

    CICV5G: A 5G Communication Delay Dataset for PnC in Cloud-based Intelligent Connected Vehicles

    Authors: Xinrui Zhang, Peizhi Zhang, Junpeng Huang, Haojie Feng, Yining Ma, Feng Shen, Lu Xiong

    Abstract: Cloud-based intelligent connected vehicles (CICVs) leverage cloud computing and vehicle-to-everything (V2X) to enable efficient information exchange and cooperative control. However, communication delay is a critical factor in vehicle-cloud interactions, potentially deteriorating the planning and control (PnC) performance of CICVs. To explore whether the new generation of communication technology,… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  31. arXiv:2504.07957  [pdf, other

    cs.CV

    MM-IFEngine: Towards Multimodal Instruction Following

    Authors: Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang

    Abstract: The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To addre… ▽ More

    Submitted 27 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  32. arXiv:2504.07524  [pdf, other

    cs.CV

    DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction

    Authors: Xu Zhao, Pengju Zhang, Bo Liu, Yihong Wu

    Abstract: Monocular 3D occupancy prediction, aiming to predict the occupancy and semantics within interesting regions of 3D scenes from only 2D images, has garnered increasing attention recently for its vital role in 3D scene understanding. Predicting the 3D occupancy of large-scale outdoor scenes from 2D images is ill-posed and resource-intensive. In this paper, we present \textbf{DGOcc}, a \textbf{D}epth-… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: under review

  33. arXiv:2504.06232  [pdf, other

    cs.CV

    HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

    Authors: Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

    Abstract: Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  34. arXiv:2504.05613  [pdf, other

    cs.CV

    Falcon: Fractional Alternating Cut with Overcoming Minima in Unsupervised Segmentation

    Authors: Xiao Zhang, Xiangyu Han, Xiwen Lai, Yao Sun, Pei Zhang, Konrad Kording

    Abstract: Today's unsupervised image segmentation algorithms often segment suboptimally. Modern graph-cut based approaches rely on high-dimensional attention maps from Transformer-based foundation models, typically employing a relaxed Normalized Cut solved recursively via the Fiedler vector (the eigenvector of the second smallest eigenvalue). Consequently, they still lag behind supervised methods in both ma… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  35. arXiv:2504.03939  [pdf, other

    cs.RO

    Deep Learning-Enhanced Robotic Subretinal Injection with Real-Time Retinal Motion Compensation

    Authors: Tianle Wu, Mojtaba Esfandiari, Peiyao Zhang, Russell H. Taylor, Peter Gehlbach, Iulian Iordachita

    Abstract: Subretinal injection is a critical procedure for delivering therapeutic agents to treat retinal diseases such as age-related macular degeneration (AMD). However, retinal motion caused by physiological factors such as respiration and heartbeat significantly impacts precise needle positioning, increasing the risk of retinal pigment epithelium (RPE) damage. This paper presents a fully autonomous robo… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  36. arXiv:2504.02826  [pdf, other

    cs.CV

    Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

    Authors: Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan

    Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing… ▽ More

    Submitted 8 April, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

    Comments: 27 pages, 23 figures, 1 table. Technical Report

  37. arXiv:2504.02433  [pdf, other

    cs.CV

    OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication

    Authors: Zhongjian Wang, Peng Zhang, Jinwei Qi, Guangyuan Wang Sheng Xu, Bang Zhang, Liefeng Bo

    Abstract: Recent years have witnessed remarkable advances in talking head generation, owing to its potential to revolutionize the human-AI interaction from text interfaces into realistic video chats. However, research on text-driven talking heads remains underexplored, with existing methods predominantly adopting a cascaded pipeline that combines TTS systems with audio-driven talking head models. This conve… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: Project Page https://humanaigc.github.io/omnitalker

  38. arXiv:2504.01990  [pdf, other

    cs.AI

    Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

    Authors: Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia , et al. (22 additional authors not shown)

    Abstract: The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

  39. arXiv:2503.23886  [pdf, other

    cs.DB cs.AI

    SchemaAgent: A Multi-Agents Framework for Generating Relational Database Schema

    Authors: Qin Wang, Youhuan Li, Yansong Feng, Si Chen, Ziming Li, Pan Zhang, Zhichao Shi, Yuequn Dou, chuchu Gao, Zebin Huang, Zihui Si, Yixuan Chen, Zhaohai Sun, Ke Tang, Wenqiang Jin

    Abstract: The relational database design would output a schema based on user's requirements, which defines table structures and their interrelated relations. Translating requirements into accurate schema involves several non-trivial subtasks demanding both database expertise and domain-specific knowledge. This poses unique challenges for automated design of relational databases. Existing efforts are mostly… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: 19 pages, 16 figures

  40. arXiv:2503.23722  [pdf, other

    cs.CV

    LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

    Authors: Xiang Hu, Yuhao Wang, Pingping Zhang, Huchuan Lu

    Abstract: Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different views. Previous methods usually adopt large-scale models, focusing on view-invariant features. However, they overlook the semantic information in person attributes. Additionally, existing training strategies often rely on full fine-tuning large-scale models, which significan… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  41. arXiv:2503.22777  [pdf, other

    cs.RO

    A reduced-scale autonomous morphing vehicle prototype with enhanced aerodynamic efficiency

    Authors: Peng Zhang, Branson Blaylock

    Abstract: Road vehicles contribute to significant levels of greenhouse gas (GHG) emissions. A potential strategy for improving their aerodynamic efficiency and reducing emissions is through active adaptation of their exterior shapes to the aerodynamic environment. In this study, we present a reduced-scale morphing vehicle prototype capable of actively interacting with the aerodynamic environment to enhance… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

  42. arXiv:2503.22444  [pdf, other

    cs.CL cs.RO

    Scaling Laws in Scientific Discovery with AI and Robot Scientists

    Authors: Pengsong Zhang, Heng Zhang, Huazhe Xu, Renjun Xu, Zhenting Wang, Cong Wang, Animesh Garg, Zhibin Li, Arash Ajoudani, Xinyu Liu

    Abstract: Scientific discovery is poised for rapid advancement through advanced robotics and artificial intelligence. Current scientific practices face substantial limitations as manual experimentation remains time-consuming and resource-intensive, while multidisciplinary research demands knowledge integration beyond individual researchers' expertise boundaries. Here, we envision an autonomous generalist sc… ▽ More

    Submitted 3 April, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

  43. arXiv:2503.22115  [pdf, other

    cs.CL cs.AI cs.CY

    Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories

    Authors: Yazhou Zhang, Qimeng Liu, Qiuchi Li, Peng Zhang, Jing Qin

    Abstract: Evaluating the value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts, which directly probe models with ethically sensitive or controversial questions. However, with the rapid advancements in AI safety techniques, models have become increasingly adept at circumventing these straightforward tests, limiting their effectiveness in revealing und… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  44. arXiv:2503.21144  [pdf, other

    cs.CV

    ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

    Authors: Jinwei Qi, Chaonan Ji, Sheng Xu, Peng Zhang, Bang Zhang, Liefeng Bo

    Abstract: Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Project Page: https://humanaigc.github.io/chat-anyone/

  45. arXiv:2503.20950  [pdf, other

    cs.AI

    DEMENTIA-PLAN: An Agent-Based Framework for Multi-Knowledge Graph Retrieval-Augmented Generation in Dementia Care

    Authors: Yutong Song, Chenhan Lyu, Pengfei Zhang, Sabine Brunswicker, Nikil Dutt, Amir Rahmani

    Abstract: Mild-stage dementia patients primarily experience two critical symptoms: severe memory loss and emotional instability. To address these challenges, we propose DEMENTIA-PLAN, an innovative retrieval-augmented generation framework that leverages large language models to enhance conversational support. Our model employs a multiple knowledge graph architecture, integrating various dimensional knowledg… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: Accepted by AAAI 2025 Workshop on Knowledge Graphs for Personalized Public Health

  46. arXiv:2503.18063  [pdf, other

    cs.CL cs.AI

    Dynamic Task Vector Grouping for Efficient Multi-Task Prompt Tuning

    Authors: Pieyi Zhang, Richong Zhang, Zhijie Nie

    Abstract: Multi-task prompt tuning utilizes multiple high-resource source tasks to improve performance on low-source target tasks. Existing approaches transfer the soft prompt trained by combining all source tasks or a single ``high-similar'' source task one-time-only. However, we find that the optimal transfer performance often comes from a combination of source tasks, which is neither one nor all. Further… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

    Comments: Work in progress

  47. arXiv:2503.17646  [pdf, other

    cs.SD cs.CV

    Leveraging Audio Representations for Vibration-Based Crowd Monitoring in Stadiums

    Authors: Yen Cheng Chang, Jesse Codling, Yiwen Dong, Jiale Zhang, Jiasi Chen, Hae Young Noh, Pei Zhang

    Abstract: Crowd monitoring in sports stadiums is important to enhance public safety and improve the audience experience. Existing approaches mainly rely on cameras and microphones, which can cause significant disturbances and often raise privacy concerns. In this paper, we sense floor vibration, which provides a less disruptive and more non-intrusive way of crowd sensing, to predict crowd behavior. However,… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  48. arXiv:2503.16867  [pdf, other

    cs.CV

    ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

    Authors: Kaisi Guan, Zhengfeng Lai, Yuchong Sun, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, Ruihua Song

    Abstract: Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Ali… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  49. arXiv:2503.16862  [pdf, other

    cs.SD cs.CV eess.AS

    City2Scene: Improving Acoustic Scene Classification with City Features

    Authors: Yiqiang Cai, Yizhou Tan, Peihong Zhang, Yuxuan Liu, Shengchen Li, Xi Shao, Mark D. Plumbley

    Abstract: Acoustic scene recordings are often collected from a diverse range of cities. Most existing acoustic scene classification (ASC) approaches focus on identifying common acoustic scene patterns across cities to enhance generalization. In contrast, we hypothesize that city-specific environmental and cultural differences in acoustic features are beneficial for the ASC task. In this paper, we introduce… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  50. arXiv:2503.16528  [pdf, other

    cs.CL cs.AI

    HDLCoRe: A Training-Free Framework for Mitigating Hallucinations in LLM-Generated HDL

    Authors: Heng Ping, Shixuan Li, Peiyu Zhang, Anzhe Cheng, Shukai Duan, Nikos Kanakaris, Xiongye Xiao, Wei Yang, Shahin Nazarian, Andrei Irimia, Paul Bogdan

    Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, when applied to hardware description languages (HDL), these models exhibit significant limitations due to data scarcity, resulting in hallucinations and incorrect code generation. To address these challenges, we propose HDLCoRe, a training-free framework that enhances LLMs'… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.