Skip to main content

Showing 1–9 of 9 results for author: Zu, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.05425  [pdf, ps, other

    cs.CV cs.AI

    SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

    Authors: Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng

    Abstract: The rich and multifaceted nature of human social interaction, encompassing multimodal cues, unobservable relations and mental states, and dynamical behavior, presents a formidable challenge for artificial intelligence. To advance research in this area, we introduce SIV-Bench, a novel video benchmark for rigorously evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Socia… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  2. arXiv:2503.07958  [pdf, other

    cs.CV

    Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

    Authors: Wenqiang Zu, Shenghao Xie, Hao Chen, Lei Ma

    Abstract: This paper investigates the critical problem of representation similarity evolution during cross-domain transfer learning, with particular focus on understanding why pre-trained models maintain effectiveness when adapted to medical imaging tasks despite significant domain gaps. The study establishes a rigorous problem definition centered on quantifying and analyzing representation similarity traje… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 10 pages, 5 figures

  3. arXiv:2503.07399  [pdf, other

    cs.CV

    Keeping Representation Similarity in Finetuning for Medical Image Analysis

    Authors: Wenqiang Zu, Shenghao Xie, Hao Chen, Yiming Liang, Lei Ma

    Abstract: Foundation models pretrained on large-scale natural images have been widely used to adapt to medical image analysis through finetuning. This is largely attributed to pretrained representations capturing universal, robust, and generalizable features, which can be reutilized by downstream tasks. However, these representations are later found to gradually vanish during finetuning, accompanied by a de… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 12 pages, 6 figures

  4. arXiv:2501.09368  [pdf, other

    cs.AI

    Aligning Instruction Tuning with Pre-training

    Authors: Yiming Liang, Tianyu Zheng, Xinrun Du, Ge Zhang, Jiaheng Liu, Xingwei Qu, Wenqiang Zu, Xingrun Xing, Chujie Zheng, Lei Ma, Wenhu Chen, Guoyin Wang, Zhaoxiang Zhang, Wenhao Huang, Xiang Yue, Jiajun Zhang

    Abstract: Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained… ▽ More

    Submitted 20 January, 2025; v1 submitted 16 January, 2025; originally announced January 2025.

    Comments: arXiv admin note: text overlap with arXiv:hep-ph/9811436 by other authors

  5. arXiv:2410.22217  [pdf, other

    cs.CV

    Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

    Authors: Shenghao Xie, Wenqiang Zu, Mingyang Zhao, Duo Su, Shilong Liu, Ruohua Shi, Guoqi Li, Shanghang Zhang, Lei Ma

    Abstract: Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation models. In this survey, we review the recent advances and discuss future directions for autoregressive vision foundation models. First, we present the trend for ne… ▽ More

    Submitted 30 October, 2024; v1 submitted 29 October, 2024; originally announced October 2024.

    Comments: 17 pages, 1 table, 2 figures

  6. Embedded Visual Prompt Tuning

    Authors: Wenqiang Zu, Shenghao Xie, Qing Zhao, Guoqi Li, Lei Ma

    Abstract: Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain… ▽ More

    Submitted 21 March, 2025; v1 submitted 1 July, 2024; originally announced July 2024.

  7. arXiv:2311.08244  [pdf, other

    cs.RO

    Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework

    Authors: Weiqin Zu, Wenbin Song, Ruiqing Chen, Ze Guo, Fanglei Sun, Zheng Tian, Wei Pan, Jun Wang

    Abstract: The socially-aware navigation system has evolved to adeptly avoid various obstacles while performing multiple tasks, such as point-to-point navigation, human-following, and -guiding. However, a prominent gap persists: in Human-Robot Interaction (HRI), the procedure of communicating commands to robots demands intricate mathematical formulations. Furthermore, the transition between tasks does not qu… ▽ More

    Submitted 21 March, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

  8. arXiv:2309.04156  [pdf, other

    cs.SD cs.CL eess.AS

    Cross-Utterance Conditioned VAE for Speech Generation

    Authors: Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun

    Abstract: Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representat… ▽ More

    Submitted 19 September, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

    Comments: 13 pages;

  9. arXiv:2205.04120  [pdf, other

    cs.SD cs.CL eess.AS

    Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech

    Authors: Yang Li, Cheng Yu, Guangzhi Sun, Hua Jiang, Fanglei Sun, Weiqin Zu, Ying Wen, Yang Yang, Jun Wang

    Abstract: Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems. In this paper, a cross-utterance conditional VAE (CUC-VAE) is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme by conditioning on acoustic features, speaker information, and text features obtained from both past… ▽ More

    Submitted 9 May, 2022; originally announced May 2022.

    Comments: ACL 2022 camera ready