Skip to main content

Showing 1–9 of 9 results for author: Fei, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.09616  [pdf, other

    cs.SD cs.AI eess.AS

    SpecWav-Attack: Leveraging Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech

    Authors: Yuqi Li, Yuanzhong Zheng, Zhongtian Guo, Yaoxuan Wang, Jianjun Yin, Haojun Fei

    Abstract: This paper presents SpecWav-Attack, an adversarial model for detecting speakers in anonymized speech. It leverages Wav2Vec2 for feature extraction and incorporates spectrogram resizing and incremental training for improved performance. Evaluated on librispeech-dev and librispeech-test, SpecWav-Attack outperforms conventional attacks, revealing vulnerabilities in anonymized speech systems and empha… ▽ More

    Submitted 10 January, 2025; originally announced May 2025.

    Comments: 2 pages,3 figures,1 chart

    MSC Class: I.2.0

  2. arXiv:2503.23377  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

    Authors: Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua

    Abstract: This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignme… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

    Comments: Work in progress. Homepage: https://javisdit.github.io/

  3. arXiv:2503.22687  [pdf, other

    eess.AS cs.AI

    Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations

    Authors: Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei

    Abstract: Emotion recognition plays a pivotal role in intelligent human-machine interaction systems. Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy. However, the lack of high-quality multimodal data and the challenge of achieving optimal alignment between different modalities significantly limit the potential for improvement in multimodal appr… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  4. arXiv:2503.04258  [pdf, other

    cs.SD cs.AI cs.CV eess.AS

    TAIL: Text-Audio Incremental Learning

    Authors: Yingfei Sun, Xu Gu, Wei Ji, Hanbin Zhao, Hao Fei, Yifang Yin, Roger Zimmermann

    Abstract: Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

    Comments: 4 figures, 5 tables

    ACM Class: I.2

  5. arXiv:2407.03026  [pdf, other

    cs.SD cs.AI eess.AS

    Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition

    Authors: Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei

    Abstract: Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables str… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: accpeted by interspeech 2014, 5 pages, 1 figure

  6. arXiv:2312.11974  [pdf, other

    cs.SD cs.HC eess.AS

    Ms-senet: Enhancing Speech Emotion Recognition Through Multi-scale Feature Fusion With Squeeze-and-excitation Blocks

    Authors: Mengbo Li, Yuanzhong Zheng, Dichucheng Li, Yulun Wu, Yaoxuan Wang, Haojun Fei

    Abstract: Speech Emotion Recognition (SER) has become a growing focus of research in human-computer interaction. Spatiotemporal features play a crucial role in SER, yet current research lacks comprehensive spatiotemporal feature learning. This paper focuses on addressing this gap by proposing a novel approach. In this paper, we employ Convolutional Neural Network (CNN) with varying kernel sizes for spatial… ▽ More

    Submitted 24 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

  7. arXiv:2207.06057  [pdf, other

    cs.SD cs.MM eess.AS

    Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

    Authors: Jian Ma, Zhedong Zheng, Hao Fei, Feng Zheng, Tat-seng Chua, Yi Yang

    Abstract: Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice tran… ▽ More

    Submitted 27 July, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

  8. arXiv:2105.08267  [pdf, other

    eess.IV cs.CV

    EchoCP: An Echocardiography Dataset in Contrast Transthoracic Echocardiography for Patent Foramen Ovale Diagnosis

    Authors: Tianchen Wang, Zhihe Li, Meiping Huang, Jian Zhuang, Shanshan Bi, Jiawei Zhang, Yiyu Shi, Hongwen Fei, Xiaowei Xu

    Abstract: Patent foramen ovale (PFO) is a potential separation between the septum, primum and septum secundum located in the anterosuperior portion of the atrial septum. PFO is one of the main factors causing cryptogenic stroke which is the fifth leading cause of death in the United States. For PFO diagnosis, contrast transthoracic echocardiography (cTTE) is preferred as being a more robust method compared… ▽ More

    Submitted 15 September, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

    Comments: MICCAI2021

  9. arXiv:1912.07367  [pdf, other

    stat.AP cs.LG eess.SP stat.ML

    A Model-driven and Data-driven Fusion Framework for Accurate Air Quality Prediction

    Authors: Haolin Fei, Xiaofeng Wu, Chunbo Luo

    Abstract: Air quality is closely related to public health. Health issues such as cardiovascular diseases and respiratory diseases, may have connection with long exposure to highly polluted environment. Therefore, accurate air quality forecasts are extremely important to those who are vulnerable. To estimate the variation of several air pollution concentrations, previous researchers used various approaches,… ▽ More

    Submitted 6 December, 2019; originally announced December 2019.