-
SpecWav-Attack: Leveraging Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech
Authors:
Yuqi Li,
Yuanzhong Zheng,
Zhongtian Guo,
Yaoxuan Wang,
Jianjun Yin,
Haojun Fei
Abstract:
This paper presents SpecWav-Attack, an adversarial model for detecting speakers in anonymized speech. It leverages Wav2Vec2 for feature extraction and incorporates spectrogram resizing and incremental training for improved performance. Evaluated on librispeech-dev and librispeech-test, SpecWav-Attack outperforms conventional attacks, revealing vulnerabilities in anonymized speech systems and empha…
▽ More
This paper presents SpecWav-Attack, an adversarial model for detecting speakers in anonymized speech. It leverages Wav2Vec2 for feature extraction and incorporates spectrogram resizing and incremental training for improved performance. Evaluated on librispeech-dev and librispeech-test, SpecWav-Attack outperforms conventional attacks, revealing vulnerabilities in anonymized speech systems and emphasizing the need for stronger defenses, benchmarked against the ICASSP 2025 Attacker Challenge.
△ Less
Submitted 10 January, 2025;
originally announced May 2025.
-
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
Authors:
Kai Liu,
Wei Li,
Lai Chen,
Shengqiong Wu,
Yanhao Zheng,
Jiayi Ji,
Fan Zhou,
Rongxin Jiang,
Jiebo Luo,
Hao Fei,
Tat-Seng Chua
Abstract:
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignme…
▽ More
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations
Authors:
Jinming Chen,
Jingyi Fang,
Yuanzhong Zheng,
Yaoxuan Wang,
Haojun Fei
Abstract:
Emotion recognition plays a pivotal role in intelligent human-machine interaction systems. Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy. However, the lack of high-quality multimodal data and the challenge of achieving optimal alignment between different modalities significantly limit the potential for improvement in multimodal appr…
▽ More
Emotion recognition plays a pivotal role in intelligent human-machine interaction systems. Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy. However, the lack of high-quality multimodal data and the challenge of achieving optimal alignment between different modalities significantly limit the potential for improvement in multimodal approaches. In this paper, the proposed Qieemo framework effectively utilizes the pretrained automatic speech recognition (ASR) model backbone which contains naturally frame aligned textual and emotional features, to achieve precise emotion classification solely based on the audio modality. Furthermore, we design the multimodal fusion (MMF) module and cross-modal attention (CMA) module in order to fuse the phonetic posteriorgram (PPG) and emotional features extracted by the ASR encoder for improving recognition accuracy. The experimental results on the IEMOCAP dataset demonstrate that Qieemo outperforms the benchmark unimodal, multimodal, and self-supervised models with absolute improvements of 3.0%, 1.2%, and 1.9% respectively.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
TAIL: Text-Audio Incremental Learning
Authors:
Yingfei Sun,
Xu Gu,
Wei Ji,
Hanbin Zhao,
Hao Fei,
Yifang Yin,
Roger Zimmermann
Abstract:
Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called…
▽ More
Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42\% of its parameters, achieving 4.46\% higher performance.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition
Authors:
Jinming Chen,
Jingyi Fang,
Yuanzhong Zheng,
Yaoxuan Wang,
Haojun Fei
Abstract:
Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables str…
▽ More
Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$\%$ and 17.2$\%$ in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Ms-senet: Enhancing Speech Emotion Recognition Through Multi-scale Feature Fusion With Squeeze-and-excitation Blocks
Authors:
Mengbo Li,
Yuanzhong Zheng,
Dichucheng Li,
Yulun Wu,
Yaoxuan Wang,
Haojun Fei
Abstract:
Speech Emotion Recognition (SER) has become a growing focus of research in human-computer interaction. Spatiotemporal features play a crucial role in SER, yet current research lacks comprehensive spatiotemporal feature learning. This paper focuses on addressing this gap by proposing a novel approach. In this paper, we employ Convolutional Neural Network (CNN) with varying kernel sizes for spatial…
▽ More
Speech Emotion Recognition (SER) has become a growing focus of research in human-computer interaction. Spatiotemporal features play a crucial role in SER, yet current research lacks comprehensive spatiotemporal feature learning. This paper focuses on addressing this gap by proposing a novel approach. In this paper, we employ Convolutional Neural Network (CNN) with varying kernel sizes for spatial and temporal feature extraction. Additionally, we introduce Squeeze-and-Excitation (SE) modules to capture and fuse multi-scale features, facilitating effective information fusion for improved emotion recognition and a deeper understanding of the temporal evolution of speech emotion. Moreover, we employ skip connections and Spatial Dropout (SD) layers to prevent overfitting and increase the model's depth. Our method outperforms the previous state-of-the-art method, achieving an average UAR and WAR improvement of 1.62% and 1.32%, respectively, across six benchmark SER datasets. Further experiments demonstrated that our method can fully extract spatiotemporal features in low-resource conditions.
△ Less
Submitted 24 December, 2023; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion
Authors:
Jian Ma,
Zhedong Zheng,
Hao Fei,
Feng Zheng,
Tat-seng Chua,
Yi Yang
Abstract:
Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice tran…
▽ More
Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.
△ Less
Submitted 27 July, 2022; v1 submitted 13 July, 2022;
originally announced July 2022.
-
EchoCP: An Echocardiography Dataset in Contrast Transthoracic Echocardiography for Patent Foramen Ovale Diagnosis
Authors:
Tianchen Wang,
Zhihe Li,
Meiping Huang,
Jian Zhuang,
Shanshan Bi,
Jiawei Zhang,
Yiyu Shi,
Hongwen Fei,
Xiaowei Xu
Abstract:
Patent foramen ovale (PFO) is a potential separation between the septum, primum and septum secundum located in the anterosuperior portion of the atrial septum. PFO is one of the main factors causing cryptogenic stroke which is the fifth leading cause of death in the United States. For PFO diagnosis, contrast transthoracic echocardiography (cTTE) is preferred as being a more robust method compared…
▽ More
Patent foramen ovale (PFO) is a potential separation between the septum, primum and septum secundum located in the anterosuperior portion of the atrial septum. PFO is one of the main factors causing cryptogenic stroke which is the fifth leading cause of death in the United States. For PFO diagnosis, contrast transthoracic echocardiography (cTTE) is preferred as being a more robust method compared with others. However, the current PFO diagnosis through cTTE is extremely slow as it is proceeded manually by sonographers on echocardiography videos. Currently there is no publicly available dataset for this important topic in the community. In this paper, we present EchoCP, as the first echocardiography dataset in cTTE targeting PFO diagnosis.
EchoCP consists of 30 patients with both rest and Valsalva maneuver videos which covers various PFO grades. We further establish an automated baseline method for PFO diagnosis based on the state-of-the-art cardiac chamber segmentation technique, which achieves 0.89 average mean Dice score, but only 0.60/0.67 mean accuracies for PFO diagnosis, leaving large room for improvement. We hope that the challenging EchoCP dataset can stimulate further research and lead to innovative and generic solutions that would have an impact in multiple domains. Our dataset is released.
△ Less
Submitted 15 September, 2021; v1 submitted 18 May, 2021;
originally announced May 2021.
-
A Model-driven and Data-driven Fusion Framework for Accurate Air Quality Prediction
Authors:
Haolin Fei,
Xiaofeng Wu,
Chunbo Luo
Abstract:
Air quality is closely related to public health. Health issues such as cardiovascular diseases and respiratory diseases, may have connection with long exposure to highly polluted environment. Therefore, accurate air quality forecasts are extremely important to those who are vulnerable. To estimate the variation of several air pollution concentrations, previous researchers used various approaches,…
▽ More
Air quality is closely related to public health. Health issues such as cardiovascular diseases and respiratory diseases, may have connection with long exposure to highly polluted environment. Therefore, accurate air quality forecasts are extremely important to those who are vulnerable. To estimate the variation of several air pollution concentrations, previous researchers used various approaches, such as the Community Multiscale Air Quality model (CMAQ) or neural networks. Although CMAQ model considers a coverage of the historic air pollution data and meteorological variables, extra bias is introduced due to additional adjustment. In this paper, a combination of model-based strategy and data-driven method namely the physical-temporal collection(PTC) model is proposed, aiming to fix the systematic error that traditional models deliver. In the data-driven part, the first components are the temporal pattern and the weather pattern to measure important features that contribute to the prediction performance. The less relevant input variables will be removed to eliminate negative weights in network training. Then, we deploy a long-short-term-memory (LSTM) to fetch the preliminary results, which will be further corrected by a neural network (NN) involving the meteorological index as well as other pollutants concentrations. The data-set we applied for forecasting is from January 1st, 2016 to December 31st, 2016. According to the results, our PTC achieves an excellent performance compared with the baseline model (CMAQ prediction, GRU, DNN and etc.). This joint model-based data-driven method for air quality prediction can be easily deployed on stations without extra adjustment, providing results with high-time-resolution information for vulnerable members to prevent heavy air pollution ahead.
△ Less
Submitted 6 December, 2019;
originally announced December 2019.