-
Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features
Authors:
Hsin-Hao Chen,
Yung-Lun Chien,
Ming-Chi Yen,
Shu-Wei Tsai,
Yu Tsao,
Tai-shih Chi,
Hsin-Min Wang
Abstract:
Patients who have had their entire larynx removed, including the vocal folds, owing to throat cancer may experience difficulties in speaking. In such cases, electrolarynx devices are often prescribed to produce speech, which is commonly referred to as electrolaryngeal speech (EL speech). However, the quality and intelligibility of EL speech are poor. To address this problem, EL voice conversion (E…
▽ More
Patients who have had their entire larynx removed, including the vocal folds, owing to throat cancer may experience difficulties in speaking. In such cases, electrolarynx devices are often prescribed to produce speech, which is commonly referred to as electrolaryngeal speech (EL speech). However, the quality and intelligibility of EL speech are poor. To address this problem, EL voice conversion (ELVC) is a method used to improve the intelligibility and quality of EL speech. In this paper, we propose a novel ELVC system that incorporates cross-domain features, specifically spectral features and self-supervised learning (SSL) embeddings. The experimental results show that applying cross-domain features can notably improve the conversion performance for the ELVC task compared with utilizing only traditional spectral features.
△ Less
Submitted 11 June, 2023;
originally announced June 2023.
-
Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion
Authors:
Yung-Lun Chien,
Hsin-Hao Chen,
Ming-Chi Yen,
Shu-Wei Tsai,
Hsin-Min Wang,
Yu Tsao,
Tai-Shih Chi
Abstract:
Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal spee…
▽ More
Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal speech voice conversion (ELVC) for converting EL speech to NL speech. In this study, we propose a multimodal voice conversion (VC) model that integrates acoustic and visual information into a unified network. We compared different pre-trained models as visual feature extractors and evaluated the effectiveness of these features in the ELVC task. The experimental results demonstrate that the proposed multimodal VC model outperforms single-modal models in both objective and subjective metrics, suggesting that the integration of visual information can significantly improve the quality of ELVC.
△ Less
Submitted 11 June, 2023;
originally announced June 2023.
-
Statistical Verification of Traffic Systems with Expected Differential Privacy
Authors:
Mark Yen,
Geir E. Dullerud,
Yu Wang
Abstract:
Traffic systems are multi-agent cyber-physical systems whose performance is closely related to human welfare. They work in open environments and are subject to uncertainties from various sources, making their performance hard to verify by traditional model-based approaches. Alternatively, statistical model checking (SMC) can verify their performance by sequentially drawing sample data until the co…
▽ More
Traffic systems are multi-agent cyber-physical systems whose performance is closely related to human welfare. They work in open environments and are subject to uncertainties from various sources, making their performance hard to verify by traditional model-based approaches. Alternatively, statistical model checking (SMC) can verify their performance by sequentially drawing sample data until the correctness of a performance specification can be inferred with desired statistical accuracy. This work aims to verify traffic systems with privacy, motivated by the fact that the data used may include personal information (e.g., daily itinerary) and get leaked unintendedly by observing the execution of the SMC algorithm. To formally capture data privacy in SMC, we introduce the concept of expected differential privacy (EDP), which constrains how much the algorithm execution can change in the expectation sense when data change. Accordingly, we introduce an exponential randomization mechanism for the SMC algorithm to achieve the EDP. Our case study on traffic intersections by Vissim simulation shows the high accuracy of SMC in traffic model verification without significantly sacrificing computing efficiency. The case study also shows EDP successfully bounding the algorithm outputs to guarantee privacy.
△ Less
Submitted 28 February, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion
Authors:
Yi-Syuan Liou,
Wen-Chin Huang,
Ming-Chi Yen,
Shu-Wei Tsai,
Yu-Huai Peng,
Tomoki Toda,
Yu Tsao,
Hsin-Min Wang
Abstract:
Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair…
▽ More
Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.