-
Weakly Supervised Convolutional Dictionary Learning for Multi-Label Classification
Authors:
Hao Chen,
Dayuan Tan
Abstract:
Convolutional Dictionary Learning (CDL) has emerged as a powerful approach for signal representation by learning translation-invariant features through convolution operations. While existing CDL methods are predominantly designed and used for fully supervised settings, many real-world classification tasks often rely on weakly labeled data, where only bag-level annotations are available. In this pa…
▽ More
Convolutional Dictionary Learning (CDL) has emerged as a powerful approach for signal representation by learning translation-invariant features through convolution operations. While existing CDL methods are predominantly designed and used for fully supervised settings, many real-world classification tasks often rely on weakly labeled data, where only bag-level annotations are available. In this paper, we propose a novel weakly supervised convolutional dictionary learning framework that jointly learns shared and class-specific components, for multi-instance multi-label (MIML) classification where each example consists of multiple instances and may be associated with multiple labels. Our approach decomposes signals into background patterns captured by a shared dictionary and discriminative features encoded in class-specific dictionaries, with nuclear norm constraints preventing feature dilution. A Block Proximal Gradient method with Majorization (BPG-M) is developed to alternately update dictionary atoms and sparse coefficients, ensuring convergence to local minima. Furthermore, we incorporate a projection mechanism that aggregates instance-level predictions to bag-level labels through learnable pooling operators.Experimental results on both synthetic and real-world datasets demonstrate that our framework outperforms existing MIML methods in terms of classification performance, particularly in low-label regimes. The learned dictionaries provide interpretable representations while effectively handling background noise and variable-length instances, making the method suitable for applications such as environmental sound classification and RF signal analysis.
△ Less
Submitted 21 May, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
High-precision visual navigation device calibration method based on collimator
Authors:
Shunkun Liang,
Dongcai Tan,
Banglei Guan,
Zhang Li,
Guangcheng Dai,
Nianpeng Pan,
Liang Shen,
Yang Shang,
Qifeng Yu
Abstract:
Visual navigation devices require precise calibration to achieve high-precision localization and navigation, which includes camera and attitude calibration. To address the limitations of time-consuming camera calibration and complex attitude adjustment processes, this study presents a collimator-based calibration method and system. Based on the optical characteristics of the collimator, a single-i…
▽ More
Visual navigation devices require precise calibration to achieve high-precision localization and navigation, which includes camera and attitude calibration. To address the limitations of time-consuming camera calibration and complex attitude adjustment processes, this study presents a collimator-based calibration method and system. Based on the optical characteristics of the collimator, a single-image camera calibration algorithm is introduced. In addition, integrated with the precision adjustment mechanism of the calibration frame, a rotation transfer model between coordinate systems enables efficient attitude calibration. Experimental results demonstrate that the proposed method achieves accuracy and stability comparable to traditional multi-image calibration techniques. Specifically, the re-projection errors are less than 0.1463 pixels, and average attitude angle errors are less than 0.0586 degrees with a standard deviation less than 0.0257 degrees, demonstrating high precision and robustness.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
Multi-scale Cascaded Large-Model for Whole-body ROI Segmentation
Authors:
Rui Hao,
Dayu Tan,
Yansen Su,
Chunhou Zheng
Abstract:
Organs-at-risk segmentation is critical for ensuring the safety and precision of radiotherapy and surgical procedures. However, existing methods for organs-at-risk image segmentation often suffer from uncertainties and biases in target selection, as well as insufficient model validation experiments, limiting their generality and reliability in practical applications. To address these issues, we pr…
▽ More
Organs-at-risk segmentation is critical for ensuring the safety and precision of radiotherapy and surgical procedures. However, existing methods for organs-at-risk image segmentation often suffer from uncertainties and biases in target selection, as well as insufficient model validation experiments, limiting their generality and reliability in practical applications. To address these issues, we propose an innovative cascaded network architecture called the Multi-scale Cascaded Fusing Network (MCFNet), which effectively captures complex multi-scale and multi-resolution features. MCFNet includes a Sharp Extraction Backbone and a Flexible Connection Backbone, which respectively enhance feature extraction in the downsampling and skip-connection stages. This design not only improves segmentation accuracy but also ensures computational efficiency, enabling precise detail capture even in low-resolution images. We conduct experiments using the A6000 GPU on diverse datasets from 671 patients, including 36,131 image-mask pairs across 10 different datasets. MCFNet demonstrates strong robustness, performing consistently well across 10 datasets. Additionally, MCFNet exhibits excellent generalizability, maintaining high accuracy in different clinical scenarios. We also introduce an adaptive loss aggregation strategy to further optimize the model training process, improving both segmentation accuracy and efficiency. Through extensive validation, MCFNet demonstrates superior performance compared to existing methods, providing more reliable image-guided support. Our solution aims to significantly improve the precision and safety of radiotherapy and surgical procedures, advancing personalized treatment. The code has been made available on GitHub:https://github.com/Henry991115/MCFNet.
△ Less
Submitted 23 November, 2024;
originally announced November 2024.
-
A novel pedestrian road crossing simulator for dynamic traffic light scheduling systems
Authors:
Dayuan Tan,
Mohamed Younis,
Wassila Lalouani,
Shuyao Fan,
Guozhi Song
Abstract:
The major advances in intelligent transportation systems are pushing societal services toward autonomy where road management is to be more agile in order to cope with changes and continue to yield optimal performance. However, the pedestrian experience is not sufficiently considered. Particularly, signalized intersections are expected to be popular if not dominant in urban settings where pedestria…
▽ More
The major advances in intelligent transportation systems are pushing societal services toward autonomy where road management is to be more agile in order to cope with changes and continue to yield optimal performance. However, the pedestrian experience is not sufficiently considered. Particularly, signalized intersections are expected to be popular if not dominant in urban settings where pedestrian density is high. This paper presents the design of a novel environment for simulating human motion on signalized crosswalks at a fine-grained level. Such a simulation not only captures typical behavior, but also handles cases where large pedestrian groups cross from both directions. The proposed simulator is instrumental for optimized road configuration management where the pedestrians' quality of experience, for example, waiting time, is factored in. The validation results using field data show that an accuracy of 98.37 percent can be obtained for the estimated crossing time. Other results using synthetic data show that our simulator enables optimized traffic light scheduling that diminishes pedestrians' waiting time without sacrificing vehicular throughput.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data
Authors:
Jing Xu,
Daxin Tan,
Jiaqi Wang,
Xiao Chen
Abstract:
While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we…
▽ More
While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
Exploring SSL Discrete Tokens for Multilingual ASR
Authors:
Mingyu Cui,
Daxin Tan,
Yifan Yang,
Dingdong Wang,
Huimeng Wang,
Xiao Chen,
Xie Chen,
Xunying Liu
Abstract:
With the advancement of Self-supervised Learning (SSL) in speech-related tasks, there has been growing interest in utilizing discrete tokens generated by SSL for automatic speech recognition (ASR), as they offer faster processing techniques. However, previous studies primarily focused on multilingual ASR with Fbank features or English ASR with discrete tokens, leaving a gap in adapting discrete to…
▽ More
With the advancement of Self-supervised Learning (SSL) in speech-related tasks, there has been growing interest in utilizing discrete tokens generated by SSL for automatic speech recognition (ASR), as they offer faster processing techniques. However, previous studies primarily focused on multilingual ASR with Fbank features or English ASR with discrete tokens, leaving a gap in adapting discrete tokens for multilingual ASR scenarios. This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. We aim to explore the performance and efficiency of speech discrete tokens across multiple language domains for both monolingual and multilingual ASR scenarios. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on Fbank features in ASR tasks across seven language domains with an average word error rate (WER) reduction of 0.31% and 1.76% absolute (2.80% and 15.70% relative) on dev and test sets respectively, with particularly WER reduction of 6.82% absolute (41.48% relative) on the Polish test set.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis
Authors:
Dehua Tao,
Daxin Tan,
Yu Ting Yeung,
Xiao Chen,
Tan Lee
Abstract:
Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrec…
▽ More
Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrect tones. To address the issue, we propose the ToneUnit framework, which leverages annotated data with tone labels as CTC supervision to learn tone-aware discrete speech units for Mandarin Chinese speech. Our findings indicate that the discrete units acquired through the TonUnit resolve the "tone shift" issue in synthesized Chinese speech and yield favorable results in English synthesis. Moreover, the experimental results suggest that finite scalar quantization enhances the effectiveness of ToneUnit. Notably, ToneUnit can work effectively even with minimal annotated data.
△ Less
Submitted 3 September, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue
Authors:
Daxin Tan,
Nikos Kargas,
David McHardy,
Constantinos Papayiannis,
Antonio Bonafonte,
Marek Strelec,
Jonas Rohnke,
Agis Oikonomou Filandras,
Trevor Wood
Abstract:
Entrainment is the phenomenon by which an interlocutor adapts their speaking style to align with their partner in conversations. It has been found in different dimensions as acoustic, prosodic, lexical or syntactic. In this work, we explore and utilize the entrainment phenomenon to improve spoken dialogue systems for voice assistants. We first examine the existence of the entrainment phenomenon in…
▽ More
Entrainment is the phenomenon by which an interlocutor adapts their speaking style to align with their partner in conversations. It has been found in different dimensions as acoustic, prosodic, lexical or syntactic. In this work, we explore and utilize the entrainment phenomenon to improve spoken dialogue systems for voice assistants. We first examine the existence of the entrainment phenomenon in human-to-human dialogues in respect to acoustic feature and then extend the analysis to emotion features. The analysis results show strong evidence of entrainment in terms of both acoustic and emotion features. Based on this findings, we implement two entrainment policies and assess if the integration of entrainment principle into a Text-to-Speech (TTS) system improves the synthesis performance and the user experience. It is found that the integration of the entrainment principle into a TTS system brings performance improvement when considering acoustic features, while no obvious improvement is observed when considering emotion features.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Automated Sex Classification of Children's Voices and Changes in Differentiating Factors with Age
Authors:
Fuling Chen,
Roberto Togneri,
Murray Maybery,
Diana Weiting Tan
Abstract:
Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 a…
▽ More
Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 and 15 years. We proposed an optimal feature set and our modelling achieved an average F1 score (the harmonic mean of the precision and recall) of 0.84 across all ages. Our results suggest that the sex classification is generally more accurate when a model is developed for each year group rather than for children in 4-year age bands, with classification accuracy being better for older age groups. We found that spontaneous speech could provide more helpful cues in sex classification than scripted speech, especially for children younger than 7 years. For younger age groups, a broad range of acoustic factors contributed evenly to sex classification, while for older age groups, F0-related acoustic factors were found to be the most critical predictors generally. Other important acoustic factors for older age groups include vocal tract length estimators, spectral flux, loudness and unvoiced features.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction
Authors:
Daxin Tan,
Liqun Deng,
Nianzu Zheng,
Yu Ting Yeung,
Xin Jiang,
Xiao Chen,
Tan Lee
Abstract:
This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped s…
▽ More
This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped symbol sequence, aligning recognized symbol sequence with target text to determine locations and types of required edit operations, and generating the corrected speech. Experiments show that the quality and naturalness of corrected speech depend on the performance of speech recognition and alignment modules, as well as the granularity level of editing operations. The proposed system is evaluated on two corpora: a manually perturbed version of VCTK and L2-ARCTIC. The results demonstrate that our system is able to correct mispronunciation and reduce accent in speech recordings. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/CorrectSpeech/ .
△ Less
Submitted 13 October, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech
Authors:
Guangyan Zhang,
Kaitao Song,
Xu Tan,
Daxin Tan,
Yuzi Yan,
Yanqing Liu,
Gang Wang,
Wei Zhou,
Tao Qin,
Tan Lee,
Sheng Zhao
Abstract:
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability t…
▽ More
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and semantic information due to limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves 3x inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT
△ Less
Submitted 19 July, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
Using Deep Learning with Large Aggregated Datasets for COVID-19 Classification from Cough
Authors:
Esin Darici Haritaoglu,
Nicholas Rasmussen,
Daniel C. H. Tan,
Jennifer Ranjani J.,
Jaclyn Xiao,
Gunvant Chaudhari,
Akanksha Rajput,
Praveen Govindan,
Christian Canham,
Wei Chen,
Minami Yamaura,
Laura Gomezjurado,
Aaron Broukhim,
Amil Khanzada,
Mert Pilanci
Abstract:
The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a soluti…
▽ More
The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a solution utilizing cough sounds as a primary screening mode for COVID-19 diagnosis. This paper presents multiple models that have achieved relatively respectable performance on the largest evaluation dataset currently presented in academic literature. Through investigation of a self-supervised learning model (Area under the ROC curve, AUC = 0.807) and a convolutional nerual network (CNN) model (AUC = 0.802), we observe the possibility of model bias with limited datasets. Moreover, we observe that performance increases with training data size, showing the need for the worldwide collection of data to help combat the Covid-19 pandemic with non-traditional means.
△ Less
Submitted 29 March, 2022; v1 submitted 5 January, 2022;
originally announced January 2022.
-
Environment Aware Text-to-Speech Synthesis
Authors:
Daxin Tan,
Guangyan Zhang,
Tan Lee
Abstract:
This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condi…
▽ More
This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condition in the process of neural network based speech synthesis. Two embedding extractors are trained with two purposely constructed datasets for characterization and disentanglement of speaker and environment factors in speech. A neural network model is trained to generate speech from extracted speaker and environment embeddings. Objective and subjective evaluation results demonstrate that the proposed TTS system is able to effectively disentangle speaker and environment factors and synthesize speech audio that carries designated speaker characteristics and environment attribute. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/Environment-Aware-TTS/ .
△ Less
Submitted 6 August, 2022; v1 submitted 8 October, 2021;
originally announced October 2021.
-
A study on the efficacy of model pre-training in developing neural text-to-speech system
Authors:
Guangyan Zhang,
Yichong Leng,
Daxin Tan,
Ying Qin,
Kaitao Song,
Xu Tan,
Sheng Zhao,
Tan Lee
Abstract:
In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand bet…
▽ More
In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is postulated that the pre-training process plays a critical role in learning text-related variation in speech, while further training with the target speaker's data aims to capture the speaker-related variation. Different test sets are created with varying degrees of similarity to target speaker data in terms of text content. Experiments show that leveraging a speaker-independent TTS trained on speech data with diverse text content can improve the target speaker TTS on domain-mismatched text. We also attempt to reduce the amount of pre-training data for a new text domain and improve the data and computational efficiency. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
Applying the Information Bottleneck Principle to Prosodic Representation Learning
Authors:
Guangyan Zhang,
Ying Qin,
Daxin Tan,
Tan Lee
Abstract:
This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation.The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and dise…
▽ More
This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation.The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and disentangle capability of the learned representation. The proposed model is able to learn word-level prosodic representations from speech data. With an optimized IB capacity, the learned representations not only are adequate to reconstruct the original speech but also can be used to transfer the prosody onto different textual content. Extensive results of the objective and subjective evaluation are presented to demonstrate the effect of IB capacity control, the effectiveness, and potential usage of the learned prosodic representation in controllable neural speech generation.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion
Authors:
Daxin Tan,
Liqun Deng,
Yu Ting Yeung,
Xin Jiang,
Xiao Chen,
Tan Lee
Abstract:
This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bi…
▽ More
This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/EditSpeech/ .
△ Less
Submitted 7 October, 2021; v1 submitted 4 July, 2021;
originally announced July 2021.
-
CUHK-EE Voice Cloning System for ICASSP 2021 M2VoC Challenge
Authors:
Daxin Tan,
Hingpang Huang,
Guangyan Zhang,
Tan Lee
Abstract:
This paper presents the CUHK-EE voice cloning system for ICASSP 2021 M2VoC challenge. The challenge provides two Mandarin speech corpora: the AIShell-3 corpus of 218 speakers with noise and reverberation and the MST corpus including high-quality speech of one male and one female speakers. 100 and 5 utterances of 3 target speakers in different voice and style are provided in track 1 and 2 respectiv…
▽ More
This paper presents the CUHK-EE voice cloning system for ICASSP 2021 M2VoC challenge. The challenge provides two Mandarin speech corpora: the AIShell-3 corpus of 218 speakers with noise and reverberation and the MST corpus including high-quality speech of one male and one female speakers. 100 and 5 utterances of 3 target speakers in different voice and style are provided in track 1 and 2 respectively, and the participants are required to synthesize speech in target speaker's voice and style. We take part in the track 1 and carry out voice cloning based on 100 utterances of target speakers. An end-to-end voicing cloning system is developed to accomplish the task, which includes: 1. a text and speech front-end module with the help of forced alignment, 2. an acoustic model combining Tacotron2 and DurIAN to predict melspectrogram, 3. a Hifigan vocoder for waveform generation. Our system comprises three stages: multi-speaker training stage, target speaker adaption stage and target speaker synthesis stage. Our team is identified as T17. The subjective evaluation results provided by the challenge organizer demonstrate the effectiveness of our system. Audio samples are available at our demo page: https://daxintan-cuhk.github.io/CUHK-EE-system-M2VoC-challenge/ .
△ Less
Submitted 5 July, 2021; v1 submitted 8 March, 2021;
originally announced March 2021.
-
Voice Gender Scoring and Independent Acoustic Characterization of Perceived Masculinity and Femininity
Authors:
Fuling Chen,
Roberto Togneri,
Murray Maybery,
Diana Tan
Abstract:
Previous research has found that voices can provide reliable information to be used for gender classification with a high level of accuracy. In social psychology, perceived masculinity and femininity (masculinity and femininity rated by humans) has often been considered an important feature when investigating the influence of vocal features on social behaviours. While previous studies have charact…
▽ More
Previous research has found that voices can provide reliable information to be used for gender classification with a high level of accuracy. In social psychology, perceived masculinity and femininity (masculinity and femininity rated by humans) has often been considered an important feature when investigating the influence of vocal features on social behaviours. While previous studies have characterised the acoustic features that contributed to perceivers' judgements of speakers' masculinity or femininity, there is limited research on developing a machine masculinity/femininity scoring model and characterizing the independent acoustic factors that contribute to perceivers' masculinity and femininity judgements. In this work, we first propose a machine scoring model of perceived masculinity/femininity based on the Extreme Random Forest and then characterize the independent and meaningful acoustic factors that contribute to perceivers' judgements by using a correlation matrix based hierarchical clustering method. Our results show that the machine ratings of masculinity and femininity strongly correlated with the human ratings of masculinity and femininity when we used an optimal speech duration of 7 seconds, with a correlation coefficient of up to .63 for females and .77 for males. Nine independent clusters of acoustic measures were generated from our modelling of femininity judgements for female voices and eight clusters were found for masculinity judgements for male voices. The results revealed that, for both genders, the F0 mean is the most important acoustic measure affecting the judgement of acoustic-related masculinity and femininity. The F3 mean, F4 mean and VTL estimators were found to be highly inter-correlated and appeared in the same cluster, forming the second most significant factor in influencing the assessment of acoustic-related masculinity and femininity.
△ Less
Submitted 4 August, 2022; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement
Authors:
Daxin Tan,
Tan Lee
Abstract:
This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentangleme…
▽ More
This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration https://daxintan-cuhk.github.io/pl-csd-speech .
△ Less
Submitted 7 October, 2021; v1 submitted 8 November, 2020;
originally announced November 2020.
-
SoftPoolNet: Shape Descriptor for Point Cloud Completion and Classification
Authors:
Yida Wang,
David Joseph Tan,
Nassir Navab,
Federico Tombari
Abstract:
Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature -- points are stored in an unordered way -- makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a…
▽ More
Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature -- points are stored in an unordered way -- makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a new way of organizing the extracted features based on their activations, which we name soft pooling. For the decoder stage, we propose regional convolutions, a novel operator aimed at maximizing the global activation entropy. Furthermore, inspired by the local refining procedure in Point Completion Network (PCN), we also propose a patch-deforming operation to simulate deconvolutional operations for point clouds. This paper proves that our regional activation can be incorporated in many point cloud architectures like AtlasNet and PCN, leading to better performance for geometric completion. We evaluate our approach on different 3D tasks such as object completion and classification, achieving state-of-the-art accuracy.
△ Less
Submitted 17 August, 2020;
originally announced August 2020.
-
ForkNet: Multi-branch Volumetric Semantic Completion from a Single Depth Image
Authors:
Yida Wang,
David Joseph Tan,
Nassir Navab,
Federico Tombari
Abstract:
We propose a novel model for 3D semantic completion from a single depth image, based on a single encoder and three separate generators used to reconstruct different geometric and semantic representations of the original and completed scene, all sharing the same latent space. To transfer information between the geometric and semantic branches of the network, we introduce paths between them concaten…
▽ More
We propose a novel model for 3D semantic completion from a single depth image, based on a single encoder and three separate generators used to reconstruct different geometric and semantic representations of the original and completed scene, all sharing the same latent space. To transfer information between the geometric and semantic branches of the network, we introduce paths between them concatenating features at corresponding network layers. Motivated by the limited amount of training samples from real scenes, an interesting attribute of our architecture is the capacity to supplement the existing dataset by generating a new training dataset with high quality, realistic scenes that even includes occlusion and real noise. We build the new dataset by sampling the features directly from latent space which generates a pair of partial volumetric surface and completed volumetric semantic surface. Moreover, we utilize multiple discriminators to increase the accuracy and realism of the reconstructions. We demonstrate the benefits of our approach on standard benchmarks for the two most common completion tasks: semantic 3D scene completion and 3D object completion.
△ Less
Submitted 3 September, 2019;
originally announced September 2019.