-
Speak While You Think: Streaming Speech Synthesis During Text Generation
Authors:
Avihu Dekel,
Slava Shechtman,
Raul Fernandez,
David Haws,
Zvi Kons,
Ron Hoory
Abstract:
Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text. Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations. We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant l…
▽ More
Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text. Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations. We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant latency reduction. LLM2Speech mimics the predictions of a non-streaming teacher model while limiting the exposure to future context in order to enable streaming. It exploits the hidden embeddings of the LLM, a by-product of the text generation that contains informative semantic context. Experimental results show that LLM2Speech maintains the teacher's quality while reducing the latency to enable natural conversations.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
Authors:
Jiatong Shi,
George Saon,
David Haws,
Shinji Watanabe,
Brian Kingsbury
Abstract:
Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses. However, recent studies have shown that decoding with hypothesis merging can achieve a more efficient search with comparable or better performance. But, the full context in recurrent networks is not compatible with hypothesis merging. We propose to use vector-quantized long short-…
▽ More
Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses. However, recent studies have shown that decoding with hypothesis merging can achieve a more efficient search with comparable or better performance. But, the full context in recurrent networks is not compatible with hypothesis merging. We propose to use vector-quantized long short-term memory units (VQ-LSTM) in the prediction network of RNN transducers. By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation. Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks while also producing denser lattices with a very low oracle word error rate (WER) for the same beam size. Additional language model rescoring experiments also demonstrate the effectiveness of the proposed lattice generation scheme.
△ Less
Submitted 2 August, 2022;
originally announced August 2022.
-
Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis
Authors:
Raul Fernandez,
David Haws,
Guy Lorberbom,
Slava Shechtman,
Alexander Sorin
Abstract:
Sequence-to-Sequence Text-to-Speech architectures that directly generate low level acoustic features from phonetic sequences are known to produce natural and expressive speech when provided with adequate amounts of training data. Such systems can learn and transfer desired speaking styles from one seen speaker to another (in multi-style multi-speaker settings), which is highly desirable for creati…
▽ More
Sequence-to-Sequence Text-to-Speech architectures that directly generate low level acoustic features from phonetic sequences are known to produce natural and expressive speech when provided with adequate amounts of training data. Such systems can learn and transfer desired speaking styles from one seen speaker to another (in multi-style multi-speaker settings), which is highly desirable for creating scalable and customizable Human-Computer Interaction systems. In this work we explore one-to-many style transfer from a dedicated single-speaker conversational corpus with style nuances and interjections. We elaborate on the corpus design and explore the feasibility of such style transfer when assisted with Voice-Conversion-based data augmentation. In a set of subjective listening experiments, this approach resulted in high-fidelity style transfer with no quality degradation. However, a certain voice persona shift was observed, requiring further improvements in voice conversion.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
Reducing Exposure Bias in Training Recurrent Neural Network Transducers
Authors:
Xiaodong Cui,
Brian Kingsbury,
George Saon,
David Haws,
Zoltan Tuske
Abstract:
When recurrent neural network transducers (RNNTs) are trained using the typical maximum likelihood criterion, the prediction network is trained only on ground truth label sequences. This leads to a mismatch during inference, known as exposure bias, when the model must deal with label sequences containing errors. In this paper we investigate approaches to reducing exposure bias in training to impro…
▽ More
When recurrent neural network transducers (RNNTs) are trained using the typical maximum likelihood criterion, the prediction network is trained only on ground truth label sequences. This leads to a mismatch during inference, known as exposure bias, when the model must deal with label sequences containing errors. In this paper we investigate approaches to reducing exposure bias in training to improve the generalization of RNNT models for automatic speech recognition (ASR). A label-preserving input perturbation to the prediction network is introduced. The input token sequences are perturbed using SwitchOut and scheduled sampling based on an additional token language model. Experiments conducted on the 300-hour Switchboard dataset demonstrate their effectiveness. By reducing the exposure bias, we show that we can further improve the accuracy of a high-performance RNNT ASR model and obtain state-of-the-art results on the 300-hour Switchboard dataset.
△ Less
Submitted 24 August, 2021;
originally announced August 2021.
-
MINT: Mutual Information based Transductive Feature Selection for Genetic Trait Prediction
Authors:
Dan He,
Irina Rish,
David Haws,
Simon Teyssedre,
Zivan Karaman,
Laxmi Parida
Abstract:
Whole genome prediction of complex phenotypic traits using high-density genotyping arrays has attracted a great deal of attention, as it is relevant to the fields of plant and animal breeding and genetic epidemiology. As the number of genotypes is generally much bigger than the number of samples, predictive models suffer from the curse-of-dimensionality. The curse-of-dimensionality problem not onl…
▽ More
Whole genome prediction of complex phenotypic traits using high-density genotyping arrays has attracted a great deal of attention, as it is relevant to the fields of plant and animal breeding and genetic epidemiology. As the number of genotypes is generally much bigger than the number of samples, predictive models suffer from the curse-of-dimensionality. The curse-of-dimensionality problem not only affects the computational efficiency of a particular genomic selection method, but can also lead to poor performance, mainly due to correlation among markers. In this work we proposed the first transductive feature selection method based on the MRMR (Max-Relevance and Min-Redundancy) criterion which we call MINT. We applied MINT on genetic trait prediction problems and showed that in general MINT is a better feature selection method than the state-of-the-art inductive method mRMR.
△ Less
Submitted 6 October, 2013;
originally announced October 2013.
-
QuickLexSort: An efficient algorithm for lexicographically sorting nested restrictions of a database
Authors:
David Haws
Abstract:
Lexicographical sorting is a fundamental problem with applications to contingency tables, databases, Bayesian networks, and more. A standard method to lexicographically sort general data is to iteratively use a stable sort -- a sort which preserves existing orders. Here we present a new method of lexicographical sorting called QuickLexSort. Whereas a stable sort based lexicographical sorting algor…
▽ More
Lexicographical sorting is a fundamental problem with applications to contingency tables, databases, Bayesian networks, and more. A standard method to lexicographically sort general data is to iteratively use a stable sort -- a sort which preserves existing orders. Here we present a new method of lexicographical sorting called QuickLexSort. Whereas a stable sort based lexicographical sorting algorithm operates from the least important to most important features, in contrast, QuickLexSort sorts from the most important to least important features, refining the sort as it goes. QuickLexSort first requires a one-time modest pre-processing step where each feature of the data set is sorted independently. When lexicographically sorting a database, QuickLexSort (including pre-processing) has comparable running time to using a stable sort based approach. For a data base with $m$ rows and $n$ columns, and a sorting algorithm running in time $O(mlog(m))$, a stable sort based lexicographical sort and QuickLexSort will both take time $O(nmlog(m))$. However in many applications one has the need to lexicographically sort nested data, e.g.\ all possible sub-matrices up to a certain cardinality of columns. In such cases we show QuickLexSort gives a performance improvement of a log factor of the database length (rows in matrix) over using a standard stable sort based approach. E.g.\ to sort all sub-matrices up to cardinality $k$, QuickLexSort has running time $O(mn^k)$ whereas a stable sort based lexicographical sort will take time $O(mlog(m)n^k)$. After the pre-processing step that is run only once for the entire matrix, QuickLexSort has a running time linear in the number of nested sub-matrices to sort. We conclude with an application to Bayesian network scoring to detect epistasis using SNP marker data.
△ Less
Submitted 6 October, 2013;
originally announced October 2013.
-
Bayes estimators for phylogenetic reconstruction
Authors:
Peter Huggins,
Wenbin Li,
David Haws,
Thomas Friedrich,
Jinze Liu,
Ruriko Yoshida
Abstract:
Tree reconstruction methods are often judged by their accuracy, measured by how close they get to the true tree. Yet most reconstruction methods like ML do not explicitly maximize this accuracy. To address this problem, we propose a Bayesian solution. Given tree samples, we propose finding the tree estimate which is closest on average to the samples. This ``median'' tree is known as the Bayes es…
▽ More
Tree reconstruction methods are often judged by their accuracy, measured by how close they get to the true tree. Yet most reconstruction methods like ML do not explicitly maximize this accuracy. To address this problem, we propose a Bayesian solution. Given tree samples, we propose finding the tree estimate which is closest on average to the samples. This ``median'' tree is known as the Bayes estimator (BE). The BE literally maximizes posterior expected accuracy, measured in terms of closeness (distance) to the true tree. We discuss a unified framework of BE trees, focusing especially on tree distances which are expressible as squared euclidean distances. Notable examples include Robinson--Foulds distance, quartet distance, and squared path difference. Using simulated data, we show Bayes estimators can be efficiently computed in practice by hill climbing. We also show that Bayes estimators achieve higher accuracy, compared to maximum likelihood and neighbor joining.
△ Less
Submitted 21 November, 2009; v1 submitted 3 November, 2009;
originally announced November 2009.