-
Assessing the Impact of Anisotropy in Neural Representations of Speech: A Case Study on Keyword Spotting
Authors:
Guillaume Wisniewski,
Séverine Guillaume,
Clara Rosina Fernández
Abstract:
Pretrained speech representations like wav2vec2 and HuBERT exhibit strong anisotropy, leading to high similarity between random embeddings. While widely observed, the impact of this property on downstream tasks remains unclear. This work evaluates anisotropy in keyword spotting for computational documentary linguistics. Using Dynamic Time Warping, we show that despite anisotropy, wav2vec2 similari…
▽ More
Pretrained speech representations like wav2vec2 and HuBERT exhibit strong anisotropy, leading to high similarity between random embeddings. While widely observed, the impact of this property on downstream tasks remains unclear. This work evaluates anisotropy in keyword spotting for computational documentary linguistics. Using Dynamic Time Warping, we show that despite anisotropy, wav2vec2 similarity measures effectively identify words without transcription. Our results highlight the robustness of these representations, which capture phonetic structures and generalize across speakers. Our results underscore the importance of pretraining in learning rich and invariant speech representations.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
New scenarios and trends in non-traditional laboratories from 2000 to 2020
Authors:
Ricardo M. Fernandez,
Felix Garcia-Loro,
Gustavo Alves,
Africa Lopez-Rey,
Russ Meier,
Manuel Castro
Abstract:
For educational institutions in STEM areas, the provision of practical learning scenarios is, traditionally, a major concern. In the 21st century, the explosion of ICTs, as well as the universalization of low-cost hardware, have allowed the proliferation of technical solutions for any field; in the case of experimentation, encouraging the emergence and proliferation of non-traditional experimentat…
▽ More
For educational institutions in STEM areas, the provision of practical learning scenarios is, traditionally, a major concern. In the 21st century, the explosion of ICTs, as well as the universalization of low-cost hardware, have allowed the proliferation of technical solutions for any field; in the case of experimentation, encouraging the emergence and proliferation of non-traditional experimentation platforms. This movement has resulted in enriched practical environments, with wider adaptability for both students and teachers. In this paper, the evolution of scholar production has been analyzed at the global level from 2000 to 2020. Current and emerging experimentation scenarios have been identified, specifying the scope and boundaries between them.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
Reverb: Open-Source ASR and Diarization from Rev
Authors:
Nishchal Bhandari,
Danny Chen,
Miguel Ángel del Río Fernández,
Natalie Delworth,
Jennifer Drexler Fox,
Migüel Jetté,
Quinten McNamara,
Corey Miller,
Ondřej Novotný,
Ján Profant,
Nan Qin,
Martin Ratajczak,
Jean-Philippe Robichaud
Abstract:
Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all exi…
▽ More
Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all existing open source speech recognition models across a variety of long-form speech recognition domains.
△ Less
Submitted 24 February, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation
Authors:
Esam Ghaleb,
Bulat Khaertdinov,
Wim Pouw,
Marlou Rasenberg,
Judith Holler,
Aslı Özyürek,
Raquel Fernández
Abstract:
In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures' variability and relationship with speech? This pap…
▽ More
In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures' variability and relationship with speech? This paper tackles this challenge by employing self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information. We propose an approach that includes both unimodal and multimodal pre-training to ground gesture representations in co-occurring speech. For training, we utilize a face-to-face dialogue dataset rich with representational iconic gestures. We conduct thorough intrinsic evaluations of the learned representations through comparison with human-annotated pairwise gesture similarity. Moreover, we perform a diagnostic probing analysis to assess the possibility of recovering interpretable gesture features from the learned representations. Our results show a significant positive correlation with human-annotated gesture similarity and reveal that the similarity between the learned representations is consistent with well-motivated patterns related to the dynamics of dialogue interaction. Moreover, our findings demonstrate that several features concerning the form of gestures can be recovered from the latent representations. Overall, this study shows that multimodal contrastive learning is a promising approach for learning gesture representations, which opens the door to using such representations in larger-scale gesture analysis studies.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
Exploring the Benefits of Tokenization of Discrete Acoustic Units
Authors:
Avihu Dekel,
Raul Fernandez
Abstract:
Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-mo…
▽ More
Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
A Neural TTS System with Parallel Prosody Transfer from Unseen Speakers
Authors:
Slava Shechtman,
Raul Fernandez
Abstract:
Modern neural TTS systems are capable of generating natural and expressive speech when provided with sufficient amounts of training data. Such systems can be equipped with prosody-control functionality, allowing for more direct shaping of the speech output at inference time. In some TTS applications, it may be desirable to have an option that guides the TTS system with an ad-hoc speech recording e…
▽ More
Modern neural TTS systems are capable of generating natural and expressive speech when provided with sufficient amounts of training data. Such systems can be equipped with prosody-control functionality, allowing for more direct shaping of the speech output at inference time. In some TTS applications, it may be desirable to have an option that guides the TTS system with an ad-hoc speech recording exemplar to impose an implicit fine-grained, user-preferred prosodic realization for certain input prompts. In this work we present a first-of-its-kind neural TTS system equipped with such functionality to transfer the prosody from a parallel text recording from an unseen speaker. We demonstrate that the proposed system can precisely transfer the speech prosody from novel speakers to various trained TTS voices with no quality degradation, while preserving the target TTS speakers' identity, as evaluated by a set of subjective listening experiments.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Speak While You Think: Streaming Speech Synthesis During Text Generation
Authors:
Avihu Dekel,
Slava Shechtman,
Raul Fernandez,
David Haws,
Zvi Kons,
Ron Hoory
Abstract:
Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text. Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations. We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant l…
▽ More
Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text. Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations. We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant latency reduction. LLM2Speech mimics the predictions of a non-streaming teacher model while limiting the exposure to future context in order to enable streaming. It exploits the hidden embeddings of the LLM, a by-product of the text generation that contains informative semantic context. Experimental results show that LLM2Speech maintains the teacher's quality while reducing the latency to enable natural conversations.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Short-Term Aggregated Residential Load Forecasting using BiLSTM and CNN-BiLSTM
Authors:
Bharat Bohara,
Raymond I. Fernandez,
Vysali Gollapudi,
Xingpeng Li
Abstract:
Higher penetration of renewable and smart home technologies at the residential level challenges grid stability as utility-customer interactions add complexity to power system operations. In response, short-term residential load forecasting has become an increasing area of focus. However, forecasting at the residential level is challenging due to the higher uncertainties involved. Recently deep neu…
▽ More
Higher penetration of renewable and smart home technologies at the residential level challenges grid stability as utility-customer interactions add complexity to power system operations. In response, short-term residential load forecasting has become an increasing area of focus. However, forecasting at the residential level is challenging due to the higher uncertainties involved. Recently deep neural networks have been leveraged to address this issue. This paper investigates the capabilities of a bidirectional long short-term memory (BiLSTM) and a convolutional neural network-based BiLSTM (CNN-BiLSTM) to provide a day ahead (24 hr.) forecasting at an hourly resolution while minimizing the root mean squared error (RMSE) between the actual and predicted load demand. Using a publicly available dataset consisting of 38 homes, the BiLSTM and CNN-BiLSTM models are trained to forecast the aggregated active power demand for each hour within a 24 hr. span, given the previous 24 hr. load data. The BiLSTM model achieved the lowest RMSE of 1.4842 for the overall daily forecast. In addition, standard LSTM and CNN-LSTM models are trained and compared with the BiLSTM architecture. The RMSE of BiLSTM is 5.60%, 2.85% and 2.60% lower than the LSTM, CNN-LSTM and CNN-BiLSTM models respectively. The source code of this work is available at https://github.com/Varat7v2/STLF-BiLSTM-CNNBiLSTM.git.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis
Authors:
Raul Fernandez,
David Haws,
Guy Lorberbom,
Slava Shechtman,
Alexander Sorin
Abstract:
Sequence-to-Sequence Text-to-Speech architectures that directly generate low level acoustic features from phonetic sequences are known to produce natural and expressive speech when provided with adequate amounts of training data. Such systems can learn and transfer desired speaking styles from one seen speaker to another (in multi-style multi-speaker settings), which is highly desirable for creati…
▽ More
Sequence-to-Sequence Text-to-Speech architectures that directly generate low level acoustic features from phonetic sequences are known to produce natural and expressive speech when provided with adequate amounts of training data. Such systems can learn and transfer desired speaking styles from one seen speaker to another (in multi-style multi-speaker settings), which is highly desirable for creating scalable and customizable Human-Computer Interaction systems. In this work we explore one-to-many style transfer from a dedicated single-speaker conversational corpus with style nuances and interjections. We elaborate on the corpus design and explore the feasibility of such style transfer when assisted with Voice-Conversion-based data augmentation. In a set of subjective listening experiments, this approach resulted in high-fidelity style transfer with no quality degradation. However, a certain voice persona shift was observed, requiring further improvements in voice conversion.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis
Authors:
Slava Shechtman,
Raul Fernandez,
David Haws
Abstract:
Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, capable of generating outputs that approach the perceptual quality of natural samples, they are limited by a lack of flexibility when it comes to controlling the output. In this work we present a framework capable of controlling the prosodic output via a set of concise, interpretable, disentangled p…
▽ More
Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, capable of generating outputs that approach the perceptual quality of natural samples, they are limited by a lack of flexibility when it comes to controlling the output. In this work we present a framework capable of controlling the prosodic output via a set of concise, interpretable, disentangled parameters. We apply this framework to the realization of emphatic lexical focus, proposing a variety of architectures designed to exploit different levels of supervision based on the availability of labeled resources. We evaluate these approaches via listening tests that demonstrate we are able to successfully realize controllable focus while maintaining the same, or higher, naturalness over an established baseline, and we explore how the different approaches compare when synthesizing in a target voice with or without labeled data.
△ Less
Submitted 25 January, 2021;
originally announced January 2021.
-
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Authors:
Hassan Akbari,
Hamid Palangi,
Jianwei Yang,
Sudha Rao,
Asli Celikyilmaz,
Roland Fernandez,
Paul Smolensky,
Jianfeng Gao,
Shih-Fu Chang
Abstract:
Neuro-symbolic representations have proved effective in learning structure information in vision and language. In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions. We refer to these relations as rel…
▽ More
Neuro-symbolic representations have proved effective in learning structure information in vision and language. In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions. We refer to these relations as relative roles and leverage them to make each token role-aware using attention. This results in a more structured and interpretable architecture that incorporates modality-specific inductive biases for the captioning task. Intuitively, the model is able to learn spatial, temporal, and cross-modal relations in a given pair of video and text. The disentanglement achieved by our proposal gives the model more capacity to capture multi-modal structures which result in captions with higher quality for videos. Our experiments on two established video captioning datasets verifies the effectiveness of the proposed approach based on automatic metrics. We further conduct a human evaluation to measure the grounding and relevance of the generated captions and observe consistent improvement for the proposed model. The codes and trained models can be found at https://github.com/hassanhub/R3Transformer
△ Less
Submitted 18 November, 2020;
originally announced November 2020.
-
Robust Optimal Planning and Control of Non-Periodic Bipedal Locomotion with A Centroidal Momentum Model
Authors:
Ye Zhao,
Benito R. Fernandez,
Luis Sentis
Abstract:
This study presents a theoretical method for planning and controlling agile bipedal locomotion based on robustly tracking a set of non-periodic keyframe states. Based on centroidal momentum dynamics, we formulate a hybrid phase-space planning and control method which includes the following key components: (i) a step transition solver that enables dynamically tracking non-periodic keyframe states o…
▽ More
This study presents a theoretical method for planning and controlling agile bipedal locomotion based on robustly tracking a set of non-periodic keyframe states. Based on centroidal momentum dynamics, we formulate a hybrid phase-space planning and control method which includes the following key components: (i) a step transition solver that enables dynamically tracking non-periodic keyframe states over various types of terrains, (ii) a robust hybrid automaton to effectively formulate planning and control algorithms, (iii) a steering direction model to control the robot's heading, (iv) a phase-space metric to measure distance to the planned locomotion manifolds, and (v) a hybrid control method based on the previous distance metric to produce robust dynamic locomotion under external disturbances. Compared to other locomotion methodologies, we have a large focus on non-periodic gait generation and robustness metrics to deal with disturbances. Such focus enables the proposed control method to robustly track non-periodic keyframe states over various challenging terrains and under external disturbances as illustrated through several simulations.
△ Less
Submitted 19 August, 2017;
originally announced August 2017.
-
A Framework for Planning and Controlling Non-Periodic Bipedal Locomotion
Authors:
Ye Zhao,
Benito R. Fernandez,
Luis Sentis
Abstract:
This study presents a theoretical framework for planning and controlling agile bipedal locomotion based on robustly tracking a set of non-periodic apex states. Based on the prismatic inverted pendulum model, we formulate a hybrid phase-space planning and control framework which includes the following key components: (1) a step transition solver that enables dynamically tracking non-periodic apex o…
▽ More
This study presents a theoretical framework for planning and controlling agile bipedal locomotion based on robustly tracking a set of non-periodic apex states. Based on the prismatic inverted pendulum model, we formulate a hybrid phase-space planning and control framework which includes the following key components: (1) a step transition solver that enables dynamically tracking non-periodic apex or keyframe states over various types of terrains, (2) a robust hybrid automaton to effectively formulate planning and control algorithms, (3) a phase-space metric to measure distance to the planned locomotion manifolds, and (4) a hybrid control method based on the previous distance metric to produce robust dynamic locomotion under external disturbances. Compared to other locomotion frameworks, we have a larger focus on non-periodic gait generation and robustness metrics to deal with disturbances. Such focus enables the proposed control framework to robustly track non-periodic apex states over various challenging terrains and under external disturbances as illustrated through several simulations. Additionally, it allows a bipedal robot to perform non-periodic bouncing maneuvers over disjointed terrains.
△ Less
Submitted 14 November, 2015;
originally announced November 2015.