Skip to main content

Showing 1–24 of 24 results for author: Eskimez, S E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.12229  [pdf, other

    eess.AS cs.AI eess.SP

    Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

    Authors: Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li, Naoyuki Kanda

    Abstract: People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. Em… ▽ More

    Submitted 17 September, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted by SLT2024. See https://aka.ms/emoctrl-tts for demo samples

  2. Target conversation extraction: Source separation using turn-taking dynamics

    Authors: Tuochao Chen, Qirui Wang, Bohan Wu, Malek Itani, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota

    Abstract: Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants. To accomplish this, we propose leveraging temporal patterns inherent in h… ▽ More

    Submitted 29 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by Interspeech 2024

  3. arXiv:2407.11055  [pdf, other

    cs.LG cs.SD eess.AS

    Knowledge boosting during low-latency inference

    Authors: Vidya Srinivas, Malek Itani, Tuochao Chen, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota

    Abstract: Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints. A possible solution is to transfer hints during inference from a large model running remotely to a small model running on-device. However, this incurs a communication delay that breaks real-time requirements and does not gu… ▽ More

    Submitted 25 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

    Comments: Accepted by Interspeech 2024

  4. arXiv:2406.18009  [pdf, other

    eess.AS cs.SD

    E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

    Abstract: This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the… ▽ More

    Submitted 12 September, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted to SLT 2024. Added evaluation data, see https://github.com/microsoft/e2tts-test-suite for more details

  5. arXiv:2406.05699  [pdf, ps, other

    eess.AS cs.AI eess.SP

    An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

    Authors: Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda

    Abstract: Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audi… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH2024

  6. arXiv:2402.07383  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

    Authors: Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yufei Xia, Jinzhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng

    Abstract: Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing an… ▽ More

    Submitted 4 March, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

    Comments: See https://aka.ms/elate/ for demo samples, v2: subjective evaluation has been added

  7. arXiv:2308.06873  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

    Authors: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

    Abstract: Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile… ▽ More

    Submitted 25 June, 2024; v1 submitted 13 August, 2023; originally announced August 2023.

    Comments: To appear in TASLP. See https://aka.ms/speechx for demo samples

  8. arXiv:2211.05172  [pdf, other

    eess.AS cs.CL cs.SD

    Speech separation with large-scale self-supervised learning

    Authors: Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez

    Abstract: Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours). We also investigate various techniques to efficiently integrate the pre-trained mo… ▽ More

    Submitted 25 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

  9. arXiv:2211.02944  [pdf, other

    eess.AS cs.SD

    Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

    Authors: Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka

    Abstract: Personalized speech enhancement (PSE) models achieve promising results compared with unconditional speech enhancement models due to their ability to remove interfering speech in addition to background noise. Unlike unconditional speech enhancement, causal PSE models may occasionally remove the target speech by mistake. The PSE models also tend to leak interfering speech when the target speaker is… ▽ More

    Submitted 5 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  10. arXiv:2211.02773  [pdf, other

    eess.AS cs.SD

    Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation

    Authors: Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Parnamaa, Huaming Wang

    Abstract: Personalized speech enhancement (PSE) is a real-time SE approach utilizing a speaker embedding of a target person to remove background noise, reverberation, and interfering voices. To deploy a PSE model for full duplex communications, the model must be combined with acoustic echo cancellation (AEC), although such a combination has been less explored. This paper proposes a series of methods that ar… ▽ More

    Submitted 25 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: Accepted to Interspeech 2023

  11. arXiv:2204.03232  [pdf, other

    eess.AS cs.AI eess.SP

    Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation

    Authors: Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka

    Abstract: Existing multi-channel continuous speech separation (CSS) models are heavily dependent on supervised data - either simulated data which causes data mismatch between the training and real-data testing, or the real transcribed overlapping data, which is difficult to be acquired, hindering further improvements in the conversational/meeting transcription tasks. In this paper, we propose a three-stage… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  12. arXiv:2204.00771  [pdf, other

    eess.AS cs.SD eess.SP

    Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

    Authors: Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang

    Abstract: This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is $3\times$ faster than a baseline STFT-based model. Besides, we use KD techniques to develop compresse… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech conference 2022 https://interspeech2022.org/

  13. arXiv:2202.13288  [pdf, other

    eess.AS cs.SD

    ICASSP 2022 Deep Noise Suppression Challenge

    Authors: Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, Robert Aichner

    Abstract: The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. This is the 4th DNS challenge, with the previous editions held at INTERSPEECH 2020, ICASSP 2021, and INTERSPEECH 2021. We open-source datasets and test sets for researchers to train their deep noise suppression models, as well as a subjective e… ▽ More

    Submitted 26 February, 2022; originally announced February 2022.

  14. arXiv:2112.05826  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Sequence-level self-learning with multiple hypotheses

    Authors: Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng

    Abstract: In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multipl… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: Published in Interspeech 2020: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Report number: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Journal ref: Proc. Interspeech 2020, page 3775-3779

  15. arXiv:2110.14142  [pdf, other

    eess.AS cs.SD

    Separating Long-Form Speech with Group-Wise Permutation Invariant Training

    Authors: Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

    Abstract: Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription. Speech separation is often required to handle overlapped speech that is commonly observed in conversation. Although the original utterancelevel permutation invariant training-based continuous speech separation approach has proven to be effective in various conditions, it l… ▽ More

    Submitted 17 November, 2021; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 3 tables, submitted to IEEE ICASSP 2022

  16. arXiv:2110.10330  [pdf, other

    eess.AS cs.SD

    One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement

    Authors: Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Zhuo Chen, Xuedong Huang

    Abstract: With the recent surge of video conferencing tools usage, providing high-quality speech signals and accurate captions have become essential to conduct day-to-day business or connect with friends and families. Single-channel personalized speech enhancement (PSE) methods show promising results compared with the unconditional speech enhancement (SE) methods in these scenarios due to their ability to r… ▽ More

    Submitted 19 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  17. arXiv:2110.09625  [pdf, other

    eess.AS cs.LG cs.SD

    Personalized Speech Enhancement: New Models and Comprehensive Evaluation

    Authors: Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, Xuedong Huang

    Abstract: Personalized speech enhancement (PSE) models utilize additional cues, such as speaker embeddings like d-vectors, to remove background noise and interfering speech in real-time and thus improve the speech quality of online video conferencing systems for various acoustic scenarios. In this work, we propose two neural networks for PSE that achieve superior performance to the previously proposed Voice… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  18. arXiv:2110.06428  [pdf, other

    eess.AS cs.SD

    All-neural beamformer for continuous speech separation

    Authors: Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei Wang, Dongmei Wang, Sefik Emre Eskimez

    Abstract: Continuous speech separation (CSS) aims to separate overlapping voices from a continuous influx of conversational audio containing an unknown number of utterances spoken by an unknown number of speakers. A common application scenario is transcribing a meeting conversation recorded by a microphone array. Prior studies explored various deep learning models for time-frequency mask estimation, followe… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 2 tables

  19. arXiv:2106.07578  [pdf, other

    cs.LG cs.DC

    Dynamic Gradient Aggregation for Federated Domain Adaptation

    Authors: Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez

    Abstract: In this paper, a new learning algorithm for Federated Learning (FL) is introduced. The proposed scheme is based on a weighted gradient aggregation using two-step optimization to offer a flexible training pipeline. Herein, two different flavors of the aggregation method are presented, leading to an order of magnitude improvement in convergence speed compared to other distributed or FL training algo… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2008.02452

  20. arXiv:2008.03592  [pdf, other

    eess.AS cs.CV cs.LG cs.MM

    Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

    Authors: Sefik Emre Eskimez, You Zhang, Zhiyao Duan

    Abstract: Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face vid… ▽ More

    Submitted 21 July, 2021; v1 submitted 8 August, 2020; originally announced August 2020.

    Comments: Accepted to IEEE Transactions on Multimedia

  21. arXiv:2008.02452  [pdf, other

    cs.LG cs.DC stat.ML

    Federated Transfer Learning with Dynamic Gradient Aggregation

    Authors: Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez

    Abstract: In this paper, a Federated Learning (FL) simulation platform is introduced. The target scenario is Acoustic Model training based on this platform. To our knowledge, this is the first attempt to apply FL techniques to Speech Recognition tasks due to the inherent complexity. The proposed FL platform can support different tasks based on the adopted modular design. As part of the platform, a novel hie… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

  22. arXiv:2004.04438  [pdf, other

    cs.CL

    Improving Readability for Automatic Speech Recognition Transcription

    Authors: Junwei Liao, Sefik Emre Eskimez, Liyang Lu, Yu Shi, Ming Gong, Linjun Shou, Hong Qu, Michael Zeng

    Abstract: Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to grammatical errors, disfluency, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and… ▽ More

    Submitted 9 April, 2020; originally announced April 2020.

  23. arXiv:1803.09803  [pdf, other

    cs.CV

    Generating Talking Face Landmarks from Speech

    Authors: Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, Zhiyao Duan

    Abstract: The presence of a corresponding talking face has been shown to significantly improve speech intelligibility in noisy conditions and for hearing impaired population. In this paper, we present a system that can generate landmark points of a talking face from an acoustic speech in real time. The system uses a long short-term memory (LSTM) network and is trained on frontal videos of 27 different speak… ▽ More

    Submitted 23 April, 2018; v1 submitted 26 March, 2018; originally announced March 2018.

    Comments: To Appear in LVA ICA 2018. Please see the following link: http://www2.ece.rochester.edu/projects/air/projects/talkingface.html

  24. arXiv:1510.06769  [pdf, other

    cs.HC

    Emotion Classification: How Does an Automated System Compare to Naive Human Coders?

    Authors: Sefik Emre Eskimez, Kenneth Imade, Na Yang, Melissa Sturge-Apple, Zhiyao Duan, Wendi Heinzelman

    Abstract: The fact that emotions play a vital role in social interactions, along with the demand for novel human-computer interaction applications, have led to the development of a number of automatic emotion classification systems. However, it is still debatable whether the performance of such systems can compare with human coders. To address this issue, in this study, we present a comprehensive comparison… ▽ More

    Submitted 21 January, 2016; v1 submitted 22 October, 2015; originally announced October 2015.

    Comments: Accepted to ICASSP 2016