Skip to main content

Showing 1–8 of 8 results for author: Dimitriadis, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.08800  [pdf, other

    cs.SD cs.LG eess.AS

    Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

    Authors: Tiantian Feng, Dimitrios Dimitriadis, Shrikanth Narayanan

    Abstract: Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the qua… ▽ More

    Submitted 29 August, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to 2024 INTERSPEECH; corrections to ActivityNet labels

  2. arXiv:2112.05826  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Sequence-level self-learning with multiple hypotheses

    Authors: Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng

    Abstract: In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multipl… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: Published in Interspeech 2020: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Report number: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Journal ref: Proc. Interspeech 2020, page 3775-3779

  3. arXiv:2101.09624  [pdf, other

    eess.AS cs.CL cs.SD

    A Review of Speaker Diarization: Recent Advances with Deep Learning

    Authors: Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, Shrikanth Narayanan

    Abstract: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application o… ▽ More

    Submitted 26 November, 2021; v1 submitted 23 January, 2021; originally announced January 2021.

    Comments: This article is a preprint version of the article published in Computer Speech & Language, Volume 72, March 2022, 101317

  4. A Memory Augmented Architecture for Continuous Speaker Identification in Meetings

    Authors: Nikolaos Flemotomos, Dimitrios Dimitriadis

    Abstract: We introduce and analyze a novel approach to the problem of speaker identification in multi-party recorded meetings. Given a speech segment and a set of available candidate profiles, we propose a novel data-driven way to model the distance relations between them, aiming at identifying the speaker label corresponding to that segment. To achieve this we employ a recurrent, memory-based architecture,… ▽ More

    Submitted 14 January, 2020; originally announced January 2020.

    Comments: Submitted to ICASSP 2020

  5. arXiv:1912.04979  [pdf, other

    eess.AS cs.CL cs.CV cs.SD eess.IV

    Advances in Online Audio-Visual Meeting Transcription

    Authors: Takuya Yoshioka, Igor Abramovski, Cem Aksoylar, Zhuo Chen, Moshe David, Dimitrios Dimitriadis, Yifan Gong, Ilya Gurvich, Xuedong Huang, Yan Huang, Aviv Hurvitz, Li Jiang, Sharon Koubi, Eyal Krupka, Ido Leichter, Changliang Liu, Partha Parthasarathy, Alon Vinnikov, Lingfeng Wu, Xiong Xiao, Wayne Xiong, Huaming Wang, Zhenghao Wang, Jun Zhang, Yong Zhao , et al. (1 additional authors not shown)

    Abstract: This paper describes a system that generates speaker-annotated transcripts of meetings by using a microphone array and a 360-degree camera. The hallmark of the system is its ability to handle overlapped speech, which has been an unsolved problem in realistic settings for over a decade. We show that this problem can be addressed by using a continuous speech separation approach. In addition, we desc… ▽ More

    Submitted 10 December, 2019; originally announced December 2019.

    Comments: To appear in Proc. IEEE ASRU Workshop 2019

  6. arXiv:1909.00082  [pdf, ps, other

    eess.AS cs.SD

    Enhancements for Audio-only Diarization Systems

    Authors: Dimitrios Dimitriadis

    Abstract: In this paper two different approaches to enhance the performance of the most challenging component of a Speaker Diarization system are presented, i.e. the speaker clustering part. A processing step is proposed enhancing the input features with a temporal smoothing process combined with nonlinear filtering. We, also, propose improvements on the Deep Embedded Clustering (DEC) algorithm -- a nonline… ▽ More

    Submitted 30 August, 2019; originally announced September 2019.

  7. arXiv:1905.02545  [pdf, other

    eess.AS cs.CL cs.SD

    Meeting Transcription Using Virtual Microphone Arrays

    Authors: Takuya Yoshioka, Zhuo Chen, Dimitrios Dimitriadis, William Hinthorn, Xuedong Huang, Andreas Stolcke, Michael Zeng

    Abstract: We describe a system that generates speaker-annotated transcripts of meetings by using a virtual microphone array, a set of spatially distributed asynchronous recording devices such as laptops and mobile phones. The system is composed of continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization using prior speaker information, and system combination. When utiliz… ▽ More

    Submitted 7 July, 2019; v1 submitted 3 May, 2019; originally announced May 2019.

    Report number: MSR-TR-2019-11

  8. arXiv:1904.06478  [pdf, other

    eess.AS cs.CL cs.SD

    Low-Latency Speaker-Independent Continuous Speech Separation

    Authors: Takuya Yoshioka, Zhuo Chen, Changliang Liu, Xiong Xiao, Hakan Erdogan, Dimitrios Dimitriadis

    Abstract: Speaker independent continuous speech separation (SI-CSS) is a task of converting a continuous audio stream, which may contain overlapping voices of unknown speakers, into a fixed number of continuous signals each of which contains no overlapping speech segment. A separated, or cleaned, version of each utterance is generated from one of SI-CSS's output channels nondeterministically without being s… ▽ More

    Submitted 13 April, 2019; originally announced April 2019.