Skip to main content

Showing 1–23 of 23 results for author: Skoglund, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.11915  [pdf, other

    eess.AS cs.SD

    BINAQUAL: A Full-Reference Objective Localization Similarity Metric for Binaural Audio

    Authors: Davoud Shariat Panah, Dan Barry, Alessandro Ragano, Jan Skoglund, Andrew Hines

    Abstract: Spatial audio enhances immersion in applications such as virtual reality, augmented reality, gaming, and cinema by creating a three-dimensional auditory experience. Ensuring the spatial fidelity of binaural audio is crucial, given that processes such as compression, encoding, or transmission can alter localization cues. While subjective listening tests like MUSHRA remain the gold standard for eval… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    Comments: Submitted to the Journal of Audio Engineering Society (JAES)

  2. arXiv:2505.01369  [pdf, ps, other

    cs.SD eess.AS

    Binamix -- A Python Library for Generating Binaural Audio Datasets

    Authors: Dan Barry, Davoud Shariat Panah, Alessandro Ragano, Jan Skoglund, Andrew Hines

    Abstract: The increasing demand for spatial audio in applications such as virtual reality, immersive media, and spatial audio research necessitates robust solutions to generate binaural audio data sets for use in testing and validation. Binamix is an open-source Python library designed to facilitate programmatic binaural mixing using the extensive SADIE II Database, which provides Head Related Impulse Respo… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: Accepted to the 158th Audio Engineering Society Convention, 2025

  3. Perceptual Audio Coding: A 40-Year Historical Perspective

    Authors: Jürgen Herre, Schuyler Quackenbush, Minje Kim, Jan Skoglund

    Abstract: In the history of audio and acoustic signal processing, perceptual audio coding has certainly excelled as a bright success story by its ubiquitous deployment in virtually all digital media devices, such as computers, tablets, mobile phones, set-top-boxes, and digital radios. From a technology perspective, perceptual audio coding has undergone tremendous development from the first very basic percep… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Journal ref: Published in the Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025

  4. arXiv:2410.06675  [pdf, other

    cs.SD eess.AS

    SCOREQ: Speech Quality Assessment with Contrastive Regression

    Authors: Alessandro Ragano, Jan Skoglund, Andrew Hines

    Abstract: In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS… ▽ More

    Submitted 15 January, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

    Comments: Accepted NeurIPS 2024

  5. arXiv:2408.06954  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    Neural Speech and Audio Coding: Modern AI Technology Meets Traditional Codecs

    Authors: Minje Kim, Jan Skoglund

    Abstract: This paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems. It highlights the challenges posed by the subjective evaluation processes of speech and audio codecs and discusses the limitations of purely data-driven approaches, which often require inefficiently large architectures to match the performance of model-based met… ▽ More

    Submitted 6 January, 2025; v1 submitted 13 August, 2024; originally announced August 2024.

    Comments: Published in IEEE Signal Processing Magazine

    Journal ref: in IEEE Signal Processing Magazine, vol. 41, no. 6, pp. 85-93, Nov. 2024

  6. arXiv:2309.16284  [pdf, other

    cs.SD eess.AS

    NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment

    Authors: Alessandro Ragano, Jan Skoglund, Andrew Hines

    Abstract: This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score bet… ▽ More

    Submitted 19 January, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: Accepted for ICASSP 2024

  7. arXiv:2303.12984  [pdf, other

    cs.SD eess.AS

    LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

    Authors: Teerapat Jenrungrot, Michael Chinen, W. Bastiaan Kleijn, Jan Skoglund, Zalán Borsos, Neil Zeghidour, Marco Tagliasacchi

    Abstract: We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the tran… ▽ More

    Submitted 22 March, 2023; originally announced March 2023.

    Comments: 5 pages, accepted to ICASSP 2023, project page: https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec

  8. arXiv:2209.06358  [pdf, other

    cs.SD cs.LG eess.AS

    Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

    Authors: Michael Chinen, Jan Skoglund, Chandan K A Reddy, Alessandro Ragano, Andrew Hines

    Abstract: Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality mo… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

    Comments: Preprint; accepted for Interspeech 2022

  9. arXiv:2207.02262  [pdf, other

    cs.SD cs.LG eess.AS

    Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

    Authors: Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund

    Abstract: Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective rec… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

    Comments: Proceedings of INTERSPEECH 2022

  10. arXiv:2204.02249  [pdf, other

    eess.AS

    A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality

    Authors: Alessandro Ragano, Emmanouil Benetos, Michael Chinen, Helard B. Martinez, Chandan K. A. Reddy, Jan Skoglund, Andrew Hines

    Abstract: Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influ… ▽ More

    Submitted 24 November, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted ISSC 2023

  11. arXiv:2107.03312  [pdf, other

    cs.SD cs.LG eess.AS

    SoundStream: An End-to-End Neural Audio Codec

    Authors: Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, Marco Tagliasacchi

    Abstract: We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and s… ▽ More

    Submitted 7 July, 2021; originally announced July 2021.

  12. arXiv:2102.11906  [pdf, other

    eess.AS cs.SD

    Handling Background Noise in Neural Speech Generation

    Authors: Tom Denton, Alejandro Luebs, Felicia S. C. Lim, Andrew Storus, Hengchin Yeh, W. Bastiaan Kleijn, Jan Skoglund

    Abstract: Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: 5 pages, 3 figures, presented at the Asilomar Conference on Signals, Systems, and Computers 2020

  13. arXiv:2102.10449  [pdf, other

    eess.AS eess.SP

    WARP-Q: Quality Prediction For Generative Neural Speech Codecs

    Authors: Wissam A. Jassim, Jan Skoglund, Michael Chinen, Andrew Hines

    Abstract: Good speech quality has been achieved using waveform matching and parametric reconstruction coders. Recently developed very low bit rate generative codecs can reconstruct high quality wideband speech with bit streams less than 3 kb/s. These codecs use a DNN with parametric input to synthesise high quality speech outputs. Existing objective speech quality models (e.g., POLQA, ViSQOL) do not accurat… ▽ More

    Submitted 20 February, 2021; originally announced February 2021.

    Comments: Accepted for presentation at IEEE ICASSP 2021. Source code and data can be found on https://github.com/wjassim/WARP-Q.git

  14. arXiv:2102.09660  [pdf, other

    eess.AS cs.SD

    Generative Speech Coding with Predictive Variance Regularization

    Authors: W. Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, Hengchin Yeh

    Abstract: The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the in… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

    MSC Class: 94 ACM Class: I.m

  15. arXiv:2004.09584  [pdf, other

    eess.AS cs.SD eess.SP

    ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric

    Authors: Michael Chinen, Felicia S. C. Lim, Jan Skoglund, Nikita Gureev, Feargus O'Gorman, Andrew Hines

    Abstract: Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of both design and usage. As an open source C++ library or binary with permissive licensing, ViSQOL can now be deployed beyond the research context into production… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

    Comments: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX)

  16. arXiv:2003.11882  [pdf, other

    eess.AS cs.SD

    Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders

    Authors: Wissam A. Jassim, Jan Skoglund, Michael Chinen, Andrew Hines

    Abstract: This study compares the performances of different algorithms for coding speech at low bit rates. In addition to widely deployed traditional vocoders, a selection of recently developed generative-model-based coders at different bit rates are contrasted. Performance analysis of the coded speech is evaluated for different quality aspects: accuracy of pitch periods estimation, the word error rates for… ▽ More

    Submitted 26 March, 2020; originally announced March 2020.

    Comments: 6 pages, 11 figures, conference

  17. arXiv:1909.04776  [pdf, other

    eess.AS cs.SD

    Generative Speech Enhancement Based on Cloned Networks

    Authors: Michael Chinen, W. Bastiaan Kleijn, Felicia S. C. Lim, Jan Skoglund

    Abstract: We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the cl… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

    Comments: Accepted WASPAA 2019

  18. arXiv:1908.07045  [pdf, other

    eess.AS cs.SD

    Salient Speech Representations Based on Cloned Networks

    Authors: W. Bastiaan Kleijn, Felicia S. C. Lim, Michael Chinen, Jan Skoglund

    Abstract: We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receiv… ▽ More

    Submitted 19 August, 2019; originally announced August 2019.

    Comments: Interspeech 2019

  19. arXiv:1905.04628  [pdf, other

    eess.AS cs.SD

    Improving Opus Low Bit Rate Quality with Neural Speech Synthesis

    Authors: Jan Skoglund, Jean-Marc Valin

    Abstract: The voice mode of the Opus audio coder can compress wideband speech at bit rates ranging from 6 kb/s to 40 kb/s. However, Opus is at its core a waveform matching coder, and as the rate drops below 10 kb/s, quality degrades quickly. As the rate reduces even further, parametric coders tend to perform better than waveform coders. In this paper we propose a backward-compatible way of improving low bit… ▽ More

    Submitted 10 August, 2020; v1 submitted 11 May, 2019; originally announced May 2019.

    Comments: Proc. Interspeech 2020, 5 pages

  20. arXiv:1903.12087  [pdf, other

    eess.AS cs.LG cs.SD

    A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet

    Authors: Jean-Marc Valin, Jan Skoglund

    Abstract: Neural speech synthesis algorithms are a promising new approach for coding speech at very low bitrate. They have so far demonstrated quality that far exceeds traditional vocoders, at the cost of very high complexity. In this work, we present a low-bitrate neural vocoder based on the LPCNet model. The use of linear prediction and sparse recurrent networks makes it possible to achieve real-time oper… ▽ More

    Submitted 27 June, 2019; v1 submitted 28 March, 2019; originally announced March 2019.

    Comments: Accepted for Interspeech 2019, 5 pages

  21. arXiv:1811.07030  [pdf, other

    cs.SD eess.AS

    Exploring Tradeoffs in Models for Low-latency Speech Enhancement

    Authors: Kevin Wilson, Michael Chinen, Jeremy Thorpe, Brian Patton, John Hershey, Rif A. Saurous, Jan Skoglund, Richard F. Lyon

    Abstract: We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and… ▽ More

    Submitted 16 November, 2018; originally announced November 2018.

  22. arXiv:1810.11846  [pdf, other

    eess.AS cs.LG cs.SD

    LPCNet: Improving Neural Speech Synthesis Through Linear Prediction

    Authors: Jean-Marc Valin, Jan Skoglund

    Abstract: Neural speech synthesis models have recently demonstrated the ability to synthesize high quality speech for text-to-speech and compression applications. These new models often require powerful GPUs to achieve real-time operation, so being able to reduce their complexity would open the way for many new applications. We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent… ▽ More

    Submitted 19 February, 2019; v1 submitted 28 October, 2018; originally announced October 2018.

    Comments: ICASSP 2019, 5 pages

  23. arXiv:1712.01120  [pdf, other

    eess.AS cs.SD eess.SP

    Wavenet based low rate speech coding

    Authors: W. Bastiaan Kleijn, Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, Florian Stimberg, Quan Wang, Thomas C. Walters

    Abstract: Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative m… ▽ More

    Submitted 1 December, 2017; originally announced December 2017.

    Comments: 5 pages, 2 figures