Skip to main content

Showing 1–23 of 23 results for author: Juvela, L

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.15368  [pdf, ps, other

    cs.SD eess.AS

    Neurodyne: Neural Pitch Manipulation with Representation Learning and Cycle-Consistency GAN

    Authors: Yicheng Gu, Chaoren Wang, Zhizheng Wu, Lauri Juvela

    Abstract: Pitch manipulation is the process of producers adjusting the pitch of an audio segment to a specific key and intonation, which is essential in music production. Neural-network-based pitch-manipulation systems have been popular in recent years due to their superior synthesis quality compared to classical DSP methods. However, their performance is still limited due to their inaccurate feature disent… ▽ More

    Submitted 28 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  2. arXiv:2504.04751  [pdf, other

    eess.AS cs.AI

    Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches

    Authors: Eloi Moliner, Michal Švento, Alec Wright, Lauri Juvela, Pavel Rajmic, Vesa Välimäki

    Abstract: Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging problem.This work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this application, based on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using black- and gray-box… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

    Comments: Submitted to the 28th International Conference on Digital Audio Effects (DAFx25)

  3. arXiv:2504.04589  [pdf, other

    cs.SD eess.AS eess.SP

    Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling

    Authors: Yicheng Gu, Runsong Zhang, Lauri Juvela, Zhizheng Wu

    Abstract: Virtual Analog (VA) modeling aims to simulate the behavior of hardware circuits via algorithms to replicate their tone digitally. Dynamic Range Compressor (DRC) is an audio processing module that controls the dynamics of a track by reducing and amplifying the volumes of loud and quiet sounds, which is essential in music production. In recent years, neural-network-based VA modeling has shown great… ▽ More

    Submitted 28 May, 2025; v1 submitted 6 April, 2025; originally announced April 2025.

  4. arXiv:2501.05959  [pdf, other

    eess.AS

    Estimation and Restoration of Unknown Nonlinear Distortion using Diffusion

    Authors: Michal Švento, Eloi Moliner, Lauri Juvela, Alec Wright, Vesa Välimäki

    Abstract: The restoration of nonlinearly distorted audio signals, alongside the identification of the applied memoryless nonlinear operation, is studied. The paper focuses on the difficult but practically important case in which both the nonlinearity and the original input signal are unknown. The proposed method uses a generative diffusion model trained unconditionally on guitar or speech signals to jointly… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: Submitted to the Journal of Audio Engineering Society, special issue "The Sound of Digital Audio Effects"

  5. arXiv:2411.14972  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Open-Amp: Synthetic Data Framework for Audio Effect Foundation Models

    Authors: Alec Wright, Alistair Carson, Lauri Juvela

    Abstract: This paper introduces Open-Amp, a synthetic data framework for generating large-scale and diverse audio effects data. Audio effects are relevant to many musical audio processing and Music Information Retrieval (MIR) tasks, such as modelling of analog audio effects, automatic mixing, tone matching and transcription. Existing audio effects datasets are limited in scope, usually including relatively… ▽ More

    Submitted 22 November, 2024; originally announced November 2024.

  6. arXiv:2409.14823  [pdf, other

    cs.SD eess.AS

    HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters

    Authors: Lauri Juvela, Pablo Pérez Zarazaga, Gustav Eje Henter, Zofia Malisz

    Abstract: We introduce an end-to-end neural speech synthesis system that uses the source-filter model of speech production. Specifically, we apply differentiable resonant filters to a glottal waveform generated by a neural vocoder. The aim is to obtain a controllable synthesiser, similar to classic formant synthesis, but with much higher perceptual quality - filling a research gap in current neural waveform… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  7. arXiv:2409.13382  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis

    Authors: Lauri Juvela, Xin Wang

    Abstract: Automatic detection of synthetic speech is becoming increasingly important as current synthesis methods are both near indistinguishable from human speech and widely accessible to the public. Audio watermarking and other active disclosure methods of are attracting research activity, as they can complement traditional deepfake defenses based on passive detection. In both active and passive detection… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  8. arXiv:2403.08559  [pdf, other

    cs.SD eess.AS

    End-to-End Amp Modeling: From Data to Controllable Guitar Amplifier Models

    Authors: Lauri Juvela, Eero-Pekka Damskägg, Aleksi Peussa, Jaakko Mäkinen, Thomas Sherson, Stylianos I. Mimilakis, Athanasios Gotsopoulos

    Abstract: This paper describes a data-driven approach to creating real-time neural network models of guitar amplifiers, recreating the amplifiers' sonic response to arbitrary inputs at the full range of controls present on the physical device. While the focus on the paper is on the data collection pipeline, we demonstrate the effectiveness of this conditioned black-box approach by training an LSTM model to… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: Presented at ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  9. arXiv:2309.15224  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Collaborative Watermarking for Adversarial Speech Synthesis

    Authors: Lauri Juvela, Xin Wang

    Abstract: Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic… ▽ More

    Submitted 2 January, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  10. arXiv:2309.07658  [pdf, other

    cs.SD eess.AS

    DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input

    Authors: Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm, Junichi Yamagishi

    Abstract: We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  11. arXiv:2306.01957  [pdf, other

    eess.AS

    Speaker-independent neural formant synthesis

    Authors: Pablo Pérez Zarazaga, Zofia Malisz, Gustav Eje Henter, Lauri Juvela

    Abstract: We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spect… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: 5 pages, 4 figures. Article accepted at INTERSPEECH 2023

  12. arXiv:2211.00943  [pdf, other

    eess.AS cs.SD

    Adversarial Guitar Amplifier Modelling With Unpaired Data

    Authors: Alec Wright, Vesa Välimäki, Lauri Juvela

    Abstract: We propose an audio effects processing framework that learns to emulate a target electric guitar tone from a recording. We train a deep neural network using an adversarial approach, with the goal of transforming the timbre of a guitar, into the timbre of another guitar after audio effects processing has been applied, for example, by a guitar amplifier. The model training requires no paired data, a… ▽ More

    Submitted 20 March, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  13. arXiv:2004.13764  [pdf, other

    eess.AS cs.SD

    Conditional Spoken Digit Generation with StyleGAN

    Authors: Kasperi Palkama, Lauri Juvela, Alexander Ilin

    Abstract: This paper adapts a StyleGAN model for speech generation with minimal or no conditioning on text. StyleGAN is a multi-scale convolutional GAN capable of hierarchically capturing data structure and latent variation on multiple spatial (or temporal) levels. The model has previously achieved impressive results on facial image generation, and it is appealing to audio applications due to similar multi-… ▽ More

    Submitted 15 September, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

    Comments: Interspeech2020 accepted version

  14. arXiv:1911.01601  [pdf, other

    eess.AS cs.CR cs.SD eess.SP

    ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

    Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika , et al. (15 additional authors not shown)

    Abstract: Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to imperso… ▽ More

    Submitted 14 July, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

    Comments: Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114

  15. arXiv:1910.12381  [pdf, other

    eess.AS cs.SD stat.ML

    Transferring neural speech waveform synthesizers to musical instrument sounds generation

    Authors: Yi Zhao, Xin Wang, Lauri Juvela, Junichi Yamagishi

    Abstract: Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation. The similarity between speech and music audio synthesis techniques suggests interesting avenues to explore in terms of the best way to apply speech synthesizers in the music domain. This work… ▽ More

    Submitted 18 November, 2019; v1 submitted 27 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  16. arXiv:1904.03976  [pdf, other

    eess.AS cs.LG cs.SD

    GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

    Authors: Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

    Abstract: Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but present additional challenges for vocoding (i.e., waveform generation from the acoustic features). High-quality synthesis can be achieved with neur… ▽ More

    Submitted 26 June, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Interspeech 2019 accepted version

  17. arXiv:1903.05955  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

    Authors: Bajibabu Bollepalli, Lauri Juvela, Paavo Alku

    Abstract: Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal vocoding. This refers to vocoders that parameterize speech into two parts, the glottal excitation and vocal tract, that occur in the human speech production apparatus. Current glottal vocoders generate the glottal excitation waveform by using deep neural networks (DNNs). However, the squared error-base… ▽ More

    Submitted 14 March, 2019; originally announced March 2019.

    Comments: Accepted in Interspeech

    Journal ref: Interspeech-2017

  18. arXiv:1811.00334  [pdf, other

    eess.AS cs.SD

    Deep Learning for Tube Amplifier Emulation

    Authors: Eero-Pekka Damskägg, Lauri Juvela, Etienne Thuillier, Vesa Välimäki

    Abstract: Analog audio effects and synthesizers often owe their distinct sound to circuit nonlinearities. Faithfully modeling such significant aspect of the original sound in virtual analog software can prove challenging. The current work proposes a generic data-driven approach to virtual analog modeling and applies it to the Fender Bassman 56F-A vacuum-tube amplifier. Specifically, a feedforward variant of… ▽ More

    Submitted 20 February, 2019; v1 submitted 1 November, 2018; originally announced November 2018.

    Comments: Accepted to ICASSP 2019

  19. arXiv:1810.12598  [pdf, other

    eess.AS cs.SD stat.ML

    Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

    Authors: Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

    Abstract: The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more expensive computationally. Meanwhile, generative adversarial networks (GANs) have achieved impressive resul… ▽ More

    Submitted 30 October, 2018; originally announced October 2018.

    Comments: Submitted to ICASSP 2019

  20. arXiv:1810.12051  [pdf, other

    cs.SD cs.CL eess.AS

    Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention

    Authors: Bajibabu Bollepalli, Lauri Juvela, Paavo Alku

    Abstract: Currently, there are increasing interests in text-to-speech (TTS) synthesis to use sequence-to-sequence models with attention. These models are end-to-end meaning that they learn both co-articulation and duration properties directly from text and speech. Since these models are entirely data-driven, they need large amounts of data to generate synthetic speech with good quality. However, in challeng… ▽ More

    Submitted 29 October, 2018; originally announced October 2018.

    Comments: 5 pages, 5 figures. Submitted to ICASSP 2019

  21. arXiv:1804.09593  [pdf, other

    eess.AS cs.SD stat.ML

    Speaker-independent raw waveform model for glottal excitation

    Authors: Lauri Juvela, Vassilis Tsiaras, Bajibabu Bollepalli, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

    Abstract: Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing t… ▽ More

    Submitted 25 April, 2018; originally announced April 2018.

    Comments: Submitted to Interspeech 2018

  22. arXiv:1804.02549  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

    Authors: Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, Junichi Yamagishi

    Abstract: Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches… ▽ More

    Submitted 7 April, 2018; originally announced April 2018.

    Comments: To appear in ICASSP 2018

  23. arXiv:1804.00920  [pdf, ps, other

    eess.AS cs.CL cs.SD stat.ML

    Speech waveform synthesis from MFCC sequences with generative adversarial networks

    Authors: Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

    Abstract: This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis. First, we predict fundamental frequency and voicing information from MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope information containe… ▽ More

    Submitted 3 April, 2018; originally announced April 2018.