Skip to main content

Showing 1–22 of 22 results for author: Korzekwa, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2508.14444  [pdf, ps, other

    cs.CL cs.AI cs.LG

    NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

    Authors: NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan , et al. (192 additional authors not shown)

    Abstract: We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achi… ▽ More

    Submitted 2 September, 2025; v1 submitted 20 August, 2025; originally announced August 2025.

  2. arXiv:2507.09310  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning

    Authors: Dominika Woszczyk, Manuel Sam Ribeiro, Thomas Merritt, Daniel Korzekwa

    Abstract: Text-to-Speech (TTS) systems in Lombard speaking style can improve the overall intelligibility of speech, useful for hearing loss and noisy conditions. However, training those models requires a large amount of data and the Lombard effect is challenging to record due to speaker and noise variability and tiring recording conditions. Voice conversion (VC) has been shown to be a useful augmentation te… ▽ More

    Submitted 12 July, 2025; originally announced July 2025.

    Comments: Presented at Clarity Challenge 2023

  3. arXiv:2506.02890  [pdf, other

    cs.LG cs.AI cs.CL

    Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

    Authors: Jakub Krajewski, Marcin Chochowski, Daniel Korzekwa

    Abstract: Mixture of Experts (MoE) architectures have emerged as pivotal for scaling Large Language Models (LLMs) efficiently. Fine-grained MoE approaches - utilizing more numerous, smaller experts - have demonstrated potential in improving model convergence and quality. This work proposes a set of training recipes and provides a comprehensive empirical evaluation of fine-grained MoE, directly comparing its… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  4. arXiv:2504.11409  [pdf, other

    cs.CL

    Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

    Authors: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

    Abstract: Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  5. arXiv:2504.03624  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

    Authors: NVIDIA, :, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo , et al. (176 additional authors not shown)

    Abstract: As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transf… ▽ More

    Submitted 5 September, 2025; v1 submitted 4 April, 2025; originally announced April 2025.

  6. arXiv:2408.11796  [pdf, other

    cs.CL cs.AI cs.LG

    LLM Pruning and Distillation in Practice: The Minitron Approach

    Authors: Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, Chenhan Yu, Wei-Chun Chen, Hayley Ross, Oluwatobi Olabiyi, Ashwath Aithal, Oleksii Kuchaiev, Daniel Korzekwa, Pavlo Molchanov, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro

    Abstract: We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Align… ▽ More

    Submitted 9 December, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: v4: Update author order

  7. arXiv:2312.16552  [pdf, other

    cs.SD cs.LG eess.AS

    AE-Flow: AutoEncoder Normalizing Flow

    Authors: Jakub Mosiński, Piotr Biliński, Thomas Merritt, Abdelhamid Ezzerg, Daniel Korzekwa

    Abstract: Recently normalizing flows have been gaining traction in text-to-speech (TTS) and voice conversion (VC) due to their state-of-the-art (SOTA) performance. Normalizing flows are unsupervised generative models. In this paper, we introduce supervision to the training process of normalizing flows, without the need for parallel data. We call this training paradigm AutoEncoder Normalizing Flow (AE-Flow).… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

    Comments: ICASSP 2023

  8. Creating New Voices using Normalizing Flows

    Authors: Piotr Bilinski, Thomas Merritt, Abdelhamid Ezzerg, Kamil Pokora, Sebastian Cygert, Kayoko Yanagisawa, Roberto Barra-Chicote, Daniel Korzekwa

    Abstract: Creating realistic and natural-sounding synthetic speech remains a big challenge for voice identities unseen during training. As there is growing interest in synthesizing voices of new speakers, here we investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities. First… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: Interspeech 2022

    Journal ref: Interspeech 2022, 2958-2962

  9. arXiv:2307.16679  [pdf, other

    eess.AS cs.CL cs.LG

    Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

    Authors: Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba

    Abstract: Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosod… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 5 pages, 2 figures, 5 tables. Interspeech 2023

  10. On granularity of prosodic representations in expressive text-to-speech

    Authors: Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafal Sienkiewicz, Daniel Korzekwa, Viacheslav Klimkov

    Abstract: In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonet… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Accepted to IEEE SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 892-899

  11. arXiv:2209.06265  [pdf, other

    eess.AS cs.SD q-bio.OT

    Automated detection of pronunciation errors in non-native English speech employing deep learning

    Authors: Daniel Korzekwa

    Abstract: Despite significant advances in recent years, the existing Computer-Assisted Pronunciation Training (CAPT) methods detect pronunciation errors with a relatively low accuracy (precision of 60% at 40%-80% recall). This Ph.D. work proposes novel deep learning methods for detecting pronunciation errors in non-native (L2) English speech, outperforming the state-of-the-art method in AUC metric (Area und… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

    Comments: PhD Thesis, in English + extended summary in Polish

  12. Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek

    Abstract: The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: Published in Speech Communication Journal

  13. arXiv:2203.08009  [pdf, other

    eess.AS cs.SD

    Text-free non-parallel many-to-many voice conversion using normalising flows

    Authors: Thomas Merritt, Abdelhamid Ezzerg, Piotr Biliński, Magdalena Proszewska, Kamil Pokora, Roberto Barra-Chicote, Daniel Korzekwa

    Abstract: Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a large challenge. This is particularly challenging in the scenario where at inference-time we have no knowledge of the text being read, i.e., text-free VC. To mit… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

  14. arXiv:2108.06270  [pdf, other

    eess.AS cs.AI

    Enhancing audio quality for expressive Neural Text-to-Speech

    Authors: Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

    Abstract: Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio a… ▽ More

    Submitted 13 August, 2021; originally announced August 2021.

    Comments: 6 pages, 4 figures, 2 tables, SSW 2021

  15. arXiv:2106.12896  [pdf, other

    cs.SD cs.AI cs.LG

    Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

    Authors: Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa, Thomas Merritt

    Abstract: Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, a 3-step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly express… ▽ More

    Submitted 25 June, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

    Comments: 6 pages, 5 figures. Accepted to Speech Synthesis Workshop (SSW) 2021

  16. Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

    Authors: Adam Gabryś, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote

    Abstract: This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021, 5 pages,3 figures

  17. arXiv:2106.03494  [pdf, other

    eess.AS cs.LG

    Weakly-supervised word-level pronunciation error detection in non-native English speech

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

    Abstract: We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  18. arXiv:2102.01106  [pdf, other

    eess.AS cs.CL cs.SD

    Universal Neural Vocoding with Parallel WaveNet

    Authors: Yunlong Jiao, Adam Gabrys, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, Viacheslav Klimkov

    Abstract: We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases. We tested it on 43 internal speakers of diverse age and gender, speaking 20 languages in 17 unique styles, of which 7 voices and 5 styles were not exposed during training. We… ▽ More

    Submitted 15 February, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

    Comments: 5 pages, 2 figures. Accepted to ICASSP 2021

  19. arXiv:2101.06396  [pdf, other

    eess.AS cs.LG cs.SD

    Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek

    Abstract: A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do… ▽ More

    Submitted 8 February, 2021; v1 submitted 16 January, 2021; originally announced January 2021.

    Comments: Accepted to ICASSP 2021

  20. arXiv:2012.14788  [pdf, other

    eess.AS cs.SD

    Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

    Authors: Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

    Abstract: This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learni… ▽ More

    Submitted 7 June, 2021; v1 submitted 29 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021

  21. arXiv:1907.04743  [pdf, other

    eess.AS cs.CL cs.SD

    Interpretable Deep Learning Model for the Detection and Reconstruction of Dysarthric Speech

    Authors: Daniel Korzekwa, Roberto Barra-Chicote, Bozena Kostek, Thomas Drugman, Mateusz Lajszczak

    Abstract: This paper proposed a novel approach for the detection and reconstruction of dysarthric speech. The encoder-decoder model factorizes speech into a low-dimensional latent space and encoding of the input text. We showed that the latent space conveys interpretable characteristics of dysarthria, such as intelligibility and fluency of speech. MUSHRA perceptual test demonstrated that the adaptation of t… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Comments: 5 pages, 5 figures, Accepted for Interspeech 2019

  22. arXiv:1811.06296  [pdf, other

    eess.AS cs.SD

    Comprehensive evaluation of statistical speech waveform synthesis

    Authors: Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, Viacheslav Klimkov, Alexis Moinet, Andrew Breen, Rafal Kuklinski, Nikko Strom, Roberto Barra-Chicote

    Abstract: Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the pro… ▽ More

    Submitted 11 December, 2018; v1 submitted 15 November, 2018; originally announced November 2018.