Skip to main content

Showing 1–7 of 7 results for author: Varadhan, P S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.20693  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages

    Authors: Praveen Srinivasa Varadhan, Srija Anand, Soma Siddhartha, Mitesh M. Khapra

    Abstract: What happens when an English Fairytaler is fine-tuned on Indian languages? We evaluate how the English F5-TTS model adapts to 11 Indian languages, measuring polyglot fluency, voice-cloning, style-cloning, and code-mixing. We compare: (i) training from scratch, (ii) fine-tuning English F5 on Indian data, and (iii) fine-tuning on both Indian and English data to prevent forgetting. Fine-tuning with o… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  2. arXiv:2505.18609  [pdf, other

    cs.CL

    RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations

    Authors: Ashwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M Khapra

    Abstract: We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we d… ▽ More

    Submitted 27 May, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  3. arXiv:2411.12719  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

    Authors: Praveen Srinivasa Varadhan, Amogh Gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M. Khapra

    Abstract: Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS's pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference sp… ▽ More

    Submitted 26 May, 2025; v1 submitted 19 November, 2024; originally announced November 2024.

    Comments: Accepted in TMLR

  4. arXiv:2410.17901  [pdf, other

    cs.CL eess.AS

    ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams

    Authors: Srija Anand, Praveen Srinivasa Varadhan, Mehak Singal, Mitesh M. Khapra

    Abstract: Recent advancements in Text-to-Speech (TTS) technology have led to natural-sounding speech for English, primarily due to the availability of large-scale, high-quality web data. However, many other languages lack access to such resources, relying instead on limited studio-quality data. This scarcity results in synthesized speech that often suffers from intelligibility issues, particularly with low-… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

    Comments: 11 pages, 1 figure, 3 tables

  5. arXiv:2409.05356  [pdf, other

    cs.CL cs.LG cs.SD eess.SP

    IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS

    Authors: Ashwin Sankar, Srija Anand, Praveen Srinivasa Varadhan, Sherry Thomas, Mehak Singal, Shridhar Kumar, Deovrat Mehendale, Aditi Krishana, Giri Raju, Mitesh Khapra

    Abstract: Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations… ▽ More

    Submitted 7 October, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

    Comments: Accepted to NeurIPS 2024 Datasets and Benchmarks track

  6. arXiv:2407.14056  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings

    Authors: Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra

    Abstract: We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increas… ▽ More

    Submitted 30 August, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

    Comments: Accepted at INTERSPEECH 2024. First two authors listed contributed equally

  7. arXiv:2407.13435  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies

    Authors: Srija Anand, Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra

    Abstract: Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from se… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted at INTERSPEECH 2024