Skip to main content

Showing 1–20 of 20 results for author: Prasanna, S R M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03606  [pdf, ps, other

    eess.AS cs.AI cs.CL eess.SP

    Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models

    Authors: Parismita Gogoi, Sishir Kalita, Wendy Lalhminghlui, Viyazonuo Terhiija, Moakala Tzudir, Priyankoo Sarmah, S. R. M. Prasanna

    Abstract: This study explores the use of self-supervised learning (SSL) models for tone recognition in three low-resource languages from North Eastern India: Angami, Ao, and Mizo. We evaluate four Wav2vec2.0 base models that were pre-trained on both tonal and non-tonal languages. We analyze tone-wise performance across the layers for all three languages and compare the different models. Our results show tha… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted in Interspeech2025

  2. arXiv:2506.00861  [pdf, ps, other

    eess.AS cs.SD

    Leveraging AM and FM Rhythm Spectrograms for Dementia Classification and Assessment

    Authors: Parismita Gogoi, Vishwanath Pratap Singh, Seema Khadirnaikar, Soma Siddhartha, Sishir Kalita, Jagabandhu Mishra, Md Sahidullah, Priyankoo Sarmah, S. R. M. Prasanna

    Abstract: This study explores the potential of Rhythm Formant Analysis (RFA) to capture long-term temporal modulations in dementia speech. Specifically, we introduce RFA-derived rhythm spectrograms as novel features for dementia classification and regression tasks. We propose two methodologies: (1) handcrafted features derived from rhythm spectrograms, and (2) a data-driven fusion approach, integrating prop… ▽ More

    Submitted 14 June, 2025; v1 submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted in Interspeech, All codes are available in GitHub repo https://github.com/seemark11/DhiNirnayaAMFM

  3. arXiv:2411.10489  [pdf, other

    cs.CR cs.AI cs.CV

    Biometrics in Extended Reality: A Review

    Authors: Ayush Agarwal, Raghavendra Ramachandra, Sushma Venkatesh, S. R. Mahadeva Prasanna

    Abstract: In the domain of Extended Reality (XR), particularly Virtual Reality (VR), extensive research has been devoted to harnessing this transformative technology in various real-world applications. However, a critical challenge that must be addressed before unleashing the full potential of XR in practical scenarios is to ensure robust security and safeguard user privacy. This paper presents a systematic… ▽ More

    Submitted 14 November, 2024; originally announced November 2024.

  4. arXiv:2409.15767  [pdf, other

    eess.AS cs.SD

    Representation Loss Minimization with Randomized Selection Strategy for Efficient Environmental Fake Audio Detection

    Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Nitin Choudhury, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

    Abstract: The adaptation of foundation models has significantly advanced environmental audio deepfake detection (EADD), a rapidly growing area of research. These models are typically fine-tuned or utilized in their frozen states for downstream tasks. However, the dimensionality of their representations can substantially lead to a high parameter count of downstream models, leading to higher computational dem… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

    MSC Class: 68T45 ACM Class: I.2.7

  5. arXiv:2409.14312  [pdf, other

    eess.AS cs.SD

    Avengers Assemble: Amalgamation of Non-Semantic Features for Depression Detection

    Authors: Orchid Chetia Phukan, Swarup Ranjan Behera, Shubham Singh, Muskaan Singh, Vandana Rajan, Arun Balaji Buduru, Rajesh Sharma, S. R. Mahadeva Prasanna

    Abstract: In this study, we address the challenge of depression detection from speech, focusing on the potential of non-semantic features (NSFs) to capture subtle markers of depression. While prior research has leveraged various features for this task, NSFs-extracted from pre-trained models (PTMs) designed for non-semantic tasks such as paralinguistic speech processing (TRILLsson), speaker recognition (x-ve… ▽ More

    Submitted 22 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

    MSC Class: 68T45 ACM Class: I.2.7

  6. arXiv:2409.14221  [pdf, other

    eess.AS cs.SD

    Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition

    Authors: Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

    Abstract: In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To v… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

    MSC Class: 68T45 ACM Class: I.2.7

  7. arXiv:2409.14131  [pdf, other

    eess.AS cs.LG cs.SD

    Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

    Authors: Orchid Chetia Phukan, Sarthak Jain, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

    Abstract: In this study, for the first time, we extensively investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD), which has recently attracted attention in the research community. For this, we perform a comprehensive comparative study of state-of-the-art (SOTA) MFMs (MERT variants and music2vec) and SFMs (pre-trained fo… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

    MSC Class: 68T45 ACM Class: I.2.7

  8. arXiv:2406.09494  [pdf, other

    eess.AS cs.LG

    The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments

    Authors: Shareef Babu Kalluri, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K T, S. R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy

    Abstract: The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of speaker diarization (SD) and language diarization (LD) on a challenging multilingual conversational speech dataset. In the DISPLACE 2024 challenge, we also introduced the task of automatic speech recognition (ASR) on this datas… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, Interspeech 2024

  9. arXiv:2308.10470  [pdf, other

    eess.AS cs.CL cs.SD

    Implicit Self-supervised Language Representation for Spoken Language Diarization

    Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

    Abstract: In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmen… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Planning to Submit in IEEE-JSTSP

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing 2024

  10. arXiv:2306.12913  [pdf, other

    eess.AS cs.CL cs.SD

    Implicit spoken language diarization

    Authors: Jagabandhu Mishra, Amartya Chowdhury, S. R. Mahadeva Prasanna

    Abstract: Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedd… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

  11. arXiv:2302.13209  [pdf, other

    eess.AS cs.SD

    I-MSV 2022: Indic-Multilingual and Multi-sensor Speaker Verification Challenge

    Authors: Jagabandhu Mishra, Mrinmoy Bhattacharjee, S. R. Mahadeva Prasanna

    Abstract: Speaker Verification (SV) is a task to verify the claimed identity of the claimant using his/her voice sample. Though there exists an ample amount of research in SV technologies, the development concerning a multilingual conversation is limited. In a country like India, almost all the speakers are polyglot in nature. Consequently, the development of a Multilingual SV (MSV) system on the data colle… ▽ More

    Submitted 25 February, 2023; originally announced February 2023.

  12. Spoken language change detection inspired by speaker change detection

    Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

    Abstract: Spoken language change detection (LCD) refers to identifying the language transitions in a code-switched utterance. Similarly, identifying the speaker transitions in a multispeaker utterance is known as speaker change detection (SCD). Since tasks-wise both are similar, the architecture/framework developed for the SCD task may be suitable for the LCD task. Hence, the aim of the present work is to d… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  13. arXiv:2203.02680   

    eess.AS cs.SD eess.SP

    Language vs Speaker Change: A Comparative Study

    Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

    Abstract: Spoken language change detection (LCD) refers to detecting language switching points in a multilingual speech signal. Speaker change detection (SCD) refers to locating the speaker change points in a multispeaker speech signal. The objective of this work is to understand the challenges in LCD task by comparing it with SCD task. Human subjective study for change detection is performed for LCD and SC… ▽ More

    Submitted 6 October, 2023; v1 submitted 5 March, 2022; originally announced March 2022.

    Comments: The work is substantially modified. The new version of the same will be submitted soon

  14. arXiv:2110.00797  [pdf, other

    eess.AS cs.SD

    Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition

    Authors: Protima Nomo Sudro, Rohan Kumar Das, Rohit Sinha, S. R. Mahadeva Prasanna

    Abstract: The automatic recognition of pathological speech, particularly from children with any articulatory impairment, is a challenging task due to various reasons. The lack of available domain specific data is one such obstacle that hinders its usage for different speech-based applications targeting pathological speakers. In line with the challenge, in this work, we investigate a few data augmentation te… ▽ More

    Submitted 2 October, 2021; originally announced October 2021.

  15. arXiv:2110.00794  [pdf, other

    cs.SD eess.AS q-bio.QM

    Processing Phoneme Specific Segments for Cleft Lip and Palate Speech Enhancement

    Authors: Protima Nomo Sudro, Rohit Sinha, S. R. Mahadeva Prasanna

    Abstract: The cleft lip and palate (CLP) speech intelligibility is distorted due to the deformation in their articulatory system. For addressing the same, a few previous works perform phoneme specific modification in CLP speech. In CLP speech, both the articulation error and the nasalization distorts the intelligibility of a word. Consequently, modification of a specific phoneme may not always yield in enha… ▽ More

    Submitted 2 October, 2021; originally announced October 2021.

  16. arXiv:2109.04138  [pdf, other

    cs.CR cs.CV

    Multilingual Audio-Visual Smartphone Dataset And Evaluation

    Authors: Hareesh Mandalapu, Aravinda Reddy P N, Raghavendra Ramachandra, K Sreenivasa Rao, Pabitra Mitra, S R Mahadeva Prasanna, Christoph Busch

    Abstract: Smartphones have been employed with biometric-based verification systems to provide security in highly sensitive applications. Audio-visual biometrics are getting popular due to their usability, and also it will be challenging to spoof because of their multimodal nature. In this work, we present an audio-visual smartphone dataset captured in five different recent smartphones. This new dataset cont… ▽ More

    Submitted 15 November, 2021; v1 submitted 9 September, 2021; originally announced September 2021.

  17. Sonority Measurement Using System, Source, and Suprasegmental Information

    Authors: Bidisha Sharma, S. R. Mahadeva Prasanna

    Abstract: Sonorant sounds are characterized by regions with prominent formant structure, high energy and high degree of periodicity. In this work, the vocal-tract system, excitation source and suprasegmental features derived from the speech signal are analyzed to measure the sonority information present in each of them. Vocal-tract system information is extracted from the Hilbert envelope of numerator of gr… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 25, Issue: 3, March 2017)

  18. Audio-Visual Biometric Recognition and Presentation Attack Detection: A Comprehensive Survey

    Authors: Hareesh Mandalapu, P N Aravinda Reddy, Raghavendra Ramachandra, K Sreenivasa Rao, Pabitra Mitra, S R Mahadeva Prasanna, Christoph Busch

    Abstract: Biometric recognition is a trending technology that uses unique characteristics data to identify or verify/authenticate security applications. Amidst the classically used biometrics, voice and face attributes are the most propitious for prevalent applications in day-to-day life because they are easy to obtain through restrained and user-friendly procedures. The pervasiveness of low-cost audio and… ▽ More

    Submitted 12 March, 2021; v1 submitted 24 January, 2021; originally announced January 2021.

    Journal ref: in IEEE Access, vol. 9, pp. 37431-37455, 2021

  19. arXiv:2101.05806  [pdf, other

    cs.CV

    Exploration of Visual Features and their weighted-additive fusion for Video Captioning

    Authors: Praveen S V, Akhilesh Bharadwaj, Harsh Raj, Janhavi Dadhania, Ganesh Samarth C. A, Nikhil Pareek, S R M Prasanna

    Abstract: Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: 6 pages

  20. arXiv:1811.01222  [pdf, ps, other

    eess.AS cs.SD

    Time-Frequency Audio Features for Speech-Music Classification

    Authors: Mrinmoy Bhattacharjee, S. R. M. Prasanna, Prithwijit Guha

    Abstract: Distinct striation patterns are observed in the spectrograms of speech and music. This motivated us to propose three novel time-frequency features for speech-music classification. These features are extracted in two stages. First, a preset number of prominent spectral peak locations are identified from the spectra of each frame. These important peak locations obtained from each frame are used to f… ▽ More

    Submitted 3 November, 2018; originally announced November 2018.

    Comments: 4 pages, 16 figures