Skip to main content

Showing 1–28 of 28 results for author: Prasanna, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03606  [pdf, ps, other

    eess.AS cs.AI cs.CL eess.SP

    Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models

    Authors: Parismita Gogoi, Sishir Kalita, Wendy Lalhminghlui, Viyazonuo Terhiija, Moakala Tzudir, Priyankoo Sarmah, S. R. M. Prasanna

    Abstract: This study explores the use of self-supervised learning (SSL) models for tone recognition in three low-resource languages from North Eastern India: Angami, Ao, and Mizo. We evaluate four Wav2vec2.0 base models that were pre-trained on both tonal and non-tonal languages. We analyze tone-wise performance across the layers for all three languages and compare the different models. Our results show tha… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted in Interspeech2025

  2. arXiv:2506.00861  [pdf, ps, other

    eess.AS cs.SD

    Leveraging AM and FM Rhythm Spectrograms for Dementia Classification and Assessment

    Authors: Parismita Gogoi, Vishwanath Pratap Singh, Seema Khadirnaikar, Soma Siddhartha, Sishir Kalita, Jagabandhu Mishra, Md Sahidullah, Priyankoo Sarmah, S. R. M. Prasanna

    Abstract: This study explores the potential of Rhythm Formant Analysis (RFA) to capture long-term temporal modulations in dementia speech. Specifically, we introduce RFA-derived rhythm spectrograms as novel features for dementia classification and regression tasks. We propose two methodologies: (1) handcrafted features derived from rhythm spectrograms, and (2) a data-driven fusion approach, integrating prop… ▽ More

    Submitted 14 June, 2025; v1 submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted in Interspeech, All codes are available in GitHub repo https://github.com/seemark11/DhiNirnayaAMFM

  3. arXiv:2411.10489  [pdf, other

    cs.CR cs.AI cs.CV

    Biometrics in Extended Reality: A Review

    Authors: Ayush Agarwal, Raghavendra Ramachandra, Sushma Venkatesh, S. R. Mahadeva Prasanna

    Abstract: In the domain of Extended Reality (XR), particularly Virtual Reality (VR), extensive research has been devoted to harnessing this transformative technology in various real-world applications. However, a critical challenge that must be addressed before unleashing the full potential of XR in practical scenarios is to ensure robust security and safeguard user privacy. This paper presents a systematic… ▽ More

    Submitted 14 November, 2024; originally announced November 2024.

  4. arXiv:2409.15767  [pdf, other

    eess.AS cs.SD

    Representation Loss Minimization with Randomized Selection Strategy for Efficient Environmental Fake Audio Detection

    Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Nitin Choudhury, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

    Abstract: The adaptation of foundation models has significantly advanced environmental audio deepfake detection (EADD), a rapidly growing area of research. These models are typically fine-tuned or utilized in their frozen states for downstream tasks. However, the dimensionality of their representations can substantially lead to a high parameter count of downstream models, leading to higher computational dem… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

    MSC Class: 68T45 ACM Class: I.2.7

  5. arXiv:2409.14312  [pdf, other

    eess.AS cs.SD

    Avengers Assemble: Amalgamation of Non-Semantic Features for Depression Detection

    Authors: Orchid Chetia Phukan, Swarup Ranjan Behera, Shubham Singh, Muskaan Singh, Vandana Rajan, Arun Balaji Buduru, Rajesh Sharma, S. R. Mahadeva Prasanna

    Abstract: In this study, we address the challenge of depression detection from speech, focusing on the potential of non-semantic features (NSFs) to capture subtle markers of depression. While prior research has leveraged various features for this task, NSFs-extracted from pre-trained models (PTMs) designed for non-semantic tasks such as paralinguistic speech processing (TRILLsson), speaker recognition (x-ve… ▽ More

    Submitted 22 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

    MSC Class: 68T45 ACM Class: I.2.7

  6. arXiv:2409.14221  [pdf, other

    eess.AS cs.SD

    Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition

    Authors: Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

    Abstract: In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To v… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

    MSC Class: 68T45 ACM Class: I.2.7

  7. arXiv:2409.14131  [pdf, other

    eess.AS cs.LG cs.SD

    Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

    Authors: Orchid Chetia Phukan, Sarthak Jain, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

    Abstract: In this study, for the first time, we extensively investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD), which has recently attracted attention in the research community. For this, we perform a comprehensive comparative study of state-of-the-art (SOTA) MFMs (MERT variants and music2vec) and SFMs (pre-trained fo… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

    MSC Class: 68T45 ACM Class: I.2.7

  8. arXiv:2408.02297  [pdf, other

    cs.RO cs.CV

    Perception Matters: Enhancing Embodied AI with Uncertainty-Aware Semantic Segmentation

    Authors: Sai Prasanna, Daniel Honerkamp, Kshitij Sirohi, Tim Welschehold, Wolfram Burgard, Abhinav Valada

    Abstract: Embodied AI has made significant progress acting in unexplored environments. However, tasks such as object search have largely focused on efficient policy learning. In this work, we identify several gaps in current search methods: They largely focus on dated perception models, neglect temporal aggregation, and transfer from ground truth directly to noisy perception at test time, without accounting… ▽ More

    Submitted 14 January, 2025; v1 submitted 5 August, 2024; originally announced August 2024.

    Journal ref: Proceedings of the International Symposium on Robotics Research (ISRR), 2024

  9. arXiv:2407.20879  [pdf, other

    cs.AI q-bio.QM

    A Scalable Tool For Analyzing Genomic Variants Of Humans Using Knowledge Graphs and Machine Learning

    Authors: Shivika Prasanna, Ajay Kumar, Deepthi Rao, Eduardo Simoes, Praveen Rao

    Abstract: The integration of knowledge graphs and graph machine learning (GML) in genomic data analysis offers several opportunities for understanding complex genetic relationships, especially at the RNA level. We present a comprehensive approach for leveraging these technologies to analyze genomic variants, specifically in the context of RNA sequencing (RNA-seq) data from COVID-19 patient samples. The prop… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2312.04423

  10. arXiv:2406.09494  [pdf, other

    eess.AS cs.LG

    The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments

    Authors: Shareef Babu Kalluri, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K T, S. R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy

    Abstract: The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of speaker diarization (SD) and language diarization (LD) on a challenging multilingual conversational speech dataset. In the DISPLACE 2024 challenge, we also introduced the task of automatic speech recognition (ASR) on this datas… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, Interspeech 2024

  11. arXiv:2403.10967  [pdf, other

    cs.LG cs.AI

    Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot Generalization

    Authors: Sai Prasanna, Karim Farid, Raghu Rajan, André Biedenkapp

    Abstract: Zero-shot generalization (ZSG) to unseen dynamics is a major challenge for creating generally capable embodied agents. To address the broader challenge, we start with the simpler setting of contextual reinforcement learning (cRL), assuming observability of the context values that parameterize the variation in the system's dynamics, such as the mass or dimensions of a robot, without making further… ▽ More

    Submitted 3 August, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

    Comments: In Reinforcement Learning Conference, 2024. 33 pages

  12. arXiv:2312.04423  [pdf, other

    cs.AI cs.DB q-bio.QM

    Scalable Knowledge Graph Construction and Inference on Human Genome Variants

    Authors: Shivika Prasanna, Deepthi Rao, Eduardo Simoes, Praveen Rao

    Abstract: Real-world knowledge can be represented as a graph consisting of entities and relationships between the entities. The need for efficient and scalable solutions arises when dealing with vast genomic data, like RNA-sequencing. Knowledge graphs offer a powerful approach for various tasks in such large-scale genomic data, such as analysis and inference. In this work, variant-level information extracte… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  13. arXiv:2308.10470  [pdf, other

    eess.AS cs.CL cs.SD

    Implicit Self-supervised Language Representation for Spoken Language Diarization

    Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

    Abstract: In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmen… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Planning to Submit in IEEE-JSTSP

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing 2024

  14. arXiv:2306.12913  [pdf, other

    eess.AS cs.CL cs.SD

    Implicit spoken language diarization

    Authors: Jagabandhu Mishra, Amartya Chowdhury, S. R. Mahadeva Prasanna

    Abstract: Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedd… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

  15. arXiv:2302.13209  [pdf, other

    eess.AS cs.SD

    I-MSV 2022: Indic-Multilingual and Multi-sensor Speaker Verification Challenge

    Authors: Jagabandhu Mishra, Mrinmoy Bhattacharjee, S. R. Mahadeva Prasanna

    Abstract: Speaker Verification (SV) is a task to verify the claimed identity of the claimant using his/her voice sample. Though there exists an ample amount of research in SV technologies, the development concerning a multilingual conversation is limited. In a country like India, almost all the speakers are polyglot in nature. Consequently, the development of a Multilingual SV (MSV) system on the data colle… ▽ More

    Submitted 25 February, 2023; originally announced February 2023.

  16. Spoken language change detection inspired by speaker change detection

    Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

    Abstract: Spoken language change detection (LCD) refers to identifying the language transitions in a code-switched utterance. Similarly, identifying the speaker transitions in a multispeaker utterance is known as speaker change detection (SCD). Since tasks-wise both are similar, the architecture/framework developed for the SCD task may be suitable for the LCD task. Hence, the aim of the present work is to d… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  17. arXiv:2203.02680   

    eess.AS cs.SD eess.SP

    Language vs Speaker Change: A Comparative Study

    Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

    Abstract: Spoken language change detection (LCD) refers to detecting language switching points in a multilingual speech signal. Speaker change detection (SCD) refers to locating the speaker change points in a multispeaker speech signal. The objective of this work is to understand the challenges in LCD task by comparing it with SCD task. Human subjective study for change detection is performed for LCD and SC… ▽ More

    Submitted 6 October, 2023; v1 submitted 5 March, 2022; originally announced March 2022.

    Comments: The work is substantially modified. The new version of the same will be submitted soon

  18. arXiv:2110.00797  [pdf, other

    eess.AS cs.SD

    Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition

    Authors: Protima Nomo Sudro, Rohan Kumar Das, Rohit Sinha, S. R. Mahadeva Prasanna

    Abstract: The automatic recognition of pathological speech, particularly from children with any articulatory impairment, is a challenging task due to various reasons. The lack of available domain specific data is one such obstacle that hinders its usage for different speech-based applications targeting pathological speakers. In line with the challenge, in this work, we investigate a few data augmentation te… ▽ More

    Submitted 2 October, 2021; originally announced October 2021.

  19. arXiv:2110.00794  [pdf, other

    cs.SD eess.AS q-bio.QM

    Processing Phoneme Specific Segments for Cleft Lip and Palate Speech Enhancement

    Authors: Protima Nomo Sudro, Rohit Sinha, S. R. Mahadeva Prasanna

    Abstract: The cleft lip and palate (CLP) speech intelligibility is distorted due to the deformation in their articulatory system. For addressing the same, a few previous works perform phoneme specific modification in CLP speech. In CLP speech, both the articulation error and the nasalization distorts the intelligibility of a word. Consequently, modification of a specific phoneme may not always yield in enha… ▽ More

    Submitted 2 October, 2021; originally announced October 2021.

  20. arXiv:2109.04138  [pdf, other

    cs.CR cs.CV

    Multilingual Audio-Visual Smartphone Dataset And Evaluation

    Authors: Hareesh Mandalapu, Aravinda Reddy P N, Raghavendra Ramachandra, K Sreenivasa Rao, Pabitra Mitra, S R Mahadeva Prasanna, Christoph Busch

    Abstract: Smartphones have been employed with biometric-based verification systems to provide security in highly sensitive applications. Audio-visual biometrics are getting popular due to their usability, and also it will be challenging to spoof because of their multimodal nature. In this work, we present an audio-visual smartphone dataset captured in five different recent smartphones. This new dataset cont… ▽ More

    Submitted 15 November, 2021; v1 submitted 9 September, 2021; originally announced September 2021.

  21. Sonority Measurement Using System, Source, and Suprasegmental Information

    Authors: Bidisha Sharma, S. R. Mahadeva Prasanna

    Abstract: Sonorant sounds are characterized by regions with prominent formant structure, high energy and high degree of periodicity. In this work, the vocal-tract system, excitation source and suprasegmental features derived from the speech signal are analyzed to measure the sonority information present in each of them. Vocal-tract system information is extracted from the Hilbert envelope of numerator of gr… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 25, Issue: 3, March 2017)

  22. Audio-Visual Biometric Recognition and Presentation Attack Detection: A Comprehensive Survey

    Authors: Hareesh Mandalapu, P N Aravinda Reddy, Raghavendra Ramachandra, K Sreenivasa Rao, Pabitra Mitra, S R Mahadeva Prasanna, Christoph Busch

    Abstract: Biometric recognition is a trending technology that uses unique characteristics data to identify or verify/authenticate security applications. Amidst the classically used biometrics, voice and face attributes are the most propitious for prevalent applications in day-to-day life because they are easy to obtain through restrained and user-friendly procedures. The pervasiveness of low-cost audio and… ▽ More

    Submitted 12 March, 2021; v1 submitted 24 January, 2021; originally announced January 2021.

    Journal ref: in IEEE Access, vol. 9, pp. 37431-37455, 2021

  23. arXiv:2101.05806  [pdf, other

    cs.CV

    Exploration of Visual Features and their weighted-additive fusion for Video Captioning

    Authors: Praveen S V, Akhilesh Bharadwaj, Harsh Raj, Janhavi Dadhania, Ganesh Samarth C. A, Nikhil Pareek, S R M Prasanna

    Abstract: Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: 6 pages

  24. arXiv:2005.00561  [pdf, other

    cs.CL cs.LG

    When BERT Plays the Lottery, All Tickets Are Winning

    Authors: Sai Prasanna, Anna Rogers, Anna Rumshisky

    Abstract: Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similar… ▽ More

    Submitted 24 October, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: EMNLP 2020 camera-ready

  25. arXiv:1909.12734  [pdf, other

    cs.LG cs.CV stat.ML

    Maximal adversarial perturbations for obfuscation: Hiding certain attributes while preserving rest

    Authors: Indu Ilanchezian, Praneeth Vepakomma, Abhishek Singh, Otkrist Gupta, G. N. Srinivasa Prasanna, Ramesh Raskar

    Abstract: In this paper we investigate the usage of adversarial perturbations for the purpose of privacy from human perception and model (machine) based detection. We employ adversarial perturbations for obfuscating certain variables in raw data while preserving the rest. Current adversarial perturbation methods are used for data poisoning with minimal perturbations of the raw data such that the machine lea… ▽ More

    Submitted 27 September, 2019; originally announced September 2019.

  26. arXiv:1902.10623  [pdf, other

    cs.CL

    Zoho at SemEval-2019 Task 9: Semi-supervised Domain Adaptation using Tri-training for Suggestion Mining

    Authors: Sai Prasanna, Sri Ananda Seelan

    Abstract: This paper describes our submission for the SemEval-2019 Suggestion Mining task. A simple Convolutional Neural Network (CNN) classifier with contextual word representations from a pre-trained language model was used for sentence classification. The model is trained using tri-training, a semi-supervised bootstrapping mechanism for labelling unseen data. Tri-training proved to be an effective techni… ▽ More

    Submitted 6 April, 2019; v1 submitted 27 February, 2019; originally announced February 2019.

    Comments: NAACL 2019

  27. arXiv:1811.01222  [pdf, ps, other

    eess.AS cs.SD

    Time-Frequency Audio Features for Speech-Music Classification

    Authors: Mrinmoy Bhattacharjee, S. R. M. Prasanna, Prithwijit Guha

    Abstract: Distinct striation patterns are observed in the spectrograms of speech and music. This motivated us to propose three novel time-frequency features for speech-music classification. These features are extracted in two stages. First, a preset number of prominent spectral peak locations are identified from the spectra of each frame. These important peak locations obtained from each frame are used to f… ▽ More

    Submitted 3 November, 2018; originally announced November 2018.

    Comments: 4 pages, 16 figures

  28. arXiv:1407.2390  [pdf

    cs.CV

    Online Stroke and Akshara Recognition GUI in Assamese Language Using Hidden Markov Model

    Authors: SRM Prasanna, Rituparna Devi, Deepjoy Das, Subhankar Ghosh, Krishna Naik

    Abstract: The work describes the development of Online Assamese Stroke & Akshara Recognizer based on a set of language rules. In handwriting literature strokes are composed of two coordinate trace in between pen down and pen up labels. The Assamese aksharas are combination of a number of strokes, the maximum number of strokes taken to make a combination being eight. Based on these combinations eight languag… ▽ More

    Submitted 9 July, 2014; originally announced July 2014.

    Comments: 6 pages, 9 figures, International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014