Skip to main content

Showing 1–5 of 5 results for author: Reddy, P B

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.03403  [pdf, ps, other

    eess.AS

    HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition

    Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

    Abstract: Compression-based representations (CBRs) from neural audio codecs such as EnCodec capture intricate acoustic features like pitch and timbre, while representation-learning-based representations (RLRs) from pre-trained models trained for speech representation learning such as WavLM encode high-level semantic and prosodic information. Previous research on Speech Emotion Recognition (SER) has explored… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Accepted to INTERSPEECH 2025

  2. arXiv:2506.03378  [pdf, ps, other

    eess.AS cs.CV cs.MM

    SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer

    Authors: Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

    Abstract: As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features re… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Accepted to INTERSPEECH 2025

  3. arXiv:2506.03364  [pdf, ps, other

    eess.AS cs.MM cs.SD

    Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models

    Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

    Abstract: In this work, we introduce the task of singing voice deepfake source attribution (SVDSA). We hypothesize that multimodal foundation models (MMFMs) such as ImageBind, LanguageBind will be most effective for SVDSA as they are better equipped for capturing subtle source-specific characteristics-such as unique timbre, pitch manipulation, or synthesis artifacts of each singing voice deepfake source due… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Accepted to INTERSPEECH 2025

  4. arXiv:2506.02232  [pdf, ps, other

    eess.AS cs.SD

    Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction

    Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

    Abstract: In this study, we focus on Singing Voice Mean Opinion Score (SingMOS) prediction. Previous research have shown the performance benefit with the use of state-of-the-art (SOTA) pre-trained models (PTMs). However, they haven't explored speaker recognition speech PTMs (SPTMs) such as x-vector, ECAPA and we hypothesize that it will be the most effective for SingMOS prediction. We believe that due to th… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted to INTERSPEECH 2025

  5. arXiv:2506.01157  [pdf, ps, other

    eess.AS cs.SD

    Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations

    Authors: Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Drishti Singh, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

    Abstract: In this work, we focus on source tracing of synthetic speech generation systems (STSGS). Each source embeds distinctive paralinguistic features--such as pitch, tone, rhythm, and intonation--into their synthesized speech, reflecting the underlying design of the generation model. While previous research has explored representations from speech pre-trained models (SPTMs), the use of representations f… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted to EUSIPCO 2025