Skip to main content

Showing 1–7 of 7 results for author: Sudarsanam, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.14562  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

    Authors: Parthasaarathy Sudarsanam, Irene Martín-Morató, Tuomas Virtanen

    Abstract: This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align vi… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted to European Signal Processing Conference (EUSIPCO 2025)

  2. arXiv:2306.09126  [pdf, other

    cs.SD cs.CV cs.MM eess.AS eess.IV

    STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

    Authors: Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

    Abstract: While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information… ▽ More

    Submitted 14 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track on Datasets and Benchmarks

  3. arXiv:2305.19769  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Attention-Based Methods For Audio Question Answering

    Authors: Parthasaarathy Sudarsanam, Tuomas Virtanen

    Abstract: Audio question answering (AQA) is the task of producing natural language answers when a system is provided with audio and natural language questions. In this paper, we propose neural network architectures based on self-attention and cross-attention for the AQA task. The self-attention layers extract powerful audio and textual representations. The cross-attention maps audio features that are releva… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  4. arXiv:2206.01948  [pdf, other

    eess.AS cs.SD

    STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

    Authors: Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

    Abstract: This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone arr… ▽ More

    Submitted 2 September, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

  5. arXiv:2204.09634  [pdf, other

    cs.SD cs.LG eess.AS

    Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

    Authors: Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we coll… ▽ More

    Submitted 17 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

  6. arXiv:2107.09388  [pdf, other

    cs.SD eess.AS

    Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection

    Authors: Parthasaarathy Sudarsanam, Archontis Politis, Konstantinos Drossos

    Abstract: Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from th… ▽ More

    Submitted 27 September, 2021; v1 submitted 20 July, 2021; originally announced July 2021.

  7. arXiv:1911.12928  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Voice Separation by Incorporating End-to-end Speech Recognition

    Authors: Naoya Takahashi, Mayank Kumar Singh, Sakya Basak, Parthasaarathy Sudarsanam, Sriram Ganapathy, Yuki Mitsufuji

    Abstract: Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data. In this work, we propose to explicitly incorporate the phonetic and linguistic nature of speech by taking a transfer learning approach using an end-to-end automatic speech recognition (E2EASR) system. The voice separation is conditioned on dee… ▽ More

    Submitted 3 May, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: Accepted in ICASSP 2020