Skip to main content

Showing 1–50 of 95 results for author: Virtanen, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.02562  [pdf, ps, other

    eess.AS cs.SD

    Multi-Utterance Speech Separation and Association Trained on Short Segments

    Authors: Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Current deep neural network (DNN) based speech separation faces a fundamental challenge -- while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this ch… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 5 pages, accepted by WASPAA 2025

  2. arXiv:2506.01483  [pdf, ps, other

    eess.AS cs.SD

    Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction

    Authors: Wang Dai, Archontis Politis, Tuomas Virtanen

    Abstract: We propose a novel approach that utilizes inter-speaker relative cues to distinguish target speakers and extract their voices from mixtures. Continuous cues (e.g., temporal order, age, pitch level) are grouped by relative differences, while discrete cues (e.g., language, gender, emotion) retain their categorical distinctions. Compared to fixed speech attribute classification, inter-speaker relativ… ▽ More

    Submitted 8 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted by Interspeech 2025

  3. arXiv:2505.20956  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    Hybrid Disagreement-Diversity Active Learning for Bioacoustic Sound Event Detection

    Authors: Shiqi Zhang, Tuomas Virtanen

    Abstract: Bioacoustic sound event detection (BioSED) is crucial for biodiversity conservation but faces practical challenges during model development and training: limited amounts of annotated data, sparse events, species diversity, and class imbalance. To address these challenges efficiently with a limited labeling budget, we apply the mismatch-first farthest-traversal (MFFT), an active learning method int… ▽ More

    Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: 5 pages, 1 figure, accepted by EUSIPCO 2025 v2: add our github repo

  4. arXiv:2505.16607  [pdf, ps, other

    eess.AS cs.SD

    Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers

    Authors: Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

    Abstract: This paper addresses the problem of single-channel speech separation, where the number of speakers is unknown, and each speaker may speak multiple utterances. We propose a speech separation model that simultaneously performs separation, dynamically estimates the number of speakers, and detects individual speaker activities by integrating an attractor module. The proposed system outperforms existin… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: 5 pages, 4 figures, accepted by Interspeech 2025

  5. arXiv:2505.14562  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

    Authors: Parthasaarathy Sudarsanam, Irene Martín-Morató, Tuomas Virtanen

    Abstract: This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align vi… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted to European Signal Processing Conference (EUSIPCO 2025)

  6. arXiv:2505.03442  [pdf, other

    cs.SD cs.LG eess.AS

    Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

    Authors: Diep Luong, Mikko Heikkinen, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Speech denoising is a generally adopted and impactful task, appearing in many common and everyday-life use cases. Although there are very powerful methods published, most of those are too complex for deployment in everyday and low-resources computational environments, like hand-held devices, intelligent glasses, hearing aids, etc. Knowledge distillation (KD) is a prominent way for alleviating this… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  7. arXiv:2503.07352  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music

    Authors: Eetu Tunturi, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen

    Abstract: Music source separation is the task of separating a mixture of instruments into constituent tracks. Music source separation models are typically trained using only audio data, although additional information can be used to improve the model's separation capability. In this paper, we propose two ways of using musical scores to aid music source separation: a score-informed model where the score is c… ▽ More

    Submitted 3 June, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

    Comments: 5 pages, 2 figures, accepted to EUSIPCO 2025

  8. arXiv:2502.09363  [pdf, other

    cs.LG

    The Accuracy Cost of Weakness: A Theoretical Analysis of Fixed-Segment Weak Labeling for Events in Time

    Authors: John Martinsson, Olof Mogren, Tuomas Virtanen, Maria Sandsten

    Abstract: Accurate labels are critical for deriving robust machine learning models. Labels are used to train supervised learning models and to evaluate most machine learning paradigms. In this paper, we model the accuracy and cost of a common weak labeling process where annotators assign presence or absence labels to fixed-length data segments for a given event class. The annotator labels a segment as "pres… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

    Comments: Submitted to TMLR

  9. Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning

    Authors: Aapo Hakala, Trevor Kincy, Tuomas Virtanen

    Abstract: This paper studies the novel problem of automatic live music song identification, where the goal is, given a live recording of a song, to retrieve the corresponding studio version of the song from a music database. We propose a system based on similarity learning and a Siamese convolutional neural network-based model. The model uses cross-similarity matrices of multi-level deep sequences to measur… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

  10. arXiv:2501.08047  [pdf, other

    eess.AS cs.LG cs.SD

    Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays

    Authors: Mikko Heikkinen, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

    Comments: Accepted for publication in Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing

  11. arXiv:2411.06892  [pdf

    cs.SD physics.comp-ph physics.pop-ph

    Timing and Dynamics of the Rosanna Shuffle

    Authors: Esa Räsänen, Niko Gullsten, Otto Pulkkinen, Tuomas Virtanen

    Abstract: The Rosanna shuffle, the drum pattern from Toto's 1982 hit "Rosanna", is one of the most recognized drum beats in popular music. Recorded by Jeff Porcaro, this drum beat features a half-time shuffle with rapid triplets on the hi-hat and snare drum. In this analysis, we examine the timing and dynamics of the original drum track, focusing on rhythmic variations such as swing factor, microtiming devi… ▽ More

    Submitted 12 November, 2024; v1 submitted 11 November, 2024; originally announced November 2024.

    Comments: 22 pages, 12 figures

  12. arXiv:2410.04951  [pdf, other

    eess.AS cs.SD

    A decade of DCASE: Achievements, practices, evaluations and future challenges

    Authors: Annamaria Mesaros, Romain Serizel, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley

    Abstract: This paper introduces briefly the history and growth of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, workshop, research area and research community. Created in 2013 as a data evaluation challenge, DCASE has become a major research topic in the Audio and Acoustic Signal Processing area. Its success comes from a combination of factors: the challenge offers a larg… ▽ More

    Submitted 7 October, 2024; originally announced October 2024.

    Comments: Submitted to ICASSP 2025

  13. arXiv:2409.10995  [pdf

    eess.AS cs.LG cs.SD

    SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation

    Authors: Jaime Garcia-Martinez, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas

    Abstract: Recent advancements in music source separation have significantly progressed, particularly in isolating vocals, drums, and bass elements from mixed tracks. These developments owe much to the creation and use of large-scale, multitrack datasets dedicated to these specific components. However, the challenge of extracting similarly sounding sources from orchestra recordings has not been extensively e… ▽ More

    Submitted 17 February, 2025; v1 submitted 17 September, 2024; originally announced September 2024.

    Comments: The SynthSOD dataset can be downloaded from https://doi.org/10.5281/zenodo.13759492

    Journal ref: IEEE Open Journal of Signal Processing, vol. 6, pp. 129-137, 2025

  14. arXiv:2409.00408  [pdf, other

    cs.SD cs.LG eess.AS

    Multi-label Zero-Shot Audio Classification with Temporal Attention

    Authors: Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen

    Abstract: Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while… ▽ More

    Submitted 31 August, 2024; originally announced September 2024.

    Comments: Accepted to International Workshop on Acoustic Signal Enhancement (IWAENC) 2024

  15. Noise-to-mask Ratio Loss for Deep Neural Network based Audio Watermarking

    Authors: Martin Moritz, Toni Olán, Tuomas Virtanen

    Abstract: Digital audio watermarking consists in inserting a message into audio signals in a transparent way and can be used to allow automatic recognition of audio material and management of the copyrights. We propose a perceptual loss function to be used in deep neural network based audio watermarking systems. The loss is based on the noise-to-mask ratio (NMR), which is a model of the psychoacoustic maski… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: 6 pages, 7 figures

  16. arXiv:2407.15672  [pdf, other

    cs.SD eess.AS

    Computer Audition: From Task-Specific Machine Learning to Foundation Models

    Authors: Andreas Triantafyllopoulos, Iosif Tsangko, Alexander Gebhard, Annamaria Mesaros, Tuomas Virtanen, Björn Schuller

    Abstract: Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition -- the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-availab… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  17. Speaker Distance Estimation in Enclosures from Single-Channel Audio

    Authors: Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen

    Abstract: Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling. Most studies predominantly center on employing a classification approach, where distances are discretized into distinct categories, enabling smoother model training and achieving higher accuracy but imposing restrictions on the precision of the… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing

  18. arXiv:2403.08525  [pdf, other

    cs.SD cs.LG eess.AS

    From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning

    Authors: John Martinsson, Olof Mogren, Maria Sandsten, Tuomas Virtanen

    Abstract: We propose an adaptive change point detection method (A-CPD) for machine guided weak label annotation of audio recording segments. The goal is to maximize the amount of information gained about the temporal activations of the target sounds. For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guide annotation. The prediction model is initially pre-tra… ▽ More

    Submitted 26 August, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted at EUSIPCO 2024 (nominated best student paper)

  19. arXiv:2401.05916  [pdf, other

    eess.AS cs.SD

    Neural Ambisonics encoding for compact irregular microphone arrays

    Authors: Mikko Heikkinen, Archontis Politis, Tuomas Virtanen

    Abstract: Ambisonics encoding of microphone array signals can enable various spatial audio applications, such as virtual reality or telepresence, but it is typically designed for uniformly-spaced spherical microphone arrays. This paper proposes a method for Ambisonics encoding that uses a deep neural network (DNN) to estimate a signal transform from microphone inputs to Ambisonics signals. The approach uses… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: Accepted for publication in Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing

  20. arXiv:2312.10756  [pdf, other

    eess.AS cs.LG eess.SP

    Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios

    Authors: Yuzhu Wang, Archontis Politis, Tuomas Virtanen

    Abstract: Current multichannel speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios. This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings. Specifically, we study the application of linear and nonlinear attention-based methods for estimating time-varying spati… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

  21. arXiv:2310.16550  [pdf, other

    cs.SD eess.AS

    Dynamic Processing Neural Network Architecture For Hearing Loss Compensation

    Authors: Szymon Drgas, Lars Bramsløw, Archontis Politis, Gaurav Naithani, Tuomas Virtanen

    Abstract: This paper proposes neural networks for compensating sensorineural hearing loss. The aim of the hearing loss compensation task is to transform a speech signal to increase speech intelligibility after further processing by a person with a hearing impairment, which is modeled by a hearing loss model. We propose an interpretable model called dynamic processing network, which has a structure similar t… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

  22. arXiv:2308.04960  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Representation Learning for Audio Privacy Preservation using Source Separation and Robust Adversarial Learning

    Authors: Diep Luong, Minh Tran, Shayan Gharib, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation o… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  23. arXiv:2306.09820  [pdf, other

    eess.AS cs.SD

    Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances

    Authors: Huang Xie, Khazar Khorrami, Okko Räsänen, Tuomas Virtanen

    Abstract: This paper explores grading text-based audio retrieval relevances with crowdsourcing assessments. Given a free-form text (e.g., a caption) as a query, crowdworkers are asked to grade audio clips using numeric scores (between 0 and 100) to indicate their judgements of how much the sound content of an audio clip matches the text, where 0 indicates no content match at all and 100 indicates perfect co… ▽ More

    Submitted 15 August, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted at DCASE 2023 Workshop

  24. arXiv:2306.09126  [pdf, other

    cs.SD cs.CV cs.MM eess.AS eess.IV

    STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

    Authors: Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

    Abstract: While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information… ▽ More

    Submitted 14 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track on Datasets and Benchmarks

  25. arXiv:2306.08510  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications

    Authors: David Diaz-Guerra, Archontis Politis, Antonio Miguel, Jose R. Beltran, Tuomas Virtanen

    Abstract: Many multi-source localization and tracking models based on neural networks use one or several recurrent layers at their final stages to track the movement of the sources. Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state. However, this approach re… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted for publication at Forum Acusticum 2023

  26. Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

    Authors: Khazar Khorrami, María Andrea Cruz Blandón, Tuomas Virtanen, Okko Räsänen

    Abstract: Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning. The joint training with SSL and VGS mechanisms provides the opportunity to utilize both unlabeled speech and speech-related visual… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: 5 pages, accepted by EUSIPCO 2023

  27. arXiv:2305.19769  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Attention-Based Methods For Audio Question Answering

    Authors: Parthasaarathy Sudarsanam, Tuomas Virtanen

    Abstract: Audio question answering (AQA) is the task of producing natural language answers when a system is provided with audio and natural language questions. In this paper, we propose neural network architectures based on self-attention and cross-attention for the AQA task. The self-attention layers extract powerful audio and textual representations. The cross-attention maps audio features that are releva… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  28. arXiv:2305.18045  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes

    Authors: Wei Xie, Yanxiong Li, Qianhua He, Wenchang Cao, Tuomas Virtanen

    Abstract: New classes of sounds constantly emerge with a few samples, making it challenging for models to adapt to dynamic acoustic environments. This challenge motivates us to address the new problem of few-shot class-incremental audio classification. This study aims to enable a model to continuously recognize new classes of sounds with a few training samples of new classes while remembering the learned on… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: 5 pages,2 figures, Accepted by Interspeech 2023

  29. arXiv:2305.00011  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Adversarial Representation Learning for Robust Privacy Preservation in Audio

    Authors: Shayan Gharib, Minh Tran, Diep Luong, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial train… ▽ More

    Submitted 3 January, 2024; v1 submitted 29 April, 2023; originally announced May 2023.

    Comments: Published in IEEE Open Journal of Signal Processing

  30. arXiv:2303.07816  [pdf, other

    eess.AS cs.SD

    Multi-Channel Masking with Learnable Filterbank for Sound Source Separation

    Authors: Wang Dai, Archontis Politis, Tuomas Virtanen

    Abstract: This work proposes a learnable filterbank based on a multi-channel masking framework for multi-channel source separation. The learnable filterbank is a 1D Conv layer, which transforms the raw waveform into a 2D representation. In contrast to the conventional single-channel masking method, we estimate a mask for each individual microphone channel. The estimated masks are then applied to the transfo… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

  31. arXiv:2303.01864  [pdf, ps, other

    cs.SD eess.AS

    Spectrogram Inversion for Audio Source Separation via Consistency, Mixing, and Magnitude Constraints

    Authors: Paul Magron, Tuomas Virtanen

    Abstract: Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and… ▽ More

    Submitted 30 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

  32. arXiv:2211.04070  [pdf, other

    eess.AS cs.SD

    On Negative Sampling for Contrastive Audio-Text Retrieval

    Authors: Huang Xie, Okko Räsänen, Tuomas Virtanen

    Abstract: This paper investigates negative sampling for contrastive learning in the context of audio-text retrieval. The strategy for negative sampling refers to selecting negatives (either audio clips or textual descriptions) from a pool of candidates for a positive audio-text pair. We explore sampling strategies via model-estimated within-modality and cross-modality relevance scores for audio and text sam… ▽ More

    Submitted 17 February, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: Accepted at ICASSP2023

  33. arXiv:2210.14536  [pdf, ps, other

    eess.AS cs.LG cs.SD eess.SP

    Position tracking of a varying number of sound sources with sliding permutation invariant training

    Authors: David Diaz-Guerra, Archontis Politis, Tuomas Virtanen

    Abstract: Recent data- and learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios. However, little work has been done on adapting such methods to track consistently multiple sources appearing and disappearing, as would occur in reality. In this paper, we present a new training strategy for deep learning SSL models with a straightforward impleme… ▽ More

    Submitted 5 June, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: Accepted for publication at the 31st European Signal Processing Conference (EUSIPCO 2023)

  34. arXiv:2209.09967   

    eess.AS cs.SD

    Language-based Audio Retrieval Task in DCASE 2022 Challenge

    Authors: Huang Xie, Samuel Lipping, Tuomas Virtanen

    Abstract: Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset. It has been first introduced into DCASE 2022 Challenge as Subtask 6B of task 6, which aims at developing computational systems to model relationships between audio signals and free-form textual descriptions. Compared with audio captioning (Subtask 6A), whi… ▽ More

    Submitted 4 October, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: Update for arXiv:2206.06108 mistakenly submitted as a new article

  35. arXiv:2208.05057  [pdf, other

    cs.SD cs.MM eess.AS

    Subjective Evaluation of Deep Neural Network Based Speech Enhancement Systems in Real-World Conditions

    Authors: Gaurav Naithani, Kirsi Pietilä, Riitta Niemistö, Erkki Paajanen, Tero Takala, Tuomas Virtanen

    Abstract: Subjective evaluation results for two low-latency deep neural networks (DNN) are compared to a matured version of a traditional Wiener-filter based noise suppressor. The target use-case is real-world single-channel speech enhancement applications, e.g., communications. Real-world recordings consisting of additive stationary and non-stationary noise types are included. The evaluation is divided int… ▽ More

    Submitted 14 August, 2022; v1 submitted 9 August, 2022; originally announced August 2022.

    Comments: Accepted for publication in IEEE MMSP 2022

  36. arXiv:2208.02406  [pdf

    eess.AS cs.SD

    Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network

    Authors: Yanxiong Li, Wenchang Cao, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people. This study focuses on solving the problem of domestic activity clustering from audio. The target of domestic activity clustering is to cluster audio clips which belong to the same category of domestic activity into one cluster in an unsupervised… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: 6 pages, 5 figures, 4 tables. Accepted by IEEE MMSP 2022

  37. arXiv:2206.04984  [pdf, other

    cs.SD cs.LG eess.AS

    Zero-Shot Audio Classification using Image Embeddings

    Authors: Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen

    Abstract: Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and time-consuming. Zero-shot learning models are capable of classifying the unseen concepts by utilizing their semantic information. The present study introduces image emb… ▽ More

    Submitted 10 June, 2022; originally announced June 2022.

    Comments: Accepted to the European Signal Processing Conference (EUSIPCO) 2022

  38. arXiv:2206.01948  [pdf, other

    eess.AS cs.SD

    STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

    Authors: Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

    Abstract: This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone arr… ▽ More

    Submitted 2 September, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

  39. Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

    Authors: Shanshan Wang, Archontis Politis, Annamaria Mesaros, Tuomas Virtanen

    Abstract: Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  40. arXiv:2204.09634  [pdf, other

    cs.SD cs.LG eess.AS

    Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

    Authors: Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we coll… ▽ More

    Submitted 17 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

  41. arXiv:2111.00030  [pdf, other

    eess.AS cs.SD

    Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressor… ▽ More

    Submitted 29 October, 2021; originally announced November 2021.

    Comments: Submitted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA2021)

  42. arXiv:2106.11794  [pdf, other

    eess.AS cs.SD

    Deep neural network Based Low-latency Speech Separation with Asymmetric analysis-Synthesis Window Pair

    Authors: Shanshan Wang, Gaurav Naithani, Archontis Politis, Tuomas Virtanen

    Abstract: Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation. In this paper, we propose the usage of an asymmetric analysis-synthesis window pair which allows for training with targets with better frequency resolution, while retaining the low-latency during inference suitable for real-time spee… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: Accepted to EUSIPCO-2021

  43. arXiv:2106.06999  [pdf, other

    eess.AS cs.SD

    A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection

    Authors: Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen

    Abstract: This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical sy… ▽ More

    Submitted 4 July, 2021; v1 submitted 13 June, 2021; originally announced June 2021.

  44. arXiv:2105.13675  [pdf, other

    eess.AS cs.SD

    Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions

    Authors: Shanshan Wang, Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

    Abstract: This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have bette… ▽ More

    Submitted 20 July, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

  45. arXiv:2010.14171  [pdf, other

    cs.SD cs.IR cs.LG eess.AS stat.ML

    Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

    Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

    Abstract: Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learni… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: 5 pages, 1 figure

  46. arXiv:2010.11716  [pdf, other

    cs.SD cs.LG eess.AS

    Robust Audio-Based Vehicle Counting in Low-to-Moderate Traffic Flow

    Authors: Slobodan Djukanović, Jiři Matas, Tuomas Virtanen

    Abstract: The paper presents a method for audio-based vehicle counting (VC) in low-to-moderate traffic using one-channel sound. We formulate VC as a regression problem, i.e., we predict the distance between a vehicle and the microphone. Minima of the proposed distance function correspond to vehicles passing by the microphone. VC is carried out via local minima detection in the predicted distance. We propose… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: The paper has been accepted for the IV2020 conference

  47. arXiv:2010.11659  [pdf, other

    cs.SD cs.LG eess.AS

    Neural Network-based Acoustic Vehicle Counting

    Authors: Slobodan Djukanović, Yash Patel, Jiři Matas, Tuomas Virtanen

    Abstract: This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehicles from local minima of clipped vehicle-to-microphone distance. This distance is predicted from audio using a two-stage (coarse-fine) regression, with both stages realised via neural networks (NNs). Experiments show that the NN-based distance regression outperforms by far the previously… ▽ More

    Submitted 27 March, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

  48. arXiv:2010.11098  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

    Authors: An Tran, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: Submitted for review at ICASSP2021

  49. Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

    Authors: Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, Tuomas Virtanen

    Abstract: Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic datase… ▽ More

    Submitted 11 January, 2021; v1 submitted 6 September, 2020; originally announced September 2020.

  50. arXiv:2007.05183  [pdf, other

    cs.SD cs.LG eess.AS

    Conditioned Time-Dilated Convolutions for Sound Event Detection

    Authors: Konstantinos Drossos, Stylianos I. Mimilakis, Tuomas Virtanen

    Abstract: Sound event detection (SED) is the task of identifying sound events along with their onset and offset times. A recent, convolutional neural networks based SED method, proposed the usage of depthwise separable (DWS) and time-dilated convolutions. DWS and time-dilated convolutions yielded state-of-the-art results for SED, with considerable small amount of parameters. In this work we propose the expa… ▽ More

    Submitted 10 July, 2020; originally announced July 2020.