Skip to main content

Showing 1–44 of 44 results for author: Drossos, K

.
  1. arXiv:2505.16607  [pdf, ps, other

    eess.AS cs.SD

    Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers

    Authors: Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

    Abstract: This paper addresses the problem of single-channel speech separation, where the number of speakers is unknown, and each speaker may speak multiple utterances. We propose a speech separation model that simultaneously performs separation, dynamically estimates the number of speakers, and detects individual speaker activities by integrating an attractor module. The proposed system outperforms existin… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: 5 pages, 4 figures, accepted by Interspeech 2025

  2. arXiv:2505.03442  [pdf, other

    cs.SD cs.LG eess.AS

    Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

    Authors: Diep Luong, Mikko Heikkinen, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Speech denoising is a generally adopted and impactful task, appearing in many common and everyday-life use cases. Although there are very powerful methods published, most of those are too complex for deployment in everyday and low-resources computational environments, like hand-held devices, intelligent glasses, hearing aids, etc. Knowledge distillation (KD) is a prominent way for alleviating this… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  3. arXiv:2501.08047  [pdf, other

    eess.AS cs.LG cs.SD

    Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays

    Authors: Mikko Heikkinen, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

    Comments: Accepted for publication in Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing

  4. arXiv:2308.04960  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Representation Learning for Audio Privacy Preservation using Source Separation and Robust Adversarial Learning

    Authors: Diep Luong, Minh Tran, Shayan Gharib, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation o… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  5. arXiv:2305.00011  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Adversarial Representation Learning for Robust Privacy Preservation in Audio

    Authors: Shayan Gharib, Minh Tran, Diep Luong, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial train… ▽ More

    Submitted 3 January, 2024; v1 submitted 29 April, 2023; originally announced May 2023.

    Comments: Published in IEEE Open Journal of Signal Processing

  6. arXiv:2208.02406  [pdf

    eess.AS cs.SD

    Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network

    Authors: Yanxiong Li, Wenchang Cao, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people. This study focuses on solving the problem of domestic activity clustering from audio. The target of domestic activity clustering is to cluster audio clips which belong to the same category of domestic activity into one cluster in an unsupervised… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: 6 pages, 5 figures, 4 tables. Accepted by IEEE MMSP 2022

  7. arXiv:2204.09634  [pdf, other

    cs.SD cs.LG eess.AS

    Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

    Authors: Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we coll… ▽ More

    Submitted 17 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

  8. arXiv:2110.07410  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

    Authors: Benno Weck, Xavier Favory, Konstantinos Drossos, Xavier Serra

    Abstract: Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recent… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: 5 pages, 4 figures. Accepted at Detection and Classification of Acoustic Scenes and Events 2021 (DCASE2021)

  9. arXiv:2110.02939  [pdf, other

    eess.AS eess.SP

    Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

    Authors: Huang Xie, Okko Räsänen, Konstantinos Drossos, Tuomas Virtanen

    Abstract: We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss cri… ▽ More

    Submitted 21 February, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: Accepted at ICASSP 2022

  10. arXiv:2110.01506  [pdf, other

    cs.LG cs.CY

    Fairness and underspecification in acoustic scene classification: The case for disaggregated evaluations

    Authors: Andreas Triantafyllopoulos, Manuel Milling, Konstantinos Drossos, Björn W. Schuller

    Abstract: Underspecification and fairness in machine learning (ML) applications have recently become two prominent issues in the ML community. Acoustic scene classification (ASC) applications have so far remained unaffected by this discussion, but are now becoming increasingly used in real-world systems where fairness and reliability are critical aspects. In this work, we argue for the need of a more holist… ▽ More

    Submitted 4 October, 2021; originally announced October 2021.

  11. arXiv:2107.09388  [pdf, other

    cs.SD eess.AS

    Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection

    Authors: Parthasaarathy Sudarsanam, Archontis Politis, Konstantinos Drossos

    Abstract: Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from th… ▽ More

    Submitted 27 September, 2021; v1 submitted 20 July, 2021; originally announced July 2021.

  12. arXiv:2107.08028  [pdf, other

    cs.SD cs.LG eess.AS

    Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach

    Authors: Jan Berg, Konstantinos Drossos

    Abstract: Automated audio captioning (AAC) is the task of automatically creating textual descriptions (i.e. captions) for the contents of a general audio signal. Most AAC methods are using existing datasets to optimize and/or evaluate upon. Given the limited information held by the AAC datasets, it is very likely that AAC methods learn only the information contained in the utilized datasets. In this paper w… ▽ More

    Submitted 16 July, 2021; originally announced July 2021.

  13. arXiv:2106.09539  [pdf, other

    eess.AS cs.LG cs.SD

    Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit

    Authors: Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Okko Räsänen

    Abstract: Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

  14. arXiv:2104.00437  [pdf, other

    cs.SD cs.IR cs.MM eess.AS

    Enriched Music Representations with Multiple Cross-modal Contrastive Learning

    Authors: Andres Ferraro, Xavier Favory, Konstantinos Drossos, Yuntae Kim, Dmitry Bogdanov

    Abstract: Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize bet… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

    Comments: Accepted for publication to IEEE Signal Processing Letters

    Report number: SPL-30069-2021

  15. arXiv:2103.16988  [pdf, other

    cs.SD

    Towards Citizen Science for Smart Cities: A Framework for a Collaborative Game of Bird Call Recognition Based on Internet of Sound Practices

    Authors: Emmanuel Rovithis, Nikolaos Moustakas, Konstantinos Vogklis, Konstantinos Drossos, Andreas Floros

    Abstract: Citizen Science aims to engage people in research activities on important issues related to their well-being. Smart Cities aim to provide them with services that improve the quality of their life. Both concepts have seen significant growth in the last years, and can be further enhanced by combining their purposes with IoT technologies that allow for dynamic and large-scale communication and intera… ▽ More

    Submitted 31 March, 2021; originally announced March 2021.

  16. arXiv:2010.14171  [pdf, other

    cs.SD cs.IR cs.LG eess.AS stat.ML

    Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

    Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

    Abstract: Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learni… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: 5 pages, 1 figure

  17. arXiv:2010.11098  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

    Authors: An Tran, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: Submitted for review at ICASSP2021

  18. arXiv:2007.05183  [pdf, other

    cs.SD cs.LG eess.AS

    Conditioned Time-Dilated Convolutions for Sound Event Detection

    Authors: Konstantinos Drossos, Stylianos I. Mimilakis, Tuomas Virtanen

    Abstract: Sound event detection (SED) is the task of identifying sound events along with their onset and offset times. A recent, convolutional neural networks based SED method, proposed the usage of depthwise separable (DWS) and time-dilated convolutions. DWS and time-dilated convolutions yielded state-of-the-art results for SED, with considerable small amount of parameters. In this work we propose the expa… ▽ More

    Submitted 10 July, 2020; originally announced July 2020.

  19. arXiv:2007.04660  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Multi-task Regularization Based on Infrequent Classes for Audio Captioning

    Authors: Emre Çakır, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the c… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

  20. arXiv:2007.02780  [pdf, other

    cs.SD eess.AS

    Revisiting Representation Learning for Singing Voice Separation with Sinkhorn Distances

    Authors: Stylianos Ioannis Mimilakis, Konstantinos Drossos, Gerald Schuller

    Abstract: In this work we present a method for unsupervised learning of audio representations, focused on the task of singing voice separation. We build upon a previously proposed method for learning representations of time-domain music signals with a re-parameterized denoising autoencoder, extending it by using the family of Sinkhorn distances with entropic regularization. We evaluate our method on the fre… ▽ More

    Submitted 8 January, 2021; v1 submitted 6 July, 2020; originally announced July 2020.

    Comments: Update including additional results justifying hyper-parameter choices, clarifications for the supervision debate, notes on interpretability

  21. arXiv:2007.02683  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

    Authors: Pyry Pyykkönen, Styliannos I. Mimilakis, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music… ▽ More

    Submitted 6 July, 2020; originally announced July 2020.

  22. arXiv:2007.02676  [pdf, other

    eess.AS cs.LG cs.SD

    Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

    Authors: Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio… ▽ More

    Submitted 6 July, 2020; originally announced July 2020.

  23. arXiv:2006.08386  [pdf, other

    cs.LG cs.IR eess.AS stat.ML

    COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

    Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

    Abstract: Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. A… ▽ More

    Submitted 8 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

    Comments: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML), 2020, Vienna, Austria

  24. arXiv:2003.01567  [pdf, other

    eess.AS cs.SD

    Unsupervised Interpretable Representation Learning for Singing Voice Separation

    Authors: Stylianos I. Mimilakis, Konstantinos Drossos, Gerald Schuller

    Abstract: In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. To demonstrate the benefits of our method, we employ the obtained representations t… ▽ More

    Submitted 1 July, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

    Comments: Camera-ready version for EUSIPCO 2020

  25. arXiv:2003.01162  [pdf, ps, other

    eess.AS cs.SD

    Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CNMF

    Authors: Antonio J. Muñoz-Montoro, Julio J. Carabias-Orti, Archontis Politis, Konstantinos Drossos

    Abstract: This work addresses the problem of multichannel source separation combining two powerful approaches, multichannel spectral factorization with recent monophonic deep-learning (DL) based spectrum inference. Individual source spectra at different channels are estimated with a Masker-Denoiser Twin Network (MaD TwinNet), able to model long-term temporal patterns of a musical piece. The monophonic sourc… ▽ More

    Submitted 2 March, 2020; originally announced March 2020.

  26. arXiv:2002.00476  [pdf, other

    cs.SD cs.LG eess.AS

    Sound Event Detection with Depthwise Separable and Dilated Convolutions

    Authors: Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen

    Abstract: State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount o… ▽ More

    Submitted 2 February, 2020; originally announced February 2020.

  27. arXiv:1911.10888  [pdf

    eess.AS

    Sound event detection via dilated convolutional recurrent neural networks

    Authors: Yanxiong Li, Mingle Liu, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal cont… ▽ More

    Submitted 20 July, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: 5 pages, 3 tables and 3 figures

  28. arXiv:1911.07098  [pdf, other

    cs.SD eess.AS

    VOICe: A Sound Event Detection Dataset For Generalizable Domain Adaptation

    Authors: Shayan Gharib, Konstantinos Drossos, Eemi Fagerlund, Tuomas Virtanen

    Abstract: The performance of sound event detection methods can significantly degrade when they are used in unseen conditions (e.g. recording devices, ambient noise). Domain adaptation is a promising way to tackle this problem. In this paper, we present VOICe, the first dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three dif… ▽ More

    Submitted 25 November, 2019; v1 submitted 16 November, 2019; originally announced November 2019.

    Comments: Fixed the footnote at the abstract

  29. arXiv:1911.00527  [pdf, other

    eess.AS cs.LG cs.PF cs.SD

    Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters

    Authors: Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti

    Abstract: Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the prop… ▽ More

    Submitted 1 November, 2019; originally announced November 2019.

  30. arXiv:1910.09387  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Clotho: An Audio Captioning Dataset

    Authors: Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen

    Abstract: Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and… ▽ More

    Submitted 21 October, 2019; originally announced October 2019.

  31. arXiv:1907.09238  [pdf, other

    cs.SD eess.AS

    Crowdsourcing a Dataset of Audio Captions

    Authors: Samuel Lipping, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an aud… ▽ More

    Submitted 22 July, 2019; originally announced July 2019.

  32. arXiv:1907.08506  [pdf, other

    cs.SD cs.LG eess.AS

    Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling

    Authors: Konstantinos Drossos, Shayan Gharib, Paul Magron, Tuomas Virtanen

    Abstract: A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine tra… ▽ More

    Submitted 6 November, 2019; v1 submitted 19 July, 2019; originally announced July 2019.

    Comments: Fixed the display of URLs at footnote, updated the results

  33. arXiv:1904.10678  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification

    Authors: Konstantinos Drossos, Paul Magron, Tuomas Virtanen

    Abstract: A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a diff… ▽ More

    Submitted 6 November, 2019; v1 submitted 24 April, 2019; originally announced April 2019.

    Comments: Updated indices at Eq 6

  34. arXiv:1904.06157  [pdf, other

    eess.AS cs.LG cs.SD

    Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation

    Authors: Stylianos Ioannis Mimilakis, Konstantinos Drossos, Estefanía Cano, Gerald Schuller

    Abstract: The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neura… ▽ More

    Submitted 20 October, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

  35. arXiv:1808.05777  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised adversarial domain adaptation for acoustic scene classification

    Authors: Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitriy Serdyuk, Tuomas Virtanen

    Abstract: A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of… ▽ More

    Submitted 17 August, 2018; originally announced August 2018.

  36. arXiv:1807.11298  [pdf, other

    cs.SD eess.AS

    Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery

    Authors: Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, Tuomas Virtanen

    Abstract: Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, w… ▽ More

    Submitted 30 July, 2018; originally announced July 2018.

  37. arXiv:1802.05132  [pdf, ps, other

    eess.AS cs.SD

    Close Miking Empirical Practice Verification: A Source Separation Approach

    Authors: Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, Tuomas Virtanen, Gerald Schuller

    Abstract: Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itsel… ▽ More

    Submitted 13 February, 2018; originally announced February 2018.

    Journal ref: In Proceedings of the 142nd Audio Engineering Society Convention, Berlin, Germany, 2017

  38. arXiv:1802.00300  [pdf, other

    cs.SD eess.AS

    MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

    Authors: Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitriy Serdyuk, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

    Abstract: Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel deep learning based method that learns long-term temporal patterns and structures of a musical piece. We build upo… ▽ More

    Submitted 1 February, 2018; originally announced February 2018.

  39. arXiv:1711.01437  [pdf, other

    cs.SD eess.AS

    Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask

    Authors: Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

    Abstract: Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during trainin… ▽ More

    Submitted 13 February, 2018; v1 submitted 4 November, 2017; originally announced November 2017.

  40. arXiv:1709.00611  [pdf, other

    cs.SD

    A Recurrent Encoder-Decoder Approach with Skip-filtering Connections for Monaural Singing Voice Separation

    Authors: Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, Gerald Schuller

    Abstract: The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude… ▽ More

    Submitted 24 April, 2018; v1 submitted 2 September, 2017; originally announced September 2017.

  41. arXiv:1706.10006  [pdf, other

    cs.SD cs.CL cs.LG

    Automated Audio Captioning with Recurrent Neural Networks

    Authors: Konstantinos Drossos, Sharath Adavanne, Tuomas Virtanen

    Abstract: We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with… ▽ More

    Submitted 24 October, 2017; v1 submitted 29 June, 2017; originally announced June 2017.

    Comments: Presented at the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017

  42. arXiv:1706.02292  [pdf, other

    cs.SD cs.LG

    Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition

    Authors: Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, Roman Jarina

    Abstract: This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with the state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted for Sound and Music Computing (SMC 2017)

  43. arXiv:1706.02047  [pdf, other

    cs.SD cs.LG

    Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection

    Authors: Sharath Adavanne, Konstantinos Drossos, Emre Çakır, Tuomas Virtanen

    Abstract: This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and t… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted for European Signal Processing Conference 2017

  44. arXiv:1703.02317  [pdf, other

    cs.SD cs.LG stat.ML

    Convolutional Recurrent Neural Networks for Bird Audio Detection

    Authors: EmreÇakır, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invarian… ▽ More

    Submitted 7 March, 2017; originally announced March 2017.

    Comments: Submitted to EUSIPCO 2017 Special Session on Bird Audio Signal Processing