-
The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis
Authors:
Bernardo Torres,
Geoffroy Peeters,
Gael Richard
Abstract:
We introduce the Inverse Drum Machine (IDM), a novel approach to drum source separation that combines analysis-by-synthesis with deep learning. Unlike recent supervised methods that rely on isolated stems, IDM requires only transcription annotations. It jointly optimizes automatic drum transcription and one-shot drum sample synthesis in an end-to-end framework. By convolving synthesized one-shot s…
▽ More
We introduce the Inverse Drum Machine (IDM), a novel approach to drum source separation that combines analysis-by-synthesis with deep learning. Unlike recent supervised methods that rely on isolated stems, IDM requires only transcription annotations. It jointly optimizes automatic drum transcription and one-shot drum sample synthesis in an end-to-end framework. By convolving synthesized one-shot samples with estimated onsets-mimicking a drum machine-IDM reconstructs individual drum stems and trains a neural network to match the original mixture. Evaluations on the StemGMD dataset show that IDM achieves separation performance on par with state-of-the-art supervised methods, while substantially outperforming matrix decomposition baselines.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
QINCODEC: Neural Audio Compression with Implicit Neural Codebooks
Authors:
Zineb Lahrichi,
Gaëtan Hadjeres,
Gael Richard,
Geoffroy Peeters
Abstract:
Neural audio codecs, neural networks which compress a waveform into discrete tokens, play a crucial role in the recent development of audio generative models. State-of-the-art codecs rely on the end-to-end training of an autoencoder and a quantization bottleneck. However, this approach restricts the choice of the quantization methods as it requires to define how gradients propagate through the qua…
▽ More
Neural audio codecs, neural networks which compress a waveform into discrete tokens, play a crucial role in the recent development of audio generative models. State-of-the-art codecs rely on the end-to-end training of an autoencoder and a quantization bottleneck. However, this approach restricts the choice of the quantization methods as it requires to define how gradients propagate through the quantizer and how to update the quantization parameters online. In this work, we revisit the common practice of joint training and propose to quantize the latent representations of a pre-trained autoencoder offline, followed by an optional finetuning of the decoder to mitigate degradation from quantization. This strategy allows to consider any off-the-shelf quantizer, especially state-of-the-art trainable quantizers with implicit neural codebooks such as QINCO2. We demonstrate that with the latter, our proposed codec termed QINCODEC, is competitive with baseline codecs while being notably simpler to train. Finally, our approach provides a general framework that amortizes the cost of autoencoder pretraining, and enables more flexible codec design.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning
Authors:
Aurian Quelennec,
Pierre Chouteau,
Geoffroy Peeters,
Slim Essid
Abstract:
Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (M…
▽ More
Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods results for musical auto-tagging on Magna-tag-a-tune.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures
Authors:
Alain Riou,
Antonin Gagneré,
Gaëtan Hadjeres,
Stefan Lattner,
Geoffroy Peeters
Abstract:
In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent rep…
▽ More
In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance.
We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.
△ Less
Submitted 24 February, 2025; v1 submitted 29 November, 2024;
originally announced November 2024.
-
A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning
Authors:
Antonin Gagnere,
Geoffroy Peeters,
Slim Essid
Abstract:
In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the k…
▽ More
In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.
△ Less
Submitted 6 November, 2024;
originally announced November 2024.
-
Episodic fine-tuning prototypical networks for optimization-based few-shot learning: Application to audio classification
Authors:
Xuanyu Zhuang,
Geoffroy Peeters,
Gaël Richard
Abstract:
The Prototypical Network (ProtoNet) has emerged as a popular choice in Few-shot Learning (FSL) scenarios due to its remarkable performance and straightforward implementation. Building upon such success, we first propose a simple (yet novel) method to fine-tune a ProtoNet on the (labeled) support set of the test episode of a C-way-K-shot test episode (without using the query set which is only used…
▽ More
The Prototypical Network (ProtoNet) has emerged as a popular choice in Few-shot Learning (FSL) scenarios due to its remarkable performance and straightforward implementation. Building upon such success, we first propose a simple (yet novel) method to fine-tune a ProtoNet on the (labeled) support set of the test episode of a C-way-K-shot test episode (without using the query set which is only used for evaluation). We then propose an algorithmic framework that combines ProtoNet with optimization-based FSL algorithms (MAML and Meta-Curvature) to work with such a fine-tuning method. Since optimization-based algorithms endow the target learner model with the ability to fast adaption to only a few samples, we utilize ProtoNet as the target model to enhance its fine-tuning performance with the help of a specifically designed episodic fine-tuning strategy. The experimental results confirm that our proposed models, MAML-Proto and MC-Proto, combined with our unique fine-tuning method, outperform regular ProtoNet by a large margin in few-shot audio classification tasks on the ESC-50 and Speech Commands v2 datasets. We note that although we have only applied our model to the audio domain, it is a general method and can be easily extended to other domains.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation
Authors:
Alain Riou,
Stefan Lattner,
Gaëtan Hadjeres,
Michael Anslow,
Geoffroy Peeters
Abstract:
This paper explores the automated process of determining stem compatibility by identifying audio recordings of single instruments that blend well with a given musical context. To tackle this challenge, we present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset using a self-supervised learning approach.
Our model comprises two networks: an encode…
▽ More
This paper explores the automated process of determining stem compatibility by identifying audio recordings of single instruments that blend well with a given musical context. To tackle this challenge, we present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset using a self-supervised learning approach.
Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems from the embeddings of a given context, typically a mix of several instruments. Training a model in this manner allows its use in estimating stem compatibility - retrieving, aligning, or generating a stem to match a given mix - or for downstream tasks such as genre or key estimation, as the training paradigm requires the model to learn information related to timbre, harmony, and rhythm.
We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix and through a subjective user study. We also show that the learned embeddings capture temporal alignment information and, finally, evaluate the representations learned by our model on several downstream tasks, highlighting that they effectively capture meaningful musical features.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning
Authors:
Alain Riou,
Stefan Lattner,
Gaëtan Hadjeres,
Geoffroy Peeters
Abstract:
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the cont…
▽ More
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport
Authors:
Bernardo Torres,
Geoffroy Peeters,
Gaël Richard
Abstract:
In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory th…
▽ More
In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory that minimizes the displacement of spectral energy. We validate this approach through an unsupervised autoencoding task that fits a harmonic template to harmonic signals. We jointly estimate the fundamental frequency and amplitudes of harmonics using a lightweight encoder and reconstruct the signals using a differentiable harmonic synthesizer. The proposed approach offers a promising direction for improving unsupervised parameter estimation in neural audio applications.
△ Less
Submitted 15 January, 2024; v1 submitted 22 December, 2023;
originally announced December 2023.
-
On the choice of the optimal temporal support for audio classification with Pre-trained embeddings
Authors:
Aurian Quelennec,
Michel Olvera,
Geoffroy Peeters,
Slim Essid
Abstract:
Current state-of-the-art audio analysis systems rely on pre-trained embedding models, often used off-the-shelf as (frozen) feature extractors. Choosing the best one for a set of tasks is the subject of many recent publications. However, one aspect often overlooked in these works is the influence of the duration of audio input considered to extract an embedding, which we refer to as Temporal Suppor…
▽ More
Current state-of-the-art audio analysis systems rely on pre-trained embedding models, often used off-the-shelf as (frozen) feature extractors. Choosing the best one for a set of tasks is the subject of many recent publications. However, one aspect often overlooked in these works is the influence of the duration of audio input considered to extract an embedding, which we refer to as Temporal Support (TS). In this work, we study the influence of the TS for well-established or emerging pre-trained embeddings, chosen to represent different types of architectures and learning paradigms. We conduct this evaluation using both musical instrument and environmental sound datasets, namely OpenMIC, TAU Urban Acoustic Scenes 2020 Mobile, and ESC-50. We especially highlight that Audio Spectrogram Transformer-based systems (PaSST and BEATs) remain effective with smaller TS, which therefore allows for a drastic reduction in memory and computational cost. Moreover, we show that by choosing the optimal TS we reach competitive results across all tasks. In particular, we improve the state-of-the-art results on OpenMIC, using BEATs and PaSST without any fine-tuning.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Blind estimation of audio effects using an auto-encoder approach and differentiable digital signal processing
Authors:
Côme Peladeau,
Geoffroy Peeters
Abstract:
Blind Estimation of Audio Effects (BE-AFX) aims at estimating the Audio Effects (AFXs) applied to an original, unprocessed audio sample solely based on the processed audio sample. To train such a system traditional approaches optimize a loss between ground truth and estimated AFX parameters. This involves knowing the exact implementation of the AFXs used for the process. In this work, we propose a…
▽ More
Blind Estimation of Audio Effects (BE-AFX) aims at estimating the Audio Effects (AFXs) applied to an original, unprocessed audio sample solely based on the processed audio sample. To train such a system traditional approaches optimize a loss between ground truth and estimated AFX parameters. This involves knowing the exact implementation of the AFXs used for the process. In this work, we propose an alternative solution that eliminates the requirement for knowing this implementation. Instead, we introduce an auto-encoder approach, which optimizes an audio quality metric. We explore, suggest, and compare various implementations of commonly used mastering AFXs, using differential signal processing or neural approximations. Our findings demonstrate that our auto-encoder approach yields superior estimates of the audio quality produced by a chain of AFXs, compared to the traditional parameter-based approach, even if the latter provides a more accurate parameter estimation.
△ Less
Submitted 9 February, 2024; v1 submitted 18 October, 2023;
originally announced October 2023.
-
PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective
Authors:
Alain Riou,
Stefan Lattner,
Gaëtan Hadjeres,
Geoffroy Peeters
Abstract:
In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ($<$ 30k parameters) Siamese neural network that takes as inputs two different pi…
▽ More
In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ($<$ 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices.
We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Self-Similarity-Based and Novelty-based loss for music structure analysis
Authors:
Geoffroy Peeters
Abstract:
Music Structure Analysis (MSA) is the task aiming at identifying musical segments that compose a music track and possibly label them based on their similarity. In this paper we propose a supervised approach for the task of music boundary detection. In our approach we simultaneously learn features and convolution kernels. For this we jointly optimize -- a loss based on the Self-Similarity-Matrix (S…
▽ More
Music Structure Analysis (MSA) is the task aiming at identifying musical segments that compose a music track and possibly label them based on their similarity. In this paper we propose a supervised approach for the task of music boundary detection. In our approach we simultaneously learn features and convolution kernels. For this we jointly optimize -- a loss based on the Self-Similarity-Matrix (SSM) obtained with the learned features, denoted by SSM-loss, and -- a loss based on the novelty score obtained applying the learned kernels to the estimated SSM, denoted by novelty-loss. We also demonstrate that relative feature learning, through self-attention, is beneficial for the task of MSA. Finally, we compare the performances of our approach to previously proposed approaches on the standard RWC-Pop, and various subsets of SALAMI.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Video-to-Music Recommendation using Temporal Alignment of Segments
Authors:
Laure Prétet,
Gaël Richard,
Clément Souchier,
Geoffroy Peeters
Abstract:
We study cross-modal recommendation of music tracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns a content association between music and video. In addition to the adequacy of content, adequacy of structure is crucial in music supervision to obtain relevant recommendations. We propose a novel approach to…
▽ More
We study cross-modal recommendation of music tracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns a content association between music and video. In addition to the adequacy of content, adequacy of structure is crucial in music supervision to obtain relevant recommendations. We propose a novel approach to significantly improve the system's performance using structure-aware recommendation. The core idea is to consider not only the full audio-video clips, but rather shorter segments for training and inference. We find that using semantic segments and ranking the tracks according to sequence alignment costs significantly improves the results. We investigate the impact of different ranking metrics and segmentation methods.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
SSM-Net: feature learning for Music Structure Analysis using a Self-Similarity-Matrix based loss
Authors:
Geoffroy Peeters,
Florian Angulo
Abstract:
In this paper, we propose a new paradigm to learn audio features for Music Structure Analysis (MSA). We train a deep encoder to learn features such that the Self-Similarity-Matrix (SSM) resulting from those approximates a ground-truth SSM. This is done by minimizing a loss between both SSMs. Since this loss is differentiable w.r.t. its input features we can train the encoder in a straightforward w…
▽ More
In this paper, we propose a new paradigm to learn audio features for Music Structure Analysis (MSA). We train a deep encoder to learn features such that the Self-Similarity-Matrix (SSM) resulting from those approximates a ground-truth SSM. This is done by minimizing a loss between both SSMs. Since this loss is differentiable w.r.t. its input features we can train the encoder in a straightforward way. We successfully demonstrate the use of this training paradigm using the Area Under the Curve ROC (AUC) on the RWC-Pop dataset.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Exploiting Device and Audio Data to Tag Music with User-Aware Listening Contexts
Authors:
Karim M. Ibrahim,
Elena V. Epure,
Geoffroy Peeters,
Gaël Richard
Abstract:
As music has become more available especially on music streaming platforms, people have started to have distinct preferences to fit to their varying listening situations, also known as context. Hence, there has been a growing interest in considering the user's situation when recommending music to users. Previous works have proposed user-aware autotaggers to infer situation-related tags from music…
▽ More
As music has become more available especially on music streaming platforms, people have started to have distinct preferences to fit to their varying listening situations, also known as context. Hence, there has been a growing interest in considering the user's situation when recommending music to users. Previous works have proposed user-aware autotaggers to infer situation-related tags from music content and user's global listening preferences. However, in a practical music retrieval system, the autotagger could be only used by assuming that the context class is explicitly provided by the user. In this work, for designing a fully automatised music retrieval system, we propose to disambiguate the user's listening information from their stream data. Namely, we propose a system which can generate a situational playlist for a user at a certain time 1) by leveraging user-aware music autotaggers, and 2) by automatically inferring the user's situation from stream data (e.g. device, network) and user's general profile information (e.g. age). Experiments show that such a context-aware personalized music retrieval system is feasible, but the performance decreases in the case of new users, new tracks or when the number of context classes increases.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation
Authors:
Christof Weiß,
Geoffroy Peeters
Abstract:
Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we…
▽ More
Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training-test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice splitting strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final evaluation on a mixed dataset suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.
△ Less
Submitted 18 February, 2022;
originally announced February 2022.
-
Is there a "language of music-video clips" ? A qualitative and quantitative study
Authors:
Laure Prétet,
Gaël Richard,
Geoffroy Peeters
Abstract:
Recommending automatically a video given a music or a music given a video has become an important asset for the audiovisual industry - with user-generated or professional content. While both music and video have specific temporal organizations, most current works do not consider those and only focus on globally recommending a media. As a first step toward the improvement of these recommendation sy…
▽ More
Recommending automatically a video given a music or a music given a video has become an important asset for the audiovisual industry - with user-generated or professional content. While both music and video have specific temporal organizations, most current works do not consider those and only focus on globally recommending a media. As a first step toward the improvement of these recommendation systems, we study in this paper the relationship between music and video temporal organization. We do this for the case of official music videos, with a quantitative and a qualitative approach. Our assumption is that the movement in the music are correlated to the ones in the video. To validate this, we first interview a set of internationally recognized music video experts. We then perform a large-scale analysis of official music-video clips (which we manually annotated into video genres) using MIR description tools (downbeats and functional segments estimation) and Computer Vision tools (shot detection). Our study confirms that a "language of music-video clips" exists; i.e. editors favor the co-occurrence of music and video events using strategies such as anticipation. It also highlights that the amount of co-occurrence depends on the music and video genres.
△ Less
Submitted 2 August, 2021;
originally announced August 2021.
-
Cross-Modal Music-Video Recommendation: A Study of Design Choices
Authors:
Laure Pretet,
Gael Richard,
Geoffroy Peeters
Abstract:
In this work, we study music/video cross-modal recommendation, i.e. recommending a music track for a video or vice versa. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. More precisely, we jointly learn audio and video embeddings by using their co-occurren…
▽ More
In this work, we study music/video cross-modal recommendation, i.e. recommending a music track for a video or vice versa. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. More precisely, we jointly learn audio and video embeddings by using their co-occurrence in music-video clips. In this work, we build upon a recent video-music retrieval system (the VM-NET), which originally relies on an audio representation obtained by a set of statistics computed over handcrafted features. We demonstrate here that using audio representation learning such as the audio embeddings provided by the pre-trained MuSimNet, OpenL3, MusicCNN or by AudioSet, largely improves recommendations. We also validate the use of the cross-modal triplet loss originally proposed in the VM-NET compared to the binary cross-entropy loss commonly used in self-supervised learning. We perform all our experiments using the Music Video Dataset (MVD).
△ Less
Submitted 30 April, 2021;
originally announced April 2021.
-
Content based singing voice source separation via strong conditioning using aligned phonemes
Authors:
Gabriel Meseguer-Brocal,
Geoffroy Peeters
Abstract:
Informed source separation has recently gained renewed interest with the introduction of neural networks and the availability of large multitrack datasets containing both the mixture and the separated sources. These approaches use prior information about the target source to improve separation. Historically, Music Information Retrieval researchers have focused primarily on score-informed source se…
▽ More
Informed source separation has recently gained renewed interest with the introduction of neural networks and the availability of large multitrack datasets containing both the mixture and the separated sources. These approaches use prior information about the target source to improve separation. Historically, Music Information Retrieval researchers have focused primarily on score-informed source separation, but more recent approaches explore lyrics-informed source separation. However, because of the lack of multitrack datasets with time-aligned lyrics, models use weak conditioning with non-aligned lyrics. In this paper, we present a multimodal multitrack dataset with lyrics aligned in time at the word level with phonetic information as well as explore strong conditioning using the aligned phonemes. Our model follows a U-Net architecture and takes as input both the magnitude spectrogram of a musical mixture and a matrix with aligned phonetic information. The phoneme matrix is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. These layers condition the U-Net feature maps to adapt the separation process to the presence of different phonemes via affine transformations. We show that phoneme conditioning can be successfully applied to improve singing voice source separation.
△ Less
Submitted 5 August, 2020;
originally announced August 2020.
-
Learning to rank music tracks using triplet loss
Authors:
Laure Prétet,
Gaël Richard,
Geoffroy Peeters
Abstract:
Most music streaming services rely on automatic recommendation algorithms to exploit their large music catalogs. These algorithms aim at retrieving a ranked list of music tracks based on their similarity with a target music track. In this work, we propose a method for direct recommendation based on the audio content without explicitly tagging the music tracks. To that aim, we propose several strat…
▽ More
Most music streaming services rely on automatic recommendation algorithms to exploit their large music catalogs. These algorithms aim at retrieving a ranked list of music tracks based on their similarity with a target music track. In this work, we propose a method for direct recommendation based on the audio content without explicitly tagging the music tracks. To that aim, we propose several strategies to perform triplet mining from ranked lists. We train a Convolutional Neural Network to learn the similarity via triplet loss. These different strategies are compared and validated on a large-scale experiment against an auto-tagging based approach. The results obtained highlight the efficiency of our system, especially when associated with an Auto-pooling layer.
△ Less
Submitted 18 May, 2020;
originally announced May 2020.
-
A Prototypical Triplet Loss for Cover Detection
Authors:
Guillaume Doras,
Geoffroy Peeters
Abstract:
Automatic cover detection -- the task of finding in a audio dataset all covers of a query track -- has long been a challenging theoretical problem in MIR community. It also became a practical need for music composers societies requiring to detect automatically if an audio excerpt embeds musical content belonging to their catalog.
In a recent work, we addressed this problem with a convolutional n…
▽ More
Automatic cover detection -- the task of finding in a audio dataset all covers of a query track -- has long been a challenging theoretical problem in MIR community. It also became a practical need for music composers societies requiring to detect automatically if an audio excerpt embeds musical content belonging to their catalog.
In a recent work, we addressed this problem with a convolutional neural network mapping each track's dominant melody to an embedding vector, and trained to minimize cover pairs distance in the embeddings space, while maximizing it for non-covers. We showed in particular that training this model with enough works having five or more covers yields state-of-the-art results.
This however does not reflect the realistic use case, where music catalogs typically contain works with zero or at most one or two covers. We thus introduce here a new test set incorporating these constraints, and propose two contributions to improve our model's accuracy under these stricter conditions: we replace dominant melody with multi-pitch representation as input data, and describe a novel prototypical triplet loss designed to improve covers clustering. We show that these changes improve results significantly for two concrete use cases, large dataset lookup and live songs identification.
△ Less
Submitted 9 April, 2020; v1 submitted 22 October, 2019;
originally announced October 2019.
-
Cover Detection using Dominant Melody Embeddings
Authors:
Guillaume Doras,
Geoffroy Peeters
Abstract:
Automatic cover detection -- the task of finding in an audio database all the covers of one or several query tracks -- has long been seen as a challenging theoretical problem in the MIR community and as an acute practical problem for authors and composers societies. Original algorithms proposed for this task have proven their accuracy on small datasets, but are unable to scale up to modern real-li…
▽ More
Automatic cover detection -- the task of finding in an audio database all the covers of one or several query tracks -- has long been seen as a challenging theoretical problem in the MIR community and as an acute practical problem for authors and composers societies. Original algorithms proposed for this task have proven their accuracy on small datasets, but are unable to scale up to modern real-life audio corpora. On the other hand, faster approaches designed to process thousands of pairwise comparisons resulted in lower accuracy, making them unsuitable for practical use.
In this work, we propose a neural network architecture that is trained to represent each track as a single embedding vector. The computation burden is therefore left to the embedding extraction -- that can be conducted offline and stored, while the pairwise comparison task reduces to a simple Euclidean distance computation. We further propose to extract each track's embedding out of its dominant melody representation, obtained by another neural network trained for this task. We then show that this architecture improves state-of-the-art accuracy both on small and large datasets, and is able to scale to query databases of thousands of tracks in a few seconds.
△ Less
Submitted 3 July, 2019;
originally announced July 2019.
-
Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations
Authors:
Gabriel Meseguer-Brocal,
Geoffroy Peeters
Abstract:
Data-driven models for audio source separation such as U-Net or Wave-U-Net are usually models dedicated to and specifically trained for a single task, e.g. a particular instrument isolation. Training them for various tasks at once commonly results in worse performances than training them for a single specialized task. In this work, we introduce the Conditioned-U-Net (C-U-Net) which adds a control…
▽ More
Data-driven models for audio source separation such as U-Net or Wave-U-Net are usually models dedicated to and specifically trained for a single task, e.g. a particular instrument isolation. Training them for various tasks at once commonly results in worse performances than training them for a single specialized task. In this work, we introduce the Conditioned-U-Net (C-U-Net) which adds a control mechanism to the standard U-Net. The control mechanism allows us to train a unique and generic U-Net to perform the separation of various instruments. The C-U-Net decides the instrument to isolate according to a one-hot-encoding input vector. The input vector is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. FiLM layers modify the U-Net feature maps in order to separate the desired instrument via affine transformations. The C-U-Net performs different instrument separations, all with a single model achieving the same performances as the dedicated ones at a lower cost.
△ Less
Submitted 21 November, 2019; v1 submitted 2 July, 2019;
originally announced July 2019.
-
DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm
Authors:
Gabriel Meseguer-Brocal,
Alice Cohen-Hadria,
Geoffroy Peeters
Abstract:
The goal of this paper is twofold. First, we introduce DALI, a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity. The second goal is to explain our methodology where dataset creation and learning models interact using a teacher-student machine learning paradigm that benefits each other. We start with a…
▽ More
The goal of this paper is twofold. First, we introduce DALI, a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity. The second goal is to explain our methodology where dataset creation and learning models interact using a teacher-student machine learning paradigm that benefits each other. We start with a set of manual annotations of draft time-aligned lyrics and notes made by non-expert users of Karaoke games. This set comes without audio. Therefore, we need to find the corresponding audio and adapt the annotations to it. To that end, we retrieve audio candidates from the Web. Each candidate is then turned into a singing-voice probability over time using a teacher, a deep convolutional neural network singing-voice detection system (SVD), trained on cleaned data. Comparing the time-aligned lyrics and the singing-voice probability, we detect matches and update the time-alignment lyrics accordingly. From this, we obtain new audio sets. They are then used to train new SVD students used to perform again the above comparison. The process could be repeated iteratively. We show that this allows to progressively improve the performances of our SVD and get better audio-matching and alignment.
△ Less
Submitted 25 June, 2019;
originally announced June 2019.
-
Improving singing voice separation using Deep U-Net and Wave-U-Net with data augmentation
Authors:
Alice Cohen-Hadria,
Axel Roebel,
Geoffroy Peeters
Abstract:
State-of-the-art singing voice separation is based on deep learning making use of CNN structures with skip connections (like U-net model, Wave-U-Net model, or MSDENSELSTM). A key to the success of these models is the availability of a large amount of training data. In the following study, we are interested in singing voice separation for mono signals and will investigate into comparing the U-Net a…
▽ More
State-of-the-art singing voice separation is based on deep learning making use of CNN structures with skip connections (like U-net model, Wave-U-Net model, or MSDENSELSTM). A key to the success of these models is the availability of a large amount of training data. In the following study, we are interested in singing voice separation for mono signals and will investigate into comparing the U-Net and the Wave-U-Net that are structurally similar, but work on different input representations. First, we report a few results on variations of the U-Net model. Second, we will discuss the potential of state of the art speech and music transformation algorithms for augmentation of existing data sets and demonstrate that the effect of these augmentations depends on the signal representations used by the model. The results demonstrate a considerable improvement due to the augmentation for both models. But pitch transposition is the most effective augmentation strategy for the U-Net model, while transposition, time stretching, and formant shifting have a much more balanced effect on the Wave-U-Net model. Finally, we compare the two models on the same dataset.
△ Less
Submitted 4 March, 2019;
originally announced March 2019.
-
Single-Channel Blind Source Separation for Singing Voice Detection: A Comparative Study
Authors:
Dominique Fourer,
Geoffroy Peeters
Abstract:
We propose a novel unsupervised singing voice detection method which use single-channel Blind Audio Source Separation (BASS) algorithm as a preliminary step. To reach this goal, we investigate three promising BASS approaches which operate through a morphological filtering of the analyzed mixture spectrogram. The contributions of this paper are manyfold. First, the investigated BASS methods are rew…
▽ More
We propose a novel unsupervised singing voice detection method which use single-channel Blind Audio Source Separation (BASS) algorithm as a preliminary step. To reach this goal, we investigate three promising BASS approaches which operate through a morphological filtering of the analyzed mixture spectrogram. The contributions of this paper are manyfold. First, the investigated BASS methods are reworded with the same formalism and we investigate their respective hyperparameters by numerical simulations. Second, we propose an extension of the KAM method for which we propose a novel training algorithm used to compute a source-specific kernel from a given isolated source signal. Second, the BASS methods are compared together in terms of source separation accuracy and in terms of singing voice detection accuracy when they are used in our new singing voice detection framework. Finally, we do an exhaustive singing voice detection evaluation for which we compare both supervised and unsupervised singing voice detection methods. Our comparison explores different combination of the proposed BASS methods with new features such as the new proposed KAM features and the scattering transform through a machine learning framework and also considers convolutional neural networks methods.
△ Less
Submitted 3 May, 2018;
originally announced May 2018.