Search | arXiv e-print repository

Uncertainty in Repeated Implicit Feedback as a Measure of Reliability

Authors: Bruno Sguerra, Viet-Anh Tran, Romain Hennequin, Manuel Moussallam

Abstract: Recommender systems rely heavily on user feedback to learn effective user and item representations. Despite their widespread adoption, limited attention has been given to the uncertainty inherent in the feedback used to train these systems. Both implicit and explicit feedback are prone to noise due to the variability in human interactions, with implicit feedback being particularly challenging. In… ▽ More Recommender systems rely heavily on user feedback to learn effective user and item representations. Despite their widespread adoption, limited attention has been given to the uncertainty inherent in the feedback used to train these systems. Both implicit and explicit feedback are prone to noise due to the variability in human interactions, with implicit feedback being particularly challenging. In collaborative filtering, the reliability of interaction signals is critical, as these signals determine user and item similarities. Thus, deriving accurate confidence measures from implicit feedback is essential for ensuring the reliability of these signals. A common assumption in academia and industry is that repeated interactions indicate stronger user interest, increasing confidence in preference estimates. However, in domains such as music streaming, repeated consumption can shift user preferences over time due to factors like satiation and exposure. While literature on repeated consumption acknowledges these dynamics, they are often overlooked when deriving confidence scores for implicit feedback. This paper addresses this gap by focusing on music streaming, where repeated interactions are frequent and quantifiable. We analyze how repetition patterns intersect with key factors influencing user interest and develop methods to quantify the associated uncertainty. These uncertainty measures are then integrated as consistency metrics in a recommendation task. Our empirical results show that incorporating uncertainty into user preference models yields more accurate and relevant recommendations. Key contributions include a comprehensive analysis of uncertainty in repeated consumption patterns, the release of a novel dataset, and a Bayesian model for implicit listening feedback. △ Less

Submitted 5 May, 2025; originally announced May 2025.

arXiv:2501.12907 [pdf, other]

S-KEY: Self-supervised Learning of Major and Minor Keys from Audio

Authors: Yuexuan Kong, Gabriel Meseguer-Brocal, Vincent Lostanlen, Mathieu Lagrange, Romain Hennequin

Abstract: STONE, the current method in self-supervised learning for tonality estimation in music signals, cannot distinguish relative keys, such as C major versus A minor. In this article, we extend the neural network architecture and learning objective of STONE to perform self-supervised learning of major and minor keys (S-KEY). Our main contribution is an auxiliary pretext task to STONE, formulated using… ▽ More STONE, the current method in self-supervised learning for tonality estimation in music signals, cannot distinguish relative keys, such as C major versus A minor. In this article, we extend the neural network architecture and learning objective of STONE to perform self-supervised learning of major and minor keys (S-KEY). Our main contribution is an auxiliary pretext task to STONE, formulated using transposition-invariant chroma features as a source of pseudo-labels. S-KEY matches the supervised state of the art in tonality estimation on FMAKv2 and GTZAN datasets while requiring no human annotation and having the same parameter budget as STONE. We build upon this result and expand the training set of S-KEY to a million songs, thus showing the potential of large-scale self-supervised learning in music information retrieval. △ Less

Submitted 1 April, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

arXiv:2501.10111 [pdf, other]

AI-Generated Music Detection and its Challenges

Authors: Darius Afchar, Gabriel Meseguer-Brocal, Romain Hennequin

Abstract: In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. In particular, the ability to create credible minute-long synthetic music in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprisi… ▽ More In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. In particular, the ability to create credible minute-long synthetic music in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and artificial reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a AI-music detector, a tool that will help in the regulation of synthetic media. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that getting a good test score is not the end of the story. We expose and discuss several facets that could be problematic with such a deployed detector: robustness to audio manipulation, generalisation to unseen models. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of artificial content checkers. △ Less

Submitted 17 January, 2025; originally announced January 2025.

Comments: Accepted for IEEE ICASSP 2025. arXiv admin note: substantial text overlap with arXiv:2405.04181

arXiv:2411.05649 [pdf, other]

Harnessing High-Level Song Descriptors towards Natural Language-Based Music Recommendation

Authors: Elena V. Epure, Gabriel Meseguer-Brocal, Darius Afchar, Romain Hennequin

Abstract: Recommender systems relying on Language Models (LMs) have gained popularity in assisting users to navigate large catalogs. LMs often exploit item high-level descriptors, i.e. categories or consumption contexts, from training data or user preferences. This has been proven effective in domains like movies or products. However, in the music domain, understanding how effectively LMs utilize song descr… ▽ More Recommender systems relying on Language Models (LMs) have gained popularity in assisting users to navigate large catalogs. LMs often exploit item high-level descriptors, i.e. categories or consumption contexts, from training data or user preferences. This has been proven effective in domains like movies or products. However, in the music domain, understanding how effectively LMs utilize song descriptors for natural language-based music recommendation is relatively limited. In this paper, we assess LMs effectiveness in recommending songs based on user natural language descriptions and items with descriptors like genres, moods, and listening contexts. We formulate the recommendation task as a dense retrieval problem and assess LMs as they become increasingly familiar with data pertinent to the task and domain. Our findings reveal improved performance as LMs are fine-tuned for general language similarity, information retrieval, and mapping longer descriptions to shorter, high-level descriptors in music. △ Less

Submitted 18 November, 2024; v1 submitted 8 November, 2024; originally announced November 2024.

Journal ref: 3rd Workshop on NLP for Music and Audio collocated with ISMIR 2024

arXiv:2408.16578 [pdf, other]

Transformers Meet ACT-R: Repeat-Aware and Sequential Listening Session Recommendation

Authors: Viet-Anh Tran, Guillaume Salha-Galvan, Bruno Sguerra, Romain Hennequin

Abstract: Music streaming services often leverage sequential recommender systems to predict the best music to showcase to users based on past sequences of listening sessions. Nonetheless, most sequential recommendation methods ignore or insufficiently account for repetitive behaviors. This is a crucial limitation for music recommendation, as repeatedly listening to the same song over time is a common phenom… ▽ More Music streaming services often leverage sequential recommender systems to predict the best music to showcase to users based on past sequences of listening sessions. Nonetheless, most sequential recommendation methods ignore or insufficiently account for repetitive behaviors. This is a crucial limitation for music recommendation, as repeatedly listening to the same song over time is a common phenomenon that can even change the way users perceive this song. In this paper, we introduce PISA (Psychology-Informed Session embedding using ACT-R), a session-level sequential recommender system that overcomes this limitation. PISA employs a Transformer architecture learning embedding representations of listening sessions and users using attention mechanisms inspired by Anderson's ACT-R (Adaptive Control of Thought-Rational), a cognitive architecture modeling human information access and memory dynamics. This approach enables us to capture dynamic and repetitive patterns from user behaviors, allowing us to effectively predict the songs they will listen to in subsequent sessions, whether they are repeated or new ones. We demonstrate the empirical relevance of PISA using both publicly available listening data from Last.fm and proprietary data from Deezer, a global music streaming service, confirming the critical importance of repetition modeling for sequential listening session recommendation. Along with this paper, we publicly release our proprietary dataset to foster future research in this field, as well as the source code of PISA to facilitate its future use. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: 11 pages. Accepted by RecSys'2024, full paper

arXiv:2407.08647 [pdf, other]

From Real to Cloned Singer Identification

Authors: Dorian Desblancs, Gabriel Meseguer-Brocal, Romain Hennequin, Manuel Moussallam

Abstract: Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding mode… ▽ More Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding models that are trained using a singer-level contrastive learning scheme, where positive pairs consist of segments with vocals from the same singers. These segments can be mixtures for the first model, vocals for the second, and both for the third. We demonstrate that all three models are highly capable of identifying real singers. However, their performance deteriorates when classifying cloned versions of singers in our evaluation set. This is especially true for models that use mixtures as an input. These findings highlight the need to understand the biases that exist within singer identification systems, and how they can influence the identification of voice deepfakes in music. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: To be published at ISMIR 2024

arXiv:2407.07408 [pdf, other]

STONE: Self-supervised Tonality Estimator

Authors: Yuexuan Kong, Vincent Lostanlen, Gabriel Meseguer-Brocal, Stella Wong, Mathieu Lagrange, Romain Hennequin

Abstract: Although deep neural networks can estimate the key of a musical piece, their supervision incurs a massive annotation effort. Against this shortcoming, we present STONE, the first self-supervised tonality estimator. The architecture behind STONE, named ChromaNet, is a convnet with octave equivalence which outputs a key signature profile (KSP) of 12 structured logits. First, we train ChromaNet to re… ▽ More Although deep neural networks can estimate the key of a musical piece, their supervision incurs a massive annotation effort. Against this shortcoming, we present STONE, the first self-supervised tonality estimator. The architecture behind STONE, named ChromaNet, is a convnet with octave equivalence which outputs a key signature profile (KSP) of 12 structured logits. First, we train ChromaNet to regress artificial pitch transpositions between any two unlabeled musical excerpts from the same audio track, as measured as cross-power spectral density (CPSD) within the circle of fifths (CoF). We observe that this self-supervised pretext task leads KSP to correlate with tonal key signature. Based on this observation, we extend STONE to output a structured KSP of 24 logits, and introduce supervision so as to disambiguate major versus minor keys sharing the same key signature. Applying different amounts of supervision yields semi-supervised and fully supervised tonality estimators: i.e., Semi-TONEs and Sup-TONEs. We evaluate these estimators on FMAK, a new dataset of 5489 real-world musical recordings with expert annotation of 24 major and minor keys. We find that Semi-TONE matches the classification accuracy of Sup-TONE with reduced supervision and outperforms it with equal supervision. △ Less

Submitted 1 April, 2025; v1 submitted 10 July, 2024; originally announced July 2024.

arXiv:2406.11380 [pdf, other]

Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3

Authors: Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara

Abstract: Large Language Models (LLMs) have shown promising results in a variety of literary tasks, often using complex memorized details of narration and fictional characters. In this work, we evaluate the ability of Llama-3 at attributing utterances of direct-speech to their speaker in novels. The LLM shows impressive results on a corpus of 28 novels, surpassing published results with ChatGPT and encoder-… ▽ More Large Language Models (LLMs) have shown promising results in a variety of literary tasks, often using complex memorized details of narration and fictional characters. In this work, we evaluate the ability of Llama-3 at attributing utterances of direct-speech to their speaker in novels. The LLM shows impressive results on a corpus of 28 novels, surpassing published results with ChatGPT and encoder-based baselines by a large margin. We then validate these results by assessing the impact of book memorization and annotation contamination. We found that these types of memorization do not explain the large performance gain, making Llama-3 the new state-of-the-art for quotation attribution in English literature. We release publicly our code and data. △ Less

Submitted 26 January, 2025; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: NAACL 2025 Main Conference -- short paper

arXiv:2406.11368 [pdf, other]

doi 10.18653/v1/2024.findings-emnlp.744

Improving Quotation Attribution with Fictional Character Embeddings

Authors: Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara

Abstract: Humans naturally attribute utterances of direct speech to their speaker in literary works. When attributing quotes, we process contextual information but also access mental representations of characters that we build and revise throughout the narrative. Recent methods to automatically attribute such utterances have explored simulating human logic with deterministic rules or learning new implicit r… ▽ More Humans naturally attribute utterances of direct speech to their speaker in literary works. When attributing quotes, we process contextual information but also access mental representations of characters that we build and revise throughout the narrative. Recent methods to automatically attribute such utterances have explored simulating human logic with deterministic rules or learning new implicit rules with neural networks when processing contextual information. However, these systems inherently lack \textit{character} representations, which often leads to errors in more challenging examples of attribution: anaphoric and implicit quotes. In this work, we propose to augment a popular quotation attribution system, BookNLP, with character embeddings that encode global stylistic information of characters derived from an off-the-shelf stylometric model, Universal Authorship Representation (UAR). We create DramaCV (Code and data can be found at https://github.com/deezer/character_embeddings_qa ), a corpus of English drama plays from the 15th to 20th century that we automatically annotate for Authorship Verification of fictional characters utterances, and release two versions of UAR trained on DramaCV, that are tailored for literary characters analysis. Then, through an extensive evaluation on 28 novels, we show that combining BookNLP's contextual information with our proposed global character embeddings improves the identification of speakers for anaphoric and implicit quotes, reaching state-of-the-art performance. △ Less

Submitted 4 October, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: EMNLP 2024 (Findings)

arXiv:2406.04140 [pdf, other]

STraDa: A Singer Traits Dataset

Authors: Yuexuan Kong, Viet-Anh Tran, Romain Hennequin

Abstract: There is a limited amount of large-scale public datasets that contain downloadable music audio files and rich lead singer metadata. To provide such a dataset to benefit research in singing voices, we created Singer Traits Dataset (STraDa) with two subsets: automatic-strada and annotated-strada. The automatic-strada contains twenty-five thousand tracks across numerous genres and languages of more t… ▽ More There is a limited amount of large-scale public datasets that contain downloadable music audio files and rich lead singer metadata. To provide such a dataset to benefit research in singing voices, we created Singer Traits Dataset (STraDa) with two subsets: automatic-strada and annotated-strada. The automatic-strada contains twenty-five thousand tracks across numerous genres and languages of more than five thousand unique lead singers, which includes cross-validated lead singer metadata as well as other track metadata. The annotated-strada consists of two hundred tracks that are balanced in terms of 2 genders, 5 languages, and 4 age groups. To show its use for model training and bias analysis thanks to its metadata's richness and downloadable audio files, we benchmarked singer sex classification (SSC) and conducted bias analysis. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.04181 [pdf, other]

Detecting music deepfakes is easy but actually hard

Authors: Darius Afchar, Gabriel Meseguer-Brocal, Romain Hennequin

Abstract: In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of tra… ▽ More In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and fake reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a music deepfake detector, a tool that will help in the regulation of music forgery. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that a good test score is not the end of the story. We step back from the straightforward ML framework and expose many facets that could be problematic with such a deployed detector: calibration, robustness to audio manipulation, generalisation to unseen models, interpretability and possibility for recourse. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of fake content checkers. △ Less

Submitted 22 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

Comments: Under review

arXiv:2404.09177 [pdf, other]

An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging

Authors: Gabriel Meseguer-Brocal, Dorian Desblancs, Romain Hennequin

Abstract: Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informa… ▽ More Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context. △ Less

Submitted 14 April, 2024; originally announced April 2024.

arXiv:2401.16968 [pdf, other]

Distinguishing Fictional Voices: a Study of Authorship Verification Models for Quotation Attribution

Authors: Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara

Abstract: Recent approaches to automatically detect the speaker of an utterance of direct speech often disregard general information about characters in favor of local information found in the context, such as surrounding mentions of entities. In this work, we explore stylistic representations of characters built by encoding their quotes with off-the-shelf pretrained Authorship Verification models in a larg… ▽ More Recent approaches to automatically detect the speaker of an utterance of direct speech often disregard general information about characters in favor of local information found in the context, such as surrounding mentions of entities. In this work, we explore stylistic representations of characters built by encoding their quotes with off-the-shelf pretrained Authorship Verification models in a large corpus of English novels (the Project Dialogism Novel Corpus). Results suggest that the combination of stylistic and topical information captured in some of these models accurately distinguish characters among each other, but does not necessarily improve over semantic-only models when attributing quotes. However, these results vary across novels and more investigation of stylometric models particularly tailored for literary texts and the study of characters should be conducted. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: Accepted at EACL 2024's workshop LaTeCH-CLfL

arXiv:2311.10635 [pdf, other]

doi 10.1145/3604915.3608856

Ex2Vec: Characterizing Users and Items from the Mere Exposure Effect

Authors: Bruno Sguerra, Viet-Anh Tran, Romain Hennequin

Abstract: The traditional recommendation framework seeks to connect user and content, by finding the best match possible based on users past interaction. However, a good content recommendation is not necessarily similar to what the user has chosen in the past. As humans, users naturally evolve, learn, forget, get bored, they change their perspective of the world and in consequence, of the recommendable cont… ▽ More The traditional recommendation framework seeks to connect user and content, by finding the best match possible based on users past interaction. However, a good content recommendation is not necessarily similar to what the user has chosen in the past. As humans, users naturally evolve, learn, forget, get bored, they change their perspective of the world and in consequence, of the recommendable content. One well known mechanism that affects user interest is the Mere Exposure Effect: when repeatedly exposed to stimuli, users' interest tends to rise with the initial exposures, reaching a peak, and gradually decreasing thereafter, resulting in an inverted-U shape. Since previous research has shown that the magnitude of the effect depends on a number of interesting factors such as stimulus complexity and familiarity, leveraging this effect is a way to not only improve repeated recommendation but to gain a more in-depth understanding of both users and stimuli. In this work we present (Mere) Exposure2Vec (Ex2Vec) our model that leverages the Mere Exposure Effect in repeat consumption to derive user and item characterization and track user interest evolution. We validate our model through predicting future music consumption based on repetition and discuss its implications for recommendation scenarios where repetition is common. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Journal ref: In Seventeenth ACM Conference on Recommender Systems (RecSys 2023)

arXiv:2308.12767 [pdf, other]

On the Consistency of Average Embeddings for Item Recommendation

Authors: Walid Bendada, Guillaume Salha-Galvan, Romain Hennequin, Thomas Bouabça, Tristan Cazenave

Abstract: A prevalent practice in recommender systems consists in averaging item embeddings to represent users or higher-level concepts in the same embedding space. This paper investigates the relevance of such a practice. For this purpose, we propose an expected precision score, designed to measure the consistency of an average embedding relative to the items used for its construction. We subsequently anal… ▽ More A prevalent practice in recommender systems consists in averaging item embeddings to represent users or higher-level concepts in the same embedding space. This paper investigates the relevance of such a practice. For this purpose, we propose an expected precision score, designed to measure the consistency of an average embedding relative to the items used for its construction. We subsequently analyze the mathematical expression of this score in a theoretical setting with specific assumptions, as well as its empirical behavior on real-world data from music streaming services. Our results emphasize that real-world averages are less consistent for recommendation, which paves the way for future research to better align real-world embeddings with assumptions from our theoretical setting. △ Less

Submitted 30 August, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

Comments: 17th ACM Conference on Recommender Systems (RecSys 2023)

arXiv:2307.01212 [pdf, other]

Of Spiky SVDs and Music Recommendation

Authors: Darius Afchar, Romain Hennequin, Vincent Guigue

Abstract: The truncated singular value decomposition is a widely used methodology in music recommendation for direct similar-item retrieval or embedding musical items for downstream tasks. This paper investigates a curious effect that we show naturally occurring on many recommendation datasets: spiking formations in the embedding space. We first propose a metric to quantify this spiking organization's stren… ▽ More The truncated singular value decomposition is a widely used methodology in music recommendation for direct similar-item retrieval or embedding musical items for downstream tasks. This paper investigates a curious effect that we show naturally occurring on many recommendation datasets: spiking formations in the embedding space. We first propose a metric to quantify this spiking organization's strength, then mathematically prove its origin tied to underlying communities of items of varying internal popularity. With this new-found theoretical understanding, we finally open the topic with an industrial use case of estimating how music embeddings' top-k similar items will change over time under the addition of data. △ Less

Submitted 30 June, 2023; originally announced July 2023.

Comments: Accepted for RecSys 2023 (Singapour, 18-22 September)

arXiv:2304.08158 [pdf, other]

Attention Mixtures for Time-Aware Sequential Recommendation

Authors: Viet-Anh Tran, Guillaume Salha-Galvan, Bruno Sguerra, Romain Hennequin

Abstract: Transformers emerged as powerful methods for sequential recommendation. However, existing architectures often overlook the complex dependencies between user preferences and the temporal context. In this short paper, we introduce MOJITO, an improved Transformer sequential recommender system that addresses this limitation. MOJITO leverages Gaussian mixtures of attention-based temporal context and it… ▽ More Transformers emerged as powerful methods for sequential recommendation. However, existing architectures often overlook the complex dependencies between user preferences and the temporal context. In this short paper, we introduce MOJITO, an improved Transformer sequential recommender system that addresses this limitation. MOJITO leverages Gaussian mixtures of attention-based temporal context and item embedding representations for sequential modeling. Such an approach permits to accurately predict which items should be recommended next to users depending on past actions and the temporal context. We demonstrate the relevance of our approach, by empirically outperforming existing Transformers for sequential recommendation on several real-world datasets. △ Less

Submitted 3 July, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: SIGIR 2023

arXiv:2303.06944 [pdf, other]

A Human Subject Study of Named Entity Recognition (NER) in Conversational Music Recommendation Queries

Authors: Elena V. Epure, Romain Hennequin

Abstract: We conducted a human subject study of named entity recognition on a noisy corpus of conversational music recommendation queries, with many irregular and novel named entities. We evaluated the human NER linguistic behaviour in these challenging conditions and compared it with the most common NER systems nowadays, fine-tuned transformers. Our goal was to learn about the task to guide the design of b… ▽ More We conducted a human subject study of named entity recognition on a noisy corpus of conversational music recommendation queries, with many irregular and novel named entities. We evaluated the human NER linguistic behaviour in these challenging conditions and compared it with the most common NER systems nowadays, fine-tuned transformers. Our goal was to learn about the task to guide the design of better evaluation methods and NER algorithms. The results showed that NER in our context was quite hard for both human and algorithms under a strict evaluation schema; humans had higher precision, while the model higher recall because of entity exposure especially during pre-training; and entity types had different error patterns (e.g. frequent typing errors for artists). The released corpus goes beyond predefined frames of interaction and can support future work in conversational music recommendation. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Journal ref: The 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)

arXiv:2211.08972 [pdf, other]

New Frontiers in Graph Autoencoders: Joint Community Detection and Link Prediction

Authors: Guillaume Salha-Galvan, Johannes F. Lutzeyer, George Dasoulas, Romain Hennequin, Michalis Vazirgiannis

Abstract: Graph autoencoders (GAE) and variational graph autoencoders (VGAE) emerged as powerful methods for link prediction (LP). Their performances are less impressive on community detection (CD), where they are often outperformed by simpler alternatives such as the Louvain method. It is still unclear to what extent one can improve CD with GAE and VGAE, especially in the absence of node features. It is mo… ▽ More Graph autoencoders (GAE) and variational graph autoencoders (VGAE) emerged as powerful methods for link prediction (LP). Their performances are less impressive on community detection (CD), where they are often outperformed by simpler alternatives such as the Louvain method. It is still unclear to what extent one can improve CD with GAE and VGAE, especially in the absence of node features. It is moreover uncertain whether one could do so while simultaneously preserving good performances on LP in a multi-task setting. In this workshop paper, summarizing results from our journal publication (Salha-Galvan et al. 2022), we show that jointly addressing these two tasks with high accuracy is possible. For this purpose, we introduce a community-preserving message passing scheme, doping our GAE and VGAE encoders by considering both the initial graph and Louvain-based prior communities when computing embedding spaces. Inspired by modularity-based clustering, we further propose novel training and optimization strategies specifically designed for joint LP and CD. We demonstrate the empirical effectiveness of our approach, referred to as Modularity-Aware GAE and VGAE, on various real-world graphs. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: This NeurIPS 2022 GLFrontiers workshop paper summarizes results from the following journal article: arXiv:2202.00961. arXiv admin note: text overlap with arXiv:2205.14651

arXiv:2210.16226 [pdf, other]

doi 10.1145/3523227.3551474

Discovery Dynamics: Leveraging Repeated Exposure for User and Music Characterization

Authors: Bruno Sguerra, Viet-Anh Tran, Romain Hennequin

Abstract: Repetition in music consumption is a common phenomenon. It is notably more frequent when compared to the consumption of other media, such as books and movies. In this paper, we show that one particularly interesting repetitive behavior arises when users are consuming new items. Users' interest tends to rise with the first repetitions and attains a peak after which interest will decrease with subse… ▽ More Repetition in music consumption is a common phenomenon. It is notably more frequent when compared to the consumption of other media, such as books and movies. In this paper, we show that one particularly interesting repetitive behavior arises when users are consuming new items. Users' interest tends to rise with the first repetitions and attains a peak after which interest will decrease with subsequent exposures, resulting in an inverted-U shape. This behavior, which has been extensively studied in psychology, is called the mere exposure effect. In this paper, we show how a number of factors, both content and user-based, well documented in the literature on the mere exposure effect, modulate the magnitude of the effect. Due to the vast availability of data of users discovering new songs everyday in music streaming platforms, these findings enable new ways to characterize both the music, users and their relationships. Ultimately, it opens up the possibility of developing new recommender systems paradigms based on these characterizations. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Journal ref: In Sixteenth ACM Conference on Recommender Systems (RecSys 2022)

arXiv:2207.11231 [pdf, other]

Learning Unsupervised Hierarchies of Audio Concepts

Authors: Darius Afchar, Romain Hennequin, Vincent Guigue

Abstract: Music signals are difficult to interpret from their low-level features, perhaps even more than images: e.g. highlighting part of a spectrogram or an image is often insufficient to convey high-level ideas that are genuinely relevant to humans. In computer vision, concept learning was therein proposed to adjust explanations to the right abstraction level (e.g. detect clinical concepts from radiograp… ▽ More Music signals are difficult to interpret from their low-level features, perhaps even more than images: e.g. highlighting part of a spectrogram or an image is often insufficient to convey high-level ideas that are genuinely relevant to humans. In computer vision, concept learning was therein proposed to adjust explanations to the right abstraction level (e.g. detect clinical concepts from radiographs). These methods have yet to be used for MIR. In this paper, we adapt concept learning to the realm of music, with its particularities. For instance, music concepts are typically non-independent and of mixed nature (e.g. genre, instruments, mood), unlike previous work that assumed disentangled concepts. We propose a method to learn numerous music concepts from audio and then automatically hierarchise them to expose their mutual relationships. We conduct experiments on datasets of playlists from a music streaming service, serving as a few annotated examples for diverse concepts. Evaluations show that the mined hierarchies are aligned with both ground-truth hierarchies of concepts -- when available -- and with proxy sources of concept similarity in the general case. △ Less

Submitted 21 July, 2022; originally announced July 2022.

Comments: ISMIR 2022

arXiv:2202.00961 [pdf, other]

Modularity-Aware Graph Autoencoders for Joint Community Detection and Link Prediction

Authors: Guillaume Salha-Galvan, Johannes F. Lutzeyer, George Dasoulas, Romain Hennequin, Michalis Vazirgiannis

Abstract: Graph autoencoders (GAE) and variational graph autoencoders (VGAE) emerged as powerful methods for link prediction. Their performances are less impressive on community detection problems where, according to recent and concurring experimental evaluations, they are often outperformed by simpler alternatives such as the Louvain method. It is currently still unclear to which extent one can improve com… ▽ More Graph autoencoders (GAE) and variational graph autoencoders (VGAE) emerged as powerful methods for link prediction. Their performances are less impressive on community detection problems where, according to recent and concurring experimental evaluations, they are often outperformed by simpler alternatives such as the Louvain method. It is currently still unclear to which extent one can improve community detection with GAE and VGAE, especially in the absence of node features. It is moreover uncertain whether one could do so while simultaneously preserving good performances on link prediction. In this paper, we show that jointly addressing these two tasks with high accuracy is possible. For this purpose, we introduce and theoretically study a community-preserving message passing scheme, doping our GAE and VGAE encoders by considering both the initial graph structure and modularity-based prior communities when computing embedding spaces. We also propose novel training and optimization strategies, including the introduction of a modularity-inspired regularizer complementing the existing reconstruction losses for joint link prediction and community detection. We demonstrate the empirical effectiveness of our approach, referred to as Modularity-Aware GAE and VGAE, through in-depth experimental validation on various real-world graphs. △ Less

Submitted 20 June, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

Comments: Accepted for publication in Elsevier's Neural Networks journal in 2022

arXiv:2201.10528 [pdf, other]

doi 10.1002/aaai.12056

Explainability in Music Recommender Systems

Authors: Darius Afchar, Alessandro B. Melchiorre, Markus Schedl, Romain Hennequin, Elena V. Epure, Manuel Moussallam

Abstract: The most common way to listen to recorded music nowadays is via streaming platforms which provide access to tens of millions of tracks. To assist users in effectively browsing these large catalogs, the integration of Music Recommender Systems (MRSs) has become essential. Current real-world MRSs are often quite complex and optimized for recommendation accuracy. They combine several building blocks… ▽ More The most common way to listen to recorded music nowadays is via streaming platforms which provide access to tens of millions of tracks. To assist users in effectively browsing these large catalogs, the integration of Music Recommender Systems (MRSs) has become essential. Current real-world MRSs are often quite complex and optimized for recommendation accuracy. They combine several building blocks based on collaborative filtering and content-based recommendation. This complexity can hinder the ability to explain recommendations to end users, which is particularly important for recommendations perceived as unexpected or inappropriate. While pure recommendation performance often correlates with user satisfaction, explainability has a positive impact on other factors such as trust and forgiveness, which are ultimately essential to maintain user loyalty. In this article, we discuss how explainability can be addressed in the context of MRSs. We provide perspectives on how explainability could improve music recommendation algorithms and enhance user experience. First, we review common dimensions and goals of recommenders' explainability and in general of eXplainable Artificial Intelligence (XAI), and elaborate on the extent to which these apply -- or need to be adapted -- to the specific characteristics of music consumption and recommendation. Then, we show how explainability components can be integrated within a MRS and in what form explanations can be provided. Since the evaluation of explanation quality is decoupled from pure accuracy-based evaluation criteria, we also discuss requirements and strategies for evaluating explanations of music recommendations. Finally, we describe the current challenges for introducing explainability within a large-scale industrial music recommender system and provide research perspectives. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: To appear in AI Magazine, Special Topic on Recommender Systems 2022

Journal ref: AI Magazine 43(2), 190-208, 2022

arXiv:2108.11857 [pdf, other]

Probing Pre-trained Auto-regressive Language Models for Named Entity Typing and Recognition

Authors: Elena V. Epure, Romain Hennequin

Abstract: Despite impressive results of language models for named entity recognition (NER), their generalization to varied textual genres, a growing entity type set, and new entities remains a challenge. Collecting thousands of annotations in each new case for training or fine-tuning is expensive and time-consuming. In contrast, humans can easily identify named entities given some simple instructions. Inspi… ▽ More Despite impressive results of language models for named entity recognition (NER), their generalization to varied textual genres, a growing entity type set, and new entities remains a challenge. Collecting thousands of annotations in each new case for training or fine-tuning is expensive and time-consuming. In contrast, humans can easily identify named entities given some simple instructions. Inspired by this, we challenge the reliance on large datasets and study pre-trained language models for NER in a meta-learning setup. First, we test named entity typing (NET) in a zero-shot transfer scenario. Then, we perform NER by giving few examples at inference. We propose a method to select seen and rare / unseen names when having access only to the pre-trained model and report results on these groups. The results show: auto-regressive language models as meta-learners can perform NET and NER fairly well especially for regular or seen names; name irregularity when often present for a certain entity type can become an effective exploitable cue; names with words foreign to the model have the most negative impact on results; the model seems to rely more on name than context cues in few-shot NER. △ Less

Submitted 27 April, 2022; v1 submitted 26 August, 2021; originally announced August 2021.

Comments: Accepted for publication in LREC2022

arXiv:2108.04655 [pdf, other]

Hierarchical Latent Relation Modeling for Collaborative Metric Learning

Authors: Viet-Anh Tran, Guillaume Salha-Galvan, Romain Hennequin, Manuel Moussallam

Abstract: Collaborative Metric Learning (CML) recently emerged as a powerful paradigm for recommendation based on implicit feedback collaborative filtering. However, standard CML methods learn fixed user and item representations, which fails to capture the complex interests of users. Existing extensions of CML also either ignore the heterogeneity of user-item relations, i.e. that a user can simultaneously l… ▽ More Collaborative Metric Learning (CML) recently emerged as a powerful paradigm for recommendation based on implicit feedback collaborative filtering. However, standard CML methods learn fixed user and item representations, which fails to capture the complex interests of users. Existing extensions of CML also either ignore the heterogeneity of user-item relations, i.e. that a user can simultaneously like very different items, or the latent item-item relations, i.e. that a user's preference for an item depends, not only on its intrinsic characteristics, but also on items they previously interacted with. In this paper, we present a hierarchical CML model that jointly captures latent user-item and item-item relations from implicit data. Our approach is inspired by translation mechanisms from knowledge graph embedding and leverages memory-based attention networks. We empirically show the relevance of this joint relational modeling, by outperforming existing CML models on recommendation tasks on several real-world datasets. Our experiments also emphasize the limits of current CML relational models on very sparse datasets. △ Less

Submitted 26 July, 2021; originally announced August 2021.

Comments: 15th ACM Conference on Recommender Systems (RecSys 2021)

arXiv:2108.01053 [pdf, other]

Cold Start Similar Artists Ranking with Gravity-Inspired Graph Autoencoders

Authors: Guillaume Salha-Galvan, Romain Hennequin, Benjamin Chapus, Viet-Anh Tran, Michalis Vazirgiannis

Abstract: On an artist's profile page, music streaming services frequently recommend a ranked list of "similar artists" that fans also liked. However, implementing such a feature is challenging for new artists, for which usage data on the service (e.g. streams or likes) is not yet available. In this paper, we model this cold start similar artists ranking problem as a link prediction task in a directed and a… ▽ More On an artist's profile page, music streaming services frequently recommend a ranked list of "similar artists" that fans also liked. However, implementing such a feature is challenging for new artists, for which usage data on the service (e.g. streams or likes) is not yet available. In this paper, we model this cold start similar artists ranking problem as a link prediction task in a directed and attributed graph, connecting artists to their top-k most similar neighbors and incorporating side musical information. Then, we leverage a graph autoencoder architecture to learn node embedding representations from this graph, and to automatically rank the top-k most similar neighbors of new artists using a gravity-inspired mechanism. We empirically show the flexibility and the effectiveness of our framework, by addressing a real-world cold start similar artists ranking problem on a global music streaming service. Along with this paper, we also publicly release our source code as well as the industrial graph data from our experiments. △ Less

Submitted 2 August, 2021; originally announced August 2021.

Comments: 15th ACM Conference on Recommender Systems (RecSys 2021)

arXiv:2105.15014 [pdf, ps, other]

doi 10.1109/ICASSP39728.2021.9414203

Singing Language Identification using a Deep Phonotactic Approach

Authors: Lenny Renault, Andrea Vaglio, Romain Hennequin

Abstract: Extensive works have tackled Language Identification (LID) in the speech domain, however their application to the singing voice trails and performances on Singing Language Identification (SLID) can be improved leveraging recent progresses made in other singing related tasks. This work presents a modernized phonotactic system for SLID on polyphonic music: phoneme recognition is performed with a Con… ▽ More Extensive works have tackled Language Identification (LID) in the speech domain, however their application to the singing voice trails and performances on Singing Language Identification (SLID) can be improved leveraging recent progresses made in other singing related tasks. This work presents a modernized phonotactic system for SLID on polyphonic music: phoneme recognition is performed with a Connectionist Temporal Classification (CTC)-based acoustic model trained with multilingual data, before language classification with a recurrent model based on the phonemes estimation. The full pipeline is trained and evaluated with a large and publicly available dataset, with unprecedented performances. First results of SLID with out-of-set languages are also presented. △ Less

Submitted 31 May, 2021; originally announced May 2021.

Comments: 5 pages, 1 figure, ICASSP 2021

Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271-275

arXiv:2104.12437 [pdf, other]

Towards Rigorous Interpretations: a Formalisation of Feature Attribution

Authors: Darius Afchar, Romain Hennequin, Vincent Guigue

Abstract: Feature attribution is often loosely presented as the process of selecting a subset of relevant features as a rationale of a prediction. Task-dependent by nature, precise definitions of "relevance" encountered in the literature are however not always consistent. This lack of clarity stems from the fact that we usually do not have access to any notion of ground-truth attribution and from a more gen… ▽ More Feature attribution is often loosely presented as the process of selecting a subset of relevant features as a rationale of a prediction. Task-dependent by nature, precise definitions of "relevance" encountered in the literature are however not always consistent. This lack of clarity stems from the fact that we usually do not have access to any notion of ground-truth attribution and from a more general debate on what good interpretations are. In this paper we propose to formalise feature selection/attribution based on the concept of relaxed functional dependence. In particular, we extend our notions to the instance-wise setting and derive necessary properties for candidate selection solutions, while leaving room for task-dependence. By computing ground-truth attributions on synthetic datasets, we evaluate many state-of-the-art attribution methods and show that, even when optimised, some fail to verify the proposed properties and provide wrong solutions. △ Less

Submitted 5 July, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

Comments: 38th International Conference on Machine Learning (ICML 2021)

Journal ref: PMLR 139:76-86, 2021

arXiv:2010.06325 [pdf, other]

Modeling the Music Genre Perception across Language-Bound Cultures

Authors: Elena V. Epure, Guillaume Salha, Manuel Moussallam, Romain Hennequin

Abstract: The music genre perception expressed through human annotations of artists or albums varies significantly across language-bound cultures. These variations cannot be modeled as mere translations since we also need to account for cultural differences in the music genre perception. In this work, we study the feasibility of obtaining relevant cross-lingual, culture-specific music genre annotations base… ▽ More The music genre perception expressed through human annotations of artists or albums varies significantly across language-bound cultures. These variations cannot be modeled as mere translations since we also need to account for cultural differences in the music genre perception. In this work, we study the feasibility of obtaining relevant cross-lingual, culture-specific music genre annotations based only on language-specific semantic representations, namely distributed concept embeddings and ontologies. Our study, focused on six languages, shows that unsupervised cross-lingual music genre annotation is feasible with high accuracy, especially when combining both types of representations. This approach of studying music genres is the most extensive to date and has many implications in musicology and music information retrieval. Besides, we introduce a new, domain-dependent cross-lingual corpus to benchmark state of the art multilingual pre-trained embedding models. △ Less

Submitted 16 November, 2020; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)

arXiv:2009.07755 [pdf, other]

Multilingual Music Genre Embeddings for Effective Cross-Lingual Music Item Annotation

Authors: Elena V. Epure, Guillaume Salha, Romain Hennequin

Abstract: Annotating music items with music genres is crucial for music recommendation and information retrieval, yet challenging given that music genres are subjective concepts. Recently, in order to explicitly consider this subjectivity, the annotation of music items was modeled as a translation task: predict for a music item its music genres within a target vocabulary or taxonomy (tag system) from a set… ▽ More Annotating music items with music genres is crucial for music recommendation and information retrieval, yet challenging given that music genres are subjective concepts. Recently, in order to explicitly consider this subjectivity, the annotation of music items was modeled as a translation task: predict for a music item its music genres within a target vocabulary or taxonomy (tag system) from a set of music genre tags originating from other tag systems. However, without a parallel corpus, previous solutions could not handle tag systems in other languages, being limited to the English-language only. Here, by learning multilingual music genre embeddings, we enable cross-lingual music genre translation without relying on a parallel corpus. First, we apply compositionality functions on pre-trained word embeddings to represent multi-word tags.Second, we adapt the tag representations to the music domain by leveraging multilingual music genres graphs with a modified retrofitting algorithm. Experiments show that our method: 1) is effective in translating music genres across tag systems in multiple languages (English, French and Spanish); 2) outperforms the previous baseline in an English-language multi-source translation task. We publicly release the new multilingual data and code. △ Less

Submitted 16 September, 2020; originally announced September 2020.

Comments: 21st International Society for Music Information Retrieval Conference (ISMIR 2020)

arXiv:2008.11406 [pdf, other]

doi 10.1145/3383313.3412253

Making Neural Networks Interpretable with Attribution: Application to Implicit Signals Prediction

Authors: Darius Afchar, Romain Hennequin

Abstract: Explaining recommendations enables users to understand whether recommended items are relevant to their needs and has been shown to increase their trust in the system. More generally, if designing explainable machine learning models is key to check the sanity and robustness of a decision process and improve their efficiency, it however remains a challenge for complex architectures, especially deep… ▽ More Explaining recommendations enables users to understand whether recommended items are relevant to their needs and has been shown to increase their trust in the system. More generally, if designing explainable machine learning models is key to check the sanity and robustness of a decision process and improve their efficiency, it however remains a challenge for complex architectures, especially deep neural networks that are often deemed "black-box". In this paper, we propose a novel formulation of interpretable deep neural networks for the attribution task. Differently to popular post-hoc methods, our approach is interpretable by design. Using masked weights, hidden features can be deeply attributed, split into several input-restricted sub-networks and trained as a boosted mixture of experts. Experimental results on synthetic data and real-world recommendation tasks demonstrate that our method enables to build models achieving close predictive performances to their non-interpretable counterparts, while providing informative attribution interpretations. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: 14th ACM Conference on Recommender Systems (RecSys '20)

arXiv:2002.01910 [pdf, other]

FastGAE: Scalable Graph Autoencoders with Stochastic Subgraph Decoding

Authors: Guillaume Salha, Romain Hennequin, Jean-Baptiste Remy, Manuel Moussallam, Michalis Vazirgiannis

Abstract: Graph autoencoders (AE) and variational autoencoders (VAE) are powerful node embedding methods, but suffer from scalability issues. In this paper, we introduce FastGAE, a general framework to scale graph AE and VAE to large graphs with millions of nodes and edges. Our strategy, based on an effective stochastic subgraph decoding scheme, significantly speeds up the training of graph AE and VAE while… ▽ More Graph autoencoders (AE) and variational autoencoders (VAE) are powerful node embedding methods, but suffer from scalability issues. In this paper, we introduce FastGAE, a general framework to scale graph AE and VAE to large graphs with millions of nodes and edges. Our strategy, based on an effective stochastic subgraph decoding scheme, significantly speeds up the training of graph AE and VAE while preserving or even improving performances. We demonstrate the effectiveness of FastGAE on various real-world graphs, outperforming the few existing approaches to scale graph AE and VAE by a wide margin. △ Less

Submitted 13 April, 2021; v1 submitted 5 February, 2020; originally announced February 2020.

Comments: Accepted for publication in Elsevier's Neural Networks journal

arXiv:2001.07614 [pdf, other]

Simple and Effective Graph Autoencoders with One-Hop Linear Models

Authors: Guillaume Salha, Romain Hennequin, Michalis Vazirgiannis

Abstract: Over the last few years, graph autoencoders (AE) and variational autoencoders (VAE) emerged as powerful node embedding methods, with promising performances on challenging tasks such as link prediction and node clustering. Graph AE, VAE and most of their extensions rely on multi-layer graph convolutional networks (GCN) encoders to learn vector space representations of nodes. In this paper, we show… ▽ More Over the last few years, graph autoencoders (AE) and variational autoencoders (VAE) emerged as powerful node embedding methods, with promising performances on challenging tasks such as link prediction and node clustering. Graph AE, VAE and most of their extensions rely on multi-layer graph convolutional networks (GCN) encoders to learn vector space representations of nodes. In this paper, we show that GCN encoders are actually unnecessarily complex for many applications. We propose to replace them by significantly simpler and more interpretable linear models w.r.t. the direct neighborhood (one-hop) adjacency matrix of the graph, involving fewer operations, fewer parameters and no activation function. For the two aforementioned tasks, we show that this simpler approach consistently reaches competitive performances w.r.t. GCN-based graph AE and VAE for numerous real-world graphs, including all benchmark datasets commonly used to evaluate graph AE and VAE. Based on these results, we also question the relevance of repeatedly using these datasets to compare complex graph AE and VAE. △ Less

Submitted 17 June, 2020; v1 submitted 21 January, 2020; originally announced January 2020.

Comments: Accepted at ECML-PKDD 2020. A preliminary version of this work has previously been presented at the NeurIPS 2019 workshop on Graph Representation Learning: arXiv:1910.00942

arXiv:1910.00942 [pdf, ps, other]

Keep It Simple: Graph Autoencoders Without Graph Convolutional Networks

Authors: Guillaume Salha, Romain Hennequin, Michalis Vazirgiannis

Abstract: Graph autoencoders (AE) and variational autoencoders (VAE) recently emerged as powerful node embedding methods, with promising performances on challenging tasks such as link prediction and node clustering. Graph AE, VAE and most of their extensions rely on graph convolutional networks (GCN) to learn vector space representations of nodes. In this paper, we propose to replace the GCN encoder by a si… ▽ More Graph autoencoders (AE) and variational autoencoders (VAE) recently emerged as powerful node embedding methods, with promising performances on challenging tasks such as link prediction and node clustering. Graph AE, VAE and most of their extensions rely on graph convolutional networks (GCN) to learn vector space representations of nodes. In this paper, we propose to replace the GCN encoder by a simple linear model w.r.t. the adjacency matrix of the graph. For the two aforementioned tasks, we empirically show that this approach consistently reaches competitive performances w.r.t. GCN-based models for numerous real-world graphs, including the widely used Cora, Citeseer and Pubmed citation networks that became the de facto benchmark datasets for evaluating graph AE and VAE. This result questions the relevance of repeatedly using these three datasets to compare complex graph AE and VAE models. It also emphasizes the effectiveness of simple node encoding schemes for many real-world applications. △ Less

Submitted 2 October, 2019; originally announced October 2019.

Comments: NeurIPS 2019 Graph Representation Learning Workshop

arXiv:1909.10912 [pdf, ps, other]

doi 10.1145/3331184.3331337

Improving Collaborative Metric Learning with Efficient Negative Sampling

Authors: Viet-Anh Tran, Romain Hennequin, Jimena Royo-Letelier, Manuel Moussallam

Abstract: Distance metric learning based on triplet loss has been applied with success in a wide range of applications such as face recognition, image retrieval, speaker change detection and recently recommendation with the CML model. However, as we show in this article, CML requires large batches to work reasonably well because of a too simplistic uniform negative sampling strategy for selecting triplets.… ▽ More Distance metric learning based on triplet loss has been applied with success in a wide range of applications such as face recognition, image retrieval, speaker change detection and recently recommendation with the CML model. However, as we show in this article, CML requires large batches to work reasonably well because of a too simplistic uniform negative sampling strategy for selecting triplets. Due to memory limitations, this makes it difficult to scale in high-dimensional scenarios. To alleviate this problem, we propose here a 2-stage negative sampling strategy which finds triplets that are highly informative for learning. Our strategy allows CML to work effectively in terms of accuracy and popularity bias, even when the batch size is an order of magnitude smaller than what would be needed with the default uniform sampling. We demonstrate the suitability of the proposed strategy for recommendation and exhibit consistent positive results across various datasets. △ Less

Submitted 24 September, 2019; originally announced September 2019.

Comments: SIGIR 2019

arXiv:1907.08698 [pdf, other]

Leveraging Knowledge Bases And Parallel Annotations For Music Genre Translation

Authors: Elena V. Epure, Anis Khlif, Romain Hennequin

Abstract: Prevalent efforts have been put in automatically inferring genres of musical items. Yet, the propose solutions often rely on simplifications and fail to address the diversity and subjectivity of music genres. Accounting for these has, though, many benefits for aligning knowledge sources, integrating data and enriching musical items with tags. Here, we choose a new angle for the genre study by seek… ▽ More Prevalent efforts have been put in automatically inferring genres of musical items. Yet, the propose solutions often rely on simplifications and fail to address the diversity and subjectivity of music genres. Accounting for these has, though, many benefits for aligning knowledge sources, integrating data and enriching musical items with tags. Here, we choose a new angle for the genre study by seeking to predict what would be the genres of musical items in a target tag system, knowing the genres assigned to them within source tag systems. We call this a translation task and identify three cases: 1) no common annotated corpus between source and target tag systems exists, 2) such a large corpus exists, 3) only few common annotations exist. We propose the related solutions: a knowledge-based translation modeled as taxonomy mapping, a statistical translation modeled with maximum likelihood logistic regression; a hybrid translation modeled with maximum a posteriori logistic regression with priors given by the knowledge-based translation. During evaluation, the solutions fit well the identified cases and the hybrid translation is systematically the most effective w.r.t. multilabel classification metrics. This is a first attempt to unify genre tag systems by leveraging both representation and interpretation diversity. △ Less

Submitted 27 July, 2019; v1 submitted 18 July, 2019; originally announced July 2019.

Comments: Published in ISMIR 2019

arXiv:1906.02618 [pdf, other]

doi 10.1109/ICASSP.2019.8683555

Singing voice separation: a study on training data

Authors: Laure Prétet, Romain Hennequin, Jimena Royo-Letelier, Andrea Vaglio

Abstract: In the recent years, singing voice separation systems showed increased performance due to the use of supervised training. The design of training datasets is known as a crucial factor in the performance of such systems. We investigate on how the characteristics of the training dataset impacts the separation performances of state-of-the-art singing voice separation algorithms. We show that the separ… ▽ More In the recent years, singing voice separation systems showed increased performance due to the use of supervised training. The design of training datasets is known as a crucial factor in the performance of such systems. We investigate on how the characteristics of the training dataset impacts the separation performances of state-of-the-art singing voice separation algorithms. We show that the separation quality and diversity are two important and complementary assets of a good training dataset. We also provide insights on possible transforms to perform data augmentation for this task. △ Less

Submitted 6 June, 2019; originally announced June 2019.

Journal ref: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:1905.09570 [pdf, other]

Gravity-Inspired Graph Autoencoders for Directed Link Prediction

Authors: Guillaume Salha, Stratis Limnios, Romain Hennequin, Viet Anh Tran, Michalis Vazirgiannis

Abstract: Graph autoencoders (AE) and variational autoencoders (VAE) recently emerged as powerful node embedding methods. In particular, graph AE and VAE were successfully leveraged to tackle the challenging link prediction problem, aiming at figuring out whether some pairs of nodes from a graph are connected by unobserved edges. However, these models focus on undirected graphs and therefore ignore the pote… ▽ More Graph autoencoders (AE) and variational autoencoders (VAE) recently emerged as powerful node embedding methods. In particular, graph AE and VAE were successfully leveraged to tackle the challenging link prediction problem, aiming at figuring out whether some pairs of nodes from a graph are connected by unobserved edges. However, these models focus on undirected graphs and therefore ignore the potential direction of the link, which is limiting for numerous real-life applications. In this paper, we extend the graph AE and VAE frameworks to address link prediction in directed graphs. We present a new gravity-inspired decoder scheme that can effectively reconstruct directed graphs from a node embedding. We empirically evaluate our method on three different directed link prediction tasks, for which standard graph AE and VAE perform poorly. We achieve competitive results on three real-world graphs, outperforming several popular baselines. △ Less

Submitted 6 June, 2022; v1 submitted 23 May, 2019; originally announced May 2019.

Comments: ACM International Conference on Information and Knowledge Management (CIKM 2019)

arXiv:1902.08813 [pdf, other]

A Degeneracy Framework for Scalable Graph Autoencoders

Authors: Guillaume Salha, Romain Hennequin, Viet Anh Tran, Michalis Vazirgiannis

Abstract: In this paper, we present a general framework to scale graph autoencoders (AE) and graph variational autoencoders (VAE). This framework leverages graph degeneracy concepts to train models only from a dense subset of nodes instead of using the entire graph. Together with a simple yet effective propagation mechanism, our approach significantly improves scalability and training speed while preserving… ▽ More In this paper, we present a general framework to scale graph autoencoders (AE) and graph variational autoencoders (VAE). This framework leverages graph degeneracy concepts to train models only from a dense subset of nodes instead of using the entire graph. Together with a simple yet effective propagation mechanism, our approach significantly improves scalability and training speed while preserving performance. We evaluate and discuss our method on several variants of existing graph AE and VAE, providing the first application of these models to large graphs with up to millions of nodes and edges. We achieve empirically competitive results w.r.t. several popular scalable node embedding methods, which emphasizes the relevance of pursuing further research towards more scalable graph AE and VAE. △ Less

Submitted 21 June, 2022; v1 submitted 23 February, 2019; originally announced February 2019.

Comments: International Joint Conference on Artificial Intelligence (IJCAI 2019)

arXiv:1810.01807 [pdf, other]

Disambiguating Music Artists at Scale with Audio Metric Learning

Authors: Jimena Royo-Letelier, Romain Hennequin, Viet-Anh Tran, Manuel Moussallam

Abstract: We address the problem of disambiguating large scale catalogs through the definition of an unknown artist clustering task. We explore the use of metric learning techniques to learn artist embeddings directly from audio, and using a dedicated homonym artists dataset, we compare our method with a recent approach that learn similar embeddings using artist classifiers. While both systems have the abil… ▽ More We address the problem of disambiguating large scale catalogs through the definition of an unknown artist clustering task. We explore the use of metric learning techniques to learn artist embeddings directly from audio, and using a dedicated homonym artists dataset, we compare our method with a recent approach that learn similar embeddings using artist classifiers. While both systems have the ability to disambiguate unknown artists relying exclusively on audio, we show that our system is more suitable in the case when enough audio data is available for each artist in the train dataset. We also propose a new negative sampling method for metric learning that takes advantage of side information such as music genre during the learning phase and shows promising results for the artist clustering task. △ Less

Submitted 3 October, 2018; originally announced October 2018.

Comments: published in ISMIR 2018

arXiv:1809.07276 [pdf, other]

Music Mood Detection Based On Audio And Lyrics With Deep Neural Net

Authors: Rémi Delbouys, Romain Hennequin, Francesco Piccoli, Jimena Royo-Letelier, Manuel Moussallam

Abstract: We consider the task of multimodal music mood prediction based on the audio signal and the lyrics of a track. We reproduce the implementation of traditional feature engineering based approaches and propose a new model based on deep learning. We compare the performance of both approaches on a database containing 18,000 tracks with associated valence and arousal values and show that our approach out… ▽ More We consider the task of multimodal music mood prediction based on the audio signal and the lyrics of a track. We reproduce the implementation of traditional feature engineering based approaches and propose a new model based on deep learning. We compare the performance of both approaches on a database containing 18,000 tracks with associated valence and arousal values and show that our approach outperforms classical models on the arousal detection task, and that both approaches perform equally on the valence prediction task. We also compare the a posteriori fusion with fusion of modalities optimized simultaneously with each unimodal model, and observe a significant improvement of valence prediction. We release part of our database for comparison purposes. △ Less

Submitted 19 September, 2018; originally announced September 2018.

Comments: Published in ISMIR 2018

arXiv:1809.07256 [pdf, other]

Audio Based Disambiguation Of Music Genre Tags

Authors: Romain Hennequin, Jimena Royo-Letelier, Manuel Moussallam

Abstract: In this paper, we propose to infer music genre embeddings from audio datasets carrying semantic information about genres. We show that such embeddings can be used for disambiguating genre tags (identification of different labels for the same genre, tag translation from a tag system to another, inference of hierarchical taxonomies on these genre tags). These embeddings are built by training a deep… ▽ More In this paper, we propose to infer music genre embeddings from audio datasets carrying semantic information about genres. We show that such embeddings can be used for disambiguating genre tags (identification of different labels for the same genre, tag translation from a tag system to another, inference of hierarchical taxonomies on these genre tags). These embeddings are built by training a deep convolutional neural network genre classifier with large audio datasets annotated with a flat tag system. We show empirically that they makes it possible to retrieve the original taxonomy of a tag system, spot duplicates tags and translate tags from a tag system to another. △ Less

Submitted 19 September, 2018; originally announced September 2018.

Comments: published in ISMIR 2018

arXiv:1808.10351 [pdf, other]

Large-Scale Cover Song Detection in Digital Music Libraries Using Metadata, Lyrics and Audio Features

Authors: Albin Andrew Correya, Romain Hennequin, Mickaël Arcos

Abstract: Cover song detection is a very relevant task in Music Information Retrieval (MIR) studies and has been mainly addressed using audio-based systems. Despite its potential impact in industrial contexts, low performances and lack of scalability have prevented such systems from being adopted in practice for large applications. In this work, we investigate whether textual music information (such as meta… ▽ More Cover song detection is a very relevant task in Music Information Retrieval (MIR) studies and has been mainly addressed using audio-based systems. Despite its potential impact in industrial contexts, low performances and lack of scalability have prevented such systems from being adopted in practice for large applications. In this work, we investigate whether textual music information (such as metadata and lyrics) can be used along with audio for large-scale cover identification problem in a wide digital music library. We benchmark this problem using standard text and state of the art audio similarity measures. Our studies shows that these methods can significantly increase the accuracy and scalability of cover detection systems on Million Song Dataset (MSD) and Second Hand Song (SHS) datasets. By only leveraging standard tf-idf based text similarity measures on song titles and lyrics, we achieved 35.5% of absolute increase in mean average precision compared to the current scalable audio content-based state of the art methods on MSD. These experimental results suggests that new methodologies can be encouraged among researchers to leverage and identify more sophisticated NLP-based techniques to improve current cover song identification systems in digital music libraries with metadata. △ Less

Submitted 30 August, 2018; originally announced August 2018.

Comments: Music Information Retrieval, Cover Song Identification, Million Song Dataset, Natural Language Processing

Showing 1–43 of 43 results for author: Hennequin, R