Skip to main content

Showing 1–50 of 67 results for author: herremans, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.15154  [pdf, ps, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

    Authors: Anuradha Chopra, Abhinaba Roy, Dorien Herremans

    Abstract: Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic de… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 14 pages, 2 figures, Accepted to AIMC 2025

    MSC Class: 68T10 (Primary); 68T50 (Secondary) ACM Class: H.5.5; H.5.1; I.2.7

    Journal ref: Proceedings of the 6th Conference on AI Music Creativity (AIMC 2025), Brussels, Belgium, September 10th - 12th, 2025

  2. arXiv:2506.02514  [pdf, ps, other

    cs.HC

    To Embody or Not: The Effect Of Embodiment On User Perception Of LLM-based Conversational Agents

    Authors: Kyra Wang, Boon-Kiat Quek, Jessica Goh, Dorien Herremans

    Abstract: Embodiment in conversational agents (CAs) refers to the physical or visual representation of these agents, which can significantly influence user perception and interaction. Limited work has been done examining the effect of embodiment on the perception of CAs utilizing modern large language models (LLMs) in non-hierarchical cooperative tasks, a common use case of CAs as more powerful models becom… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  3. arXiv:2505.20979  [pdf, ps, other

    cs.SD cs.AI eess.AS

    MelodySim: Measuring Melody-aware Music Similarity for Plagiarism Detection

    Authors: Tongyu Lu, Charlotta-Marlena Geist, Jan Melechovsky, Abhinaba Roy, Dorien Herremans

    Abstract: We propose MelodySim, a melody-aware music similarity model and dataset for plagiarism detection. First, we introduce a novel method to construct a dataset with focus on melodic similarity. By augmenting Slakh2100; an existing MIDI dataset, we generate variations of each piece while preserving the melody through modifications such as note splitting, arpeggiation, minor track dropout (excluding bas… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  4. arXiv:2505.12669  [pdf, ps, other

    cs.SD cs.AI cs.MM eess.AS

    Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment

    Authors: Abhinaba Roy, Geeta Puri, Dorien Herremans

    Abstract: We present Text2midi-InferAlign, a novel technique for improving symbolic music generation at inference time. Our method leverages text-to-audio alignment and music structural alignment rewards during inference to encourage the generated music to be consistent with the input caption. Specifically, we introduce two objectives scores: a text-audio consistency score that measures rhythmic alignment b… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: 7 pages, 1 figure, 5 tables

    MSC Class: 68T07 ACM Class: I.2.1

  5. arXiv:2502.07461  [pdf, other

    cs.SD cs.AI

    JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata

    Authors: Abhinaba Roy, Renhang Liu, Tongyu Lu, Dorien Herremans

    Abstract: We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 362,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then… ▽ More

    Submitted 16 May, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: 8 pages, 5 figures

  6. arXiv:2502.04522  [pdf, other

    cs.SD cs.AI eess.AS

    ImprovNet -- Generating Controllable Musical Improvisations with Iterative Corruption Refinement

    Authors: Keshav Bhandari, Sungkyun Chang, Tongyu Lu, Fareza R. Enus, Louis B. Bradshaw, Dorien Herremans, Simon Colton

    Abstract: Despite deep learning's remarkable advances in style transfer across various domains, generating controllable performance-level musical style transfer for complete symbolically represented musical works remains a challenging area of research. Much of this is owed to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks.… ▽ More

    Submitted 16 May, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

    Comments: 10 pages, 6 figures, IJCNN 2025 conference

  7. arXiv:2502.03979  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Unified Music Emotion Recognition across Dimensional and Categorical Models

    Authors: Jaeyong Kang, Dorien Herremans

    Abstract: One of the most significant challenges in Music Emotion Recognition (MER) comes from the fact that emotion labels can be heterogeneous across datasets with regard to the emotion representation, including categorical (e.g., happy, sad) versus dimensional labels (e.g., valence-arousal). In this paper, we present a unified multitask learning framework that combines these two types of labels and is th… ▽ More

    Submitted 11 April, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

  8. arXiv:2412.16526  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Text2midi: Generating Symbolic Music from Captions

    Authors: Keshav Bhandari, Abhinaba Roy, Kyra Wang, Geeta Puri, Simon Colton, Dorien Herremans

    Abstract: This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Speci… ▽ More

    Submitted 31 December, 2024; v1 submitted 21 December, 2024; originally announced December 2024.

    Comments: 9 pages, 3 figures, Accepted at the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)

    Journal ref: Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)

  9. arXiv:2411.00469  [pdf, other

    cs.SD cs.AI cs.IR eess.AS

    MIRFLEX: Music Information Retrieval Feature Library for Extraction

    Authors: Anuradha Chopra, Abhinaba Roy, Dorien Herremans

    Abstract: This paper introduces an extendable modular system that compiles a range of music feature extraction models to aid music information retrieval research. The features include musical elements like key, downbeats, and genre, as well as audio characteristics like instrument recognition, vocals/instrumental classification, and vocals gender detection. The integrated models are state-of-the-art or late… ▽ More

    Submitted 1 November, 2024; originally announced November 2024.

    Comments: 2 pages, 4 tables, submitted to Extended Abstracts for the Late-Breaking Demo Session of the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024

    ACM Class: I.2.7

  10. arXiv:2410.13342  [pdf, other

    eess.AS cs.AI cs.SD

    DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech

    Authors: Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

    Abstract: Recent advancements in Text-to-Speech (TTS) systems have enabled the generation of natural and expressive speech from textual input. Accented TTS aims to enhance user experience by making the synthesized speech more relatable to minority group listeners, and useful across various applications and context. Speech synthesis can further be made more flexible by allowing users to choose any combinatio… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: Accepted in Audio Imagination workshop of NeurIPS 2024

  11. arXiv:2410.11522  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Leveraging LLM Embeddings for Cross Dataset Label Alignment and Zero Shot Music Emotion Prediction

    Authors: Renhang Liu, Abhinaba Roy, Dorien Herremans

    Abstract: In this work, we present a novel method for music emotion recognition that leverages Large Language Model (LLM) embeddings for label alignment across multiple datasets and zero-shot prediction on novel categories. First, we compute LLM embeddings for emotion labels and apply non-parametric clustering to group similar labels, across multiple datasets containing disjoint labels. We use these cluster… ▽ More

    Submitted 17 October, 2024; v1 submitted 15 October, 2024; originally announced October 2024.

  12. arXiv:2409.09378  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Prevailing Research Areas for Music AI in the Era of Foundation Models

    Authors: Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans

    Abstract: In tandem with the recent advancements in foundation model research, there has been a surge of generative music AI applications within the past few years. As the idea of AI-generated or AI-augmented music becomes more mainstream, many researchers in the music AI community may be wondering what avenues of research are left. With regards to music generative models, we outline the current areas of re… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    MSC Class: 68T05; 68T20 ACM Class: I.2; I.5.4; I.2.6; I.2.7; H.5.5

  13. arXiv:2408.06827  [pdf, other

    eess.AS cs.LG

    PRESENT: Zero-Shot Text-to-Prosody Control

    Authors: Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

    Abstract: Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modi… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Journal ref: IEEE Signal Processing Letters 2025

  14. arXiv:2407.10462  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features

    Authors: Jing Luo, Xinyu Yang, Dorien Herremans

    Abstract: Controllable music generation promotes the interaction between humans and composition systems by projecting the users' intent on their desired music. The challenge of introducing controllability is an increasingly important issue in the symbolic music generation field. When building controllable generative popular multi-instrument music systems, two main challenges typically present themselves, na… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Demo page: https://chinglohsiu.github.io/files/bandcontrolnet.html

  15. arXiv:2406.08820  [pdf, other

    eess.AS cs.CL

    DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage

    Authors: Kyra Wang, Dorien Herremans

    Abstract: Laughing, sighing, stuttering, and other forms of paralanguage do not contribute any direct lexical meaning to speech, but they provide crucial propositional context that aids semantic and pragmatic processes such as irony. It is thus important for artificial social agents to both understand and be able to generate speech with semantically-important paralanguage. Most speech datasets do not includ… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 4 pages, 1 figure, submitted to IEEE TENCON 2024

  16. arXiv:2406.08809  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Are We There Yet? A Brief Survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges

    Authors: Jaeyong Kang, Dorien Herremans

    Abstract: Deep learning models for music have advanced drastically in recent years, but how good are machine learning models at capturing emotion, and what challenges are researchers facing? In this paper, we provide a comprehensive overview of the available music-emotion datasets and discuss evaluation standards as well as competitions in the field. We also offer a brief overview of various types of music… ▽ More

    Submitted 24 June, 2025; v1 submitted 13 June, 2024; originally announced June 2024.

    Journal ref: IEEE Transactions on Affective Computing (2025)

  17. arXiv:2406.02255  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    MidiCaps: A large-scale MIDI dataset with text captions

    Authors: Jan Melechovsky, Abhinaba Roy, Dorien Herremans

    Abstract: Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting, the first openly available large-scale MIDI dataset with text captions. MIDI (Musical Instrument Digital Interface) files are widely used… ▽ More

    Submitted 22 July, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted in ISMIR2024

    Journal ref: Proceedings of ISMIR 2024

  18. arXiv:2406.01018  [pdf, other

    eess.AS cs.LG cs.SD

    Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

    Authors: Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

    Abstract: With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (T… ▽ More

    Submitted 29 September, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted in IEEE TENCON 2024

  19. arXiv:2402.17467  [pdf, other

    cs.IR cs.AI cs.SD eess.AS

    Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: a Survey

    Authors: Dinh-Viet-Toan Le, Louis Bigo, Mikaela Keller, Dorien Herremans

    Abstract: Several adaptations of Transformers models have been developed in various domains since its breakthrough in Natural Language Processing (NLP). This trend has spread into the field of Music Information Retrieval (MIR), including studies processing music data. However, the practice of leveraging NLP tools for symbolic music data is not novel in MIR. Music has been frequently compared to language, as… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 36 pages, 5 figures, 4 tables

    Journal ref: ACM Computing Surveys 2025, Volume 57, Issue 7

  20. arXiv:2311.00968  [pdf, other

    cs.SD cs.AI eess.AS

    Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

    Authors: Jaeyong Kang, Soujanya Poria, Dorien Herremans

    Abstract: Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene… ▽ More

    Submitted 4 March, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

    Journal ref: Expert Systems with Applications 249 (2024): 123640

  21. arXiv:2306.13661  [pdf, other

    q-fin.CP cs.LG q-fin.PM

    Constructing Time-Series Momentum Portfolios with Deep Multi-Task Learning

    Authors: Joel Ong, Dorien Herremans

    Abstract: A diversified risk-adjusted time-series momentum (TSMOM) portfolio can deliver substantial abnormal returns and offer some degree of tail risk protection during extreme market events. The performance of existing TSMOM strategies, however, relies not only on the quality of the momentum signal but also on the efficacy of the volatility estimator. Yet many of the existing studies have always consider… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Journal ref: Expert Systems with Applications Volume 230, 15 November 2023, 120587

  22. arXiv:2302.00286  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

    Authors: Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Ju-Chiang Wang, Yun-Ning Hung, Dorien Herremans

    Abstract: In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilize… ▽ More

    Submitted 1 February, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: arXiv admin note: text overlap with arXiv:2206.10805

  23. arXiv:2212.00973  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    A Domain-Knowledge-Inspired Music Embedding Space and a Novel Attention Mechanism for Symbolic Music Modeling

    Authors: Z. Guo, J. Kang, D. Herremans

    Abstract: Following the success of the transformer architecture in the natural language domain, transformer-like architectures have been widely applied to the domain of symbolic music recently. Symbolic music and text, however, are two different modalities. Symbolic music contains multiple attributes, both absolute attributes (e.g., pitch) and relative attributes (e.g., pitch interval). These relative attri… ▽ More

    Submitted 2 December, 2022; originally announced December 2022.

    Comments: This paper is accepted at AAAI 2023

    Report number: Article No.: 566, Pages 5070 - 5077

    Journal ref: Proceedings of AAAI 2023

  24. arXiv:2211.08281  [pdf, other

    q-fin.TR cs.AI cs.LG q-fin.CP q-fin.PM

    Forecasting Bitcoin volatility spikes from whale transactions and CryptoQuant data using Synthesizer Transformer models

    Authors: Dorien Herremans, Kah Wee Low

    Abstract: The cryptocurrency market is highly volatile compared to traditional financial markets. Hence, forecasting its volatility is crucial for risk management. In this paper, we investigate CryptoQuant data (e.g. on-chain analytics, exchange and miner data) and whale-alert tweets, and explore their relationship to Bitcoin's next-day volatility, with a focus on extreme volatility spikes. We propose a dee… ▽ More

    Submitted 6 October, 2022; originally announced November 2022.

    Comments: Co-first authors

  25. arXiv:2211.07283  [pdf, other

    eess.AS cs.SD

    SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

    Authors: Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

    Abstract: Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Thus, we propose training TTS models using decaying sparsity, i.e. a high initial sparsity to acc… ▽ More

    Submitted 1 June, 2024; v1 submitted 14 November, 2022; originally announced November 2022.

  26. arXiv:2211.03316  [pdf, other

    eess.AS cs.LG cs.SD

    Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

    Authors: Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

    Abstract: Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's voice, and convert this to any desired target accent. Our… ▽ More

    Submitted 29 September, 2024; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: Accepted at IEEE TENCON 2024

  27. arXiv:2210.05148  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

    Authors: Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji

    Abstract: In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative task where we train our model to generate realistic looking piano rolls from pure Gaussian noise conditioned on spectrograms.… ▽ More

    Submitted 20 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Journal ref: Proceedings of ICASSP - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023

  28. arXiv:2206.10805  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

    Authors: Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Amy Hung, Ju-Chiang Wang, Dorien Herremans

    Abstract: In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utiliz… ▽ More

    Submitted 28 June, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

    Comments: Submitted to ISMIR

  29. arXiv:2206.00648  [pdf, other

    q-fin.ST cs.CL cs.LG q-fin.CP q-fin.TR

    PreBit -- A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin

    Authors: Yanzhao Zou, Dorien Herremans

    Abstract: Bitcoin, with its ever-growing popularity, has demonstrated extreme price volatility since its origin. This volatility, together with its decentralised nature, make Bitcoin highly subjective to speculative trading as compared to more traditional assets. In this paper, we propose a multimodal model for predicting extreme price fluctuations. This model takes as input a variety of correlated assets,… ▽ More

    Submitted 21 October, 2023; v1 submitted 30 May, 2022; originally announced June 2022.

    Comments: 21 pages, submitted preprint to Elsevier Expert Systems with Applications

    Journal ref: Expert Systems with Applications, 233, 120838 (2023)

  30. arXiv:2204.11437  [pdf, other

    cs.SD eess.AS eess.SP

    Understanding Audio Features via Trainable Basis Functions

    Authors: Kwan Yee Heung, Kin Wai Cheuk, Dorien Herremans

    Abstract: In this paper we explore the possibility of maximizing the information represented in spectrograms by making the spectrogram basis functions trainable. We experiment with two different tasks, namely keyword spotting (KWS) and automatic speech recognition (ASR). For most neural network models, the architecture and hyperparameters are typically fine-tuned and optimized in experiments. Input features… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Comments: under review in Interspeech 2022

  31. arXiv:2203.03022  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    HEAR: Holistic Evaluation of Audio Representations

    Authors: Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk

    Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, in… ▽ More

    Submitted 29 May, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

    Comments: to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track

    Journal ref: Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track

  32. arXiv:2202.10453  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses

    Authors: Phoebe Chua, Dimos Makris, Dorien Herremans, Gemma Roig, Kat Agres

    Abstract: Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived e… ▽ More

    Submitted 19 February, 2022; originally announced February 2022.

    Comments: 16 pages with 9 figures

  33. arXiv:2202.05528  [pdf, other

    cs.AI cs.MM

    MusIAC: An extensible generative framework for Music Infilling Applications with multi-level Control

    Authors: Rui Guo, Ivor Simpson, Chris Kiefer, Thor Magnusson, Dorien Herremans

    Abstract: We present a novel music generation framework for music infilling, with a user friendly interface. Infilling refers to the task of generating musical sections given the surrounding multi-track music. The proposed transformer-based framework is extensible for new control tokens as the added music control tokens such as tonal tension per bar and track polyphony level in this work. We explore the eff… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

    Comments: preprint for The 11th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART) 2022

  34. arXiv:2202.04464  [pdf, other

    cs.SD cs.LG eess.AS

    Conditional Drums Generation using Compound Word Representations

    Authors: Dimos Makris, Guo Zixun, Maximos Kaliakatsos-Papakostas, Dorien Herremans

    Abstract: The field of automatic music composition has seen great progress in recent years, specifically with the invention of transformer-based architectures. When using any deep learning model which considers music as a sequence of events with multiple complex dependencies, the selection of a proper data representation is crucial. In this paper, we tackle the task of conditional drums generation using a n… ▽ More

    Submitted 21 February, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: Accepted for the 11th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART), 2022

  35. aiSTROM -- A roadmap for developing a successful AI strategy

    Authors: Dorien Herremans

    Abstract: A total of 34% of AI research and development projects fails or are abandoned, according to a recent survey by Rackspace Technology of 1,870 companies. We propose a new strategic framework, aiSTROM, that empowers managers to create a successful AI strategy based on a thorough literature review. This provides a unique and integrated approach that guides managers and lead developers through the vari… ▽ More

    Submitted 15 November, 2021; v1 submitted 25 June, 2021; originally announced July 2021.

    MSC Class: 68Txx; 97Pxx ACM Class: K.5; K.6; C.5; D.m; H.2; K.7

    Journal ref: IEEE Access, 2021

  36. arXiv:2107.04954  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data

    Authors: Kin Wai Cheuk, Dorien Herremans, Li Su

    Abstract: Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlab… ▽ More

    Submitted 29 July, 2021; v1 submitted 10 July, 2021; originally announced July 2021.

    Comments: Accepted in ACMMM 21. Camera ready version

  37. arXiv:2106.12174  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    Deep Neural Network Based Respiratory Pathology Classification Using Cough Sounds

    Authors: Balamurali B T, Hwan Ing Hee, Saumitra Kapoor, Oon Hoe Teoh, Sung Shin Teng, Khai Pin Lee, Dorien Herremans, Jer Ming Chen

    Abstract: Intelligent systems are transforming the world, as well as our healthcare system. We propose a deep learning-based cough sound classification model that can distinguish between children with healthy versus pathological coughs such as asthma, upper respiratory tract infection (URTI), and lower respiratory tract infection (LRTI). In order to train a deep neural network model, we collected a new data… ▽ More

    Submitted 23 June, 2021; originally announced June 2021.

    MSC Class: 62-XX; 92-XX; 68Txx; ACM Class: J.3; I.2

  38. arXiv:2104.13056  [pdf, other

    cs.SD cs.LG eess.AS

    Generating Lead Sheets with Affect: A Novel Conditional seq2seq Framework

    Authors: Dimos Makris, Kat R. Agres, Dorien Herremans

    Abstract: The field of automatic music composition has seen great progress in the last few years, much of which can be attributed to advances in deep neural networks. There are numerous studies that present different strategies for generating sheet music from scratch. The inclusion of high-level musical characteristics (e.g., perceived emotional qualities), however, as conditions for controlling the generat… ▽ More

    Submitted 27 April, 2021; originally announced April 2021.

    Comments: Accepted for the International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18-22 July 2021 (virtual)

  39. arXiv:2104.06607  [pdf, other

    cs.SD eess.AS

    Revisiting the Onsets and Frames Model with Additive Attention

    Authors: Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Dorien Herremans

    Abstract: Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano transcription results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep and complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In this paper, we conduct a comprehensive examination o… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted in IJCNN 2021 Special Session S04. https://dr-costas.github.io/rlasmp2021-website/

  40. arXiv:2102.13397  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Underwater Acoustic Communication Receiver Using Deep Belief Network

    Authors: Abigail Lee-Leon, Chau Yuen, Dorien Herremans

    Abstract: Underwater environments create a challenging channel for communications. In this paper, we design a novel receiver system by exploring the machine learning technique--Deep Belief Network (DBN)-- to combat the signal distortion caused by the Doppler effect and multi-path propagation. We evaluate the performance of the proposed receiver system in both simulation experiments and sea trials. Our propo… ▽ More

    Submitted 26 February, 2021; originally announced February 2021.

  41. arXiv:2010.11188  [pdf

    cs.SD cs.CV eess.AS

    AttendAffectNet: Self-Attention based Networks for Predicting Affective Responses from Movies

    Authors: Ha Thi Phuong Thao, Balamurali B. T., Dorien Herremans, Gemma Roig

    Abstract: In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integrati… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: 8 pages, 6 figures

    Journal ref: Proceedings of the International Conference on Pattern Recognition (ICPR2020)

  42. arXiv:2010.09969  [pdf, other

    cs.SD cs.LG eess.AS

    The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

    Authors: Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Dorien Herremans

    Abstract: Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitc… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

    Comments: Accepted in ICPR

  43. arXiv:2010.09489  [pdf, other

    cs.SD cs.LG cs.MM

    Hit Song Prediction Based on Early Adopter Data and Audio Features

    Authors: Dorien Herremans, Tom Bergmans

    Abstract: Billions of USD are invested in new artists and songs by the music industry every year. This research provides a new strategy for assessing the hit potential of songs, which can help record companies support their investment decisions. A number of models were developed that use both audio data, and a novel feature based on social media listening behaviour. The results show that models based on ear… ▽ More

    Submitted 16 October, 2020; originally announced October 2020.

    Journal ref: The 18th International Society for Music Information Retrieval Conference (ISMIR)2018 - LBD

  44. arXiv:2010.06230  [pdf, ps, other

    cs.SD cs.SC eess.AS

    A variational autoencoder for music generation controlled by tonal tension

    Authors: Rui Guo, Ivor Simpson, Thor Magnusson, Chris Kiefer, Dorien Herremans

    Abstract: Many of the music generation systems based on neural networks are fully autonomous and do not offer control over the generation process. In this research, we present a controllable music generation system in terms of tonal tension. We incorporate two tonal tension measures based on the Spiral Array Tension theory into a variational autoencoder model. This allows us to control the direction of the… ▽ More

    Submitted 14 October, 2020; v1 submitted 13 October, 2020; originally announced October 2020.

    Comments: 2020 Joint Conference on AI Music Creativity

  45. arXiv:2009.04459  [pdf, other

    cs.SD cs.LG eess.AS

    A dataset and classification model for Malay, Hindi, Tamil and Chinese music

    Authors: Fajilatun Nahar, Kat Agres, Balamurali BT, Dorien Herremans

    Abstract: In this paper we present a new dataset, with musical excepts from the three main ethnic groups in Singapore: Chinese, Malay and Indian (both Hindi and Tamil). We use this new dataset to train different classification models to distinguish the origin of the music in terms of these ethnic groups. The classification models were optimized by exploring the use of different musical features as the input… ▽ More

    Submitted 15 September, 2020; v1 submitted 9 September, 2020; originally announced September 2020.

    Comments: 4 pages

  46. arXiv:2007.15474  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling

    Authors: Hao Hao Tan, Dorien Herremans

    Abstract: High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-le… ▽ More

    Submitted 29 July, 2020; originally announced July 2020.

    Journal ref: Proc. of 21st International Society of Music Information Retrieval Conference, ISMIR 2020

  47. arXiv:2007.00977  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    PerceptionGAN: Real-world Image Construction from Provided Text through Perceptual Understanding

    Authors: Kanish Garg, Ajeet kumar Singh, Dorien Herremans, Brejesh Lall

    Abstract: Generating an image from a provided descriptive text is quite a challenging task because of the difficulty in incorporating perceptual information (object shapes, colors, and their interactions) along with providing high relevancy related to the provided text. Current methods first generate an initial low-resolution image, which typically has irregular object shapes, colors, and interaction betwee… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

    Comments: Proceedings of IEEE International Conference on Imaging, Vision & Pattern Recognition, (IVPR 2020, Japan)

    MSC Class: 68Txx; 68-XX ACM Class: I.4; I.5; I.3; I.2

    Journal ref: Proceedings of IEEE International Conference on Imaging, Vision & Pattern Recognition, (IVPR 2020, Japan)

  48. arXiv:2006.09833  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    Generative Modelling for Controllable Audio Synthesis of Expressive Piano Performance

    Authors: Hao Hao Tan, Yin-Jyun Luo, Dorien Herremans

    Abstract: We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics. We demonstrate how the model is able to apply fine-grained style morphing over the course of syn… ▽ More

    Submitted 12 July, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

    Journal ref: Published at ICML Workshop on Machine Learning for Media Discovery Workshop (ML4MD) 2020

  49. arXiv:2006.09016  [pdf, other

    physics.comp-ph cs.LG stat.AP

    Acoustic prediction of flowrate: varying liquid jet stream onto a free surface

    Authors: Balamurali B T, Edwin Jonathan Aslim, Yun Shu Lynn Ng, Tricia Li, Chuen Kuo, Jacob Shihang Chen, Dorien Herremans, Lay Guat Ng, Jer-Ming Chen

    Abstract: Information on liquid jet stream flow is crucial in many real world applications. In a large number of cases, these flows fall directly onto free surfaces (e.g. pools), creating a splash with accompanying splashing sounds. The sound produced is supplied by energy interactions between the liquid jet stream and the passive free surface. In this investigation, we collect the sound of a water jet of v… ▽ More

    Submitted 16 June, 2020; originally announced June 2020.

    MSC Class: 76-XX; 92C55; 92-XX ACM Class: J.2

    Journal ref: Proceedings of the IEEE International Conference on Signal Processing and Communications (SPCOM), 2020

  50. arXiv:2001.09989  [pdf, other

    cs.SD eess.AS

    The impact of Audio input representations on neural network based music transcription

    Authors: Kin Wai Cheuk, Kat Agres, Dorien Herremans

    Abstract: This paper thoroughly analyses the effect of different input representations on polyphonic multi-instrument music transcription. We use our own GPU based spectrogram extraction tool, nnAudio, to investigate the influence of using a linear-frequency spectrogram, log-frequency spectrogram, Mel spectrogram, and constant-Q transform (CQT). Our results show that a $8.33$% increase in transcription accu… ▽ More

    Submitted 21 July, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: Paper accepted in IJCNN 2020

    Journal ref: IJCNN 2020