Skip to main content

Showing 1–50 of 54 results for author: Dixon, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.24853  [pdf, ps, other

    cs.SD eess.AS

    Enhanced Automatic Drum Transcription via Drum Stem Source Separation

    Authors: Xavier Riley, Simon Dixon

    Abstract: Automatic Drum Transcription (ADT) remains a challenging task in MIR but recent advances allow accurate transcription of drum kits with up 5 classes - kick, snare, hi-hats, toms and cymbals - via the ADTOF package. In addition, several drum kit \emph{stem} separation models in the open source community support separation for more than 6 stem classes, including distinct crash and ride cymbals. In t… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  2. arXiv:2507.12175  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection

    Authors: Sungkyun Chang, Simon Dixon, Emmanouil Benetos

    Abstract: This study introduces RUMAA, a transformer-based framework for music performance analysis that unifies score-to-performance alignment, score-informed transcription, and mistake detection in a near end-to-end manner. Unlike prior methods addressing these tasks separately, RUMAA integrates them using pre-trained score and audio encoders and a novel tri-stream decoder capturing task interdependencies… ▽ More

    Submitted 16 July, 2025; originally announced July 2025.

    Comments: Accepted to WASPAA 2025

  3. arXiv:2505.15559  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes

    Authors: Zixun Guo, Simon Dixon

    Abstract: Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and Multidimensional Relati… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  4. arXiv:2502.07711  [pdf, other

    eess.AS cs.MM

    RenderBox: Expressive Performance Rendering with Text Control

    Authors: Huan Zhang, Akira Maezawa, Simon Dixon

    Abstract: Expressive music performance rendering involves interpreting symbolic scores with variations in timing, dynamics, articulation, and instrument-specific techniques, resulting in performances that capture musical can emotional intent. We introduce RenderBox, a unified framework for text-and-score controlled audio performance generation across multiple instruments, applying coarse-level controls thro… ▽ More

    Submitted 11 February, 2025; originally announced February 2025.

  5. arXiv:2410.03139  [pdf, other

    eess.AS cs.SD

    How does the teacher rate? Observations from the NeuroPiano dataset

    Authors: Huan Zhang, Vincent Cheung, Hayato Nishioka, Simon Dixon, Shinichi Furuya

    Abstract: This paper provides a detailed analysis of the NeuroPiano dataset, which comprise 104 audio recordings of student piano performances accompanied with 2255 textual feedback and ratings given by professional pianists. We offer a statistical overview of the dataset, focusing on the standardization of annotations and inter-annotator agreement across 12 evaluative questions concerning performance quali… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  6. arXiv:2409.08795  [pdf, other

    eess.AS cs.MM

    LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment

    Authors: Huan Zhang, Vincent Cheung, Hayato Nishioka, Simon Dixon, Shinichi Furuya

    Abstract: Research in music understanding has extensively explored composition-level attributes such as key, genre, and instrumentation through advanced representations, leading to cross-modal applications using large language models. However, aspects of musical performance such as stylistic expression and technique remain underexplored, along with the potential of using large language models to enhance edu… ▽ More

    Submitted 16 September, 2024; v1 submitted 13 September, 2024; originally announced September 2024.

  7. arXiv:2408.14340  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    Foundation Models for Music: A Survey

    Authors: Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg, Ruibin Yuan , et al. (17 additional authors not shown)

    Abstract: In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the signifi… ▽ More

    Submitted 3 September, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

  8. arXiv:2408.10807  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

    Authors: Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

    Abstract: Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a se… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  9. arXiv:2408.08653  [pdf, other

    cs.SD eess.AS

    GAPS: A Large and Diverse Classical Guitar Dataset and Benchmark Transcription Model

    Authors: Xavier Riley, Zixun Guo, Drew Edwards, Simon Dixon

    Abstract: We introduce GAPS (Guitar-Aligned Performance Scores), a new dataset of classical guitar performances, and a benchmark guitar transcription model that achieves state-of-the-art performance on GuitarSet in both supervised and zero-shot settings. GAPS is the largest dataset of real guitar audio, containing 14 hours of freely available audio-score aligned pairs, recorded in diverse conditions by over… ▽ More

    Submitted 30 August, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

    Comments: ISMIR 2024

  10. arXiv:2408.05024  [pdf, other

    cs.SD cs.CL cs.IR

    MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling

    Authors: Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon

    Abstract: Guitar tablatures enrich the structure of traditional music notation by assigning each note to a string and fret of a guitar in a particular tuning, indicating precisely where to play the note on the instrument. The problem of generating tablature from a symbolic music representation involves inferring this string and fret assignment per note across an entire composition or performance. On the gui… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: Reviewed pre-print accepted for publication at ISMIR 2024

  11. arXiv:2407.04822  [pdf, other

    eess.AS cs.LG cs.SD

    YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

    Authors: Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon

    Abstract: Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of model… ▽ More

    Submitted 1 August, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: Accepted at IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2024, London

  12. arXiv:2405.18386  [pdf, ps, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

    Authors: Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

    Abstract: Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; o… ▽ More

    Submitted 17 July, 2025; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted at ISMIR 2025 Conference. Code and demo are available at: https://github.com/ldzhangyx/instruct-musicgen

  13. arXiv:2405.16687  [pdf, other

    cs.SD eess.AS

    Reconstructing the Charlie Parker Omnibook using an audio-to-score automatic transcription pipeline

    Authors: Xavier Riley, Simon Dixon

    Abstract: The Charlie Parker Omnibook is a cornerstone of jazz music education, described by pianist Ethan Iverson as "the most important jazz education text ever published". In this work we propose a new transcription pipeline and explore the extent to which state of the art music technology is able to reconstruct these scores directly from the audio without human intervention. Our pipeline includes: a new… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  14. arXiv:2402.15258  [pdf, other

    eess.AS cs.LG cs.SD

    High Resolution Guitar Transcription via Domain Adaptation

    Authors: Xavier Riley, Drew Edwards, Simon Dixon

    Abstract: Automatic music transcription (AMT) has achieved high accuracy for piano due to the availability of large, high-quality datasets such as MAESTRO and MAPS, but comparable datasets are not yet available for other instruments. In recent work, however, it has been demonstrated that aligning scores to transcription model activations can produce high quality AMT training data for instruments other than… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  15. arXiv:2402.06178  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

    Authors: Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

    Abstract: Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, music generation usually involves iterative refinements, and how to edit the generated music remains a significant challenge. This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and inst… ▽ More

    Submitted 28 May, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to IJCAI 2024

  16. arXiv:2402.01424  [pdf, other

    cs.SD cs.LG eess.AS

    A Data-Driven Analysis of Robust Automatic Piano Transcription

    Authors: Drew Edwards, Simon Dixon, Emmanouil Benetos, Akira Maezawa, Yuta Kusaka

    Abstract: Algorithms for automatic piano transcription have improved dramatically in recent years due to new datasets and modeling techniques. Recent developments have focused primarily on adapting new neural network architectures, such as the Transformer and Perceiver, in order to yield more accurate systems. In this work, we study transcription systems from the perspective of their training data. By measu… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted for publication in IEEE Signal Processing Letters on 31 Janurary, 2024

  17. arXiv:2311.08884  [pdf, other

    cs.SD cs.MM eess.AS

    CREPE Notes: A new method for segmenting pitch contours into discrete notes

    Authors: Xavier Riley, Simon Dixon

    Abstract: Tracking the fundamental frequency (f0) of a monophonic instrumental performance is effectively a solved problem with several solutions achieving 99% accuracy. However, the related task of automatic music transcription requires a further processing step to segment an f0 contour into discrete notes. This sub-task of note segmentation is necessary to enable a range of applications including musicolo… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Journal ref: Proceedings of the 20th Sound and Music Computing Conference. June 15-17, 2023. Stockholm, Sweden

  18. arXiv:2311.02023  [pdf, other

    cs.SD cs.MM eess.AS

    FiloBass: A Dataset and Corpus Based Study of Jazz Basslines

    Authors: Xavier Riley, Simon Dixon

    Abstract: We present FiloBass: a novel corpus of music scores and annotations which focuses on the important but often overlooked role of the double bass in jazz accompaniment. Inspired by recent work that sheds light on the role of the soloist, we offer a collection of 48 manually verified transcriptions of professional jazz bassists, comprising over 50,000 note events, which are based on the backing track… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: ISMIR 2023

    Journal ref: Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy

  19. arXiv:2310.12404  [pdf, other

    cs.SD cs.CL cs.HC cs.LG eess.AS

    Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing

    Authors: Yixiao Zhang, Akira Maezawa, Gus Xia, Kazuhiko Yamamoto, Simon Dixon

    Abstract: Creating music is iterative, requiring varied methods at each stage. However, existing AI music systems fall short in orchestrating multiple subsystems for diverse needs. To address this gap, we introduce Loop Copilot, a novel system that enables users to generate and iteratively refine music through an interactive, multi-round dialogue interface. The system uses a large language model to interpre… ▽ More

    Submitted 29 August, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

    Comments: Source code and demo video are available at \url{https://sites.google.com/view/loop-copilot}

  20. arXiv:2309.02567  [pdf, other

    eess.AS cs.MM cs.SD

    Symbolic Music Representations for Classification Tasks: A Systematic Evaluation

    Authors: Huan Zhang, Emmanouil Karystinaios, Simon Dixon, Gerhard Widmer, Carlos Eduardo Cancino-Chacón

    Abstract: Music Information Retrieval (MIR) has seen a recent surge in deep learning-based approaches, which often involve encoding symbolic music (i.e., music represented in terms of discrete note events) in an image-like or language like fashion. However, symbolic music is neither an image nor a sentence, and research in the symbolic domain lacks a comprehensive overview of the different available represe… ▽ More

    Submitted 10 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: To be published in the Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy

    Journal ref: Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy

  21. arXiv:2306.17473  [pdf

    astro-ph.EP cs.CY physics.soc-ph

    An Orbital Solution for WASP-12 b: Updated Ephemeris and Evidence for Decay Leveraging Citizen Science Data

    Authors: Avinash S. Nediyedath, Martin J. Fowler, A. Norris, Shivaraj R. Maidur, Kyle A. Pearson, S. Dixon, P. Lewin, Andre O. Kovacs, A. Odasso, K. Davis, M. Primm, P. Das, Bryan E. Martin, D. Lalla

    Abstract: NASA Citizen Scientists have used Exoplanet Transit Interpretation Code (EXOTIC) to reduce 40 sets of time-series images of WASP-12 taken by privately owned telescopes and a 6-inch telescope operated by the Center for Astrophysics | Harvard & Smithsonian MicroObservatory (MOBs). Of these sets, 24 result in clean transit light curves of WASP-12 b which are included in the NASA Exoplanet Watch websi… ▽ More

    Submitted 10 November, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

    Comments: https://app.aavso.org/jaavso/article/3901/

    Journal ref: JAAVSO Volume 51 number 2 (2023)

  22. arXiv:2302.13678  [pdf, other

    cs.SD cs.AI eess.AS

    A Comparative Analysis Of Latent Regressor Losses For Singing Voice Conversion

    Authors: Brendan O'Connor, Simon Dixon

    Abstract: Previous research has shown that established techniques for spoken voice conversion (VC) do not perform as well when applied to singing voice conversion (SVC). We propose an alternative loss component in a loss function that is otherwise well-established among VC tasks, which has been shown to improve our model's SVC performance. We first trained a singer identity embedding (SIE) network on mel-sp… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: Submitted to the Sound and Music Computing Conference 2023

  23. arXiv:2208.11671  [pdf, other

    cs.SD cs.CL eess.AS

    Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model

    Authors: Yixiao Zhang, Junyan Jiang, Gus Xia, Simon Dixon

    Abstract: Lyric interpretations can help people understand songs and their lyrics quickly, and can also make it easier to manage, retrieve and discover songs efficiently from the growing mass of music archives. In this paper we propose BART-fusion, a novel model for generating lyric interpretations from lyrics and music audio that combines a large-scale pre-trained language model with an audio encoder. We e… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: Accepted to ISMIR 2022

  24. arXiv:2207.07645  [pdf, other

    astro-ph.CO cs.LG

    A Probabilistic Autoencoder for Type Ia Supernovae Spectral Time Series

    Authors: George Stein, Uros Seljak, Vanessa Bohm, G. Aldering, P. Antilogus, C. Aragon, S. Bailey, C. Baltay, S. Bongard, K. Boone, C. Buton, Y. Copin, S. Dixon, D. Fouchez, E. Gangler, R. Gupta, B. Hayden, W. Hillebrandt, M. Karmen, A. G. Kim, M. Kowalski, D. Kusters, P. F. Leget, F. Mondon, J. Nordin , et al. (15 additional authors not shown)

    Abstract: We construct a physically-parameterized probabilistic autoencoder (PAE) to learn the intrinsic diversity of type Ia supernovae (SNe Ia) from a sparse set of spectral time series. The PAE is a two-stage generative model, composed of an Auto-Encoder (AE) which is interpreted probabilistically after training using a Normalizing Flow (NF). We demonstrate that the PAE learns a low-dimensional latent sp… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: 23 pages, 8 Figures, 1 Table. Accepted to ApJ

  25. arXiv:2205.05871  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Robust Unsupervised Disentanglement of Sequential Data -- A Case Study Using Music Audio

    Authors: Yin-Jyun Luo, Sebastian Ewert, Simon Dixon

    Abstract: Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models that describes an observed sequence with dynamic latent variables and a static latent variable. The former encode information at a frame rate identical to the observation, while the latter globally governs the entire sequence. This introduces an inductive bias and facilitates unsupervised disentangleme… ▽ More

    Submitted 14 June, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: The paper is accepted to IJCAI 2022

  26. arXiv:2204.08822  [pdf, other

    cs.SD cs.AI eess.AS

    A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization

    Authors: Ruchit Agrawal, Daniel Wolff, Simon Dixon

    Abstract: Performance-score synchronization is an integral task in signal processing, which entails generating an accurate mapping between an audio recording of a performance and the corresponding musical score. Traditional synchronization methods compute alignment using knowledge-driven and stochastic approaches, and are typically unable to generalize well to different domains and modalities. We present a… ▽ More

    Submitted 19 April, 2022; originally announced April 2022.

    Comments: Published in IEEE Signal Processing Letters, Volume 29, December 2021

  27. arXiv:2112.00410  [pdf, other

    cs.CV cs.LG

    Rethink, Revisit, Revise: A Spiral Reinforced Self-Revised Network for Zero-Shot Learning

    Authors: Zhe Liu, Yun Li, Lina Yao, Julian McAuley, Sam Dixon

    Abstract: Current approaches to Zero-Shot Learning (ZSL) struggle to learn generalizable semantic knowledge capable of capturing complex correlations. Inspired by \emph{Spiral Curriculum}, which enhances learning processes by revisiting knowledge, we propose a form of spiral learning which revisits visual representations based on a sequence of attribute groups (e.g., a combined group of \emph{color} and \em… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

  28. arXiv:2111.08839  [pdf, other

    cs.SD eess.AS

    Zero-shot Singing Technique Conversion

    Authors: Brendan O'Connor, Simon Dixon, George Fazekas

    Abstract: In this paper we propose modifications to the neural network framework, AutoVC for the task of singing technique conversion. This includes utilising a pretrained singing technique encoder which extracts technique information, upon which a decoder is conditioned during training. By swapping out a source singer's technique information for that of the target's during conversion, the input spectrogram… ▽ More

    Submitted 16 November, 2021; originally announced November 2021.

    Comments: In Proceedings of the 15th International Symposium on Computer Music Multidisciplinary Research (CMMR 2021), Tokyo, Japan, November 15-16, 2021

  29. An Exploratory Study on Perceptual Spaces of the Singing Voice

    Authors: Brendan O'Connor, Simon Dixon, George Fazekas

    Abstract: Sixty participants provided dissimilarity ratings between various singing techniques. Multidimensional scaling, class averaging and clustering techniques were used to analyse timbral spaces and how they change between different singers, genders and registers. Clustering analysis showed that ground-truth similarity and silhouette scores that were not significantly different between gender or regist… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

    Comments: In Proceedings of the 2020 Joint Conference on AI Music Creativity (CSMC-MuMe 2020), Stockholm, Sweden, October 15-19, 2020

  30. arXiv:2108.02625  [pdf, other

    cs.SD cs.CL cs.IR eess.AS

    MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription

    Authors: Emir Demirel, Sven Ahlbäck, Simon Dixon

    Abstract: This paper makes several contributions to automatic lyrics transcription (ALT) research. Our main contribution is a novel variant of the Multistreaming Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net, which processes the temporal information using multiple streams in parallel with varying resolutions keeping the network more compact, and thus with a faster inference and an improve… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

  31. arXiv:2107.13617  [pdf, other

    cs.SD cs.IR cs.LG cs.NE eess.AS

    Pitch-Informed Instrument Assignment Using a Deep Convolutional Network with Multiple Kernel Shapes

    Authors: Carlos Lordelo, Emmanouil Benetos, Simon Dixon, Sven Ahlbäck

    Abstract: This paper proposes a deep convolutional neural network for performing note-level instrument assignment. Given a polyphonic multi-instrumental music signal along with its ground truth or predicted notes, the objective is to assign an instrumental source for each note. This problem is addressed as a pitch-informed classification task where each note is analysed individually. We also propose to util… ▽ More

    Submitted 28 July, 2021; originally announced July 2021.

    Comments: 4 figures, 4 tables and 7 pages. Accepted for publication at ISMIR Conference 2021

  32. arXiv:2106.10977  [pdf, other

    cs.IR cs.SD eess.AS

    Computational Pronunciation Analysis in Sung Utterances

    Authors: Emir Demirel, Sven Ahlback, Simon Dixon

    Abstract: Recent automatic lyrics transcription (ALT) approaches focus on building stronger acoustic models or in-domain language models, while the pronunciation aspect is seldom touched upon. This paper applies a novel computational analysis on the pronunciation variances in sung utterances and further proposes a new pronunciation model adapted for singing. The singing-adapted model is tested on multiple p… ▽ More

    Submitted 21 June, 2021; originally announced June 2021.

  33. arXiv:2102.09202  [pdf, other

    cs.SD eess.AS

    Low Resource Audio-to-Lyrics Alignment From Polyphonic Music Recordings

    Authors: Emir Demirel, Sven Ahlbäck, Simon Dixon

    Abstract: Lyrics alignment in long music recordings can be memory exhaustive when performed in a single pass. In this study, we present a novel method that performs audio-to-lyrics alignment with a low memory consumption footprint regardless of the duration of the music recording. The proposed system first spots the anchoring words within the audio signal. With respect to these anchors, the recording is the… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

  34. arXiv:2102.00382  [pdf, other

    cs.SD cs.LG eess.AS

    Structure-Aware Audio-to-Score Alignment using Progressively Dilated Convolutional Neural Networks

    Authors: Ruchit Agrawal, Daniel Wolff, Simon Dixon

    Abstract: The identification of structural differences between a music performance and the score is a challenging yet integral step of audio-to-score alignment, an important subtask of music information retrieval. We present a novel method to detect such differences between the score and performance for a given piece of music using progressively dilated convolutional neural networks. Our method incorporates… ▽ More

    Submitted 13 February, 2021; v1 submitted 31 January, 2021; originally announced February 2021.

    Comments: ICASSP 2021 camera-ready version. Copyrights belong to IEEE

  35. Adversarial Unsupervised Domain Adaptation for Harmonic-Percussive Source Separation

    Authors: Carlos Lordelo, Emmanouil Benetos, Simon Dixon, Sven Ahlbäck, Patrik Ohlsson

    Abstract: This paper addresses the problem of domain adaptation for the task of music source separation. Using datasets from two different domains, we compare the performance of a deep learning-based harmonic-percussive source separation model under different training scenarios, including supervised joint training using data from both domains and pre-training in one domain with fine-tuning in another. We pr… ▽ More

    Submitted 3 January, 2021; originally announced January 2021.

    Comments: 5 pages, 2 figures and 1 table. Accepted for publication in IEEE Signal Processing Letters

  36. arXiv:2011.07546  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Learning Frame Similarity using Siamese networks for Audio-to-Score Alignment

    Authors: Ruchit Agrawal, Simon Dixon

    Abstract: Audio-to-score alignment aims at generating an accurate mapping between a performance audio and the score of a given piece. Standard alignment methods are based on Dynamic Time Warping (DTW) and employ handcrafted features, which cannot be adapted to different acoustic conditions. We propose a method to overcome this limitation using learned frame similarity for audio-to-score alignment. We focus… ▽ More

    Submitted 15 November, 2020; originally announced November 2020.

    Comments: Accepted at EUSIPCO 2020

  37. arXiv:2007.14333  [pdf, other

    eess.AS cs.LG cs.SD

    A Hybrid Approach to Audio-to-Score Alignment

    Authors: Ruchit Agrawal, Simon Dixon

    Abstract: Audio-to-score alignment aims at generating an accurate mapping between a performance audio and the score of a given piece. Standard alignment methods are based on Dynamic Time Warping (DTW) and employ handcrafted features. We explore the usage of neural networks as a preprocessing step for DTW-based automatic alignment methods. Experiments on music data from different acoustic conditions demonstr… ▽ More

    Submitted 28 July, 2020; originally announced July 2020.

    Comments: ML4MD at ICML 2019

  38. arXiv:2007.06486  [pdf, other

    eess.AS cs.CL cs.LG

    Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention

    Authors: Emir Demirel, Sven Ahlback, Simon Dixon

    Abstract: Speech recognition is a well developed research field so that the current state of the art systems are being used in many applications in the software industry, yet as by today, there still does not exist such robust system for the recognition of words and sentences from singing voice. This paper proposes a complete pipeline for this task which may commonly be referred as automatic lyrics transcri… ▽ More

    Submitted 24 July, 2020; v1 submitted 13 July, 2020; originally announced July 2020.

  39. arXiv:2005.07788  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Reliable Local Explanations for Machine Listening

    Authors: Saumitra Mishra, Emmanouil Benetos, Bob L. Sturm, Simon Dixon

    Abstract: One way to analyse the behaviour of machine learning models is through local explanations that highlight input features that maximally influence model predictions. Sensitivity analysis, which involves analysing the effect of input perturbations on model predictions, is one of the methods to generate local explanations. Meaningful input perturbations are essential for generating reliable explanatio… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

    Comments: 8 pages plus references. Accepted at the IJCNN 2020 Special Session on Explainable Computational/Artificial Intelligence. Camera-ready version

  40. Embedded Large-Scale Handwritten Chinese Character Recognition

    Authors: Youssouf Chherawala, Hans J. G. A. Dolfing, Ryan S. Dixon, Jerome R. Bellegarda

    Abstract: As handwriting input becomes more prevalent, the large symbol inventory required to support Chinese handwriting recognition poses unique challenges. This paper describes how the Apple deep learning recognition system can accurately handle up to 30,000 Chinese characters while running in real-time across a range of mobile devices. To achieve acceptable accuracy, we paid particular attention to data… ▽ More

    Submitted 13 April, 2020; originally announced April 2020.

    Comments: 5 pages, 7 figures

  41. arXiv:1911.06393  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling

    Authors: Daniel Stoller, Mi Tian, Sebastian Ewert, Simon Dixon

    Abstract: Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. However, efficiently modelling long-term dependencies in these sequences is still challenging. Although the receptive field of these models grows exponentially with the number of layers, computing the convolution… ▽ More

    Submitted 14 November, 2019; originally announced November 2019.

    Comments: Code available at https://github.com/f90/Seq-U-Net

  42. arXiv:1905.12660  [pdf, other

    cs.LG stat.ML

    Training Generative Adversarial Networks from Incomplete Observations using Factorised Discriminators

    Authors: Daniel Stoller, Sebastian Ewert, Simon Dixon

    Abstract: Generative adversarial networks (GANs) have shown great success in applications such as image generation and inpainting. However, they typically require large datasets, which are often not available, especially in the context of prediction tasks such as image segmentation that require labels. Therefore, methods such as the CycleGAN use more easily available unlabelled data, but do not offer a way… ▽ More

    Submitted 30 January, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

    Comments: 10 pages plus 14 pages appendix. Accepted at the International Conference on Learning Representations (ICLR) 2020. Camera-ready submission. Implementation available at https://github.com/f90/FactorGAN

  43. arXiv:1905.01899  [pdf, other

    cs.SD eess.AS

    Investigating kernel shapes and skip connections for deep learning-based harmonic-percussive separation

    Authors: Carlos Lordelo, Emmanouil Benetos, Simon Dixon, Sven Ahlbäck

    Abstract: In this paper we propose an efficient deep learning encoder-decoder network for performing Harmonic-Percussive Source Separation (HPSS). It is shown that we are able to greatly reduce the number of model trainable parameters by using a dense arrangement of skip connections between the model layers. We also explore the utilisation of different kernel sizes for the 2D filters of the convolutional la… ▽ More

    Submitted 30 July, 2019; v1 submitted 6 May, 2019; originally announced May 2019.

    Comments: Accepted for publication at WASPAA 2019, 5 pages, 5 figures

  44. arXiv:1904.09533  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    GAN-based Generation and Automatic Selection of Explanations for Neural Networks

    Authors: Saumitra Mishra, Daniel Stoller, Emmanouil Benetos, Bob L. Sturm, Simon Dixon

    Abstract: One way to interpret trained deep neural networks (DNNs) is by inspecting characteristics that neurons in the model respond to, such as by iteratively optimising the model input (e.g., an image) to maximally activate specific neurons. However, this requires a careful selection of hyper-parameters to generate interpretable examples for each neuron of interest, and current methods rely on a manual,… ▽ More

    Submitted 27 April, 2019; v1 submitted 20 April, 2019; originally announced April 2019.

    Comments: 8 pages plus references and appendix. Accepted at the ICLR 2019 Workshop "Safe Machine Learning: Specification, Robustness and Assurance". Camera-ready version. v2: Corrected page header

    Journal ref: SafeML Workshop at the International Conference on Learning Representations (ICLR) 2019

  45. arXiv:1806.03185  [pdf, other

    cs.SD eess.AS stat.ML

    Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

    Authors: Daniel Stoller, Sebastian Ewert, Simon Dixon

    Abstract: Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, em… ▽ More

    Submitted 8 June, 2018; originally announced June 2018.

    Comments: 7 pages (1 for references), 4 figures, 3 tables. Appearing in the proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR 2018) (camera-ready version). Implementation available at https://github.com/f90/Wave-U-Net

    Journal ref: 19th International Society for Music Information Retrieval Conference (ISMIR 2018)

  46. arXiv:1804.01650  [pdf, other

    cs.SD cs.LG eess.AS

    Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

    Authors: Daniel Stoller, Sebastian Ewert, Simon Dixon

    Abstract: A main challenge in applying deep learning to music processing is the availability of training data. One potential solution is Multi-task Learning, in which the model also learns to solve related auxiliary tasks on additional datasets to exploit their correlation. While intuitive in principle, it can be challenging to identify related tasks and construct the model to optimally share information be… ▽ More

    Submitted 4 April, 2018; originally announced April 2018.

    Comments: 10 pages, 2 figures, accepted for the 14th International Conference on Latent Variable Analysis and Signal Separation

  47. arXiv:1802.05178  [pdf, other

    cs.MM cs.SD eess.AS

    Similarity measures for vocal-based drum sample retrieval using deep convolutional auto-encoders

    Authors: Adib Mehrabi, Keunwoo Choi, Simon Dixon, Mark Sandler

    Abstract: The expressive nature of the voice provides a powerful medium for communicating sonic ideas, motivating recent research on methods for query by vocalisation. Meanwhile, deep learning methods have demonstrated state-of-the-art results for matching vocal imitations to imitated sounds, yet little is known about how well learned features represent the perceptual similarity between vocalisations and qu… ▽ More

    Submitted 14 February, 2018; originally announced February 2018.

    Comments: ICASSP 2018 camera-ready

  48. arXiv:1711.00048  [pdf, other

    cs.LG cs.SD

    Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction

    Authors: Daniel Stoller, Sebastian Ewert, Simon Dixon

    Abstract: The state of the art in music source separation employs neural networks trained in a supervised fashion on multi-track databases to estimate the sources from a given mixture. With only few datasets available, often extensive data augmentation is used to combat overfitting. Mixing random tracks, however, can even reduce separation performance as instruments in real music are strongly correlated. Th… ▽ More

    Submitted 6 April, 2018; v1 submitted 31 October, 2017; originally announced November 2017.

    Comments: 5 pages, 2 figures, 1 table. Final version of manuscript accepted for 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Implementation available at https://github.com/f90/AdversarialAudioSeparation

    ACM Class: H.5.5; I.2.6

  49. Note Value Recognition for Piano Transcription Using Markov Random Fields

    Authors: Eita Nakamura, Kazuyoshi Yoshii, Simon Dixon

    Abstract: This paper presents a statistical method for use in music transcription that can estimate score times of note onsets and offsets from polyphonic MIDI performance signals. Because performed note durations can deviate largely from score-indicated values, previous methods had the problem of not being able to accurately estimate offset score times (or note values) and thus could only output incomplete… ▽ More

    Submitted 7 July, 2017; v1 submitted 23 March, 2017; originally announced March 2017.

    Comments: 13 pages, 16 figures, version accepted to IEEE/ACM TASLP, minor revision

  50. arXiv:1604.08516  [pdf, other

    cs.SD

    Robust Joint Alignment of Multiple Versions of a Piece of Music

    Authors: Siying Wang, Sebastian Ewert, Simon Dixon

    Abstract: Large music content libraries often comprise multiple versions of a piece of music. To establish a link between different versions, automatic music alignment methods map each position in one version to a corresponding position in another version. Due to the leeway in interpreting a piece, any two versions can differ significantly, for example, in terms of local tempo, articulation, or playing styl… ▽ More

    Submitted 28 April, 2016; originally announced April 2016.

    Comments: International Society for Music Information Retrieval Conference (ISMIR)

    Journal ref: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, pp. 83-88, 2014