-
Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene Classification
Authors:
Paul Primus,
Hamid Eghbal-zadeh,
David Eitelsebner,
Khaled Koutini,
Andreas Arzt,
Gerhard Widmer
Abstract:
Distribution mismatches between the data seen at training and at application time remain a major challenge in all application areas of machine learning. We study this problem in the context of machine listening (Task 1b of the DCASE 2019 Challenge). We propose a novel approach to learn domain-invariant classifiers in an end-to-end fashion by enforcing equal hidden layer representations for domain-…
▽ More
Distribution mismatches between the data seen at training and at application time remain a major challenge in all application areas of machine learning. We study this problem in the context of machine listening (Task 1b of the DCASE 2019 Challenge). We propose a novel approach to learn domain-invariant classifiers in an end-to-end fashion by enforcing equal hidden layer representations for domain-parallel samples, i.e. time-aligned recordings from different recording devices. No classification labels are needed for our domain adaptation (DA) method, which makes the data collection process cheaper.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
Learning Complex Basis Functions for Invariant Representations of Audio
Authors:
Stefan Lattner,
Monika Dörfler,
Andreas Arzt
Abstract:
Learning features from data has shown to be more successful than using hand-crafted features for many machine learning tasks. In music information retrieval (MIR), features learned from windowed spectrograms are highly variant to transformations like transposition or time-shift. Such variances are undesirable when they are irrelevant for the respective MIR task. We propose an architecture called C…
▽ More
Learning features from data has shown to be more successful than using hand-crafted features for many machine learning tasks. In music information retrieval (MIR), features learned from windowed spectrograms are highly variant to transformations like transposition or time-shift. Such variances are undesirable when they are irrelevant for the respective MIR task. We propose an architecture called Complex Autoencoder (CAE) which learns features invariant to orthogonal transformations. Mapping signals onto complex basis functions learned by the CAE results in a transformation-invariant "magnitude space" and a transformation-variant "phase space". The phase space is useful to infer transformations between data pairs. When exploiting the invariance-property of the magnitude space, we achieve state-of-the-art results in audio-to-score alignment and repeated section discovery for audio. A PyTorch implementation of the CAE, including the repeated section discovery method, is available online.
△ Less
Submitted 12 July, 2019;
originally announced July 2019.
-
Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval
Authors:
Stefan Balke,
Matthias Dorfer,
Luis Carvalho,
Andreas Arzt,
Gerhard Widmer
Abstract:
Connecting large libraries of digitized audio recordings to their corresponding sheet music images has long been a motivation for researchers to develop new cross-modal retrieval systems. In recent years, retrieval systems based on embedding space learning with deep neural networks got a step closer to fulfilling this vision. However, global and local tempo deviations in the music recordings still…
▽ More
Connecting large libraries of digitized audio recordings to their corresponding sheet music images has long been a motivation for researchers to develop new cross-modal retrieval systems. In recent years, retrieval systems based on embedding space learning with deep neural networks got a step closer to fulfilling this vision. However, global and local tempo deviations in the music recordings still require careful tuning of the amount of temporal context given to the system. In this paper, we address this problem by introducing an additional soft-attention mechanism on the audio input. Quantitative and qualitative results on synthesized piano data indicate that this attention increases the robustness of the retrieval system by focusing on different parts of the input representation based on the tempo of the audio. Encouraged by these results, we argue for the potential of attention models as a very general tool for many MIR tasks.
△ Less
Submitted 26 June, 2019;
originally announced June 2019.
-
Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies
Authors:
Meinard Müller,
Andreas Arzt,
Stefan Balke,
Matthias Dorfer,
Gerhard Widmer
Abstract:
There has been a rapid growth of digitally available music data, including audio recordings, digitized images of sheet music, album covers and liner notes, and video clips. This huge amount of data calls for retrieval strategies that allow users to explore large music collections in a convenient way. More precisely, there is a need for cross-modal retrieval algorithms that, given a query in one mo…
▽ More
There has been a rapid growth of digitally available music data, including audio recordings, digitized images of sheet music, album covers and liner notes, and video clips. This huge amount of data calls for retrieval strategies that allow users to explore large music collections in a convenient way. More precisely, there is a need for cross-modal retrieval algorithms that, given a query in one modality (e.g., a short audio excerpt), find corresponding information and entities in other modalities (e.g., the name of the piece and the sheet music). This goes beyond exact audio identification and subsequent retrieval of metainformation as performed by commercial applications like Shazam [1].
△ Less
Submitted 12 February, 2019;
originally announced February 2019.
-
Audio-to-Score Alignment using Transposition-invariant Features
Authors:
Andreas Arzt,
Stefan Lattner
Abstract:
Audio-to-score alignment is an important pre-processing step for in-depth analysis of classical music. In this paper, we apply novel transposition-invariant audio features to this task. These low-dimensional features represent local pitch intervals and are learned in an unsupervised fashion by a gated autoencoder. Our results show that the proposed features are indeed fully transposition-invariant…
▽ More
Audio-to-score alignment is an important pre-processing step for in-depth analysis of classical music. In this paper, we apply novel transposition-invariant audio features to this task. These low-dimensional features represent local pitch intervals and are learned in an unsupervised fashion by a gated autoencoder. Our results show that the proposed features are indeed fully transposition-invariant and enable accurate alignments between transposed scores and performances. Furthermore, they can even outperform widely used features for audio-to-score alignment on `untransposed data', and thus are a viable and more flexible alternative to well-established features for music alignment and matching.
△ Less
Submitted 19 July, 2018;
originally announced July 2018.
-
The ACCompanion v0.1: An Expressive Accompaniment System
Authors:
Carlos Cancino-Chacón,
Martin Bonev,
Amaury Durand,
Maarten Grachten,
Andreas Arzt,
Laura Bishop,
Werner Goebl,
Gerhard Widmer
Abstract:
In this paper we present a preliminary version of the ACCompanion, an expressive accompaniment system for MIDI input. The system uses a probabilistic monophonic score follower to track the position of the soloist in the score, and a linear Gaussian model to compute tempo updates. The expressiveness of the system is powered by the Basis-Mixer, a state-of-the-art computational model of expressive mu…
▽ More
In this paper we present a preliminary version of the ACCompanion, an expressive accompaniment system for MIDI input. The system uses a probabilistic monophonic score follower to track the position of the soloist in the score, and a linear Gaussian model to compute tempo updates. The expressiveness of the system is powered by the Basis-Mixer, a state-of-the-art computational model of expressive music performance. The system allows for expressive dynamics, timing and articulation.
△ Less
Submitted 7 November, 2017;
originally announced November 2017.
-
Aktuelle Entwicklungen in der Automatischen Musikverfolgung
Authors:
Andreas Arzt,
Matthias Dorfer
Abstract:
In this paper we present current trends in real-time music tracking (a.k.a. score following). Casually speaking, these algorithms "listen" to a live performance of music, compare the audio signal to an abstract representation of the score, and "read" along in the sheet music. In this way at any given time the exact position of the musician(s) in the sheet music is computed. Here, we focus on the a…
▽ More
In this paper we present current trends in real-time music tracking (a.k.a. score following). Casually speaking, these algorithms "listen" to a live performance of music, compare the audio signal to an abstract representation of the score, and "read" along in the sheet music. In this way at any given time the exact position of the musician(s) in the sheet music is computed. Here, we focus on the aspects of flexibility and usability of these algorithms. This comprises work on automatic identification and flexible tracking of the piece being played as well as current approaches based on Deep Learning. The latter enables direct learning of correspondences between complex audio data and images of the sheet music, avoiding the complicated and time-consuming definition of a mid-level representation.
-----
Diese Arbeit befasst sich mit aktuellen Entwicklungen in der automatischen Musikverfolgung durch den Computer. Es handelt sich dabei um Algorithmen, die einer musikalischen Aufführung "zuhören", das aufgenommene Audiosignal mit einer (abstrakten) Repräsentation des Notentextes vergleichen und sozusagen in diesem mitlesen. Der Algorithmus kennt also zu jedem Zeitpunkt die Position der Musiker im Notentext. Neben der Vermittlung eines generellen Überblicks, liegt der Schwerpunkt dieser Arbeit auf der Beleuchtung des Aspekts der Flexibilität und der einfacheren Nutzbarkeit dieser Algorithmen. Es wird dargelegt, welche Schritte getätigt wurden (und aktuell getätigt werden) um den Prozess der automatischen Musikverfolgung einfacher zugänglich zu machen. Dies umfasst Arbeiten zur automatischen Identifikation von gespielten Stücken und deren flexible Verfolgung ebenso wie aktuelle Ansätze mithilfe von Deep Learning, die es erlauben Bild und Ton direkt zu verbinden, ohne Umwege über abstrakte und nur unter großem Zeitaufwand zu erstellende Zwischenrepräsentationen.
△ Less
Submitted 7 August, 2017;
originally announced August 2017.
-
Piece Identification in Classical Piano Music Without Reference Scores
Authors:
Andreas Arzt,
Gerhard Widmer
Abstract:
In this paper we describe an approach to identify the name of a piece of piano music, based on a short audio excerpt of a performance. Given only a description of the pieces in text format (i.e. no score information is provided), a reference database is automatically compiled by acquiring a number of audio representations (performances of the pieces) from internet sources. These are transcribed, p…
▽ More
In this paper we describe an approach to identify the name of a piece of piano music, based on a short audio excerpt of a performance. Given only a description of the pieces in text format (i.e. no score information is provided), a reference database is automatically compiled by acquiring a number of audio representations (performances of the pieces) from internet sources. These are transcribed, preprocessed, and used to build a reference database via a robust symbolic fingerprinting algorithm, which in turn is used to identify new, incoming queries. The main challenge is the amount of noise that is introduced into the identification process by the music transcription algorithm and the automatic (but possibly suboptimal) choice of performances to represent a piece in the reference database. In a number of experiments we show how to improve the identification performance by increasing redundancy in the reference database and by using a preprocessing step to rate the reference performances regarding their suitability as a representation of the pieces in question. As the results show this approach leads to a robust system that is able to identify piano music with high accuracy -- without any need for data annotation or manual data preparation.
△ Less
Submitted 2 August, 2017;
originally announced August 2017.
-
Learning Audio - Sheet Music Correspondences for Score Identification and Offline Alignment
Authors:
Matthias Dorfer,
Andreas Arzt,
Gerhard Widmer
Abstract:
This work addresses the problem of matching short excerpts of audio with their respective counterparts in sheet music images. We show how to employ neural network-based cross-modality embedding spaces for solving the following two sheet music-related tasks: retrieving the correct piece of sheet music from a database when given a music audio as a search query; and aligning an audio recording of a p…
▽ More
This work addresses the problem of matching short excerpts of audio with their respective counterparts in sheet music images. We show how to employ neural network-based cross-modality embedding spaces for solving the following two sheet music-related tasks: retrieving the correct piece of sheet music from a database when given a music audio as a search query; and aligning an audio recording of a piece with the corresponding images of sheet music. We demonstrate the feasibility of this in experiments on classical piano music by five different composers (Bach, Haydn, Mozart, Beethoven and Chopin), and additionally provide a discussion on why we expect multi-modal neural networks to be a fruitful paradigm for dealing with sheet music and audio at the same time.
△ Less
Submitted 31 July, 2017;
originally announced July 2017.
-
Modeling Harmony with Skip-Grams
Authors:
David R. W. Sears,
Andreas Arzt,
Harald Frostel,
Reinhard Sonnleitner,
Gerhard Widmer
Abstract:
String-based (or viewpoint) models of tonal harmony often struggle with data sparsity in pattern discovery and prediction tasks, particularly when modeling composite events like triads and seventh chords, since the number of distinct n-note combinations in polyphonic textures is potentially enormous. To address this problem, this study examines the efficacy of skip-grams in music research, an alte…
▽ More
String-based (or viewpoint) models of tonal harmony often struggle with data sparsity in pattern discovery and prediction tasks, particularly when modeling composite events like triads and seventh chords, since the number of distinct n-note combinations in polyphonic textures is potentially enormous. To address this problem, this study examines the efficacy of skip-grams in music research, an alternative viewpoint method developed in corpus linguistics and natural language processing that includes sub-sequences of n events (or n-grams) in a frequency distribution if their constituent members occur within a certain number of skips.
Using a corpus consisting of four datasets of Western classical music in symbolic form, we found that including skip-grams reduces data sparsity in n-gram distributions by (1) minimizing the proportion of n-grams with negligible counts, and (2) increasing the coverage of contiguous n-grams in a test corpus. What is more, skip-grams significantly outperformed contiguous n-grams in discovering conventional closing progressions (called cadences).
△ Less
Submitted 18 July, 2017; v1 submitted 14 July, 2017;
originally announced July 2017.
-
On the Potential of Simple Framewise Approaches to Piano Transcription
Authors:
Rainer Kelz,
Matthias Dorfer,
Filip Korzeniowski,
Sebastian Böck,
Andreas Arzt,
Gerhard Widmer
Abstract:
In an attempt at exploring the limitations of simple approaches to the task of piano transcription (as usually defined in MIR), we conduct an in-depth analysis of neural network-based framewise transcription. We systematically compare different popular input representations for transcription systems to determine the ones most suitable for use with neural networks. Exploiting recent advances in tra…
▽ More
In an attempt at exploring the limitations of simple approaches to the task of piano transcription (as usually defined in MIR), we conduct an in-depth analysis of neural network-based framewise transcription. We systematically compare different popular input representations for transcription systems to determine the ones most suitable for use with neural networks. Exploiting recent advances in training techniques and new regularizers, and taking into account hyper-parameter tuning, we show that it is possible, by simple bottom-up frame-wise processing, to obtain a piano transcriber that outperforms the current published state of the art on the publicly available MAPS dataset -- without any complex post-processing steps. Thus, we propose this simple approach as a new baseline for this dataset, for future transcription research to build on and improve.
△ Less
Submitted 15 December, 2016;
originally announced December 2016.
-
Live Score Following on Sheet Music Images
Authors:
Matthias Dorfer,
Andreas Arzt,
Sebastian Böck,
Amaury Durand,
Gerhard Widmer
Abstract:
In this demo we show a novel approach to score following. Instead of relying on some symbolic representation, we are using a multi-modal convolutional neural network to match the incoming audio stream directly to sheet music images. This approach is in an early stage and should be seen as proof of concept. Nonetheless, the audience will have the opportunity to test our implementation themselves vi…
▽ More
In this demo we show a novel approach to score following. Instead of relying on some symbolic representation, we are using a multi-modal convolutional neural network to match the incoming audio stream directly to sheet music images. This approach is in an early stage and should be seen as proof of concept. Nonetheless, the audience will have the opportunity to test our implementation themselves via 3 simple piano pieces.
△ Less
Submitted 15 December, 2016;
originally announced December 2016.
-
Towards End-to-End Audio-Sheet-Music Retrieval
Authors:
Matthias Dorfer,
Andreas Arzt,
Gerhard Widmer
Abstract:
This paper demonstrates the feasibility of learning to retrieve short snippets of sheet music (images) when given a short query excerpt of music (audio) -- and vice versa --, without any symbolic representation of music or scores. This would be highly useful in many content-based musical retrieval scenarios. Our approach is based on Deep Canonical Correlation Analysis (DCCA) and learns correlated…
▽ More
This paper demonstrates the feasibility of learning to retrieve short snippets of sheet music (images) when given a short query excerpt of music (audio) -- and vice versa --, without any symbolic representation of music or scores. This would be highly useful in many content-based musical retrieval scenarios. Our approach is based on Deep Canonical Correlation Analysis (DCCA) and learns correlated latent spaces allowing for cross-modality retrieval in both directions. Initial experiments with relatively simple monophonic music show promising results.
△ Less
Submitted 15 December, 2016;
originally announced December 2016.
-
Towards Score Following in Sheet Music Images
Authors:
Matthias Dorfer,
Andreas Arzt,
Gerhard Widmer
Abstract:
This paper addresses the matching of short music audio snippets to the corresponding pixel location in images of sheet music. A system is presented that simultaneously learns to read notes, listens to music and matches the currently played music to its corresponding notes in the sheet. It consists of an end-to-end multi-modal convolutional neural network that takes as input images of sheet music a…
▽ More
This paper addresses the matching of short music audio snippets to the corresponding pixel location in images of sheet music. A system is presented that simultaneously learns to read notes, listens to music and matches the currently played music to its corresponding notes in the sheet. It consists of an end-to-end multi-modal convolutional neural network that takes as input images of sheet music and spectrograms of the respective audio snippets. It learns to predict, for a given unseen audio snippet (covering approximately one bar of music), the corresponding position in the respective score line. Our results suggest that with the use of (deep) neural networks -- which have proven to be powerful image processing models -- working with sheet music becomes feasible and a promising future research direction.
△ Less
Submitted 15 December, 2016;
originally announced December 2016.