Skip to main content

Showing 1–14 of 14 results for author: Primus, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.07609  [pdf, other

    eess.AS cs.LG cs.SD

    TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining

    Authors: Paul Primus, Florian Schmid, Gerhard Widmer

    Abstract: Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-lik… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: submitted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025. Dataset (Zenodo): https://zenodo.org/records/15379789, Implementation (GitHub): https://github.com/OptimusPrimus/tacos

  2. arXiv:2505.01747  [pdf, other

    eess.AS cs.SD

    Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge

    Authors: Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Gerhard Widmer

    Abstract: This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge and its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022--2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the devel… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: Task Description Page: https://dcase.community/challenge2025/task-low-complexity-acoustic-scene-classification-with-device-information

  3. arXiv:2409.09546  [pdf, other

    eess.AS cs.SD

    Effective Pre-Training of Audio Transformers for Sound Event Detection

    Authors: Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, Gerhard Widmer

    Abstract: We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance imp… ▽ More

    Submitted 28 November, 2024; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP'25. Source code available: https://github.com/fschmid56/PretrainedSED

  4. arXiv:2408.11641  [pdf, other

    eess.AS cs.LG cs.SD

    Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

    Authors: Paul Primus, Florian Schmid, Gerhard Widmer

    Abstract: Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to cre… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: In Proceedings of the 9th Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE, Tokyo, Japan, 2024. Implementation available on GitHub: https://github.com/OptimusPrimus/salsa

  5. arXiv:2408.00791  [pdf, other

    eess.AS cs.SD

    Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

    Authors: Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

    Abstract: This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system set… ▽ More

    Submitted 17 July, 2024; originally announced August 2024.

    Comments: Technical Report describing our system for DCASE2024 Challenge Task 4: https://dcase.community/challenge2024/task-sound-event-detection-with-heterogeneous-training-dataset-and-potentially-missing-labels-results Code: https://github.com/CPJKU/cpjku_dcase24. arXiv admin note: text overlap with arXiv:2407.12997

  6. arXiv:2406.15897  [pdf, other

    eess.AS cs.LG cs.SD

    Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

    Authors: Paul Primus, Gerhard Widmer

    Abstract: Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadat… ▽ More

    Submitted 2 July, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

    Comments: In Proceedings of the 32nd European Signal Processing Conference, EUSIPCO 2024

  7. arXiv:2405.10018  [pdf, other

    eess.AS cs.SD

    Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

    Authors: Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Khaled Koutini, Gerhard Widmer

    Abstract: This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop da… ▽ More

    Submitted 17 July, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

    Comments: Task Description Page: https://dcase.community/challenge2024/task-data-efficient-low-complexity-acoustic-scene-classification

  8. arXiv:2308.04258  [pdf, other

    eess.AS cs.IR cs.LG cs.SD

    Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

    Authors: Paul Primus, Khaled Koutini, Gerhard Widmer

    Abstract: This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers. Our method projects recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. Through a systematic analysis, we examine how each component of the system influences retrieval performance. As a result, we identify two k… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: submitted to DCASE Workshop 2023

  9. arXiv:2208.11460  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

    Authors: Paul Primus, Gerhard Widmer

    Abstract: The absence of large labeled datasets remains a significant challenge in many application areas of deep learning. Researchers and practitioners typically resort to transfer learning and data augmentation to alleviate this issue. We study these strategies in the context of audio retrieval with natural language queries (Task 6b of the DCASE 2022 Challenge). Our proposed system uses pre-trained embed… ▽ More

    Submitted 29 October, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

    Comments: accepted at DCASE Workshop 2022

  10. arXiv:2208.11402  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

    Authors: Paul Primus, Gerhard Widmer

    Abstract: Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning. To this end, we compare the… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: published in EUSIPCO 2022

  11. arXiv:2011.02949  [pdf, other

    eess.AS cs.LG cs.SD

    Anomalous Sound Detection as a Simple Binary Classification Problem with Careful Selection of Proxy Outlier Examples

    Authors: Paul Primus, Verena Haunschmid, Patrick Praher, Gerhard Widmer

    Abstract: Unsupervised anomalous sound detection is concerned with identifying sounds that deviate from what is defined as 'normal', without explicitly specifying the types of anomalies. A significant obstacle is the diversity and rareness of outliers, which typically prevent us from collecting a representative set of anomalous sounds. As a consequence, most anomaly detection methods use unsupervised rather… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: published in DCASE 2020 Workshop

  12. arXiv:2007.13503  [pdf, other

    eess.AS cs.LG cs.SD

    Receptive-Field Regularized CNNs for Music Classification and Tagging

    Authors: Khaled Koutini, Hamid Eghbal-Zadeh, Verena Haunschmid, Paul Primus, Shreyan Chowdhury, Gerhard Widmer

    Abstract: Convolutional Neural Networks (CNNs) have been successfully used in various Music Information Retrieval (MIR) tasks, both as end-to-end models and as feature extractors for more complex systems. However, the MIR field is still dominated by the classical VGG-based CNN architecture variants, often in combination with more complex modules such as attention, and/or techniques such as pre-training on l… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

  13. arXiv:2007.02650  [pdf, other

    cs.LG stat.ML

    On Data Augmentation and Adversarial Risk: An Empirical Analysis

    Authors: Hamid Eghbal-zadeh, Khaled Koutini, Paul Primus, Verena Haunschmid, Michal Lewandowski, Werner Zellinger, Bernhard A. Moser, Gerhard Widmer

    Abstract: Data augmentation techniques have become standard practice in deep learning, as it has been shown to greatly improve the generalisation abilities of models. These techniques rely on different ideas such as invariance-preserving transformations (e.g, expert-defined augmentation), statistical heuristics (e.g, Mixup), and learning the data distribution (e.g, GANs). However, in the adversarial setting… ▽ More

    Submitted 6 July, 2020; originally announced July 2020.

    Comments: 21 pages, 15 figures, 3 tables

  14. arXiv:1909.02869  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene Classification

    Authors: Paul Primus, Hamid Eghbal-zadeh, David Eitelsebner, Khaled Koutini, Andreas Arzt, Gerhard Widmer

    Abstract: Distribution mismatches between the data seen at training and at application time remain a major challenge in all application areas of machine learning. We study this problem in the context of machine listening (Task 1b of the DCASE 2019 Challenge). We propose a novel approach to learn domain-invariant classifiers in an end-to-end fashion by enforcing equal hidden layer representations for domain-… ▽ More

    Submitted 4 September, 2019; originally announced September 2019.

    Comments: Published at the Workshop on Detection and Classification of Acoustic Scenes and Events, 25-26 October 2019, New York, USA