Skip to main content

Showing 1–15 of 15 results for author: Schmid, F

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.07609  [pdf, other

    eess.AS cs.LG cs.SD

    TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining

    Authors: Paul Primus, Florian Schmid, Gerhard Widmer

    Abstract: Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-lik… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: submitted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025. Dataset (Zenodo): https://zenodo.org/records/15379789, Implementation (GitHub): https://github.com/OptimusPrimus/tacos

  2. arXiv:2505.01747  [pdf, other

    eess.AS cs.SD

    Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge

    Authors: Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Gerhard Widmer

    Abstract: This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge and its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022--2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the devel… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: Task Description Page: https://dcase.community/challenge2025/task-low-complexity-acoustic-scene-classification-with-device-information

  3. arXiv:2503.11373  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Exploring Performance-Complexity Trade-Offs in Sound Event Detection Models

    Authors: Tobias Morocutti, Florian Schmid, Jonathan Greif, Francesco Foscarin, Gerhard Widmer

    Abstract: We target the problem of developing new low-complexity networks for the sound event detection task. Our goal is to meticulously analyze the performance-complexity trade-off, aiming to be competitive with the large state-of-the-art models, at a fraction of the computational requirements. We find that low-complexity convolutional models previously proposed for audio tagging can be effectively adapte… ▽ More

    Submitted 12 June, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

    Comments: In Proceedings of the 33rd European Signal Processing Conference (EUSIPCO 2025), Palermo, Italy

  4. arXiv:2503.11363  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Creating a Good Teacher for Knowledge Distillation in Acoustic Scene Classification

    Authors: Tobias Morocutti, Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: Knowledge Distillation (KD) is a widespread technique for compressing the knowledge of large models into more compact and efficient models. KD has proved to be highly effective in building well-performing low-complexity Acoustic Scene Classification (ASC) systems and was used in all the top-ranked submissions to this task of the annual DCASE challenge in the past three years. There is extensive re… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  5. arXiv:2409.09546  [pdf, other

    eess.AS cs.SD

    Effective Pre-Training of Audio Transformers for Sound Event Detection

    Authors: Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, Gerhard Widmer

    Abstract: We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance imp… ▽ More

    Submitted 28 November, 2024; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP'25. Source code available: https://github.com/fschmid56/PretrainedSED

  6. arXiv:2408.11641  [pdf, other

    eess.AS cs.LG cs.SD

    Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

    Authors: Paul Primus, Florian Schmid, Gerhard Widmer

    Abstract: Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to cre… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: In Proceedings of the 9th Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE, Tokyo, Japan, 2024. Implementation available on GitHub: https://github.com/OptimusPrimus/salsa

  7. arXiv:2408.11638  [pdf, other

    eess.AS

    Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining

    Authors: Jonathan Greif, Florian Schmid, Paul Primus, Gerhard Widmer

    Abstract: Query-by-Vocal Imitation (QBV) is about searching audio files within databases using vocal imitations created by the user's voice. Since most humans can effectively communicate sound concepts through voice, QBV offers the more intuitive and convenient approach compared to text-based search. To fully leverage QBV, developing robust audio feature representations for both the vocal imitation and the… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: Accepted to the DCASE Workshop 2024. Source code available: https://github.com/Jonathan-Greif/QBV

  8. arXiv:2408.00791  [pdf, other

    eess.AS cs.SD

    Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

    Authors: Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

    Abstract: This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system set… ▽ More

    Submitted 17 July, 2024; originally announced August 2024.

    Comments: Technical Report describing our system for DCASE2024 Challenge Task 4: https://dcase.community/challenge2024/task-sound-event-detection-with-heterogeneous-training-dataset-and-potentially-missing-labels-results Code: https://github.com/CPJKU/cpjku_dcase24. arXiv admin note: text overlap with arXiv:2407.12997

  9. arXiv:2407.12997  [pdf, other

    eess.AS

    Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

    Authors: Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

    Abstract: A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage proce… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Code: https://github.com/CPJKU/cpjku_dcase24

  10. arXiv:2405.10018  [pdf, other

    eess.AS cs.SD

    Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

    Authors: Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Khaled Koutini, Gerhard Widmer

    Abstract: This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop da… ▽ More

    Submitted 17 July, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

    Comments: Task Description Page: https://dcase.community/challenge2024/task-data-efficient-low-complexity-acoustic-scene-classification

  11. arXiv:2310.15648  [pdf, other

    cs.SD cs.LG eess.AS

    Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

    Authors: Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks. However, current popula… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. Source Code available at: https://github.com/fschmid56/EfficientAT

  12. arXiv:2305.07499  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Device-Robust Acoustic Scene Classification via Impulse Response Augmentation

    Authors: Tobias Morocutti, Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely wh… ▽ More

    Submitted 27 June, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

    Comments: In Proceedings of the 31st European Signal Processing Conference, EUSIPCO 2023. Source Code available at: https://github.com/theMoro/DIRAugmentation/

  13. arXiv:2303.01879  [pdf, other

    cs.SD eess.AS

    Low-Complexity Audio Embedding Extractors

    Authors: Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: Solving tasks such as speaker recognition, music classification, or semantic audio event tagging with deep learning models typically requires computationally demanding networks. General-purpose audio embeddings (GPAEs) are dense representations of audio signals that allow lightweight, shallow classifiers to tackle various audio tasks. The idea is that a single complex feature extractor would extra… ▽ More

    Submitted 23 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: In Proceedings of the 31st European Signal Processing Conference, EUSIPCO 2023. Source Code available at: https://github.com/fschmid56/EfficientAT_HEAR

  14. arXiv:2211.13956  [pdf, other

    cs.SD cs.LG eess.AS

    Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

    Authors: Khaled Koutini, Shahed Masoudian, Florian Schmid, Hamid Eghbal-zadeh, Jan Schlüter, Gerhard Widmer

    Abstract: The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature ex… ▽ More

    Submitted 2 March, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

    Comments: will apear in HEAR: Holistic Evaluation of Audio Representations Proceedings of Machine Learning Research PMLR 166. Source code: https://github.com/kkoutini/passt_hear21

    Journal ref: Proceedings of Machine Learning Research v166 (2022) 65-89

  15. arXiv:2211.04772  [pdf, other

    cs.SD cs.LG eess.AS

    Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

    Authors: Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient… ▽ More

    Submitted 23 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

    Comments: In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023. Source Code available at: https://github.com/fschmid56/EfficientAT