-
Comparative Study of Spike Encoding Methods for Environmental Sound Classification
Authors:
Andres Larroza,
Javier Naranjo-Alcazar,
Vicent Ortiz Castelló,
Pedro Zuccarello
Abstract:
Spiking Neural Networks (SNNs) offer a promising approach to reduce energy consumption and computational demands, making them particularly beneficial for embedded machine learning in edge applications. However, data from conventional digital sensors must first be converted into spike trains to be processed using neuromorphic computing technologies. The classification of environmental sounds presen…
▽ More
Spiking Neural Networks (SNNs) offer a promising approach to reduce energy consumption and computational demands, making them particularly beneficial for embedded machine learning in edge applications. However, data from conventional digital sensors must first be converted into spike trains to be processed using neuromorphic computing technologies. The classification of environmental sounds presents unique challenges due to the high variability of frequencies, background noise, and overlapping acoustic events. Despite these challenges, most studies on spike-based audio encoding focus on speech processing, leaving non-speech environmental sounds underexplored. In this work, we conduct a comprehensive comparison of widely used spike encoding techniques, evaluating their effectiveness on the ESC-10 dataset. By understanding the impact of encoding choices on environmental sound processing, researchers and practitioners can select the most suitable approach for real-world applications such as smart surveillance, environmental monitoring, and industrial acoustic analysis. This study serves as a benchmark for spike encoding in environmental sound classification, providing a foundational reference for future research in neuromorphic audio processing.
△ Less
Submitted 2 April, 2025; v1 submitted 14 March, 2025;
originally announced March 2025.
-
A Data-Centric Framework for Machine Listening Projects: Addressing Large-Scale Data Acquisition and Labeling through Active Learning
Authors:
Javier Naranjo-Alcazar,
Jordi Grau-Haro,
Ruben Ribes-Serrano,
Pedro Zuccarello
Abstract:
Machine Listening focuses on developing technologies to extract relevant information from audio signals. A critical aspect of these projects is the acquisition and labeling of contextualized data, which is inherently complex and requires specific resources and strategies. Despite the availability of some audio datasets, many are unsuitable for commercial applications. The paper emphasizes the impo…
▽ More
Machine Listening focuses on developing technologies to extract relevant information from audio signals. A critical aspect of these projects is the acquisition and labeling of contextualized data, which is inherently complex and requires specific resources and strategies. Despite the availability of some audio datasets, many are unsuitable for commercial applications. The paper emphasizes the importance of Active Learning (AL) using expert labelers over crowdsourcing, which often lacks detailed insights into dataset structures. AL is an iterative process combining human labelers and AI models to optimize the labeling budget by intelligently selecting samples for human review. This approach addresses the challenge of handling large, constantly growing datasets that exceed available computational resources and memory. The paper presents a comprehensive data-centric framework for Machine Listening projects, detailing the configuration of recording nodes, database structure, and labeling budget optimization in resource-constrained scenarios. Applied to an industrial port in Valencia, Spain, the framework successfully labeled 6540 ten-second audio samples over five months with a small team, demonstrating its effectiveness and adaptability to various resource availability situations.
Acknowledgments: The participation of Javier Naranjo-Alcazar, Jordi Grau-Haro and Pedro Zuccarello in this research was funded by the Valencian Institute for Business Competitiveness (IVACE) and the FEDER funds by means of project Soroll-IA2 (IMDEEA/2023/91). The research carried out for this publication has been partially funded by the project STARRING-NEURO (PID2022-137048OA-C44) funded by the Ministry of Science, Innovation and Universities of Spain and the European Union.
△ Less
Submitted 8 October, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Female mosquito detection by means of AI techniques inside release containers in the context of a Sterile Insect Technique program
Authors:
Javier Naranjo-Alcazar,
Jordi Grau-Haro,
David Almenar,
Pedro Zuccarello
Abstract:
The Sterile Insect Technique (SIT) is a biological pest control technique based on the release into the environment of sterile males of the insect species whose population is to be controlled. The entire SIT process involves mass-rearing within a biofactory, sorting of the specimens by sex, sterilization, and subsequent release of the sterile males into the environment. The reason for avoiding the…
▽ More
The Sterile Insect Technique (SIT) is a biological pest control technique based on the release into the environment of sterile males of the insect species whose population is to be controlled. The entire SIT process involves mass-rearing within a biofactory, sorting of the specimens by sex, sterilization, and subsequent release of the sterile males into the environment. The reason for avoiding the release of female specimens is because, unlike males, females bite, with the subsequent risk of disease transmission. In the case of Aedes mosquito biofactories for SIT, the key point of the whole process is sex separation. This process is nowadays performed by a combination of mechanical devices and AI-based vision systems. However, there is still a possibility of false negatives, so a last stage of verification is necessary before releasing them into the environment. It is known that the sound produced by the flapping of adult male mosquitoes is different from that produced by females, so this feature can be used to detect the presence of females in containers prior to environmental release. This paper presents a study for the detection of females in Aedes mosquito release vessels for SIT programs. The containers used consist of PVC a tubular design of 8.8cm diameter and 12.5cm height. The containers were placed in an experimental setup that allowed the recording of the sound of mosquito flight inside of them. Each container was filled with 250 specimens considering the cases of (i) only male mosquitoes, (ii) only female mosquitoes, and (iii) 75% males and 25% females. Case (i) was used for training and testing, whereas cases (ii) and (iii) were used only for testing. Two algorithms were implemented for the detection of female mosquitoes: an unsupervised outlier detection algorithm (iForest) and a one-class SVM trained with male-only recordings.
△ Less
Submitted 31 May, 2024; v1 submitted 19 June, 2023;
originally announced June 2023.
-
DCASE 2022: Comparative Analysis Of CNNs For Acoustic Scene Classification Under Low-Complexity Considerations
Authors:
Josep Zaragoza-Paredes,
Javier Naranjo-Alcazar,
Valery Naranjo,
Pedro Zuccarello
Abstract:
Acoustic scene classification is an automatic listening problem that aims to assign an audio recording to a pre-defined scene based on its audio data. Over the years (and in past editions of the DCASE) this problem has often been solved with techniques known as ensembles (use of several machine learning models to combine their predictions in the inference phase). While these solutions can show per…
▽ More
Acoustic scene classification is an automatic listening problem that aims to assign an audio recording to a pre-defined scene based on its audio data. Over the years (and in past editions of the DCASE) this problem has often been solved with techniques known as ensembles (use of several machine learning models to combine their predictions in the inference phase). While these solutions can show performance in terms of accuracy, they can be very expensive in terms of computational capacity, making it impossible to deploy them in IoT devices. Due to the drift in this field of study, this task has two limitations in terms of model complexity. It should be noted that there is also the added complexity of mismatching devices (the audios provided are recorded by different sources of information). This technical report makes a comparative study of two different network architectures: conventional CNN and Conv-mixer. Although both networks exceed the baseline required by the competition, the conventional CNN shows a higher performance, exceeding the baseline by 8 percentage points. Solutions based on Conv-mixer architectures show worse performance although they are much lighter solutions.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Task 1A DCASE 2021: Acoustic Scene Classification with mismatch-devices using squeeze-excitation technique and low-complexity constraint
Authors:
Javier Naranjo-Alcazar,
Sergi Perez-Castanos,
Maximo Cobos,
Francesc J. Ferri,
Pedro Zuccarello
Abstract:
Acoustic scene classification (ASC) is one of the most popular problems in the field of machine listening. The objective of this problem is to classify an audio clip into one of the predefined scenes using only the audio data. This problem has considerably progressed over the years in the different editions of DCASE. It usually has several subtasks that allow to tackle this problem with different…
▽ More
Acoustic scene classification (ASC) is one of the most popular problems in the field of machine listening. The objective of this problem is to classify an audio clip into one of the predefined scenes using only the audio data. This problem has considerably progressed over the years in the different editions of DCASE. It usually has several subtasks that allow to tackle this problem with different approaches. The subtask presented in this report corresponds to a ASC problem that is constrained by the complexity of the model as well as having audio recorded from different devices, known as mismatch devices (real and simulated). The work presented in this report follows the research line carried out by the team in previous years. Specifically, a system based on two steps is proposed: a two-dimensional representation of the audio using the Gamamtone filter bank and a convolutional neural network using squeeze-excitation techniques. The presented system outperforms the baseline by about 17 percentage points.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
TASK3 DCASE2021 Challenge: Sound event localization and detection using squeeze-excitation residual CNNs
Authors:
Javier Naranjo-Alcazar,
Sergi Perez-Castanos,
Pedro Zuccarello,
Francesc J. Ferri,
Maximo Cobos
Abstract:
Sound event localisation and detection (SELD) is a problem in the field of automatic listening that aims at the temporal detection and localisation (direction of arrival estimation) of sound events within an audio clip, usually of long duration. Due to the amount of data present in the datasets related to this problem, solutions based on deep learning have positioned themselves at the top of the s…
▽ More
Sound event localisation and detection (SELD) is a problem in the field of automatic listening that aims at the temporal detection and localisation (direction of arrival estimation) of sound events within an audio clip, usually of long duration. Due to the amount of data present in the datasets related to this problem, solutions based on deep learning have positioned themselves at the top of the state of the art. Most solutions are based on 2D representations of the audio (different spectrograms) that are processed by a convolutional-recurrent network. The motivation of this submission is to study the squeeze-excitation technique in the convolutional part of the network and how it improves the performance of the system. This study is based on the one carried out by the same team last year. This year, it has been decided to study how this technique improves each of the datasets (last year only the MIC dataset was studied). This modification shows an improvement in the performance of the system compared to the baseline using MIC dataset.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification
Authors:
Javier Naranjo-Alcazar,
Sergi Perez-Castanos,
Aaron Lopez-Garcia,
Pedro Zuccarello,
Maximo Cobos,
Francesc J. Ferri
Abstract:
The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scen…
▽ More
The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. The visual subnetwork is a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the residual audio subnetwork is based on stacked squeeze-excitation convolutional blocks trained from scratch. After training each subnetwork, the fusion of information from the audio and visual streams is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. We evaluate the method using the recently published TAU Audio-Visual Urban Scenes 2021, which contains synchronized audio and video recordings from 12 European cities in 10 different scene classes. The proposed model has been shown to provide an excellent trade-off between prediction performance (86.5%) and system complexity (15M parameters) in the evaluation results of the DCASE 2021 Challenge.
△ Less
Submitted 28 July, 2021;
originally announced July 2021.
-
Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation
Authors:
Sergi Perez-Castanos,
Javier Naranjo-Alcazar,
Pedro Zuccarello,
Maximo Cobos
Abstract:
Automated audio captioning is machine listening task whose goal is to describe an audio using free text. An automated audio captioning system has to be implemented as it accepts an audio as input and outputs as textual description, that is, the caption of the signal. This task can be useful in many applications such as automatic content description or machine-to-machine interaction. In this work,…
▽ More
Automated audio captioning is machine listening task whose goal is to describe an audio using free text. An automated audio captioning system has to be implemented as it accepts an audio as input and outputs as textual description, that is, the caption of the signal. This task can be useful in many applications such as automatic content description or machine-to-machine interaction. In this work, an automatic audio captioning based on residual learning on the encoder phase is proposed. The encoder phase is implemented via different Residual Networks configurations. The decoder phase (create the caption) is run using recurrent layers plus attention mechanism. The audio representation chosen has been Gammatone. Results show that the framework proposed in this work surpass the baseline system in challenge results.
△ Less
Submitted 8 July, 2020; v1 submitted 27 June, 2020;
originally announced June 2020.
-
Anomalous Sound Detection using unsupervised and semi-supervised autoencoders and gammatone audio representation
Authors:
Sergi Perez-Castanos,
Javier Naranjo-Alcazar,
Pedro Zuccarello,
Maximo Cobos
Abstract:
Anomalous sound detection (ASD) is, nowadays, one of the topical subjects in machine listening discipline. Unsupervised detection is attracting a lot of interest due to its immediate applicability in many fields. For example, related to industrial processes, the early detection of malfunctions or damage in machines can mean great savings and an improvement in the efficiency of industrial processes…
▽ More
Anomalous sound detection (ASD) is, nowadays, one of the topical subjects in machine listening discipline. Unsupervised detection is attracting a lot of interest due to its immediate applicability in many fields. For example, related to industrial processes, the early detection of malfunctions or damage in machines can mean great savings and an improvement in the efficiency of industrial processes. This problem can be solved with an unsupervised ASD solution since industrial machines will not be damaged simply by having this audio data in the training stage. This paper proposes a novel framework based on convolutional autoencoders (both unsupervised and semi-supervised) and a Gammatone-based representation of the audio. The results obtained by these architectures substantially exceed the results presented as a baseline.
△ Less
Submitted 27 June, 2020;
originally announced June 2020.
-
Sound Event Localization and Detection using Squeeze-Excitation Residual CNNs
Authors:
Javier Naranjo-Alcazar,
Sergi Perez-Castanos,
Jose Ferrandis,
Pedro Zuccarello,
Maximo Cobos
Abstract:
Sound Event Localization and Detection (SELD) is a problem related to the field of machine listening whose objective is to recognize individual sound events, detect their temporal activity, and estimate their spatial location. Thanks to the emergence of more hard-labeled audio datasets, deep learning techniques have become state-of-the-art solutions. The most common ones are those that implement a…
▽ More
Sound Event Localization and Detection (SELD) is a problem related to the field of machine listening whose objective is to recognize individual sound events, detect their temporal activity, and estimate their spatial location. Thanks to the emergence of more hard-labeled audio datasets, deep learning techniques have become state-of-the-art solutions. The most common ones are those that implement a convolutional recurrent network (CRNN) having previously transformed the audio signal into multichannel 2D representation. The squeeze-excitation technique can be considered as a convolution enhancement that aims to learn spatial and channel feature maps independently rather than together as standard convolutions do. This is usually achieved by combining some global clustering operators, linear operators and a final calibration between the block input and its learned relationships. This work aims to improve the accuracy results of the baseline CRNN presented in DCASE 2020 Task 3 by adding residual squeeze-excitation (SE) blocks in the convolutional part of the CRNN. The followed procedure involves a grid search of the ratio parameter (used in the linear relationships) of the residual SE block, whereas the hyperparameters of the network remain the same as in the baseline. Experiments show that by simply introducing the residual SE blocks, the results obtained are able to improve the baseline considerably.
△ Less
Submitted 30 July, 2021; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Acoustic Scene Classification with Squeeze-Excitation Residual Networks
Authors:
Javier Naranjo-Alcazar,
Sergi Perez-Castanos,
Pedro Zuccarello,
Maximo Cobos
Abstract:
Acoustic scene classification (ASC) is a problem related to the field of machine listening whose objective is to classify/tag an audio clip in a predefined label describing a scene location (e. g. park, airport, etc.). Many state-of-the-art solutions to ASC incorporate data augmentation techniques and model ensembles. However, considerable improvements can also be achieved only by modifying the ar…
▽ More
Acoustic scene classification (ASC) is a problem related to the field of machine listening whose objective is to classify/tag an audio clip in a predefined label describing a scene location (e. g. park, airport, etc.). Many state-of-the-art solutions to ASC incorporate data augmentation techniques and model ensembles. However, considerable improvements can also be achieved only by modifying the architecture of convolutional neural networks (CNNs). In this work we propose two novel squeeze-excitation blocks to improve the accuracy of a CNN-based ASC framework based on residual learning. The main idea of squeeze-excitation blocks is to learn spatial and channel-wise feature maps independently instead of jointly as standard CNNs do. This is usually achieved by some global grouping operators, linear operators and a final calibration between the input of the block and its obtained relationships. The behavior of the block that implements such operators and, therefore, the entire neural network, can be modified depending on the input to the block, the established residual configurations and the selected non-linear activations. The analysis has been carried out using the TAU Urban Acoustic Scenes 2019 dataset (https://zenodo.org/record/2589280) presented in the 2019 edition of the DCASE challenge. All configurations discussed in this document exceed the performance of the baseline proposed by the DCASE organization by 13\% percentage points. In turn, the novel configurations proposed in this paper outperform the residual configurations proposed in previous works.
△ Less
Submitted 26 June, 2020; v1 submitted 20 March, 2020;
originally announced March 2020.
-
On the performance of residual block design alternatives in convolutional neural networks for end-to-end audio classification
Authors:
Javier Naranjo-Alcazar,
Sergi Perez-Castanos,
Irene Martin-Morato,
Pedro Zuccarello,
Maximo Cobos
Abstract:
Residual learning is a recently proposed learning framework to facilitate the training of very deep neural networks. Residual blocks or units are made of a set of stacked layers, where the inputs are added back to their outputs with the aim of creating identity mappings. In practice, such identity mappings are accomplished by means of the so-called skip or residual connections. However, multiple i…
▽ More
Residual learning is a recently proposed learning framework to facilitate the training of very deep neural networks. Residual blocks or units are made of a set of stacked layers, where the inputs are added back to their outputs with the aim of creating identity mappings. In practice, such identity mappings are accomplished by means of the so-called skip or residual connections. However, multiple implementation alternatives arise with respect to where such skip connections are applied within the set of stacked layers that make up a residual block. While ResNet architectures for image classification using convolutional neural networks (CNNs) have been widely discussed in the literature, few works have adopted ResNet architectures so far for 1D audio classification tasks. Thus, the suitability of different residual block designs for raw audio classification is partly unknown. The purpose of this paper is to analyze and discuss the performance of several residual block implementations within a state-of-the-art CNN-based architecture for end-to-end audio classification using raw audio waveforms. For comparison purposes, we analyze as well the performance of the residual blocks under a similar 2D architecture using a conventional time-frequency audio represen-tation as input. The results show that the achieved accuracy is considerably dependent, not only on the specific residual block implementation, but also on the selected input normalization.
△ Less
Submitted 26 September, 2019; v1 submitted 26 June, 2019;
originally announced June 2019.
-
CNN depth analysis with different channel inputs for Acoustic Scene Classification
Authors:
Sergi Perez-Castanos,
Javier Naranjo-Alcazar,
Pedro Zuccarello,
Maximo Cobos,
Frances J. Ferri
Abstract:
Acoustic scene classification (ASC) has been approached in the last years using deep learning techniques such as convolutional neural networks or recurrent neural networks. Many state-of-the-art solutions are based on image classification frameworks and, as such, a 2D representation of the audio signal is considered for training these networks. Finding the most suitable audio representation is sti…
▽ More
Acoustic scene classification (ASC) has been approached in the last years using deep learning techniques such as convolutional neural networks or recurrent neural networks. Many state-of-the-art solutions are based on image classification frameworks and, as such, a 2D representation of the audio signal is considered for training these networks. Finding the most suitable audio representation is still a research area of interest. In this paper, different log-Mel representations and combinations are analyzed. Experiments show that the best results are obtained using the harmonic and percussive components plus the difference between left and right stereo channels, (L-R). On the other hand, it is a common strategy to ensemble different models in order to increase the final accuracy. Even though averaging different model predictions is a common choice, an exhaustive analysis of different ensemble techniques has not been presented in ASC problems. In this paper, geometric and arithmetic mean plus the Ordered Weighted Averaging (OWA) operator are studied as aggregation operators for the output of the different models of the ensemble. Finally, the work carried out in this paper is highly oriented towards real-time implementations. In this context, as the number of applications for audio classification on edge devices is increasing exponentially, we also analyze different network depths and efficient solutions for aggregating ensemble predictions.
△ Less
Submitted 13 August, 2021; v1 submitted 10 June, 2019;
originally announced June 2019.