-
Generalizable Detection of Audio Deepfakes
Authors:
Jose A. Lopez,
Georg Stemmer,
Héctor Cordourier Maruri
Abstract:
In this paper, we present our comprehensive study aimed at enhancing the generalization capabilities of audio deepfake detection models. We investigate the performance of various pre-trained backbones, including Wav2Vec2, WavLM, and Whisper, across a diverse set of datasets, including those from the ASVspoof challenges and additional sources. Our experiments focus on the effects of different data…
▽ More
In this paper, we present our comprehensive study aimed at enhancing the generalization capabilities of audio deepfake detection models. We investigate the performance of various pre-trained backbones, including Wav2Vec2, WavLM, and Whisper, across a diverse set of datasets, including those from the ASVspoof challenges and additional sources. Our experiments focus on the effects of different data augmentation strategies and loss functions on model performance. The results of our research demonstrate substantial enhancements in the generalization capabilities of audio deepfake detection models, surpassing the performance of the top-ranked single system in the ASVspoof 5 Challenge. This study contributes valuable insights into the optimization of audio models for more robust deepfake detection and facilitates future research in this critical area.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
MambaStyle: Efficient StyleGAN Inversion for Real Image Editing with State-Space Models
Authors:
Jhon Lopez,
Carlos Hinojosa,
Henry Arguello,
Bernard Ghanem
Abstract:
The task of inverting real images into StyleGAN's latent space to manipulate their attributes has been extensively studied. However, existing GAN inversion methods struggle to balance high reconstruction quality, effective editability, and computational efficiency. In this paper, we introduce MambaStyle, an efficient single-stage encoder-based approach for GAN inversion and editing that leverages…
▽ More
The task of inverting real images into StyleGAN's latent space to manipulate their attributes has been extensively studied. However, existing GAN inversion methods struggle to balance high reconstruction quality, effective editability, and computational efficiency. In this paper, we introduce MambaStyle, an efficient single-stage encoder-based approach for GAN inversion and editing that leverages vision state-space models (VSSMs) to address these challenges. Specifically, our approach integrates VSSMs within the proposed architecture, enabling high-quality image inversion and flexible editing with significantly fewer parameters and reduced computational complexity compared to state-of-the-art methods. Extensive experiments show that MambaStyle achieves a superior balance among inversion accuracy, editing quality, and computational efficiency. Notably, our method achieves superior inversion and editing results with reduced model complexity and faster inference, making it suitable for real-time applications.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Recreating Neural Activity During Speech Production with Language and Speech Model Embeddings
Authors:
Owais Mujtaba Khanday,
Pablo Rodroguez San Esteban,
Zubair Ahmad Lone,
Marc Ouellet,
Jose Andres Gonzalez Lopez
Abstract:
Understanding how neural activity encodes speech and language production is a fundamental challenge in neuroscience and artificial intelligence. This study investigates whether embeddings from large-scale, self-supervised language and speech models can effectively reconstruct high-gamma neural activity characteristics, key indicators of cortical processing, recorded during speech production. We le…
▽ More
Understanding how neural activity encodes speech and language production is a fundamental challenge in neuroscience and artificial intelligence. This study investigates whether embeddings from large-scale, self-supervised language and speech models can effectively reconstruct high-gamma neural activity characteristics, key indicators of cortical processing, recorded during speech production. We leverage pre-trained embeddings from deep learning models trained on linguistic and acoustic data to represent high-level speech features and map them onto these high-gamma signals. We analyze the extent to which these embeddings preserve the spatio-temporal dynamics of brain activity. Reconstructed neural signals are evaluated against high-gamma ground-truth activity using correlation metrics and signal reconstruction quality assessments. The results indicate that high-gamma activity can be effectively reconstructed using large language and speech model embeddings in all study participants, generating Pearson's correlation coefficients ranging from 0.79 to 0.99.
△ Less
Submitted 21 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
Unsupervised Welding Defect Detection Using Audio And Video
Authors:
Georg Stemmer,
Jose A. Lopez,
Juan A. Del Hoyo Ontiveros,
Arvind Raju,
Tara Thimmanaik,
Sovan Biswas
Abstract:
In this work we explore the application of AI to robotic welding. Robotic welding is a widely used technology in many industries, but robots currently do not have the capability to detect welding defects which get introduced due to various reasons in the welding process. We describe how deep-learning methods can be applied to detect weld defects in real-time by recording the welding process with m…
▽ More
In this work we explore the application of AI to robotic welding. Robotic welding is a widely used technology in many industries, but robots currently do not have the capability to detect welding defects which get introduced due to various reasons in the welding process. We describe how deep-learning methods can be applied to detect weld defects in real-time by recording the welding process with microphones and a camera. Our findings are based on a large database with more than 4000 welding samples we collected which covers different weld types, materials and various defect categories. All deep learning models are trained in an unsupervised fashion because the space of possible defects is large and the defects in our data may contain biases. We demonstrate that a reliable real-time detection of most categories of weld defects is feasible both from audio and video, with improvements achieved by combining both modalities. Specifically, the multi-modal approach achieves an average Area-under-ROC-Curve (AUC) of 0.92 over all eleven defect types in our data. We conclude the paper with an analysis of the results by defect type and a discussion of future work.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Privacy-Preserving Deep Learning Using Deformable Operators for Secure Task Learning
Authors:
Fabian Perez,
Jhon Lopez,
Henry Arguello
Abstract:
In the era of cloud computing and data-driven applications, it is crucial to protect sensitive information to maintain data privacy, ensuring truly reliable systems. As a result, preserving privacy in deep learning systems has become a critical concern. Existing methods for privacy preservation rely on image encryption or perceptual transformation approaches. However, they often suffer from reduce…
▽ More
In the era of cloud computing and data-driven applications, it is crucial to protect sensitive information to maintain data privacy, ensuring truly reliable systems. As a result, preserving privacy in deep learning systems has become a critical concern. Existing methods for privacy preservation rely on image encryption or perceptual transformation approaches. However, they often suffer from reduced task performance and high computational costs. To address these challenges, we propose a novel Privacy-Preserving framework that uses a set of deformable operators for secure task learning. Our method involves shuffling pixels during the analog-to-digital conversion process to generate visually protected data. Those are then fed into a well-known network enhanced with deformable operators. Using our approach, users can achieve equivalent performance to original images without additional training using a secret key. Moreover, our method enables access control against unauthorized users. Experimental results demonstrate the efficacy of our approach, showcasing its potential in cloud-based scenarios and privacy-sensitive applications.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Improvement of Performance in Freezing of Gait detection in Parkinsons Disease using Transformer networks and a single waist worn triaxial accelerometer
Authors:
Luis Sigcha,
Luigi Borzì,
Ignacio Pavón,
Nélson Costa,
Susana Costa,
Pedro Arezes,
Juan-Manuel López,
Guillermo De Arcas
Abstract:
Freezing of gait (FOG) is one of the most incapacitating symptoms in Parkinsons disease, affecting more than 50 percent of patients in advanced stages of the disease. The presence of FOG may lead to falls and a loss of independence with a consequent reduction in the quality of life. Wearable technology and artificial intelligence have been used for automatic FOG detection to optimize monitoring. H…
▽ More
Freezing of gait (FOG) is one of the most incapacitating symptoms in Parkinsons disease, affecting more than 50 percent of patients in advanced stages of the disease. The presence of FOG may lead to falls and a loss of independence with a consequent reduction in the quality of life. Wearable technology and artificial intelligence have been used for automatic FOG detection to optimize monitoring. However, differences between laboratory and daily-life conditions present challenges for the implementation of reliable detection systems. Consequently, improvement of FOG detection methods remains important to provide accurate monitoring mechanisms intended for free-living and real-time use. This paper presents advances in automatic FOG detection using a single body-worn triaxial accelerometer and a novel classification algorithm based on Transformers and convolutional networks. This study was performed with data from 21 patients who manifested FOG episodes while performing activities of daily living in a home setting. Results indicate that the proposed FOG-Transformer can bring a significant improvement in FOG detection using leave-one-subject-out cross-validation (LOSO CV). These results bring opportunities for the implementation of accurate monitoring systems for use in ambulatory or home settings.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Privacy-preserving Optics for Enhancing Protection in Face De-identification
Authors:
Jhon Lopez,
Carlos Hinojosa,
Henry Arguello,
Bernard Ghanem
Abstract:
The modern surge in camera usage alongside widespread computer vision technology applications poses significant privacy and security concerns. Current artificial intelligence (AI) technologies aid in recognizing relevant events and assisting in daily tasks in homes, offices, hospitals, etc. The need to access or process personal information for these purposes raises privacy concerns. While softwar…
▽ More
The modern surge in camera usage alongside widespread computer vision technology applications poses significant privacy and security concerns. Current artificial intelligence (AI) technologies aid in recognizing relevant events and assisting in daily tasks in homes, offices, hospitals, etc. The need to access or process personal information for these purposes raises privacy concerns. While software-level solutions like face de-identification provide a good privacy/utility trade-off, they present vulnerabilities to sniffing attacks. In this paper, we propose a hardware-level face de-identification method to solve this vulnerability. Specifically, our approach first learns an optical encoder along with a regression model to obtain a face heatmap while hiding the face identity from the source image. We also propose an anonymization framework that generates a new face using the privacy-preserving image, face heatmap, and a reference face image from a public dataset as input. We validate our approach with extensive simulations and hardware experiments.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
NOSTROMO: Lessons learned, conclusions and way forward
Authors:
Mayte Cano,
Andrés Perillo,
Juan Antonio López,
Faustino Tello,
Javier Poveda,
Francisco Câmara,
Francisco Antunes,
Christoffer Riis,
Ian Crook,
Abderrazak Tibichte,
Sandrine Molton,
David Mocholí,
Ricardo Herranz,
Gérald Gurtner,
Tatjana Bolić,
Andrew Cook,
Jovana Kuljanin,
Xavier Prats
Abstract:
This White Paper sets out to explain the value that metamodelling can bring to air traffic management (ATM) research. It will define metamodelling and explore what it can, and cannot, do. The reader is assumed to have basic knowledge of SESAR: the Single European Sky ATM Research project. An important element of SESAR, as the technological pillar of the Single European Sky initiative, is to bring…
▽ More
This White Paper sets out to explain the value that metamodelling can bring to air traffic management (ATM) research. It will define metamodelling and explore what it can, and cannot, do. The reader is assumed to have basic knowledge of SESAR: the Single European Sky ATM Research project. An important element of SESAR, as the technological pillar of the Single European Sky initiative, is to bring about improvements, as measured through specific key performance indicators (KPIs), and as implemented by a series of so-called SESAR 'Solutions'. These 'Solutions' are new or improved operational procedures or technologies, designed to meet operational and performance improvements described in the European ATM Master Plan.
△ Less
Submitted 29 March, 2023;
originally announced March 2023.
-
EndoMapper dataset of complete calibrated endoscopy procedures
Authors:
Pablo Azagra,
Carlos Sostres,
Ángel Ferrandez,
Luis Riazuelo,
Clara Tomasini,
Oscar León Barbed,
Javier Morlana,
David Recasens,
Victor M. Batlle,
Juan J. Gómez-Rodríguez,
Richard Elvira,
Julia López,
Cristina Oriol,
Javier Civera,
Juan D. Tardós,
Ana Cristina Murillo,
Angel Lanas,
José M. M. Montiel
Abstract:
Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introdu…
▽ More
Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introduces the Endomapper dataset, the first collection of complete endoscopy sequences acquired during regular medical practice, making secondary use of medical data. Its main purpose is to facilitate the development and evaluation of Visual Simultaneous Localization and Mapping (VSLAM) methods in real endoscopy data. The dataset contains more than 24 hours of video. It is the first endoscopic dataset that includes endoscope calibration as well as the original calibration videos. Meta-data and annotations associated with the dataset vary from the anatomical landmarks, procedure labeling, segmentations, reconstructions, simulated sequences with ground truth and same patient procedures. The software used in this paper is publicly available.
△ Less
Submitted 10 October, 2023; v1 submitted 29 April, 2022;
originally announced April 2022.
-
Short-Term Flow-Based Bandwidth Forecasting using Machine Learning
Authors:
Maxime Labonne,
Jorge López,
Claude Poletti,
Jean-Baptiste Munier
Abstract:
This paper proposes a novel framework to predict traffic flows' bandwidth ahead of time. Modern network management systems share a common issue: the network situation evolves between the moment the decision is made and the moment when actions (countermeasures) are applied. This framework converts packets from real-life traffic into flows containing relevant features. Machine learning models, inclu…
▽ More
This paper proposes a novel framework to predict traffic flows' bandwidth ahead of time. Modern network management systems share a common issue: the network situation evolves between the moment the decision is made and the moment when actions (countermeasures) are applied. This framework converts packets from real-life traffic into flows containing relevant features. Machine learning models, including Decision Tree, Random Forest, XGBoost, and Deep Neural Network, are trained on these data to predict the bandwidth at the next time instance for every flow. Predictions can be fed to the management system instead of current flows bandwidth in order to take decisions on a more accurate network state. Experiments were performed on 981,774 flows and 15 different time windows (from 0.03s to 4s). They show that the Random Forest is the best performing and most reliable model, with a predictive performance consistently better than relying on the current bandwidth (+19.73% in mean absolute error and +18.00% in root mean square error). Experimental results indicate that this framework can help network management systems to take more informed decisions using a predicted network state.
△ Less
Submitted 3 December, 2020; v1 submitted 29 November, 2020;
originally announced November 2020.
-
Runtime Safety Assurance Using Reinforcement Learning
Authors:
Christopher Lazarus,
James G. Lopez,
Mykel J. Kochenderfer
Abstract:
The airworthiness and safety of a non-pedigreed autopilot must be verified, but the cost to formally do so can be prohibitive. We can bypass formal verification of non-pedigreed components by incorporating Runtime Safety Assurance (RTSA) as mechanism to ensure safety. RTSA consists of a meta-controller that observes the inputs and outputs of a non-pedigreed component and verifies formally specifie…
▽ More
The airworthiness and safety of a non-pedigreed autopilot must be verified, but the cost to formally do so can be prohibitive. We can bypass formal verification of non-pedigreed components by incorporating Runtime Safety Assurance (RTSA) as mechanism to ensure safety. RTSA consists of a meta-controller that observes the inputs and outputs of a non-pedigreed component and verifies formally specified behavior as the system operates. When the system is triggered, a verified recovery controller is deployed. Recovery controllers are designed to be safe but very likely disruptive to the operational objective of the system, and thus RTSA systems must balance safety and efficiency. The objective of this paper is to design a meta-controller capable of identifying unsafe situations with high accuracy. High dimensional and non-linear dynamics in which modern controllers are deployed along with the black-box nature of the nominal controllers make this a difficult problem. Current approaches rely heavily on domain expertise and human engineering. We frame the design of RTSA with the Markov decision process (MDP) framework and use reinforcement learning (RL) to solve it. Our learned meta-controller consistently exhibits superior performance in our experiments compared to our baseline, human engineered approach.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments
Authors:
Javier Naranjo-Alcazar,
Sergi Perez-Castanos,
Pedro Zuccarrello,
Ana M. Torres,
Jose J. Lopez,
Franscesc J. Ferri,
Maximo Cobos
Abstract:
The problem of training with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications include those related to face…
▽ More
The problem of training with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications include those related to face recognition. In the audio domain, music fraud or speaker recognition can be clearly benefited from FSL methods. This paper deals with the application of FSL to the detection of specific and intentional acoustic events given by different types of sound alarms, such as door bells or fire alarms, using a limited number of samples. These sounds typically occur in domestic environments where many events corresponding to a wide variety of sound classes take place. Therefore, the detection of such alarms in a practical scenario can be considered an open-set recognition (OSR) problem. To address the lack of a dedicated public dataset for audio FSL, researchers usually make modifications on other available datasets. This paper is aimed at poviding the audio recognition community with a carefully annotated dataset (https://zenodo.org/record/3689288) for FSL in an OSR context comprised of 1360 clips from 34 classes divided into pattern sounds} and unwanted sounds. To facilitate and promote research on this area, results with state-of-the-art baseline systems based on transfer learning are also presented.
△ Less
Submitted 11 April, 2022; v1 submitted 26 February, 2020;
originally announced February 2020.
-
How To Tame Your Sparsity Constraints
Authors:
Jose A. Lopez
Abstract:
We show that designing sparse $H_\infty$ controllers, in a discrete (LTI) setting, is easy when the controller is assumed to be an FIR filter. In this case, the problem reduces to a static output feedback problem with equality constraints. We show how to obtain an initial guess, for the controller, and then provide a simple algorithm that alternates between two (convex) feasibility programs until…
▽ More
We show that designing sparse $H_\infty$ controllers, in a discrete (LTI) setting, is easy when the controller is assumed to be an FIR filter. In this case, the problem reduces to a static output feedback problem with equality constraints. We show how to obtain an initial guess, for the controller, and then provide a simple algorithm that alternates between two (convex) feasibility programs until converging, when the problem is feasible, to a suboptimal $H_\infty$ controller that is automatically stable. As FIR filters contain the information of their impulse response in their coefficients, it is easy to see that our results provide a path of least resistance to designing sparse robust controllers for continuous-time plants, via system identification methods.
△ Less
Submitted 31 May, 2015;
originally announced June 2015.
-
Robust Anomaly Detection Using Semidefinite Programming
Authors:
Jose A. Lopez,
Octavia Camps,
Mario Sznaier
Abstract:
This paper presents a new approach, based on polynomial optimization and the method of moments, to the problem of anomaly detection. The proposed technique only requires information about the statistical moments of the normal-state distribution of the features of interest and compares favorably with existing approaches (such as Parzen windows and 1-class SVM). In addition, it provides a succinct d…
▽ More
This paper presents a new approach, based on polynomial optimization and the method of moments, to the problem of anomaly detection. The proposed technique only requires information about the statistical moments of the normal-state distribution of the features of interest and compares favorably with existing approaches (such as Parzen windows and 1-class SVM). In addition, it provides a succinct description of the normal state. Thus, it leads to a substantial simplification of the the anomaly detection problem when working with higher dimensional datasets.
△ Less
Submitted 30 May, 2015; v1 submitted 3 April, 2015;
originally announced April 2015.