Search | arXiv e-print repository

Generalizable Detection of Audio Deepfakes

Authors: Jose A. Lopez, Georg Stemmer, Héctor Cordourier Maruri

Abstract: In this paper, we present our comprehensive study aimed at enhancing the generalization capabilities of audio deepfake detection models. We investigate the performance of various pre-trained backbones, including Wav2Vec2, WavLM, and Whisper, across a diverse set of datasets, including those from the ASVspoof challenges and additional sources. Our experiments focus on the effects of different data… ▽ More In this paper, we present our comprehensive study aimed at enhancing the generalization capabilities of audio deepfake detection models. We investigate the performance of various pre-trained backbones, including Wav2Vec2, WavLM, and Whisper, across a diverse set of datasets, including those from the ASVspoof challenges and additional sources. Our experiments focus on the effects of different data augmentation strategies and loss functions on model performance. The results of our research demonstrate substantial enhancements in the generalization capabilities of audio deepfake detection models, surpassing the performance of the top-ranked single system in the ASVspoof 5 Challenge. This study contributes valuable insights into the optimization of audio models for more robust deepfake detection and facilitates future research in this critical area. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 8 pages, 3 figures

arXiv:2505.15822 [pdf, other]

MambaStyle: Efficient StyleGAN Inversion for Real Image Editing with State-Space Models

Authors: Jhon Lopez, Carlos Hinojosa, Henry Arguello, Bernard Ghanem

Abstract: The task of inverting real images into StyleGAN's latent space to manipulate their attributes has been extensively studied. However, existing GAN inversion methods struggle to balance high reconstruction quality, effective editability, and computational efficiency. In this paper, we introduce MambaStyle, an efficient single-stage encoder-based approach for GAN inversion and editing that leverages… ▽ More The task of inverting real images into StyleGAN's latent space to manipulate their attributes has been extensively studied. However, existing GAN inversion methods struggle to balance high reconstruction quality, effective editability, and computational efficiency. In this paper, we introduce MambaStyle, an efficient single-stage encoder-based approach for GAN inversion and editing that leverages vision state-space models (VSSMs) to address these challenges. Specifically, our approach integrates VSSMs within the proposed architecture, enabling high-quality image inversion and flexible editing with significantly fewer parameters and reduced computational complexity compared to state-of-the-art methods. Extensive experiments show that MambaStyle achieves a superior balance among inversion accuracy, editing quality, and computational efficiency. Notably, our method achieves superior inversion and editing results with reduced model complexity and faster inference, making it suitable for real-time applications. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.14074 [pdf, ps, other]

Recreating Neural Activity During Speech Production with Language and Speech Model Embeddings

Authors: Owais Mujtaba Khanday, Pablo Rodroguez San Esteban, Zubair Ahmad Lone, Marc Ouellet, Jose Andres Gonzalez Lopez

Abstract: Understanding how neural activity encodes speech and language production is a fundamental challenge in neuroscience and artificial intelligence. This study investigates whether embeddings from large-scale, self-supervised language and speech models can effectively reconstruct high-gamma neural activity characteristics, key indicators of cortical processing, recorded during speech production. We le… ▽ More Understanding how neural activity encodes speech and language production is a fundamental challenge in neuroscience and artificial intelligence. This study investigates whether embeddings from large-scale, self-supervised language and speech models can effectively reconstruct high-gamma neural activity characteristics, key indicators of cortical processing, recorded during speech production. We leverage pre-trained embeddings from deep learning models trained on linguistic and acoustic data to represent high-level speech features and map them onto these high-gamma signals. We analyze the extent to which these embeddings preserve the spatio-temporal dynamics of brain activity. Reconstructed neural signals are evaluated against high-gamma ground-truth activity using correlation metrics and signal reconstruction quality assessments. The results indicate that high-gamma activity can be effectively reconstructed using large language and speech model embeddings in all study participants, generating Pearson's correlation coefficients ranging from 0.79 to 0.99. △ Less

Submitted 21 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

Comments: Accepted for presentation at Interspeech2025

arXiv:2409.02290 [pdf, other]

Unsupervised Welding Defect Detection Using Audio And Video

Authors: Georg Stemmer, Jose A. Lopez, Juan A. Del Hoyo Ontiveros, Arvind Raju, Tara Thimmanaik, Sovan Biswas

Abstract: In this work we explore the application of AI to robotic welding. Robotic welding is a widely used technology in many industries, but robots currently do not have the capability to detect welding defects which get introduced due to various reasons in the welding process. We describe how deep-learning methods can be applied to detect weld defects in real-time by recording the welding process with m… ▽ More In this work we explore the application of AI to robotic welding. Robotic welding is a widely used technology in many industries, but robots currently do not have the capability to detect welding defects which get introduced due to various reasons in the welding process. We describe how deep-learning methods can be applied to detect weld defects in real-time by recording the welding process with microphones and a camera. Our findings are based on a large database with more than 4000 welding samples we collected which covers different weld types, materials and various defect categories. All deep learning models are trained in an unsupervised fashion because the space of possible defects is large and the defects in our data may contain biases. We demonstrate that a reliable real-time detection of most categories of weld defects is feasible both from audio and video, with improvements achieved by combining both modalities. Specifically, the multi-modal approach achieves an average Area-under-ROC-Curve (AUC) of 0.92 over all eleven defect types in our data. We conclude the paper with an analysis of the results by defect type and a discussion of future work. △ Less

Submitted 3 September, 2024; originally announced September 2024.

Comments: 21 pages

arXiv:2404.05828 [pdf, other]

doi 10.1109/ICASSP48485.2024.10446218

Privacy-Preserving Deep Learning Using Deformable Operators for Secure Task Learning

Authors: Fabian Perez, Jhon Lopez, Henry Arguello

Abstract: In the era of cloud computing and data-driven applications, it is crucial to protect sensitive information to maintain data privacy, ensuring truly reliable systems. As a result, preserving privacy in deep learning systems has become a critical concern. Existing methods for privacy preservation rely on image encryption or perceptual transformation approaches. However, they often suffer from reduce… ▽ More In the era of cloud computing and data-driven applications, it is crucial to protect sensitive information to maintain data privacy, ensuring truly reliable systems. As a result, preserving privacy in deep learning systems has become a critical concern. Existing methods for privacy preservation rely on image encryption or perceptual transformation approaches. However, they often suffer from reduced task performance and high computational costs. To address these challenges, we propose a novel Privacy-Preserving framework that uses a set of deformable operators for secure task learning. Our method involves shuffling pixels during the analog-to-digital conversion process to generate visually protected data. Those are then fed into a well-known network enhanced with deformable operators. Using our approach, users can achieve equivalent performance to original images without additional training using a secret key. Moreover, our method enables access control against unauthorized users. Experimental results demonstrate the efficacy of our approach, showcasing its potential in cloud-based scenarios and privacy-sensitive applications. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2404.03704 [pdf]

doi 10.1016/j.engappai.2022.105482

Improvement of Performance in Freezing of Gait detection in Parkinsons Disease using Transformer networks and a single waist worn triaxial accelerometer

Authors: Luis Sigcha, Luigi Borzì, Ignacio Pavón, Nélson Costa, Susana Costa, Pedro Arezes, Juan-Manuel López, Guillermo De Arcas

Abstract: Freezing of gait (FOG) is one of the most incapacitating symptoms in Parkinsons disease, affecting more than 50 percent of patients in advanced stages of the disease. The presence of FOG may lead to falls and a loss of independence with a consequent reduction in the quality of life. Wearable technology and artificial intelligence have been used for automatic FOG detection to optimize monitoring. H… ▽ More Freezing of gait (FOG) is one of the most incapacitating symptoms in Parkinsons disease, affecting more than 50 percent of patients in advanced stages of the disease. The presence of FOG may lead to falls and a loss of independence with a consequent reduction in the quality of life. Wearable technology and artificial intelligence have been used for automatic FOG detection to optimize monitoring. However, differences between laboratory and daily-life conditions present challenges for the implementation of reliable detection systems. Consequently, improvement of FOG detection methods remains important to provide accurate monitoring mechanisms intended for free-living and real-time use. This paper presents advances in automatic FOG detection using a single body-worn triaxial accelerometer and a novel classification algorithm based on Transformers and convolutional networks. This study was performed with data from 21 patients who manifested FOG episodes while performing activities of daily living in a home setting. Results indicate that the proposed FOG-Transformer can bring a significant improvement in FOG detection using leave-one-subject-out cross-validation (LOSO CV). These results bring opportunities for the implementation of accurate monitoring systems for use in ambulatory or home settings. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Journal ref: Engineering Applications of Artificial Intelligence Volume 116, November 2022, 105482

arXiv:2404.00777 [pdf, other]

Privacy-preserving Optics for Enhancing Protection in Face De-identification

Authors: Jhon Lopez, Carlos Hinojosa, Henry Arguello, Bernard Ghanem

Abstract: The modern surge in camera usage alongside widespread computer vision technology applications poses significant privacy and security concerns. Current artificial intelligence (AI) technologies aid in recognizing relevant events and assisting in daily tasks in homes, offices, hospitals, etc. The need to access or process personal information for these purposes raises privacy concerns. While softwar… ▽ More The modern surge in camera usage alongside widespread computer vision technology applications poses significant privacy and security concerns. Current artificial intelligence (AI) technologies aid in recognizing relevant events and assisting in daily tasks in homes, offices, hospitals, etc. The need to access or process personal information for these purposes raises privacy concerns. While software-level solutions like face de-identification provide a good privacy/utility trade-off, they present vulnerabilities to sniffing attacks. In this paper, we propose a hardware-level face de-identification method to solve this vulnerability. Specifically, our approach first learns an optical encoder along with a regression model to obtain a face heatmap while hiding the face identity from the source image. We also propose an anonymization framework that generates a new face using the privacy-preserving image, face heatmap, and a reference face image from a public dataset as input. We validate our approach with extensive simulations and hardware experiments. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024. Project Website and Code coming soon

arXiv:2303.18060 [pdf]

NOSTROMO: Lessons learned, conclusions and way forward

Authors: Mayte Cano, Andrés Perillo, Juan Antonio López, Faustino Tello, Javier Poveda, Francisco Câmara, Francisco Antunes, Christoffer Riis, Ian Crook, Abderrazak Tibichte, Sandrine Molton, David Mocholí, Ricardo Herranz, Gérald Gurtner, Tatjana Bolić, Andrew Cook, Jovana Kuljanin, Xavier Prats

Abstract: This White Paper sets out to explain the value that metamodelling can bring to air traffic management (ATM) research. It will define metamodelling and explore what it can, and cannot, do. The reader is assumed to have basic knowledge of SESAR: the Single European Sky ATM Research project. An important element of SESAR, as the technological pillar of the Single European Sky initiative, is to bring… ▽ More This White Paper sets out to explain the value that metamodelling can bring to air traffic management (ATM) research. It will define metamodelling and explore what it can, and cannot, do. The reader is assumed to have basic knowledge of SESAR: the Single European Sky ATM Research project. An important element of SESAR, as the technological pillar of the Single European Sky initiative, is to bring about improvements, as measured through specific key performance indicators (KPIs), and as implemented by a series of so-called SESAR 'Solutions'. These 'Solutions' are new or improved operational procedures or technologies, designed to meet operational and performance improvements described in the European ATM Master Plan. △ Less

Submitted 29 March, 2023; originally announced March 2023.

Comments: White Paper of the NOSTROMO, an exploratory research project funded by the SESAR Joint Undertaking (SJU) under the European Union's Horizon 2020 research and innovation programme

arXiv:2204.14240 [pdf, other]

doi 10.1038/s41597-023-02564-7

EndoMapper dataset of complete calibrated endoscopy procedures

Authors: Pablo Azagra, Carlos Sostres, Ángel Ferrandez, Luis Riazuelo, Clara Tomasini, Oscar León Barbed, Javier Morlana, David Recasens, Victor M. Batlle, Juan J. Gómez-Rodríguez, Richard Elvira, Julia López, Cristina Oriol, Javier Civera, Juan D. Tardós, Ana Cristina Murillo, Angel Lanas, José M. M. Montiel

Abstract: Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introdu… ▽ More Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introduces the Endomapper dataset, the first collection of complete endoscopy sequences acquired during regular medical practice, making secondary use of medical data. Its main purpose is to facilitate the development and evaluation of Visual Simultaneous Localization and Mapping (VSLAM) methods in real endoscopy data. The dataset contains more than 24 hours of video. It is the first endoscopic dataset that includes endoscope calibration as well as the original calibration videos. Meta-data and annotations associated with the dataset vary from the anatomical landmarks, procedure labeling, segmentations, reconstructions, simulated sequences with ground truth and same patient procedures. The software used in this paper is publicly available. △ Less

Submitted 10 October, 2023; v1 submitted 29 April, 2022; originally announced April 2022.

Comments: 17 pages, 14 figures, 8 tables

Journal ref: Sci Data 10, 671 (2023)

arXiv:2011.14421 [pdf, ps, other]

doi 10.1109/WoWMoM51794.2021.00048

Short-Term Flow-Based Bandwidth Forecasting using Machine Learning

Authors: Maxime Labonne, Jorge López, Claude Poletti, Jean-Baptiste Munier

Abstract: This paper proposes a novel framework to predict traffic flows' bandwidth ahead of time. Modern network management systems share a common issue: the network situation evolves between the moment the decision is made and the moment when actions (countermeasures) are applied. This framework converts packets from real-life traffic into flows containing relevant features. Machine learning models, inclu… ▽ More This paper proposes a novel framework to predict traffic flows' bandwidth ahead of time. Modern network management systems share a common issue: the network situation evolves between the moment the decision is made and the moment when actions (countermeasures) are applied. This framework converts packets from real-life traffic into flows containing relevant features. Machine learning models, including Decision Tree, Random Forest, XGBoost, and Deep Neural Network, are trained on these data to predict the bandwidth at the next time instance for every flow. Predictions can be fed to the management system instead of current flows bandwidth in order to take decisions on a more accurate network state. Experiments were performed on 981,774 flows and 15 different time windows (from 0.03s to 4s). They show that the Random Forest is the best performing and most reliable model, with a predictive performance consistently better than relying on the current bandwidth (+19.73% in mean absolute error and +18.00% in root mean square error). Experimental results indicate that this framework can help network management systems to take more informed decisions using a predicted network state. △ Less

Submitted 3 December, 2020; v1 submitted 29 November, 2020; originally announced November 2020.

Comments: 4 pages, 1 figure 3 tables

arXiv:2010.10618 [pdf, other]

Runtime Safety Assurance Using Reinforcement Learning

Authors: Christopher Lazarus, James G. Lopez, Mykel J. Kochenderfer

Abstract: The airworthiness and safety of a non-pedigreed autopilot must be verified, but the cost to formally do so can be prohibitive. We can bypass formal verification of non-pedigreed components by incorporating Runtime Safety Assurance (RTSA) as mechanism to ensure safety. RTSA consists of a meta-controller that observes the inputs and outputs of a non-pedigreed component and verifies formally specifie… ▽ More The airworthiness and safety of a non-pedigreed autopilot must be verified, but the cost to formally do so can be prohibitive. We can bypass formal verification of non-pedigreed components by incorporating Runtime Safety Assurance (RTSA) as mechanism to ensure safety. RTSA consists of a meta-controller that observes the inputs and outputs of a non-pedigreed component and verifies formally specified behavior as the system operates. When the system is triggered, a verified recovery controller is deployed. Recovery controllers are designed to be safe but very likely disruptive to the operational objective of the system, and thus RTSA systems must balance safety and efficiency. The objective of this paper is to design a meta-controller capable of identifying unsafe situations with high accuracy. High dimensional and non-linear dynamics in which modern controllers are deployed along with the black-box nature of the nominal controllers make this a difficult problem. Current approaches rely heavily on domain expertise and human engineering. We frame the design of RTSA with the Markov decision process (MDP) framework and use reinforcement learning (RL) to solve it. Our learned meta-controller consistently exhibits superior performance in our experiments compared to our baseline, human engineered approach. △ Less

Submitted 20 October, 2020; originally announced October 2020.

Journal ref: 2020 IEEE/AIAA 39th Digital Avionics Systems Conference (DASC)

arXiv:2002.11561 [pdf, other]

An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments

Authors: Javier Naranjo-Alcazar, Sergi Perez-Castanos, Pedro Zuccarrello, Ana M. Torres, Jose J. Lopez, Franscesc J. Ferri, Maximo Cobos

Abstract: The problem of training with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications include those related to face… ▽ More The problem of training with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications include those related to face recognition. In the audio domain, music fraud or speaker recognition can be clearly benefited from FSL methods. This paper deals with the application of FSL to the detection of specific and intentional acoustic events given by different types of sound alarms, such as door bells or fire alarms, using a limited number of samples. These sounds typically occur in domestic environments where many events corresponding to a wide variety of sound classes take place. Therefore, the detection of such alarms in a practical scenario can be considered an open-set recognition (OSR) problem. To address the lack of a dedicated public dataset for audio FSL, researchers usually make modifications on other available datasets. This paper is aimed at poviding the audio recognition community with a carefully annotated dataset (https://zenodo.org/record/3689288) for FSL in an OSR context comprised of 1360 clips from 34 classes divided into pattern sounds} and unwanted sounds. To facilitate and promote research on this area, results with state-of-the-art baseline systems based on transfer learning are also presented. △ Less

Submitted 11 April, 2022; v1 submitted 26 February, 2020; originally announced February 2020.

Comments: Submitted to IEEEAccess

arXiv:1506.00300 [pdf, other]

How To Tame Your Sparsity Constraints

Authors: Jose A. Lopez

Abstract: We show that designing sparse $H_\infty$ controllers, in a discrete (LTI) setting, is easy when the controller is assumed to be an FIR filter. In this case, the problem reduces to a static output feedback problem with equality constraints. We show how to obtain an initial guess, for the controller, and then provide a simple algorithm that alternates between two (convex) feasibility programs until… ▽ More We show that designing sparse $H_\infty$ controllers, in a discrete (LTI) setting, is easy when the controller is assumed to be an FIR filter. In this case, the problem reduces to a static output feedback problem with equality constraints. We show how to obtain an initial guess, for the controller, and then provide a simple algorithm that alternates between two (convex) feasibility programs until converging, when the problem is feasible, to a suboptimal $H_\infty$ controller that is automatically stable. As FIR filters contain the information of their impulse response in their coefficients, it is easy to see that our results provide a path of least resistance to designing sparse robust controllers for continuous-time plants, via system identification methods. △ Less

Submitted 31 May, 2015; originally announced June 2015.

Comments: 12 pages, 1 figure

arXiv:1504.00905 [pdf, other]

Robust Anomaly Detection Using Semidefinite Programming

Authors: Jose A. Lopez, Octavia Camps, Mario Sznaier

Abstract: This paper presents a new approach, based on polynomial optimization and the method of moments, to the problem of anomaly detection. The proposed technique only requires information about the statistical moments of the normal-state distribution of the features of interest and compares favorably with existing approaches (such as Parzen windows and 1-class SVM). In addition, it provides a succinct d… ▽ More This paper presents a new approach, based on polynomial optimization and the method of moments, to the problem of anomaly detection. The proposed technique only requires information about the statistical moments of the normal-state distribution of the features of interest and compares favorably with existing approaches (such as Parzen windows and 1-class SVM). In addition, it provides a succinct description of the normal state. Thus, it leads to a substantial simplification of the the anomaly detection problem when working with higher dimensional datasets. △ Less

Submitted 30 May, 2015; v1 submitted 3 April, 2015; originally announced April 2015.

Comments: 13 pages, 11 figures

Showing 1–14 of 14 results for author: Lopez, J