-
Diversity-Driven Learning: Tackling Spurious Correlations and Data Heterogeneity in Federated Models
Authors:
Gergely D. Németh,
Eros Fanì,
Yeat Jeng Ng,
Barbara Caputo,
Miguel Ángel Lozano,
Nuria Oliver,
Novi Quadrianto
Abstract:
Federated Learning (FL) enables decentralized training of machine learning models on distributed data while preserving privacy. However, in real-world FL settings, client data is often non-identically distributed and imbalanced, resulting in statistical data heterogeneity which impacts the generalization capabilities of the server's model across clients, slows convergence and reduces performance.…
▽ More
Federated Learning (FL) enables decentralized training of machine learning models on distributed data while preserving privacy. However, in real-world FL settings, client data is often non-identically distributed and imbalanced, resulting in statistical data heterogeneity which impacts the generalization capabilities of the server's model across clients, slows convergence and reduces performance. In this paper, we address this challenge by first proposing a characterization of statistical data heterogeneity by means of 6 metrics of global and client attribute imbalance, class imbalance, and spurious correlations. Next, we create and share 7 computer vision datasets for binary and multiclass image classification tasks in Federated Learning that cover a broad range of statistical data heterogeneity and hence simulate real-world situations. Finally, we propose FedDiverse, a novel client selection algorithm in FL which is designed to manage and leverage data heterogeneity across clients by promoting collaboration between clients with complementary data distributions. Experiments on the seven proposed FL datasets demonstrate FedDiverse's effectiveness in enhancing the performance and robustness of a variety of FL methods while having low communication and computational overhead.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Compute-Efficient Active Learning
Authors:
Gábor Németh,
Tamás Matuszka
Abstract:
Active learning, a powerful paradigm in machine learning, aims at reducing labeling costs by selecting the most informative samples from an unlabeled dataset. However, the traditional active learning process often demands extensive computational resources, hindering scalability and efficiency. In this paper, we address this critical issue by presenting a novel method designed to alleviate the comp…
▽ More
Active learning, a powerful paradigm in machine learning, aims at reducing labeling costs by selecting the most informative samples from an unlabeled dataset. However, the traditional active learning process often demands extensive computational resources, hindering scalability and efficiency. In this paper, we address this critical issue by presenting a novel method designed to alleviate the computational burden associated with active learning on massive datasets. To achieve this goal, we introduce a simple, yet effective method-agnostic framework that outlines how to strategically choose and annotate data points, optimizing the process for efficiency while maintaining model performance. Through case studies, we demonstrate the effectiveness of our proposed method in reducing computational costs while maintaining or, in some cases, even surpassing baseline model outcomes. Code is available at https://github.com/aimotive/Compute-Efficient-Active-Learning.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Privacy and Accuracy Implications of Model Complexity and Integration in Heterogeneous Federated Learning
Authors:
Gergely Dániel Németh,
Miguel Ángel Lozano,
Novi Quadrianto,
Nuria Oliver
Abstract:
Federated Learning (FL) has been proposed as a privacy-preserving solution for distributed machine learning, particularly in heterogeneous FL settings where clients have varying computational capabilities and thus train models with different complexities compared to the server's model. However, FL is not without vulnerabilities: recent studies have shown that it is susceptible to membership infere…
▽ More
Federated Learning (FL) has been proposed as a privacy-preserving solution for distributed machine learning, particularly in heterogeneous FL settings where clients have varying computational capabilities and thus train models with different complexities compared to the server's model. However, FL is not without vulnerabilities: recent studies have shown that it is susceptible to membership inference attacks (MIA), which can compromise the privacy of client data. In this paper, we examine the intersection of these two aspects, heterogeneous FL and its privacy vulnerabilities, by focusing on the role of client model integration, the process through which the server integrates parameters from clients' smaller models into its larger model. To better understand this process, we first propose a taxonomy that categorizes existing heterogeneous FL methods and enables the design of seven novel heterogeneous FL model integration strategies. Using CIFAR-10, CIFAR-100, and FEMNIST vision datasets, we evaluate the privacy and accuracy trade-offs of these approaches under three types of MIAs. Our findings reveal significant differences in privacy leakage and performance depending on the integration method. Notably, introducing randomness in the model integration process enhances client privacy while maintaining competitive accuracy for both the clients and the server. This work provides quantitative light on the privacy-accuracy implications client model integration in heterogeneous FL settings, paving the way towards more secure and efficient FL systems.
△ Less
Submitted 10 March, 2025; v1 submitted 29 November, 2023;
originally announced November 2023.
-
aiMotive Dataset: A Multimodal Dataset for Robust Autonomous Driving with Long-Range Perception
Authors:
Tamás Matuszka,
Iván Barton,
Ádám Butykai,
Péter Hajas,
Dávid Kiss,
Domonkos Kovács,
Sándor Kunsági-Máté,
Péter Lengyel,
Gábor Németh,
Levente Pető,
Dezső Ribli,
Dávid Szeghy,
Szabolcs Vajna,
Bálint Varga
Abstract:
Autonomous driving is a popular research area within the computer vision research community. Since autonomous vehicles are highly safety-critical, ensuring robustness is essential for real-world deployment. While several public multimodal datasets are accessible, they mainly comprise two sensor modalities (camera, LiDAR) which are not well suited for adverse weather. In addition, they lack far-ran…
▽ More
Autonomous driving is a popular research area within the computer vision research community. Since autonomous vehicles are highly safety-critical, ensuring robustness is essential for real-world deployment. While several public multimodal datasets are accessible, they mainly comprise two sensor modalities (camera, LiDAR) which are not well suited for adverse weather. In addition, they lack far-range annotations, making it harder to train neural networks that are the base of a highway assistant function of an autonomous vehicle. Therefore, we introduce a multimodal dataset for robust autonomous driving with long-range perception. The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view. The collected data was captured in highway, urban, and suburban areas during daytime, night, and rain and is annotated with 3D bounding boxes with consistent identifiers across frames. Furthermore, we trained unimodal and multimodal baseline models for 3D object detection. Data are available at \url{https://github.com/aimotive/aimotive_dataset}.
△ Less
Submitted 22 September, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
A Snapshot of the Frontiers of Client Selection in Federated Learning
Authors:
Gergely Dániel Németh,
Miguel Ángel Lozano,
Novi Quadrianto,
Nuria Oliver
Abstract:
Federated learning (FL) has been proposed as a privacy-preserving approach in distributed machine learning. A federated learning architecture consists of a central server and a number of clients that have access to private, potentially sensitive data. Clients are able to keep their data in their local machines and only share their locally trained model's parameters with a central server that manag…
▽ More
Federated learning (FL) has been proposed as a privacy-preserving approach in distributed machine learning. A federated learning architecture consists of a central server and a number of clients that have access to private, potentially sensitive data. Clients are able to keep their data in their local machines and only share their locally trained model's parameters with a central server that manages the collaborative learning process. FL has delivered promising results in real-life scenarios, such as healthcare, energy, and finance. However, when the number of participating clients is large, the overhead of managing the clients slows down the learning. Thus, client selection has been introduced as a strategy to limit the number of communicating parties at every step of the process. Since the early naïve random selection of clients, several client selection methods have been proposed in the literature. Unfortunately, given that this is an emergent field, there is a lack of a taxonomy of client selection methods, making it hard to compare approaches. In this paper, we propose a taxonomy of client selection in Federated Learning that enables us to shed light on current progress in the field and identify potential areas of future research in this promising area of machine learning.
△ Less
Submitted 2 January, 2023; v1 submitted 27 September, 2022;
originally announced October 2022.
-
Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0
Authors:
Mohammed Salah Al-Radhi,
Tamás Gábor Csapó,
Csaba Zainkó,
Géza Németh
Abstract:
Neural network-based Text-to-Speech has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron2, FastSpeech, FastPitch) usually generate Mel-spectrogram from text and then synthesize speech using vocoder (e.g., WaveNet, WaveGlow, HiFiGAN). Compared with traditional parametric approaches (e.g., STRAIGHT and WORLD), neural vocoder based end-to-end models suffer f…
▽ More
Neural network-based Text-to-Speech has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron2, FastSpeech, FastPitch) usually generate Mel-spectrogram from text and then synthesize speech using vocoder (e.g., WaveNet, WaveGlow, HiFiGAN). Compared with traditional parametric approaches (e.g., STRAIGHT and WORLD), neural vocoder based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust and lack of controllability. In this work, we propose a novel updated vocoder, which is a simple signal model to train and easy to generate waveforms. We use the Gaussian-Markov model toward robust learning of spectral envelope and wavelet-based statistical signal processing to characterize and decompose F0 features. It can retain the fine spectral envelope and achieve high controllability of natural speech. The experimental results demonstrate that our proposed vocoder achieves better naturalness of reconstructed speech than the conventional STRAIGHT vocoder, slightly better than WaveNet, and somewhat worse than the WaveRNN.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
Authors:
Csaba Zainkó,
László Tóth,
Amin Honarmandi Shandiz,
Gábor Gosztolya,
Alexandra Markó,
Géza Németh,
Tamás Gábor Csapó
Abstract:
For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use…
▽ More
For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion contains three steps: 1) from a sequence of ultrasound tongue image recordings, a 3D convolutional neural network predicts the inputs of the pre-trained Tacotron2 model, 2) the Tacotron2 model converts this intermediate representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model is applied for final inference. This generated speech contains the timing of the original articulatory data from the ultrasound recording, but the F0 contour and the spectral information is predicted by the Tacotron2 model. The F0 values are independent of the original ultrasound images, but represent the target speaker, as they are inferred from the pre-trained Tacotron2 model. In our experiments, we demonstrated that the synthesized speech quality is more natural with the proposed solutions than with our earlier model.
△ Less
Submitted 26 July, 2021;
originally announced July 2021.
-
Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters
Authors:
Mohammed Salah Al-Radhi,
Tamás Gábor Csapó,
Géza Németh
Abstract:
Vocoders received renewed attention as main components in statistical parametric text-to-speech (TTS) synthesis and speech transformation systems. Even though there are vocoding techniques give almost accepted synthesized speech, their high computational complexity and irregular structures are still considered challenging concerns, which yield a variety of voice quality degradation. Therefore, thi…
▽ More
Vocoders received renewed attention as main components in statistical parametric text-to-speech (TTS) synthesis and speech transformation systems. Even though there are vocoding techniques give almost accepted synthesized speech, their high computational complexity and irregular structures are still considered challenging concerns, which yield a variety of voice quality degradation. Therefore, this paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system. First, a new continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise and letting an accurate reconstruction of noise characteristics. Second, we addressed the need of neural sequence to sequence modeling approach for the task of TTS based on recurrent networks. Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human. The evaluation results proved that the proposed model achieves the state-of-the-art performance of the speech synthesis compared with the other traditional methods.
△ Less
Submitted 19 June, 2021;
originally announced June 2021.
-
Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis
Authors:
Mohammed Salah Al-Radhi,
Tamás Gábor Csapó,
Csaba Zainkó,
Géza Németh
Abstract:
To date, various speech technology systems have adopted the vocoder approach, a method for synthesizing speech waveform that shows a major role in the performance of statistical parametric speech synthesis. WaveNet one of the best models that nearly resembles the human voice, has to generate a waveform in a time consuming sequential manner with an extremely complex structure of its neural networks…
▽ More
To date, various speech technology systems have adopted the vocoder approach, a method for synthesizing speech waveform that shows a major role in the performance of statistical parametric speech synthesis. WaveNet one of the best models that nearly resembles the human voice, has to generate a waveform in a time consuming sequential manner with an extremely complex structure of its neural networks.
△ Less
Submitted 12 June, 2021;
originally announced June 2021.
-
Self-Attention Networks for Intent Detection
Authors:
Sevinj Yolchuyeva,
Géza Németh,
Bálint Gyires-Tóth
Abstract:
Self-attention networks (SAN) have shown promising performance in various Natural Language Processing (NLP) scenarios, especially in machine translation. One of the main points of SANs is the strength of capturing long-range and multi-scale dependencies from the data. In this paper, we present a novel intent detection system which is based on a self-attention network and a Bi-LSTM. Our approach sh…
▽ More
Self-attention networks (SAN) have shown promising performance in various Natural Language Processing (NLP) scenarios, especially in machine translation. One of the main points of SANs is the strength of capturing long-range and multi-scale dependencies from the data. In this paper, we present a novel intent detection system which is based on a self-attention network and a Bi-LSTM. Our approach shows improvement by using a transformer model and deep averaging network-based universal sentence encoder compared to previous solutions. We evaluate the system on Snips, Smart Speaker, Smart Lights, and ATIS datasets by different evaluation metrics. The performance of the proposed model is compared with LSTM with the same datasets.
△ Less
Submitted 28 June, 2020;
originally announced June 2020.
-
Transformer based Grapheme-to-Phoneme Conversion
Authors:
Sevinj Yolchuyeva,
Géza Németh,
Bálint Gyires-Tóth
Abstract:
Attention mechanism is one of the most successful techniques in deep learning based Natural Language Processing (NLP). The transformer network architecture is completely based on attention mechanisms, and it outperforms sequence-to-sequence models in neural machine translation without recurrent and convolutional layers. Grapheme-to-phoneme (G2P) conversion is a task of converting letters (grapheme…
▽ More
Attention mechanism is one of the most successful techniques in deep learning based Natural Language Processing (NLP). The transformer network architecture is completely based on attention mechanisms, and it outperforms sequence-to-sequence models in neural machine translation without recurrent and convolutional layers. Grapheme-to-phoneme (G2P) conversion is a task of converting letters (grapheme sequence) to their pronunciations (phoneme sequence). It plays a significant role in text-to-speech (TTS) and automatic speech recognition (ASR) systems. In this paper, we investigate the application of transformer architecture to G2P conversion and compare its performance with recurrent and convolutional neural network based approaches. Phoneme and word error rates are evaluated on the CMUDict dataset for US English and the NetTalk dataset. The results show that transformer based G2P outperforms the convolutional-based approach in terms of word error rate and our results significantly exceeded previous recurrent approaches (without attention) regarding word and phoneme error rates on both datasets. Furthermore, the size of the proposed model is much smaller than the size of the previous approaches.
△ Less
Submitted 26 June, 2020; v1 submitted 14 April, 2020;
originally announced April 2020.
-
Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder
Authors:
Tamás Gábor Csapó,
Mohammed Salah Al-Radhi,
Géza Németh,
Gábor Gosztolya,
Tamás Grósz,
László Tóth,
Alexandra Markó
Abstract:
Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even whe…
▽ More
Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-to-acoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.
△ Less
Submitted 24 June, 2019;
originally announced June 2019.
-
RNN-based speech synthesis using a continuous sinusoidal model
Authors:
Mohammed Salah Al-Radhi,
Tamás Gábor Csapó,
Géza Németh
Abstract:
Recently in statistical parametric speech synthesis, we proposed a continuous sinusoidal model (CSM) using continuous F0 (contF0) in combination with Maximum Voiced Frequency (MVF), which was successfully giving state-of-the-art vocoders performance (e.g. similar to STRAIGHT) in synthesized speech. In this paper, we address the use of sequence-to-sequence modeling with recurrent neural networks (R…
▽ More
Recently in statistical parametric speech synthesis, we proposed a continuous sinusoidal model (CSM) using continuous F0 (contF0) in combination with Maximum Voiced Frequency (MVF), which was successfully giving state-of-the-art vocoders performance (e.g. similar to STRAIGHT) in synthesized speech. In this paper, we address the use of sequence-to-sequence modeling with recurrent neural networks (RNNs). Bidirectional long short-term memory (Bi-LSTM) is investigated and applied using our CSM to model contF0, MVF, and Mel-Generalized Cepstrum (MGC) for more natural sounding synthesized speech. For refining the output of the contF0 estimation, post-processing based on time-warping approach is applied to reduce the unwanted voiced component of the unvoiced speech sounds, resulting in an enhanced contF0 track. The overall conclusion is covered by objective evaluation and subjective listening test, showing that the proposed framework provides satisfactory results in terms of naturalness and intelligibility, and is comparable to the high-quality WORLD model based RNNs.
△ Less
Submitted 12 April, 2019;
originally announced April 2019.