-
Private kNN-VC: Interpretable Anonymization of Converted Speech
Authors:
Carlos Franzreb,
Arnab Das,
Tim Polzehl,
Sebastian Möller
Abstract:
Speaker anonymization seeks to conceal a speaker's identity while preserving the utility of their speech. The achieved privacy is commonly evaluated with a speaker recognition model trained on anonymized speech. Although this represents a strong attack, it is unclear which aspects of speech are exploited to identify the speakers. Our research sets out to unveil these aspects. It starts with kNN-VC…
▽ More
Speaker anonymization seeks to conceal a speaker's identity while preserving the utility of their speech. The achieved privacy is commonly evaluated with a speaker recognition model trained on anonymized speech. Although this represents a strong attack, it is unclear which aspects of speech are exploited to identify the speakers. Our research sets out to unveil these aspects. It starts with kNN-VC, a powerful voice conversion model that performs poorly as an anonymization system, presumably because of prosody leakage. To test this hypothesis, we extend kNN-VC with two interpretable components that anonymize the duration and variation of phones. These components increase privacy significantly, proving that the studied prosodic factors encode speaker identity and are exploited by the privacy attack. Additionally, we show that changes in the target selection algorithm considerably influence the outcome of the privacy attack.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention
Authors:
Yassine El Kheir,
Tim Polzehl,
Sebastian Möller
Abstract:
We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition,…
▽ More
We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection
Authors:
Yassine El Kheir,
Youness Samih,
Suraj Maharjan,
Tim Polzehl,
Sebastian Möller
Abstract:
This paper conducts a comprehensive layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across diverse contexts, including multilingual datasets (English, Chinese, Spanish), partial, song, and scene-based deepfake scenarios. By systematically evaluating the contributions of different transformer layers, we uncover critical insights into model behavior and perf…
▽ More
This paper conducts a comprehensive layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across diverse contexts, including multilingual datasets (English, Chinese, Spanish), partial, song, and scene-based deepfake scenarios. By systematically evaluating the contributions of different transformer layers, we uncover critical insights into model behavior and performance. Our findings reveal that lower layers consistently provide the most discriminative features, while higher layers capture less relevant information. Notably, all models achieve competitive equal error rate (EER) scores even when employing a reduced number of layers. This indicates that we can reduce computational costs and increase the inference speed of detecting deepfakes by utilizing only a few lower layers. This work enhances our understanding of SSL models in deepfake detection, offering valuable insights applicable across varied linguistic and contextual settings. Our trained models and code are publicly available: https://github.com/Yaselley/SSL_Layerwise_Deepfake.
△ Less
Submitted 7 February, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
Accelerated MRI With Deep Linear Convolutional Transform Learning
Authors:
Hongyi Gu,
Burhaneddin Yaman,
Steen Moeller,
Il Yong Chun,
Mehmet Akçakaya
Abstract:
Recent studies show that deep learning (DL) based MRI reconstruction outperforms conventional methods, such as parallel imaging and compressed sensing (CS), in multiple applications. Unlike CS that is typically implemented with pre-determined linear representations for regularization, DL inherently uses a non-linear representation learned from a large database. Another line of work uses transform…
▽ More
Recent studies show that deep learning (DL) based MRI reconstruction outperforms conventional methods, such as parallel imaging and compressed sensing (CS), in multiple applications. Unlike CS that is typically implemented with pre-determined linear representations for regularization, DL inherently uses a non-linear representation learned from a large database. Another line of work uses transform learning (TL) to bridge the gap between these two approaches by learning linear representations from data. In this work, we combine ideas from CS, TL and DL reconstructions to learn deep linear convolutional transforms as part of an algorithm unrolling approach. Using end-to-end training, our results show that the proposed technique can reconstruct MR images to a level comparable to DL methods, while supporting uniform undersampling patterns unlike conventional CS methods. Our proposed method relies on convex sparse image reconstruction with linear representation at inference time, which may be beneficial for characterizing robustness, stability and generalizability.
△ Less
Submitted 19 August, 2022; v1 submitted 17 April, 2022;
originally announced April 2022.
-
On incorporating social speaker characteristics in synthetic speech
Authors:
Sai Sirisha Rallabandi,
Sebastian Möller
Abstract:
In our previous work, we derived the acoustic features, that contribute to the perception of warmth and competence in synthetic speech. As an extension, in our current work, we investigate the impact of the derived vocal features in the generation of the desired characteristics. The acoustic features, spectral flux, F1 mean and F2 mean and their convex combinations were explored for the generation…
▽ More
In our previous work, we derived the acoustic features, that contribute to the perception of warmth and competence in synthetic speech. As an extension, in our current work, we investigate the impact of the derived vocal features in the generation of the desired characteristics. The acoustic features, spectral flux, F1 mean and F2 mean and their convex combinations were explored for the generation of higher warmth in female speech. The voiced slope, spectral flux, and their convex combinations were investigated for the generation of higher competence in female speech. We have employed a feature quantization approach in the traditional end-to-end tacotron based speech synthesis model. The listening tests have shown that the convex combination of acoustic features displays higher Mean Opinion Scores of warmth and competence when compared to that of individual features.
△ Less
Submitted 3 April, 2022;
originally announced April 2022.
-
ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications
Authors:
Gaoxiong Yi,
Wei Xiao,
Yiming Xiao,
Babak Naderi,
Sebastian Möller,
Wafaa Wardah,
Gabriel Mittag,
Ross Cutler,
Zhuohuang Zhang,
Donald S. Williamson,
Fei Chen,
Fuzheng Yang,
Shidong Shang
Abstract:
With the advances in speech communication systems such as online conferencing applications, we can seamlessly work with people regardless of where they are. However, during online meetings, speech quality can be significantly affected by background noise, reverberation, packet loss, network jitter, etc. Because of its nature, speech quality is traditionally assessed in subjective tests in laborato…
▽ More
With the advances in speech communication systems such as online conferencing applications, we can seamlessly work with people regardless of where they are. However, during online meetings, speech quality can be significantly affected by background noise, reverberation, packet loss, network jitter, etc. Because of its nature, speech quality is traditionally assessed in subjective tests in laboratories and lately also in crowdsourcing following the international standards from ITU-T Rec. P.800 series. However, those approaches are costly and cannot be applied to customer data. Therefore, an effective objective assessment approach is needed to evaluate or monitor the speech quality of the ongoing conversation. The ConferencingSpeech 2022 challenge targets the non-intrusive deep neural network models for the speech quality assessment task. We open-sourced a training corpus with more than 86K speech clips in different languages, with a wide range of synthesized and live degradations and their corresponding subjective quality scores through crowdsourcing. 18 teams submitted their models for evaluation in this challenge. The blind test sets included about 4300 clips from wide ranges of degradations. This paper describes the challenge, the datasets, and the evaluation methods and reports the final results.
△ Less
Submitted 31 March, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
Visualising and Explaining Deep Learning Models for Speech Quality Prediction
Authors:
H. Tilkorn,
G. Mittag,
S. Möller
Abstract:
Estimating quality of transmitted speech is known to be a non-trivial task. While traditionally, test participants are asked to rate the quality of samples; nowadays, automated methods are available. These methods can be divided into: 1) intrusive models, which use both, the original and the degraded signals, and 2) non-intrusive models, which only require the degraded signal. Recently, non-intrus…
▽ More
Estimating quality of transmitted speech is known to be a non-trivial task. While traditionally, test participants are asked to rate the quality of samples; nowadays, automated methods are available. These methods can be divided into: 1) intrusive models, which use both, the original and the degraded signals, and 2) non-intrusive models, which only require the degraded signal. Recently, non-intrusive models based on neural networks showed to outperform signal processing based models. However, the advantages of deep learning based models come with the cost of being more challenging to interpret. To get more insight into the prediction models the non-intrusive speech quality prediction model NISQA is analyzed in this paper. NISQA is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN). The task of the CNN is to compute relevant features for the speech quality prediction on a frame level, while the RNN models time-dependencies between the individual speech frames. Different explanation algorithms are used to understand the automatically learned features of the CNN. In this way, several interpretable features could be identified, such as the sensitivity to noise or strong interruptions. On the other hand, it was found that multiple features carry redundant information.
△ Less
Submitted 12 December, 2021;
originally announced December 2021.
-
20-fold Accelerated 7T fMRI Using Referenceless Self-Supervised Deep Learning Reconstruction
Authors:
Omer Burak Demirel,
Burhaneddin Yaman,
Logan Dowdle,
Steen Moeller,
Luca Vizioli,
Essa Yacoub,
John Strupp,
Cheryl A. Olman,
Kâmil Uğurbil,
Mehmet Akçakaya
Abstract:
High spatial and temporal resolution across the whole brain is essential to accurately resolve neural activities in fMRI. Therefore, accelerated imaging techniques target improved coverage with high spatio-temporal resolution. Simultaneous multi-slice (SMS) imaging combined with in-plane acceleration are used in large studies that involve ultrahigh field fMRI, such as the Human Connectome Project.…
▽ More
High spatial and temporal resolution across the whole brain is essential to accurately resolve neural activities in fMRI. Therefore, accelerated imaging techniques target improved coverage with high spatio-temporal resolution. Simultaneous multi-slice (SMS) imaging combined with in-plane acceleration are used in large studies that involve ultrahigh field fMRI, such as the Human Connectome Project. However, for even higher acceleration rates, these methods cannot be reliably utilized due to aliasing and noise artifacts. Deep learning (DL) reconstruction techniques have recently gained substantial interest for improving highly-accelerated MRI. Supervised learning of DL reconstructions generally requires fully-sampled training datasets, which is not available for high-resolution fMRI studies. To tackle this challenge, self-supervised learning has been proposed for training of DL reconstruction with only undersampled datasets, showing similar performance to supervised learning. In this study, we utilize a self-supervised physics-guided DL reconstruction on a 5-fold SMS and 4-fold in-plane accelerated 7T fMRI data. Our results show that our self-supervised DL reconstruction produce high-quality images at this 20-fold acceleration, substantially improving on existing methods, while showing similar functional precision and temporal effects in the subsequent analysis compared to a standard 10-fold accelerated acquisition.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Improved Simultaneous Multi-Slice Functional MRI Using Self-supervised Deep Learning
Authors:
Omer Burak Demirel,
Burhaneddin Yaman,
Logan Dowdle,
Steen Moeller,
Luca Vizioli,
Essa Yacoub,
John Strupp,
Cheryl A. Olman,
Kâmil Uğurbil,
Mehmet Akçakaya
Abstract:
Functional MRI (fMRI) is commonly used for interpreting neural activities across the brain. Numerous accelerated fMRI techniques aim to provide improved spatiotemporal resolutions. Among these, simultaneous multi-slice (SMS) imaging has emerged as a powerful strategy, becoming a part of large-scale studies, such as the Human Connectome Project. However, when SMS imaging is combined with in-plane a…
▽ More
Functional MRI (fMRI) is commonly used for interpreting neural activities across the brain. Numerous accelerated fMRI techniques aim to provide improved spatiotemporal resolutions. Among these, simultaneous multi-slice (SMS) imaging has emerged as a powerful strategy, becoming a part of large-scale studies, such as the Human Connectome Project. However, when SMS imaging is combined with in-plane acceleration for higher acceleration rates, conventional SMS reconstruction methods may suffer from noise amplification and other artifacts. Recently, deep learning (DL) techniques have gained interest for improving MRI reconstruction. However, these methods are typically trained in a supervised manner that necessitates fully-sampled reference data, which is not feasible in highly-accelerated fMRI acquisitions. Self-supervised learning that does not require fully-sampled data has recently been proposed and has shown similar performance to supervised learning. However, it has only been applied for in-plane acceleration. Furthermore the effect of DL reconstruction on subsequent fMRI analysis remains unclear. In this work, we extend self-supervised DL reconstruction to SMS imaging. Our results on prospectively 10-fold accelerated 7T fMRI data show that self-supervised DL reduces reconstruction noise and suppresses residual artifacts. Subsequent fMRI analysis remains unaltered by DL processing, while the improved temporal signal-to-noise ratio produces higher coherence estimates between task runs.
△ Less
Submitted 10 May, 2021;
originally announced May 2021.
-
Full-Reference Speech Quality Estimation with Attentional Siamese Neural Networks
Authors:
Gabriel Mittags,
Sebastian Möller
Abstract:
In this paper, we present a full-reference speech quality prediction model with a deep learning approach. The model determines a feature representation of the reference and the degraded signal through a siamese recurrent convolutional network that shares the weights for both signals as input. The resulting features are then used to align the signals with an attention mechanism and are finally comb…
▽ More
In this paper, we present a full-reference speech quality prediction model with a deep learning approach. The model determines a feature representation of the reference and the degraded signal through a siamese recurrent convolutional network that shares the weights for both signals as input. The resulting features are then used to align the signals with an attention mechanism and are finally combined to estimate the overall speech quality. The proposed network architecture represents a simple solution for the time-alignment problem that occurs for speech signals transmitted through Voice-Over-IP networks and shows how the clean reference signal can be incorporated into speech quality models that are based on end-to-end trained neural networks.
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
Deep Learning Based Assessment of Synthetic Speech Naturalness
Authors:
Gabriel Mittag,
Sebastian Möller
Abstract:
In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such a…
▽ More
In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.
△ Less
Submitted 23 April, 2021;
originally announced April 2021.
-
Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets
Authors:
Gabriel Mittag,
Saman Zadtootaghaj,
Thilo Michael,
Babak Naderi,
Sebastian Möller
Abstract:
The ground truth used for training image, video, or speech quality prediction models is based on the Mean Opinion Scores (MOS) obtained from subjective experiments. Usually, it is necessary to conduct multiple experiments, mostly with different test participants, to obtain enough data to train quality models based on machine learning. Each of these experiments is subject to an experiment-specific…
▽ More
The ground truth used for training image, video, or speech quality prediction models is based on the Mean Opinion Scores (MOS) obtained from subjective experiments. Usually, it is necessary to conduct multiple experiments, mostly with different test participants, to obtain enough data to train quality models based on machine learning. Each of these experiments is subject to an experiment-specific bias, where the rating of the same file may be substantially different in two experiments (e.g. depending on the overall quality distribution). These different ratings for the same distortion levels confuse neural networks during training and lead to lower performance. To overcome this problem, we propose a bias-aware loss function that estimates each dataset's biases during training with a linear function and considers it while optimising the network weights. We prove the efficiency of the proposed method by training and validating quality prediction models on synthetic and subjective image and speech quality datasets.
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets
Authors:
Gabriel Mittag,
Babak Naderi,
Assmaa Chehadi,
Sebastian Möller
Abstract:
In this paper, we present an update to the NISQA speech quality prediction model that is focused on distortions that occur in communication networks. In contrast to the previous version, the model is trained end-to-end and the time-dependency modelling and time-pooling is achieved through a Self-Attention mechanism. Besides overall speech quality, the model also predicts the four speech quality di…
▽ More
In this paper, we present an update to the NISQA speech quality prediction model that is focused on distortions that occur in communication networks. In contrast to the previous version, the model is trained end-to-end and the time-dependency modelling and time-pooling is achieved through a Self-Attention mechanism. Besides overall speech quality, the model also predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness, and in this way gives more insight into the cause of a quality degradation. Furthermore, new datasets with over 13,000 speech files were created for training and validation of the model. The model was finally tested on a new, live-talking test dataset that contains recordings of real telephone calls. Overall, NISQA was trained and evaluated on 81 datasets from different sources and showed to provide reliable predictions also for unknown speech samples. The code, model weights, and datasets are open-sourced.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Speech Quality Assessment in Crowdsourcing: Comparison Category Rating Method
Authors:
Babak Naderi,
Sebastian Möller,
Ross Cutler
Abstract:
Traditionally, Quality of Experience (QoE) for a communication system is evaluated through a subjective test. The most common test method for speech QoE is the Absolute Category Rating (ACR), in which participants listen to a set of stimuli, processed by the underlying test conditions, and rate their perceived quality for each stimulus on a specific scale. The Comparison Category Rating (CCR) is a…
▽ More
Traditionally, Quality of Experience (QoE) for a communication system is evaluated through a subjective test. The most common test method for speech QoE is the Absolute Category Rating (ACR), in which participants listen to a set of stimuli, processed by the underlying test conditions, and rate their perceived quality for each stimulus on a specific scale. The Comparison Category Rating (CCR) is another standard approach in which participants listen to both reference and processed stimuli and rate their quality compared to the other one. The CCR method is particularly suitable for systems that improve the quality of input speech. This paper evaluates an adaptation of the CCR test procedure for assessing speech quality in the crowdsourcing set-up. The CCR method was introduced in the ITU-T Rec. P.800 for laboratory-based experiments. We adapted the test for the crowdsourcing approach following the guidelines from ITU-T Rec. P.800 and P.808. We show that the results of the CCR procedure via crowdsourcing are highly reproducible. We also compared the CCR test results with widely used ACR test procedures obtained in the laboratory and crowdsourcing. Our results show that the CCR procedure in crowdsourcing is a reliable and valid test method.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
Incorporating Wireless Communication Parameters into the E-Model Algorithm
Authors:
Demóstenes Z. Rodríguez,
Dick Carrillo Melgarejo,
Miguel A. Ramírez,
Pedro H. J. Nardelli,
Sebastian Möller
Abstract:
Telecommunication service providers have to guarantee acceptable speech quality during a phone call to avoid a negative impact on the users' quality of experience. Currently, there are different speech quality assessment methods. ITU-T Recommendation G.107 describes the E-model algorithm, which is a computational model developed for network planning purposes focused on narrowband (NB) networks. La…
▽ More
Telecommunication service providers have to guarantee acceptable speech quality during a phone call to avoid a negative impact on the users' quality of experience. Currently, there are different speech quality assessment methods. ITU-T Recommendation G.107 describes the E-model algorithm, which is a computational model developed for network planning purposes focused on narrowband (NB) networks. Later, ITU-T Recommendations G.107.1 and G.107.2 were developed for wideband (WB) and fullband (FB) networks. These algorithms use different impairment factors, each one related to different speech communication steps. However, the NB, WB, and FB E-model algorithms do not consider wireless techniques used in these networks, such as Multiple-Input-Multiple-Output (MIMO) systems, which are used to improve the communication system robustness in the presence of different types of wireless channel degradation. In this context, the main objective of this study is to propose a general methodology to incorporate wireless network parameters into the NB and WB E-model algorithms. To accomplish this goal, MIMO and wireless channel parameters are incorporated into the E-model algorithms, specifically into the $I_{e,eff}$ and $I_{e,eff,WB}$ impairment factors. For performance validation, subjective tests were carried out, and the proposed methodology reached a Pearson correlation coefficient (PCC) and a root mean square error (RMSE) of $0.9732$ and $0.2351$, respectively. It is noteworthy that our proposed methodology does not affect the rest of the E-model input parameters, and it intends to be useful for wireless network planning in speech communication services.
△ Less
Submitted 5 March, 2021;
originally announced March 2021.
-
On Instabilities of Conventional Multi-Coil MRI Reconstruction to Small Adverserial Perturbations
Authors:
Chi Zhang,
Jinghan Jia,
Burhaneddin Yaman,
Steen Moeller,
Sijia Liu,
Mingyi Hong,
Mehmet Akçakaya
Abstract:
Although deep learning (DL) has received much attention in accelerated MRI, recent studies suggest small perturbations may lead to instabilities in DL-based reconstructions, leading to concern for their clinical application. However, these works focus on single-coil acquisitions, which is not practical. We investigate instabilities caused by small adversarial attacks for multi-coil acquisitions. O…
▽ More
Although deep learning (DL) has received much attention in accelerated MRI, recent studies suggest small perturbations may lead to instabilities in DL-based reconstructions, leading to concern for their clinical application. However, these works focus on single-coil acquisitions, which is not practical. We investigate instabilities caused by small adversarial attacks for multi-coil acquisitions. Our results suggest that, parallel imaging and multi-coil CS exhibit considerable instabilities against small adversarial perturbations.
△ Less
Submitted 25 February, 2021;
originally announced February 2021.
-
Self-Supervised Physics-Guided Deep Learning Reconstruction For High-Resolution 3D LGE CMR
Authors:
Burhaneddin Yaman,
Chetan Shenoy,
Zilin Deng,
Steen Moeller,
Hossam El-Rewaidy,
Reza Nezafat,
Mehmet Akçakaya
Abstract:
Late gadolinium enhancement (LGE) cardiac MRI (CMR) is the clinical standard for diagnosis of myocardial scar. 3D isotropic LGE CMR provides improved coverage and resolution compared to 2D imaging. However, image acceleration is required due to long scan times and contrast washout. Physics-guided deep learning (PG-DL) approaches have recently emerged as an improved accelerated MRI strategy. Traini…
▽ More
Late gadolinium enhancement (LGE) cardiac MRI (CMR) is the clinical standard for diagnosis of myocardial scar. 3D isotropic LGE CMR provides improved coverage and resolution compared to 2D imaging. However, image acceleration is required due to long scan times and contrast washout. Physics-guided deep learning (PG-DL) approaches have recently emerged as an improved accelerated MRI strategy. Training of PG-DL methods is typically performed in supervised manner requiring fully-sampled data as reference, which is challenging in 3D LGE CMR. Recently, a self-supervised learning approach was proposed to enable training PG-DL techniques without fully-sampled data. In this work, we extend this self-supervised learning approach to 3D imaging, while tackling challenges related to small training database sizes of 3D volumes. Results and a reader study on prospectively accelerated 3D LGE show that the proposed approach at 6-fold acceleration outperforms the clinically utilized compressed sensing approach at 3-fold acceleration.
△ Less
Submitted 18 November, 2020;
originally announced November 2020.
-
Improved Supervised Training of Physics-Guided Deep Learning Image Reconstruction with Multi-Masking
Authors:
Burhaneddin Yaman,
Seyed Amir Hossein Hosseini,
Steen Moeller,
Mehmet Akçakaya
Abstract:
Physics-guided deep learning (PG-DL) via algorithm unrolling has received significant interest for improved image reconstruction, including MRI applications. These methods unroll an iterative optimization algorithm into a series of regularizer and data consistency units. The unrolled networks are typically trained end-to-end using a supervised approach. Current supervised PG-DL approaches use all…
▽ More
Physics-guided deep learning (PG-DL) via algorithm unrolling has received significant interest for improved image reconstruction, including MRI applications. These methods unroll an iterative optimization algorithm into a series of regularizer and data consistency units. The unrolled networks are typically trained end-to-end using a supervised approach. Current supervised PG-DL approaches use all of the available sub-sampled measurements in their data consistency units. Thus, the network learns to fit the rest of the measurements. In this study, we propose to improve the performance and robustness of supervised training by utilizing randomness by retrospectively selecting only a subset of all the available measurements for data consistency units. The process is repeated multiple times using different random masks during training for further enhancement. Results on knee MRI show that the proposed multi-mask supervised PG-DL enhances reconstruction performance compared to conventional supervised PG-DL approaches.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
Effect of Language Proficiency on Subjective Evaluation of Noise Suppression Algorithms
Authors:
Babak Naderi,
Gabriel Mittag,
Rafael Zequeira Jim\a'enez,
Sebastian Möller
Abstract:
Speech communication systems based on Voice-over-IP technology are frequently used by native as well as non-native speakers of a target language, e.g. in international phone calls or telemeetings. Frequently, such calls also occur in a noisy environment, making noise suppression modules necessary to increase perceived quality of experience. Whereas standard tests for assessing perceived quality ma…
▽ More
Speech communication systems based on Voice-over-IP technology are frequently used by native as well as non-native speakers of a target language, e.g. in international phone calls or telemeetings. Frequently, such calls also occur in a noisy environment, making noise suppression modules necessary to increase perceived quality of experience. Whereas standard tests for assessing perceived quality make use of native listeners, we assume that noise-reduced speech and residual noise may affect native and non-native listeners of a target language in different ways. To test this assumption, we report results of two subjective tests conducted with English and German native listeners who judge the quality of speech samples recorded by native English, German, and Mandarin speakers, which are degraded with different background noise levels and noise suppression effects. The experiments were conducted following the standardized ITU-T Rec. P.835 approach, however implemented in a crowdsourcing setting according to ITU-T Rec. P.808. Our results show a significant influence of language on speech signal ratings and, consequently, on the overall perceived quality in specific conditions.
△ Less
Submitted 25 October, 2020;
originally announced October 2020.
-
A Self-Decoupled 32 Channel Receive Array for Human Brain Magnetic Resonance Imaging at 10.5T
Authors:
Nader Tavaf,
Russell L. Lagore,
Steve Jungst,
Shajan Gunamony,
Jerahmie Radder,
Andrea Grant,
Steen Moeller,
Edward Auerbach,
Kamil Ugurbil,
Gregor Adriany,
Pierre-Francois Van de Moortele
Abstract:
Purpose: Receive array layout, noise mitigation and B0 field strength are crucial contributors to signal-to-noise ratio (SNR) and parallel imaging performance. Here, we investigate SNR and parallel imaging gains at 10.5 Tesla (T) compared to 7T using 32-channel receive arrays at both fields. Methods: A self-decoupled 32-channel receive array for human brain imaging at 10.5T (10.5T-32Rx), consistin…
▽ More
Purpose: Receive array layout, noise mitigation and B0 field strength are crucial contributors to signal-to-noise ratio (SNR) and parallel imaging performance. Here, we investigate SNR and parallel imaging gains at 10.5 Tesla (T) compared to 7T using 32-channel receive arrays at both fields. Methods: A self-decoupled 32-channel receive array for human brain imaging at 10.5T (10.5T-32Rx), consisting of 31 loops and one cloverleaf element, was co-designed and built in tandem with a 16-channel dual-row loop transmitter. Novel receive array design and self-decoupling techniques were implemented. Parallel imaging performance, in terms of SNR and noise amplification (g-factor), of the 10.5T-32Rx was compared to the performance of an industry-standard 32-channel receiver at 7T (7T-32Rx) via experimental phantom measurements. Results: Compared to the 7T-32Rx, the 10.5T-32Rx provided 1.46 times the central SNR and 2.08 times the peripheral SNR. Minimum inverse g-factor value of the 10.5T-32Rx (min(1/g) = 0.56) was 51% higher than that of the 7T-32Rx (min(1/g) = 0.37) with R=4x4 2D acceleration, resulting in significantly enhanced parallel imaging performance at 10.5T compared to 7T. The g-factor values of 10.5T-32Rx were on par with those of a 64-channel receiver at 7T, e.g. 1.8 versus 1.9, respectively, with R=4x4 axial acceleration. Conclusion: Experimental measurements demonstrated effective self-decoupling of the receive array as well as substantial gains in SNR and parallel imaging performance at 10.5T compared to 7T.
△ Less
Submitted 9 November, 2020; v1 submitted 15 September, 2020;
originally announced September 2020.
-
Multi-Mask Self-Supervised Learning for Physics-Guided Neural Networks in Highly Accelerated MRI
Authors:
Burhaneddin Yaman,
Hongyi Gu,
Seyed Amir Hossein Hosseini,
Omer Burak Demirel,
Steen Moeller,
Jutta Ellermann,
Kâmil Uğurbil,
Mehmet Akçakaya
Abstract:
Self-supervised learning has shown great promise due to its capability to train deep learning MRI reconstruction methods without fully-sampled data. Current self-supervised learning methods for physics-guided reconstruction networks split acquired undersampled data into two disjoint sets, where one is used for data consistency (DC) in the unrolled network and the other to define the training loss.…
▽ More
Self-supervised learning has shown great promise due to its capability to train deep learning MRI reconstruction methods without fully-sampled data. Current self-supervised learning methods for physics-guided reconstruction networks split acquired undersampled data into two disjoint sets, where one is used for data consistency (DC) in the unrolled network and the other to define the training loss. In this study, we propose an improved self-supervised learning strategy that more efficiently uses the acquired data to train a physics-guided reconstruction network without a database of fully-sampled data. The proposed multi-mask self-supervised learning via data undersampling (SSDU) applies a hold-out masking operation on acquired measurements to split it into multiple pairs of disjoint sets for each training sample, while using one of these pairs for DC units and the other for defining loss, thereby more efficiently using the undersampled data. Multi-mask SSDU is applied on fully-sampled 3D knee and prospectively undersampled 3D brain MRI datasets, for various acceleration rates and patterns, and compared to CG-SENSE and single-mask SSDU DL-MRI, as well as supervised DL-MRI when fully-sampled data is available. Results on knee MRI show that the proposed multi-mask SSDU outperforms SSDU and performs closely with supervised DL-MRI. A clinical reader study further ranks the multi-mask SSDU higher than supervised DL-MRI in terms of SNR and aliasing artifacts. Results on brain MRI show that multi-mask SSDU achieves better reconstruction quality compared to SSDU. Reader study demonstrates that multi-mask SSDU at R=8 significantly improves reconstruction compared to single-mask SSDU at R=8, as well as CG-SENSE at R=2.
△ Less
Submitted 8 June, 2022; v1 submitted 13 August, 2020;
originally announced August 2020.
-
High-Fidelity Accelerated MRI Reconstruction by Scan-Specific Fine-Tuning of Physics-Based Neural Networks
Authors:
Seyed Amir Hossein Hosseini,
Burhaneddin Yaman,
Steen Moeller,
Mehmet Akçakaya
Abstract:
Long scan duration remains a challenge for high-resolution MRI. Deep learning has emerged as a powerful means for accelerated MRI reconstruction by providing data-driven regularizers that are directly learned from data. These data-driven priors typically remain unchanged for future data in the testing phase once they are learned during training. In this study, we propose to use a transfer learning…
▽ More
Long scan duration remains a challenge for high-resolution MRI. Deep learning has emerged as a powerful means for accelerated MRI reconstruction by providing data-driven regularizers that are directly learned from data. These data-driven priors typically remain unchanged for future data in the testing phase once they are learned during training. In this study, we propose to use a transfer learning approach to fine-tune these regularizers for new subjects using a self-supervision approach. While the proposed approach can compromise the extremely fast reconstruction time of deep learning MRI methods, our results on knee MRI indicate that such adaptation can substantially reduce the remaining artifacts in reconstructed images. In addition, the proposed approach has the potential to reduce the risks of generalization to rare pathological conditions, which may be unavailable in the training data.
△ Less
Submitted 12 May, 2020;
originally announced May 2020.
-
Towards Deep Learning Methods for Quality Assessment of Computer-Generated Imagery
Authors:
Markus Utke,
Saman Zadtootaghaj,
Steven Schmidt,
Sebastian Möller
Abstract:
Video gaming streaming services are growing rapidly due to new services such as passive video streaming, e.g. Twitch.tv, and cloud gaming, e.g. Nvidia Geforce Now. In contrast to traditional video content, gaming content has special characteristics such as extremely high motion for some games, special motion patterns, synthetic content and repetitive content, which makes the state-of-the-art video…
▽ More
Video gaming streaming services are growing rapidly due to new services such as passive video streaming, e.g. Twitch.tv, and cloud gaming, e.g. Nvidia Geforce Now. In contrast to traditional video content, gaming content has special characteristics such as extremely high motion for some games, special motion patterns, synthetic content and repetitive content, which makes the state-of-the-art video and image quality metrics perform weaker for this special computer generated content. In this paper, we outline our plan to build a deep learningbased quality metric for video gaming quality assessment. In addition, we present initial results by training the network based on VMAF values as a ground truth to give some insights on how to build a metric in future. The paper describes the method that is used to choose an appropriate Convolutional Neural Network architecture. Furthermore, we estimate the size of the required subjective quality dataset which achieves a sufficiently high performance. The results show that by taking around 5k images for training of the last six modules of Xception, we can obtain a relatively high performance metric to assess the quality of distorted video games.
△ Less
Submitted 2 May, 2020;
originally announced May 2020.
-
Multi-episodic Perceived Quality of an Audio-on-Demand Service
Authors:
Dennis Guse,
Oliver Hohlfeld,
Anna Wunderlich,
Benjamin Weiss,
Sebastian Möller
Abstract:
QoE is traditionally evaluated by using short stimuli usually representing parts or single usage episodes. This opens the question on how the overall service perception involving multiple} usage episodes can be evaluated---a question of high practical relevance to service operators. Despite initial research on this challenging aspect of multi-episodic perceived quality, the question of the underly…
▽ More
QoE is traditionally evaluated by using short stimuli usually representing parts or single usage episodes. This opens the question on how the overall service perception involving multiple} usage episodes can be evaluated---a question of high practical relevance to service operators. Despite initial research on this challenging aspect of multi-episodic perceived quality, the question of the underlying quality formation processes and its factors are still to be discovered. We present a multi-episodic experiment of an Audio on Demand service over a usage period of 6~days with 93 participants. Our work directly extends prior work investigating the impact of time between usage episodes. The results show similar effects---also the recency effect is not statistically significant. In addition, we extend prediction of multi-episodic judgments by accounting for the observed saturation.
△ Less
Submitted 1 May, 2020;
originally announced May 2020.
-
Self-Supervised Learning of Physics-Guided Reconstruction Neural Networks without Fully-Sampled Reference Data
Authors:
Burhaneddin Yaman,
Seyed Amir Hossein Hosseini,
Steen Moeller,
Jutta Ellermann,
Kâmil Uğurbil,
Mehmet Akçakaya
Abstract:
Purpose: To develop a strategy for training a physics-guided MRI reconstruction neural network without a database of fully-sampled datasets. Theory and Methods: Self-supervised learning via data under-sampling (SSDU) for physics-guided deep learning (DL) reconstruction partitions available measurements into two disjoint sets, one of which is used in the data consistency units in the unrolled netwo…
▽ More
Purpose: To develop a strategy for training a physics-guided MRI reconstruction neural network without a database of fully-sampled datasets. Theory and Methods: Self-supervised learning via data under-sampling (SSDU) for physics-guided deep learning (DL) reconstruction partitions available measurements into two disjoint sets, one of which is used in the data consistency units in the unrolled network and the other is used to define the loss for training. The proposed training without fully-sampled data is compared to fully-supervised training with ground-truth data, as well as conventional compressed sensing and parallel imaging methods using the publicly available fastMRI knee database. The same physics-guided neural network is used for both proposed SSDU and supervised training. The SSDU training is also applied to prospectively 2-fold accelerated high-resolution brain datasets at different acceleration rates, and compared to parallel imaging. Results: Results on five different knee sequences at acceleration rate of 4 shows that proposed self-supervised approach performs closely with supervised learning, while significantly outperforming conventional compressed sensing and parallel imaging, as characterized by quantitative metrics and a clinical reader study. The results on prospectively sub-sampled brain datasets, where supervised learning cannot be employed due to lack of ground-truth reference, show that the proposed self-supervised approach successfully perform reconstruction at high acceleration rates (4, 6 and 8). Image readings indicate improved visual reconstruction quality with the proposed approach compared to parallel imaging at acquisition acceleration. Conclusion: The proposed SSDU approach allows training of physics-guided DL-MRI reconstruction without fully-sampled data, while achieving comparable results with supervised DL-MRI trained on fully-sampled data.
△ Less
Submitted 14 April, 2020; v1 submitted 16 December, 2019;
originally announced December 2019.
-
Dense Recurrent Neural Networks for Accelerated MRI: History-Cognizant Unrolling of Optimization Algorithms
Authors:
Seyed Amir Hossein Hosseini,
Burhaneddin Yaman,
Steen Moeller,
Mingyi Hong,
Mehmet Akçakaya
Abstract:
Inverse problems for accelerated MRI typically incorporate domain-specific knowledge about the forward encoding operator in a regularized reconstruction framework. Recently physics-driven deep learning (DL) methods have been proposed to use neural networks for data-driven regularization. These methods unroll iterative optimization algorithms to solve the inverse problem objective function, by alte…
▽ More
Inverse problems for accelerated MRI typically incorporate domain-specific knowledge about the forward encoding operator in a regularized reconstruction framework. Recently physics-driven deep learning (DL) methods have been proposed to use neural networks for data-driven regularization. These methods unroll iterative optimization algorithms to solve the inverse problem objective function, by alternating between domain-specific data consistency and data-driven regularization via neural networks. The whole unrolled network is then trained end-to-end to learn the parameters of the network. Due to simplicity of data consistency updates with gradient descent steps, proximal gradient descent (PGD) is a common approach to unroll physics-driven DL reconstruction methods. However, PGD methods have slow convergence rates, necessitating a higher number of unrolled iterations, leading to memory issues in training and slower reconstruction times in testing. Inspired by efficient variants of PGD methods that use a history of the previous iterates, we propose a history-cognizant unrolling of the optimization algorithm with dense connections across iterations for improved performance. In our approach, the gradient descent steps are calculated at a trainable combination of the outputs of all the previous regularization units. We also apply this idea to unrolling variable splitting methods with quadratic relaxation. Our results in reconstruction of the fastMRI knee dataset show that the proposed history-cognizant approach reduces residual aliasing artifacts compared to its conventional unrolled counterpart without requiring extra computational power or increasing reconstruction time.
△ Less
Submitted 8 July, 2020; v1 submitted 16 December, 2019;
originally announced December 2019.
-
Self-Supervised Physics-Based Deep Learning MRI Reconstruction Without Fully-Sampled Data
Authors:
Burhaneddin Yaman,
Seyed Amir Hossein Hosseini,
Steen Moeller,
Jutta Ellermann,
Kâmil Uǧurbil,
Mehmet Akçakaya
Abstract:
Deep learning (DL) has emerged as a tool for improving accelerated MRI reconstruction. A common strategy among DL methods is the physics-based approach, where a regularized iterative algorithm alternating between data consistency and a regularizer is unrolled for a finite number of iterations. This unrolled network is then trained end-to-end in a supervised manner, using fully-sampled data as grou…
▽ More
Deep learning (DL) has emerged as a tool for improving accelerated MRI reconstruction. A common strategy among DL methods is the physics-based approach, where a regularized iterative algorithm alternating between data consistency and a regularizer is unrolled for a finite number of iterations. This unrolled network is then trained end-to-end in a supervised manner, using fully-sampled data as ground truth for the network output. However, in a number of scenarios, it is difficult to obtain fully-sampled datasets, due to physiological constraints such as organ motion or physical constraints such as signal decay. In this work, we tackle this issue and propose a self-supervised learning strategy that enables physics-based DL reconstruction without fully-sampled data. Our approach is to divide the acquired sub-sampled points for each scan into training and validation subsets. During training, data consistency is enforced over the training subset, while the validation subset is used to define the loss function. Results show that the proposed self-supervised learning method successfully reconstructs images without fully-sampled data, performing similarly to the supervised approach that is trained with fully-sampled references. This has implications for physics-based inverse problem approaches for other settings, where fully-sampled data is not available or possible to acquire.
△ Less
Submitted 20 October, 2019;
originally announced October 2019.
-
Accelerated Coronary MRI with sRAKI: A Database-Free Self-Consistent Neural Network k-space Reconstruction for Arbitrary Undersampling
Authors:
Seyed Amir Hossein Hosseini,
Chi Zhang,
Sebastian Weingärtner,
Steen Moeller,
Matthias Stuber,
Kâmil Uǧurbil,
Mehmet Akçakaya
Abstract:
This study aims to accelerate coronary MRI using a novel reconstruction algorithm, called self-consistent robust artificial-neural-networks for k-space interpolation (sRAKI). sRAKI performs iterative parallel imaging reconstruction by enforcing coil self-consistency using subject-specific neural networks. This approach extends the linear convolutions in SPIRiT to nonlinear interpolation using conv…
▽ More
This study aims to accelerate coronary MRI using a novel reconstruction algorithm, called self-consistent robust artificial-neural-networks for k-space interpolation (sRAKI). sRAKI performs iterative parallel imaging reconstruction by enforcing coil self-consistency using subject-specific neural networks. This approach extends the linear convolutions in SPIRiT to nonlinear interpolation using convolutional neural networks (CNNs). These CNNs are trained individually for each scan using the scan-specific autocalibrating signal (ACS) data. Reconstruction is performed by imposing the learned self-consistency and data-consistency enabling sRAKI to support random undersampling patterns. Fully-sampled targeted right coronary artery MRI was acquired in six healthy subjects for evaluation. The data were retrospectively undersampled, and reconstructed using SPIRiT, $\ell_1$-SPIRiT and sRAKI for acceleration rates of 2 to 5. Additionally, prospectively undersampled whole-heart coronary MRI was acquired to further evaluate performance. The results indicate that sRAKI reduces noise amplification and blurring artifacts compared with SPIRiT and $\ell_1$-SPIRiT, especially at high acceleration rates in targeted data. Quantitative analysis shows that sRAKI improves normalized mean-squared-error (~44% and ~21% over SPIRiT and $\ell_1$-SPIRiT at rate 5) and vessel sharpness (~10% and ~20% over SPIRiT and $\ell_1$-SPIRiT at rate 5). In addition, whole-heart data shows the sharpest coronary arteries when resolved using sRAKI, with 11% and 15% improvement in vessel sharpness over SPIRiT and $\ell_1$-SPIRiT, respectively. Thus, sRAKI is a database-free neural network-based reconstruction technique that may further accelerate coronary MRI with arbitrary undersampling patterns, while improving noise resilience over linear parallel imaging and image sharpness over $\ell_1$ regularization techniques.
△ Less
Submitted 18 July, 2019;
originally announced July 2019.
-
Deep Learning Methods for Parallel Magnetic Resonance Image Reconstruction
Authors:
Florian Knoll,
Kerstin Hammernik,
Chi Zhang,
Steen Moeller,
Thomas Pock,
Daniel K. Sodickson,
Mehmet Akcakaya
Abstract:
Following the success of deep learning in a wide range of applications, neural network-based machine learning techniques have received interest as a means of accelerating magnetic resonance imaging (MRI). A number of ideas inspired by deep learning techniques from computer vision and image processing have been successfully applied to non-linear image reconstruction in the spirit of compressed sens…
▽ More
Following the success of deep learning in a wide range of applications, neural network-based machine learning techniques have received interest as a means of accelerating magnetic resonance imaging (MRI). A number of ideas inspired by deep learning techniques from computer vision and image processing have been successfully applied to non-linear image reconstruction in the spirit of compressed sensing for both low dose computed tomography and accelerated MRI. The additional integration of multi-coil information to recover missing k-space lines in the MRI reconstruction process, is still studied less frequently, even though it is the de-facto standard for currently used accelerated MR acquisitions. This manuscript provides an overview of the recent machine learning approaches that have been proposed specifically for improving parallel imaging. A general background introduction to parallel MRI is given that is structured around the classical view of image space and k-space based methods. Both linear and non-linear methods are covered, followed by a discussion of recent efforts to further improve parallel imaging using machine learning, and specifically using artificial neural networks. Image-domain based techniques that introduce improved regularizers are covered as well as k-space based methods, where the focus is on better interpolation strategies using neural networks. Issues and open problems are discussed as well as recent efforts for producing open datasets and benchmarks for the community.
△ Less
Submitted 1 April, 2019;
originally announced April 2019.
-
LIFO-Backpressure Achieves Near Optimal Utility-Delay Tradeoff
Authors:
Longbo Huang,
Scott Moeller,
Michael J. Neely,
Bhaskar Krishnamachari
Abstract:
There has been considerable recent work developing a new stochastic network utility maximization framework using Backpressure algorithms, also known as MaxWeight. A key open problem has been the development of utility-optimal algorithms that are also delay efficient. In this paper, we show that the Backpressure algorithm, when combined with the LIFO queueing discipline (called LIFO-Backpressure),…
▽ More
There has been considerable recent work developing a new stochastic network utility maximization framework using Backpressure algorithms, also known as MaxWeight. A key open problem has been the development of utility-optimal algorithms that are also delay efficient. In this paper, we show that the Backpressure algorithm, when combined with the LIFO queueing discipline (called LIFO-Backpressure), is able to achieve a utility that is within $O(1/V)$ of the optimal value, while maintaining an average delay of $O([\log(V)]^2)$ for all but a tiny fraction of the network traffic. This result holds for general stochastic network optimization problems and general Markovian dynamics. Remarkably, the performance of LIFO-Backpressure can be achieved by simply changing the queueing discipline; it requires no other modifications of the original Backpressure algorithm. We validate the results through empirical measurements from a sensor network testbed, which show good match between theory and practice.
△ Less
Submitted 3 April, 2011; v1 submitted 28 August, 2010;
originally announced August 2010.