-
Deformable Image Registration of Dark-Field Chest Radiographs for Local Lung Signal Change Assessment
Authors:
Fabian Drexel,
Vasiliki Sideri-Lampretsa,
Henriette Bast,
Alexander W. Marka,
Thomas Koehler,
Florian T. Gassert,
Daniela Pfeiffer,
Daniel Rueckert,
Franz Pfeiffer
Abstract:
Dark-field radiography of the human chest has been demonstrated to have promising potential for the analysis of the lung microstructure and the diagnosis of respiratory diseases. However, previous studies of dark-field chest radiographs evaluated the lung signal only in the inspiratory breathing state. Our work aims to add a new perspective to these previous assessments by locally comparing dark-f…
▽ More
Dark-field radiography of the human chest has been demonstrated to have promising potential for the analysis of the lung microstructure and the diagnosis of respiratory diseases. However, previous studies of dark-field chest radiographs evaluated the lung signal only in the inspiratory breathing state. Our work aims to add a new perspective to these previous assessments by locally comparing dark-field lung information between different respiratory states. To this end, we discuss suitable image registration methods for dark-field chest radiographs to enable consistent spatial alignment of the lung in distinct breathing states. Utilizing full inspiration and expiration scans from a clinical chronic obstructive pulmonary disease study, we assess the performance of the proposed registration framework and outline applicable evaluation approaches. Our regional characterization of lung dark-field signal changes between the breathing states provides a proof-of-principle that dynamic radiography-based lung function assessment approaches may benefit from considering registered dark-field images in addition to standard plain chest radiographs.
△ Less
Submitted 18 January, 2025;
originally announced January 2025.
-
Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis
Authors:
Prabhav Agrawal,
Thilo Koehler,
Zhiping Xiu,
Prashant Serai,
Qing He
Abstract:
Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder o…
▽ More
Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders
Authors:
Jason Fong,
Yun Wang,
Prabhav Agrawal,
Vimal Manohar,
Jilong Wu,
Thilo Köhler,
Qing He
Abstract:
Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model…
▽ More
Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model training on data from the target speaker, which is a costly process that may incorporate potentially sensitive data into server-side models. In contrast, this work focuses on the zero-shot approach which avoids finetuning altogether, and instead uses pretrained speaker verification embeddings together with a jointly trained reference encoder to encode utterance-level information that helps capture aspects such as speaker identity and prosody. Subjective listening tests find that both utterance embeddings and a reference encoder improve the continuity of speaker identity and prosody between the edited synthetic speech and unedited original recording in the zero-shot setting.
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
3D helical CT Reconstruction with a Memory Efficient Learned Primal-Dual Architecture
Authors:
Jevgenija Rudzusika,
Buda Bajić,
Thomas Koehler,
Ozan Öktem
Abstract:
Deep learning based computed tomography (CT) reconstruction has demonstrated outstanding performance on simulated 2D low-dose CT data. This applies in particular to domain adapted neural networks, which incorporate a handcrafted physics model for CT imaging. Empirical evidence shows that employing such architectures reduces the demand for training data and improves upon generalisation. However, th…
▽ More
Deep learning based computed tomography (CT) reconstruction has demonstrated outstanding performance on simulated 2D low-dose CT data. This applies in particular to domain adapted neural networks, which incorporate a handcrafted physics model for CT imaging. Empirical evidence shows that employing such architectures reduces the demand for training data and improves upon generalisation. However, their training requires large computational resources that quickly become prohibitive in 3D helical CT, which is the most common acquisition geometry used for medical imaging. Furthermore, clinical data also comes with other challenges not accounted for in simulations, like errors in flux measurement, resolution mismatch and, most importantly, the absence of the real ground truth. The necessity to have a computationally feasible training combined with the need to address these issues has made it difficult to evaluate deep learning based reconstruction on clinical 3D helical CT. This paper modifies a domain adapted neural network architecture, the Learned Primal-Dual (LPD), so that it can be trained and applied to reconstruction in this setting. We achieve this by splitting the helical trajectory into sections and applying the unrolled LPD iterations to those sections sequentially. To the best of our knowledge, this work is the first to apply an unrolled deep learning architecture for reconstruction on full-sized clinical data, like those in the Low dose CT image and projection data set (LDCT). Moreover, training and testing is done on a single GPU card with 24GB of memory.
△ Less
Submitted 28 November, 2023; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Deep learning based dictionary learning and tomographic image reconstruction
Authors:
Jevgenija Rudzusika,
Thomas Koehler,
Ozan Öktem
Abstract:
This work presents an approach for image reconstruction in clinical low-dose tomography that combines principles from sparse signal processing with ideas from deep learning. First, we describe sparse signal representation in terms of dictionaries from a statistical perspective and interpret dictionary learning as a process of aligning distribution that arises from a generative model with empirical…
▽ More
This work presents an approach for image reconstruction in clinical low-dose tomography that combines principles from sparse signal processing with ideas from deep learning. First, we describe sparse signal representation in terms of dictionaries from a statistical perspective and interpret dictionary learning as a process of aligning distribution that arises from a generative model with empirical distribution of true signals. As a result we can see that sparse coding with learned dictionaries resembles a specific variational autoencoder, where the decoder is a linear function and the encoder is a sparse coding algorithm. Next, we show that dictionary learning can also benefit from computational advancements introduced in the context of deep learning, such as parallelism and as stochastic optimization. Finally, we show that regularization by dictionaries achieves competitive performance in computed tomography (CT) reconstruction comparing to state-of-the-art model based and data driven approaches.
△ Less
Submitted 26 August, 2021;
originally announced August 2021.
-
Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling
Authors:
Qing He,
Zhiping Xiu,
Thilo Koehler,
Jilong Wu
Abstract:
Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech,…
▽ More
Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O($L$) increase in both latency and real-time factor (RTF) with respect to input length $L$. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications.
△ Less
Submitted 1 April, 2021;
originally announced April 2021.
-
FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge
Authors:
Bichen Wu,
Qing He,
Peizhao Zhang,
Thilo Koehler,
Kurt Keutzer,
Peter Vajda
Abstract:
Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal…
▽ More
Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal performance-efficiency trade-offs for different edge devices. FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models. It produces high quality audio and supports streaming during inference while remaining highly computationally efficient. Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality. Audio demos are available at https://bichenwu09.github.io/vocoder_demos.
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
X-ray dark-field signal reduction due to hardening of the visibility spectrum
Authors:
Fabio De Marco,
Jana Andrejewski,
Theresa Urban,
Konstantin Willer,
Lukas Gromann,
Thomas Koehler,
Hanns-Ingo Maack,
Julia Herzen,
Franz Pfeiffer
Abstract:
X-ray dark-field imaging enables a spatially-resolved visualization of ultra-small-angle X-ray scattering. Using phantom measurements, we demonstrate that a material's effective dark-field signal may be reduced by modification of the visibility spectrum by other dark-field-active objects in the beam. This is the dark-field equivalent of conventional beam-hardening, and is distinct from related, kn…
▽ More
X-ray dark-field imaging enables a spatially-resolved visualization of ultra-small-angle X-ray scattering. Using phantom measurements, we demonstrate that a material's effective dark-field signal may be reduced by modification of the visibility spectrum by other dark-field-active objects in the beam. This is the dark-field equivalent of conventional beam-hardening, and is distinct from related, known effects, where the dark-field signal is modified by attenuation or phase shifts. We present a theoretical model for this group of effects and verify it by comparison to the measurements. These findings have significant implications for the interpretation of dark-field signal strength in polychromatic measurements.
△ Less
Submitted 3 November, 2023; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Interactive Text-to-Speech System via Joint Style Analysis
Authors:
Yang Gao,
Weiyi Zheng,
Zhaojun Yang,
Thilo Kohler,
Christian Fuegen,
Qing He
Abstract:
While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then…
▽ More
While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then used by the TTS to produce a matching response. We faced two main challenges: 1) only a small portion of the TTS training dataset has style labels, which is needed to train a multi-style TTS that respects different style embeddings during inference. 2) The TTS system and the style extraction model have disjoint training datasets. We need consistent style labels across these two datasets so that the TTS can learn to respect the labels produced by the style extraction model during inference. To solve these, we adopted a semi-supervised approach that uses the style extraction model to create style labels for the TTS dataset and applied transfer learning to learn the style embedding jointly. Our experiment results show user preference for the styled TTS responses and demonstrate the style-embedded TTS system's capability of mimicking the speech query style.
△ Less
Submitted 21 September, 2020; v1 submitted 16 February, 2020;
originally announced February 2020.
-
Merging-ISP: Multi-Exposure High Dynamic Range Image Signal Processing
Authors:
Prashant Chaudhari,
Franziska Schirrmacher,
Andreas Maier,
Christian Riess,
Thomas Köhler
Abstract:
High dynamic range (HDR) imaging combines multiple images with different exposure times into a single high-quality image. The image signal processing pipeline (ISP) is a core component in digital cameras to perform these operations. It includes demosaicing of raw color filter array (CFA) data at different exposure times, alignment of the exposures, conversion to HDR domain, and exposure merging in…
▽ More
High dynamic range (HDR) imaging combines multiple images with different exposure times into a single high-quality image. The image signal processing pipeline (ISP) is a core component in digital cameras to perform these operations. It includes demosaicing of raw color filter array (CFA) data at different exposure times, alignment of the exposures, conversion to HDR domain, and exposure merging into an HDR image. Traditionally, such pipelines cascade algorithms that address these individual subtasks. However, cascaded designs suffer from error propagation, since simply combining multiple steps is not necessarily optimal for the entire imaging task. This paper proposes a multi-exposure HDR image signal processing pipeline (Merging-ISP) to jointly solve all these subtasks. Our pipeline is modeled by a deep neural network architecture. As such, it is end-to-end trainable, circumvents the use of hand-crafted and potentially complex algorithms, and mitigates error propagation. Merging-ISP enables direct reconstructions of HDR images of dynamic scenes from multiple raw CFA images with different exposures. We compare Merging-ISP against several state-of-the-art cascaded pipelines. The proposed method provides HDR reconstructions of high perceptual quality and it quantitatively outperforms competing ISPs by more than 1 dB in terms of PSNR.
△ Less
Submitted 4 October, 2021; v1 submitted 12 November, 2019;
originally announced November 2019.
-
G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR
Authors:
Duc Le,
Thilo Koehler,
Christian Fuegen,
Michael L. Seltzer
Abstract:
Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel…
▽ More
Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel method to train a statistical grapheme-to-grapheme (G2G) model on text-to-speech data that can rewrite an arbitrary character sequence into more phonetically consistent forms. We show that using G2G to provide alternative pronunciations during decoding reduces Word Error Rate by 3% to 11% relative over a strong graphemic baseline and bridges the gap on rare name recognition with an equivalent phonetic setup. Unlike many previously proposed methods, our method does not require any change to the acoustic model training procedure. This work reaffirms the efficacy of grapheme-based modeling and shows that specialized linguistic knowledge, when available, can be leveraged to improve graphemic ASR.
△ Less
Submitted 13 February, 2020; v1 submitted 22 October, 2019;
originally announced October 2019.