-
SUSEP-Net: Simulation-Supervised and Contrastive Learning-based Deep Neural Networks for Susceptibility Source Separation
Authors:
Min Li,
Chen Chen,
Zhenghao Li,
Yin Liu,
Shanshan Shan,
Peng Wu,
Pengfei Rong,
Feng Liu,
G. Bruce Pike,
Alan H. Wilman,
Hongfu Sun,
Yang Gao
Abstract:
Quantitative susceptibility mapping (QSM) provides a valuable tool for quantifying susceptibility distributions in human brains; however, two types of opposing susceptibility sources (i.e., paramagnetic and diamagnetic), may coexist in a single voxel, and cancel each other out in net QSM images. Susceptibility source separation techniques enable the extraction of sub-voxel information from QSM map…
▽ More
Quantitative susceptibility mapping (QSM) provides a valuable tool for quantifying susceptibility distributions in human brains; however, two types of opposing susceptibility sources (i.e., paramagnetic and diamagnetic), may coexist in a single voxel, and cancel each other out in net QSM images. Susceptibility source separation techniques enable the extraction of sub-voxel information from QSM maps. This study proposes a novel SUSEP-Net for susceptibility source separation by training a dual-branch U-net with a simulation-supervised training strategy. In addition, a contrastive learning framework is included to explicitly impose similarity-based constraints between the branch-specific guidance features in specially-designed encoders and the latent features in the decoders. Comprehensive experiments were carried out on both simulated and in vivo data, including healthy subjects and patients with pathological conditions, to compare SUSEP-Net with three state-of-the-art susceptibility source separation methods (i.e., APART-QSM, \c{hi}-separation, and \c{hi}-sepnet). SUSEP-Net consistently showed improved results compared with the other three methods, with better numerical metrics, improved high-intensity hemorrhage and calcification lesion contrasts, and reduced artifacts in brains with pathological conditions. In addition, experiments on an agarose gel phantom data were conducted to validate the accuracy and the generalization capability of SUSEP-Net.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Movable-Antenna Array Enhanced Downlink NOMA
Authors:
Nianzu Li,
Peiran Wu,
Lipeng Zhu,
Derrick Wing Kwan Ng
Abstract:
Movable antenna (MA) has gained increasing attention in the field of wireless communications due to its exceptional capability to proactively reconfigure wireless channels via localized antenna movements. In this paper, we investigate the resource allocation design for an MA array-enabled base station serving multiple single-antenna users in a downlink non-orthogonal multiple access (NOMA) system.…
▽ More
Movable antenna (MA) has gained increasing attention in the field of wireless communications due to its exceptional capability to proactively reconfigure wireless channels via localized antenna movements. In this paper, we investigate the resource allocation design for an MA array-enabled base station serving multiple single-antenna users in a downlink non-orthogonal multiple access (NOMA) system. We aim to maximize the sum rate of all users by jointly optimizing the transmit beamforming and the positions of all MAs at the BS, subject to the constraints of transmit power budget, finite antenna moving region, and the conditions for successive interference cancellation decoding rate. The formulated problem, inherently highly non-convex, is addressed by successive convex approximation (SCA) and alternating optimization methods to obtain a high-quality suboptimal solution. Simulation results unveil that the proposed MA-enhanced downlink NOMA system can significantly improve the sum rate performance compared to both the fixed-position antenna (FPA) system and the traditional orthogonal multiple access (OMA) system.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction
Authors:
Qichao Wang,
Ziqiao Meng,
Wenqian Cui,
Yifei Zhang,
Pengcheng Wu,
Bingzhe Wu,
Irwin King,
Liang Chen,
Peilin Zhao
Abstract:
Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures t…
▽ More
Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.
△ Less
Submitted 11 June, 2025; v1 submitted 1 June, 2025;
originally announced June 2025.
-
NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets, Methods, and Results
Authors:
Sangmin Lee,
Eunpil Park,
Angel Canelo,
Hyunhee Park,
Youngjo Kim,
Hyung-Ju Chun,
Xin Jin,
Chongyi Li,
Chun-Le Guo,
Radu Timofte,
Qi Wu,
Tianheng Qiu,
Yuchun Dong,
Shenglin Ding,
Guanghua Pan,
Weiyu Zhou,
Tao Hu,
Yixu Feng,
Duwei Dai,
Yu Cao,
Peng Wu,
Wei Dong,
Yanning Zhang,
Qingsen Yan,
Simon J. Larsen
, et al. (11 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effect…
▽ More
This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effectively fusing these frames while adhering to strict efficiency constraints: fewer than 30 million model parameters and a computational budget under 4.0 trillion FLOPs. A total of 217 participants registered, with six teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 43.22 dB, showcasing the potential of novel methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers and practitioners in efficient burst HDR and restoration.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
Authors:
Zhifei Xie,
Mingbao Lin,
Zihang Liu,
Pengcheng Wu,
Shuicheng Yan,
Chunyan Miao
Abstract:
Recent advancements in multimodal reasoning have largely overlooked the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT pr…
▽ More
Recent advancements in multimodal reasoning have largely overlooked the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT process. These datasets together form a high-quality reasoning dataset with 1.2 million reasoning-rich samples, which we name CoTA. Following inference scaling principles, we train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning. Experiments show state-of-the-art performance across key benchmarks, including MMAU-mini (+25.42%), AIR-Bench chat/foundation(+14.57%/+10.13%), and MELD (+8.01%). Our findings stress the core of structured CoT training in advancing audio reasoning.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
An Attention-Assisted Multi-Modal Data Fusion Model for Real-Time Estimation of Underwater Sound Velocity
Authors:
Pengfei Wu,
Wei Huang,
Yujie Shi,
Hao Zhang
Abstract:
The estimation of underwater sound velocity distribution serves as a critical basis for facilitating effective underwater communication and precise positioning, given that variations in sound velocity influence the path of signal transmission. Conventional techniques for the direct measurement of sound velocity, as well as methods that involve the inversion of sound velocity utilizing acoustic fie…
▽ More
The estimation of underwater sound velocity distribution serves as a critical basis for facilitating effective underwater communication and precise positioning, given that variations in sound velocity influence the path of signal transmission. Conventional techniques for the direct measurement of sound velocity, as well as methods that involve the inversion of sound velocity utilizing acoustic field data, necessitate on--site data collection. This requirement not only places high demands on device deployment, but also presents challenges in achieving real-time estimation of sound velocity distribution. In order to construct a real-time sound velocity field and eliminate the need for underwater onsite data measurement operations, we propose a self-attention embedded multimodal data fusion convolutional neural network (SA-MDF-CNN) for real-time underwater sound speed profile (SSP) estimation. The proposed model seeks to elucidate the inherent relationship between remote sensing sea surface temperature (SST) data, the primary component characteristics of historical SSPs, and their spatial coordinates. This is achieved by employing CNNs and attention mechanisms to extract local and global correlations from the input data, respectively. The ultimate objective is to facilitate a rapid and precise estimation of sound velocity distribution within a specified task area. Experimental results show that the method proposed in this paper has lower root mean square error (RMSE) and stronger robustness than other state-of-the-art methods.
△ Less
Submitted 2 March, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
Spatiotemporal Trajectory Tracking Method for Vehicles Incorporating Lead-Lag Judgement
Authors:
Yuan Li,
Xiang Dong,
Tao Li,
Junfeng Hao,
Xiaoxue Xu,
Sana Ullaha,
Yincai Cai,
Peng Wu,
Ting Peng
Abstract:
In the domain of intelligent transportation systems, especially within the context of autonomous vehicle control, the preemptive holistic collaborative system has been presented as a promising solution to bring a remarkable enhancement in traffic efficiency and a substantial reduction in the accident rate, demonstrating a great potential of development. In order to ensure this system operates as i…
▽ More
In the domain of intelligent transportation systems, especially within the context of autonomous vehicle control, the preemptive holistic collaborative system has been presented as a promising solution to bring a remarkable enhancement in traffic efficiency and a substantial reduction in the accident rate, demonstrating a great potential of development. In order to ensure this system operates as intended, accurate tracking of the spatiotemporal trajectory is of crucial significance. Moreover, minimizing the tracking error is a necessary step in this process. To this end, a novel lead-lag judgment mechanism is proposed. This mechanism precisely quantifies the longitudinal positional deviation between the vehicle and the target trajectory over time, then the deviation is corrected with a real - time acceleration compensation strategy, as a result, the accuracy and reliability of trajectory tracking are significantly enhanced. Real - vehicle experiments were conducted in a dedicated test field to validate the feasibility of this innovative approach empirically. Subsequently, the obtained tracking data was subsequent processed using the lead-lag judgment mechanism. In this step, we carefully analyzed the spatiotemporal error patterns between the vehicle and the target trajectory under different alignments and speeds. Finally, using real highway speed and alignment data, we conducted comprehensive spatiotemporal trajectory tracking simulations. Through experiments and simulations, tracking errors maintained in an acceptable range and reasonable spatiotemporal distance is given during the preemptive merging process on highway ramps. Overall, this study offers valuable insights for highway ramp emerging safety. Future work can expand on these findings.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
DDUNet: Dual Dynamic U-Net for Highly-Efficient Cloud Segmentation
Authors:
Yijie Li,
Hewei Wang,
Jinfeng Xu,
Puzhen Wu,
Yunzhong Xiao,
Shaofan Wang,
Soumyabrata Dev
Abstract:
Cloud segmentation amounts to separating cloud pixels from non-cloud pixels in an image. Current deep learning methods for cloud segmentation suffer from three issues. (a) Constrain on their receptive field due to the fixed size of the convolution kernel. (b) Lack of robustness towards different scenarios. (c) Requirement of a large number of parameters and limitations for real-time implementation…
▽ More
Cloud segmentation amounts to separating cloud pixels from non-cloud pixels in an image. Current deep learning methods for cloud segmentation suffer from three issues. (a) Constrain on their receptive field due to the fixed size of the convolution kernel. (b) Lack of robustness towards different scenarios. (c) Requirement of a large number of parameters and limitations for real-time implementation. To address these issues, we propose a Dual Dynamic U-Net (DDUNet) for supervised cloud segmentation. The DDUNet adheres to a U-Net architecture and integrates two crucial modules: the dynamic multi-scale convolution (DMSC), improving merging features under different reception fields, and the dynamic weights and bias generator (DWBG) in classification layers to enhance generalization ability. More importantly, owing to the use of depth-wise convolution, the DDUNet is a lightweight network that can achieve 95.3% accuracy on the SWINySEG dataset with only 0.33M parameters, and achieve superior performance over three different configurations of the SWINySEg dataset in both accuracy and efficiency.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Movable Antenna Enhanced DF and AF Relaying Systems: Performance Analysis and Optimization
Authors:
Nianzu Li,
Weidong Mei,
Peiran Wu,
Boyu Ning,
Lipeng Zhu
Abstract:
Movable antenna (MA) has been deemed as a promising technology to flexibly reconfigure wireless channels by adjusting the antenna positions in a given local region. In this paper, we investigate the application of the MA technology in both decode-and-forward (DF) and amplify-and-forward (AF) relaying systems, where a relay is equipped with multiple MAs to assist in the data transmission between tw…
▽ More
Movable antenna (MA) has been deemed as a promising technology to flexibly reconfigure wireless channels by adjusting the antenna positions in a given local region. In this paper, we investigate the application of the MA technology in both decode-and-forward (DF) and amplify-and-forward (AF) relaying systems, where a relay is equipped with multiple MAs to assist in the data transmission between two single-antenna nodes. For the DF relaying system, our objective is to maximize the achievable rate at the destination by jointly optimizing the positions of the MAs in two stages for receiving signals from the source and transmitting signals to the destination, respectively. To drive essential insights, we first derive a closed-form upper bound on the maximum achievable rate of the DF relaying system. Then, a low-complexity algorithm based on projected gradient ascent (PGA) and alternating optimization (AO) is proposed to solve the antenna position optimization problem. For the AF relaying system, our objective is to maximize the achievable rate by jointly optimizing the two-stage MA positions as well as the AF beamforming matrix at the relay, which results in a more challenging optimization problem due to the intricate coupling variables. To tackle this challenge, we first reveal the hidden separability among the antenna position optimization in the two stages and the beamforming optimization. Based on such separability, we derive a closed-form upper bound on the maximum achievable rate of the AF relaying system and propose a low-complexity algorithm to obtain a high-quality suboptimal solution to the considered problem. Simulation results validate the efficacy of our theoretical analysis and demonstrate the superiority of the MA-enhanced relaying systems to the conventional relaying systems with fixed-position antennas (FPAs) and other benchmark schemes.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
Deep Speech Synthesis from Multimodal Articulatory Representations
Authors:
Peter Wu,
Bohan Yu,
Kevin Scheck,
Alan W Black,
Aditi S. Krishnapriyan,
Irene Y. Chen,
Tanja Schultz,
Shinji Watanabe,
Gopala K. Anumanchipalli
Abstract:
The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intell…
▽ More
The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
$\mathscr{H}_2$ Model Reduction for Linear Quantum Systems
Authors:
G. P. Wu,
S. Xue,
G. F. Zhang,
I. R. Petersen
Abstract:
In this paper, an $\mathscr{H}_2$ norm-based model reduction method for linear quantum systems is presented, which can obtain a physically realizable model with a reduced order for closely approximating the original system. The model reduction problem is described as an optimization problem, whose objective is taken as an $\mathscr{H}_2$ norm of the difference between the transfer function of the…
▽ More
In this paper, an $\mathscr{H}_2$ norm-based model reduction method for linear quantum systems is presented, which can obtain a physically realizable model with a reduced order for closely approximating the original system. The model reduction problem is described as an optimization problem, whose objective is taken as an $\mathscr{H}_2$ norm of the difference between the transfer function of the original system and that of the reduced one. Different from classical model reduction problems, physical realizability conditions for guaranteeing that the reduced-order system is also a quantum system should be taken as nonlinear constraints in the optimization. To solve the optimization problem with such nonlinear constraints, we employ a matrix inequality approach to transform nonlinear inequality constraints into readily solvable linear matrix inequalities (LMIs) and nonlinear equality constraints, so that the optimization problem can be solved by a lifting variables approach. We emphasize that different from existing work, which only introduces a criterion to evaluate the performance after model reduction, we guide our method to obtain an optimal reduced model with respect to the $\mathscr{H}_2$ norm. In addition, the above approach for model reduction is extended to passive linear quantum systems. Finally, examples of active and passive linear quantum systems validate the efficacy of the proposed method.
△ Less
Submitted 19 November, 2024; v1 submitted 12 November, 2024;
originally announced November 2024.
-
Improved Video VAE for Latent Video Diffusion Model
Authors:
Pingyu Wu,
Kai Zhu,
Yu Liu,
Liming Zhao,
Wei Zhai,
Yang Cao,
Zheng-Jun Zha
Abstract:
Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-tra…
▽ More
Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from a lower-dimension image VAE while the other half involves temporal compression through 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks demonstrate the SOTA video reconstruction and generation capabilities of the proposed IV-VAE (https://wpy1999.github.io/IV-VAE/).
△ Less
Submitted 10 November, 2024;
originally announced November 2024.
-
Over-the-Air Computation via 2D Movable Antenna Array
Authors:
Nianzu Li,
Peiran Wu,
Boyu Ning,
Lipeng Zhu,
Weidong Mei
Abstract:
Movable antenna (MA) has emerged as a promising technology for improving the performance of wireless communication systems, which enables local movement of the antennas to create more favorable channel conditions. In this letter, we advance its application for over-the-air computation (AirComp) network, where an access point is equipped with a two-dimensional (2D) MA array to aggregate wireless da…
▽ More
Movable antenna (MA) has emerged as a promising technology for improving the performance of wireless communication systems, which enables local movement of the antennas to create more favorable channel conditions. In this letter, we advance its application for over-the-air computation (AirComp) network, where an access point is equipped with a two-dimensional (2D) MA array to aggregate wireless data from massive users. We aim to minimize the computation mean square error (CMSE) by jointly optimizing the antenna position vector (APV), the receive combining vector at the access point and the transmit coefficients from all users. To tackle this highly non-convex problem, we propose a two-loop iterative algorithm, where the particle swarm optimization (PSO) approach is leveraged to obtain a suboptimal APV in the outer loop while the receive combining vector and transmit coefficients are alternately optimized in the inner loop. Numerical results demonstrate that the proposed MA-enhanced AirComp network outperforms the conventional network with fixed-position antennas (FPAs).
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP
Authors:
Yisi Liu,
Bohan Yu,
Drake Lin,
Peter Wu,
Cheol Jun Cho,
Gopala Krishna Anumanchipalli
Abstract:
Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance th…
▽ More
Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and 0.16 compared to the state-of-the-art (SOTA) baseline. Our DDSP vocoder is 4.9x faster than the baseline on CPU during inference, and can generate speech of comparable quality with only 0.4M parameters, in contrast to the 9M parameters required by the SOTA.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Enhancing Expressway Ramp Merge Safety and Efficiency via Spatiotemporal Cooperative Control
Authors:
Ting Peng,
Xiaoxue Xu,
Yuan Li,
Jie WU,
Tao Li,
Xiang Dong,
Yincai Cai,
Peng Wu,
Sana Ullah
Abstract:
In the context of autonomous driving on expressways, the issue of ensuring safe and efficient ramp merging remains a significant challenge. Existing systems often struggle to accurately assess the status and intentions of other vehicles, leading to a persistent occurrence of accidents despite efforts to maintain safe distances. This study proposes a novel spatiotemporal cooperative control approac…
▽ More
In the context of autonomous driving on expressways, the issue of ensuring safe and efficient ramp merging remains a significant challenge. Existing systems often struggle to accurately assess the status and intentions of other vehicles, leading to a persistent occurrence of accidents despite efforts to maintain safe distances. This study proposes a novel spatiotemporal cooperative control approach integrating vehicle-road coordination to address this critical issue. A comprehensive methodology is developed, beginning with the calculation of safe distances under varying spatiotemporal conditions. This involves considering multiple factors, including vehicle speed differentials, positioning errors, and clock synchronization errors. Subsequently, an advanced vehicle conflict risk evaluation model is constructed. By incorporating collision acceleration and emergency acceleration as key parameters, this model offers a more accurate and detailed evaluation of potential risks during the ramp merging process. Based on the calculated safe distances and conflict risk evaluations, a mainline priority coordinated control method is formulated. This method enables the pre-planning of vehicle trajectories, effectively reducing conflicts among vehicles. Through rigorous simulations using diverse traffic volume and speed scenarios, the efficacy of the proposed strategy is validated. The results demonstrate remarkable improvements, with the average delay time reduced by an impressive 97.96% and fuel consumption decreased by 6.01%. These outcomes indicate that the proposed approach not only enhances the speed of vehicle merging but also significantly reduces latency and fuel consumption, thereby enhancing the overall performance of ramp merging operations.
△ Less
Submitted 14 February, 2025; v1 submitted 15 August, 2024;
originally announced August 2024.
-
Sum Rate Maximization for Movable Antenna Enabled Uplink NOMA
Authors:
Nianzu Li,
Peiran Wu,
Boyu Ning,
Lipeng Zhu
Abstract:
Movable antenna (MA) has been recently proposed as a promising candidate technology for the next generation wireless communication systems due to its significant capability of reconfiguring wireless channels via antenna movement. In this letter, we study an MA-enabled uplink non-orthogonal multiple access (NOMA) system, where each user is equipped with a single MA. Our objective is to maximize the…
▽ More
Movable antenna (MA) has been recently proposed as a promising candidate technology for the next generation wireless communication systems due to its significant capability of reconfiguring wireless channels via antenna movement. In this letter, we study an MA-enabled uplink non-orthogonal multiple access (NOMA) system, where each user is equipped with a single MA. Our objective is to maximize the users' sum rate by jointly optimizing the MAs' positions, the decoding order and the power control. To solve this non-convex problem, we equivalently transform it into two tractable subproblems. First, we use the successive convex approximation (SCA) to find a locally optimal solution for the antenna position optimization subproblem. Next, we derive the closed-form optimal solution of the decoding order and power control subproblem. Numerical results show that our proposed MA-enabled NOMA system can significantly enhance the sum rate compared to fixed-position antenna (FPA) systems and orthogonal multiple access (OMA) systems.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Movable Antenna Enhanced AF Relaying: Two-Stage Antenna Position Optimization
Authors:
Nianzu Li,
Weidong Mei,
Boyu Ning,
Peiran Wu
Abstract:
The movable antenna (MA) technology has attracted increasing attention in wireless communications due to its capability for flexibly adjusting the positions of multiple antennas in a local region to reconfigure channel conditions. In this paper, we investigate its application in an amplify-and-forward (AF) relay system, where a multi-MA AF relay is deployed to assist in the wireless communications…
▽ More
The movable antenna (MA) technology has attracted increasing attention in wireless communications due to its capability for flexibly adjusting the positions of multiple antennas in a local region to reconfigure channel conditions. In this paper, we investigate its application in an amplify-and-forward (AF) relay system, where a multi-MA AF relay is deployed to assist in the wireless communications from a source to a destination. In particular, we aim to maximize the achievable rate at the destination, by jointly optimizing the AF weight matrix at the relay and its MAs' positions in two stages for receiving the signal from the source and transmitting its amplified version to the destination, respectively. However, compared to the existing one-stage antenna position optimization, the two-stage position optimization is more challenging due to its intricate coupling in the achievable rate at the destination. To tackle this challenge, we decompose the considered problem into several subproblems by invoking the alternating optimization (AO) and solve them by using the semidefinite programming and the gradient ascent. Numerical results demonstrate the superiority of our proposed system over the conventional relaying system with fixed-position antennas (FPAs) and also drive essential insights.
△ Less
Submitted 11 August, 2024;
originally announced August 2024.
-
Learned Trimmed-Ridge Regression for Channel Estimation in Millimeter-Wave Massive MIMO
Authors:
Pengxia Wu,
Julian Cheng,
Yonina C. Eldar,
John M. Cioffi
Abstract:
Channel estimation poses significant challenges in millimeter-wave massive multiple-input multiple-output systems, especially when the base station has fewer radio-frequency chains than antennas. To address this challenge, one promising solution exploits the beamspace channel sparsity to reconstruct full-dimensional channels from incomplete measurements. This paper presents a model-based deep lear…
▽ More
Channel estimation poses significant challenges in millimeter-wave massive multiple-input multiple-output systems, especially when the base station has fewer radio-frequency chains than antennas. To address this challenge, one promising solution exploits the beamspace channel sparsity to reconstruct full-dimensional channels from incomplete measurements. This paper presents a model-based deep learning method to reconstruct sparse, as well as approximately sparse, vectors fast and accurately. To implement this method, we propose a trimmed-ridge regression that transforms the sparse-reconstruction problem into a least-squares problem regularized by a nonconvex penalty term, and then derive an iterative solution. We then unfold the iterations into a deep network that can be implemented in online applications to realize real-time computations. To this end, an unfolded trimmed-ridge regression model is constructed using a structural configuration to reduce computational complexity and a model ensemble strategy to improve accuracy. Compared with other state-of-the-art deep learning models, the proposed learning scheme achieves better accuracy and supports higher downlink sum rates.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Towards EMG-to-Speech with a Necklace Form Factor
Authors:
Peter Wu,
Ryan Kaveh,
Raghav Nautiyal,
Christine Zhang,
Albert Guo,
Anvitha Kachinthaya,
Tavish Mishra,
Bohan Yu,
Alan W Black,
Rikky Muller,
Gopala Krishna Anumanchipalli
Abstract:
Electrodes for decoding speech from electromyography (EMG) are typically placed on the face, requiring adhesives that are inconvenient and skin-irritating if used regularly. We explore a different device form factor, where dry electrodes are placed around the neck instead. 11-word, multi-speaker voiced EMG classifiers trained on data recorded with this device achieve 92.7% accuracy. Ablation studi…
▽ More
Electrodes for decoding speech from electromyography (EMG) are typically placed on the face, requiring adhesives that are inconvenient and skin-irritating if used regularly. We explore a different device form factor, where dry electrodes are placed around the neck instead. 11-word, multi-speaker voiced EMG classifiers trained on data recorded with this device achieve 92.7% accuracy. Ablation studies reveal the importance of having more than two electrodes on the neck, and phonological analyses reveal similar classification confusions between neck-only and neck-and-face form factors. Finally, speech-EMG correlation experiments demonstrate a linear relationship between many EMG spectrogram frequency bins and self-supervised speech representation dimensions.
△ Less
Submitted 31 July, 2024;
originally announced July 2024.
-
Multi-Agent Deep Reinforcement Learning for Energy Efficient Multi-Hop STAR-RIS-Assisted Transmissions
Authors:
Pei-Hsiang Liao,
Li-Hsiang Shen,
Po-Chen Wu,
Kai-Ten Feng
Abstract:
Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) provides a promising way to expand coverage in wireless communications. However, limitation of single STAR-RIS inspire us to integrate the concept of multi-hop transmissions, as focused on RIS in existing research. Therefore, we propose the novel architecture of multi-hop STAR-RISs to achieve a wider range of…
▽ More
Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) provides a promising way to expand coverage in wireless communications. However, limitation of single STAR-RIS inspire us to integrate the concept of multi-hop transmissions, as focused on RIS in existing research. Therefore, we propose the novel architecture of multi-hop STAR-RISs to achieve a wider range of full-plane service coverage. In this paper, we intend to solve active beamforming of the base station and passive beamforming of STAR-RISs, aiming for maximizing the energy efficiency constrained by hardware limitation of STAR-RISs. Furthermore, we investigate the impact of the on-off state of STAR-RIS elements on energy efficiency. To tackle the complex problem, a Multi-Agent Global and locAl deep Reinforcement learning (MAGAR) algorithm is designed. The global agent elevates the collaboration among local agents, which focus on individual learning. In numerical results, we observe the significant improvement of MAGAR compared to the other benchmarks, including Q-learning, multi-agent deep Q network (DQN) with golbal reward, and multi-agent DQN with local rewards. Moreover, the proposed architecture of multi-hop STAR-RISs achieves the highest energy efficiency compared to mode switching based STAR-RISs, conventional RISs and deployment without RISs or STAR-RISs.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
System-Level Simulation Framework for NB-IoT: Key Features and Performance Evaluation
Authors:
Shutao Zhang,
Wenkun Wen,
Peiran Wu,
Hongqing Huang,
Liya Zhu,
Yijia Guo,
Tingting Yang,
Minghua Xia
Abstract:
Narrowband Internet of Things (NB-IoT) is a technology specifically designated by the 3rd Generation Partnership Project (3GPP) to meet the explosive demand for massive machine-type communications (mMTC), and it is evolving to RedCap. Industrial companies have increasingly adopted NB-IoT as the solution for mMTC due to its lightweight design and comprehensive technical specifications released by 3…
▽ More
Narrowband Internet of Things (NB-IoT) is a technology specifically designated by the 3rd Generation Partnership Project (3GPP) to meet the explosive demand for massive machine-type communications (mMTC), and it is evolving to RedCap. Industrial companies have increasingly adopted NB-IoT as the solution for mMTC due to its lightweight design and comprehensive technical specifications released by 3GPP. This paper presents a system-level simulation framework for NB-IoT networks to evaluate their performance. The system-level simulator is structured into four parts: initialization, pre-generation, main simulation loop, and post-processing. Additionally, three essential features are investigated to enhance coverage, support massive connections, and ensure low power consumption, respectively. Simulation results demonstrate that the cumulative distribution function curves of the signal-to-interference-and-noise ratio fully comply with industrial standards. Furthermore, the throughput performance explains how NB-IoT networks realize massive connections at the cost of data rate. This work highlights its practical utility and paves the way for developing NB-IoT networks.
△ Less
Submitted 13 August, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
Multimodal Segmentation for Vocal Tract Modeling
Authors:
Rishi Jain,
Bohan Yu,
Peter Wu,
Tejas Prabhune,
Gopala Anumanchipalli
Abstract:
Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech…
▽ More
Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Coding Speech through Vocal Tract Kinematics
Authors:
Cheol Jun Cho,
Peter Wu,
Tejas S. Prabhune,
Dhruv Agarwal,
Gopala K. Anumanchipalli
Abstract:
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Speech Articulatory Coding (SPARC). SPARC co…
▽ More
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
△ Less
Submitted 14 December, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Optimal Reference Nodes Deployment for Positioning Seafloor Anchor Nodes
Authors:
Wei Huang,
Pengfei Wu,
Tianhe Xu,
Hao Zhang,
Kaitao Meng
Abstract:
Seafloor anchor nodes, which form a geodetic network, are designed to provide surface and underwater users with positioning, navigation and timing (PNT) services. Due to the non-uniform distribution of underwater sound speed, accurate positioning of underwater anchor nodes is a challenge work. Traditional anchor node positioning typically uses cross or circular shapes, however, how to optimize the…
▽ More
Seafloor anchor nodes, which form a geodetic network, are designed to provide surface and underwater users with positioning, navigation and timing (PNT) services. Due to the non-uniform distribution of underwater sound speed, accurate positioning of underwater anchor nodes is a challenge work. Traditional anchor node positioning typically uses cross or circular shapes, however, how to optimize the deployment of reference nodes for positioning underwater anchor nodes considering the variability of sound speed has not yet been studied. This paper focuses on the optimal reference nodes deployment strategies for time--of--arrival (TOA) localization in the three-dimensional (3D) underwater space. We adopt the criterion that minimizing the trace of the inverse Fisher information matrix (FIM) to determine optimal reference nodes deployment with Gaussian measurement noise, which is positive related to the signal propagation path. A comprehensive analysis of optimal reference-target geometries is provided in the general circumstance with no restriction on the number of reference nodes, elevation angle and reference-target range. A new semi-closed form solution is found to detemine the optimal geometries. To demonstrate the findings in this paper, we conducted both simulations and sea trials on underwater anchor node positioning. Both the simulation and experiment results are consistent with theoretical analysis.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
CRNet: A Detail-Preserving Network for Unified Image Restoration and Enhancement Task
Authors:
Kangzhen Yang,
Tao Hu,
Kexin Dai,
Genggeng Chen,
Yu Cao,
Wei Dong,
Peng Wu,
Yanning Zhang,
Qingsen Yan
Abstract:
In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. Howev…
▽ More
In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. However, merely performing a single type of image enhancement still cannot yield satisfactory images. In this paper, to deal with the challenge above, we propose the Composite Refinement Network (CRNet) to address this issue using multiple exposure images. By fully integrating information-rich multiple exposure inputs, CRNet can perform unified image restoration and enhancement. To improve the quality of image details, CRNet explicitly separates and strengthens high and low-frequency information through pooling layers, using specially designed Multi-Branch Blocks for effective fusion of these frequencies. To increase the receptive field and fully integrate input features, CRNet employs the High-Frequency Enhancement Module, which includes large kernel convolutions and an inverted bottleneck ConvFFN. Our model secured third place in the first track of the Bracketing Image Restoration and Enhancement Challenge, surpassing previous SOTA models in both testing metrics and visual quality.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Bracketing Image Restoration and Enhancement with High-Low Frequency Decomposition
Authors:
Genggeng Chen,
Kexin Dai,
Kangzhen Yang,
Tao Hu,
Xiangyu Chen,
Yongqing Yang,
Wei Dong,
Peng Wu,
Yanning Zhang,
Qingsen Yan
Abstract:
In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resul…
▽ More
In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resulting in less-than-ideal restoration outcomes. Inspired by the notion that high/low frequency information is applicable to different degradations, we introduce HLNet, a Bracketing Image Restoration and Enhancement method based on high-low frequency decomposition. Specifically, we employ two modules for feature extraction: shared weight modules and non-shared weight modules. In the shared weight modules, we use SCConv to extract common features from different degradations. In the non-shared weight modules, we introduce the High-Low Frequency Decomposition Block (HLFDB), which employs different methods to handle high-low frequency information, enabling the model to address different degradations more effectively. Compared to other networks, our method takes into account the characteristics of different degradations, thus achieving higher-quality image restoration.
△ Less
Submitted 24 April, 2024; v1 submitted 21 April, 2024;
originally announced April 2024.
-
Air-to-Ground Communications Beyond 5G: UAV Swarm Formation Control and Tracking
Authors:
Xiao Fan,
Peiran Wu,
Minghua Xia
Abstract:
Unmanned aerial vehicle (UAV) communications have been widely accepted as promising technologies to support air-to-ground communications in the forthcoming sixth-generation (6G) wireless networks. This paper proposes a novel air-to-ground communication model consisting of aerial base stations served by UAVs and terrestrial user equipments (UEs) by integrating the technique of coordinated multi-poi…
▽ More
Unmanned aerial vehicle (UAV) communications have been widely accepted as promising technologies to support air-to-ground communications in the forthcoming sixth-generation (6G) wireless networks. This paper proposes a novel air-to-ground communication model consisting of aerial base stations served by UAVs and terrestrial user equipments (UEs) by integrating the technique of coordinated multi-point (CoMP) transmission with the theory of stochastic geometry. In particular, a CoMP set consisting of multiple UAVs is developed based on the theory of Poisson-Delaunay tetrahedralization. Effective UAV formation control and UAV swarm tracking schemes for two typical scenarios, including static and mobile UEs, are also developed using the multi-agent system theory to ensure that collaborative UAVs can efficiently reach target spatial positions for mission execution. Thanks to the ease of mathematical tractability, this model provides explicit performance expressions for a typical UE's coverage probability and achievable ergodic rate. Extensive simulation and numerical results corroborate that the proposed scheme outperforms UAV communications without CoMP transmission and obtains similar performance to the conventional CoMP scheme while avoiding search overhead.
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection
Authors:
Jiachen Lian,
Carly Feng,
Naasir Farooqi,
Steve Li,
Anshul Kashyap,
Cheol Jun Cho,
Peter Wu,
Robbie Netzorg,
Tingle Li,
Gopala Krishna Anumanchipalli
Abstract:
Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and dete…
▽ More
Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and detection in an automatic and hierarchical manner. UDM eliminates the need for extensive manual annotation by providing a comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks.
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection
Authors:
Davide Berghi,
Peipei Wu,
Jinzheng Zhao,
Wenwu Wang,
Philip J. B. Jackson
Abstract:
Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore th…
▽ More
Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Coronary Atherosclerotic Plaque Characterization with Photon-counting CT: a Simulation-based Feasibility Study
Authors:
Mengzhou Li,
Mingye Wu,
Jed Pack,
Pengwei Wu,
Bruno De Man,
Adam Wang,
Koen Nieman,
Ge Wang
Abstract:
Recent development of photon-counting CT (PCCT) brings great opportunities for plaque characterization with much-improved spatial resolution and spectral imaging capability. While existing coronary plaque PCCT imaging results are based on detectors made of CZT or CdTe materials, deep-silicon photon-counting detectors have unique performance characteristics and promise distinct imaging capabilities…
▽ More
Recent development of photon-counting CT (PCCT) brings great opportunities for plaque characterization with much-improved spatial resolution and spectral imaging capability. While existing coronary plaque PCCT imaging results are based on detectors made of CZT or CdTe materials, deep-silicon photon-counting detectors have unique performance characteristics and promise distinct imaging capabilities. In this work, we report a systematic simulation study of a deep-silicon PCCT scanner with a new clinically-relevant digital plaque phantom with realistic geometrical parameters and chemical compositions. This work investigates the effects of spatial resolution, noise, motion artifacts, radiation dose, and spectral characterization. Our simulation results suggest that the deep-silicon PCCT design provides adequate spatial resolution for visualizing a necrotic core and quantitation of key plaque features. Advanced denoising techniques and aggressive bowtie filter designs can keep image noise to acceptable levels at this resolution while keeping radiation dose comparable to that of a conventional CT scan. The ultrahigh resolution of PCCT also means an elevated sensitivity to motion artifacts. It is found that a tolerance of less than 0.4 mm residual movement range requires the application of accurate motion correction methods for best plaque imaging quality with PCCT.
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
Future Full-Ocean Deep SSPs Prediction based on Hierarchical Long Short-Term Memory Neural Networks
Authors:
Jiajun Lu,
Hao Zhang,
Pengfei Wu,
Sijia Li,
Wei Huang
Abstract:
The spatial-temporal distribution of underwater sound velocity affects the propagation mode of underwater acoustic signals. Therefore, rapid estimation and prediction of underwater sound velocity distribution is crucial for providing underwater positioning, navigation and timing (PNT) services. Currently, sound speed profile (SSP) inversion methods have a faster time response rate compared to dire…
▽ More
The spatial-temporal distribution of underwater sound velocity affects the propagation mode of underwater acoustic signals. Therefore, rapid estimation and prediction of underwater sound velocity distribution is crucial for providing underwater positioning, navigation and timing (PNT) services. Currently, sound speed profile (SSP) inversion methods have a faster time response rate compared to direct measurement methods, however, most SSP inversion methods focus on constructing spatial dimensional sound velocity fields and are highly dependent on sonar observation data, thus high requirements have been placed on observation data sources. To explore the distribution pattern of sound velocity in the time dimension and achieve future SSP prediction without sonar observation data, we propose a hierarchical long short-term memory (H-LSTM) neural network for SSP prediction. By our SSP prediction method, the sound speed distribution could be estimated without any on-site data measurement process, so that the time efficiency could be greatly improved. Through comparing with other state-of-the-art methods, H-LSTM has better accuracy performance on prediction of monthly average sound velocity distribution, which is less than 1 m/s in different depth layers.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Towards Streaming Speech-to-Avatar Synthesis
Authors:
Tejas S. Prabhune,
Peter Wu,
Bohan Yu,
Gopala K. Anumanchipalli
Abstract:
Streaming speech-to-avatar synthesis creates real-time animations for a virtual character from audio data. Accurate avatar representations of speech are important for the visualization of sound in linguistics, phonetics, and phonology, visual feedback to assist second language acquisition, and virtual embodiment for paralyzed patients. Previous works have highlighted the capability of deep articul…
▽ More
Streaming speech-to-avatar synthesis creates real-time animations for a virtual character from audio data. Accurate avatar representations of speech are important for the visualization of sound in linguistics, phonetics, and phonology, visual feedback to assist second language acquisition, and virtual embodiment for paralyzed patients. Previous works have highlighted the capability of deep articulatory inversion to perform high-quality avatar animation using electromagnetic articulography (EMA) features. However, these models focus on offline avatar synthesis with recordings rather than real-time audio, which is necessary for live avatar visualization or embodiment. To address this issue, we propose a method using articulatory inversion for streaming high quality facial and inner-mouth avatar animation from real-time audio. Our approach achieves 130ms average streaming latency for every 0.1 seconds of audio with a 0.792 correlation with ground truth articulations. Finally, we show generated mouth and tongue animations to demonstrate the efficacy of our methodology.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
Authors:
Jinzheng Zhao,
Yong Xu,
Xinyuan Qian,
Davide Berghi,
Peipei Wu,
Meng Cui,
Jianyuan Sun,
Philip J. B. Jackson,
Wenwu Wang
Abstract:
Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide applications. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter and deep learning-based methods can solve the problem of data association, audio-visual fusion and track ma…
▽ More
Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide applications. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter and deep learning-based methods can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on the AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boost the development of audio-visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. Finally, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.
△ Less
Submitted 13 April, 2025; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Underwater Sound Speed Profile Construction: A Review
Authors:
Wei Huang,
Jixuan Zhou,
Fan Gao,
Jiajun Lu,
Sijia Li,
Pengfei Wu,
Junting Wang,
Hao Zhang,
Tianhe Xu
Abstract:
Real--time and accurate construction of regional sound speed profiles (SSP) is important for building underwater positioning, navigation, and timing (PNT) systems as it greatly affect the signal propagation modes such as trajectory. In this paper, we summarizes and analyzes the current research status in the field of underwater SSP construction, and the mainstream methods include direct SSP measur…
▽ More
Real--time and accurate construction of regional sound speed profiles (SSP) is important for building underwater positioning, navigation, and timing (PNT) systems as it greatly affect the signal propagation modes such as trajectory. In this paper, we summarizes and analyzes the current research status in the field of underwater SSP construction, and the mainstream methods include direct SSP measurement and SSP inversion. In the direct measurement method, we compare the performance of popular international commercial temperature, conductivity, and depth profilers (CTD). While for the inversion methods, the framework and basic principles of matched field processing (MFP), compressive sensing (CS), and deep learning (DL) for constructing SSP are introduced, and their advantages and disadvantages are compared. The traditional direct measurement method has good accuracy performance, but it usually takes a long time. The proposal of SSP inversion method greatly improves the convenience and real--time performance, but the accuracy is not as good as the direct measurement method. Currently, the SSP inversion relies on sonar observation data, making it difficult to apply to areas that couldn't be covered by underwater observation systems, and these methods are unable to predict the distribution of sound velocity at future times. How to comprehensively utilize multi-source data and provide elastic sound velocity distribution estimation services with different accuracy and real-time requirements for underwater users without sonar observation data is the mainstream trend in future research on SSP construction.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities
Authors:
Robin Netzorg,
Bohan Yu,
Andrea Guzman,
Peter Wu,
Luna McNulty,
Gopala Anumanchipalli
Abstract:
Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptu…
▽ More
Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptual voice qualities (PQs). By adding gendered PQs to the pathology-focused Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) protocol, our PQ-based approach provides a perceptual latent space of the character of adult voices that is an intermediary of abstraction between high-level demographics and low-level acoustic, physical, or learned representations. Contrary to prior belief, we demonstrate that these PQs are hearable by ensembles of non-experts, and further demonstrate that the information encoded in a PQ-based representation is predictable by various speech representations.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
CiwaGAN: Articulatory information exchange
Authors:
Gašper Beguš,
Thomas Lu,
Alan Zhou,
Peter Wu,
Gopala K. Anumanchipalli
Abstract:
Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus. This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality. While prior research includes unsupervised articulatory modeli…
▽ More
Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus. This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality. While prior research includes unsupervised articulatory modeling and information exchange separately, our model is the first to combine the two components. The paper also proposes an improved articulatory model with more interpretable internal representations. The proposed CiwaGAN model is the most realistic approximation of human spoken language acquisition using deep learning. As such, it is useful for cognitively plausible simulations of the human speech act.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Robust Interference Mitigation techniques for Direct Position Estimation
Authors:
Haoqing Li,
Shuo Tang,
Peng Wu,
Pau Closas
Abstract:
Global Navigation Satellite System (GNSS) is pervasive in navigation and positioning applications, where precise position and time referencing estimations are required. Conventional methods for GNSS positioning involve a two-step process, where intermediate measurements such as Doppler shift and time delay of received GNSS signals are computed and then used to solve for the receiver's position. Al…
▽ More
Global Navigation Satellite System (GNSS) is pervasive in navigation and positioning applications, where precise position and time referencing estimations are required. Conventional methods for GNSS positioning involve a two-step process, where intermediate measurements such as Doppler shift and time delay of received GNSS signals are computed and then used to solve for the receiver's position. Alternatively, Direct Position Estimation (DPE) was proposed to infer the position directly from the sampled signal without intermediate variables, yielding to superior levels of sensitivity and operation under challenging environments. However, the positioning resilience of DPE method is still under the threat of various interferences. Robust Interference Mitigation (RIM) processing has been studied and proved to be efficient against various interference in conventional two-step positioning (2SP) methods, and therefore worthy to be explored regarding its potential to enhance DPE. This article extends DPE methodology by incorporating RIM strategies that address the increasing need to protect GNSS receivers against intentional or unintentional interferences, such as jamming signals, which can deny GNSS-based positioning. RIM, which leverages robust statistics, was shown to provide competitive results in two-step approaches and is here employed in a high-sensitivity DPE framework with successful results. The article also provides a quantification of the loss of efficiency of using RIM when no interference is present and validates the proposed methodology on relevant interference cases, while the approach can be used to mitigate other common interference signals.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
A Safe DRL Method for Fast Solution of Real-Time Optimal Power Flow
Authors:
Pengfei Wu,
Chen Chen,
Dexiang Lai,
Jian Zhong
Abstract:
High-level penetration of intermittent renewable energy sources (RESs) has introduced significant uncertainties into modern power systems. In order to rapidly and economically respond to the fluctuations of power system operating state, this paper proposes a safe deep reinforcement learning (SDRL) based method for fast solution of real-time optimal power flow (RT-OPF) problems. The proposed method…
▽ More
High-level penetration of intermittent renewable energy sources (RESs) has introduced significant uncertainties into modern power systems. In order to rapidly and economically respond to the fluctuations of power system operating state, this paper proposes a safe deep reinforcement learning (SDRL) based method for fast solution of real-time optimal power flow (RT-OPF) problems. The proposed method considers the volatility of RESs and temporal constraints, and formulates the RT-OPF as a Constrained Markov Decision Process (CMDP). In the training process, the proposed method hybridizes the proximal policy optimization (PPO) and the primal-dual method. Instead of integrating the constraint violation penalty with the reward function, its actor gradients are estimated by a Lagrange advantage function which is derived from two critic systems based on economic reward and violation cost. The decoupling of reward and cost alleviates reward sparsity while improving critic approximation accuracy. Moreover, the introduction of Lagrange multipliers enables the agent to comprehend the trade-off between optimality and feasibility. Numerical tests are carried out and compared with penalty-based DRL methods on the IEEE 9-bus, 30-bus, and 118-bus test systems. The results show that the well-trained SDRL agent can significantly improve the computation efficiency while satisfying the security constraints and optimality requirements.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
D-STAR: Dual Simultaneously Transmitting and Reflecting Reconfigurable Intelligent Surfaces for Joint Uplink/Downlink Transmission
Authors:
Li-Hsiang Shen,
Po-Chen Wu,
Chia-Jou Ku,
Yu-Ting Li,
Kai-Ten Feng,
Yuanwei Liu,
Lajos Hanzo
Abstract:
The joint uplink/downlink (JUD) design of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) is conceived in support of both uplink (UL) and downlink (DL) users. Furthermore, the dual STAR-RISs (D-STAR) concept is conceived as a promising architecture for 360-degree full-plane service coverage, including UL/DL users located between the base station (BS) and t…
▽ More
The joint uplink/downlink (JUD) design of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) is conceived in support of both uplink (UL) and downlink (DL) users. Furthermore, the dual STAR-RISs (D-STAR) concept is conceived as a promising architecture for 360-degree full-plane service coverage, including UL/DL users located between the base station (BS) and the D-STAR as well as beyond. The corresponding regions are termed as primary (P) and secondary (S) regions. Both BS/users exist in the P-region, but only users are located in the S-region. The primary STAR-RIS (STAR-P) plays an important role in terms of tackling the P-region inter-user interference, the self-interference (SI) from the BS and from the reflective as well as refractive UL users imposed on the DL receiver. By contrast, the secondary STAR-RIS (STAR-S) aims for mitigating the S-region interferences. The non-linear and non-convex rate-maximization problem formulated is solved by alternating optimization amongst the decomposed convex sub-problems of the BS beamformer, and the D-STAR amplitude as well as phase shift configurations. We also propose a D-STAR based active beamforming and passive STAR-RIS amplitude/phase (DBAP) optimization scheme to solve the respective sub-problems by Lagrange dual with Dinkelbach's transformation, alternating direction method of multipliers (ADMM) with successive convex approximation (SCA), and penalty convex-concave procedure (PCCP). Our simulation results reveal that the proposed D-STAR architecture outperforms the conventional single RIS, single STAR-RIS, and half-duplex networks. The proposed DBAP of D-STAR outperforms the state-of-the-art solutions found in the open literature for different numbers of quantization levels, geographic deployment, transmit power and for diverse numbers of transmit antennas, patch partitions as well as D-STAR elements.
△ Less
Submitted 8 February, 2024; v1 submitted 29 July, 2023;
originally announced July 2023.
-
Deep Speech Synthesis from MRI-Based Articulatory Representations
Authors:
Peter Wu,
Tingle Li,
Yijing Lu,
Yubin Zhang,
Jiachen Lian,
Alan W Black,
Louis Goldstein,
Shinji Watanabe,
Gopala K. Anumanchipalli
Abstract:
In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiti…
▽ More
In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiting generalization capabilities. To bridge this gap, we propose an alternative MRI-based feature set that covers a much more extensive articulatory space than EMA. We also introduce normalization and denoising procedures to enhance the generalizability of deep learning methods trained on MRI data. Moreover, we propose an MRI-to-speech model that improves both computational efficiency and speech fidelity. Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
One-Bit Spectrum Sensing for Cognitive Radio
Authors:
Pei-Wen Wu,
Lei Huang,
David Ramírez,
Yu-Hang Xiao,
Hing Cheung So
Abstract:
Spectrum sensing in cognitive radio necessitates effective monitoring of wide bandwidths, which requires high-rate sampling. Traditional spectrum sensing methods employing high-precision analog-to-digital converters (ADCs) result in increased power consumption and expensive hardware costs. In this paper, we explore blind spectrum sensing utilizing one-bit ADCs. We derive a closed-form detector bas…
▽ More
Spectrum sensing in cognitive radio necessitates effective monitoring of wide bandwidths, which requires high-rate sampling. Traditional spectrum sensing methods employing high-precision analog-to-digital converters (ADCs) result in increased power consumption and expensive hardware costs. In this paper, we explore blind spectrum sensing utilizing one-bit ADCs. We derive a closed-form detector based on Rao's test and demonstrate its equivalence with the second-order eigenvalue-moment-ratio test. Furthermore, a near-exact distribution based on the moment-based method, and an approximate distribution in the low signal-to-noise ratio (SNR) regime with the use of the central limit theorem, are obtained. Theoretical analysis is then performed and our results show that the performance loss of the proposed detector is approximately $2$ dB ($π/2$) compared to detectors employing $\infty$-bit ADCs when SNR is low. This loss can be compensated for by using approximately $2.47$ ($π^2/4$) times more samples. In addition, we unveil that the efficiency of incoherent accumulation in one-bit detection is the square root of that of coherent accumulation. Simulation results corroborate the correctness of our theoretical calculations.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
Text-Driven Foley Sound Generation With Latent Diffusion Model
Authors:
Yi Yuan,
Haohe Liu,
Xubo Liu,
Xiyuan Kang,
Peipei Wu,
Mark D. Plumbley,
Wenwu Wang
Abstract:
Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale…
▽ More
Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tuned to this task via transfer learning using the contrastive language-audio pertaining (CLAP) technique. We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model. Hence, we introduce a trainable layer after the encoder to improve the text embedding produced by the encoder. In addition, we further refine the generated waveform by generating multiple candidate audio clips simultaneously and selecting the best one, which is determined in terms of the similarity score between the embedding of the candidate clips and the embedding of the target text label. Using the proposed method, our system ranks ${1}^{st}$ among the systems submitted to DCASE Challenge 2023 Task 7. The results of the ablation studies illustrate that the proposed techniques significantly improve sound generation performance. The codes for implementing the proposed system are available online.
△ Less
Submitted 18 September, 2023; v1 submitted 17 June, 2023;
originally announced June 2023.
-
Continuous and Noninvasive Measurement of Arterial Pulse Pressure and Pressure Waveform using an Image-free Ultrasound System
Authors:
Lirui Xu,
Pang Wu,
Pan Xia,
Fanglin Geng,
Peng Wang,
Xianxiang Chen,
Zhenfeng Li,
Lidong Du,
Shuping Liu,
Li Li,
Hongbo Chang,
Zhen Fang
Abstract:
The local beat-to-beat local pulse pressure (PP) and blood pressure waveform of arteries, especially central arteries, are important indicators of the course of cardiovascular diseases (CVDs). Nevertheless, noninvasive measurement of them remains a challenge in the clinic. This work presents a three-element image-free ultrasound system with a low-computational method for real-time measurement of l…
▽ More
The local beat-to-beat local pulse pressure (PP) and blood pressure waveform of arteries, especially central arteries, are important indicators of the course of cardiovascular diseases (CVDs). Nevertheless, noninvasive measurement of them remains a challenge in the clinic. This work presents a three-element image-free ultrasound system with a low-computational method for real-time measurement of local pulse wave velocity (PWV) and diameter waveforms, enabling real-time and noninvasive continuous PP and blood pressure waveforms measurement without calibration. The developed system has been well-validated in vitro and in vivo. In in vitro cardiovascular phantom experiments, the results demonstrated high accuracy in the measurement of PP (error < 3 mmHg) and blood pressure waveform (root-mean-square-errors (RMSE) < 2 mmHg, correlation coefficient (r) > textgreater 0.99). In subsequent human carotid experiments, the system was compared with an arterial tonometer, which showed excellent PP accuracy (mean absolute error (MAE) = 3.7 +- 3.4 mmHg) and pressure waveform similarity (RMSE = 3.7 +- 1.6 mmHg, r = 0.98 +- 0.01). Furthermore, comparative experiments with the volume clamp device demonstrated the system's ability to accurately trace blood pressure changes (induced by deep breathing) over a period of one minute, with the MAE of DBP, MAP, and SBP within 5 +- 8 mmHg. The present results demonstrate the accuracy and reliability of the developed system for continuous and noninvasive measurement of arterial PP and blood pressure waveform measurements, with potential applications in the diagnosis and prevention of CVDs.
△ Less
Submitted 29 May, 2023;
originally announced May 2023.
-
CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training
Authors:
Linhao Dong,
Zhecheng An,
Peihao Wu,
Jun Zhang,
Lu Lu,
Zejun Ma
Abstract:
Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridg…
▽ More
Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text. It jointly performs speech-to-text training and language model distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively. We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.
△ Less
Submitted 27 May, 2023;
originally announced May 2023.
-
Edge Learning for Large-Scale Internet of Things With Task-Oriented Efficient Communication
Authors:
Haihui Xie,
Minghua Xia,
Peiran Wu,
Shuai Wang,
H. Vincent Poor
Abstract:
In the Internet of Things (IoT) networks, edge learning for data-driven tasks provides intelligent applications and services. As the network size becomes large, different users may generate distinct datasets. Thus, to suit multiple edge learning tasks for large-scale IoT networks, this paper performs efficient communication under the task-oriented principle by using the collaborative design of wir…
▽ More
In the Internet of Things (IoT) networks, edge learning for data-driven tasks provides intelligent applications and services. As the network size becomes large, different users may generate distinct datasets. Thus, to suit multiple edge learning tasks for large-scale IoT networks, this paper performs efficient communication under the task-oriented principle by using the collaborative design of wireless resource allocation and edge learning error prediction. In particular, we start with multi-user scheduling to alleviate co-channel interference in dense networks. Then, we perform optimal power allocation in parallel for different learning tasks. Thanks to the high parallelization of the designed algorithm, extensive experimental results corroborate that the multi-user scheduling and task-oriented power allocation improve the performance of distinct edge learning tasks efficiently compared with the state-of-the-art benchmark algorithms.
△ Less
Submitted 30 April, 2023;
originally announced May 2023.
-
Speaker-Independent Acoustic-to-Articulatory Speech Inversion
Authors:
Peter Wu,
Li-Wei Chen,
Cheol Jun Cho,
Shinji Watanabe,
Louis Goldstein,
Alan W Black,
Gopala K. Anumanchipalli
Abstract:
To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages…
▽ More
To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages self-supervision to generalize to unseen speakers. Our approach obtains 0.784 correlation on an electromagnetic articulography (EMA) dataset, improving the state-of-the-art by 12.5\%. Additionally, we show the interpretability of these representations through directly comparing the behavior of estimated representations with speech production behavior. Finally, we propose a resynthesis-based AAI evaluation metric that does not rely on articulatory labels, demonstrating its efficacy with an 18-speaker dataset.
△ Less
Submitted 24 July, 2023; v1 submitted 13 February, 2023;
originally announced February 2023.
-
Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation
Authors:
Rao Ma,
Xiaobo Wu,
Jin Qiu,
Yanan Qin,
Haihua Xu,
Peihao Wu,
Zejun Ma
Abstract:
ASR model deployment environment is ever-changing, and the incoming speech can be switched across different domains during a session. This brings a challenge for effective domain adaptation when only target domain text data is available, and our objective is to obtain obviously improved performance on the target domain while the performance on the general domain is less undermined. In this paper,…
▽ More
ASR model deployment environment is ever-changing, and the incoming speech can be switched across different domains during a session. This brings a challenge for effective domain adaptation when only target domain text data is available, and our objective is to obtain obviously improved performance on the target domain while the performance on the general domain is less undermined. In this paper, we propose an adaptive LM fusion approach called internal language model estimation based adaptive domain adaptation (ILME-ADA). To realize such an ILME-ADA, an interpolated log-likelihood score is calculated based on the maximum of the scores from the internal LM and the external LM (ELM) respectively. We demonstrate the efficacy of the proposed ILME-ADA method with both RNN-T and LAS modeling frameworks employing neural network and n-gram LMs as ELMs respectively on two domain specific (target) test sets. The proposed method can achieve significantly better performance on the target test sets while it gets minimal performance degradation on the general test set, compared with both shallow and ILME-based LM fusion methods.
△ Less
Submitted 2 March, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
A Fast and Accurate Pitch Estimation Algorithm Based on the Pseudo Wigner-Ville Distribution
Authors:
Yisi Liu,
Peter Wu,
Alan W Black,
Gopala K. Anumanchipalli
Abstract:
Estimation of fundamental frequency (F0) in voiced segments of speech signals, also known as pitch tracking, plays a crucial role in pitch synchronous speech analysis, speech synthesis, and speech manipulation. In this paper, we capitalize on the high time and frequency resolution of the pseudo Wigner-Ville distribution (PWVD) and propose a new PWVD-based pitch estimation method. We devise an effi…
▽ More
Estimation of fundamental frequency (F0) in voiced segments of speech signals, also known as pitch tracking, plays a crucial role in pitch synchronous speech analysis, speech synthesis, and speech manipulation. In this paper, we capitalize on the high time and frequency resolution of the pseudo Wigner-Ville distribution (PWVD) and propose a new PWVD-based pitch estimation method. We devise an efficient algorithm to compute PWVD faster and use cepstrum-based pre-filtering to avoid cross-term interference. Evaluating our approach on a database with speech and electroglottograph (EGG) recordings yields a state-of-the-art mean absolute error (MAE) of around 4Hz. Our approach is also effective at voiced/unvoiced classification and handling sudden frequency changes.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
Articulation GAN: Unsupervised modeling of articulatory learning
Authors:
Gašper Beguš,
Alan Zhou,
Peter Wu,
Gopala K Anumanchipalli
Abstract:
Generative deep neural networks are widely used for speech synthesis, but most existing models directly generate waveforms or spectral outputs. Humans, however, produce speech by controlling articulators, which results in the production of speech sounds through physical properties of sound propagation. We introduce the Articulatory Generator to the Generative Adversarial Network paradigm, a new un…
▽ More
Generative deep neural networks are widely used for speech synthesis, but most existing models directly generate waveforms or spectral outputs. Humans, however, produce speech by controlling articulators, which results in the production of speech sounds through physical properties of sound propagation. We introduce the Articulatory Generator to the Generative Adversarial Network paradigm, a new unsupervised generative model of speech production/synthesis. The Articulatory Generator more closely mimics human speech production by learning to generate articulatory representations (electromagnetic articulography or EMA) in a fully unsupervised manner. A separate pre-trained physical model (ema2wav) then transforms the generated EMA representations to speech waveforms, which get sent to the Discriminator for evaluation. Articulatory analysis suggests that the network learns to control articulators in a similar manner to humans during speech production. Acoustic analysis of the outputs suggests that the network learns to generate words that are both present and absent in the training distribution. We additionally discuss implications of articulatory representations for cognitive models of human language and speech technology in general.
△ Less
Submitted 12 March, 2023; v1 submitted 27 October, 2022;
originally announced October 2022.
-
Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech
Authors:
Cheol Jun Cho,
Peter Wu,
Abdelrahman Mohamed,
Gopala K. Anumanchipalli
Abstract:
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and…
▽ More
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and semantic perspectives, the physical grounding by speech production has not yet received full attention. To bridge this gap, we conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA). Our analysis is based on a linear probing approach where we measure articulatory score as an average correlation of linear mapping to EMA. We analyze a set of SSL models selected from the leaderboard of the SUPERB benchmark and perform further layer-wise analyses on two most successful models, Wav2Vec 2.0 and HuBERT. Surprisingly, representations from the recent speech SSL models are highly correlated with EMA traces (best: r = 0.81), and only 5 minutes are sufficient to train a linear model with high performance (r = 0.77). Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
△ Less
Submitted 20 July, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.