Search | arXiv e-print repository

arXiv:2506.13293 [pdf]

SUSEP-Net: Simulation-Supervised and Contrastive Learning-based Deep Neural Networks for Susceptibility Source Separation

Authors: Min Li, Chen Chen, Zhenghao Li, Yin Liu, Shanshan Shan, Peng Wu, Pengfei Rong, Feng Liu, G. Bruce Pike, Alan H. Wilman, Hongfu Sun, Yang Gao

Abstract: Quantitative susceptibility mapping (QSM) provides a valuable tool for quantifying susceptibility distributions in human brains; however, two types of opposing susceptibility sources (i.e., paramagnetic and diamagnetic), may coexist in a single voxel, and cancel each other out in net QSM images. Susceptibility source separation techniques enable the extraction of sub-voxel information from QSM map… ▽ More Quantitative susceptibility mapping (QSM) provides a valuable tool for quantifying susceptibility distributions in human brains; however, two types of opposing susceptibility sources (i.e., paramagnetic and diamagnetic), may coexist in a single voxel, and cancel each other out in net QSM images. Susceptibility source separation techniques enable the extraction of sub-voxel information from QSM maps. This study proposes a novel SUSEP-Net for susceptibility source separation by training a dual-branch U-net with a simulation-supervised training strategy. In addition, a contrastive learning framework is included to explicitly impose similarity-based constraints between the branch-specific guidance features in specially-designed encoders and the latent features in the decoders. Comprehensive experiments were carried out on both simulated and in vivo data, including healthy subjects and patients with pathological conditions, to compare SUSEP-Net with three state-of-the-art susceptibility source separation methods (i.e., APART-QSM, \c{hi}-separation, and \c{hi}-sepnet). SUSEP-Net consistently showed improved results compared with the other three methods, with better numerical metrics, improved high-intensity hemorrhage and calcification lesion contrasts, and reduced artifacts in brains with pathological conditions. In addition, experiments on an agarose gel phantom data were conducted to validate the accuracy and the generalization capability of SUSEP-Net. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: 8 figures, 2 tables

arXiv:2506.11438 [pdf, ps, other]

Movable-Antenna Array Enhanced Downlink NOMA

Authors: Nianzu Li, Peiran Wu, Lipeng Zhu, Derrick Wing Kwan Ng

Abstract: Movable antenna (MA) has gained increasing attention in the field of wireless communications due to its exceptional capability to proactively reconfigure wireless channels via localized antenna movements. In this paper, we investigate the resource allocation design for an MA array-enabled base station serving multiple single-antenna users in a downlink non-orthogonal multiple access (NOMA) system.… ▽ More Movable antenna (MA) has gained increasing attention in the field of wireless communications due to its exceptional capability to proactively reconfigure wireless channels via localized antenna movements. In this paper, we investigate the resource allocation design for an MA array-enabled base station serving multiple single-antenna users in a downlink non-orthogonal multiple access (NOMA) system. We aim to maximize the sum rate of all users by jointly optimizing the transmit beamforming and the positions of all MAs at the BS, subject to the constraints of transmit power budget, finite antenna moving region, and the conditions for successive interference cancellation decoding rate. The formulated problem, inherently highly non-convex, is addressed by successive convex approximation (SCA) and alternating optimization methods to obtain a high-quality suboptimal solution. Simulation results unveil that the proposed MA-enhanced downlink NOMA system can significantly improve the sum rate performance compared to both the fixed-position antenna (FPA) system and the traditional orthogonal multiple access (OMA) system. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: Accepted in 2025 IEEE ICC Workshops

arXiv:2506.00975 [pdf, ps, other]

NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

Authors: Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao

Abstract: Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures t… ▽ More Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications. △ Less

Submitted 11 June, 2025; v1 submitted 1 June, 2025; originally announced June 2025.

Comments: Accepted by ICML 2025

arXiv:2505.12089 [pdf, ps, other]

NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets, Methods, and Results

Authors: Sangmin Lee, Eunpil Park, Angel Canelo, Hyunhee Park, Youngjo Kim, Hyung-Ju Chun, Xin Jin, Chongyi Li, Chun-Le Guo, Radu Timofte, Qi Wu, Tianheng Qiu, Yuchun Dong, Shenglin Ding, Guanghua Pan, Weiyu Zhou, Tao Hu, Yixu Feng, Duwei Dai, Yu Cao, Peng Wu, Wei Dong, Yanning Zhang, Qingsen Yan, Simon J. Larsen , et al. (11 additional authors not shown)

Abstract: This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effect… ▽ More This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effectively fusing these frames while adhering to strict efficiency constraints: fewer than 30 million model parameters and a computational budget under 4.0 trillion FLOPs. A total of 217 participants registered, with six teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 43.22 dB, showcasing the potential of novel methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers and practitioners in efficient burst HDR and restoration. △ Less

Submitted 17 May, 2025; originally announced May 2025.

arXiv:2503.02318 [pdf, other]

Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models

Authors: Zhifei Xie, Mingbao Lin, Zihang Liu, Pengcheng Wu, Shuicheng Yan, Chunyan Miao

Abstract: Recent advancements in multimodal reasoning have largely overlooked the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT pr… ▽ More Recent advancements in multimodal reasoning have largely overlooked the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT process. These datasets together form a high-quality reasoning dataset with 1.2 million reasoning-rich samples, which we name CoTA. Following inference scaling principles, we train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning. Experiments show state-of-the-art performance across key benchmarks, including MMAU-mini (+25.42%), AIR-Bench chat/foundation(+14.57%/+10.13%), and MELD (+8.01%). Our findings stress the core of structured CoT training in advancing audio reasoning. △ Less

Submitted 4 March, 2025; originally announced March 2025.

Comments: Technical report, in process

arXiv:2502.12817 [pdf, ps, other]

An Attention-Assisted Multi-Modal Data Fusion Model for Real-Time Estimation of Underwater Sound Velocity

Authors: Pengfei Wu, Wei Huang, Yujie Shi, Hao Zhang

Abstract: The estimation of underwater sound velocity distribution serves as a critical basis for facilitating effective underwater communication and precise positioning, given that variations in sound velocity influence the path of signal transmission. Conventional techniques for the direct measurement of sound velocity, as well as methods that involve the inversion of sound velocity utilizing acoustic fie… ▽ More The estimation of underwater sound velocity distribution serves as a critical basis for facilitating effective underwater communication and precise positioning, given that variations in sound velocity influence the path of signal transmission. Conventional techniques for the direct measurement of sound velocity, as well as methods that involve the inversion of sound velocity utilizing acoustic field data, necessitate on--site data collection. This requirement not only places high demands on device deployment, but also presents challenges in achieving real-time estimation of sound velocity distribution. In order to construct a real-time sound velocity field and eliminate the need for underwater onsite data measurement operations, we propose a self-attention embedded multimodal data fusion convolutional neural network (SA-MDF-CNN) for real-time underwater sound speed profile (SSP) estimation. The proposed model seeks to elucidate the inherent relationship between remote sensing sea surface temperature (SST) data, the primary component characteristics of historical SSPs, and their spatial coordinates. This is achieved by employing CNNs and attention mechanisms to extract local and global correlations from the input data, respectively. The ultimate objective is to facilitate a rapid and precise estimation of sound velocity distribution within a specified task area. Experimental results show that the method proposed in this paper has lower root mean square error (RMSE) and stronger robustness than other state-of-the-art methods. △ Less

Submitted 2 March, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

arXiv:2502.03974 [pdf]

Spatiotemporal Trajectory Tracking Method for Vehicles Incorporating Lead-Lag Judgement

Authors: Yuan Li, Xiang Dong, Tao Li, Junfeng Hao, Xiaoxue Xu, Sana Ullaha, Yincai Cai, Peng Wu, Ting Peng

Abstract: In the domain of intelligent transportation systems, especially within the context of autonomous vehicle control, the preemptive holistic collaborative system has been presented as a promising solution to bring a remarkable enhancement in traffic efficiency and a substantial reduction in the accident rate, demonstrating a great potential of development. In order to ensure this system operates as i… ▽ More In the domain of intelligent transportation systems, especially within the context of autonomous vehicle control, the preemptive holistic collaborative system has been presented as a promising solution to bring a remarkable enhancement in traffic efficiency and a substantial reduction in the accident rate, demonstrating a great potential of development. In order to ensure this system operates as intended, accurate tracking of the spatiotemporal trajectory is of crucial significance. Moreover, minimizing the tracking error is a necessary step in this process. To this end, a novel lead-lag judgment mechanism is proposed. This mechanism precisely quantifies the longitudinal positional deviation between the vehicle and the target trajectory over time, then the deviation is corrected with a real - time acceleration compensation strategy, as a result, the accuracy and reliability of trajectory tracking are significantly enhanced. Real - vehicle experiments were conducted in a dedicated test field to validate the feasibility of this innovative approach empirically. Subsequently, the obtained tracking data was subsequent processed using the lead-lag judgment mechanism. In this step, we carefully analyzed the spatiotemporal error patterns between the vehicle and the target trajectory under different alignments and speeds. Finally, using real highway speed and alignment data, we conducted comprehensive spatiotemporal trajectory tracking simulations. Through experiments and simulations, tracking errors maintained in an acceptable range and reasonable spatiotemporal distance is given during the preemptive merging process on highway ramps. Overall, this study offers valuable insights for highway ramp emerging safety. Future work can expand on these findings. △ Less

Submitted 6 February, 2025; originally announced February 2025.

arXiv:2501.15385 [pdf, other]

DDUNet: Dual Dynamic U-Net for Highly-Efficient Cloud Segmentation

Authors: Yijie Li, Hewei Wang, Jinfeng Xu, Puzhen Wu, Yunzhong Xiao, Shaofan Wang, Soumyabrata Dev

Abstract: Cloud segmentation amounts to separating cloud pixels from non-cloud pixels in an image. Current deep learning methods for cloud segmentation suffer from three issues. (a) Constrain on their receptive field due to the fixed size of the convolution kernel. (b) Lack of robustness towards different scenarios. (c) Requirement of a large number of parameters and limitations for real-time implementation… ▽ More Cloud segmentation amounts to separating cloud pixels from non-cloud pixels in an image. Current deep learning methods for cloud segmentation suffer from three issues. (a) Constrain on their receptive field due to the fixed size of the convolution kernel. (b) Lack of robustness towards different scenarios. (c) Requirement of a large number of parameters and limitations for real-time implementation. To address these issues, we propose a Dual Dynamic U-Net (DDUNet) for supervised cloud segmentation. The DDUNet adheres to a U-Net architecture and integrates two crucial modules: the dynamic multi-scale convolution (DMSC), improving merging features under different reception fields, and the dynamic weights and bias generator (DWBG) in classification layers to enhance generalization ability. More importantly, owing to the use of depth-wise convolution, the DDUNet is a lightweight network that can achieve 95.3% accuracy on the SWINySEG dataset with only 0.33M parameters, and achieve superior performance over three different configurations of the SWINySEg dataset in both accuracy and efficiency. △ Less

Submitted 25 January, 2025; originally announced January 2025.

Comments: 5 pages

arXiv:2501.07989 [pdf, ps, other]

Movable Antenna Enhanced DF and AF Relaying Systems: Performance Analysis and Optimization

Authors: Nianzu Li, Weidong Mei, Peiran Wu, Boyu Ning, Lipeng Zhu

Abstract: Movable antenna (MA) has been deemed as a promising technology to flexibly reconfigure wireless channels by adjusting the antenna positions in a given local region. In this paper, we investigate the application of the MA technology in both decode-and-forward (DF) and amplify-and-forward (AF) relaying systems, where a relay is equipped with multiple MAs to assist in the data transmission between tw… ▽ More Movable antenna (MA) has been deemed as a promising technology to flexibly reconfigure wireless channels by adjusting the antenna positions in a given local region. In this paper, we investigate the application of the MA technology in both decode-and-forward (DF) and amplify-and-forward (AF) relaying systems, where a relay is equipped with multiple MAs to assist in the data transmission between two single-antenna nodes. For the DF relaying system, our objective is to maximize the achievable rate at the destination by jointly optimizing the positions of the MAs in two stages for receiving signals from the source and transmitting signals to the destination, respectively. To drive essential insights, we first derive a closed-form upper bound on the maximum achievable rate of the DF relaying system. Then, a low-complexity algorithm based on projected gradient ascent (PGA) and alternating optimization (AO) is proposed to solve the antenna position optimization problem. For the AF relaying system, our objective is to maximize the achievable rate by jointly optimizing the two-stage MA positions as well as the AF beamforming matrix at the relay, which results in a more challenging optimization problem due to the intricate coupling variables. To tackle this challenge, we first reveal the hidden separability among the antenna position optimization in the two stages and the beamforming optimization. Based on such separability, we derive a closed-form upper bound on the maximum achievable rate of the AF relaying system and propose a low-complexity algorithm to obtain a high-quality suboptimal solution to the considered problem. Simulation results validate the efficacy of our theoretical analysis and demonstrate the superiority of the MA-enhanced relaying systems to the conventional relaying systems with fixed-position antennas (FPAs) and other benchmark schemes. △ Less

Submitted 14 January, 2025; originally announced January 2025.

arXiv:2412.13387 [pdf, other]

Deep Speech Synthesis from Multimodal Articulatory Representations

Authors: Peter Wu, Bohan Yu, Kevin Scheck, Alan W Black, Aditi S. Krishnapriyan, Irene Y. Chen, Tanja Schultz, Shinji Watanabe, Gopala K. Anumanchipalli

Abstract: The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intell… ▽ More The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics. △ Less

Submitted 17 December, 2024; originally announced December 2024.

arXiv:2411.07603 [pdf, other]

$\mathscr{H}_2$ Model Reduction for Linear Quantum Systems

Authors: G. P. Wu, S. Xue, G. F. Zhang, I. R. Petersen

Abstract: In this paper, an $\mathscr{H}_2$ norm-based model reduction method for linear quantum systems is presented, which can obtain a physically realizable model with a reduced order for closely approximating the original system. The model reduction problem is described as an optimization problem, whose objective is taken as an $\mathscr{H}_2$ norm of the difference between the transfer function of the… ▽ More In this paper, an $\mathscr{H}_2$ norm-based model reduction method for linear quantum systems is presented, which can obtain a physically realizable model with a reduced order for closely approximating the original system. The model reduction problem is described as an optimization problem, whose objective is taken as an $\mathscr{H}_2$ norm of the difference between the transfer function of the original system and that of the reduced one. Different from classical model reduction problems, physical realizability conditions for guaranteeing that the reduced-order system is also a quantum system should be taken as nonlinear constraints in the optimization. To solve the optimization problem with such nonlinear constraints, we employ a matrix inequality approach to transform nonlinear inequality constraints into readily solvable linear matrix inequalities (LMIs) and nonlinear equality constraints, so that the optimization problem can be solved by a lifting variables approach. We emphasize that different from existing work, which only introduces a criterion to evaluate the performance after model reduction, we guide our method to obtain an optimal reduced model with respect to the $\mathscr{H}_2$ norm. In addition, the above approach for model reduction is extended to passive linear quantum systems. Finally, examples of active and passive linear quantum systems validate the efficacy of the proposed method. △ Less

Submitted 19 November, 2024; v1 submitted 12 November, 2024; originally announced November 2024.

Comments: 13 pages,3 figures

arXiv:2411.06449 [pdf, other]

Improved Video VAE for Latent Video Diffusion Model

Authors: Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, Zheng-Jun Zha

Abstract: Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-tra… ▽ More Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from a lower-dimension image VAE while the other half involves temporal compression through 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks demonstrate the SOTA video reconstruction and generation capabilities of the proposed IV-VAE (https://wpy1999.github.io/IV-VAE/). △ Less

Submitted 10 November, 2024; originally announced November 2024.

arXiv:2409.10351 [pdf, other]

doi 10.1109/LWC.2024.3485513

Over-the-Air Computation via 2D Movable Antenna Array

Authors: Nianzu Li, Peiran Wu, Boyu Ning, Lipeng Zhu, Weidong Mei

Abstract: Movable antenna (MA) has emerged as a promising technology for improving the performance of wireless communication systems, which enables local movement of the antennas to create more favorable channel conditions. In this letter, we advance its application for over-the-air computation (AirComp) network, where an access point is equipped with a two-dimensional (2D) MA array to aggregate wireless da… ▽ More Movable antenna (MA) has emerged as a promising technology for improving the performance of wireless communication systems, which enables local movement of the antennas to create more favorable channel conditions. In this letter, we advance its application for over-the-air computation (AirComp) network, where an access point is equipped with a two-dimensional (2D) MA array to aggregate wireless data from massive users. We aim to minimize the computation mean square error (CMSE) by jointly optimizing the antenna position vector (APV), the receive combining vector at the access point and the transmit coefficients from all users. To tackle this highly non-convex problem, we propose a two-loop iterative algorithm, where the particle swarm optimization (PSO) approach is leveraged to obtain a suboptimal APV in the outer loop while the receive combining vector and transmit coefficients are alternately optimized in the inner loop. Numerical results demonstrate that the proposed MA-enhanced AirComp network outperforms the conventional network with fixed-position antennas (FPAs). △ Less

Submitted 16 September, 2024; originally announced September 2024.

Journal ref: IEEE Wireless Communications Letters, vol. 14, no. 1, pp. 33-37, Jan. 2025

arXiv:2409.02451 [pdf, other]

Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

Authors: Yisi Liu, Bohan Yu, Drake Lin, Peter Wu, Cheol Jun Cho, Gopala Krishna Anumanchipalli

Abstract: Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance th… ▽ More Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and 0.16 compared to the state-of-the-art (SOTA) baseline. Our DDSP vocoder is 4.9x faster than the baseline on CPU during inference, and can generate speech of comparable quality with only 0.4M parameters, in contrast to the 9M parameters required by the SOTA. △ Less

Submitted 4 September, 2024; originally announced September 2024.

Comments: accepted for Spoken Language Technology Workshop 2024

arXiv:2408.08121 [pdf]

doi 10.1109/ACCESS.2025.3539370

Enhancing Expressway Ramp Merge Safety and Efficiency via Spatiotemporal Cooperative Control

Authors: Ting Peng, Xiaoxue Xu, Yuan Li, Jie WU, Tao Li, Xiang Dong, Yincai Cai, Peng Wu, Sana Ullah

Abstract: In the context of autonomous driving on expressways, the issue of ensuring safe and efficient ramp merging remains a significant challenge. Existing systems often struggle to accurately assess the status and intentions of other vehicles, leading to a persistent occurrence of accidents despite efforts to maintain safe distances. This study proposes a novel spatiotemporal cooperative control approac… ▽ More In the context of autonomous driving on expressways, the issue of ensuring safe and efficient ramp merging remains a significant challenge. Existing systems often struggle to accurately assess the status and intentions of other vehicles, leading to a persistent occurrence of accidents despite efforts to maintain safe distances. This study proposes a novel spatiotemporal cooperative control approach integrating vehicle-road coordination to address this critical issue. A comprehensive methodology is developed, beginning with the calculation of safe distances under varying spatiotemporal conditions. This involves considering multiple factors, including vehicle speed differentials, positioning errors, and clock synchronization errors. Subsequently, an advanced vehicle conflict risk evaluation model is constructed. By incorporating collision acceleration and emergency acceleration as key parameters, this model offers a more accurate and detailed evaluation of potential risks during the ramp merging process. Based on the calculated safe distances and conflict risk evaluations, a mainline priority coordinated control method is formulated. This method enables the pre-planning of vehicle trajectories, effectively reducing conflicts among vehicles. Through rigorous simulations using diverse traffic volume and speed scenarios, the efficacy of the proposed strategy is validated. The results demonstrate remarkable improvements, with the average delay time reduced by an impressive 97.96% and fuel consumption decreased by 6.01%. These outcomes indicate that the proposed approach not only enhances the speed of vehicle merging but also significantly reduces latency and fuel consumption, thereby enhancing the overall performance of ramp merging operations. △ Less

Submitted 14 February, 2025; v1 submitted 15 August, 2024; originally announced August 2024.

Journal ref: IEEE Access, vol. 13, pp. 25664-25682, 2025

arXiv:2408.06789 [pdf, ps, other]

doi 10.1109/LWC.2024.3403138

Sum Rate Maximization for Movable Antenna Enabled Uplink NOMA

Authors: Nianzu Li, Peiran Wu, Boyu Ning, Lipeng Zhu

Abstract: Movable antenna (MA) has been recently proposed as a promising candidate technology for the next generation wireless communication systems due to its significant capability of reconfiguring wireless channels via antenna movement. In this letter, we study an MA-enabled uplink non-orthogonal multiple access (NOMA) system, where each user is equipped with a single MA. Our objective is to maximize the… ▽ More Movable antenna (MA) has been recently proposed as a promising candidate technology for the next generation wireless communication systems due to its significant capability of reconfiguring wireless channels via antenna movement. In this letter, we study an MA-enabled uplink non-orthogonal multiple access (NOMA) system, where each user is equipped with a single MA. Our objective is to maximize the users' sum rate by jointly optimizing the MAs' positions, the decoding order and the power control. To solve this non-convex problem, we equivalently transform it into two tractable subproblems. First, we use the successive convex approximation (SCA) to find a locally optimal solution for the antenna position optimization subproblem. Next, we derive the closed-form optimal solution of the decoding order and power control subproblem. Numerical results show that our proposed MA-enabled NOMA system can significantly enhance the sum rate compared to fixed-position antenna (FPA) systems and orthogonal multiple access (OMA) systems. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: 5 pages, 3 figures. Accepted to IEEE Wireless Communications Letters

Journal ref: IEEE Wireless Communications Letters, vol. 13, no. 8, pp. 2140-2144, Aug. 2024

arXiv:2408.05746 [pdf, ps, other]

Movable Antenna Enhanced AF Relaying: Two-Stage Antenna Position Optimization

Authors: Nianzu Li, Weidong Mei, Boyu Ning, Peiran Wu

Abstract: The movable antenna (MA) technology has attracted increasing attention in wireless communications due to its capability for flexibly adjusting the positions of multiple antennas in a local region to reconfigure channel conditions. In this paper, we investigate its application in an amplify-and-forward (AF) relay system, where a multi-MA AF relay is deployed to assist in the wireless communications… ▽ More The movable antenna (MA) technology has attracted increasing attention in wireless communications due to its capability for flexibly adjusting the positions of multiple antennas in a local region to reconfigure channel conditions. In this paper, we investigate its application in an amplify-and-forward (AF) relay system, where a multi-MA AF relay is deployed to assist in the wireless communications from a source to a destination. In particular, we aim to maximize the achievable rate at the destination, by jointly optimizing the AF weight matrix at the relay and its MAs' positions in two stages for receiving the signal from the source and transmitting its amplified version to the destination, respectively. However, compared to the existing one-stage antenna position optimization, the two-stage position optimization is more challenging due to its intricate coupling in the achievable rate at the destination. To tackle this challenge, we decompose the considered problem into several subproblems by invoking the alternating optimization (AO) and solve them by using the semidefinite programming and the gradient ascent. Numerical results demonstrate the superiority of our proposed system over the conventional relaying system with fixed-position antennas (FPAs) and also drive essential insights. △ Less

Submitted 11 August, 2024; originally announced August 2024.

arXiv:2408.02934 [pdf, other]

Learned Trimmed-Ridge Regression for Channel Estimation in Millimeter-Wave Massive MIMO

Authors: Pengxia Wu, Julian Cheng, Yonina C. Eldar, John M. Cioffi

Abstract: Channel estimation poses significant challenges in millimeter-wave massive multiple-input multiple-output systems, especially when the base station has fewer radio-frequency chains than antennas. To address this challenge, one promising solution exploits the beamspace channel sparsity to reconstruct full-dimensional channels from incomplete measurements. This paper presents a model-based deep lear… ▽ More Channel estimation poses significant challenges in millimeter-wave massive multiple-input multiple-output systems, especially when the base station has fewer radio-frequency chains than antennas. To address this challenge, one promising solution exploits the beamspace channel sparsity to reconstruct full-dimensional channels from incomplete measurements. This paper presents a model-based deep learning method to reconstruct sparse, as well as approximately sparse, vectors fast and accurately. To implement this method, we propose a trimmed-ridge regression that transforms the sparse-reconstruction problem into a least-squares problem regularized by a nonconvex penalty term, and then derive an iterative solution. We then unfold the iterations into a deep network that can be implemented in online applications to realize real-time computations. To this end, an unfolded trimmed-ridge regression model is constructed using a structural configuration to reduce computational complexity and a model ensemble strategy to improve accuracy. Compared with other state-of-the-art deep learning models, the proposed learning scheme achieves better accuracy and supports higher downlink sum rates. △ Less

Submitted 5 August, 2024; originally announced August 2024.

Comments: Accepted by IEEE Transactions on Communications

arXiv:2407.21345 [pdf, other]

Towards EMG-to-Speech with a Necklace Form Factor

Authors: Peter Wu, Ryan Kaveh, Raghav Nautiyal, Christine Zhang, Albert Guo, Anvitha Kachinthaya, Tavish Mishra, Bohan Yu, Alan W Black, Rikky Muller, Gopala Krishna Anumanchipalli

Abstract: Electrodes for decoding speech from electromyography (EMG) are typically placed on the face, requiring adhesives that are inconvenient and skin-irritating if used regularly. We explore a different device form factor, where dry electrodes are placed around the neck instead. 11-word, multi-speaker voiced EMG classifiers trained on data recorded with this device achieve 92.7% accuracy. Ablation studi… ▽ More Electrodes for decoding speech from electromyography (EMG) are typically placed on the face, requiring adhesives that are inconvenient and skin-irritating if used regularly. We explore a different device form factor, where dry electrodes are placed around the neck instead. 11-word, multi-speaker voiced EMG classifiers trained on data recorded with this device achieve 92.7% accuracy. Ablation studies reveal the importance of having more than two electrodes on the neck, and phonological analyses reveal similar classification confusions between neck-only and neck-and-face form factors. Finally, speech-EMG correlation experiments demonstrate a linear relationship between many EMG spectrogram frequency bins and self-supervised speech representation dimensions. △ Less

Submitted 31 July, 2024; originally announced July 2024.

arXiv:2407.18627 [pdf, ps, other]

Multi-Agent Deep Reinforcement Learning for Energy Efficient Multi-Hop STAR-RIS-Assisted Transmissions

Authors: Pei-Hsiang Liao, Li-Hsiang Shen, Po-Chen Wu, Kai-Ten Feng

Abstract: Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) provides a promising way to expand coverage in wireless communications. However, limitation of single STAR-RIS inspire us to integrate the concept of multi-hop transmissions, as focused on RIS in existing research. Therefore, we propose the novel architecture of multi-hop STAR-RISs to achieve a wider range of… ▽ More Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) provides a promising way to expand coverage in wireless communications. However, limitation of single STAR-RIS inspire us to integrate the concept of multi-hop transmissions, as focused on RIS in existing research. Therefore, we propose the novel architecture of multi-hop STAR-RISs to achieve a wider range of full-plane service coverage. In this paper, we intend to solve active beamforming of the base station and passive beamforming of STAR-RISs, aiming for maximizing the energy efficiency constrained by hardware limitation of STAR-RISs. Furthermore, we investigate the impact of the on-off state of STAR-RIS elements on energy efficiency. To tackle the complex problem, a Multi-Agent Global and locAl deep Reinforcement learning (MAGAR) algorithm is designed. The global agent elevates the collaboration among local agents, which focus on individual learning. In numerical results, we observe the significant improvement of MAGAR compared to the other benchmarks, including Q-learning, multi-agent deep Q network (DQN) with golbal reward, and multi-agent DQN with local rewards. Moreover, the proposed architecture of multi-hop STAR-RISs achieves the highest energy efficiency compared to mode switching based STAR-RISs, conventional RISs and deployment without RISs or STAR-RISs. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: Accepted by Proc. IEEE VTC-fall

arXiv:2407.17691 [pdf, other]

System-Level Simulation Framework for NB-IoT: Key Features and Performance Evaluation

Authors: Shutao Zhang, Wenkun Wen, Peiran Wu, Hongqing Huang, Liya Zhu, Yijia Guo, Tingting Yang, Minghua Xia

Abstract: Narrowband Internet of Things (NB-IoT) is a technology specifically designated by the 3rd Generation Partnership Project (3GPP) to meet the explosive demand for massive machine-type communications (mMTC), and it is evolving to RedCap. Industrial companies have increasingly adopted NB-IoT as the solution for mMTC due to its lightweight design and comprehensive technical specifications released by 3… ▽ More Narrowband Internet of Things (NB-IoT) is a technology specifically designated by the 3rd Generation Partnership Project (3GPP) to meet the explosive demand for massive machine-type communications (mMTC), and it is evolving to RedCap. Industrial companies have increasingly adopted NB-IoT as the solution for mMTC due to its lightweight design and comprehensive technical specifications released by 3GPP. This paper presents a system-level simulation framework for NB-IoT networks to evaluate their performance. The system-level simulator is structured into four parts: initialization, pre-generation, main simulation loop, and post-processing. Additionally, three essential features are investigated to enhance coverage, support massive connections, and ensure low power consumption, respectively. Simulation results demonstrate that the cumulative distribution function curves of the signal-to-interference-and-noise ratio fully comply with industrial standards. Furthermore, the throughput performance explains how NB-IoT networks realize massive connections at the cost of data rate. This work highlights its practical utility and paves the way for developing NB-IoT networks. △ Less

Submitted 13 August, 2024; v1 submitted 24 July, 2024; originally announced July 2024.

arXiv:2406.15754 [pdf, other]

Multimodal Segmentation for Vocal Tract Modeling

Authors: Rishi Jain, Bohan Yu, Peter Wu, Tejas Prabhune, Gopala Anumanchipalli

Abstract: Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech… ▽ More Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: Interspeech 2024

arXiv:2406.12998 [pdf, other]

doi 10.1109/JSTSP.2024.3497655

Coding Speech through Vocal Tract Kinematics

Authors: Cheol Jun Cho, Peter Wu, Tejas S. Prabhune, Dhruv Agarwal, Gopala K. Anumanchipalli

Abstract: Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Speech Articulatory Coding (SPARC). SPARC co… ▽ More Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech. △ Less

Submitted 14 December, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

Journal ref: IEEE Journal of Selected Topics in Signal Processing, vol. 18, no. 8, pp. 1427-1440, Dec. 2024

arXiv:2405.15153 [pdf, other]

doi 10.1109/JIOT.2024.3524405

Optimal Reference Nodes Deployment for Positioning Seafloor Anchor Nodes

Authors: Wei Huang, Pengfei Wu, Tianhe Xu, Hao Zhang, Kaitao Meng

Abstract: Seafloor anchor nodes, which form a geodetic network, are designed to provide surface and underwater users with positioning, navigation and timing (PNT) services. Due to the non-uniform distribution of underwater sound speed, accurate positioning of underwater anchor nodes is a challenge work. Traditional anchor node positioning typically uses cross or circular shapes, however, how to optimize the… ▽ More Seafloor anchor nodes, which form a geodetic network, are designed to provide surface and underwater users with positioning, navigation and timing (PNT) services. Due to the non-uniform distribution of underwater sound speed, accurate positioning of underwater anchor nodes is a challenge work. Traditional anchor node positioning typically uses cross or circular shapes, however, how to optimize the deployment of reference nodes for positioning underwater anchor nodes considering the variability of sound speed has not yet been studied. This paper focuses on the optimal reference nodes deployment strategies for time--of--arrival (TOA) localization in the three-dimensional (3D) underwater space. We adopt the criterion that minimizing the trace of the inverse Fisher information matrix (FIM) to determine optimal reference nodes deployment with Gaussian measurement noise, which is positive related to the signal propagation path. A comprehensive analysis of optimal reference-target geometries is provided in the general circumstance with no restriction on the number of reference nodes, elevation angle and reference-target range. A new semi-closed form solution is found to detemine the optimal geometries. To demonstrate the findings in this paper, we conducted both simulations and sea trials on underwater anchor node positioning. Both the simulation and experiment results are consistent with theoretical analysis. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Journal ref: IEEE Internet of Things Journal, 2024

arXiv:2404.14132 [pdf, other]

CRNet: A Detail-Preserving Network for Unified Image Restoration and Enhancement Task

Authors: Kangzhen Yang, Tao Hu, Kexin Dai, Genggeng Chen, Yu Cao, Wei Dong, Peng Wu, Yanning Zhang, Qingsen Yan

Abstract: In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. Howev… ▽ More In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. However, merely performing a single type of image enhancement still cannot yield satisfactory images. In this paper, to deal with the challenge above, we propose the Composite Refinement Network (CRNet) to address this issue using multiple exposure images. By fully integrating information-rich multiple exposure inputs, CRNet can perform unified image restoration and enhancement. To improve the quality of image details, CRNet explicitly separates and strengthens high and low-frequency information through pooling layers, using specially designed Multi-Branch Blocks for effective fusion of these frequencies. To increase the receptive field and fully integrate input features, CRNet employs the High-Frequency Enhancement Module, which includes large kernel convolutions and an inverted bottleneck ConvFFN. Our model secured third place in the first track of the Bracketing Image Restoration and Enhancement Challenge, surpassing previous SOTA models in both testing metrics and visual quality. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: This paper is accepted by CVPR2024 Workshop, Code: https://github.com/CalvinYang0/CRNet

arXiv:2404.13537 [pdf, other]

Bracketing Image Restoration and Enhancement with High-Low Frequency Decomposition

Authors: Genggeng Chen, Kexin Dai, Kangzhen Yang, Tao Hu, Xiangyu Chen, Yongqing Yang, Wei Dong, Peng Wu, Yanning Zhang, Qingsen Yan

Abstract: In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resul… ▽ More In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resulting in less-than-ideal restoration outcomes. Inspired by the notion that high/low frequency information is applicable to different degradations, we introduce HLNet, a Bracketing Image Restoration and Enhancement method based on high-low frequency decomposition. Specifically, we employ two modules for feature extraction: shared weight modules and non-shared weight modules. In the shared weight modules, we use SCConv to extract common features from different degradations. In the non-shared weight modules, we introduce the High-Low Frequency Decomposition Block (HLFDB), which employs different methods to handle high-low frequency information, enabling the model to address different degradations more effectively. Compared to other networks, our method takes into account the characteristics of different degradations, thus achieving higher-quality image restoration. △ Less

Submitted 24 April, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

Comments: This paper is accepted by CVPR 2024 Workshop, code: https://github.com/chengeng0613/HLNet

arXiv:2312.15668 [pdf, ps, other]

Air-to-Ground Communications Beyond 5G: UAV Swarm Formation Control and Tracking

Authors: Xiao Fan, Peiran Wu, Minghua Xia

Abstract: Unmanned aerial vehicle (UAV) communications have been widely accepted as promising technologies to support air-to-ground communications in the forthcoming sixth-generation (6G) wireless networks. This paper proposes a novel air-to-ground communication model consisting of aerial base stations served by UAVs and terrestrial user equipments (UEs) by integrating the technique of coordinated multi-poi… ▽ More Unmanned aerial vehicle (UAV) communications have been widely accepted as promising technologies to support air-to-ground communications in the forthcoming sixth-generation (6G) wireless networks. This paper proposes a novel air-to-ground communication model consisting of aerial base stations served by UAVs and terrestrial user equipments (UEs) by integrating the technique of coordinated multi-point (CoMP) transmission with the theory of stochastic geometry. In particular, a CoMP set consisting of multiple UAVs is developed based on the theory of Poisson-Delaunay tetrahedralization. Effective UAV formation control and UAV swarm tracking schemes for two typical scenarios, including static and mobile UEs, are also developed using the multi-agent system theory to ensure that collaborative UAVs can efficiently reach target spatial positions for mission execution. Thanks to the ease of mathematical tractability, this model provides explicit performance expressions for a typical UE's coverage probability and achievable ergodic rate. Extensive simulation and numerical results corroborate that the proposed scheme outperforms UAV communications without CoMP transmission and obtains similar performance to the conventional CoMP scheme while avoiding search overhead. △ Less

Submitted 25 December, 2023; originally announced December 2023.

Comments: 14 pages, 9 figures, to appear in IEEE TWC

arXiv:2312.12810 [pdf, other]

Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection

Authors: Jiachen Lian, Carly Feng, Naasir Farooqi, Steve Li, Anshul Kashyap, Cheol Jun Cho, Peter Wu, Robbie Netzorg, Tingle Li, Gopala Krishna Anumanchipalli

Abstract: Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and dete… ▽ More Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and detection in an automatic and hierarchical manner. UDM eliminates the need for extensive manual annotation by providing a comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 2023 ASRU

arXiv:2312.09034 [pdf, other]

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

Authors: Davide Berghi, Peipei Wu, Jinzheng Zhao, Wenwu Wang, Philip J. B. Jackson

Abstract: Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore th… ▽ More Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2312.01566 [pdf, other]

Coronary Atherosclerotic Plaque Characterization with Photon-counting CT: a Simulation-based Feasibility Study

Authors: Mengzhou Li, Mingye Wu, Jed Pack, Pengwei Wu, Bruno De Man, Adam Wang, Koen Nieman, Ge Wang

Abstract: Recent development of photon-counting CT (PCCT) brings great opportunities for plaque characterization with much-improved spatial resolution and spectral imaging capability. While existing coronary plaque PCCT imaging results are based on detectors made of CZT or CdTe materials, deep-silicon photon-counting detectors have unique performance characteristics and promise distinct imaging capabilities… ▽ More Recent development of photon-counting CT (PCCT) brings great opportunities for plaque characterization with much-improved spatial resolution and spectral imaging capability. While existing coronary plaque PCCT imaging results are based on detectors made of CZT or CdTe materials, deep-silicon photon-counting detectors have unique performance characteristics and promise distinct imaging capabilities. In this work, we report a systematic simulation study of a deep-silicon PCCT scanner with a new clinically-relevant digital plaque phantom with realistic geometrical parameters and chemical compositions. This work investigates the effects of spatial resolution, noise, motion artifacts, radiation dose, and spectral characterization. Our simulation results suggest that the deep-silicon PCCT design provides adequate spatial resolution for visualizing a necrotic core and quantitation of key plaque features. Advanced denoising techniques and aggressive bowtie filter designs can keep image noise to acceptable levels at this resolution while keeping radiation dose comparable to that of a conventional CT scan. The ultrahigh resolution of PCCT also means an elevated sensitivity to motion artifacts. It is found that a tolerance of less than 0.4 mm residual movement range requires the application of accurate motion correction methods for best plaque imaging quality with PCCT. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: 13 figures, 5 tables

arXiv:2311.09537 [pdf, other]

doi 10.3390/jmse12060943

Future Full-Ocean Deep SSPs Prediction based on Hierarchical Long Short-Term Memory Neural Networks

Authors: Jiajun Lu, Hao Zhang, Pengfei Wu, Sijia Li, Wei Huang

Abstract: The spatial-temporal distribution of underwater sound velocity affects the propagation mode of underwater acoustic signals. Therefore, rapid estimation and prediction of underwater sound velocity distribution is crucial for providing underwater positioning, navigation and timing (PNT) services. Currently, sound speed profile (SSP) inversion methods have a faster time response rate compared to dire… ▽ More The spatial-temporal distribution of underwater sound velocity affects the propagation mode of underwater acoustic signals. Therefore, rapid estimation and prediction of underwater sound velocity distribution is crucial for providing underwater positioning, navigation and timing (PNT) services. Currently, sound speed profile (SSP) inversion methods have a faster time response rate compared to direct measurement methods, however, most SSP inversion methods focus on constructing spatial dimensional sound velocity fields and are highly dependent on sonar observation data, thus high requirements have been placed on observation data sources. To explore the distribution pattern of sound velocity in the time dimension and achieve future SSP prediction without sonar observation data, we propose a hierarchical long short-term memory (H-LSTM) neural network for SSP prediction. By our SSP prediction method, the sound speed distribution could be estimated without any on-site data measurement process, so that the time efficiency could be greatly improved. Through comparing with other state-of-the-art methods, H-LSTM has better accuracy performance on prediction of monthly average sound velocity distribution, which is less than 1 m/s in different depth layers. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: arXiv admin note: text overlap with arXiv:2310.09522

arXiv:2310.16287 [pdf, other]

Towards Streaming Speech-to-Avatar Synthesis

Authors: Tejas S. Prabhune, Peter Wu, Bohan Yu, Gopala K. Anumanchipalli

Abstract: Streaming speech-to-avatar synthesis creates real-time animations for a virtual character from audio data. Accurate avatar representations of speech are important for the visualization of sound in linguistics, phonetics, and phonology, visual feedback to assist second language acquisition, and virtual embodiment for paralyzed patients. Previous works have highlighted the capability of deep articul… ▽ More Streaming speech-to-avatar synthesis creates real-time animations for a virtual character from audio data. Accurate avatar representations of speech are important for the visualization of sound in linguistics, phonetics, and phonology, visual feedback to assist second language acquisition, and virtual embodiment for paralyzed patients. Previous works have highlighted the capability of deep articulatory inversion to perform high-quality avatar animation using electromagnetic articulography (EMA) features. However, these models focus on offline avatar synthesis with recordings rather than real-time audio, which is necessary for live avatar visualization or embodiment. To address this issue, we propose a method using articulatory inversion for streaming high quality facial and inner-mouth avatar animation from real-time audio. Our approach achieves 130ms average streaming latency for every 0.1 seconds of audio with a 0.792 correlation with ground truth articulations. Finally, we show generated mouth and tongue animations to demonstrate the efficacy of our methodology. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: Submitted to ICASSP 2024

arXiv:2310.14778 [pdf, other]

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Authors: Jinzheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip J. B. Jackson, Wenwu Wang

Abstract: Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide applications. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter and deep learning-based methods can solve the problem of data association, audio-visual fusion and track ma… ▽ More Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide applications. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter and deep learning-based methods can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on the AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boost the development of audio-visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. Finally, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking. △ Less

Submitted 13 April, 2025; v1 submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.08251 [pdf, other]

doi 10.3390/jmse12122356

Underwater Sound Speed Profile Construction: A Review

Authors: Wei Huang, Jixuan Zhou, Fan Gao, Jiajun Lu, Sijia Li, Pengfei Wu, Junting Wang, Hao Zhang, Tianhe Xu

Abstract: Real--time and accurate construction of regional sound speed profiles (SSP) is important for building underwater positioning, navigation, and timing (PNT) systems as it greatly affect the signal propagation modes such as trajectory. In this paper, we summarizes and analyzes the current research status in the field of underwater SSP construction, and the mainstream methods include direct SSP measur… ▽ More Real--time and accurate construction of regional sound speed profiles (SSP) is important for building underwater positioning, navigation, and timing (PNT) systems as it greatly affect the signal propagation modes such as trajectory. In this paper, we summarizes and analyzes the current research status in the field of underwater SSP construction, and the mainstream methods include direct SSP measurement and SSP inversion. In the direct measurement method, we compare the performance of popular international commercial temperature, conductivity, and depth profilers (CTD). While for the inversion methods, the framework and basic principles of matched field processing (MFP), compressive sensing (CS), and deep learning (DL) for constructing SSP are introduced, and their advantages and disadvantages are compared. The traditional direct measurement method has good accuracy performance, but it usually takes a long time. The proposal of SSP inversion method greatly improves the convenience and real--time performance, but the accuracy is not as good as the direct measurement method. Currently, the SSP inversion relies on sonar observation data, making it difficult to apply to areas that couldn't be covered by underwater observation systems, and these methods are unable to predict the distribution of sound velocity at future times. How to comprehensively utilize multi-source data and provide elastic sound velocity distribution estimation services with different accuracy and real-time requirements for underwater users without sonar observation data is the mainstream trend in future research on SSP construction. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Journal ref: Journal of Marine Science and Engineering, 2024

arXiv:2310.02497 [pdf, other]

Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities

Authors: Robin Netzorg, Bohan Yu, Andrea Guzman, Peter Wu, Luna McNulty, Gopala Anumanchipalli

Abstract: Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptu… ▽ More Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptual voice qualities (PQs). By adding gendered PQs to the pathology-focused Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) protocol, our PQ-based approach provides a perceptual latent space of the character of adult voices that is an intermediary of abstraction between high-level demographics and low-level acoustic, physical, or learned representations. Contrary to prior belief, we demonstrate that these PQs are hearable by ensembles of non-experts, and further demonstrate that the information encoded in a PQ-based representation is predictable by various speech representations. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2309.07861 [pdf, other]

CiwaGAN: Articulatory information exchange

Authors: Gašper Beguš, Thomas Lu, Alan Zhou, Peter Wu, Gopala K. Anumanchipalli

Abstract: Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus. This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality. While prior research includes unsupervised articulatory modeli… ▽ More Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus. This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality. While prior research includes unsupervised articulatory modeling and information exchange separately, our model is the first to combine the two components. The paper also proposes an improved articulatory model with more interpretable internal representations. The proposed CiwaGAN model is the most realistic approximation of human spoken language acquisition using deep learning. As such, it is useful for cognitively plausible simulations of the human speech act. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2308.05262 [pdf, other]

Robust Interference Mitigation techniques for Direct Position Estimation

Authors: Haoqing Li, Shuo Tang, Peng Wu, Pau Closas

Abstract: Global Navigation Satellite System (GNSS) is pervasive in navigation and positioning applications, where precise position and time referencing estimations are required. Conventional methods for GNSS positioning involve a two-step process, where intermediate measurements such as Doppler shift and time delay of received GNSS signals are computed and then used to solve for the receiver's position. Al… ▽ More Global Navigation Satellite System (GNSS) is pervasive in navigation and positioning applications, where precise position and time referencing estimations are required. Conventional methods for GNSS positioning involve a two-step process, where intermediate measurements such as Doppler shift and time delay of received GNSS signals are computed and then used to solve for the receiver's position. Alternatively, Direct Position Estimation (DPE) was proposed to infer the position directly from the sampled signal without intermediate variables, yielding to superior levels of sensitivity and operation under challenging environments. However, the positioning resilience of DPE method is still under the threat of various interferences. Robust Interference Mitigation (RIM) processing has been studied and proved to be efficient against various interference in conventional two-step positioning (2SP) methods, and therefore worthy to be explored regarding its potential to enhance DPE. This article extends DPE methodology by incorporating RIM strategies that address the increasing need to protect GNSS receivers against intentional or unintentional interferences, such as jamming signals, which can deny GNSS-based positioning. RIM, which leverages robust statistics, was shown to provide competitive results in two-step approaches and is here employed in a high-sensitivity DPE framework with successful results. The article also provides a quantification of the loss of efficiency of using RIM when no interference is present and validates the proposed methodology on relevant interference cases, while the approach can be used to mitigate other common interference signals. △ Less

Submitted 9 August, 2023; originally announced August 2023.

arXiv:2308.03420 [pdf]

A Safe DRL Method for Fast Solution of Real-Time Optimal Power Flow

Authors: Pengfei Wu, Chen Chen, Dexiang Lai, Jian Zhong

Abstract: High-level penetration of intermittent renewable energy sources (RESs) has introduced significant uncertainties into modern power systems. In order to rapidly and economically respond to the fluctuations of power system operating state, this paper proposes a safe deep reinforcement learning (SDRL) based method for fast solution of real-time optimal power flow (RT-OPF) problems. The proposed method… ▽ More High-level penetration of intermittent renewable energy sources (RESs) has introduced significant uncertainties into modern power systems. In order to rapidly and economically respond to the fluctuations of power system operating state, this paper proposes a safe deep reinforcement learning (SDRL) based method for fast solution of real-time optimal power flow (RT-OPF) problems. The proposed method considers the volatility of RESs and temporal constraints, and formulates the RT-OPF as a Constrained Markov Decision Process (CMDP). In the training process, the proposed method hybridizes the proximal policy optimization (PPO) and the primal-dual method. Instead of integrating the constraint violation penalty with the reward function, its actor gradients are estimated by a Lagrange advantage function which is derived from two critic systems based on economic reward and violation cost. The decoupling of reward and cost alleviates reward sparsity while improving critic approximation accuracy. Moreover, the introduction of Lagrange multipliers enables the agent to comprehend the trade-off between optimality and feasibility. Numerical tests are carried out and compared with penalty-based DRL methods on the IEEE 9-bus, 30-bus, and 118-bus test systems. The results show that the well-trained SDRL agent can significantly improve the computation efficiency while satisfying the security constraints and optimality requirements. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2307.16096 [pdf, ps, other]

D-STAR: Dual Simultaneously Transmitting and Reflecting Reconfigurable Intelligent Surfaces for Joint Uplink/Downlink Transmission

Authors: Li-Hsiang Shen, Po-Chen Wu, Chia-Jou Ku, Yu-Ting Li, Kai-Ten Feng, Yuanwei Liu, Lajos Hanzo

Abstract: The joint uplink/downlink (JUD) design of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) is conceived in support of both uplink (UL) and downlink (DL) users. Furthermore, the dual STAR-RISs (D-STAR) concept is conceived as a promising architecture for 360-degree full-plane service coverage, including UL/DL users located between the base station (BS) and t… ▽ More The joint uplink/downlink (JUD) design of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) is conceived in support of both uplink (UL) and downlink (DL) users. Furthermore, the dual STAR-RISs (D-STAR) concept is conceived as a promising architecture for 360-degree full-plane service coverage, including UL/DL users located between the base station (BS) and the D-STAR as well as beyond. The corresponding regions are termed as primary (P) and secondary (S) regions. Both BS/users exist in the P-region, but only users are located in the S-region. The primary STAR-RIS (STAR-P) plays an important role in terms of tackling the P-region inter-user interference, the self-interference (SI) from the BS and from the reflective as well as refractive UL users imposed on the DL receiver. By contrast, the secondary STAR-RIS (STAR-S) aims for mitigating the S-region interferences. The non-linear and non-convex rate-maximization problem formulated is solved by alternating optimization amongst the decomposed convex sub-problems of the BS beamformer, and the D-STAR amplitude as well as phase shift configurations. We also propose a D-STAR based active beamforming and passive STAR-RIS amplitude/phase (DBAP) optimization scheme to solve the respective sub-problems by Lagrange dual with Dinkelbach's transformation, alternating direction method of multipliers (ADMM) with successive convex approximation (SCA), and penalty convex-concave procedure (PCCP). Our simulation results reveal that the proposed D-STAR architecture outperforms the conventional single RIS, single STAR-RIS, and half-duplex networks. The proposed DBAP of D-STAR outperforms the state-of-the-art solutions found in the open literature for different numbers of quantization levels, geographic deployment, transmit power and for diverse numbers of transmit antennas, patch partitions as well as D-STAR elements. △ Less

Submitted 8 February, 2024; v1 submitted 29 July, 2023; originally announced July 2023.

Comments: Accepted by IEEE TCOM

arXiv:2307.02471 [pdf, other]

Deep Speech Synthesis from MRI-Based Articulatory Representations

Authors: Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

Abstract: In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiti… ▽ More In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiting generalization capabilities. To bridge this gap, we propose an alternative MRI-based feature set that covers a much more extensive articulatory space than EMA. We also introduce normalization and denoising procedures to enhance the generalizability of deep learning methods trained on MRI data. Moreover, we propose an MRI-to-speech model that improves both computational efficiency and speech fidelity. Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis. △ Less

Submitted 5 July, 2023; originally announced July 2023.

arXiv:2306.13558 [pdf, other]

One-Bit Spectrum Sensing for Cognitive Radio

Authors: Pei-Wen Wu, Lei Huang, David Ramírez, Yu-Hang Xiao, Hing Cheung So

Abstract: Spectrum sensing in cognitive radio necessitates effective monitoring of wide bandwidths, which requires high-rate sampling. Traditional spectrum sensing methods employing high-precision analog-to-digital converters (ADCs) result in increased power consumption and expensive hardware costs. In this paper, we explore blind spectrum sensing utilizing one-bit ADCs. We derive a closed-form detector bas… ▽ More Spectrum sensing in cognitive radio necessitates effective monitoring of wide bandwidths, which requires high-rate sampling. Traditional spectrum sensing methods employing high-precision analog-to-digital converters (ADCs) result in increased power consumption and expensive hardware costs. In this paper, we explore blind spectrum sensing utilizing one-bit ADCs. We derive a closed-form detector based on Rao's test and demonstrate its equivalence with the second-order eigenvalue-moment-ratio test. Furthermore, a near-exact distribution based on the moment-based method, and an approximate distribution in the low signal-to-noise ratio (SNR) regime with the use of the central limit theorem, are obtained. Theoretical analysis is then performed and our results show that the performance loss of the proposed detector is approximately $2$ dB ($π/2$) compared to detectors employing $\infty$-bit ADCs when SNR is low. This loss can be compensated for by using approximately $2.47$ ($π^2/4$) times more samples. In addition, we unveil that the efficiency of incoherent accumulation in one-bit detection is the square root of that of coherent accumulation. Simulation results corroborate the correctness of our theoretical calculations. △ Less

Submitted 23 June, 2023; originally announced June 2023.

arXiv:2306.10359 [pdf, other]

Text-Driven Foley Sound Generation With Latent Diffusion Model

Authors: Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark D. Plumbley, Wenwu Wang

Abstract: Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale… ▽ More Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tuned to this task via transfer learning using the contrastive language-audio pertaining (CLAP) technique. We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model. Hence, we introduce a trainable layer after the encoder to improve the text embedding produced by the encoder. In addition, we further refine the generated waveform by generating multiple candidate audio clips simultaneously and selecting the best one, which is determined in terms of the similarity score between the embedding of the candidate clips and the embedding of the target text label. Using the proposed method, our system ranks ${1}^{st}$ among the systems submitted to DCASE Challenge 2023 Task 7. The results of the ablation studies illustrate that the proposed techniques significantly improve sound generation performance. The codes for implementing the proposed system are available online. △ Less

Submitted 18 September, 2023; v1 submitted 17 June, 2023; originally announced June 2023.

Comments: Submit to DCASE-workshop 2023, an extension and supersedes the previous technical report arXiv:2305.15905

arXiv:2305.17896 [pdf, other]

Continuous and Noninvasive Measurement of Arterial Pulse Pressure and Pressure Waveform using an Image-free Ultrasound System

Authors: Lirui Xu, Pang Wu, Pan Xia, Fanglin Geng, Peng Wang, Xianxiang Chen, Zhenfeng Li, Lidong Du, Shuping Liu, Li Li, Hongbo Chang, Zhen Fang

Abstract: The local beat-to-beat local pulse pressure (PP) and blood pressure waveform of arteries, especially central arteries, are important indicators of the course of cardiovascular diseases (CVDs). Nevertheless, noninvasive measurement of them remains a challenge in the clinic. This work presents a three-element image-free ultrasound system with a low-computational method for real-time measurement of l… ▽ More The local beat-to-beat local pulse pressure (PP) and blood pressure waveform of arteries, especially central arteries, are important indicators of the course of cardiovascular diseases (CVDs). Nevertheless, noninvasive measurement of them remains a challenge in the clinic. This work presents a three-element image-free ultrasound system with a low-computational method for real-time measurement of local pulse wave velocity (PWV) and diameter waveforms, enabling real-time and noninvasive continuous PP and blood pressure waveforms measurement without calibration. The developed system has been well-validated in vitro and in vivo. In in vitro cardiovascular phantom experiments, the results demonstrated high accuracy in the measurement of PP (error < 3 mmHg) and blood pressure waveform (root-mean-square-errors (RMSE) < 2 mmHg, correlation coefficient (r) > textgreater 0.99). In subsequent human carotid experiments, the system was compared with an arterial tonometer, which showed excellent PP accuracy (mean absolute error (MAE) = 3.7 +- 3.4 mmHg) and pressure waveform similarity (RMSE = 3.7 +- 1.6 mmHg, r = 0.98 +- 0.01). Furthermore, comparative experiments with the volume clamp device demonstrated the system's ability to accurately trace blood pressure changes (induced by deep breathing) over a period of one minute, with the MAE of DBP, MAP, and SBP within 5 +- 8 mmHg. The present results demonstrate the accuracy and reliability of the developed system for continuous and noninvasive measurement of arterial PP and blood pressure waveform measurements, with potential applications in the diagnosis and prevention of CVDs. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: 13 pages, 12 figures

arXiv:2305.17499 [pdf, other]

CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

Authors: Linhao Dong, Zhecheng An, Peihao Wu, Jun Zhang, Lu Lu, Zejun Ma

Abstract: Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridg… ▽ More Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text. It jointly performs speech-to-text training and language model distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively. We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training. △ Less

Submitted 27 May, 2023; originally announced May 2023.

Comments: Accepted by ACL 2023 Findings

arXiv:2305.00383 [pdf, other]

Edge Learning for Large-Scale Internet of Things With Task-Oriented Efficient Communication

Authors: Haihui Xie, Minghua Xia, Peiran Wu, Shuai Wang, H. Vincent Poor

Abstract: In the Internet of Things (IoT) networks, edge learning for data-driven tasks provides intelligent applications and services. As the network size becomes large, different users may generate distinct datasets. Thus, to suit multiple edge learning tasks for large-scale IoT networks, this paper performs efficient communication under the task-oriented principle by using the collaborative design of wir… ▽ More In the Internet of Things (IoT) networks, edge learning for data-driven tasks provides intelligent applications and services. As the network size becomes large, different users may generate distinct datasets. Thus, to suit multiple edge learning tasks for large-scale IoT networks, this paper performs efficient communication under the task-oriented principle by using the collaborative design of wireless resource allocation and edge learning error prediction. In particular, we start with multi-user scheduling to alleviate co-channel interference in dense networks. Then, we perform optimal power allocation in parallel for different learning tasks. Thanks to the high parallelization of the designed algorithm, extensive experimental results corroborate that the multi-user scheduling and task-oriented power allocation improve the performance of distinct edge learning tasks efficiently compared with the state-of-the-art benchmark algorithms. △ Less

Submitted 30 April, 2023; originally announced May 2023.

Comments: 16 pages, 8 figures; accepted for publication in IEEE TWC

arXiv:2302.06774 [pdf, other]

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

Authors: Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

Abstract: To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages… ▽ More To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages self-supervision to generalize to unseen speakers. Our approach obtains 0.784 correlation on an electromagnetic articulography (EMA) dataset, improving the state-of-the-art by 12.5\%. Additionally, we show the interpretability of these representations through directly comparing the behavior of estimated representations with speech production behavior. Finally, we propose a resynthesis-based AAI evaluation metric that does not rely on articulatory labels, demonstrating its efficacy with an 18-speaker dataset. △ Less

Submitted 24 July, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

arXiv:2211.00968 [pdf, ps, other]

Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation

Authors: Rao Ma, Xiaobo Wu, Jin Qiu, Yanan Qin, Haihua Xu, Peihao Wu, Zejun Ma

Abstract: ASR model deployment environment is ever-changing, and the incoming speech can be switched across different domains during a session. This brings a challenge for effective domain adaptation when only target domain text data is available, and our objective is to obtain obviously improved performance on the target domain while the performance on the general domain is less undermined. In this paper,… ▽ More ASR model deployment environment is ever-changing, and the incoming speech can be switched across different domains during a session. This brings a challenge for effective domain adaptation when only target domain text data is available, and our objective is to obtain obviously improved performance on the target domain while the performance on the general domain is less undermined. In this paper, we propose an adaptive LM fusion approach called internal language model estimation based adaptive domain adaptation (ILME-ADA). To realize such an ILME-ADA, an interpolated log-likelihood score is calculated based on the maximum of the scores from the internal LM and the external LM (ELM) respectively. We demonstrate the efficacy of the proposed ILME-ADA method with both RNN-T and LAS modeling frameworks employing neural network and n-gram LMs as ELMs respectively on two domain specific (target) test sets. The proposed method can achieve significantly better performance on the target test sets while it gets minimal performance degradation on the general test set, compared with both shallow and ILME-based LM fusion methods. △ Less

Submitted 2 March, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: Accepted by ICASSP 2023

arXiv:2210.15272 [pdf, ps, other]

A Fast and Accurate Pitch Estimation Algorithm Based on the Pseudo Wigner-Ville Distribution

Authors: Yisi Liu, Peter Wu, Alan W Black, Gopala K. Anumanchipalli

Abstract: Estimation of fundamental frequency (F0) in voiced segments of speech signals, also known as pitch tracking, plays a crucial role in pitch synchronous speech analysis, speech synthesis, and speech manipulation. In this paper, we capitalize on the high time and frequency resolution of the pseudo Wigner-Ville distribution (PWVD) and propose a new PWVD-based pitch estimation method. We devise an effi… ▽ More Estimation of fundamental frequency (F0) in voiced segments of speech signals, also known as pitch tracking, plays a crucial role in pitch synchronous speech analysis, speech synthesis, and speech manipulation. In this paper, we capitalize on the high time and frequency resolution of the pseudo Wigner-Ville distribution (PWVD) and propose a new PWVD-based pitch estimation method. We devise an efficient algorithm to compute PWVD faster and use cepstrum-based pre-filtering to avoid cross-term interference. Evaluating our approach on a database with speech and electroglottograph (EGG) recordings yields a state-of-the-art mean absolute error (MAE) of around 4Hz. Our approach is also effective at voiced/unvoiced classification and handling sudden frequency changes. △ Less

Submitted 27 October, 2022; originally announced October 2022.

arXiv:2210.15173 [pdf, other]

doi 10.1109/ICASSP49357.2023.10096800

Articulation GAN: Unsupervised modeling of articulatory learning

Authors: Gašper Beguš, Alan Zhou, Peter Wu, Gopala K Anumanchipalli

Abstract: Generative deep neural networks are widely used for speech synthesis, but most existing models directly generate waveforms or spectral outputs. Humans, however, produce speech by controlling articulators, which results in the production of speech sounds through physical properties of sound propagation. We introduce the Articulatory Generator to the Generative Adversarial Network paradigm, a new un… ▽ More Generative deep neural networks are widely used for speech synthesis, but most existing models directly generate waveforms or spectral outputs. Humans, however, produce speech by controlling articulators, which results in the production of speech sounds through physical properties of sound propagation. We introduce the Articulatory Generator to the Generative Adversarial Network paradigm, a new unsupervised generative model of speech production/synthesis. The Articulatory Generator more closely mimics human speech production by learning to generate articulatory representations (electromagnetic articulography or EMA) in a fully unsupervised manner. A separate pre-trained physical model (ema2wav) then transforms the generated EMA representations to speech waveforms, which get sent to the Discriminator for evaluation. Articulatory analysis suggests that the network learns to control articulators in a similar manner to humans during speech production. Acoustic analysis of the outputs suggests that the network learns to generate words that are both present and absent in the training distribution. We additionally discuss implications of articulatory representations for cognitive models of human language and speech technology in general. △ Less

Submitted 12 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: ICASSP 2023

Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

arXiv:2210.11723 [pdf, other]

doi 10.1109/ICASSP49357.2023.10094711

Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech

Authors: Cheol Jun Cho, Peter Wu, Abdelrahman Mohamed, Gopala K. Anumanchipalli

Abstract: Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and… ▽ More Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and semantic perspectives, the physical grounding by speech production has not yet received full attention. To bridge this gap, we conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA). Our analysis is based on a linear probing approach where we measure articulatory score as an average correlation of linear mapping to EMA. We analyze a set of SSL models selected from the leaderboard of the SUPERB benchmark and perform further layer-wise analyses on two most successful models, Wav2Vec 2.0 and HuBERT. Surprisingly, representations from the recent speech SSL models are highly correlated with EMA traces (best: r = 0.81), and only 5 minutes are sufficient to train a linear model with high performance (r = 0.77). Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL. △ Less

Submitted 20 July, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

Showing 1–50 of 85 results for author: Wu, P