-
Large Language Models Can Achieve Explainable and Training-Free One-shot HRRP ATR
Authors:
Lingfeng Chen,
Panhe Hu,
Zhiliang Pan,
Qi Liu,
Zhen Liu
Abstract:
This letter introduces a pioneering, training-free and explainable framework for High-Resolution Range Profile (HRRP) automatic target recognition (ATR) utilizing large-scale pre-trained Large Language Models (LLMs). Diverging from conventional methods requiring extensive task-specific training or fine-tuning, our approach converts one-dimensional HRRP signals into textual scattering center repres…
▽ More
This letter introduces a pioneering, training-free and explainable framework for High-Resolution Range Profile (HRRP) automatic target recognition (ATR) utilizing large-scale pre-trained Large Language Models (LLMs). Diverging from conventional methods requiring extensive task-specific training or fine-tuning, our approach converts one-dimensional HRRP signals into textual scattering center representations. Prompts are designed to align LLMs' semantic space for ATR via few-shot in-context learning, effectively leveraging its vast pre-existing knowledge without any parameter update. We make our codes publicly available to foster research into LLMs for HRRP ATR.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Authors:
Zijian Lin,
Yang Zhang,
Yougen Yuan,
Yuming Yan,
Jinjiang Liu,
Zhiyong Wu,
Pengfei Hu,
Qun Yu
Abstract:
Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their deployment in scenarios where inference speed is critical. In this work, we propose Speech Speculative Decoding (SSD), a novel framework for autoregressive speech synt…
▽ More
Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their deployment in scenarios where inference speed is critical. In this work, we propose Speech Speculative Decoding (SSD), a novel framework for autoregressive speech synthesis acceleration. Specifically, our method employs a lightweight draft model to generate candidate token sequences, which are subsequently verified in parallel by the target model using the proposed SSD framework. Experimental results demonstrate that SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding, while maintaining high fidelity and naturalness. Subjective evaluations further validate the effectiveness of SSD in preserving the perceptual quality of the target model while accelerating inference.
△ Less
Submitted 2 June, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars
Authors:
Tianbao Zhang,
Jian Zhao,
Yuer Li,
Zheng Zhu,
Ping Hu,
Zhaoxin Fan,
Wenjun Wu,
Xuelong Li
Abstract:
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a signif…
▽ More
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
An Edge AI Solution for Space Object Detection
Authors:
Wenxuan Zhang,
Peng Hu
Abstract:
Effective Edge AI for space object detection (SOD) tasks that can facilitate real-time collision assessment and avoidance is essential with the increasing space assets in near-Earth orbits. In SOD, low Earth orbit (LEO) satellites must detect other objects with high precision and minimal delay. We explore an Edge AI solution based on deep-learning-based vision sensing for SOD tasks and propose a d…
▽ More
Effective Edge AI for space object detection (SOD) tasks that can facilitate real-time collision assessment and avoidance is essential with the increasing space assets in near-Earth orbits. In SOD, low Earth orbit (LEO) satellites must detect other objects with high precision and minimal delay. We explore an Edge AI solution based on deep-learning-based vision sensing for SOD tasks and propose a deep learning model based on Squeeze-and-Excitation (SE) layers, Vision Transformers (ViT), and YOLOv9 framework. We evaluate the performance of these models across various realistic SOD scenarios, demonstrating their ability to detect multiple satellites with high accuracy and very low latency.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Model-Based Closed-Loop Control Algorithm for Stochastic Partial Differential Equation Control
Authors:
Peiyan Hu,
Haodong Feng,
Yue Wang,
Zhiming Ma
Abstract:
Neural operators have demonstrated promise in modeling and controlling systems governed by Partial Differential Equations (PDEs). Beyond PDEs, Stochastic Partial Differential Equations (SPDEs) play a critical role in modeling systems influenced by randomness, with applications in finance, physics, and beyond. However, controlling SPDE-governed systems remains a significant challenge. On the one ha…
▽ More
Neural operators have demonstrated promise in modeling and controlling systems governed by Partial Differential Equations (PDEs). Beyond PDEs, Stochastic Partial Differential Equations (SPDEs) play a critical role in modeling systems influenced by randomness, with applications in finance, physics, and beyond. However, controlling SPDE-governed systems remains a significant challenge. On the one hand, the regularity of the system's state (which can be intuitively understood as smoothness) deteriorates, making modeling and generalization more challenging. On the other hand, this stochasticity also renders control more unstable and thus less accurate. To address this gap, we propose the Model-Based Closed-Loop Control Algorithm (MB-CC), the first model-based closed-loop control method for SPDEs. MB-CC introduces two key innovations to enhance control robustness and efficiency: a Regularity Feature (RF) block and a closed-loop strategy with an operator-encoded policy network. The RF block, inspired by the regularity structure theory of SPDEs, addresses noise-induced irregularities by transforming the network's input, including the system state and noise-perturbed external forces, into a refined feature space for improved forward prediction. Compared to previous works using regularity features, we introduce a new parameterization, data augmentation, and extend the RF block as a plug-and-play component. Additionally, to achieve closed-loop control, we introduce an operator-encoded policy network to map the current state to optimal control, which integrates physical priors and swiftly makes decisions based on states returned by the environment. We conduct a systematic evaluation of MB-CC on two notable SPDEs, showcasing its effectiveness and efficiency. The ablation studies show its ability to handle stochasticity more effectively.
△ Less
Submitted 15 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
Pairing Real-Time Piano Transcription with Symbol-level Tracking for Precise and Robust Score Following
Authors:
Silvan Peter,
Patricia Hu,
Gerhard Widmer
Abstract:
Real-time music tracking systems follow a musical performance and at any time report the current position in a corresponding score. Most existing methods approach this problem exclusively in the audio domain, typically using online time warping (OLTW) techniques on incoming audio and an audio representation of the score. Audio OLTW techniques have seen incremental improvements both in features and…
▽ More
Real-time music tracking systems follow a musical performance and at any time report the current position in a corresponding score. Most existing methods approach this problem exclusively in the audio domain, typically using online time warping (OLTW) techniques on incoming audio and an audio representation of the score. Audio OLTW techniques have seen incremental improvements both in features and model heuristics which reached a performance plateau in the past ten years. We argue that converting and representing the performance in the symbolic domain -- thereby transforming music tracking into a symbolic task -- can be a more effective approach, even when the domain transformation is imperfect. Our music tracking system combines two real-time components: one handling audio-to-note transcription and the other a novel symbol-level tracker between transcribed input and score. We compare the performance of this mixed audio-symbolic approach with its equivalent audio-only counterpart, and demonstrate that our method outperforms the latter in terms of both precision, i.e., absolute tracking error, and robustness, i.e., tracking success.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
How to Infer Repeat Structures in MIDI Performances
Authors:
Silvan Peter,
Patricia Hu,
Gerhard Widmer
Abstract:
MIDI performances are generally expedient in performance research and music information retrieval, and even more so if they can be connected to a score. This connection is usually established by means of alignment, linking either notes or time points between the score and the performance. The first obstacle when trying to establish such an alignment is that a performance realizes one (out of many)…
▽ More
MIDI performances are generally expedient in performance research and music information retrieval, and even more so if they can be connected to a score. This connection is usually established by means of alignment, linking either notes or time points between the score and the performance. The first obstacle when trying to establish such an alignment is that a performance realizes one (out of many) structural versions of the score that can plausibly result from instructions such as repeats, variations, and navigation markers like 'dal segno/da capo al coda'. A score needs to be unfolded, that is, its repeats and navigation markers need to be explicitly written out to create a single timeline without jumps matching the performance, before alignment algorithms can be applied. In the curation of large performance corpora this process is carried out manually, as no tools are available to infer the repeat structure of the performance. To ease this process, we develop a method to automatically infer the repeat structure of a MIDI performance, given a symbolically encoded score including repeat and navigation markers. The intuition guiding our design is: 1) local alignment of every contiguous section of the score with a section of a performance containing the same material should receive high alignment gain, whereas local alignment with any other performance section should accrue a low or zero gain. And 2) stitching local alignments together according to a valid structural version of the score should result in an approximate full alignment and correspondingly high global accumulated gain if the structural version corresponds to the performance, and low gain for all other, ill-fitting structural versions.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Toward Onboard AI-Enabled Solutions to Space Object Detection for Space Sustainability
Authors:
Wenxuan Zhang,
Peng Hu
Abstract:
The rapid expansion of advanced low-Earth orbit (LEO) satellites in large constellations is positioning space assets as key to the future, enabling global internet access and relay systems for deep space missions. A solution to the challenge is effective space object detection (SOD) for collision assessment and avoidance. In SOD, an LEO satellite must detect other satellites and objects with high…
▽ More
The rapid expansion of advanced low-Earth orbit (LEO) satellites in large constellations is positioning space assets as key to the future, enabling global internet access and relay systems for deep space missions. A solution to the challenge is effective space object detection (SOD) for collision assessment and avoidance. In SOD, an LEO satellite must detect other satellites and objects with high precision and minimal delay. This paper investigates the feasibility and effectiveness of employing vision sensors for SOD tasks based on deep learning (DL) models. It introduces models based on the Squeeze-and-Excitation (SE) layer, Vision Transformer (ViT), and the Generalized Efficient Layer Aggregation Network (GELAN) and evaluates their performance under SOD scenarios. Experimental results show that the proposed models achieve mean average precision at intersection over union threshold 0.5 (mAP50) scores of up to 0.751 and mean average precision averaged over intersection over union thresholds from 0.5 to 0.95 (mAP50:95) scores of up to 0.280. Compared to the baseline GELAN-t model, the proposed GELAN-ViT-SE model increases the average mAP50 from 0.721 to 0.751, improves the mAP50:95 from 0.266 to 0.274, reduces giga floating point operations (GFLOPs) from 7.3 to 5.6, and lowers peak power consumption from 2080.7 mW to 2028.7 mW by 2.5\%.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Rotational ultrasound and photoacoustic tomography of the human body
Authors:
Yang Zhang,
Shuai Na,
Jonathan J. Russin,
Karteekeya Sastry,
Li Lin,
Junfu Zheng,
Yilin Luo,
Xin Tong,
Yujin An,
Peng Hu,
Konstantin Maslov,
Tze-Woei Tan,
Charles Y. Liu,
Lihong V. Wang
Abstract:
Imaging the human body's morphological and angiographic information is essential for diagnosing, monitoring, and treating medical conditions. Ultrasonography performs the morphological assessment of the soft tissue based on acoustic impedance variations, whereas photoacoustic tomography (PAT) can visualize blood vessels based on intrinsic hemoglobin absorption. Three-dimensional (3D) panoramic ima…
▽ More
Imaging the human body's morphological and angiographic information is essential for diagnosing, monitoring, and treating medical conditions. Ultrasonography performs the morphological assessment of the soft tissue based on acoustic impedance variations, whereas photoacoustic tomography (PAT) can visualize blood vessels based on intrinsic hemoglobin absorption. Three-dimensional (3D) panoramic imaging of the vasculature is generally not practical in conventional ultrasonography with limited field-of-view (FOV) probes, and PAT does not provide sufficient scattering-based soft tissue morphological contrast. Complementing each other, fast panoramic rotational ultrasound tomography (RUST) and PAT are integrated for hybrid rotational ultrasound and photoacoustic tomography (RUS-PAT), which obtains 3D ultrasound structural and PAT angiographic images of the human body quasi-simultaneously. The RUST functionality is achieved in a cost-effective manner using a single-element ultrasonic transducer for ultrasound transmission and rotating arc-shaped arrays for 3D panoramic detection. RUST is superior to conventional ultrasonography, which either has a limited FOV with a linear array or is high-cost with a hemispherical array that requires both transmission and receiving. By switching the acoustic source to a light source, the system is conveniently converted to PAT mode to acquire angiographic images in the same region. Using RUS-PAT, we have successfully imaged the human head, breast, hand, and foot with a 10 cm diameter FOV, submillimeter isotropic resolution, and 10 s imaging time for each modality. The 3D RUS-PAT is a powerful tool for high-speed, 3D, dual-contrast imaging of the human body with potential for rapid clinical translation.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Cross-Environment Transfer Learning for Location-Aided Beam Prediction in 5G and Beyond Millimeter-Wave Networks
Authors:
Enrico Tosi,
Panwei Hu,
Aleksandar Ichkov,
Marina Petrova,
Ljiljana Simić
Abstract:
Millimeter-wave (mm-wave) communications requirebeamforming and consequent precise beam alignmentbetween the gNodeB (gNB) and the user equipment (UE) toovercome high propagation losses. This beam alignment needs tobe constantly updated for different UE locations based on beamsweepingradio frequency measurements, leading to significantbeam management overhead. One potential solution involvesusing m…
▽ More
Millimeter-wave (mm-wave) communications requirebeamforming and consequent precise beam alignmentbetween the gNodeB (gNB) and the user equipment (UE) toovercome high propagation losses. This beam alignment needs tobe constantly updated for different UE locations based on beamsweepingradio frequency measurements, leading to significantbeam management overhead. One potential solution involvesusing machine learning (ML) beam prediction algorithms thatleverage UE position information to select the serving beamwithout the overhead of beam sweeping. However, the highlysite-specific nature of mm-wave propagation means that MLmodels require training from scratch for each scenario, whichis inefficient in practice. In this paper, we propose a robustcross-environment transfer learning solution for location-aidedbeam prediction, whereby the ML model trained on a referencegNB is transferred to a target gNB by fine-tuning with a limiteddataset. Extensive simulation results based on ray-tracing in twourban environments show the effectiveness of our solution forboth inter- and intra-city model transfer. Our results show thatby training the model on a reference gNB and transferring themodel by fine-tuning with only 5% of the target gNB dataset,we can achieve 80% accuracy in predicting the best beamfor the target gNB. Importantly, our approach improves thepoor generalization accuracy of transferring the model to newenvironments without fine-tuning by around 75 percentage points.This demonstrates that transfer learning enables high predictionaccuracy while reducing the computational and training datasetcollection burden of ML-based beam prediction, making itpractical for 5G-and-beyond deployments.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
A Three-Dimensional Pursuit-Evasion Game Based on Fuzzy Actor-Critic Learning Algorithm
Authors:
Penglin Hu
Abstract:
Most of the existing research on pursuit-evasion game (PEG) is conducted in a two-dimensional (2D) environment. In this paper, we investigate the PEG in a 3D space. We extend the Apollonius circle (AC) to the 3D space and introduce its detailed analytical form. To enhance the capture efficiency, we derive the optimal motion space for both the pursuer and the evader. To address the issue arising fr…
▽ More
Most of the existing research on pursuit-evasion game (PEG) is conducted in a two-dimensional (2D) environment. In this paper, we investigate the PEG in a 3D space. We extend the Apollonius circle (AC) to the 3D space and introduce its detailed analytical form. To enhance the capture efficiency, we derive the optimal motion space for both the pursuer and the evader. To address the issue arising from a discrete state space, we design a fuzzy actor-critic learning (FACL) algorithm to obtain the agents' strategies. To improve learning performance, we devise a reward function for the agents, which enables obstacle avoidance functionality. The effectiveness of the proposed algorithm is validated through simulation experiments.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
A Novel Multi-Objective Reinforcement Learning Algorithm for Pursuit-Evasion Game
Authors:
Penglin Hu,
Chunhui Zhao,
Quan Pan
Abstract:
In practical application, the pursuit-evasion game (PEG) often involves multiple complex and conflicting objectives. The single-objective reinforcement learning (RL) usually focuses on a single optimization objective, and it is difficult to find the optimal balance among multiple objectives. This paper proposes a three-objective RL algorithm based on fuzzy Q-learning (FQL) to solve the PEG with di…
▽ More
In practical application, the pursuit-evasion game (PEG) often involves multiple complex and conflicting objectives. The single-objective reinforcement learning (RL) usually focuses on a single optimization objective, and it is difficult to find the optimal balance among multiple objectives. This paper proposes a three-objective RL algorithm based on fuzzy Q-learning (FQL) to solve the PEG with different optimization objectives. First, the multi-objective FQL algorithm is introduced, which uses the reward function to represent three optimization objectives: evading pursuit, reaching target, and avoiding obstacle. Second, a multi-objective evaluation method and action selection strategy based on three-dimensional hypervolume are designed, which solved the dilemma of exploration-exploitation. By sampling the Pareto front, the update rule of the global strategy is obtained. The proposed algorithm reduces computational load while ensuring exploration ability. Finally, the performance of the algorithm is verified by simulation results.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Few-shot Human Motion Recognition through Multi-Aspect mmWave FMCW Radar Data
Authors:
Hao Fan,
Lingfeng Chen,
Chengbai Xu,
Jiadong Zhou,
Yongpeng Dai,
Panhe HU
Abstract:
Radar human motion recognition methods based on deep learning models has been a heated spot of remote sensing in recent years, yet the existing methods are mostly radial-oriented. In practical application, the test data could be multi-aspect and the sample number of each motion could be very limited, causing model overfitting and reduced recognition accuracy. This paper proposed channel-DN4, a mul…
▽ More
Radar human motion recognition methods based on deep learning models has been a heated spot of remote sensing in recent years, yet the existing methods are mostly radial-oriented. In practical application, the test data could be multi-aspect and the sample number of each motion could be very limited, causing model overfitting and reduced recognition accuracy. This paper proposed channel-DN4, a multi-aspect few-shot human motion recognition method. First, local descriptors are introduced for a precise classification metric. Moreover, episodic training strategy was adopted to reduce model overfitting. To utilize the invariant sematic information in multi-aspect conditions, we considered channel attention after the embedding network to obtain precise implicit high-dimensional representation of sematic information. We tested the performance of channel-DN4 and methods for comparison on measured mmWave FMCW radar data. The proposed channel-DN4 produced competitive and convincing results, reaching the highest 87.533% recognition accuracy in 3-way 10-shot condition while other methods suffer from overfitting. Codes are available at: https://github.com/MountainChenCad/channel-DN4
△ Less
Submitted 19 January, 2025;
originally announced January 2025.
-
Sensing for Space Safety and Sustainability: A Deep Learning Approach with Vision Transformers
Authors:
Wenxuan Zhang,
Peng Hu
Abstract:
The rapid increase of space assets represented by small satellites in low Earth orbit can enable ubiquitous digital services for everyone. However, due to the dynamic space environment, numerous space objects, complex atmospheric conditions, and unexpected events can easily introduce adverse conditions affecting space safety, operations, and sustainability of the outer space environment. This chal…
▽ More
The rapid increase of space assets represented by small satellites in low Earth orbit can enable ubiquitous digital services for everyone. However, due to the dynamic space environment, numerous space objects, complex atmospheric conditions, and unexpected events can easily introduce adverse conditions affecting space safety, operations, and sustainability of the outer space environment. This challenge calls for responsive, effective satellite object detection (SOD) solutions that allow a small satellite to assess and respond to collision risks, with the consideration of constrained resources on a small satellite platform. This paper discusses the SOD tasks and onboard deep learning (DL) approach to the tasks. Two new DL models are proposed, called GELAN-ViT and GELAN-RepViT, which incorporate vision transformer (ViT) into the Generalized Efficient Layer Aggregation Network (GELAN) architecture and address limitations by separating the convolutional neural network and ViT paths. These models outperform the state-of-the-art YOLOv9-t in terms of mean average precision (mAP) and computational costs. On the SOD dataset, our proposed models can achieve around 95% mAP50 with giga-floating point operations (GFLOPs) reduced by over 5.0. On the VOC 2012 dataset, they can achieve $\geq$ 60.7% mAP50 with GFLOPs reduced by over 5.2.
△ Less
Submitted 14 December, 2024; v1 submitted 11 December, 2024;
originally announced December 2024.
-
SigmaRL: A Sample-Efficient and Generalizable Multi-Agent Reinforcement Learning Framework for Motion Planning
Authors:
Jianye Xu,
Pan Hu,
Bassam Alrifaee
Abstract:
This paper introduces an open-source, decentralized framework named SigmaRL, designed to enhance both sample efficiency and generalization of multi-agent Reinforcement Learning (RL) for motion planning of connected and automated vehicles. Most RL agents exhibit a limited capacity to generalize, often focusing narrowly on specific scenarios, and are usually evaluated in similar or even the same sce…
▽ More
This paper introduces an open-source, decentralized framework named SigmaRL, designed to enhance both sample efficiency and generalization of multi-agent Reinforcement Learning (RL) for motion planning of connected and automated vehicles. Most RL agents exhibit a limited capacity to generalize, often focusing narrowly on specific scenarios, and are usually evaluated in similar or even the same scenarios seen during training. Various methods have been proposed to address these challenges, including experience replay and regularization. However, how observation design in RL affects sample efficiency and generalization remains an under-explored area. We address this gap by proposing five strategies to design information-dense observations, focusing on general features that are applicable to most traffic scenarios. We train our RL agents using these strategies on an intersection and evaluate their generalization through numerical experiments across completely unseen traffic scenarios, including a new intersection, an on-ramp, and a roundabout. Incorporating these information-dense observations reduces training times to under one hour on a single CPU, and the evaluation results reveal that our RL agents can effectively zero-shot generalize. Code: github.com/bassamlab/SigmaRL
△ Less
Submitted 10 April, 2025; v1 submitted 14 August, 2024;
originally announced August 2024.
-
Quantifying the Corpus Bias Problem in Automatic Music Transcription Systems
Authors:
Lukáš Samuel Marták,
Patricia Hu,
Gerhard Widmer
Abstract:
Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. The State-of-the-Art (SotA) benchmarks have been dominated by deep learning systems. Due to the scarcity of high quality data, they are usually trained and evaluated exclusively or predominantly on classical piano music. Unfortunately, that hinders our ability to understand how they generalize to oth…
▽ More
Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. The State-of-the-Art (SotA) benchmarks have been dominated by deep learning systems. Due to the scarcity of high quality data, they are usually trained and evaluated exclusively or predominantly on classical piano music. Unfortunately, that hinders our ability to understand how they generalize to other music. Previous works have revealed several aspects of memorization and overfitting in these systems. We identify two primary sources of distribution shift: the music, and the sound. Complementing recent results on the sound axis (i.e. acoustics, timbre), we investigate the musical one (i.e. note combinations, dynamics, genre). We evaluate the performance of several SotA AMT systems on two new experimental test sets which we carefully construct to emulate different levels of musical distribution shift. Our results reveal a stark performance gap, shedding further light on the Corpus Bias problem, and the extent to which it continues to trouble these systems.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
CL-DiffPhyCon: Closed-loop Diffusion Control of Complex Physical Systems
Authors:
Long Wei,
Haodong Feng,
Yuchen Yang,
Ruiqi Feng,
Peiyan Hu,
Xiang Zheng,
Tao Zhang,
Dixia Fan,
Tailin Wu
Abstract:
The control problems of complex physical systems have broad applications in science and engineering. Previous studies have shown that generative control methods based on diffusion models offer significant advantages for solving these problems. However, existing generative control approaches face challenges in both performance and efficiency when extended to the closed-loop setting, which is essent…
▽ More
The control problems of complex physical systems have broad applications in science and engineering. Previous studies have shown that generative control methods based on diffusion models offer significant advantages for solving these problems. However, existing generative control approaches face challenges in both performance and efficiency when extended to the closed-loop setting, which is essential for effective control. In this paper, we propose an efficient Closed-Loop Diffusion method for Physical systems Control (CL-DiffPhyCon). By employing an asynchronous denoising framework for different physical time steps, CL-DiffPhyCon generates control signals conditioned on real-time feedback from the system with significantly reduced computational cost during sampling. Additionally, the control process could be further accelerated by incorporating fast sampling techniques, such as DDIM. We evaluate CL-DiffPhyCon on two tasks: 1D Burgers' equation control and 2D incompressible fluid control. The results demonstrate that CL-DiffPhyCon achieves superior control performance with significant improvements in sampling efficiency. The code can be found at https://github.com/AI4Science-WestlakeU/CL_DiffPhyCon.
△ Less
Submitted 22 February, 2025; v1 submitted 31 July, 2024;
originally announced August 2024.
-
A Deep Learning-Based Target Radial Length Estimation Method through HRRP Sequence
Authors:
Lingfeng Chen,
Panhe Hu,
Zhiliang Pan,
Xiao Sun,
Zehao Wang
Abstract:
This paper introduces an innovative deep learning-based method for end-to-end target radial length estimation from HRRP (High Resolution Range Profile) sequences. Firstly, the HRRP sequences are normalized and transformed into GAF (Gram Angular Field) images to effectively capture and utilize the temporal information. Subsequently, these GAF images serve as the input for a pretrained ResNet-101 mo…
▽ More
This paper introduces an innovative deep learning-based method for end-to-end target radial length estimation from HRRP (High Resolution Range Profile) sequences. Firstly, the HRRP sequences are normalized and transformed into GAF (Gram Angular Field) images to effectively capture and utilize the temporal information. Subsequently, these GAF images serve as the input for a pretrained ResNet-101 model, which is then fine-tuned for target radial length estimation. The simulation results show that compared to traditional threshold method and simple networks e.g. one-dimensional CNN (Convolutional Neural Network), the proposed method demonstrates superior noise resistance and higher accuracy under low SNR (Signal-to-Noise Ratio) conditions.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
HRRPGraphNet: Make HRRPs to Be Graphs for Efficient Target Recognition
Authors:
Lingfeng Chen,
Xiao Sun,
Zhiliang Pan,
Zehao Wang,
Xiaolong Su,
Zhen Liu,
Panhe Hu
Abstract:
High Resolution Range Profiles (HRRP) have become a key area of focus in the domain of Radar Automatic Target Recognition (RATR). Despite the success of deep learning based HRRP recognition, these methods needs a large amount of training samples to generate good performance, which could be a severe challenge under non-cooperative circumstances. Currently, deep learning based models treat HRRP as s…
▽ More
High Resolution Range Profiles (HRRP) have become a key area of focus in the domain of Radar Automatic Target Recognition (RATR). Despite the success of deep learning based HRRP recognition, these methods needs a large amount of training samples to generate good performance, which could be a severe challenge under non-cooperative circumstances. Currently, deep learning based models treat HRRP as sequences, which may lead to ignorance of the internal relationship of range cells. This letter introduces HRRPGraphNet, whose pivotal innovation is the transformation of HRRP data into a novel graph structure, utilizing a range cell amplitude(hyphen)based node vector and a range(hyphen)relative adjacency matrix. This graph(hyphen)based approach facilitates both local feature extraction via one(hyphen)dimensional convolution layers, global feature extraction through a graph convolution layer and a attention module. Experiments on the aircraft electromagnetic simulation dataset confirmed HRRPGraphNet superior accuracy and robustness, particularly in limited training sample environments, underscoring the potential of graph(hyphen)driven innovations in HRRP(hyphen)based RATR.
△ Less
Submitted 1 November, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios
Authors:
Ya Jiang,
Qing Wang,
Jun Du,
Maocheng Hu,
Pengfei Hu,
Zeyan Liu,
Shi Cheng,
Zhaoxu Nian,
Yuxuan Dong,
Mingqi Cai,
Xin Fang,
Chin-Hui Lee
Abstract:
This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich c…
▽ More
This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich collection of audio data with multiple data augmentation techniques, to an audio-visual student model trained with only a limited set of multi-modal data. Next, we propose a two-stage audio-visual fusion strategy, consisting of an early feature fusion and a late video-guided decision fusion to exploit synergies between audio and video modalities. Finally, we introduce an innovative video pixel swapping (VPS) technique to extend an audio channel swapping (ACS) method to an audio-visual joint augmentation. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge data set demonstrate significant improvements in SELD performances. Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranks first place by effectively integrating the proposed techniques into a model ensemble.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Towards Musically Informed Evaluation of Piano Transcription Models
Authors:
Patricia Hu,
Lukáš Samuel Marták,
Carlos Cancino-Chacón,
Gerhard Widmer
Abstract:
Automatic piano transcription models are typically evaluated using simple frame- or note-wise information retrieval (IR) metrics. Such benchmark metrics do not provide insights into the transcription quality of specific musical aspects such as articulation, dynamics, or rhythmic precision of the output, which are essential in the context of expressive performance analysis. Furthermore, in recent y…
▽ More
Automatic piano transcription models are typically evaluated using simple frame- or note-wise information retrieval (IR) metrics. Such benchmark metrics do not provide insights into the transcription quality of specific musical aspects such as articulation, dynamics, or rhythmic precision of the output, which are essential in the context of expressive performance analysis. Furthermore, in recent years, MAESTRO has become the de-facto training and evaluation dataset for such models. However, inference performance has been observed to deteriorate substantially when applied on out-of-distribution data, thereby questioning the suitability and reliability of transcribed outputs from such models for specific MIR tasks. In this work, we investigate the performance of three state-of-the-art piano transcription models in two experiments. In the first one, we propose a variety of musically informed evaluation metrics which, in contrast to the IR metrics, offer more detailed insight into the musical quality of the transcriptions. In the second experiment, we compare inference performance on real-world and perturbed audio recordings, and highlight musical dimensions which our metrics can help explain. Our experimental results highlight the weaknesses of existing piano transcription metrics and contribute to a more musically sound error analysis of transcription outputs.
△ Less
Submitted 7 October, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Comparing remote sensing-based forest biomass mapping approaches using new forest inventory plots in contrasting forests in northeastern and southwestern China
Authors:
Wenquan Dong,
Edward T. A. Mitchard,
Yuwei Chen,
Man Chen,
Congfeng Cao,
Peilun Hu,
Cong Xu,
Steven Hancock
Abstract:
Large-scale high spatial resolution aboveground biomass (AGB) maps play a crucial role in determining forest carbon stocks and how they are changing, which is instrumental in understanding the global carbon cycle, and implementing policy to mitigate climate change. The advent of the new space-borne LiDAR sensor, NASA's GEDI instrument, provides unparalleled possibilities for the accurate and unbia…
▽ More
Large-scale high spatial resolution aboveground biomass (AGB) maps play a crucial role in determining forest carbon stocks and how they are changing, which is instrumental in understanding the global carbon cycle, and implementing policy to mitigate climate change. The advent of the new space-borne LiDAR sensor, NASA's GEDI instrument, provides unparalleled possibilities for the accurate and unbiased estimation of forest AGB at high resolution, particularly in dense and tall forests, where Synthetic Aperture Radar (SAR) and passive optical data exhibit saturation. However, GEDI is a sampling instrument, collecting dispersed footprints, and its data must be combined with that from other continuous cover satellites to create high-resolution maps, using local machine learning methods. In this study, we developed local models to estimate forest AGB from GEDI L2A data, as the models used to create GEDI L4 AGB data incorporated minimal field data from China. We then applied LightGBM and random forest regression to generate wall-to-wall AGB maps at 25 m resolution, using extensive GEDI footprints as well as Sentinel-1 data, ALOS-2 PALSAR-2 and Sentinel-2 optical data. Through a 5-fold cross-validation, LightGBM demonstrated a slightly better performance than Random Forest across two contrasting regions. However, in both regions, the computation speed of LightGBM is substantially faster than that of the random forest model, requiring roughly one-third of the time to compute on the same hardware. Through the validation against field data, the 25 m resolution AGB maps generated using the local models developed in this study exhibited higher accuracy compared to the GEDI L4B AGB data. We found in both regions an increase in error as slope increased. The trained models were tested on nearby but different regions and exhibited good performance.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Latency versus Transmission Power Trade-off in Free-Space Optical (FSO) Satellite Networks with Multiple Inter-Continental Connections
Authors:
Jintao Liang,
Aizaz Chaudhry,
John Chinneck,
Halim Yanikomeroglu,
Gunes Kurt,
Peng Hu,
Khaled Ahmed,
Stephane Martel
Abstract:
In free-space optical satellite networks (FSOSNs), satellites connected via laser inter-satellite links (LISLs), latency is a critical factor, especially for long-distance inter-continental connections. Since satellites depend on solar panels for power supply, power consumption is also a vital factor. We investigate the minimization of total network latency (i.e., the sum of the network latencies…
▽ More
In free-space optical satellite networks (FSOSNs), satellites connected via laser inter-satellite links (LISLs), latency is a critical factor, especially for long-distance inter-continental connections. Since satellites depend on solar panels for power supply, power consumption is also a vital factor. We investigate the minimization of total network latency (i.e., the sum of the network latencies of all inter-continental connections in a time slot) in a realistic model of a FSOSN, the latest version of the Starlink Phase 1 Version 3 constellation. We develop mathematical formulations of the total network latency over different LISL ranges and different satellite transmission power constraints for multiple simultaneous inter-continental connections. We use practical system models for calculating network latency and satellite optical link transmission power, and we formulate the problem as a binary integer linear program. The results reveal that, for satellite transmission power limits set at 0.5 W, 0.3 W, and 0.1 W, the average total network latency for all five inter-continental connections studied in this work levels off at 339 ms, 361 ms, and 542 ms, respectively. Furthermore, the corresponding LISL ranges required to achieve these average total network latency values are 4500 km, 3000 km, and 1731 km, respectively. Different limitations on satellite transmission power exhibit varying effects on average total network latency (over 100 time slots), and they also induce differing changes in the corresponding LISL ranges. In the absence of satellite transmission power constraints, as the LISL range extends from the minimum feasible range of 1575 km to the maximum feasible range of 5016 km, the average total network latency decreases from 589 ms to 311 ms.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Free-Space Optical (FSO) Satellite Networks Performance Analysis: Transmission Power, Latency, and Outage Probability
Authors:
Jintao Liang,
Aizaz U. Chaudhry,
Eylem Erdogan,
Halim Yanikomeroglu,
Gunes Karabulut Kurt,
Peng Hu,
Khaled Ahmed,
Stephane Martel
Abstract:
In free-space optical satellite networks (FSOSNs), satellites can have different laser inter-satellite link (LISL) ranges for connectivity. Greater LISL ranges can reduce network latency of the path but can also result in an increase in transmission power for satellites on the path. Consequently, this tradeoff between satellite transmission power and network latency should be investigated, and in…
▽ More
In free-space optical satellite networks (FSOSNs), satellites can have different laser inter-satellite link (LISL) ranges for connectivity. Greater LISL ranges can reduce network latency of the path but can also result in an increase in transmission power for satellites on the path. Consequently, this tradeoff between satellite transmission power and network latency should be investigated, and in this work we examine it in FSOSNs drawing on the Starlink Phase 1 Version 3 and Kuiper Shell 2 constellations for different LISL ranges and different inter-continental connections. We use appropriate system models for calculating the average satellite transmission power and network latency. The results show that the mean network latency decreases and mean average satellite transmission power increases with an increase in LISL range. For the Toronto--Sydney inter-continental connection in an FSOSN with Starlink's Phase 1 Version 3 constellation, when the LISL range is approximately 2,900 km, the mean network latency and mean average satellite transmission power intersect are approximately 135 ms and 380 mW, respectively. For an FSOSN with the Kuiper Shell 2 constellation in this inter-continental connection, this LISL range is around 3,800 km, and the two parameters are approximately 120 ms and 700 mW, respectively. For the Toronto--Istanbul and Toronto--London inter-continental connections, the LISL ranges at the intersection are different and vary from 2,600 km to 3,400 km. Furthermore, we analyze outage probability performance of optical uplink/downlink due to atmosphere attenuation and turbulence.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition
Authors:
Qijie Shao,
Pengcheng Guo,
Jinghao Yan,
Pengfei Hu,
Lei Xie
Abstract:
Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related a…
▽ More
Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related accent characteristics, while coarse-grained units are better for learning linguistic information. Moreover, an explicit interaction of two tasks can also provide complementary information and improve the performance of each other, but it is rarely used by existing approaches. In this paper, we propose a novel Decoupling and Interacting Multi-task Network (DIMNet) for joint speech and accent recognition, which is comprised of a connectionist temporal classification (CTC) branch, an AR branch, an ASR branch, and a bottom feature encoder. Specifically, AR and ASR are first decoupled by separated branches and two-granular modeling units to learn task-specific representations. The AR branch is from our previously proposed linguistic-acoustic bimodal AR model and the ASR branch is an encoder-decoder based Conformer model. Then, for the task interaction, the CTC branch provides aligned text for the AR task, while accent embeddings extracted from our AR model are incorporated into the ASR branch's encoder and decoder. Finally, during ASR inference, a cross-granular rescoring method is introduced to fuse the complementary information from the CTC and attention decoder after the decoupling. Our experiments on English and Chinese datasets demonstrate the effectiveness of the proposed model, which achieves 21.45%/28.53% AR accuracy relative improvement and 32.33%/14.55% ASR error rate relative reduction over a published standard baseline, respectively.
△ Less
Submitted 17 November, 2023; v1 submitted 12 November, 2023;
originally announced November 2023.
-
Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023
Authors:
Haotian Wang,
Yuxuan Xi,
Hang Chen,
Jun Du,
Yan Song,
Qing Wang,
Hengshun Zhou,
Chenxi Wang,
Jiefeng Ma,
Pengfei Hu,
Ya Jiang,
Shi Cheng,
Jie Zhang,
Yuzhe Weng
Abstract:
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for e…
▽ More
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence regression. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
△ Less
Submitted 10 September, 2023;
originally announced September 2023.
-
The Batik-plays-Mozart Corpus: Linking Performance to Score to Musicological Annotations
Authors:
Patricia Hu,
Gerhard Widmer
Abstract:
We present the Batik-plays-Mozart Corpus, a piano performance dataset combining professional Mozart piano sonata performances with expert-labelled scores at a note-precise level. The performances originate from a recording by Viennese pianist Roland Batik on a computer-monitored Bösendorfer grand piano, and are available both as MIDI files and audio recordings. They have been precisely aligned, no…
▽ More
We present the Batik-plays-Mozart Corpus, a piano performance dataset combining professional Mozart piano sonata performances with expert-labelled scores at a note-precise level. The performances originate from a recording by Viennese pianist Roland Batik on a computer-monitored Bösendorfer grand piano, and are available both as MIDI files and audio recordings. They have been precisely aligned, note by note, with a current standard edition of the corresponding scores (the New Mozart Edition) in such a way that they can further be connected to the musicological annotations (harmony, cadences, phrases) on these scores that were recently published by Hentschel et al. (2021).
The result is a high-quality, high-precision corpus mapping scores and musical structure annotations to precise note-level professional performance information. As the first of its kind, it can serve as a valuable resource for studying various facets of expressive performance and their relationship with structural aspects. In the paper, we outline the curation process of the alignment and conduct two exploratory experiments to demonstrate its usefulness in analyzing expressive performance.
△ Less
Submitted 6 September, 2023; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Single-shot 3D photoacoustic computed tomography with a densely packed array for transcranial functional imaging
Authors:
Rui Cao,
Yilin Luo,
Jinhua Xu,
Xiaofei Luo,
Ku Geng,
Yousuf Aborahama,
Manxiu Cui,
Samuel Davis,
Shuai Na,
Xin Tong,
Cindy Liu,
Karteek Sastry,
Konstantin Maslov,
Peng Hu,
Yide Zhang,
Li Lin,
Yang Zhang,
Lihong V. Wang
Abstract:
Photoacoustic computed tomography (PACT) is emerging as a new technique for functional brain imaging, primarily due to its capabilities in label-free hemodynamic imaging. Despite its potential, the transcranial application of PACT has encountered hurdles, such as acoustic attenuations and distortions by the skull and limited light penetration through the skull. To overcome these challenges, we hav…
▽ More
Photoacoustic computed tomography (PACT) is emerging as a new technique for functional brain imaging, primarily due to its capabilities in label-free hemodynamic imaging. Despite its potential, the transcranial application of PACT has encountered hurdles, such as acoustic attenuations and distortions by the skull and limited light penetration through the skull. To overcome these challenges, we have engineered a PACT system that features a densely packed hemispherical ultrasonic transducer array with 3072 channels, operating at a central frequency of 1 MHz. This system allows for single-shot 3D imaging at a rate equal to the laser repetition rate, such as 20 Hz. We have achieved a single-shot light penetration depth of approximately 9 cm in chicken breast tissue utilizing a 750 nm laser (withstanding 3295-fold light attenuation and still retaining an SNR of 74) and successfully performed transcranial imaging through an ex vivo human skull using a 1064 nm laser. Moreover, we have proven the capacity of our system to perform single-shot 3D PACT imaging in both tissue phantoms and human subjects. These results suggest that our PACT system is poised to unlock potential for real-time, in vivo transcranial functional imaging in humans.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Non-Asymptotic Performance of Social Machine Learning Under Limited Data
Authors:
Ping Hu,
Virginia Bordignon,
Mert Kayaalp,
Ali H. Sayed
Abstract:
This paper studies the probability of error associated with the social machine learning framework, which involves an independent training phase followed by a cooperative decision-making phase over a graph. This framework addresses the problem of classifying a stream of unlabeled data in a distributed manner. In this work, we examine the classification task with limited observations during the deci…
▽ More
This paper studies the probability of error associated with the social machine learning framework, which involves an independent training phase followed by a cooperative decision-making phase over a graph. This framework addresses the problem of classifying a stream of unlabeled data in a distributed manner. In this work, we examine the classification task with limited observations during the decision-making phase, which requires a non-asymptotic performance analysis. We establish a condition for consistent training and derive an upper bound on the probability of error for classification. The results clarify the dependence on the statistical properties of the data and the combination policy used over the graph. They also establish the exponential decay of the probability of error with respect to the number of unlabeled samples.
△ Less
Submitted 9 July, 2024; v1 submitted 15 June, 2023;
originally announced June 2023.
-
The ACCompanion: Combining Reactivity, Robustness, and Musical Expressivity in an Automatic Piano Accompanist
Authors:
Carlos Cancino-Chacón,
Silvan Peter,
Patricia Hu,
Emmanouil Karystinaios,
Florian Henkel,
Francesco Foscarin,
Nimrod Varga,
Gerhard Widmer
Abstract:
This paper introduces the ACCompanion, an expressive accompaniment system. Similarly to a musician who accompanies a soloist playing a given musical piece, our system can produce a human-like rendition of the accompaniment part that follows the soloist's choices in terms of tempo, dynamics, and articulation. The ACCompanion works in the symbolic domain, i.e., it needs a musical instrument capable…
▽ More
This paper introduces the ACCompanion, an expressive accompaniment system. Similarly to a musician who accompanies a soloist playing a given musical piece, our system can produce a human-like rendition of the accompaniment part that follows the soloist's choices in terms of tempo, dynamics, and articulation. The ACCompanion works in the symbolic domain, i.e., it needs a musical instrument capable of producing and playing MIDI data, with explicitly encoded onset, offset, and pitch for each played note. We describe the components that go into such a system, from real-time score following and prediction to expressive performance generation and online adaptation to the expressive choices of the human player. Based on our experience with repeated live demonstrations in front of various audiences, we offer an analysis of the challenges of combining these components into a system that is highly reactive and precise, while still a reliable musical partner, robust to possible performance errors and responsive to expressive variations.
△ Less
Submitted 30 May, 2023; v1 submitted 24 April, 2023;
originally announced April 2023.
-
HAPS-UAV-Enabled Heterogeneous Networks: A Deep Reinforcement Learning Approach
Authors:
Atefeh H. Arani,
Peng Hu,
Yeying Zhu
Abstract:
The integrated use of non-terrestrial network (NTN) entities such as the high-altitude platform station (HAPS) and low-altitude platform station (LAPS) has become essential elements in the space-air-ground integrated networks (SAGINs). However, the complexity, mobility, and heterogeneity of NTN entities and resources present various challenges from system design to deployment. This paper proposes…
▽ More
The integrated use of non-terrestrial network (NTN) entities such as the high-altitude platform station (HAPS) and low-altitude platform station (LAPS) has become essential elements in the space-air-ground integrated networks (SAGINs). However, the complexity, mobility, and heterogeneity of NTN entities and resources present various challenges from system design to deployment. This paper proposes a novel approach to designing a heterogeneous network consisting of HAPSs and unmanned aerial vehicles (UAVs) being LAPS entities. Our approach involves jointly optimizing the three-dimensional trajectory and channel allocation for aerial base stations, with a focus on ensuring fairness and the provision of quality of service (QoS) to ground users. Furthermore, we consider the load on base stations and incorporate this information into the optimization problem. The proposed approach utilizes a combination of deep reinforcement learning and fixed-point iteration techniques to determine the UAV locations and channel allocation strategies. Simulation results reveal that our proposed deep learning-based approach significantly outperforms learning-based and conventional benchmark models.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
Quantification of cervical elasticity during pregnancy based on transvaginal ultrasound imaging and stress measurement
Authors:
Peng Hu,
Peinan Zhao,
Yuan Qu,
Konstantin Maslov,
Jessica Chubiz,
Methodius G. Tuuli,
Molly J. Stout,
Lihong V. Wang
Abstract:
Objective: Strain elastography and shear wave elastography are two commonly used methods to quantify cervical elasticity; however, they have limitations. Strain elastography is effective in showing tissue elasticity distribution in a single image, but the absence of stress information causes difficulty in comparing the results acquired from different imaging sessions. Shear wave elastography is ef…
▽ More
Objective: Strain elastography and shear wave elastography are two commonly used methods to quantify cervical elasticity; however, they have limitations. Strain elastography is effective in showing tissue elasticity distribution in a single image, but the absence of stress information causes difficulty in comparing the results acquired from different imaging sessions. Shear wave elastography is effective in measuring shear wave speed (an intrinsic tissue property correlated with elasticity) in relatively homogeneous tissue, such as in the liver. However, for inhomogeneous tissue in the cervix, the shear wave speed measurement is less robust. To overcome these limitations, we develop a quantitative cervical elastography system by adding a stress sensor to an ultrasound imaging system.
Methods: In an imaging session for quantitative cervical elastography, we use the transvaginal ultrasound imaging system to record B-mode images of the cervix showing its deformation and use the stress sensor to record the probe-surface stress simultaneously. We develop a correlation-based automatic feature tracking algorithm to quantify the deformation, from which the strain is quantified. After each imaging session, we calibrate the stress sensor and transform its measurement to true stress. Applying a linear regression to the stress and strain, we obtain an approximation of the cervical Young's modulus.
Results: We validate the accuracy and robustness of this elastography system using phantom experiments. Applying this system to pregnant participants, we observe significant softening of the cervix during pregnancy (p-value < 0.001) with the cervical Young's modulus decreasing 3.95% per week. We estimate that geometric mean values of cervical Young's moduli during the first (11 to 13 weeks), second, and third trimesters are 13.07 kPa, 7.59 kPa, and 4.40 kPa, respectively.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting
Authors:
Tarasha Khurana,
Peiyun Hu,
David Held,
Deva Ramanan
Abstract:
Predicting how the world can evolve in the future is crucial for motion planning in autonomous systems. Classical methods are limited because they rely on costly human annotations in the form of semantic class labels, bounding boxes, and tracks or HD maps of cities to plan their motion and thus are difficult to scale to large unlabeled datasets. One promising self-supervised task is 3D point cloud…
▽ More
Predicting how the world can evolve in the future is crucial for motion planning in autonomous systems. Classical methods are limited because they rely on costly human annotations in the form of semantic class labels, bounding boxes, and tracks or HD maps of cities to plan their motion and thus are difficult to scale to large unlabeled datasets. One promising self-supervised task is 3D point cloud forecasting from unannotated LiDAR sequences. We show that this task requires algorithms to implicitly capture (1) sensor extrinsics (i.e., the egomotion of the autonomous vehicle), (2) sensor intrinsics (i.e., the sampling pattern specific to the particular LiDAR sensor), and (3) the shape and motion of other objects in the scene. But autonomous systems should make predictions about the world and not their sensors. To this end, we factor out (1) and (2) by recasting the task as one of spacetime (4D) occupancy forecasting. But because it is expensive to obtain ground-truth 4D occupancy, we render point cloud data from 4D occupancy predictions given sensor extrinsics and intrinsics, allowing one to train and test occupancy algorithms with unannotated LiDAR sequences. This also allows one to evaluate and compare point cloud forecasting algorithms across diverse datasets, sensors, and vehicles.
△ Less
Submitted 30 April, 2023; v1 submitted 25 February, 2023;
originally announced February 2023.
-
Satellite Anomaly Detection Using Variance Based Genetic Ensemble of Neural Networks
Authors:
Mohammad Amin Maleki Sadr,
Yeying Zhu,
Peng Hu
Abstract:
In this paper, we use a variance-based genetic ensemble (VGE) of Neural Networks (NNs) to detect anomalies in the satellite's historical data. We use an efficient ensemble of the predictions from multiple Recurrent Neural Networks (RNNs) by leveraging each model's uncertainty level (variance). For prediction, each RNN is guided by a Genetic Algorithm (GA) which constructs the optimal structure for…
▽ More
In this paper, we use a variance-based genetic ensemble (VGE) of Neural Networks (NNs) to detect anomalies in the satellite's historical data. We use an efficient ensemble of the predictions from multiple Recurrent Neural Networks (RNNs) by leveraging each model's uncertainty level (variance). For prediction, each RNN is guided by a Genetic Algorithm (GA) which constructs the optimal structure for each RNN model. However, finding the model uncertainty level is challenging in many cases. Although the Bayesian NNs (BNNs)-based methods are popular for providing the confidence bound of the models, they cannot be employed in complex NN structures as they are computationally intractable. This paper uses the Monte Carlo (MC) dropout as an approximation version of BNNs. Then these uncertainty levels and each predictive model suggested by GA are used to generate a new model, which is then used for forecasting the TS and AD. Simulation results show that the forecasting and AD capability of the ensemble model outperforms existing approaches.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
Toward Multi-Layer Networking for Satellite Network Operations
Authors:
Peng Hu
Abstract:
Recent advancements in low-Earth-orbit (LEO) satellites aim to bring resilience, ubiquitous, and high-quality service to future Internet infrastructure. However, the soaring number of space assets, increasing dynamics of LEO satellites and expanding dimensions of network threats call for an enhanced approach to efficient satellite operations. To address these pressing challenges, we propose an app…
▽ More
Recent advancements in low-Earth-orbit (LEO) satellites aim to bring resilience, ubiquitous, and high-quality service to future Internet infrastructure. However, the soaring number of space assets, increasing dynamics of LEO satellites and expanding dimensions of network threats call for an enhanced approach to efficient satellite operations. To address these pressing challenges, we propose an approach for satellite network operations based on multi-layer satellite networking (MLSN), called "SatNetOps". Two SatNetOps schemes are proposed, referred to as LEO-LEO MLSN (LLM) and GEO-LEO MLSN (GLM). The performance of the proposed schemes is evaluated in 24-hr satellite scenarios with typical payload setups in simulations, where the key metrics such as latency and reliability are discussed with the consideration of the Consultative Committee for Space Data Systems (CCSDS) standard-compliant telemetry and telecommand missions. Although the SatNetOps approach is promising, we analyze the factors affecting the performance of the LLM and GLM schemes. The discussions on the results and conclusive remarks are made in the end.
△ Less
Submitted 19 November, 2024; v1 submitted 9 January, 2023;
originally announced January 2023.
-
A Cross-Layer Descent Approach for Resilient Network Operations of Proliferated LEO Satellites
Authors:
Peng Hu
Abstract:
With the proliferated low-Earth-orbit (LEO) satellites in mega-constellations, the future Internet will be able to reach any place on Earth, providing high-quality services to everyone. However, high-quality operations in terms of timeliness and resilience are lacking in the current solutions. This paper proposes a multi-layer networking approach called "Cross-Layer Descent (CLD)". Based on the pr…
▽ More
With the proliferated low-Earth-orbit (LEO) satellites in mega-constellations, the future Internet will be able to reach any place on Earth, providing high-quality services to everyone. However, high-quality operations in terms of timeliness and resilience are lacking in the current solutions. This paper proposes a multi-layer networking approach called "Cross-Layer Descent (CLD)". Based on the proposed system model, principles, and measures, CLD can support foundational services such as telecommand (TC) transmissions for various network operation missions for LEO satellites compliant with the Consultative Committee for Space Data Systems (CCSDS) standards. The CLD approach enhances timing and resilience requirements using advanced communication payloads. From the simulation-based analysis, the proposed scheme outperforms other classical ones in resilience and latency for typical TC missions. The future work and conclusive remarks are discussed at the end.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
Relationship Quantification of Image Degradations
Authors:
Wenxin Wang,
Boyun Li,
Yuanbiao Gou,
Peng Hu,
Wangmeng Zuo,
Xi Peng
Abstract:
In this paper, we study two challenging but less-touched problems in image restoration, namely, i) how to quantify the relationship between image degradations and ii) how to improve the performance of a specific restoration task using the quantified relationship. To tackle the first challenge, we proposed a Degradation Relationship Index (DRI) which is defined as the mean drop rate difference in t…
▽ More
In this paper, we study two challenging but less-touched problems in image restoration, namely, i) how to quantify the relationship between image degradations and ii) how to improve the performance of a specific restoration task using the quantified relationship. To tackle the first challenge, we proposed a Degradation Relationship Index (DRI) which is defined as the mean drop rate difference in the validation loss between two models which are respectively trained using the anchor degradation and the mixture of the anchor and the auxiliary degradations. Through quantifying the degradation relationship using DRI, we reveal that i) a positive DRI always predicts performance improvement by using the specific degradation as an auxiliary to train models; ii) the degradation proportion is crucial to the image restoration performance. In other words, the restoration performance is improved only if the anchor and the auxiliary degradations are mixed with an appropriate proportion. Based on the observations, we further propose a simple but effective method (dubbed DPD) to estimate whether the given degradation combinations could improve the performance on the anchor degradation with the assistance of the auxiliary degradation. Extensive experimental results verify the effectiveness of our method in dehazing, denoising, deraining, and desnowing. The code will be released after acceptance.
△ Less
Submitted 5 August, 2023; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Enabling Resilient and Real-Time Network Operations in Space: A Novel Multi-Layer Satellite Networking Scheme
Authors:
Peng Hu
Abstract:
Recently advanced low-Earth-orbit (LEO) satellite networks represented by large constellations and advanced payloads provide great promises for enabling high-quality Internet connectivity to any place on Earth. However, the traditional access-based approach to satellite operations cannot meet the pressing requirements of real-time, reliable, and resilient operations for LEO satellites. A new schem…
▽ More
Recently advanced low-Earth-orbit (LEO) satellite networks represented by large constellations and advanced payloads provide great promises for enabling high-quality Internet connectivity to any place on Earth. However, the traditional access-based approach to satellite operations cannot meet the pressing requirements of real-time, reliable, and resilient operations for LEO satellites. A new scheme is proposed based on multi-layer satellite networking considering the advanced Ka-band and optical communications payloads on a satellite platform. The proposed scheme can enable efficient and resilient message transmissions for critical telecommand and telemetry missions through different layers of satellite networks, which consist of LEO, medium-Earth-orbit (MEO), and geostationary (GEO) satellites. The proposed scheme is evaluated in a 24-hr satellite mission and shows superior performance improvements compared to the traditional operations approach.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
AccEar: Accelerometer Acoustic Eavesdropping with Unconstrained Vocabulary
Authors:
Pengfei Hu,
Hui Zhuang,
Panneer Selvam Santhalingamy,
Riccardo Spolaor,
Parth Pathaky,
Guoming Zhang,
Xiuzhen Cheng
Abstract:
With the increasing popularity of voice-based applications, acoustic eavesdropping has become a serious threat to users' privacy. While on smartphones the access to microphones needs an explicit user permission, acoustic eavesdropping attacks can rely on motion sensors (such as accelerometer and gyroscope), which access is unrestricted. However, previous instances of such attacks can only recogniz…
▽ More
With the increasing popularity of voice-based applications, acoustic eavesdropping has become a serious threat to users' privacy. While on smartphones the access to microphones needs an explicit user permission, acoustic eavesdropping attacks can rely on motion sensors (such as accelerometer and gyroscope), which access is unrestricted. However, previous instances of such attacks can only recognize a limited set of pre-trained words or phrases. In this paper, we present AccEar, an accelerometerbased acoustic eavesdropping attack that can reconstruct any audio played on the smartphone's loudspeaker with unconstrained vocabulary. We show that an attacker can employ a conditional Generative Adversarial Network (cGAN) to reconstruct highfidelity audio from low-frequency accelerometer signals. The presented cGAN model learns to recreate high-frequency components of the user's voice from low-frequency accelerometer signals through spectrogram enhancement. We assess the feasibility and effectiveness of AccEar attack in a thorough set of experiments using audio from 16 public personalities. As shown by the results in both objective and subjective evaluations, AccEar successfully reconstructs user speeches from accelerometer signals in different scenarios including varying sampling rate, audio volume, device model, etc.
△ Less
Submitted 2 December, 2022;
originally announced December 2022.
-
An Anomaly Detection Method for Satellites Using Monte Carlo Dropout
Authors:
Mohammad Amin Maleki Sadr,
Yeying Zhu,
Peng Hu
Abstract:
Recently, there has been a significant amount of interest in satellite telemetry anomaly detection (AD) using neural networks (NN). For AD purposes, the current approaches focus on either forecasting or reconstruction of the time series, and they cannot measure the level of reliability or the probability of correct detection. Although the Bayesian neural network (BNN)-based approaches are well kno…
▽ More
Recently, there has been a significant amount of interest in satellite telemetry anomaly detection (AD) using neural networks (NN). For AD purposes, the current approaches focus on either forecasting or reconstruction of the time series, and they cannot measure the level of reliability or the probability of correct detection. Although the Bayesian neural network (BNN)-based approaches are well known for time series uncertainty estimation, they are computationally intractable. In this paper, we present a tractable approximation for BNN based on the Monte Carlo (MC) dropout method for capturing the uncertainty in the satellite telemetry time series, without sacrificing accuracy. For time series forecasting, we employ an NN, which consists of several Long Short-Term Memory (LSTM) layers followed by various dense layers. We employ the MC dropout inside each LSTM layer and before the dense layers for uncertainty estimation. With the proposed uncertainty region and by utilizing a post-processing filter, we can effectively capture the anomaly points. Numerical results show that our proposed time series AD approach outperforms the existing methods from both prediction accuracy and AD perspectives.
△ Less
Submitted 27 November, 2022;
originally announced November 2022.
-
UAV-Assisted Space-Air-Ground Integrated Networks: A Technical Review of Recent Learning Algorithms
Authors:
Atefeh H. Arani,
Peng Hu,
Yeying Zhu
Abstract:
Recent technological advancements in space, air, and ground components have made possible a new network paradigm called space-air-ground integrated network (SAGIN). Unmanned aerial vehicles (UAVs) play a key role in SAGINs. However, due to UAVs' high dynamics and complexity, real-world deployment of a SAGIN becomes a significant barrier to realizing such SAGINs. UAVs are expected to meet key perfo…
▽ More
Recent technological advancements in space, air, and ground components have made possible a new network paradigm called space-air-ground integrated network (SAGIN). Unmanned aerial vehicles (UAVs) play a key role in SAGINs. However, due to UAVs' high dynamics and complexity, real-world deployment of a SAGIN becomes a significant barrier to realizing such SAGINs. UAVs are expected to meet key performance requirements with limited maneuverability and resources with space and terrestrial components. Therefore, employing UAVs in various usage scenarios requires well-designed planning in algorithmic approaches. This paper provides an essential review and analysis of recent learning algorithms in a UAV-assisted SAGIN. We consider possible reward functions and discuss the state-of-the-art algorithms for optimizing the reward functions, including Q-learning, deep Q-learning, multi-armed bandit, particle swarm optimization, and satisfaction-based learning algorithms. Unlike other survey papers, we focus on the methodological perspective of the optimization problem, applicable to various missions on a SAGIN. We consider real-world configurations and the 2-dimensional (2D) and 3-dimensional (3D) UAV trajectories to reflect deployment cases. Our simulations suggest the 3D satisfaction-based learning algorithm outperforms other approaches in most cases. With open challenges discussed at the end, we aim to provide design and deployment guidelines for UAV-assisted SAGINs.
△ Less
Submitted 16 July, 2024; v1 submitted 27 November, 2022;
originally announced November 2022.
-
Learning Audio-Visual embedding for Person Verification in the Wild
Authors:
Peiwen Sun,
Shanshan Zhang,
Zishan Liu,
Yougen Yuan,
Taotao Zhang,
Honggang Zhang,
Pengfei Hu
Abstract:
It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during…
▽ More
It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.
△ Less
Submitted 26 October, 2022; v1 submitted 8 September, 2022;
originally announced September 2022.
-
High Speed Rotation Estimation with Dynamic Vision Sensors
Authors:
Guangrong Zhao,
Yiran Shen,
Ning Chen,
Pengfei Hu,
Lei Liu,
Hongkai Wen
Abstract:
Rotational speed is one of the important metrics to be measured for calibrating the electric motors in manufacturing, monitoring engine during car repairing, faults detection on electrical appliance and etc. However, existing measurement techniques either require prohibitive hardware (e.g., high-speed camera) or are inconvenient to use in real-world application scenarios. In this paper, we propose…
▽ More
Rotational speed is one of the important metrics to be measured for calibrating the electric motors in manufacturing, monitoring engine during car repairing, faults detection on electrical appliance and etc. However, existing measurement techniques either require prohibitive hardware (e.g., high-speed camera) or are inconvenient to use in real-world application scenarios. In this paper, we propose, EV-Tach, an event-based tachometer via efficient dynamic vision sensing on mobile devices. EV-Tach is designed as a high-fidelity and convenient tachometer by introducing dynamic vision sensor as a new sensing modality to capture the high-speed rotation precisely under various real-world scenarios. By designing a series of signal processing algorithms bespoke for dynamic vision sensing on mobile devices, EV-Tach is able to extract the rotational speed accurately from the event stream produced by dynamic vision sensing on rotary targets. According to our extensive evaluations, the Relative Mean Absolute Error (RMAE) of EV-Tach is as low as 0.03% which is comparable to the state-of-the-art laser tachometer under fixed measurement mode. Moreover, EV-Tach is robust to subtle movement of user's hand, therefore, can be used as a handheld device, where the laser tachometer fails to produce reasonable results.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Transcranial photoacoustic computed tomography of human brain function
Authors:
Yang Zhang,
Shuai Na,
Karteekeya Sastry,
Jonathan J. Russin,
Peng Hu,
Li Lin,
Xin Tong,
Kay B. Jann,
Danny J. Wang,
Charles Y. Liu,
Lihong V. Wang
Abstract:
Herein we report the first in-human transcranial imaging of brain function using photoacoustic computed tomography. Functional responses to benchmark motor tasks were imaged on both the skull-less and the skull-intact hemispheres of a hemicraniectomy patient. The observed brain responses in these preliminary results demonstrate the potential of photoacoustic computed tomography for achieving trans…
▽ More
Herein we report the first in-human transcranial imaging of brain function using photoacoustic computed tomography. Functional responses to benchmark motor tasks were imaged on both the skull-less and the skull-intact hemispheres of a hemicraniectomy patient. The observed brain responses in these preliminary results demonstrate the potential of photoacoustic computed tomography for achieving transcranial functional imaging.
△ Less
Submitted 1 June, 2022;
originally announced June 2022.
-
A CNN with Noise Inclined Module and Denoise Framework for Hyperspectral Image Classification
Authors:
Zhiqiang Gong,
Ping Zhong,
Jiahao Qi,
Panhe Hu
Abstract:
Deep Neural Networks have been successfully applied in hyperspectral image classification. However, most of prior works adopt general deep architectures while ignore the intrinsic structure of the hyperspectral image, such as the physical noise generation. This would make these deep models unable to generate discriminative features and provide impressive classification performance. To leverage suc…
▽ More
Deep Neural Networks have been successfully applied in hyperspectral image classification. However, most of prior works adopt general deep architectures while ignore the intrinsic structure of the hyperspectral image, such as the physical noise generation. This would make these deep models unable to generate discriminative features and provide impressive classification performance. To leverage such intrinsic information, this work develops a novel deep learning framework with the noise inclined module and denoise framework for hyperspectral image classification. First, we model the spectral signature of hyperspectral image with the physical noise model to describe the high intraclass variance of each class and great overlapping between different classes in the image. Then, a noise inclined module is developed to capture the physical noise within each object and a denoise framework is then followed to remove such noise from the object. Finally, the CNN with noise inclined module and the denoise framework is developed to obtain discriminative features and provides good classification performance of hyperspectral image. Experiments are conducted over two commonly used real-world datasets and the experimental results show the effectiveness of the proposed method. The implementation of the proposed method and other compared methods could be accessed at https://github.com/shendu-sw/noise-physical-framework.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
Segmentation Network with Compound Loss Function for Hydatidiform Mole Hydrops Lesion Recognition
Authors:
Chengze Zhu,
Pingge Hu,
Xianxu Zeng,
Xingtong Wang,
Zehua Ji,
Li Shi
Abstract:
Pathological morphology diagnosis is the standard diagnosis method of hydatidiform mole. As a disease with malignant potential, the hydatidiform mole section of hydrops lesions is an important basis for diagnosis. Due to incomplete lesion development, early hydatidiform mole is difficult to distinguish, resulting in a low accuracy of clinical diagnosis. As a remarkable machine learning technology,…
▽ More
Pathological morphology diagnosis is the standard diagnosis method of hydatidiform mole. As a disease with malignant potential, the hydatidiform mole section of hydrops lesions is an important basis for diagnosis. Due to incomplete lesion development, early hydatidiform mole is difficult to distinguish, resulting in a low accuracy of clinical diagnosis. As a remarkable machine learning technology, image semantic segmentation networks have been used in many medical image recognition tasks. We developed a hydatidiform mole hydrops lesion segmentation model based on a novel loss function and training method. The model consists of different networks that segment the section image at the pixel and lesion levels. Our compound loss function assign weights to the segmentation results of the two levels to calculate the loss. We then propose a stagewise training method to combine the advantages of various loss functions at different levels. We evaluate our method on a hydatidiform mole hydrops dataset. Experiments show that the proposed model with our loss function and training method has good recognition performance under different segmentation metrics.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
A Semantic Segmentation Network Based Real-Time Computer-Aided Diagnosis System for Hydatidiform Mole Hydrops Lesion Recognition in Microscopic View
Authors:
Chengze Zhu,
Pingge Hu,
Xianxu Zeng,
Xingtong Wang,
Zehua Ji,
Li Shi
Abstract:
As a disease with malignant potential, hydatidiform mole (HM) is one of the most common gestational trophoblastic diseases. For pathologists, the HM section of hydrops lesions is an important basis for diagnosis. In pathology departments, the diverse microscopic manifestations of HM lesions and the limited view under the microscope mean that physicians with extensive diagnostic experience are requ…
▽ More
As a disease with malignant potential, hydatidiform mole (HM) is one of the most common gestational trophoblastic diseases. For pathologists, the HM section of hydrops lesions is an important basis for diagnosis. In pathology departments, the diverse microscopic manifestations of HM lesions and the limited view under the microscope mean that physicians with extensive diagnostic experience are required to prevent missed diagnosis and misdiagnosis. Feature extraction can significantly improve the accuracy and speed of the diagnostic process. As a remarkable diagnosis assisting technology, computer-aided diagnosis (CAD) has been widely used in clinical practice. We constructed a deep-learning-based CAD system to identify HM hydrops lesions in the microscopic view in real-time. The system consists of three modules; the image mosaic module and edge extension module process the image to improve the outcome of the hydrops lesion recognition module, which adopts a semantic segmentation network, our novel compound loss function, and a stepwise training function in order to achieve the best performance in identifying hydrops lesions. We evaluated our system using an HM hydrops dataset. Experiments show that our system is able to respond in real-time and correctly display the entire microscopic view with accurately labeled HM hydrops lesions.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition
Authors:
Qijie Shao,
Jinghao Yan,
Jian Kang,
Pengcheng Guo,
Xian Shi,
Pengfei Hu,
Lei Xie
Abstract:
General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering accent can be regarded as a series of shifts relative to native pronunciation, distinguishing accents will be an easier task with accent shift as input. But due to the lack of native utterance as an anchor, estimating the acce…
▽ More
General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering accent can be regarded as a series of shifts relative to native pronunciation, distinguishing accents will be an easier task with accent shift as input. But due to the lack of native utterance as an anchor, estimating the accent shift is difficult. In this paper, we propose linguistic-acoustic similarity based accent shift (LASAS) for AR tasks. For an accent speech utterance, after mapping the corresponding text vector to multiple accent-associated spaces as anchors, its accent shift could be estimated by the similarities between the acoustic embedding and those anchors. Then, we concatenate the accent shift with a dimension-reduced text vector to obtain a linguistic-acoustic bimodal representation. Compared with pure acoustic embedding, the bimodal representation is richer and more clear by taking full advantage of both linguistic and acoustic information, which can effectively improve AR performance. Experiments on Accented English Speech Recognition Challenge (AESRC) dataset show that our method achieves 77.42% accuracy on Test set, obtaining a 6.94% relative improvement over a competitive system in the challenge.
△ Less
Submitted 1 July, 2022; v1 submitted 7 April, 2022;
originally announced April 2022.
-
Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
Authors:
Guodong Ma,
Pengfei Hu,
Jian Kang,
Shen Huang,
Hao Huang
Abstract:
In Uyghur speech, consonant and vowel reduction are often encountered, especially in spontaneous speech with high speech rate, which will cause a degradation of speech recognition performance. To solve this problem, we propose an effective phone mask training method for Conformer-based Uyghur end-to-end (E2E) speech recognition. The idea is to randomly mask off a certain percentage features of pho…
▽ More
In Uyghur speech, consonant and vowel reduction are often encountered, especially in spontaneous speech with high speech rate, which will cause a degradation of speech recognition performance. To solve this problem, we propose an effective phone mask training method for Conformer-based Uyghur end-to-end (E2E) speech recognition. The idea is to randomly mask off a certain percentage features of phones during model training, which simulates the above verbal phenomena and facilitates E2E model to learn more contextual information. According to experiments, the above issues can be greatly alleviated. In addition, deep investigations are carried out into different units in masking, which shows the effectiveness of our proposed masking unit. We also further study the masking method and optimize filling strategy of phone mask. Finally, compared with Conformer-based E2E baseline without mask training, our model demonstrates about 5.51% relative Word Error Rate (WER) reduction on reading speech and 12.92% on spontaneous speech, respectively. The above approach has also been verified on test-set of open-source data THUYG-20, which shows 20% relative improvements.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification
Authors:
Yang Zhang,
Zhiqiang Lv,
Haibin Wu,
Shanshan Zhang,
Pengfei Hu,
Zhiyong Wu,
Hung-yi Lee,
Helen Meng
Abstract:
In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an easy-to-implement, simple but effective backbone for automatic speaker verification based on the Convolution-augmented Transformer (Conformer). The architecture of the MFA-Conformer is inspired by recent stateof-the-art models in speech recognition and speaker verification. Firstly, we introduce a convolution s…
▽ More
In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an easy-to-implement, simple but effective backbone for automatic speaker verification based on the Convolution-augmented Transformer (Conformer). The architecture of the MFA-Conformer is inspired by recent stateof-the-art models in speech recognition and speaker verification. Firstly, we introduce a convolution subsampling layer to decrease the computational cost of the model. Secondly, we adopt Conformer blocks which combine Transformers and convolution neural networks (CNNs) to capture global and local features effectively. Finally, the output feature maps from all Conformer blocks are concatenated to aggregate multi-scale representations before final pooling. We evaluate the MFA-Conformer on the widely used benchmarks. The best system obtains 0.64%, 1.29% and 1.63% EER on VoxCeleb1-O, SITW.Dev, and SITW.Eval set, respectively. MFA-Conformer significantly outperforms the popular ECAPA-TDNN systems in both recognition performance and inference speed. Last but not the least, the ablation studies clearly demonstrate that the combination of global and local feature learning can lead to robust and accurate speaker embedding extraction. We have also released the code for future comparison.
△ Less
Submitted 10 November, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.