Search | arXiv e-print repository

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

Authors: Shuhe Li, Chenxu Guo, Jiachen Lian, Cheol Jun Cho, Wenshuo Zhao, Xuanru Zhou, Dingkun Zhou, Sam Wang, Grace Wang, Jingze Yang, Jingyi Xu, Ruohan Bao, Elise Brenner, Brandon In, Francesca Pei, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli

Abstract: Early evaluation of children's language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specifi… ▽ More Early evaluation of children's language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specific errors while remaining fully interpretable. Kids-WFST attains 1.39% phoneme error on MyST and 8.61% on Multitudes--absolute gains of 10.47 and 7.06 points over a greedy-search decoder. These high-fidelity transcripts power an LLM that grades verbal skills, milestones, reading, and comprehension, aligning with human proctors and supplying tongue-and-lip visualizations plus targeted advice. The results show that precise phoneme recognition cements a complete diagnostic-feedback loop, paving the way for scalable, clinician-ready language assessment. △ Less

Submitted 3 July, 2025; originally announced July 2025.

arXiv:2504.09441 [pdf, other]

Structure-Accurate Medical Image Translation via Dynamic Frequency Balance and Knowledge Guidance

Authors: Jiahua Xu, Dawei Zhou, Lei Hu, Zaiyi Liu, Nannan Wang, Xinbo Gao

Abstract: Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel… ▽ More Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority. △ Less

Submitted 27 May, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

Comments: Medical image translation, Diffusion model, 16 pages

arXiv:2502.02603 [pdf, other]

SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

Authors: Chunyu Sun, Bingyu Liu, Zhichao Cui, Anbin Qi, Tian-hao Zhang, Dinghao Zhou, Lewei Lu

Abstract: Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architec… ▽ More Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50\% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a theoretical analysis of the challenges inherent in end-to-end speech retrieval and introduce architectural principles for effective speech-to-document matching. Extensive experiments demonstrate the robustness of our approach across diverse acoustic conditions and speaker variations, paving the way for a new paradigm in multimodal SLLMs retrieval systems. △ Less

Submitted 26 January, 2025; originally announced February 2025.

arXiv:2501.16780 [pdf, ps, other]

doi 10.1109/THMS.2025.3585165

AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals

Authors: Dongliang Zhou, Yakun Zhang, Jinghan Wu, Xingyu Zhang, Liang Xie, Erwei Yin

Abstract: The global aging population faces considerable challenges, particularly in communication, due to the prevalence of hearing and speech impairments. To address these, we introduce the AVE speech, a comprehensive multi-modal dataset for speech recognition tasks. The dataset includes a 100-sentence Mandarin corpus with audio signals, lip-region video recordings, and six-channel electromyography (EMG)… ▽ More The global aging population faces considerable challenges, particularly in communication, due to the prevalence of hearing and speech impairments. To address these, we introduce the AVE speech, a comprehensive multi-modal dataset for speech recognition tasks. The dataset includes a 100-sentence Mandarin corpus with audio signals, lip-region video recordings, and six-channel electromyography (EMG) data, collected from 100 participants. Each subject read the entire corpus ten times, with each sentence averaging approximately two seconds in duration, resulting in over 55 hours of multi-modal speech data per modality. Experiments demonstrate that combining these modalities significantly improves recognition performance, particularly in cross-subject and high-noise environments. To our knowledge, this is the first publicly available sentence-level dataset integrating these three modalities for large-scale Mandarin speech recognition. We expect this dataset to drive advancements in both acoustic and non-acoustic speech recognition research, enhancing cross-modal learning and human-machine interaction. △ Less

Submitted 5 July, 2025; v1 submitted 28 January, 2025; originally announced January 2025.

Comments: The paper has been accepted by IEEE Transactions on Human-Machine Systems

arXiv:2412.19497 [pdf, other]

Multi-Condition Fault Diagnosis of Dynamic Systems: A Survey, Insights, and Prospects

Authors: Pengyu Han, Zeyi Liu, Xiao He, Steven X. Ding, Donghua Zhou

Abstract: With the increasing complexity of industrial production systems, accurate fault diagnosis is essential to ensure safe and efficient system operation. However, due to changes in production demands, dynamic process adjustments, and complex external environmental disturbances, multiple operating conditions frequently arise during production. The multi-condition characteristics pose significant challe… ▽ More With the increasing complexity of industrial production systems, accurate fault diagnosis is essential to ensure safe and efficient system operation. However, due to changes in production demands, dynamic process adjustments, and complex external environmental disturbances, multiple operating conditions frequently arise during production. The multi-condition characteristics pose significant challenges to traditional fault diagnosis methods. In this context, multi-condition fault diagnosis has gradually become a key area of research, attracting extensive attention from both academia and industry. This paper aims to provide a systematic and comprehensive review of existing research in the field. Firstly, the mathematical definition of the problem is presented, followed by an overview of the current research status. Subsequently, the existing literature is reviewed and categorized from the perspectives of single-model and multi-model approaches. In addition, standard evaluation metrics and typical real-world application scenarios are summarized and analyzed. Finally, the key challenges and prospects in the field are thoroughly discussed. △ Less

Submitted 27 December, 2024; originally announced December 2024.

Comments: 17 pages, 14 figures

arXiv:2412.15622 [pdf, other]

TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch

Authors: Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou, Zhendong Peng

Abstract: Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially pr… ▽ More Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98\% to 2.45\%. Thirdly, our model is not only competent in Mandarin speech recognition but also proficient in multilingual, multi-dialect, emotion, gender, and sound event perception. We refer to this as Automatic Speech Perception (ASP), and the perception results are presented in the experimental section. △ Less

Submitted 20 December, 2024; originally announced December 2024.

Comments: Technical Report

arXiv:2412.08237 [pdf, other]

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

Authors: Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu

Abstract: It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely o… ▽ More It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data. △ Less

Submitted 12 December, 2024; v1 submitted 11 December, 2024; originally announced December 2024.

Comments: Technical Report

arXiv:2412.07590 [pdf, other]

Motion Artifact Removal in Pixel-Frequency Domain via Alternate Masks and Diffusion Model

Authors: Jiahua Xu, Dawei Zhou, Lei Hu, Jianfeng Guo, Feng Yang, Zaiyi Liu, Nannan Wang, Xinbo Gao

Abstract: Motion artifacts present in magnetic resonance imaging (MRI) can seriously interfere with clinical diagnosis. Removing motion artifacts is a straightforward solution and has been extensively studied. However, paired data are still heavily relied on in recent works and the perturbations in k-space (frequency domain) are not well considered, which limits their applications in the clinical field. To… ▽ More Motion artifacts present in magnetic resonance imaging (MRI) can seriously interfere with clinical diagnosis. Removing motion artifacts is a straightforward solution and has been extensively studied. However, paired data are still heavily relied on in recent works and the perturbations in k-space (frequency domain) are not well considered, which limits their applications in the clinical field. To address these issues, we propose a novel unsupervised purification method which leverages pixel-frequency information of noisy MRI images to guide a pre-trained diffusion model to recover clean MRI images. Specifically, considering that motion artifacts are mainly concentrated in high-frequency components in k-space, we utilize the low-frequency components as the guide to ensure correct tissue textures. Additionally, given that high-frequency and pixel information are helpful for recovering shape and detail textures, we design alternate complementary masks to simultaneously destroy the artifact structure and exploit useful information. Quantitative experiments are performed on datasets from different tissues and show that our method achieves superior performance on several metrics. Qualitative evaluations with radiologists also show that our method provides better clinical feedback. Our code is available at https://github.com/medcx/PFAD. △ Less

Submitted 11 December, 2024; v1 submitted 10 December, 2024; originally announced December 2024.

Comments: 12 pages, 8 figures, AAAI 2025

arXiv:2411.10775 [pdf, other]

Beyond Feature Mapping GAP: Integrating Real HDRTV Priors for Superior SDRTV-to-HDRTV Conversion

Authors: Kepeng Xu, Li Xu, Gang He, Zhiqiang Zhang, Wenxin Yu, Shihao Wang, Dajiang Zhou, Yunsong Li

Abstract: The rise of HDR-WCG display devices has highlighted the need to convert SDRTV to HDRTV, as most video sources are still in SDR. Existing methods primarily focus on designing neural networks to learn a single-style mapping from SDRTV to HDRTV. However, the limited information in SDRTV and the diversity of styles in real-world conversions render this process an ill-posed problem, thereby constrainin… ▽ More The rise of HDR-WCG display devices has highlighted the need to convert SDRTV to HDRTV, as most video sources are still in SDR. Existing methods primarily focus on designing neural networks to learn a single-style mapping from SDRTV to HDRTV. However, the limited information in SDRTV and the diversity of styles in real-world conversions render this process an ill-posed problem, thereby constraining the performance and generalization of these methods. Inspired by generative approaches, we propose a novel method for SDRTV to HDRTV conversion guided by real HDRTV priors. Despite the limited information in SDRTV, introducing real HDRTV as reference priors significantly constrains the solution space of the originally high-dimensional ill-posed problem. This shift transforms the task from solving an unreferenced prediction problem to making a referenced selection, thereby markedly enhancing the accuracy and reliability of the conversion process. Specifically, our approach comprises two stages: the first stage employs a Vector Quantized Generative Adversarial Network to capture HDRTV priors, while the second stage matches these priors to the input SDRTV content to recover realistic HDRTV outputs. We evaluate our method on public datasets, demonstrating its effectiveness with significant improvements in both objective and subjective metrics across real and synthetic datasets. △ Less

Submitted 16 November, 2024; originally announced November 2024.

Comments: 8 pages,4 figures

arXiv:2404.16407 [pdf, other]

U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF

Authors: Xingchen Song, Di Wu, Binbin Zhang, Dinghao Zhou, Zhendong Peng, Bo Dang, Fuping Pan, Chao Yang

Abstract: Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the… ▽ More Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency. △ Less

Submitted 8 August, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

ACM Class: I.2.7

arXiv:2404.11313 [pdf, other]

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Authors: Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai, Jianhui Sun, Tianyi Wang, Lei Li, Han Kong, Wenxuan Wang, Bing Li, Cheng Luo , et al. (43 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The… ▽ More This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2403.12521 [pdf]

Multi-mode Fault Diagnosis Datasets of Gearbox Under Variable Working Conditions

Authors: Shijin Chen, Zeyi Liu, Xiao He, Dongliang Zou, Donghua Zhou

Abstract: The gearbox is a critical component of electromechanical systems. The occurrence of multiple faults can significantly impact system accuracy and service life. The vibration signal of the gearbox is an effective indicator of its operational status and fault information. However, gearboxes in real industrial settings often operate under variable working conditions, such as varying speeds and loads.… ▽ More The gearbox is a critical component of electromechanical systems. The occurrence of multiple faults can significantly impact system accuracy and service life. The vibration signal of the gearbox is an effective indicator of its operational status and fault information. However, gearboxes in real industrial settings often operate under variable working conditions, such as varying speeds and loads. It is a significant and challenging research area to complete the gearbox fault diagnosis procedure under varying operating conditions using vibration signals. This data article presents vibration datasets collected from a gearbox exhibiting various fault degrees of severity and fault types, operating under diverse speed and load conditions. These faults are manually implanted into the gears or bearings through precise machining processes, which include health, missing teeth, wear, pitting, root cracks, and broken teeth. Several kinds of actual compound faults are also encompassed. The development of these datasets facilitates testing the effectiveness and reliability of newly developed fault diagnosis methods. △ Less

Submitted 8 April, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: 10 pages, 12 figures

arXiv:2312.09621 [pdf, other]

Inter-domain Resource Collaboration in Satellite Networks: An Intelligent Scheduling Approach Towards Hybrid Missions

Authors: Chenxi Bao, Di Zhou, Min Sheng, Yan Shi, Jiandong Li

Abstract: Since the next-generation satellite network consisting of various service function domains, such as communication, observation, navigation, etc., is moving towards large-scale, using single-domain resources is difficult to provide satisfied and timely service guarantees for the rapidly increasing mission demands of each domain. Breaking the barriers of independence of resources in each domain, and… ▽ More Since the next-generation satellite network consisting of various service function domains, such as communication, observation, navigation, etc., is moving towards large-scale, using single-domain resources is difficult to provide satisfied and timely service guarantees for the rapidly increasing mission demands of each domain. Breaking the barriers of independence of resources in each domain, and realizing the cross-domain transmission of missions to efficiently collaborate inter-domain resources is a promising solution. However, the hybrid scheduling of different missions and the continuous increase in the number of service domains have strengthened the differences and dynamics of mission demands, making it challenging for an efficient cross-domain mission scheduling (CMS). To this end, this paper first accurately characterizes the communication resource state of inter-satellite in real-time exploiting the sparse resource representation scheme, and systematically characterizes the differentiation of mission demands by conducting the mission priority model. Based on the information of resources and missions, we construct the top- and bottom-layer mission scheduling models of reward association exploiting the correlation of intra- and inter-domain mission scheduling and formulate the Markov decision process-based hierarchical CMS problem. Further, to achieve higher adaptability and autonomy of CMS and efficiently mitigate the impact of network scale, a hierarchical intelligent CMS algorithm is developed to dynamically adjust and efficiently match the CMS policy according to different mission demands. Simulation results demonstrate that the proposed algorithm has significant performance gain compared with independent domains and the existing CMS algorithms, and can still guarantee high service performance under different network scales. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2310.01633 [pdf, other]

Distributionally Robust Path Integral Control

Authors: Hyuk Park, Duo Zhou, Grani A. Hanasusanto, Takashi Tanaka

Abstract: We consider a continuous-time continuous-space stochastic optimal control problem, where the controller lacks exact knowledge of the underlying diffusion process, relying instead on a finite set of historical disturbance trajectories. In situations where data collection is limited, the controller synthesized from empirical data may exhibit poor performance. To address this issue, we introduce a no… ▽ More We consider a continuous-time continuous-space stochastic optimal control problem, where the controller lacks exact knowledge of the underlying diffusion process, relying instead on a finite set of historical disturbance trajectories. In situations where data collection is limited, the controller synthesized from empirical data may exhibit poor performance. To address this issue, we introduce a novel approach named Distributionally Robust Path Integral (DRPI). The proposed method employs distributionally robust optimization (DRO) to robustify the resulting policy against the unknown diffusion process. Notably, the DRPI scheme shows similarities with risk-sensitive control, which enables us to utilize the path integral control (PIC) framework as an efficient solution scheme. We derive theoretical performance guarantees for the DRPI scheme, which closely aligns with selecting a risk parameter in risk-sensitive control. We validate the efficacy of our scheme and showcase its superiority when compared to risk-neutral PIC policies in the absence of the true diffusion process. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2309.09776 [pdf, other]

MAD: Meta Adversarial Defense Benchmark

Authors: X. Peng, D. Zhou, G. Sun, J. Shi, L. Wu

Abstract: Adversarial training (AT) is a prominent technique employed by deep learning models to defend against adversarial attacks, and to some extent, enhance model robustness. However, there are three main drawbacks of the existing AT-based defense methods: expensive computational cost, low generalization ability, and the dilemma between the original model and the defense model. To this end, we propose a… ▽ More Adversarial training (AT) is a prominent technique employed by deep learning models to defend against adversarial attacks, and to some extent, enhance model robustness. However, there are three main drawbacks of the existing AT-based defense methods: expensive computational cost, low generalization ability, and the dilemma between the original model and the defense model. To this end, we propose a novel benchmark called meta adversarial defense (MAD). The MAD benchmark consists of two MAD datasets, along with a MAD evaluation protocol. The two large-scale MAD datasets were generated through experiments using 30 kinds of attacks on MNIST and CIFAR-10 datasets. In addition, we introduce a meta-learning based adversarial training (Meta-AT) algorithm as the baseline, which features high robustness to unseen adversarial attacks through few-shot learning. Experimental results demonstrate the effectiveness of our Meta-AT algorithm compared to the state-of-the-art methods. Furthermore, the model after Meta-AT maintains a relatively high clean-samples classification accuracy (CCA). It is worth noting that Meta-AT addresses all three aforementioned limitations, leading to substantial improvements. This benchmark ultimately achieved breakthroughs in investigating the transferability of adversarial defense methods to new attacks and the ability to learn from a limited number of adversarial examples. Our codes and attacked datasets address will be available at https://github.com/PXX1110/Meta_AT. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: 12 pages, 11 figures,IEEE Transactions on Neural Networks and Learning Systems

arXiv:2308.10420 [pdf, other]

doi 10.1109/TVT.2023.3305330

Reconfigurable Intelligent Surface Enabled Joint Backscattering and Communication

Authors: Jinqiu Zhao, Jia Ye Shuaishuai Guo, Zhiquan Bai, Di Zhou, Abeer Mohamed

Abstract: Reconfigurable intelligent surface (RIS) as an essential topic in the sixth-generation (6G) communications aims to enhance communication performance or mitigate undesired transmission. However, the controllability of each reflecting element on RIS also enables it to act as a passive backscatter device (BD) and transmit its information to reader devices. In this paper, we propose a RIS-enabled join… ▽ More Reconfigurable intelligent surface (RIS) as an essential topic in the sixth-generation (6G) communications aims to enhance communication performance or mitigate undesired transmission. However, the controllability of each reflecting element on RIS also enables it to act as a passive backscatter device (BD) and transmit its information to reader devices. In this paper, we propose a RIS-enabled joint backscattering and communication (JBAC) system, where the backscatter communication coexists with the primary communication and occupies no extra spectrum. Specifically, the RIS modifies its reflecting pattern to act as a passive BD and reflect its own information back to the base station (BS) in the backscatter communication, while helping the primary communication from the BS to the users simultaneously. We further present an iterative active beamforming and reflecting pattern design to maximize the user average transmission rate of the primary communication and the goodput of the backscatter communication by solving the formulated multi-objective optimization problem (MOOP). Numerical results fully uncover the impacts of the number of reflecting elements and the reflecting patterns on the system performance, and demonstrate the effectiveness of the proposed scheme. Important practical implementation remarks have also been discussed. △ Less

Submitted 20 August, 2023; originally announced August 2023.

Comments: 11 pages, 8 figures, published to IEEE TVT

Journal ref: IEEE Transactions on Vehicular Technology, 2023

arXiv:2307.14132 [pdf, other]

CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

Authors: Tian-Hao Zhang, Dinghao Zhou, Guiping Zhong, Jiaming Zhou, Baoxiang Li

Abstract: RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which inco… ▽ More RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models. △ Less

Submitted 26 November, 2024; v1 submitted 26 July, 2023; originally announced July 2023.

Comments: Accepted by ICASSP 2024

arXiv:2307.01525 [pdf, other]

OTFS-based Robust MMSE Precoding Design in Over-the-air Computation

Authors: Dongkai Zhou, Jing Guo, Siqiang Wang, Zhong Zheng, Zesong Fei, Weijie Yuan, Xinyi Wang

Abstract: Over-the-air computation (AirComp), as a data aggregation method that can improve network efficiency by exploiting the superposition characteristics of wireless channels, has received much attention recently. Meanwhile, the orthogonal time frequency space (OTFS) modulation can provide a strong Doppler resilience and facilitate reliable transmission for high-mobility communications. Hence, in this… ▽ More Over-the-air computation (AirComp), as a data aggregation method that can improve network efficiency by exploiting the superposition characteristics of wireless channels, has received much attention recently. Meanwhile, the orthogonal time frequency space (OTFS) modulation can provide a strong Doppler resilience and facilitate reliable transmission for high-mobility communications. Hence, in this work, we investigate an OTFS-based AirComp system in the presence of time-frequency dual-selective channels. In particular, we commence from the development of a novel transmission framework for the considered system, where the pilot signal is sent together with data, and the channel estimation is implemented according to the echo from the access point to the sensor, thereby reducing the overhead of channel state information (CSI) feedback. Hereafter, based on the CSI estimated from the previous frame, a robust precoding matrix aiming at minimizing mean square error in the current frame is designed, which takes into account the estimation error from the receiver noise and the outdated CSI. The simulation results demonstrate the effectiveness of the proposed robust precoding scheme by comparing it with the non-robust precoding. The performance gain is more obvious in a high signal-to-noise ratio in case of large channel estimation errors. △ Less

Submitted 26 March, 2024; v1 submitted 4 July, 2023; originally announced July 2023.

arXiv:2305.13616 [pdf]

An Entire Renal Anatomy Extraction Network for Advanced CAD During Partial Nephrectomy

Authors: Nan Ma, Ying Yang, Dongkai Zhou

Abstract: Partial nephrectomy (PN) is common surgery in urology. Digitization of renal anatomies brings much help to many computer-aided diagnosis (CAD) techniques during PN. However, the manual delineation of kidney vascular system and tumor on each slice is time consuming, error-prone, and inconsistent. Therefore, we proposed an entire renal anatomies extraction method from Computed Tomographic Angiograph… ▽ More Partial nephrectomy (PN) is common surgery in urology. Digitization of renal anatomies brings much help to many computer-aided diagnosis (CAD) techniques during PN. However, the manual delineation of kidney vascular system and tumor on each slice is time consuming, error-prone, and inconsistent. Therefore, we proposed an entire renal anatomies extraction method from Computed Tomographic Angiographic (CTA) images fully based on deep learning. We adopted a coarse-to-fine workflow to extract target tissues: first, we roughly located the kidney region, and then cropped the kidney region for more detail extraction. The network we used in our workflow is based on 3D U-Net. To dealing with the imbalance of class contributions to loss, we combined the dice loss with focal loss, and added an extra weight to prevent excessive attention. We also improved the manual annotations of vessels by merging semi-trained model's prediction and original annotations under supervision. We performed several experiments to find the best-fitting combination of variables for training. We trained and evaluated the models on our 60 cases dataset with 3 different sources. The average dice score coefficient (DSC) of kidney, tumor, cyst, artery, and vein, were 90.9%, 90.0%, 89.2%, 80.1% and 82.2% respectively. Our modulate weight and hybrid strategy of loss function increased the average DSC of all tissues about 8-20%. Our optimization of vessel annotation improved the average DSC about 1-5%. We proved the efficiency of our network on renal anatomies segmentation. The high accuracy and fully automation make it possible to quickly digitize the personal renal anatomies, which greatly increases the feasibility and practicability of CAD application on urology surgery. △ Less

Submitted 22 May, 2023; originally announced May 2023.

arXiv:2303.04644 [pdf, other]

Robust Trajectory and Offloading for Energy-Efficient UAV Edge Computing in Industrial Internet of Things

Authors: Xiao Tang, Hongrui Zhang, Ruonan Zhang, Deyun Zhou, Yan Zhang, Zhu Han

Abstract: Efficient data processing and computation are essential for the industrial Internet of things (IIoT) to empower various applications, which yet can be significantly bottlenecked by the limited energy capacity and computation capability of the IIoT nodes. In this paper, we employ an unmanned aerial vehicle (UAV) as an edge server to assist IIoT data processing, while considering the practical issue… ▽ More Efficient data processing and computation are essential for the industrial Internet of things (IIoT) to empower various applications, which yet can be significantly bottlenecked by the limited energy capacity and computation capability of the IIoT nodes. In this paper, we employ an unmanned aerial vehicle (UAV) as an edge server to assist IIoT data processing, while considering the practical issue of UAV jittering. Specifically, we propose a joint design on trajectory and offloading strategies to minimize energy consumption due to local and edge computation, as well as data transmission. We particularly address the UAV jittering that induces Gaussian-distributed uncertainties associated with flying waypoints, resulting in probabilistic-form flying speed and data offloading constraints. We exploit the Bernstein-type inequality to reformulate the constraints in deterministic forms and decompose the energy minimization to solve for trajectory and offloading separately within an alternating optimization framework. The subproblems are then tackled with the successive convex approximation technique. Simulation results show that our proposal strictly guarantees robustness under uncertainties and effectively reduces energy consumption as compared with the baselines. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: 11 pages, 12 figures; accepted at IEEE TII

arXiv:2212.04248 [pdf, other]

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Authors: Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, Baoyuan Wang

Abstract: In this paper, we introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maint… ▽ More In this paper, we introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness. This is achieved by our newly proposed audio-to-visual diffusion prior trained on top of the mapping between audio and disentangled non-lip facial representations. Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences given the same audio clip, which is quite user-friendly for many real applications. Through comprehensive evaluations on public benchmarks, we conclude that (1) our diffusion prior outperforms auto-regressive prior significantly on almost all the concerned metrics; (2) our overall system is competitive with prior works in terms of audio-lip synchronization but can effectively sample rich and natural-looking lip-irrelevant facial motions while still semantically harmonized with the audio input. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: 16 pages

arXiv:2211.17106 [pdf, other]

Diffusion Probabilistic Model Made Slim

Authors: Xingyi Yang, Daquan Zhou, Jiashi Feng, Xinchao Wang

Abstract: Despite the recent visually-pleasing results achieved, the massive computational cost has been a long-standing flaw for diffusion probabilistic models (DPMs), which, in turn, greatly limits their applications on resource-limited platforms. Prior methods towards efficient DPM, however, have largely focused on accelerating the testing yet overlooked their huge complexity and sizes. In this paper, we… ▽ More Despite the recent visually-pleasing results achieved, the massive computational cost has been a long-standing flaw for diffusion probabilistic models (DPMs), which, in turn, greatly limits their applications on resource-limited platforms. Prior methods towards efficient DPM, however, have largely focused on accelerating the testing yet overlooked their huge complexity and sizes. In this paper, we make a dedicated attempt to lighten DPM while striving to preserve its favourable performance. We start by training a small-sized latent diffusion model (LDM) from scratch, but observe a significant fidelity drop in the synthetic images. Through a thorough assessment, we find that DPM is intrinsically biased against high-frequency generation, and learns to recover different frequency components at different time-steps. These properties make compact networks unable to represent frequency dynamics with accurate high-frequency estimation. Towards this end, we introduce a customized design for slim DPM, which we term as Spectral Diffusion (SD), for light-weight image synthesis. SD incorporates wavelet gating in its architecture to enable frequency dynamic feature extraction at every reverse steps, and conducts spectrum-aware distillation to promote high-frequency recovery by inverse weighting the objective based on spectrum magni tudes. Experimental results demonstrate that, SD achieves 8-18x computational complexity reduction as compared to the latent diffusion models on a series of conditional and unconditional image generation tasks while retaining competitive image fidelity. △ Less

Submitted 27 November, 2022; originally announced November 2022.

arXiv:2210.10264 [pdf, other]

SignReLU neural network and its approximation ability

Authors: Jianfei Li, Han Feng, Ding-Xuan Zhou

Abstract: Deep neural networks (DNNs) have garnered significant attention in various fields of science and technology in recent years. Activation functions define how neurons in DNNs process incoming signals for them. They are essential for learning non-linear transformations and for performing diverse computations among successive neuron layers. In the last few years, researchers have investigated the appr… ▽ More Deep neural networks (DNNs) have garnered significant attention in various fields of science and technology in recent years. Activation functions define how neurons in DNNs process incoming signals for them. They are essential for learning non-linear transformations and for performing diverse computations among successive neuron layers. In the last few years, researchers have investigated the approximation ability of DNNs to explain their power and success. In this paper, we explore the approximation ability of DNNs using a different activation function, called SignReLU. Our theoretical results demonstrate that SignReLU networks outperform rational and ReLU networks in terms of approximation performance. Numerical experiments are conducted comparing SignReLU with the existing activations such as ReLU, Leaky ReLU, and ELU, which illustrate the competitive practical performance of SignReLU. △ Less

Submitted 30 August, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

arXiv:2209.08317 [pdf, other]

doi 10.1109/TSTE.2023.3271317

On Power Control of Grid-Forming Converters: Modeling, Controllability, and Full-State Feedback Design

Authors: Meng Chen, Dao Zhou, Ali Tayyebi, Eduardo Prieto-Araujo, Florian Dörfler, Frede Blaabjerg

Abstract: The popular single-input single-output control structures and classic design methods (e.g., root locus analysis) for the power control of grid-forming converters have limitations in applying to different line characteristics and providing favorable performance. This paper studies the grid-forming converter power loops from the perspective of multi-input multi-output systems. First, the error dynam… ▽ More The popular single-input single-output control structures and classic design methods (e.g., root locus analysis) for the power control of grid-forming converters have limitations in applying to different line characteristics and providing favorable performance. This paper studies the grid-forming converter power loops from the perspective of multi-input multi-output systems. First, the error dynamics associated with power control loops (error-based state-space model) are derived while taking into account the natural dynamical coupling terms of the power converter models. Thereafter, the controllability Gramian of the grid-forming converter power loops is studied. Last, a full-state feedback control design using only the local measurements is applied. By this way, the eigenvalues of the system can be arbitrarily placed in the timescale of power loops based on predefined time-domain specifications. A step-by-step construction and design procedure of the power control of grid-forming converters is also given. The analysis and proposed method are verified by experimental results. △ Less

Submitted 17 September, 2022; originally announced September 2022.

Comments: arXiv admin note: text overlap with arXiv:2205.03465

arXiv:2208.13019 [pdf, other]

Impact of Loss Model Selection on Power Semiconductor Lifetime Prediction in Electric Vehicles

Authors: Hongjian Xia, Yi Zhang, Dao Zhou, Minyou Chen, Wei Lai, Yunhai Wei, Huai Wang

Abstract: Power loss estimation is an indispensable procedure to conduct lifetime prediction for power semiconductor device. The previous studies successfully perform steady-state power loss estimation for different applications, but which may be limited for the electric vehicles (EVs) with high dynamics. Based on two EV standard driving cycle profiles, this paper gives a comparative study of power loss est… ▽ More Power loss estimation is an indispensable procedure to conduct lifetime prediction for power semiconductor device. The previous studies successfully perform steady-state power loss estimation for different applications, but which may be limited for the electric vehicles (EVs) with high dynamics. Based on two EV standard driving cycle profiles, this paper gives a comparative study of power loss estimation models with two different time resolutions, i.e., the output period average and the switching period average. The correspondingly estimated power losses, thermal profiles, and lifetime clearly pointed out that the widely applied power loss model with the output period average is limited for EV applications, in particular for the highly dynamic driving cycle. The difference in the predicted lifetime can be up to 300 times due to the unreasonable choice the loss model, which calls for the industry attention on the differences of the EVs and the importance of loss model selection in lifetime prediction. △ Less

Submitted 27 August, 2022; originally announced August 2022.

Comments: 8 pages, 11 figures

arXiv:2206.13804 [pdf, other]

doi 10.1109/ECCE50734.2022.9947432

Multivariable Grid-Forming Converters with Direct States Control

Authors: Meng Chen, Dao Zhou, Frede Blaabjerg

Abstract: A multi-input multi-output based grid-forming (MIMO-GFM) converter has been proposed using multivariable feedback control, which has been proven as a superior and robust system using low-order controllers. However, the original MIMO-GFM control is easily affected by the high-frequency components especially for the converter without inner cascaded voltage and current loops and when it is connected… ▽ More A multi-input multi-output based grid-forming (MIMO-GFM) converter has been proposed using multivariable feedback control, which has been proven as a superior and robust system using low-order controllers. However, the original MIMO-GFM control is easily affected by the high-frequency components especially for the converter without inner cascaded voltage and current loops and when it is connected into a strong grid. This paper proposes an improved MIMO-GFM control method, where the frequency and internal voltage are chosen as state variables to be controlled directly. In this way, the impact of high-frequency components is eliminated without increasing the complexity of the control system. The H-infinity synthesis is used to tune the parameters to obtain an optimized performance. Experimental results verify the effectiveness of the proposed method. △ Less

Submitted 28 June, 2022; originally announced June 2022.

arXiv:2205.05675 [pdf, other]

NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results

Authors: Yawei Li, Kai Zhang, Radu Timofte, Luc Van Gool, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du, Ding Liu, Chenhui Zhou, Jingyi Chen, Qingrui Han, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Yu Qiao, Chao Dong, Long Sun, Jinshan Pan, Yi Zhu, Zhikai Zong, Xiaoxiao Liu, Zheng Hui, Tao Yang , et al. (86 additional authors not shown)

Abstract: This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of e… ▽ More This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution. △ Less

Submitted 11 May, 2022; originally announced May 2022.

Comments: Validation code of the baseline model is available at https://github.com/ofsoundof/IMDN. Validation of all submitted models is available at https://github.com/ofsoundof/NTIRE2022_ESR

arXiv:2205.03465 [pdf, other]

doi 10.1109/IECON49645.2022.9968836

Power Control of Grid-Forming Converters Based on Full-State Feedback

Authors: Meng Chen, Dao Zhou, Frede Blaabjerg

Abstract: The active and reactive power controllers of grid-forming converters are traditionally designed separately, which relies on the assumption of loop decoupling. This paper proposes a full-state feedback control for the power loops of grid-forming converters. First, the power loops are modeled considering their natural coupling, which, therefore, can apply to all kinds of line impedance, i.e., resist… ▽ More The active and reactive power controllers of grid-forming converters are traditionally designed separately, which relies on the assumption of loop decoupling. This paper proposes a full-state feedback control for the power loops of grid-forming converters. First, the power loops are modeled considering their natural coupling, which, therefore, can apply to all kinds of line impedance, i.e., resistive, inductive, or complex. Then a full-state feedback control design is used. By this way, the eigenvalues of the system can be arbitrarily placed to any positions in the timescale of power loops. Therefore, the parameters can be directly chosen by the predefined specifications. A step-by-step parameters design procedure is also given in this paper. Experimental results verify the proposed method. △ Less

Submitted 6 May, 2022; originally announced May 2022.

arXiv:2205.02682 [pdf]

doi 10.1016/j.optcom.2022.128982

Temporally and Spatially variant-resolution illumination patterns in computational ghost imaging

Authors: Dong Zhou, Jie Cao, Huan Cui, Li-Xing Lin, Haoyu Zhang, Yingqiang Zhang, Qun Hao

Abstract: Conventional computational ghost imaging (CGI) uses light carrying a sequence of patterns with uniform-resolution to illuminate the object, then performs correlation calculation based on the light intensity value reflected by the target and the preset patterns to obtain object image. It requires a large number of measurements to obtain high-quality images, especially if high-resolution images are… ▽ More Conventional computational ghost imaging (CGI) uses light carrying a sequence of patterns with uniform-resolution to illuminate the object, then performs correlation calculation based on the light intensity value reflected by the target and the preset patterns to obtain object image. It requires a large number of measurements to obtain high-quality images, especially if high-resolution images are to be obtained. To solve this problem, we developed temporally variable-resolution illumination patterns, replacing the conventional uniform-resolution illumination patterns with a sequence of patterns of different imaging resolutions. In addition, we propose to combine temporally variable-resolution illumination patterns and spatially variable-resolution structure to develop temporally and spatially variable-resolution (TSV) illumination patterns, which not only improve the imaging quality of the region of interest (ROI) but also improve the robustness to noise. The methods using proposed illumination patterns are verified by simulations and experiments compared with CGI. For the same number of measurements, the method using temporally variable-resolution illumination patterns has better imaging quality than CGI, but it is less robust to noise. The method using TSV illumination patterns has better imaging quality in ROI than the method using temporally variable-resolution illumination patterns and CGI under the same number of measurements. We also experimentally verify that the method using TSV patterns have better imaging performance when applied to higher resolution imaging. The proposed methods are expected to solve the current computational ghost imaging that is difficult to achieve high-resolution and high-quality imaging. △ Less

Submitted 14 May, 2022; v1 submitted 5 May, 2022; originally announced May 2022.

arXiv:2204.07988 [pdf, other]

Automatic spinal curvature measurement on ultrasound spine images using Faster R-CNN

Authors: Zhichao Liu, Liyue Qian, Wenke Jing, Desen Zhou, Xuming He, Edmond Lou, Rui Zheng

Abstract: Ultrasound spine imaging technique has been applied to the assessment of spine deformity. However, manual measurements of scoliotic angles on ultrasound images are time-consuming and heavily rely on raters experience. The objectives of this study are to construct a fully automatic framework based on Faster R-CNN for detecting vertebral lamina and to measure the fitting spinal curves from the detec… ▽ More Ultrasound spine imaging technique has been applied to the assessment of spine deformity. However, manual measurements of scoliotic angles on ultrasound images are time-consuming and heavily rely on raters experience. The objectives of this study are to construct a fully automatic framework based on Faster R-CNN for detecting vertebral lamina and to measure the fitting spinal curves from the detected lamina pairs. The framework consisted of two closely linked modules: 1) the lamina detector for identifying and locating each lamina pairs on ultrasound coronal images, and 2) the spinal curvature estimator for calculating the scoliotic angles based on the chain of detected lamina. Two hundred ultrasound images obtained from AIS patients were identified and used for the training and evaluation of the proposed method. The experimental results showed the 0.76 AP on the test set, and the Mean Absolute Difference (MAD) between automatic and manual measurement which was within the clinical acceptance error. Meanwhile the correlation between automatic measurement and Cobb angle from radiographs was 0.79. The results revealed that our proposed technique could provide accurate and reliable automatic curvature measurements on ultrasound spine images for spine deformities. △ Less

Submitted 20 April, 2022; v1 submitted 17 April, 2022; originally announced April 2022.

Comments: Accepted by IUS2021

arXiv:2203.15613 [pdf, other]

Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer

Authors: Jingyu Sun, Guiping Zhong, Dinghao Zhou, Baoxiang Li

Abstract: An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automati… ▽ More An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition in this paper. The long-range history context is stored into the augment memory bank as a complement to the limited history context used in the encoder. Key and value are cached by a cache mechanism and reused for next chunk to reduce computation. Afterwards, a dynamic latency training method is proposed to obtain better performance and support low and high latency inference simultaneously. Our experiments are conducted on benchmark 960h LibriSpeech data set. With an average latency of 640ms, our model achieves a relative WER reduction of 6.0% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: 5 pages, 2 figures, submitted to interspeech 2022

arXiv:2203.15609 [pdf, other]

Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Authors: Jingyu Sun, Guiping Zhong, Dinghao Zhou, Baoxiang Li, Yiran Zhong

Abstract: Conformer has shown a great success in automatic speech recognition (ASR) on many public benchmarks. One of its crucial drawbacks is the quadratic time-space complexity with respect to the input sequence length, which prohibits the model to scale-up as well as process longer input audio sequences. To solve this issue, numerous linear attention methods have been proposed. However, these methods oft… ▽ More Conformer has shown a great success in automatic speech recognition (ASR) on many public benchmarks. One of its crucial drawbacks is the quadratic time-space complexity with respect to the input sequence length, which prohibits the model to scale-up as well as process longer input audio sequences. To solve this issue, numerous linear attention methods have been proposed. However, these methods often have limited performance on ASR as they treat tokens equally in modeling, neglecting the fact that the neighbouring tokens are often more connected than the distanced tokens. In this paper, we take this fact into account and propose a new locality-biased linear attention for Conformer. It not only achieves higher accuracy than the vanilla Conformer, but also enjoys linear space-time computational complexity. To be specific, we replace the softmax attention with a locality-biased linear attention (LBLA) mechanism in Conformer blocks. The LBLA contains a kernel function to ensure the linear complexities and a cosine reweighing matrix to impose more weights on neighbouring tokens. Extensive experiments on the LibriSpeech corpus show that by introducing this locality bias to the Conformer, our method achieves a lower word error rate with more than 22% inference speed. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: 5 pages, 2 figures, submitted to interspeech 2022

arXiv:2203.13535 [pdf, other]

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance

Authors: Xinchi Zhou, Dongzhan Zhou, Wanli Ouyang, Hang Zhou, Ziwei Liu, Di Hu

Abstract: Recent years have witnessed the success of deep learning on the visual sound separation task. However, existing works follow similar settings where the training and testing datasets share the same musical instrument categories, which to some extent limits the versatility of this task. In this work, we focus on a more general and challenging scenario, namely the separation of unknown musical instru… ▽ More Recent years have witnessed the success of deep learning on the visual sound separation task. However, existing works follow similar settings where the training and testing datasets share the same musical instrument categories, which to some extent limits the versatility of this task. In this work, we focus on a more general and challenging scenario, namely the separation of unknown musical instruments, where the categories in training and testing phases have no overlap with each other. To tackle this new setting, we propose the Separation-with-Consistency (SeCo) framework, which can accomplish the separation on unknown categories by exploiting the consistency constraints. Furthermore, to capture richer characteristics of the novel melodies, we devise an online matching strategy, which can bring stable enhancements with no cost of extra parameters. Experiments demonstrate that our SeCo framework exhibits strong adaptation ability on the novel musical categories and outperforms the baseline methods by a significant margin. △ Less

Submitted 25 March, 2022; originally announced March 2022.

arXiv:2202.11295 [pdf, other]

Continual learning-based probabilistic slow feature analysis for multimode dynamic process monitoring

Authors: Jingxin Zhang, Donghua Zhou, Maoyin Chen, Xia Hong

Abstract: In this paper, a novel multimode dynamic process monitoring approach is proposed by extending elastic weight consolidation (EWC) to probabilistic slow feature analysis (PSFA) in order to extract multimode slow features for online monitoring. EWC was originally introduced in the setting of machine learning of sequential multi-tasks with the aim of avoiding catastrophic forgetting issue, which equal… ▽ More In this paper, a novel multimode dynamic process monitoring approach is proposed by extending elastic weight consolidation (EWC) to probabilistic slow feature analysis (PSFA) in order to extract multimode slow features for online monitoring. EWC was originally introduced in the setting of machine learning of sequential multi-tasks with the aim of avoiding catastrophic forgetting issue, which equally poses as a major challenge in multimode dynamic process monitoring. When a new mode arrives, a set of data should be collected so that this mode can be identified by PSFA and prior knowledge. Then, a regularization term is introduced to prevent new data from significantly interfering with the learned knowledge, where the parameter importance measures are estimated. The proposed method is denoted as PSFA-EWC, which is updated continually and capable of achieving excellent performance for successive modes. Different from traditional multimode monitoring algorithms, PSFA-EWC furnishes backward and forward transfer ability. The significant features of previous modes are retained while consolidating new information, which may contribute to learning new relevant modes. Compared with several known methods, the effectiveness of the proposed method is demonstrated via a continuous stirred tank heater and a practical coal pulverizing system. △ Less

Submitted 28 April, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

Comments: This paper has been submitted to IEEE Transactions on Automation Science and Engineering for potential publication

arXiv:2202.08639 [pdf, other]

doi 10.23919/IPEC-Himeji2022-ECCE53331.2022.9807103

Augmentation of Generalized Multivariable Grid-Forming Control for Power Converters with Cascaded Controllers

Authors: Meng Chen, Dao Zhou, Ali Tayyebi, Eduardo Prieto-Araujo, Florian Dörfler, Frede Blaabjerg

Abstract: The classic design of grid-forming control strategies for power converters rely on the stringent assumption of the timescale separation between DC and AC states and their corresponding control loops, e.g., AC and DC loops, power and cascaded voltage and current loops, etc. This paper proposes a multi-input multi-output based grid-forming (MIMO-GFM) control for the power converters using a multivar… ▽ More The classic design of grid-forming control strategies for power converters rely on the stringent assumption of the timescale separation between DC and AC states and their corresponding control loops, e.g., AC and DC loops, power and cascaded voltage and current loops, etc. This paper proposes a multi-input multi-output based grid-forming (MIMO-GFM) control for the power converters using a multivariable feedback structure. First, the MIMO-GFM control couples the AC and DC loops by a general multivariable control transfer matrix. Then, the parameters design is transformed into a standard fixed-structure H-infinity synthesis. By this way, all the loops can be tuned simultaneously and optimally without relying on the assumptions of loop decoupling. Therefore, a superior and robust performance can be achieved. Experimental results verify the proposed method. △ Less

Submitted 17 February, 2022; originally announced February 2022.

arXiv:2202.04250 [pdf, other]

GenAD: General Representations of Multivariate Time Seriesfor Anomaly Detection

Authors: Xiaolei Hua, Lin Zhu, Shenglin Zhang, Zeyan Li, Su Wang, Dong Zhou, Shuo Wang, Chao Deng

Abstract: The reliability of wireless base stations in China Mobile is of vital importance, because the cell phone users are connected to the stations and the behaviors of the stations are directly related to user experience. Although the monitoring of the station behaviors can be realized by anomaly detection on multivariate time series, due to complex correlations and various temporal patterns of multivar… ▽ More The reliability of wireless base stations in China Mobile is of vital importance, because the cell phone users are connected to the stations and the behaviors of the stations are directly related to user experience. Although the monitoring of the station behaviors can be realized by anomaly detection on multivariate time series, due to complex correlations and various temporal patterns of multivariate series in large-scale stations, building a general unsupervised anomaly detection model with a higher F1-score remains a challenging task. In this paper, we propose a General representation of multivariate time series for Anomaly Detection(GenAD). First, we pre-train a general model on large-scale wireless base stations with self-supervision, which can be easily transferred to a specific station anomaly detection with a small amount of training data. Second, we employ Multi-Correlation Attention and Time-Series Attention to represent the correlations and temporal patterns of the stations. With the above innovations, GenAD increases F1-score by total 9% on real-world datasets in China Mobile, while the performance does not significantly degrade on public datasets with only 10% of the training data. △ Less

Submitted 8 February, 2022; originally announced February 2022.

arXiv:2110.11684 [pdf, other]

Multimodal-Boost: Multimodal Medical Image Super-Resolution using Multi-Attention Network with Wavelet Transform

Authors: Fayaz Ali Dharejo, Muhammad Zawish, Farah Deeba Yuanchun Zhou, Kapal Dev, Sunder Ali Khowaja, Nawab Muhammad Faseeh Qureshi

Abstract: Deep learning based single image super resolution (SISR) algorithms has revolutionized the overall diagnosis framework by continually improving the architectural components and training strategies associated with convolutional neural networks (CNN) on low-resolution images. However, existing work lacks in two ways: i) the SR output produced exhibits poor texture details, and often produce blurred… ▽ More Deep learning based single image super resolution (SISR) algorithms has revolutionized the overall diagnosis framework by continually improving the architectural components and training strategies associated with convolutional neural networks (CNN) on low-resolution images. However, existing work lacks in two ways: i) the SR output produced exhibits poor texture details, and often produce blurred edges, ii) most of the models have been developed for a single modality, hence, require modification to adapt to a new one. This work addresses (i) by proposing generative adversarial network (GAN) with deep multi-attention modules to learn high-frequency information from low-frequency data. Existing approaches based on the GAN have yielded good SR results; however, the texture details of their SR output have been experimentally confirmed to be deficient for medical images particularly. The integration of wavelet transform (WT) and GANs in our proposed SR model addresses the aforementioned limitation concerning textons. While the WT divides the LR image into multiple frequency bands, the transferred GAN uses multi-attention and upsample blocks to predict high-frequency components. Additionally, we present a learning method for training domain-specific classifiers as perceptual loss functions. Using a combination of multi-attention GAN loss and a perceptual loss function results in an efficient and reliable performance. Applying the same model for medical images from diverse modalities is challenging, our work addresses (ii) by training and performing on several modalities via transfer learning. Using two medical datasets, we validate our proposed SR network against existing state-of-the-art approaches and achieve promising results in terms of SSIM and PSNR. △ Less

Submitted 12 March, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

Comments: 14 pages, 13 Figures, and 3 Tables. Submitted to IEEE/ACM TCBB

arXiv:2110.09704 [pdf, other]

Hybrid variable monitoring: An unsupervised process monitoring framework with binary and continuous variables

Authors: Min Wang, Donghua Zhou, Maoyin Chen

Abstract: Traditional process monitoring methods, such as PCA, PLS, ICA, MD et al., are strongly dependent on continuous variables because most of them inevitably involve Euclidean or Mahalanobis distance. With industrial processes becoming more and more complex and integrated, binary variables also appear in monitoring variables besides continuous variables, which makes process monitoring more challenging.… ▽ More Traditional process monitoring methods, such as PCA, PLS, ICA, MD et al., are strongly dependent on continuous variables because most of them inevitably involve Euclidean or Mahalanobis distance. With industrial processes becoming more and more complex and integrated, binary variables also appear in monitoring variables besides continuous variables, which makes process monitoring more challenging. The aforementioned traditional approaches are incompetent to mine the information of binary variables, so that the useful information contained in them is usually discarded during the data preprocessing. To solve the problem, this paper focuses on the issue of hybrid variable monitoring (HVM) and proposes a novel unsupervised framework of process monitoring with hybrid variables including continuous and binary variables. HVM is addressed in the probabilistic framework, which can effectively exploit the process information implicit in both continuous and binary variables at the same time. In HVM, the statistics and the monitoring strategy suitable for hybrid variables with only healthy state data are defined and the physical explanation behind the framework is elaborated. In addition, the estimation of parameters required in HVM is derived in detail and the detectable condition of the proposed method is analyzed. Finally, the superiority of HVM is fully demonstrated first on a numerical simulation and then on an actual case of a thermal power plant. △ Less

Submitted 10 March, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

Comments: This paper has been submitted to Automatica for potential publication

arXiv:2109.06982 [pdf, other]

doi 10.1109/TSG.2022.3161608

Generalized Multivariable Grid-Forming Control Design for Power Converters

Authors: Meng Chen, Dao Zhou, Ali Tayyebi, Eduardo Prieto-Araujo, Florian Dörfler, Frede Blaabjerg

Abstract: The grid-forming converter is an important unit in the future power system with more inverter-interfaced generators. However, improving its performance is still a key challenge. This paper proposes a generalized architecture of the grid-forming converter from the view of multivariable feedback control. As a result, many of the existing popular control strategies, i.e., droop control, power synchro… ▽ More The grid-forming converter is an important unit in the future power system with more inverter-interfaced generators. However, improving its performance is still a key challenge. This paper proposes a generalized architecture of the grid-forming converter from the view of multivariable feedback control. As a result, many of the existing popular control strategies, i.e., droop control, power synchronization control, virtual synchronous generator control, matching control, dispatchable virtual oscillator control, and their improved forms are unified into a multivariable feedback control transfer matrix working on several linear and nonlinear error signals. Meanwhile, unlike the traditional assumptions of decoupling between AC and DC control, active power and reactive power control, the proposed configuration simultaneously takes all of them into consideration, which therefore can provide better performance. As an example, a new multi-input-multi-output-based grid-forming (MIMO-GFM) control is proposed based on the generalized configuration. To cope with the multivariable feedback, an optimal and structured $H_{\infty}$ synthesis is used to design the control parameters. At last, simulation and experimental results show superior performance and robustness of the proposed configuration and control. △ Less

Submitted 14 September, 2021; originally announced September 2021.

arXiv:2109.00617 [pdf, other]

LinEasyBO: Scalable Bayesian Optimization Approach for Analog Circuit Synthesis via One-Dimensional Subspaces

Authors: Shuhan Zhang, Fan Yang, Changhao Yan, Dian Zhou, Xuan Zeng

Abstract: A large body of literature has proved that the Bayesian optimization framework is especially efficient and effective in analog circuit synthesis. However, most of the previous research works only focus on designing informative surrogate models or efficient acquisition functions. Even if searching for the global optimum over the acquisition function surface is itself a difficult task, it has been l… ▽ More A large body of literature has proved that the Bayesian optimization framework is especially efficient and effective in analog circuit synthesis. However, most of the previous research works only focus on designing informative surrogate models or efficient acquisition functions. Even if searching for the global optimum over the acquisition function surface is itself a difficult task, it has been largely ignored. In this paper, we propose a fast and robust Bayesian optimization approach via one-dimensional subspaces for analog circuit synthesis. By solely focusing on optimizing one-dimension subspaces at each iteration, we greatly reduce the computational overhead of the Bayesian optimization framework while safely maximizing the acquisition function. By combining the benefits of different dimension selection strategies, we adaptively balancing between searching globally and locally. By leveraging the batch Bayesian optimization framework, we further accelerate the optimization procedure by making full use of the hardware resources. Experimental results quantitatively show that our proposed algorithm can accelerate the optimization procedure by up to 9x and 38x compared to LP-EI and REMBOpBO respectively when the batch size is 15. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: 6 pages, 4 figures

arXiv:2108.05096 [pdf]

doi 10.1364/OL.440660

Omnidirectional ghost imaging system and unwrapping-free panoramic ghost imaging

Authors: Huan Cui, Jie Cao, Qun Hao, Dong Zhou, Mingyuan Tang, Kaiyu Zhang, Yingqiang Zhang

Abstract: Ghost imaging (GI) is a novel imaging method, which can reconstruct the object information by the light intensity correlation measurements. However, at present, the field of view (FOV) is limited to the illuminating range of the light patterns. To enlarge FOV of GI efficiently, here we proposed the omnidirectional ghost imaging system (OGIS), which can achieve a 360° omnidirectional FOV at one sho… ▽ More Ghost imaging (GI) is a novel imaging method, which can reconstruct the object information by the light intensity correlation measurements. However, at present, the field of view (FOV) is limited to the illuminating range of the light patterns. To enlarge FOV of GI efficiently, here we proposed the omnidirectional ghost imaging system (OGIS), which can achieve a 360° omnidirectional FOV at one shot only by adding a curved mirror. Moreover, by designing the retina-like annular patterns with log-polar patterns, OGIS can obtain unwrapping-free undistorted panoramic images with uniform resolution, which opens up a new way for the application of GI. △ Less

Submitted 11 August, 2021; originally announced August 2021.

arXiv:2108.01667 [pdf]

doi 10.1364/OE.439704

Optimization of retina-like illumination patterns in ghost imaging

Authors: Jie Cao, Dong Zhou, Ying-Qiang Zhang, Huan Cui, Fang-Hua Zhang, Qun Hao

Abstract: Ghost imaging (GI) reconstructs images using a single-pixel or bucket detector, which has the advantages of scattering robustness, wide spectrum and beyond-visual-field imaging. However, this technique needs large amount of measurements to obtain a sharp image. There have been a lot of methods proposed to overcome this disadvantage. Retina-like patterns, as one of the compressive sensing approache… ▽ More Ghost imaging (GI) reconstructs images using a single-pixel or bucket detector, which has the advantages of scattering robustness, wide spectrum and beyond-visual-field imaging. However, this technique needs large amount of measurements to obtain a sharp image. There have been a lot of methods proposed to overcome this disadvantage. Retina-like patterns, as one of the compressive sensing approaches, enhance the imaging quality of region of interest (ROI) while not increase measurements. The design of the retina-like patterns determines the performance of the ROI in the reconstructed image. Unlike the conventional method to fill in ROI with random patterns, we propose to optimize retina-like patterns by filling in the ROI with the patterns containing the sparsity prior of objects. This proposed method is verified by simulations and experiments compared with conventional GI, retina-like GI and GI using patterns optimized by principal component analysis. The method using optimized retina-like patterns obtain the best imaging quality in ROI than other methods. Meanwhile, the good generalization ability of the optimized retina-like pattern is also verified. While designing the size and position of the ROI of retina-like pattern, the feature information of the target can be obtained to optimize the pattern of ROI. This proposed method paves the way for realizing high-quality GI. △ Less

Submitted 2 August, 2021; originally announced August 2021.

arXiv:2108.01666 [pdf]

Complementary Fourier single-pixel imaging

Authors: Dong Zhou, Jie Cao, Huan Cui, Qun Hao, Bing-Kun Chen, Kai Lin

Abstract: Single-pixel imaging, with the advantages of a wide spectrum, beyond-visual-field imaging, and robustness to light scattering, has attracted increasing attention in recent years. Fourier single-pixel imaging (FSI) can reconstruct sharp images under sub-Nyquist sampling. However, the conventional FSI has difficulty with balancing the imaging quality and efficiency. To overcome this issue, we propos… ▽ More Single-pixel imaging, with the advantages of a wide spectrum, beyond-visual-field imaging, and robustness to light scattering, has attracted increasing attention in recent years. Fourier single-pixel imaging (FSI) can reconstruct sharp images under sub-Nyquist sampling. However, the conventional FSI has difficulty with balancing the imaging quality and efficiency. To overcome this issue, we proposed a novel approach called complementary Fourier single-pixel imaging (CFSI) to reduce measurements while retaining its robustness. The complementary nature of Fourier patterns based on a four-step phase-shift algorithm is combined with the complementary nature of a digital micromirror device. CFSI only requires two phase-shifted patterns to obtain one Fourier spectral value. Four light intensity values are obtained by load the two patterns, and the spectral value is calculated through differential measurement, which has good robustness to noise. The proposed method is verified by simulations and experiments compared with FSI based on two-, three-, and four-step phase shift algorithms. CFSI performed better than the other methods under the condition that the best imaging quality of CFSI is not reached. The reported technique provides an alternative approach to realize real-time and high-quality imaging. △ Less

Submitted 2 August, 2021; originally announced August 2021.

arXiv:2106.15412 [pdf, other]

doi 10.1109/TCAD.2021.3054811

An Efficient Batch Constrained Bayesian Optimization Approach for Analog Circuit Synthesis via Multi-objective Acquisition Ensemble

Authors: Shuhan Zhang, Fan Yang, Changhao Yan, Dian Zhou, Xuan Zeng

Abstract: Bayesian optimization is a promising methodology for analog circuit synthesis. However, the sequential nature of the Bayesian optimization framework significantly limits its ability to fully utilize real-world computational resources. In this paper, we propose an efficient parallelizable Bayesian optimization algorithm via Multi-objective ACquisition function Ensemble (MACE) to further accelerate… ▽ More Bayesian optimization is a promising methodology for analog circuit synthesis. However, the sequential nature of the Bayesian optimization framework significantly limits its ability to fully utilize real-world computational resources. In this paper, we propose an efficient parallelizable Bayesian optimization algorithm via Multi-objective ACquisition function Ensemble (MACE) to further accelerate the optimization procedure. By sampling query points from the Pareto front of the probability of improvement (PI), expected improvement (EI) and lower confidence bound (LCB), we combine the benefits of state-of-the-art acquisition functions to achieve a delicate tradeoff between exploration and exploitation for the unconstrained optimization problem. Based on this batch design, we further adjust the algorithm for the constrained optimization problem. By dividing the optimization procedure into two stages and first focusing on finding an initial feasible point, we manage to gain more information about the valid region and can better avoid sampling around the infeasible area. After achieving the first feasible point, we favor the feasible region by adopting a specially designed penalization term to the acquisition function ensemble. The experimental results quantitatively demonstrate that our proposed algorithm can reduce the overall simulation time by up to 74 times compared to differential evolution (DE) for the unconstrained optimization problem when the batch size is 15. For the constrained optimization problem, our proposed algorithm can speed up the optimization process by up to 15 times compared to the weighted expected improvement based Bayesian optimization (WEIBO) approach, when the batch size is 15. △ Less

Submitted 28 June, 2021; originally announced June 2021.

Comments: 14 pages, 5 figures

arXiv:2106.15054 [pdf]

Time-Domain Doppler Biomotion Detections Immune to Unavoidable DC Offsets

Authors: Qinyi Lv, Lingtong Min, Congqi Cao, Shigang Zhou, Deyun Zhou, Chengkai Zhu, Yun Li, Zhongbo Zhu, Xiaojun Li, Lixin Ran

Abstract: In the past decades, continuous Doppler radar sensor-based bio-signal detections have attracted many research interests. A typical example is the Doppler heartbeat detection. While significant progresses have been achieved, reliable, time-domain accurate demodulation of bio-signals in the presence of unavoidable DC offsets remains a technical challenge. Aiming to overcome this difficulty, we propo… ▽ More In the past decades, continuous Doppler radar sensor-based bio-signal detections have attracted many research interests. A typical example is the Doppler heartbeat detection. While significant progresses have been achieved, reliable, time-domain accurate demodulation of bio-signals in the presence of unavoidable DC offsets remains a technical challenge. Aiming to overcome this difficulty, we propose in this paper a novel demodulation algorithm that does not need to trace and eliminate dynamic DC offsets based on approximating segmented arcs in a quadrature constellation of sampling data to directional chords. Assisted by the principal component analysis, such chords and their directions can be deterministically determined. Simulations and experimental validations showed fully recovery of micron-level pendulum movements and strongly noised human heartbeats, verifying the effectiveness and accuracy of the proposed approach. △ Less

Submitted 29 October, 2021; v1 submitted 28 June, 2021; originally announced June 2021.

Comments: Accepted by IEEE Transactions on Instrumentation & Measurement

arXiv:2106.14683 [pdf, other]

doi 10.1109/DAC18072.2020.9218592

An Efficient Asynchronous Batch Bayesian Optimization Approach for Analog Circuit Synthesis

Authors: Shuhan Zhang, Fan Yang, Dian Zhou, Xuan Zeng

Abstract: In this paper, we propose EasyBO, an Efficient ASYnchronous Batch Bayesian Optimization approach for analog circuit synthesis. In this proposed approach, instead of waiting for the slowest simulations in the batch to finish, we accelerate the optimization procedure by asynchronously issuing the next query points whenever there is an idle worker. We introduce a new acquisition function that can bet… ▽ More In this paper, we propose EasyBO, an Efficient ASYnchronous Batch Bayesian Optimization approach for analog circuit synthesis. In this proposed approach, instead of waiting for the slowest simulations in the batch to finish, we accelerate the optimization procedure by asynchronously issuing the next query points whenever there is an idle worker. We introduce a new acquisition function that can better explore the design space for asynchronous batch Bayesian optimization. A new strategy is proposed to better balance the exploration and exploitation and guarantee the diversity of the query points. And a penalization scheme is proposed to further avoid redundant queries during the asynchronous batch optimization. The efficiency of optimization can thus be further improved. Compared with the state-of-the-art batch Bayesian optimization algorithm, EasyBO achieves up to 7.35 times speed-up without sacrificing the optimization results. △ Less

Submitted 28 June, 2021; originally announced June 2021.

Comments: 6 pages, 6 figures

arXiv:2105.15077 [pdf, other]

SDNet: mutil-branch for single image deraining using swin

Authors: Fuxiang Tan, YuTing Kong, Yingying Fan, Feng Liu, Daxin Zhou, Hao zhang, Long Chen, Liang Gao, Yurong Qian

Abstract: Rain streaks degrade the image quality and seriously affect the performance of subsequent computer vision tasks, such as autonomous driving, social security, etc. Therefore, removing rain streaks from a given rainy images is of great significance. Convolutional neural networks(CNN) have been widely used in image deraining tasks, however, the local computational characteristics of convolutional ope… ▽ More Rain streaks degrade the image quality and seriously affect the performance of subsequent computer vision tasks, such as autonomous driving, social security, etc. Therefore, removing rain streaks from a given rainy images is of great significance. Convolutional neural networks(CNN) have been widely used in image deraining tasks, however, the local computational characteristics of convolutional operations limit the development of image deraining tasks. Recently, the popular transformer has global computational features that can further facilitate the development of image deraining tasks. In this paper, we introduce Swin-transformer into the field of image deraining for the first time to study the performance and potential of Swin-transformer in the field of image deraining. Specifically, we improve the basic module of Swin-transformer and design a three-branch model to implement single-image rain removal. The former implements the basic rain pattern feature extraction, while the latter fuses different features to further extract and process the image features. In addition, we employ a jump connection to fuse deep features and shallow features. In terms of experiments, the existing public dataset suffers from image duplication and relatively homogeneous background. So we propose a new dataset Rain3000 to validate our model. Therefore, we propose a new dataset Rain3000 for validating our model. Experimental results on the publicly available datasets Rain100L, Rain100H and our dataset Rain3000 show that our proposed method has performance and inference speed advantages over the current mainstream single-image rain streaks removal models.The source code will be available at https://github.com/H-tfx/SDNet. △ Less

Submitted 31 May, 2021; originally announced May 2021.

arXiv:2105.03847 [pdf]

Automatic segmentation of vertebral features on ultrasound spine images using Stacked Hourglass Network

Authors: Hong-Ye Zeng, Song-Han Ge, Yu-Chong Gao, De-Sen Zhou, Kang Zhou, Xu-Ming He, Edmond Lou, Rui Zheng

Abstract: Objective: The spinous process angle (SPA) is one of the essential parameters to denote three-dimensional (3-D) deformity of spine. We propose an automatic segmentation method based on Stacked Hourglass Network (SHN) to detect the spinous processes (SP) on ultrasound (US) spine images and to measure the SPAs of clinical scoliotic subjects. Methods: The network was trained to detect vertebral SP an… ▽ More Objective: The spinous process angle (SPA) is one of the essential parameters to denote three-dimensional (3-D) deformity of spine. We propose an automatic segmentation method based on Stacked Hourglass Network (SHN) to detect the spinous processes (SP) on ultrasound (US) spine images and to measure the SPAs of clinical scoliotic subjects. Methods: The network was trained to detect vertebral SP and laminae as five landmarks on 1200 ultrasound transverse images and validated on 100 images. All the processed transverse images with highlighted SP and laminae were reconstructed into a 3D image volume, and the SPAs were measured on the projected coronal images. The trained network was tested on 400 images by calculating the percentage of correct keypoints (PCK); and the SPA measurements were evaluated on 50 scoliotic subjects by comparing the results from US images and radiographs. Results: The trained network achieved a high average PCK (86.8%) on the test datasets, particularly the PCK of SP detection was 90.3%. The SPAs measured from US and radiographic methods showed good correlation (r>0.85), and the mean absolute differences (MAD) between two modalities were 3.3°, which was less than the clinical acceptance error (5°). Conclusion: The vertebral features can be accurately segmented on US spine images using SHN, and the measurement results of SPA from US data was comparable to the gold standard from radiography. △ Less

Submitted 23 May, 2021; v1 submitted 9 May, 2021; originally announced May 2021.

Comments: 9 pages,5 figures

arXiv:2105.03660 [pdf, other]

Deep learning of nanopore sensing signals using a bi-path network

Authors: Dario Dematties, Chenyu Wen, Mauricio David Pérez, Dian Zhou, Shi-Li Zhang

Abstract: Temporary changes in electrical resistance of a nanopore sensor caused by translocating target analytes are recorded as a sequence of pulses on current traces. Prevalent algorithms for feature extraction in pulse-like signals lack objectivity because empirical amplitude thresholds are user-defined to single out the pulses from the noisy background. Here, we use deep learning for feature extraction… ▽ More Temporary changes in electrical resistance of a nanopore sensor caused by translocating target analytes are recorded as a sequence of pulses on current traces. Prevalent algorithms for feature extraction in pulse-like signals lack objectivity because empirical amplitude thresholds are user-defined to single out the pulses from the noisy background. Here, we use deep learning for feature extraction based on a bi-path network (B-Net). After training, the B-Net acquires the prototypical pulses and the ability of both pulse recognition and feature extraction without a priori assigned parameters. The B-Net performance is evaluated on generated datasets and further applied to experimental data of DNA and protein translocation. The B-Net results show remarkably small relative errors and stable trends. The B-Net is further shown capable of processing data with a signal-to-noise ratio equal to one, an impossibility for threshold-based algorithms. The developed B-Net is generic for pulse-like signals beyond pulsed nanopore currents. △ Less

Submitted 8 May, 2021; originally announced May 2021.

Showing 1–50 of 71 results for author: Zhou, D