-
Target Localization with Coprime Multistatic MIMO Radar via Coupled Canonical Polyadic Decomposition Based on Joint Eigenvalue Decomposition
Authors:
Guo-Zhao Liao,
Xiao-Feng Gong,
Wei Liu,
Hing Cheung So
Abstract:
This paper investigates target localization using a multistatic multiple-input multiple-output (MIMO) radar system with two distinct coprime array configurations: coprime L-shaped arrays and coprime planar arrays. The observed signals are modeled as tensors that admit a coupled canonical polyadic decomposition (C-CPD) model. For each configuration, a C-CPD method is presented based on joint eigenv…
▽ More
This paper investigates target localization using a multistatic multiple-input multiple-output (MIMO) radar system with two distinct coprime array configurations: coprime L-shaped arrays and coprime planar arrays. The observed signals are modeled as tensors that admit a coupled canonical polyadic decomposition (C-CPD) model. For each configuration, a C-CPD method is presented based on joint eigenvalue decomposition (J-EVD). This computational framework includes (semi-)algebraic and optimization-based C-CPD algorithms and target localization that fuses direction-of-arrivals (DOAs) information to calculate the optimal position of each target. Specifically, the proposed (semi-)algebraic methods exploit the rotational invariance of the Vandermonde structure in coprime arrays, similar to the multiple invariance property of \added{estimation of signal parameters via rotational invariance techniques} (ESPRIT), which transforms the model into a J-EVD problem and reduces computational complexity. The study also investigates the working conditions of the algorithm to understand model identifiability. Additionally, the proposed method does not rely on prior knowledge of non-orthogonal probing waveforms and is effective in challenging underdetermined scenarios. Experimental results demonstrate that our method outperforms existing tensor-based approaches in both accuracy and computational efficiency.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM
Authors:
Xun Gong,
Anqi Lv,
Zhiming Wang,
Huijia Zhu,
Yanmin Qian
Abstract:
While speech large language models (SpeechLLMs) have advanced standard automatic speech recognition (ASR), contextual biasing for named entities and rare words remains challenging, especially at scale. To address this, we propose BR-ASR: a Bias Retrieval framework for large-scale contextual biasing (up to 200k entries) via two innovations: (1) speech-and-bias contrastive learning to retrieve seman…
▽ More
While speech large language models (SpeechLLMs) have advanced standard automatic speech recognition (ASR), contextual biasing for named entities and rare words remains challenging, especially at scale. To address this, we propose BR-ASR: a Bias Retrieval framework for large-scale contextual biasing (up to 200k entries) via two innovations: (1) speech-and-bias contrastive learning to retrieve semantically relevant candidates; (2) dynamic curriculum learning that mitigates homophone confusion which negatively impacts the final performance. The is a general framework that allows seamless integration of the retrieved candidates into diverse ASR systems without fine-tuning. Experiments on LibriSpeech test-clean/-other achieve state-of-the-art (SOTA) biased word error rates (B-WER) of 2.8%/7.1% with 2000 bias words, delivering 45% relative improvement over prior methods. BR-ASR also demonstrates high scalability: when expanding the bias list to 200k where traditional methods generally fail, it induces only 0.3 / 2.9% absolute WER / B-WER degradation with a 99.99% pruning rate and only 20ms latency per query on test-other.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
S2MNet: Speckle-To-Mesh Net for Three-Dimensional Cardiac Morphology Reconstruction via Echocardiogram
Authors:
Xilin Gong,
Yongkai Chen,
Shushan Wu,
Fang Wang,
Ping Ma,
Wenxuan Zhong
Abstract:
Echocardiogram is the most commonly used imaging modality in cardiac assessment duo to its non-invasive nature, real-time capability, and cost-effectiveness. Despite its advantages, most clinical echocardiograms provide only two-dimensional views, limiting the ability to fully assess cardiac anatomy and function in three dimensions. While three-dimensional echocardiography exists, it often suffers…
▽ More
Echocardiogram is the most commonly used imaging modality in cardiac assessment duo to its non-invasive nature, real-time capability, and cost-effectiveness. Despite its advantages, most clinical echocardiograms provide only two-dimensional views, limiting the ability to fully assess cardiac anatomy and function in three dimensions. While three-dimensional echocardiography exists, it often suffers from reduced resolution, limited availability, and higher acquisition costs. To overcome these challenges, we propose a deep learning framework S2MNet that reconstructs continuous and high-fidelity 3D heart models by integrating six slices of routinely acquired 2D echocardiogram views. Our method has three advantages. First, our method avoid the difficulties on training data acquasition by simulate six of 2D echocardiogram images from corresponding slices of a given 3D heart mesh. Second, we introduce a deformation field-based method, which avoid spatial discontinuities or structural artifacts in 3D echocardiogram reconstructions. We validate our method using clinically collected echocardiogram and demonstrate that our estimated left ventricular volume, a key clinical indicator of cardiac function, is strongly correlated with the doctor measured GLPS, a clinical measurement that should demonstrate a negative correlation with LVE in medical theory. This association confirms the reliability of our proposed 3D construction method.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Statistical CSI Acquisition for Multi-frequency Massive MIMO Systems
Authors:
Jinke Tang,
Li You,
Xinrui Gong,
Chenjie Xie,
Xiqi Gao,
Xiang-Gen Xia,
Xueyuan Shi
Abstract:
Multi-frequency massive multi-input multi-output (MIMO) communication is a promising strategy for both 5G and future 6G systems, ensuring reliable transmission while enhancing frequency resource utilization. Statistical channel state information (CSI) has been widely adopted in multi-frequency massive MIMO transmissions to reduce overhead and improve transmission performance. In this paper, we pro…
▽ More
Multi-frequency massive multi-input multi-output (MIMO) communication is a promising strategy for both 5G and future 6G systems, ensuring reliable transmission while enhancing frequency resource utilization. Statistical channel state information (CSI) has been widely adopted in multi-frequency massive MIMO transmissions to reduce overhead and improve transmission performance. In this paper, we propose efficient and accurate methods for obtaining statistical CSI in multi-frequency massive MIMO systems. First, we introduce a multi-frequency massive MIMO channel model and analyze the mapping relationship between two types of statistical CSI, namely the angular power spectrum (APS) and the spatial covariance matrix, along with their correlation across different frequency bands. Next, we propose an autoregressive (AR) method to predict the spatial covariance matrix of any frequency band based on that of another frequency band. Furthermore, we emphasize that channels across different frequency bands share similar APS characteristics. Leveraging the maximum entropy (ME) criterion, we develop a low-complexity algorithm for high-resolution APS estimation. Simulation results validate the effectiveness of the AR-based covariance prediction method and demonstrate the high-resolution estimation capability of the ME-based approach. Furthermore, we demonstrate the effectiveness of multi-frequency cooperative transmission by applying the proposed methods to obtain statistical CSI from low-frequency bands and utilizing it for high-frequency channel transmission. This approach significantly enhances high-frequency transmission performance while effectively reducing system overhead.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
GNN-enabled Precoding for Massive MIMO LEO Satellite Communications
Authors:
Huibin Zhou,
Xinrui Gong,
Christos G. Tsinos,
Li You,
Xiqi Gao,
Björn Ottersten
Abstract:
Low Earth Orbit (LEO) satellite communication is a critical component in the development of sixth generation (6G) networks. The integration of massive multiple-input multiple-output (MIMO) technology is being actively explored to enhance the performance of LEO satellite communications. However, the limited power of LEO satellites poses a significant challenge in improving communication energy effi…
▽ More
Low Earth Orbit (LEO) satellite communication is a critical component in the development of sixth generation (6G) networks. The integration of massive multiple-input multiple-output (MIMO) technology is being actively explored to enhance the performance of LEO satellite communications. However, the limited power of LEO satellites poses a significant challenge in improving communication energy efficiency (EE) under constrained power conditions. Artificial intelligence (AI) methods are increasingly recognized as promising solutions for optimizing energy consumption while enhancing system performance, thus enabling more efficient and sustainable communications. This paper proposes approaches to address the challenges associated with precoding in massive MIMO LEO satellite communications. First, we introduce an end-to-end graph neural network (GNN) framework that effectively reduces the computational complexity of traditional precoding methods. Next, we introduce a deep unfolding of the Dinkelbach algorithm and the weighted minimum mean square error (WMMSE) approach to achieve enhanced EE, transforming iterative optimization processes into a structured neural network, thereby improving convergence speed and computational efficiency. Furthermore, we incorporate the Taylor expansion method to approximate matrix inversion within the GNN, enhancing both the interpretability and performance of the proposed method. Numerical experiments demonstrate the validity of our proposed method in terms of complexity and robustness, achieving significant improvements over state-of-the-art methods.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Diffusion-empowered AutoPrompt MedSAM
Authors:
Peng Huang,
Shu Hu,
Bo Peng,
Xun Gong,
Penghang Yin,
Hongtu Zhu,
Xi Wu,
Xin Wang
Abstract:
MedSAM, a medical foundation model derived from the SAM architecture, has demonstrated notable success across diverse medical domains. However, its clinical application faces two major challenges: the dependency on labor-intensive manual prompt generation, which imposes a significant burden on clinicians, and the absence of semantic labeling in the generated segmentation masks for organs or lesion…
▽ More
MedSAM, a medical foundation model derived from the SAM architecture, has demonstrated notable success across diverse medical domains. However, its clinical application faces two major challenges: the dependency on labor-intensive manual prompt generation, which imposes a significant burden on clinicians, and the absence of semantic labeling in the generated segmentation masks for organs or lesions, limiting its practicality for non-expert users. To address these limitations, we propose AutoMedSAM, an end-to-end framework derived from SAM, designed to enhance usability and segmentation performance. AutoMedSAM retains MedSAM's image encoder and mask decoder structure while introducing a novel diffusion-based class prompt encoder. The diffusion-based encoder employs a dual-decoder structure to collaboratively generate prompt embeddings guided by sparse and dense prompt definitions. These embeddings enhance the model's ability to understand and process clinical imagery autonomously. With this encoder, AutoMedSAM leverages class prompts to embed semantic information into the model's predictions, transforming MedSAM's semi-automated pipeline into a fully automated workflow. Furthermore, AutoMedSAM employs an uncertainty-aware joint optimization strategy during training to effectively inherit MedSAM's pre-trained knowledge while improving generalization by integrating multiple loss functions. Experimental results across diverse datasets demonstrate that AutoMedSAM achieves superior performance while broadening its applicability to both clinical settings and non-expert users. Code is available at https://github.com/HP-ML/AutoPromptMedSAM.git.
△ Less
Submitted 15 April, 2025; v1 submitted 4 February, 2025;
originally announced February 2025.
-
A parametric non-negative coupled canonical polyadic decomposition algorithm for hyperspectral super-resolution
Authors:
Xi-Yuan Liu,
Xiao-Feng Gong,
Lei Wang,
Wei Feng,
Qiu-Hua Lin
Abstract:
Recently, coupled tensor decomposition has been widely used in data fusion of a hyperspectral image (HSI) and a multispectral image (MSI) for hyperspectral super-resolution (HSR). However, exsiting works often ignore the inherent non-negative (NN) property of the image data, or impose the NN constraint via hard-thresholding which may interfere with the optimization procedure and cause the method t…
▽ More
Recently, coupled tensor decomposition has been widely used in data fusion of a hyperspectral image (HSI) and a multispectral image (MSI) for hyperspectral super-resolution (HSR). However, exsiting works often ignore the inherent non-negative (NN) property of the image data, or impose the NN constraint via hard-thresholding which may interfere with the optimization procedure and cause the method to be sub-optimal. As such, we propose a novel NN coupled canonical polyadic decomposition (NN-C-CPD) algorithm, which makes use of the parametric method and nonlinear least squares (NLS) framework to impose the NN constraint into the C-CPD computation. More exactly, the NN constraint is converted into the squared relationship between the NN entries of the factor matrices and a set of latent parameters. Based on the chain rule for deriving the derivatives, the key entities such as gradient and Jacobian with regards to the latent parameters can be derived, thus the NN constraint is naturally integrated without interfering with the optimization procedure. Experimental results are provided to demonstrate the performance of the proposed NN-C-CPD algorithm in HSR applications.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
A Block Term Decomposition Model Based Algorithm for Tensor Completion of Multidimensional Harmonic Signals
Authors:
Lei Wang,
Xiao-Feng Gong,
Xi-Yuan Liu,
Wei Feng,
Qiu-Hua Lin
Abstract:
We consider tensor data completion of an incomplete observation of multidimensional harmonic (MH) signals. Unlike existing tensor-based techniques for MH retrieval (MHR), which mostly adopt the canonical polyadic decomposition (CPD) to model the simple "one-to-one" correspondence among harmonics across difference modes, we herein use the more flexible block term decomposition (BTD) model that can…
▽ More
We consider tensor data completion of an incomplete observation of multidimensional harmonic (MH) signals. Unlike existing tensor-based techniques for MH retrieval (MHR), which mostly adopt the canonical polyadic decomposition (CPD) to model the simple "one-to-one" correspondence among harmonics across difference modes, we herein use the more flexible block term decomposition (BTD) model that can be used to describe the complex mutual correspondences among several groups of harmonics across different modes. An optimization principle that aims to fit the BTD model in the least squares sense, subject to rank minimization of hankelized MH components, is set up for the tensor completion task, and an algorithm based on alternating direction method of multipliers is proposed, of which the effectiveness and applicability are validated through both numerical simulations and an application in Sub-6GHz channel state information (CSI) completion.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Target Localization with a Coprime Multistatic MIMO Radar via Coupled Canonical Polyadic Decomposition Based on Joint EVD
Authors:
Guo-Zhao Liao,
Xiao-Feng Gong,
Wei Liu,
Hing Cheung So
Abstract:
This paper addresses target localization using a multistatic multiple-input multiple-output (MIMO) radar system with coprime L-shaped receive arrays (CLsA). A target localization method is proposed by modeling the observed signals as tensors that admit a coupled canonical polyadic decomposition (C-CPD) model without matched filtering. It consists of a novel joint eigenvalue decomposition (J-EVD) b…
▽ More
This paper addresses target localization using a multistatic multiple-input multiple-output (MIMO) radar system with coprime L-shaped receive arrays (CLsA). A target localization method is proposed by modeling the observed signals as tensors that admit a coupled canonical polyadic decomposition (C-CPD) model without matched filtering. It consists of a novel joint eigenvalue decomposition (J-EVD) based (semi-)algebraic algorithm, and a post-processing approach to determine the target locations by fusing the direction-of-arrival estimates extracted from J-EVD-based CCPD results. Particularly, by leveraging the rotational invariance of Vandermonde structure in CLsA, we convert the CCPD problem into a J-EVD problem, significantly reducing its computational complexity. Experimental results show that our method outperforms existing tensor-based ones.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Deep Distance Map Regression Network with Shape-aware Loss for Imbalanced Medical Image Segmentation
Authors:
Huiyu Li,
Xiabi Liu,
Said Boumaraf,
Xiaopeng Gong,
Donghai Liao,
Xiaohong Ma
Abstract:
Small object segmentation, like tumor segmentation, is a difficult and critical task in the field of medical image analysis. Although deep learning based methods have achieved promising performance, they are restricted to the use of binary segmentation mask. Inspired by the rigorous mapping between binary segmentation mask and distance map, we adopt distance map as a novel ground truth and employ…
▽ More
Small object segmentation, like tumor segmentation, is a difficult and critical task in the field of medical image analysis. Although deep learning based methods have achieved promising performance, they are restricted to the use of binary segmentation mask. Inspired by the rigorous mapping between binary segmentation mask and distance map, we adopt distance map as a novel ground truth and employ a network to fulfill the computation of distance map. Specially, we propose a new segmentation framework that incorporates the existing binary segmentation network and a light weight regression network (dubbed as LR-Net). Thus, the LR-Net can convert the distance map computation into a regression task and leverage the rich information of distance maps. Additionally, we derive a shape-aware loss by employing distance maps as penalty map to infer the complete shape of an object. We evaluated our approach on MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge dataset and a clinical dataset. Experimental results show that our approach outperforms the classification-based methods as well as other existing state-of-the-arts.
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
Detecting Neurocognitive Disorders through Analyses of Topic Evolution and Cross-modal Consistency in Visual-Stimulated Narratives
Authors:
Jinchao Li,
Yuejiao Wang,
Junan Li,
Jiawen Kang,
Bo Zheng,
Simon Wong,
Brian Mak,
Helene Fung,
Jean Woo,
Man-Wai Mak,
Timothy Kwok,
Vincent Mok,
Xianmin Gong,
Xixin Wu,
Xunying Liu,
Patrick Wong,
Helen Meng
Abstract:
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Speech analysis offers a non-intrusive and scalable screening method, particularly through narrative tasks in neuropsychological assessment tools. Traditional narrative analysis often focuses on local indicators in microstructure, such as word usage and syntax. While these features provide…
▽ More
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Speech analysis offers a non-intrusive and scalable screening method, particularly through narrative tasks in neuropsychological assessment tools. Traditional narrative analysis often focuses on local indicators in microstructure, such as word usage and syntax. While these features provide insights into language production abilities, they often fail to capture global narrative patterns, or microstructures. Macrostructures include coherence, thematic organization, and logical progressions, reflecting essential cognitive skills potentially critical for recognizing NCDs. Addressing this gap, we propose to investigate specific cognitive and linguistic challenges by analyzing topical shifts, temporal dynamics, and the coherence of narratives over time, aiming to reveal cognitive deficits by identifying narrative impairments, and exploring their impact on communication and cognition. The investigation is based on the CU-MARVEL Rabbit Story corpus, which comprises recordings of a story-telling task from 758 older adults. We developed two approaches: the Dynamic Topic Models (DTM)-based temporal analysis to examine the evolution of topics over time, and the Text-Image Temporal Alignment Network (TITAN) to evaluate the coherence between spoken narratives and visual stimuli. DTM-based approach validated the effectiveness of dynamic topic consistency as a macrostructural metric (F1=0.61, AUC=0.78). The TITAN approach achieved the highest performance (F1=0.72, AUC=0.81), surpassing established microstructural and macrostructural feature sets. Cross-comparison and regression tasks further demonstrated the effectiveness of proposed dynamic macrostructural modeling approaches for NCD detection.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
GDM4MMIMO: Generative Diffusion Models for Massive MIMO Communications
Authors:
Zhenzhou Jin,
Li You,
Huibin Zhou,
Yuanshuo Wang,
Xiaofeng Liu,
Xinrui Gong,
Xiqi Gao,
Derrick Wing Kwan Ng,
Xiang-Gen Xia
Abstract:
Massive multiple-input multiple-output (MIMO) offers significant advantages in spectral and energy efficiencies, positioning it as a cornerstone technology of fifth-generation (5G) wireless communication systems and a promising solution for the burgeoning data demands anticipated in sixth-generation (6G) networks. In recent years, with the continuous advancement of artificial intelligence (AI), a…
▽ More
Massive multiple-input multiple-output (MIMO) offers significant advantages in spectral and energy efficiencies, positioning it as a cornerstone technology of fifth-generation (5G) wireless communication systems and a promising solution for the burgeoning data demands anticipated in sixth-generation (6G) networks. In recent years, with the continuous advancement of artificial intelligence (AI), a multitude of task-oriented generative foundation models (GFMs) have emerged, achieving remarkable performance in various fields such as computer vision (CV), natural language processing (NLP), and autonomous driving. As a pioneering force, these models are driving the paradigm shift in AI towards generative AI (GenAI). Among them, the generative diffusion model (GDM), as one of state-of-the-art families of generative models, demonstrates an exceptional capability to learn implicit prior knowledge and robust generalization capabilities, thereby enhancing its versatility and effectiveness across diverse applications. In this paper, we delve into the potential applications of GDM in massive MIMO communications. Specifically, we first provide an overview of massive MIMO communication, the framework of GFMs, and the working mechanism of GDM. Following this, we discuss recent research advancements in the field and present a case study of near-field channel estimation based on GDM, demonstrating its promising potential for facilitating efficient ultra-dimensional channel statement information (CSI) acquisition in the context of massive MIMO communications. Finally, we highlight several pressing challenges in future mobile communications and identify promising research directions surrounding GDM.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering
Authors:
Zhilin Zhang,
Jie Wang,
Zhanghao Qin,
Ruiqi Zhu,
Xiaoliang Gong
Abstract:
Medical Visual Question Answering (MedVQA) has attracted growing interest at the intersection of medical image understanding and natural language processing for clinical applications. By interpreting medical images and providing precise answers to relevant clinical inquiries, MedVQA has the potential to support diagnostic decision-making and reduce workload across various fields like radiology. Wh…
▽ More
Medical Visual Question Answering (MedVQA) has attracted growing interest at the intersection of medical image understanding and natural language processing for clinical applications. By interpreting medical images and providing precise answers to relevant clinical inquiries, MedVQA has the potential to support diagnostic decision-making and reduce workload across various fields like radiology. While recent approaches rely heavily on unified large pre-trained Visual-Language Models, research on more efficient fusion mechanisms remains relatively limited in this domain. In this paper, we introduce a fusion model, OMniBAN, that integrates Orthogonality loss, Multi-head attention, and a Bilinear Attention Network to achieve high computational efficiency as well as solid performance. We conduct comprehensive experiments and demonstrate how bilinear attention fusion can approximate the performance of larger fusion models like cross-modal Transformer. Our results show that OMniBAN requires fewer parameters (approximately 2/3 of Transformer-based Co-Attention) and substantially lower FLOPs (approximately 1/4), while achieving comparable overall performance and even slight improvements on closed-ended questions on two key MedVQA benchmarks. This balance between efficiency and accuracy suggests that OMniBAN could be a viable option for real-world medical image question answering, where computational resources are often constrained.
△ Less
Submitted 11 May, 2025; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Unrevealed Threats: A Comprehensive Study of the Adversarial Robustness of Underwater Image Enhancement Models
Authors:
Siyu Zhai,
Zhibo He,
Xiaofeng Cong,
Junming Hou,
Jie Gui,
Jian Wei You,
Xin Gong,
James Tin-Yau Kwok,
Yuan Yan Tang
Abstract:
Learning-based methods for underwater image enhancement (UWIE) have undergone extensive exploration. However, learning-based models are usually vulnerable to adversarial examples so as the UWIE models. To the best of our knowledge, there is no comprehensive study on the adversarial robustness of UWIE models, which indicates that UWIE models are potentially under the threat of adversarial attacks.…
▽ More
Learning-based methods for underwater image enhancement (UWIE) have undergone extensive exploration. However, learning-based models are usually vulnerable to adversarial examples so as the UWIE models. To the best of our knowledge, there is no comprehensive study on the adversarial robustness of UWIE models, which indicates that UWIE models are potentially under the threat of adversarial attacks. In this paper, we propose a general adversarial attack protocol. We make a first attempt to conduct adversarial attacks on five well-designed UWIE models on three common underwater image benchmark datasets. Considering the scattering and absorption of light in the underwater environment, there exists a strong correlation between color correction and underwater image enhancement. On the basis of that, we also design two effective UWIE-oriented adversarial attack methods Pixel Attack and Color Shift Attack targeting different color spaces. The results show that five models exhibit varying degrees of vulnerability to adversarial attacks and well-designed small perturbations on degraded images are capable of preventing UWIE models from generating enhanced results. Further, we conduct adversarial training on these models and successfully mitigated the effectiveness of adversarial attacks. In summary, we reveal the adversarial vulnerability of UWIE models and propose a new evaluation dimension of UWIE models.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
Knowledge-driven AI-generated data for accurate and interpretable breast ultrasound diagnoses
Authors:
Haojun Yu,
Youcheng Li,
Nan Zhang,
Zihan Niu,
Xuantong Gong,
Yanwen Luo,
Quanlin Wu,
Wangyan Qin,
Mengyuan Zhou,
Jie Han,
Jia Tao,
Ziwei Zhao,
Di Dai,
Di He,
Dong Wang,
Binghui Tang,
Ling Huo,
Qingli Zhu,
Yong Wang,
Liwei Wang
Abstract:
Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifical…
▽ More
Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifically, we introduce a pipeline, TAILOR, that builds a knowledge-driven generative model to produce tailored synthetic data. The generative model, using 3,749 lesions as source data, can generate millions of breast-US images, especially for error-prone rare cases. The generated data can be further used to build a diagnostic model for accurate and interpretable diagnoses. In the prospective external evaluation, our diagnostic model outperforms the average performance of nine radiologists by 33.5% in specificity with the same sensitivity, improving their performance by providing predictions with an interpretable decision-making process. Moreover, on ductal carcinoma in situ (DCIS), our diagnostic model outperforms all radiologists by a large margin, with only 34 DCIS lesions in the source data. We believe that TAILOR can potentially be extended to various diseases and imaging modalities.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Advanced Long-Content Speech Recognition With Factorized Neural Transducer
Authors:
Xun Gong,
Yu Wu,
Jinyu Li,
Shujie Liu,
Rui Zhao,
Xie Chen,
Yanmin Qian
Abstract:
In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that th…
▽ More
In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Traffic Control via Connected and Automated Vehicles: An Open-Road Field Experiment with 100 CAVs
Authors:
Jonathan W. Lee,
Han Wang,
Kathy Jang,
Amaury Hayat,
Matthew Bunting,
Arwa Alanqary,
William Barbour,
Zhe Fu,
Xiaoqian Gong,
George Gunter,
Sharon Hornstein,
Abdul Rahman Kreidieh,
Nathan Lichtlé,
Matthew W. Nice,
William A. Richardson,
Adit Shah,
Eugene Vinitsky,
Fangyu Wu,
Shengquan Xiang,
Sulaiman Almatrudi,
Fahd Althukair,
Rahul Bhadani,
Joy Carpio,
Raphael Chekroun,
Eric Cheng
, et al. (39 additional authors not shown)
Abstract:
The CIRCLES project aims to reduce instabilities in traffic flow, which are naturally occurring phenomena due to human driving behavior. These "phantom jams" or "stop-and-go waves,"are a significant source of wasted energy. Toward this goal, the CIRCLES project designed a control system referred to as the MegaController by the CIRCLES team, that could be deployed in real traffic. Our field experim…
▽ More
The CIRCLES project aims to reduce instabilities in traffic flow, which are naturally occurring phenomena due to human driving behavior. These "phantom jams" or "stop-and-go waves,"are a significant source of wasted energy. Toward this goal, the CIRCLES project designed a control system referred to as the MegaController by the CIRCLES team, that could be deployed in real traffic. Our field experiment leveraged a heterogeneous fleet of 100 longitudinally-controlled vehicles as Lagrangian traffic actuators, each of which ran a controller with the architecture described in this paper. The MegaController is a hierarchical control architecture, which consists of two main layers. The upper layer is called Speed Planner, and is a centralized optimal control algorithm. It assigns speed targets to the vehicles, conveyed through the LTE cellular network. The lower layer is a control layer, running on each vehicle. It performs local actuation by overriding the stock adaptive cruise controller, using the stock on-board sensors. The Speed Planner ingests live data feeds provided by third parties, as well as data from our own control vehicles, and uses both to perform the speed assignment. The architecture of the speed planner allows for modular use of standard control techniques, such as optimal control, model predictive control, kernel methods and others, including Deep RL, model predictive control and explicit controllers. Depending on the vehicle architecture, all onboard sensing data can be accessed by the local controllers, or only some. Control inputs vary across different automakers, with inputs ranging from torque or acceleration requests for some cars, and electronic selection of ACC set points in others. The proposed architecture allows for the combination of all possible settings proposed above. Most configurations were tested throughout the ramp up to the MegaVandertest.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
On Data-Driven Modeling and Control in Modern Power Grids Stability: Survey and Perspective
Authors:
Xun Gong,
Xiaozhe Wang,
Bo Cao
Abstract:
Modern power grids are fast evolving with the increasing volatile renewable generation, distributed energy resources (DERs) and time-varying operating conditions. The DERs include rooftop photovoltaic (PV), small wind turbines, energy storages, flexible loads, electric vehicles (EVs), etc. The grid control is confronted with low inertia, uncertainty and nonlinearity that challenge the operation se…
▽ More
Modern power grids are fast evolving with the increasing volatile renewable generation, distributed energy resources (DERs) and time-varying operating conditions. The DERs include rooftop photovoltaic (PV), small wind turbines, energy storages, flexible loads, electric vehicles (EVs), etc. The grid control is confronted with low inertia, uncertainty and nonlinearity that challenge the operation security, efficacy and efficiency. The ongoing digitization of power grids provides opportunities to address the challenges with data-driven and control. This paper provides a comprehensive review of emerging data-driven dynamical modeling and control methods and their various applications in power grid. Future trends are also discussed based on advances in data-driven control.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
Human Emotion Recognition Based On Galvanic Skin Response signal Feature Selection and SVM
Authors:
Di Fan,
Mingyang Liu,
Xiaohan Zhang,
Xiaopeng Gong
Abstract:
A novel human emotion recognition method based on automatically selected Galvanic Skin Response (GSR) signal features and SVM is proposed in this paper. GSR signals were acquired by e-Health Sensor Platform V2.0. Then, the data is de-noised by wavelet function and normalized to get rid of the individual difference. 30 features are extracted from the normalized data, however, directly using of thes…
▽ More
A novel human emotion recognition method based on automatically selected Galvanic Skin Response (GSR) signal features and SVM is proposed in this paper. GSR signals were acquired by e-Health Sensor Platform V2.0. Then, the data is de-noised by wavelet function and normalized to get rid of the individual difference. 30 features are extracted from the normalized data, however, directly using of these features will lead to a low recognition rate. In order to gain the optimized features, a covariance based feature selection is employed in our method. Finally, a SVM with input of the optimized features is utilized to achieve the human emotion recognition. The experimental results indicate that the proposed method leads to good human emotion recognition, and the recognition accuracy is more than 66.67%.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Deep-Learning-Aided Alternating Least Squares for Tensor CP Decomposition and Its Application to Massive MIMO Channel Estimation
Authors:
Xiao Gong,
Wei Chen,
Bo Ai,
Geert Leus
Abstract:
CANDECOMP/PARAFAC (CP) decomposition is the mostly used model to formulate the received tensor signal in a massive MIMO system, as the receiver generally sums the components from different paths or users. To achieve accurate and low-latency channel estimation, good and fast CP decomposition (CPD) algorithms are desired. The CP alternating least squares (CPALS) is the workhorse algorithm for calcul…
▽ More
CANDECOMP/PARAFAC (CP) decomposition is the mostly used model to formulate the received tensor signal in a massive MIMO system, as the receiver generally sums the components from different paths or users. To achieve accurate and low-latency channel estimation, good and fast CP decomposition (CPD) algorithms are desired. The CP alternating least squares (CPALS) is the workhorse algorithm for calculating the CPD. However, its performance depends on the initializations, and good starting values can lead to more efficient solutions. Existing initialization strategies are decoupled from the CPALS and are not necessarily favorable for solving the CPD. This paper proposes a deep-learning-aided CPALS (DL-CPALS) method that uses a deep neural network (DNN) to generate favorable initializations. The proposed DL-CPALS integrates the DNN and CPALS to a model-based deep learning paradigm, where it trains the DNN to generate an initialization that facilitates fast and accurate CPD. Moreover, benefiting from the CP low-rankness, the proposed method is trained using noisy data and does not require paired clean data. The proposed DL-CPALS is applied to millimeter wave MIMO-OFDM channel estimation. Experimental results demonstrate the significant improvements of the proposed method in terms of both speed and accuracy for CPD and channel estimation.
△ Less
Submitted 20 November, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition
Authors:
Hang Shao,
Bei Liu,
Wei Wang,
Xun Gong,
Yanmin Qian
Abstract:
As a popular multilingual and multitask pre-trained speech model, Whisper has the problem of curse of multilinguality. To enhance multilingual capabilities in small Whisper models, we propose DQ-Whisper, a novel joint distillation and quantization framework to compress Whisper for efficient inference. Firstly, we propose a novel dynamic matching distillation strategy. Then, a quantization-aware di…
▽ More
As a popular multilingual and multitask pre-trained speech model, Whisper has the problem of curse of multilinguality. To enhance multilingual capabilities in small Whisper models, we propose DQ-Whisper, a novel joint distillation and quantization framework to compress Whisper for efficient inference. Firstly, we propose a novel dynamic matching distillation strategy. Then, a quantization-aware distillation framework is introduced to integrate quantization with distillation. Experimental results on various multilingual datasets show that our suggested distillation approach can effectively enhance the multilingual capabilities of small Whisper models without increasing computational costs. Up to 5.18x reduction in model size is achieved with marginal performance degradation. In addition, quantization is compatible with distillation, which can result in a higher compression rate.
△ Less
Submitted 29 September, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Optimal Resource Allocation between Two Nonfully Cooperative Wireless Networks under Malicious Attacks: A Gestalt Game Perspective
Authors:
Yukang Cui,
Xinru Yang,
Tingwen Huang,
Xin Gong
Abstract:
In this paper, the problem of seeking optimal distributed resource allocation (DRA) policies on cellular networks in the presence of an unknown malicious adding-edge attacker is investigated. This problem is described as the games of games (GoG) model. Specifically, two subnetwork policymakers constitute a Nash game, while the confrontation between each subnetwork policymaker and the attacker is c…
▽ More
In this paper, the problem of seeking optimal distributed resource allocation (DRA) policies on cellular networks in the presence of an unknown malicious adding-edge attacker is investigated. This problem is described as the games of games (GoG) model. Specifically, two subnetwork policymakers constitute a Nash game, while the confrontation between each subnetwork policymaker and the attacker is captured by a Stackelberg game. First, we show that the communication resource allocation of cellular networks based on the Foschini-Miljanic (FM) algorithm can be transformed into a \emph{geometric program} and be efficiently solved via convex optimization. Second, the upper limit of attack magnitude that can be tolerated by the network is calculated by the corresponding theory, and it is proved that the above geometric programming (GP) framework is solvable within the attack bound, that is, there exists a Gestalt Nash equilibrium (GNE) in our GoG. Third, a heuristic algorithm that iteratively uses GP is proposed to identify the optimal policy profiles of both subnetworks, for which asymptotic convergence is also confirmed. Fourth, a greedy heuristic adding-edge strategy is developed for the attacker to determine the set of the most vulnerable edges. Finally, simulation examples illustrate that the proposed theoretical results are robust and can achieve the GNE. It is verified that the transmission gains and interference gains of all channels are well tuned within a limited budget, despite the existence of malicious attacks.
△ Less
Submitted 22 March, 2023;
originally announced April 2023.
-
Resilient Output Consensus Control of Heterogeneous Multi-agent Systems against Byzantine Attacks: A Twin Layer Approach
Authors:
Xin Gong,
Yiwen Liang,
Yukang Cui,
Shi Liang,
Tingwen Huang
Abstract:
This paper studies the problem of cooperative control of heterogeneous multi-agent systems (MASs) against Byzantine attacks. The agent affected by Byzantine attacks sends different wrong values to all neighbors while applying wrong input signals for itself, which is aggressive and difficult to be defended. Inspired by the concept of Digital Twin, a new hierarchical protocol equipped with a virtual…
▽ More
This paper studies the problem of cooperative control of heterogeneous multi-agent systems (MASs) against Byzantine attacks. The agent affected by Byzantine attacks sends different wrong values to all neighbors while applying wrong input signals for itself, which is aggressive and difficult to be defended. Inspired by the concept of Digital Twin, a new hierarchical protocol equipped with a virtual twin layer (TL) is proposed, which decouples the above problems into the defense scheme against Byzantine edge attacks on the TL and the defense scheme against Byzantine node attacks on the cyber-physical layer (CPL). On the TL, we propose a resilient topology reconfiguration strategy by adding a minimum number of key edges to improve network resilience. It is strictly proved that the control strategy is sufficient to achieve asymptotic consensus in finite time with the topology on the TL satisfying strongly $(2f+1)$-robustness. On the CPL, decentralized chattering-free controllers are proposed to guarantee the resilient output consensus for the heterogeneous MASs against Byzantine node attacks. Moreover, the obtained controller shows exponential convergence. The effectiveness and practicality of the theoretical results are verified by numerical examples.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
Data-Driven Leader-following Consensus for Nonlinear Multi-Agent Systems against Composite Attacks: A Twins Layer Approach
Authors:
Xin Gong,
Jintao Peng,
Dong Yang,
Zhan Shu,
Tingwen Huang,
Yukang Cui
Abstract:
This paper studies the leader-following consensuses of uncertain and nonlinear multi-agent systems against composite attacks (CAs), including Denial of Service (DoS) attacks and actuation attacks (AAs). A double-layer control framework is formulated, where a digital twin layer (TL) is added beside the traditional cyber-physical layer (CPL), inspired by the recent Digital Twin technology. Consequen…
▽ More
This paper studies the leader-following consensuses of uncertain and nonlinear multi-agent systems against composite attacks (CAs), including Denial of Service (DoS) attacks and actuation attacks (AAs). A double-layer control framework is formulated, where a digital twin layer (TL) is added beside the traditional cyber-physical layer (CPL), inspired by the recent Digital Twin technology. Consequently, the resilient control task against CAs can be divided into two parts: One is distributed estimation against DoS attacks on the TL and the other is resilient decentralized tracking control against actuation attacks on the CPL. %The data-driven scheme is used to deal with both model non-linearity and model uncertainty, in which only the input and output data of the system are employed throughout the whole control process. First, a distributed observer based on switching estimation law against DoS is designed on TL. Second, a distributed model free adaptive control (DMFAC) protocol based on attack compensation against AAs is designed on CPL. Moreover, the uniformly ultimately bounded convergence of consensus error of the proposed double-layer DMFAC algorithm is strictly proved. Finally, the simulation verifies the effectiveness of the resilient double-layer control scheme.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
Resilient Output Containment Control of Heterogeneous Multiagent Systems Against Composite Attacks: A Digital Twin Approach
Authors:
Yukang Cui,
Lingbo Cao,
Michael V. Basin,
Jun Shen,
Tingwen Huang,
Xin Gong
Abstract:
This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense pr…
▽ More
This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense protocols against DoS attacks on TL and defense protocols against actuation attacks on cyber-physical layer (CPL). First, considering modeling errors of leader dynamics, we introduce distributed observers to reconstruct the leader dynamics for each follower on TL under DoS attacks. Second, distributed estimators are used to estimate follower states according to the reconstructed leader dynamics on the TL. Third, according to the reconstructed leader dynamics, we design decentralized solvers that calculate the output regulator equations on CPL. Fourth, decentralized adaptive attack-resilient control schemes that resist unbounded actuation attacks are provided on CPL. Furthermore, we apply the above control protocols to prove that the followers can achieve uniformly ultimately bounded (UUB) convergence, and the upper bound of the UUB convergence is determined explicitly. Finally, two simulation examples are provided to show the effectiveness of the proposed control protocols.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
Catch You and I Can: Revealing Source Voiceprint Against Voice Conversion
Authors:
Jiangyi Deng,
Yanjiao Chen,
Yinan Zhong,
Qianhao Miao,
Xueluan Gong,
Wenyuan Xu
Abstract:
Voice conversion (VC) techniques can be abused by malicious parties to transform their audios to sound like a target speaker, making it hard for a human being or a speaker verification/identification system to trace the source speaker. In this paper, we make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit. However, unveiling t…
▽ More
Voice conversion (VC) techniques can be abused by malicious parties to transform their audios to sound like a target speaker, making it hard for a human being or a speaker verification/identification system to trace the source speaker. In this paper, we make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit. However, unveiling the features of the source speaker from a converted audio is challenging since the voice conversion operation intends to disentangle the original features and infuse the features of the target speaker. To fulfill our goal, we develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples. We equip Revelio with a carefully-designed differential rectification algorithm to eliminate the influence of the target speaker by removing the representation component that is parallel to the voiceprint of the target speaker. We have conducted extensive experiments to evaluate the capability of Revelio in restoring voiceprint from audios converted by VQVC, VQVC+, AGAIN, and BNE. The experiments verify that Revelio is able to rebuild voiceprints that can be traced to the source speaker by speaker verification and identification systems. Revelio also exhibits robust performance under inter-gender conversion, unseen languages, and telephony networks.
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
A Novel Koopman-Inspired Method for the Secondary Control of Microgrids with Grid-Forming and Grid-Following Sources
Authors:
Xun Gong,
Xiaozhe Wang
Abstract:
This paper proposes an online data-driven Koopman-inspired identification and control method for microgrid secondary voltage and frequency control. Unlike typical data-driven methods, the proposed method requires no warm-up training yet with guaranteed bounded-input-bounded-output (BIBO) stability and even asymptotic stability under some mild conditions. The proposed method estimates the Koopman s…
▽ More
This paper proposes an online data-driven Koopman-inspired identification and control method for microgrid secondary voltage and frequency control. Unlike typical data-driven methods, the proposed method requires no warm-up training yet with guaranteed bounded-input-bounded-output (BIBO) stability and even asymptotic stability under some mild conditions. The proposed method estimates the Koopman state space model adaptively so as to perform effective secondary voltage and frequency control that can handle microgrid nonlinearity and uncertainty. Case studies in the 4-bus and 13-bus microgrid test systems (with grid-forming and grid-following sources) demonstrate the effectiveness and robustness of the proposed identification and control method subject to the change of operating conditions and large disturbances (e.g., microgrid mode transitions, generation/load variations) even with measurement noises and time delays.
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
LongFNT: Long-form Speech Recognition with Factorized Neural Transducer
Authors:
Xun Gong,
Yu Wu,
Jinyu Li,
Shujie Liu,
Rui Zhao,
Xie Chen,
Yanmin Qian
Abstract:
Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language mo…
▽ More
Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
A Random Forest and Current Fault Texture Feature-Based Method for Current Sensor Fault Diagnosis in Three-Phase PWM VSR
Authors:
Lei Kou,
Xiao-dong Gong,
Yi Zheng,
Xiu-hui Ni,
Yang Li,
Quan-de Yuan,
Ya-nan Dong
Abstract:
Three-phase PWM voltage-source rectifier (VSR) systems have been widely used in various energy conversion systems, where current sensors are the key component for state monitoring and system control. The current sensor faults may bring hidden danger or damage to the whole system; therefore, this paper proposed a random forest (RF) and current fault texture feature-based method for current sensor f…
▽ More
Three-phase PWM voltage-source rectifier (VSR) systems have been widely used in various energy conversion systems, where current sensors are the key component for state monitoring and system control. The current sensor faults may bring hidden danger or damage to the whole system; therefore, this paper proposed a random forest (RF) and current fault texture feature-based method for current sensor fault diagnosis in three-phase PWM VSR systems. First, the three-phase alternating currents (ACs) of the three-phase PWM VSR are collected to extract the current fault texture features, and no additional hardware sensors are needed to avoid causing additional unstable factors. Then, the current fault texture features are adopted to train the random forest current sensor fault detection and diagnosis (CSFDD) classifier, which is a data-driven CSFDD classifier. Finally, the effectiveness of the proposed method is verified by simulation experiments. The result shows that the current sensor faults can be detected and located successfully and that it can effectively provide fault locations for maintenance personnel to keep the stable operation of the whole system.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Review on Monitoring, Operation and Maintenance of Smart Offshore Wind Farms
Authors:
Lei Kou,
Yang Li,
Fangfang Zhang,
Xiaodong Gong,
Yinghong Hu,
Quande Yuan,
Wende Ke
Abstract:
In recent years, with the development of wind energy, the number and scale of wind farms are developing rapidly. Since offshore wind farm has the advantages of stable wind speed, clean, renewable, non-polluting and no occupation of cultivated land, which has gradually become a new trend of wind power industry all over the world. The operation and maintenance mode of offshore wind power is developi…
▽ More
In recent years, with the development of wind energy, the number and scale of wind farms are developing rapidly. Since offshore wind farm has the advantages of stable wind speed, clean, renewable, non-polluting and no occupation of cultivated land, which has gradually become a new trend of wind power industry all over the world. The operation and maintenance mode of offshore wind power is developing in the direction of digitization and intelligence. It is of great significance to carry out the research on the monitoring, operation and maintenance of offshore wind farm, which will be of benefits to reduce the operation and maintenance cost, improve the power generation efficiency, improve the stability of offshore wind farm system and build smart offshore wind farm. This paper will mainly analyze and summarize the monitoring, operation and maintenance of offshore wind farm, especially from the following points: monitoring of "offshore wind power engineering & biological & environment", the monitoring of power equipment and the operation & maintenance of smart offshore wind farms. Finally, the future research challenges about monitoring, operation and maintenance of smart offshore wind farm are proposed, and the future research directions in this field are prospected.
△ Less
Submitted 31 October, 2022;
originally announced November 2022.
-
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
Authors:
Ziqiang Zhang,
Sanyuan Chen,
Long Zhou,
Yu Wu,
Shuo Ren,
Shujie Liu,
Zhuoyuan Yao,
Xun Gong,
Lirong Dai,
Jinyu Li,
Furu Wei
Abstract:
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discret…
▽ More
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.
△ Less
Submitted 15 June, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition
Authors:
Xun Gong,
Zhikai Zhou,
Yanmin Qian
Abstract:
Modern non-autoregressive~(NAR) speech recognition systems aim to accelerate the inference speed; however, they suffer from performance degradation compared with autoregressive~(AR) models as well as the huge model size issue. We propose a novel knowledge transfer and distillation architecture that leverages knowledge from AR models to improve the NAR performance while reducing the model's size. F…
▽ More
Modern non-autoregressive~(NAR) speech recognition systems aim to accelerate the inference speed; however, they suffer from performance degradation compared with autoregressive~(AR) models as well as the huge model size issue. We propose a novel knowledge transfer and distillation architecture that leverages knowledge from AR models to improve the NAR performance while reducing the model's size. Frame- and sequence-level objectives are well-designed for transfer learning. To further boost the performance of NAR, a beam search method on Mask-CTC is developed to enlarge the search space during the inference stage. Experiments show that the proposed NAR beam search relatively reduces CER by over 5% on AISHELL-1 benchmark with a tolerable real-time-factor~(RTF) increment. By knowledge transfer, the NAR student who has the same size as the AR teacher obtains relative CER reductions of 8/16% on AISHELL-1 dev/test sets, and over 25% relative WER reductions on LibriSpeech test-clean/other sets. Moreover, the ~9x smaller NAR models achieve ~25% relative CER/WER reductions on both AISHELL-1 and LibriSpeech benchmarks with the proposed knowledge transfer and distillation.
△ Less
Submitted 15 July, 2022;
originally announced July 2022.
-
An Online Data-Driven Method for Microgrid Secondary Voltage and Frequency Control with Ensemble Koopman Modeling
Authors:
Xun Gong,
Xiaozhe Wang,
Geza Joos
Abstract:
Low inertia, nonlinearity and a high level of uncertainty (varying topologies and operating conditions) pose challenges to microgrid (MG) systemwide operation. This paper proposes an online adaptive Koopman operator optimal control (AKOOC) method for MG secondary voltage and frequency control. Unlike typical data-driven methods that are data-hungry and lack guaranteed stability, the proposed AKOOC…
▽ More
Low inertia, nonlinearity and a high level of uncertainty (varying topologies and operating conditions) pose challenges to microgrid (MG) systemwide operation. This paper proposes an online adaptive Koopman operator optimal control (AKOOC) method for MG secondary voltage and frequency control. Unlike typical data-driven methods that are data-hungry and lack guaranteed stability, the proposed AKOOC requires no warm-up training yet with guaranteed bounded-input-bounded-output (BIBO) stability and even asymptotical stability under some mild conditions. The proposed AKOOC is developed based on an ensemble Koopman state space modeling with full basis functions that combines both linear and nonlinear bases without the need of event detection or switching. An iterative learning method is also developed to exploit model parameters, ensuring the effectiveness and the adaptiveness of the designed control. Simulation studies in the 4-bus (with detailed inner-loop control) MG system and the 34-bus MG system showed improved modeling accuracy and control, verifying the effectiveness of the proposed method subject to various changes of operating conditions even with time delay, measurement noise, and missing measurements.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
A rigorous multi-population multi-lane hybrid traffic model and its mean-field limit for dissipation of waves via autonomous vehicles
Authors:
Nicolas Kardous,
Amaury Hayat,
Sean T. McQuade,
Xiaoqian Gong,
Sydney Truong,
Tinhinane Mezair,
Paige Arnold,
Ryan Delorenzo,
Alexandre Bayen,
Benedetto Piccoli
Abstract:
In this paper, a multi-lane multi-population microscopic model, which presents stop and go waves, is proposed to simulate traffic on a ring-road. Vehicles are divided between human-driven and autonomous vehicles (AV). Control strategies are designed with the ultimate goal of using a small number of AVs (less than 5\% penetration rate) to represent Lagrangian control actuators that can smooth the m…
▽ More
In this paper, a multi-lane multi-population microscopic model, which presents stop and go waves, is proposed to simulate traffic on a ring-road. Vehicles are divided between human-driven and autonomous vehicles (AV). Control strategies are designed with the ultimate goal of using a small number of AVs (less than 5\% penetration rate) to represent Lagrangian control actuators that can smooth the multilane traffic flow and dissipate the stop-and-go waves. This in turn may reduce fuel consumption and emissions.
The lane-changing mechanism is based on three components that we treat as parameters in the model: safety, incentive and cool-down time. The choice of these parameters in the lane-change mechanism is critical to modeling traffic accurately, because different parameter values can lead to drastically different traffic behaviors. In particular, the number of lane-changes and the speed variance are highly affected by the choice of parameters. Despite this modeling issue, when using sufficiently simple and robust controllers for AVs, the stabilization of uniform flow steady-state is effective for any realistic value of the parameters, and ultimately bypasses the observed modeling issue. Our approach is based on accurate and rigorous mathematical models, which allows a limit procedure that is termed, in gas dynamic terminology, mean-field. In simple words, from increasing the human-driven population to infinity, a system of coupled ordinary and partial differential equations are obtained. Moreover, control problems also pass to the limit, allowing the design to be tackled at different scales.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition
Authors:
Xun Gong,
Yizhou Lu,
Zhikai Zhou,
Yanmin Qian
Abstract:
Accent variability has posed a huge challenge to automatic speech recognition~(ASR) modeling. Although one-hot accent vector based adaptation systems are commonly used, they require prior knowledge about the target accent and cannot handle unseen accents. Furthermore, simply concatenating accent embeddings does not make good use of accent knowledge, which has limited improvements. In this work, we…
▽ More
Accent variability has posed a huge challenge to automatic speech recognition~(ASR) modeling. Although one-hot accent vector based adaptation systems are commonly used, they require prior knowledge about the target accent and cannot handle unseen accents. Furthermore, simply concatenating accent embeddings does not make good use of accent knowledge, which has limited improvements. In this work, we aim to tackle these problems with a novel layer-wise adaptation structure injected into the E2E ASR model encoder. The adapter layer encodes an arbitrary accent in the accent space and assists the ASR model in recognizing accented speech. Given an utterance, the adaptation structure extracts the corresponding accent information and transforms the input acoustic feature into an accent-related feature through the linear combination of all accent bases. We further explore the injection position of the adaptation layer, the number of accent bases, and different types of accent bases to achieve better accent adaptation. Experimental results show that the proposed adaptation structure brings 12\% and 10\% relative word error rate~(WER) reduction on the AESRC2020 accent dataset and the Librispeech dataset, respectively, compared to the baseline.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Towards Integrated Sensing and Communications for 6G
Authors:
Qi Wang,
Anastasios Kakkavas,
Xitao Gong,
Richard A. Stirling-Gallacher
Abstract:
For the next generation of mobile communications systems, the integration of sensing and communications promises benefits in terms of spectrum utilization, cost, latency, area and weight. In this paper, we categorize and summarize the key features and technical considerations for different integration approaches and discuss related waveform design issues for a future 6G system. We provide results…
▽ More
For the next generation of mobile communications systems, the integration of sensing and communications promises benefits in terms of spectrum utilization, cost, latency, area and weight. In this paper, we categorize and summarize the key features and technical considerations for different integration approaches and discuss related waveform design issues for a future 6G system. We provide results on new candidate waveforms for monostatic sensing and finally highlight important open issues and directions that deserve future in-depth research.
△ Less
Submitted 12 January, 2022;
originally announced January 2022.
-
Continuous Human Action Detection Based on Wearable Inertial Data
Authors:
Xia Gong,
Yan Lu,
Haoran Wei
Abstract:
Human action detection is a hot topic, which is widely used in video surveillance, human machine interface, healthcare monitoring, gaming, dancing training and musical instrument teaching. As inertial sensors are low cost, portable, and having no operating space, it is suitable to detect human action. In real-world applications, actions that are of interest appear among actions of non interest wit…
▽ More
Human action detection is a hot topic, which is widely used in video surveillance, human machine interface, healthcare monitoring, gaming, dancing training and musical instrument teaching. As inertial sensors are low cost, portable, and having no operating space, it is suitable to detect human action. In real-world applications, actions that are of interest appear among actions of non interest without pauses in between. Recognizing and detecting actions of interests from continuous action streams is more challenging and useful for real applications. Based on inertial sensor and C-MHAD smart TV gesture recognition dataset, this paper utilized different inertial sensor feature formats, then compared the performance with different deep neural network structures according to these feature formats. Experiment results show the best performance was achieved by image based inertial feature with convolution neural network, which got 51.1% F1 score.
△ Less
Submitted 11 December, 2021;
originally announced December 2021.
-
ChMusic: A Traditional Chinese Music Dataset for Evaluation of Instrument Recognition
Authors:
Xia Gong,
Yuxiang Zhu,
Haidi Zhu,
Haoran Wei
Abstract:
Musical instruments recognition is a widely used application for music information retrieval. As most of previous musical instruments recognition dataset focus on western musical instruments, it is difficult for researcher to study and evaluate the area of traditional Chinese musical instrument recognition. This paper propose a traditional Chinese music dataset for training model and performance e…
▽ More
Musical instruments recognition is a widely used application for music information retrieval. As most of previous musical instruments recognition dataset focus on western musical instruments, it is difficult for researcher to study and evaluate the area of traditional Chinese musical instrument recognition. This paper propose a traditional Chinese music dataset for training model and performance evaluation, named ChMusic. This dataset is free and publicly available, 11 traditional Chinese musical instruments and 55 traditional Chinese music excerpts are recorded in this dataset. Then an evaluation standard is proposed based on ChMusic dataset. With this standard, researchers can compare their results following the same rule, and results from different researchers will become comparable.
△ Less
Submitted 11 December, 2021; v1 submitted 18 August, 2021;
originally announced August 2021.
-
Integrated Framework of Vehicle Dynamics, Instabilities, Energy Models, and Sparse Flow Smoothing Controllers
Authors:
Jonathan W. Lee,
George Gunter,
Rabie Ramadan,
Sulaiman Almatrudi,
Paige Arnold,
John Aquino,
William Barbour,
Rahul Bhadani,
Joy Carpio,
Fang-Chieh Chou,
Marsalis Gibson,
Xiaoqian Gong,
Amaury Hayat,
Nour Khoudari,
Abdul Rahman Kreidieh,
Maya Kumar,
Nathan Lichtlé,
Sean McQuade,
Brian Nguyen,
Megan Ross,
Sydney Truong,
Eugene Vinitsky,
Yibo Zhao,
Jonathan Sprinkle,
Benedetto Piccoli
, et al. (3 additional authors not shown)
Abstract:
This work presents an integrated framework of: vehicle dynamics models, with a particular attention to instabilities and traffic waves; vehicle energy models, with particular attention to accurate energy values for strongly unsteady driving profiles; and sparse Lagrangian controls via automated vehicles, with a focus on controls that can be executed via existing technology such as adaptive cruise…
▽ More
This work presents an integrated framework of: vehicle dynamics models, with a particular attention to instabilities and traffic waves; vehicle energy models, with particular attention to accurate energy values for strongly unsteady driving profiles; and sparse Lagrangian controls via automated vehicles, with a focus on controls that can be executed via existing technology such as adaptive cruise control systems. This framework serves as a key building block in developing control strategies for human-in-the-loop traffic flow smoothing on real highways. In this contribution, we outline the fundamental merits of integrating vehicle dynamics and energy modeling into a single framework, and we demonstrate the energy impact of sparse flow smoothing controllers via simulation results.
△ Less
Submitted 22 April, 2021;
originally announced April 2021.
-
Limitations and Improvements of the Intelligent Driver Model (IDM)
Authors:
Saleh Albeaik,
Alexandre Bayen,
Maria Teresa Chiri,
Xiaoqian Gong,
Amaury Hayat,
Nicolas Kardous,
Alexander Keimer,
Sean T. McQuade,
Benedetto Piccoli,
Yiling You
Abstract:
This contribution analyzes the widely used and well-known "intelligent driver model (briefly IDM), which is a second order car-following model governed by a system of ordinary differential equations. Although this model was intensively studied in recent years for properly capturing traffic phenomena and driver braking behavior, a rigorous study of the well-posedness has, to our knowledge, never be…
▽ More
This contribution analyzes the widely used and well-known "intelligent driver model (briefly IDM), which is a second order car-following model governed by a system of ordinary differential equations. Although this model was intensively studied in recent years for properly capturing traffic phenomena and driver braking behavior, a rigorous study of the well-posedness has, to our knowledge, never been performed. First it is shown that, for a specific class of initial data, the vehicles' velocities become negative or even diverge to $-\infty$ in finite time, both undesirable properties for a car-following model. Various modifications of the IDM are then proposed in order to avoid such ill-posedness. The theoretical remediation of the model, rather than post facto by ad-hoc modification of code implementations, allows a more sound numerical implementation and preservation of the model features. Indeed, to avoid inconsistencies and ensure dynamics close to the one of the original model, one may need to inspect and clean large input data, which may result in practically impossible scenarios for large-scale simulations. Although well-posedness issues occur only for specific initial data, this may happen frequently when different traffic scenarios are analyzed, and especially in presence of lane-changing, on ramps and other network components as it is the case for most commonly used micro-simulators. On the other side, it is shown that well-posedness can be guaranteed by straight-forward improvements, such as those obtained by slightly changing the acceleration to prevent the velocity from becoming negative.
△ Less
Submitted 1 April, 2022; v1 submitted 2 April, 2021;
originally announced April 2021.
-
Enhanced Few-shot Learning for Intrusion Detection in Railway Video Surveillance
Authors:
Xiao Gong,
Xi Chen,
Wei Chen
Abstract:
Video surveillance is gaining increasing popularity to assist in railway intrusion detection in recent years. However, efficient and accurate intrusion detection remains a challenging issue due to: (a) limited sample number: only small sample size (or portion) of intrusive video frames is available; (b) low inter-scene dissimilarity: various railway track area scenes are captured by cameras instal…
▽ More
Video surveillance is gaining increasing popularity to assist in railway intrusion detection in recent years. However, efficient and accurate intrusion detection remains a challenging issue due to: (a) limited sample number: only small sample size (or portion) of intrusive video frames is available; (b) low inter-scene dissimilarity: various railway track area scenes are captured by cameras installed in different landforms; (c) high intra-scene similarity: the video frames captured by an individual camera share a same backgound. In this paper, an efficient few-shot learning solution is developed to address the above issues. In particular, an enhanced model-agnostic meta-learner is trained using both the original video frames and segmented masks of track area extracted from the video. Moreover, theoretical analysis and engineering solutions are provided to cope with the highly similar video frames in the meta-model training phase. The proposed method is tested on realistic railway video dataset. Numerical results show that the enhanced meta-learner successfully adapts unseen scene with only few newly collected video frame samples, and its intrusion detection accuracy outperforms that of the standard randomly initialized supervised learning.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Achievable Rates of Opportunistic Cognitive Radio Systems Using Reconfigurable Antennas with Imperfect Sensing and Channel Estimation
Authors:
Hassan Yazdani,
Azadeh Vosoughi,
Xun Gong
Abstract:
We consider an opportunistic cognitive radio (CR) system in which secondary transmitter (SUtx) is equipped with a reconfigurable antenna (RA). Utilizing the beam steering capability of the RA, we regard a design framework for integrated sector-based spectrum sensing and data communication. In this framework, SUtx senses the spectrum and detects the beam corresponding to active primary user's (PU)…
▽ More
We consider an opportunistic cognitive radio (CR) system in which secondary transmitter (SUtx) is equipped with a reconfigurable antenna (RA). Utilizing the beam steering capability of the RA, we regard a design framework for integrated sector-based spectrum sensing and data communication. In this framework, SUtx senses the spectrum and detects the beam corresponding to active primary user's (PU) location. SUtx also sends training symbols (prior to data symbols), to enable channel estimation at secondary receiver (SUrx) and selection of the strongest beam between SUtx-SUrx for data transmission. We establish a lower bound on the achievable rates of SUtx-SUrx link, in the presence of spectrum sensing and channel estimation errors, and errors due to incorrect detection of the beam corresponding to PU's location and incorrect selection of the strongest beam for data transmission. We formulate a novel constrained optimization problem, aiming at maximizing the derived achievable rate lower bound subject to average transmit and interference power constraints. We optimize the durations of spatial spectrum sensing and channel training as well as data symbol transmission power. Our numerical results demonstrate that between optimizing spectrum sensing and channel training durations, the latter is more important for providing higher achievable rates.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
AutoSpeech: Neural Architecture Search for Speaker Recognition
Authors:
Shaojin Ding,
Tianlong Chen,
Xinyu Gong,
Weiwei Zha,
Zhangyang Wang
Abstract:
Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture…
▽ More
Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. The final speaker recognition model can be obtained by training the derived CNN model through the standard scheme. To evaluate the proposed approach, we conduct experiments on both speaker identification and speaker verification tasks using the VoxCeleb1 dataset. Results demonstrate that the derived CNN architectures from the proposed approach significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
△ Less
Submitted 31 August, 2020; v1 submitted 6 May, 2020;
originally announced May 2020.
-
Multi-modal Datasets for Super-resolution
Authors:
Haoran Li,
Weihong Quan,
Meijun Yan,
Jin zhang,
Xiaoli Gong,
Jin Zhou
Abstract:
Nowdays, most datasets used to train and evaluate super-resolution models are single-modal simulation datasets. However, due to the variety of image degradation types in the real world, models trained on single-modal simulation datasets do not always have good robustness and generalization ability in different degradation scenarios. Previous work tended to focus only on true-color images. In contr…
▽ More
Nowdays, most datasets used to train and evaluate super-resolution models are single-modal simulation datasets. However, due to the variety of image degradation types in the real world, models trained on single-modal simulation datasets do not always have good robustness and generalization ability in different degradation scenarios. Previous work tended to focus only on true-color images. In contrast, we first proposed real-world black-and-white old photo datasets for super-resolution (OID-RW), which is constructed using two methods of manually filling pixels and shooting with different cameras. The dataset contains 82 groups of images, including 22 groups of character type and 60 groups of landscape and architecture. At the same time, we also propose a multi-modal degradation dataset (MDD400) to solve the super-resolution reconstruction in real-life image degradation scenarios. We managed to simulate the process of generating degraded images by the following four methods: interpolation algorithm, CNN network, GAN network and capturing videos with different bit rates. Our experiments demonstrate that not only the models trained on our dataset have better generalization capability and robustness, but also the trained images can maintain better edge contours and texture features.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
Fully Dense Neural Network for the Automatic Modulation Recognition
Authors:
Miao Du,
Qin Yu,
Shaomin Fei,
Chen Wang,
Xiaofeng Gong,
Ruisen Luo
Abstract:
Nowadays, we mainly use various convolution neural network (CNN) structures to extract features from radio data or spectrogram in AMR. Based on expert experience and spectrograms, they not only increase the difficulty of preprocessing, but also consume a lot of memory. In order to directly use in-phase and quadrature (IQ) data obtained by the receiver and enhance the efficiency of network extracti…
▽ More
Nowadays, we mainly use various convolution neural network (CNN) structures to extract features from radio data or spectrogram in AMR. Based on expert experience and spectrograms, they not only increase the difficulty of preprocessing, but also consume a lot of memory. In order to directly use in-phase and quadrature (IQ) data obtained by the receiver and enhance the efficiency of network extraction features to improve the recognition rate of modulation mode, this paper proposes a new network structure called Fully Dense Neural Network (FDNN). This network uses residual blocks to extract features, dense connect to reduce model size, and adds attentions mechanism to recalibrate. Experiments on RML2016.10a show that this network has a higher recognition rate and lower model complexity. And it shows that the FDNN model with dense connections can not only extract features effectively but also greatly reduce model parameters, which also provides a significant contribution for the application of deep learning to the intelligent radio system.
△ Less
Submitted 7 December, 2019;
originally announced December 2019.
-
A New Three-stage Curriculum Learning Approach to Deep Network Based Liver Tumor Segmentation
Authors:
Huiyu Li,
Xiabi Liu,
Said Boumaraf,
Weihua Liu,
Xiaopeng Gong,
Xiaohong Ma
Abstract:
Automatic segmentation of liver tumors in medical images is crucial for the computer-aided diagnosis and therapy. It is a challenging task, since the tumors are notoriously small against the background voxels. This paper proposes a new three-stage curriculum learning approach for training deep networks to tackle this small object segmentation problem. The learning in the first stage is performed o…
▽ More
Automatic segmentation of liver tumors in medical images is crucial for the computer-aided diagnosis and therapy. It is a challenging task, since the tumors are notoriously small against the background voxels. This paper proposes a new three-stage curriculum learning approach for training deep networks to tackle this small object segmentation problem. The learning in the first stage is performed on the whole input to obtain an initial deep network for tumor segmenta-tion. Then the second stage of learning focuses the strength-ening of tumor specific features by continuing training the network on the tumor patches. Finally, we retrain the net-work on the whole input in the third stage, in order that the tumor specific features and the global context can be inte-grated ideally under the segmentation objective. Benefitting from the proposed learning approach, we only need to em-ploy one single network to segment the tumors directly. We evaluated our approach on the 2017 MICCAI Liver Tumor Segmentation challenge dataset. In the experiments, our approach exhibits significant improvement compared with the commonly used cascaded counterpart.
△ Less
Submitted 17 October, 2019;
originally announced October 2019.
-
A New Deep Learning Method for Image Deblurring in Optical Microscopic Systems
Authors:
Huangxuan Zhao,
Ziwen Ke,
Ningbo Chen,
Ke Li,
Lidai Wang,
Xiaojing Gong,
Wei Zheng,
Liang Song,
Zhicheng Liu,
Dong Liang,
Chengbo Liu
Abstract:
Deconvolution is the most commonly used image processing method to remove the blur caused by the point-spread-function (PSF) in optical imaging systems. While this method has been successful in deblurring, it suffers from several disadvantages including being slow, since it takes many iterations, suboptimal, in cases where experimental operator chosen to represent PSF is not optimal. In this paper…
▽ More
Deconvolution is the most commonly used image processing method to remove the blur caused by the point-spread-function (PSF) in optical imaging systems. While this method has been successful in deblurring, it suffers from several disadvantages including being slow, since it takes many iterations, suboptimal, in cases where experimental operator chosen to represent PSF is not optimal. In this paper, we are proposing a deep-learning-based deblurring method applicable to optical microscopic imaging systems. We tested the proposed method in database data, simulated data, and experimental data (include 2D optical microscopic data and 3D photoacoustic microscopic data), all of which showed much improved deblurred results compared to deconvolution. To quantify the improved performance, we compared our results against several deconvolution methods. Our results are better than conventional techniques and do not require multiple iterations or pre-determined experimental operator. Our method has the advantages of simple operation, short time to compute, good deblur results and wide application in all types of optical microscopic imaging systems. The deep learning approach opens up a new path for deblurring and can be applied in various biomedical imaging fields.
△ Less
Submitted 8 October, 2019;
originally announced October 2019.
-
A Radio Signal Modulation Recognition Algorithm Based on Residual Networks and Attention Mechanisms
Authors:
Ruisen Luo,
Tao Hu,
Zuodong Tang,
Chen Wang,
Xiaofeng Gong,
Haiyan Tu
Abstract:
To solve the problem of inaccurate recognition of types of communication signal modulation, a RNN neural network recognition algorithm combining residual block network with attention mechanism is proposed. In this method, 10 kinds of communication signals with Gaussian white noise are generated from standard data sets, such as MASK, MPSK, MFSK, OFDM, 16QAM, AM and FM. Based on the original RNN neu…
▽ More
To solve the problem of inaccurate recognition of types of communication signal modulation, a RNN neural network recognition algorithm combining residual block network with attention mechanism is proposed. In this method, 10 kinds of communication signals with Gaussian white noise are generated from standard data sets, such as MASK, MPSK, MFSK, OFDM, 16QAM, AM and FM. Based on the original RNN neural network, residual block network is added to solve the problem of gradient disappearance caused by deep network layers. Attention mechanism is added to the network to accelerate the gradient descent. In the experiment, 16QAM, 2FSK and 4FSK are used as actual samples, IQ data frames of signals are used as input, and the RNN neural network combined with residual block network and attention mechanism is trained. The final recognition results show that the average recognition rate of real-time signals is over 93%. The network has high robustness and good use value.
△ Less
Submitted 26 September, 2019;
originally announced September 2019.
-
AutoGAN: Neural Architecture Search for Generative Adversarial Networks
Authors:
Xinyu Gong,
Shiyu Chang,
Yifan Jiang,
Zhangyang Wang
Abstract:
Neural architecture search (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variati…
▽ More
Neural architecture search (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variations and use an RNN controller to guide the search, with parameter sharing and dynamic-resetting to accelerate the process. Inception score is adopted as the reward, and a multi-level search strategy is introduced to perform NAS in a progressive way. Experiments validate the effectiveness of AutoGAN on the task of unconditional image generation. Specifically, our discovered architectures achieve highly competitive performance compared to current state-of-the-art hand-crafted GANs, e.g., setting new state-of-the-art FID scores of 12.42 on CIFAR-10, and 31.01 on STL-10, respectively. We also conclude with a discussion of the current limitations and future potential of AutoGAN. The code is available at https://github.com/TAMU-VITA/AutoGAN
△ Less
Submitted 10 August, 2019;
originally announced August 2019.
-
Multi-layer Attention Mechanism for Speech Keyword Recognition
Authors:
Ruisen Luo,
Tianran Sun,
Chen Wang,
Miao Du,
Zuodong Tang,
Kai Zhou,
Xiaofeng Gong,
Xiaomei Yang
Abstract:
As an important part of speech recognition technology, automatic speech keyword recognition has been intensively studied in recent years. Such technology becomes especially pivotal under situations with limited infrastructures and computational resources, such as voice command recognition in vehicles and robot interaction. At present, the mainstream methods in automatic speech keyword recognition…
▽ More
As an important part of speech recognition technology, automatic speech keyword recognition has been intensively studied in recent years. Such technology becomes especially pivotal under situations with limited infrastructures and computational resources, such as voice command recognition in vehicles and robot interaction. At present, the mainstream methods in automatic speech keyword recognition are based on long short-term memory (LSTM) networks with attention mechanism. However, due to inevitable information losses for the LSTM layer caused during feature extraction, the calculated attention weights are biased. In this paper, a novel approach, namely Multi-layer Attention Mechanism, is proposed to handle the inaccurate attention weights problem. The key idea is that, in addition to the conventional attention mechanism, information of layers prior to feature extraction and LSTM are introduced into attention weights calculations. Therefore, the attention weights are more accurate because the overall model can have more precise and focused areas. We conduct a comprehensive comparison and analysis on the keyword spotting performances on convolution neural network, bi-directional LSTM cyclic neural network, and cyclic neural network with the proposed attention mechanism on Google Speech Command datasets V2 datasets. Experimental results indicate favorable results for the proposed method and demonstrate the validity of the proposed method. The proposed multi-layer attention methods can be useful for other researches related to object spotting.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.