Skip to main content

Showing 1–27 of 27 results for author: Shen, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.22051  [pdf, ps, other

    eess.AS

    ARiSE: Auto-Regressive Multi-Channel Speech Enhancement

    Authors: Pengjie Shen, Xueliang Zhang, Zhong-Qiu Wang

    Abstract: We propose ARiSE, an auto-regressive algorithm for multi-channel speech enhancement. ARiSE improves existing deep neural network (DNN) based frame-online multi-channel speech enhancement models by introducing auto-regressive connections, where the estimated target speech at previous frames is leveraged as extra input features to help the DNN estimate the target speech at the current frame. The ext… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  2. arXiv:2505.13079  [pdf, ps, other

    eess.AS cs.AI

    Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the W… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: To appear in Interspeech 2025

  3. arXiv:2505.05114  [pdf, other

    eess.AS cs.SD

    Listen to Extract: Onset-Prompted Target Speaker Extraction

    Authors: Pengjie Shen, Kangrui Chen, Shulin He, Pengru Chen, Shuqi Yuan, He Kong, Xueliang Zhang, Zhong-Qiu Wang

    Abstract: We propose $\textit{listen to extract}$ (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: in submission

  4. arXiv:2503.21401  [pdf, other

    cs.RO cs.LG eess.SY

    AcL: Action Learner for Fault-Tolerant Quadruped Locomotion Control

    Authors: Tianyu Xu, Yaoyu Cheng, Pinxi Shen, Lin Zhao

    Abstract: Quadrupedal robots can learn versatile locomotion skills but remain vulnerable when one or more joints lose power. In contrast, dogs and cats can adopt limping gaits when injured, demonstrating their remarkable ability to adapt to physical conditions. Inspired by such adaptability, this paper presents Action Learner (AcL), a novel teacher-student reinforcement learning framework that enables quadr… ▽ More

    Submitted 28 March, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

  5. arXiv:2502.15264  [pdf, other

    cs.CL cs.SD eess.AS

    Retrieval-Augmented Speech Recognition Approach for Domain Challenges

    Authors: Peng Shen, Xugang Lu, Hisashi Kawai

    Abstract: Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that inc… ▽ More

    Submitted 21 February, 2025; originally announced February 2025.

  6. arXiv:2409.02239  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging ta… ▽ More

    Submitted 5 September, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted to IEEE SLT 2024

  7. arXiv:2404.15312  [pdf, other

    eess.SP cs.CV

    Realtime Person Identification via Gait Analysis

    Authors: Shanmuga Venkatachalam, Harideep Nair, Prabhu Vellaisamy, Yongqi Zhou, Ziad Youssfi, John Paul Shen

    Abstract: Each person has a unique gait, i.e., walking style, that can be used as a biometric for personal identification. Recent works have demonstrated effective gait recognition using deep neural networks, however most of these works predominantly focus on classification accuracy rather than model efficiency. In order to perform gait recognition using wearable devices on the edge, it is imperative to dev… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  8. arXiv:2312.10964  [pdf, other

    cs.CL cs.SD eess.AS

    Generative linguistic representation for spoken language identification

    Authors: Peng Shen, Xuguang Lu, Hisashi Kawai

    Abstract: Effective extraction and application of linguistic features are central to the enhancement of spoken Language IDentification (LID) performance. With the success of recent large models, such as GPT and Whisper, the potential to leverage such pre-trained models for extracting linguistic features for LID tasks has become a promising area of research. In this paper, we explore the utilization of the d… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: Accepted by IEEE ASRU2023

  9. arXiv:2312.10959  [pdf, other

    cs.SD cs.CL eess.AS

    Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition

    Authors: Peng Shen, Xugang Lu, Hisashi Kawai

    Abstract: Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed. In this paper, to better address these tasks, we first introduce speaker labels into an autoregressive transformer-based speech recognition model to support multi-speaker overlapped speech recognition. Then, to improve speaker diariza… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  10. arXiv:2311.01003  [pdf, other

    eess.SY cs.RO

    Minimum Snap Trajectory Generation and Control for an Under-actuated Flapping Wing Aerial Vehicle

    Authors: Chen Qian, Rui Chen, Peiyao Shen, Yongchun Fang, Jifu Yan, Tiefeng Li

    Abstract: Minimum Snap Trajectory Generation and Control for an Under-actuated Flapping Wing Aerial VehicleThis paper presents both the trajectory generation and tracking control strategies for an underactuated flapping wing aerial vehicle (FWAV). First, the FWAV dynamics is analyzed in a practical perspective. Then, based on these analyses, we demonstrate the differential flatness of the FWAV system, and d… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  11. arXiv:2310.13471  [pdf, ps, other

    eess.AS cs.SD

    Neural domain alignment for spoken language recognition based on optimal transport

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Domain shift poses a significant challenge in cross-domain spoken language recognition (SLR) by reducing its effectiveness. Unsupervised domain adaptation (UDA) algorithms have been explored to address domain shifts in SLR without relying on class labels in the target domain. One successful UDA approach focuses on learning domain-invariant representations to align feature distributions between dom… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  12. arXiv:2309.16093  [pdf, ps, other

    eess.AS cs.SD

    Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) base… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  13. arXiv:2309.13650  [pdf, ps, other

    eess.AS cs.SD

    Cross-modal Alignment with Optimal Transport for CTC-based ASR

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretraine… ▽ More

    Submitted 24 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ASRU 2023

  14. arXiv:2309.10832  [pdf, ps, other

    cs.SD eess.AS

    Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding

    Authors: Jiahui Pan, Pengjie Shen, Hui Zhang, Xueliang Zhang

    Abstract: Multi-channel speech enhancement extracts speech using multiple microphones that capture spatial cues. Effectively utilizing directional information is key for multi-channel enhancement. Deep learning shows great potential on multi-channel speech enhancement and often takes short-time Fourier Transform (STFT) as inputs directly. To fully leverage the spatial information, we introduce a method usin… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: arXiv admin note: text overlap with arXiv:2309.10393

  15. arXiv:2305.11651  [pdf, other

    cs.IT cs.MA cs.PF eess.SY

    Channel Cycle Time: A New Measure of Short-term Fairness

    Authors: Pengfei Shen, Yulin Shao, Haoyuan Pan, Lu Lu, Yonina C. Eldar

    Abstract: This paper puts forth a new metric, dubbed channel cycle time (CCT), to measure the short-term fairness of communication networks. CCT characterizes the average duration between two consecutive successful transmissions of a user, during which all other users successfully accessed the channel at least once. In contrast to existing short-term fairness measures, CCT provides more comprehensive insigh… ▽ More

    Submitted 14 October, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

  16. arXiv:2212.01106   

    eess.AS eess.SP

    ExARN: self-attending RNN for target speaker extraction

    Authors: Pengjie Shen, Shulin He, Xueliang Zhang

    Abstract: Target speaker extraction is to extract the target speaker, specified by enrollment utterance, in an environment with other competing speakers. Therefore, the task needs to solve two problems, speaker identification and separation, at the same time. In this paper, we combine self-attention and Recurrent Neural Networks (RNN). Further, we exploit various ways to combining different auxiliary inform… ▽ More

    Submitted 12 March, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

    Comments: The overall quality of the article is not good enough

  17. arXiv:2209.07313  [pdf, other

    eess.IV cs.CV

    HarDNet-DFUS: An Enhanced Harmonically-Connected Network for Diabetic Foot Ulcer Image Segmentation and Colonoscopy Polyp Segmentation

    Authors: Ting-Yu Liao, Ching-Hui Yang, Yu-Wen Lo, Kuan-Ying Lai, Po-Huai Shen, Youn-Long Lin

    Abstract: We present a neural network architecture for medical image segmentation of diabetic foot ulcers and colonoscopy polyps. Diabetic foot ulcers are caused by neuropathic and vascular complications of diabetes mellitus. In order to provide a proper diagnosis and treatment, wound care professionals need to extract accurate morphological features from the foot wounds. Using computer-aided systems is a p… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

  18. arXiv:2207.14578  [pdf, other

    cs.CL cs.SD eess.AS

    Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

    Authors: Peng Shen, Xugang Lu, Hisashi Kawai

    Abstract: For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, compared to character-based modeling units, pronunciation-based modeling units could improve the sharing of modeling units in model training but meet homophone problems. In this study, we propose to use a novel pronunciation-aware unique character encoding for building E2E RNN-T-based Mandarin ASR systems. The proposed encodin… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

  19. arXiv:2207.06309  [pdf, other

    cs.IT cs.NI eess.SY

    Dynamic gNodeB Sleep Control for Energy-Conserving 5G Radio Access Network

    Authors: Pengfei Shen, Yulin Shao, Qi Cao, Lu Lu

    Abstract: 5G radio access network (RAN) is consuming much more energy than legacy RAN due to the denser deployments of gNodeBs (gNBs) and higher single-gNB power consumption. In an effort to achieve an energy-conserving RAN, this paper develops a dynamic on-off switching paradigm, where the ON/OFF states of gNBs can be dynamically configured according to the evolvements of the associated users. We formulate… ▽ More

    Submitted 13 July, 2022; originally announced July 2022.

    Comments: Keywords: Base station sleep control, 5G, radio access network, Markov decision process, greedy policy, index policy

  20. arXiv:2204.03888  [pdf, other

    cs.CL cs.SD eess.AS

    Transducer-based language embedding for spoken language identification

    Authors: Peng Shen, Xugang Lu, Hisashi Kawai

    Abstract: The acoustic and linguistic features are important cues for the spoken language identification (LID) task. Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding. In this paper, we propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefi… ▽ More

    Submitted 29 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: This paper was accepted by Interspeech 2022

  21. arXiv:2203.17036  [pdf, ps, other

    eess.AS cs.CL

    Partial Coupling of Optimal Transport for Spoken Language Identification

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: In order to reduce domain discrepancy to improve the performance of cross-domain spoken language identification (SLID) system, as an unsupervised domain adaptation (UDA) method, we have proposed a joint distribution alignment (JDA) model based on optimal transport (OT). A discrepancy measurement based on OT was adopted for JDA between training and test data sets. In our previous study, it was supp… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: This work was submitted to INTERSPEECH 2022

  22. arXiv:2106.12864  [pdf, other

    eess.IV cs.CV cs.LG

    A Systematic Collection of Medical Image Datasets for Deep Learning

    Authors: Johann Li, Guangming Zhu, Cong Hua, Mingtao Feng, BasheerBennamoun, Ping Li, Xiaoyuan Lu, Juan Song, Peiyi Shen, Xu Xu, Lin Mei, Liang Zhang, Syed Afaq Ali Shah, Mohammed Bennamoun

    Abstract: The astounding success made by artificial intelligence (AI) in healthcare and other fields proves that AI can achieve human-like performance. However, success always comes with challenges. Deep learning algorithms are data-dependent and require large datasets for training. The lack of data in the medical imaging field creates a bottleneck for the application of deep learning to medical image analy… ▽ More

    Submitted 24 June, 2021; originally announced June 2021.

    Comments: This paper has been submitted to one journal

  23. arXiv:2104.03004  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Siamese Neural Network with Joint Bayesian Model Structure for Speaker Verification

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Generative probability models are widely used for speaker verification (SV). However, the generative models are lack of discriminative feature selection ability. As a hypothesis test, the SV can be regarded as a binary classification task which can be designed as a Siamese neural network (SiamNN) with discriminative training. However, in most of the discriminative training for SiamNN, only the dis… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2101.03329

  24. arXiv:2101.03329  [pdf, ps, other

    eess.AS cs.SD

    Coupling a generative model with a discriminative learning framework for speaker verification

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: The speaker verification (SV) task is to decide whether an utterance is spoken by a target or an imposter speaker. For most studies, a log-likelihood ratio (LLR) score is estimated based on a generative probability model on speaker features and compared with a threshold for making a decision. However, the generative model usually focuses on individual feature distributions, does not have the discr… ▽ More

    Submitted 24 November, 2021; v1 submitted 9 January, 2021; originally announced January 2021.

  25. arXiv:2012.13152  [pdf, ps, other

    cs.LG cs.CL cs.SD eess.AS

    Unsupervised neural adaptation model based on optimal transport for spoken language identification

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. In this paper, we propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID. In our model, we explicitly formulate the adaptation as to reduce the distribution dis… ▽ More

    Submitted 24 December, 2020; originally announced December 2020.

  26. arXiv:2007.06013  [pdf, other

    cs.CV eess.IV

    MeDaS: An open-source platform as service to help break the walls between medicine and informatics

    Authors: Liang Zhang, Johann Li, Ping Li, Xiaoyuan Lu, Peiyi Shen, Guangming Zhu, Syed Afaq Shah, Mohammed Bennarmoun, Kun Qian, Björn W. Schuller

    Abstract: In the past decade, deep learning (DL) has achieved unprecedented success in numerous fields including computer vision, natural language processing, and healthcare. In particular, DL is experiencing an increasing development in applications for advanced medical image analysis in terms of analysis, segmentation, classification, and furthermore. On the one hand, tremendous needs that leverage the po… ▽ More

    Submitted 13 July, 2020; v1 submitted 12 July, 2020; originally announced July 2020.

    Comments: layout error fixed

  27. arXiv:1912.12011  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Cross-scale Attention Model for Acoustic Event Classification

    Authors: Xugang Lu, Peng Shen, Sheng Li, Yu Tsao, Hisashi Kawai

    Abstract: A major advantage of a deep convolutional neural network (CNN) is that the focused receptive field size is increased by stacking multiple convolutional layers. Accordingly, the model can explore the long-range dependency of features from the top layers. However, a potential limitation of the network is that the discriminative features from the bottom layers (which can model the short-range depende… ▽ More

    Submitted 15 June, 2020; v1 submitted 27 December, 2019; originally announced December 2019.