Search | arXiv e-print repository

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

Authors: Tianyi Xu, Hongjie Chen, Wang Qing, Lv Hang, Jian Kang, Li Jie, Zhennan Lin, Yongxiang Li, Xie Lei

Abstract: Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resour… ▽ More Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research △ Less

Submitted 16 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

arXiv:2305.12231 [pdf, other]

Bi-VLGM : Bi-Level Class-Severity-Aware Vision-Language Graph Matching for Text Guided Medical Image Segmentation

Authors: Chen Wenting, Liu Jie, Yuan Yixuan

Abstract: Medical reports with substantial information can be naturally complementary to medical images for computer vision tasks, and the modality gap between vision and language can be solved by vision-language matching (VLM). However, current vision-language models distort the intra-model relation and mainly include class information in prompt learning that is insufficient for segmentation task. In this… ▽ More Medical reports with substantial information can be naturally complementary to medical images for computer vision tasks, and the modality gap between vision and language can be solved by vision-language matching (VLM). However, current vision-language models distort the intra-model relation and mainly include class information in prompt learning that is insufficient for segmentation task. In this paper, we introduce a Bi-level class-severity-aware Vision-Language Graph Matching (Bi-VLGM) for text guided medical image segmentation, composed of a word-level VLGM module and a sentence-level VLGM module, to exploit the class-severity-aware relation among visual-textual features. In word-level VLGM, to mitigate the distorted intra-modal relation during VLM, we reformulate VLM as graph matching problem and introduce a vision-language graph matching (VLGM) to exploit the high-order relation among visual-textual features. Then, we perform VLGM between the local features for each class region and class-aware prompts to bridge their gap. In sentence-level VLGM, to provide disease severity information for segmentation task, we introduce a severity-aware prompting to quantify the severity level of retinal lesion, and perform VLGM between the global features and the severity-aware prompts. By exploiting the relation between the local (global) and class (severity) features, the segmentation model can selectively learn the class-aware and severity-aware information to promote performance. Extensive experiments prove the effectiveness of our method and its superiority to existing methods. Source code is to be released. △ Less

Submitted 20 May, 2023; originally announced May 2023.

arXiv:1909.04779 [pdf, other]

Localized Adversarial Training for Increased Accuracy and Robustness in Image Classification

Authors: Eitan Rothberg, Tingting Chen, Luo Jie, Hao Ji

Abstract: Today's state-of-the-art image classifiers fail to correctly classify carefully manipulated adversarial images. In this work, we develop a new, localized adversarial attack that generates adversarial examples by imperceptibly altering the backgrounds of normal images. We first use this attack to highlight the unnecessary sensitivity of neural networks to changes in the background of an image, then… ▽ More Today's state-of-the-art image classifiers fail to correctly classify carefully manipulated adversarial images. In this work, we develop a new, localized adversarial attack that generates adversarial examples by imperceptibly altering the backgrounds of normal images. We first use this attack to highlight the unnecessary sensitivity of neural networks to changes in the background of an image, then use it as part of a new training technique: localized adversarial training. By including locally adversarial images in the training set, we are able to create a classifier that suffers less loss than a non-adversarially trained counterpart model on both natural and adversarial inputs. The evaluation of our localized adversarial training algorithm on MNIST and CIFAR-10 datasets shows decreased accuracy loss on natural images, and increased robustness against adversarial inputs. △ Less

Submitted 10 September, 2019; originally announced September 2019.

Comments: 4 pages (excluding references). Presented at AdvML: 1st Workshop on Adversarial Learning Methods for Machine Learning and Data Mining at KDD '19

arXiv:1711.04646 [pdf]

doi 10.1364/OE.26.004243

Orbital-angular-momentum mode-group multiplexed transmission over a graded-index ring-core fiber based on receive diversity and maximal ratio combining

Authors: Junwei Zhang, Guoxuan Zhu, Liu Jie, Xiong Wu, Jianbo Zhu, Cheng Du, Wenyong Luo, Siyuan Yu

Abstract: An orbital-angular-momentum (OAM) mode-group multiplexing (MGM) scheme based on a graded-index ring-core fiber (GIRCF) is proposed, in which a single-input two-output (or receive diversity) architecture is designed for each MG channel and simple digital signal processing (DSP) is utilized to adaptively resist the mode partition noise resulting from random intra-group mode crosstalk. There is no ne… ▽ More An orbital-angular-momentum (OAM) mode-group multiplexing (MGM) scheme based on a graded-index ring-core fiber (GIRCF) is proposed, in which a single-input two-output (or receive diversity) architecture is designed for each MG channel and simple digital signal processing (DSP) is utilized to adaptively resist the mode partition noise resulting from random intra-group mode crosstalk. There is no need of complex multiple-input multiple-output (MIMO) equalization in this scheme. Furthermore, the signal-to-noise ratio (SNR) of the received signals can be improved if a simple maximal ratio combining (MRC) technique is employed on the receiver side to efficiently take advantage of the diversity gain of receiver. Intensity-modulated direct-detection (IM-DD) systems transmitting three OAM mode groups with total 100-Gb/s discrete multi-tone (DMT) signals over a 1-km GIRCF and two OAM mode groups with total 40-Gb/s DMT signals over an 18-km GIRCF are experimentally demonstrated, respectively, to confirm the feasibility of our proposed OAM-MGM scheme. △ Less

Submitted 9 November, 2017; originally announced November 2017.

Comments: 13 pages, 6 figures

Showing 1–4 of 4 results for author: Jie, L