Search | arXiv e-print repository

Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

Authors: Jingran Xie, Shun Lei, Yue Yu, Yang Xiang, Hui Wang, Xixin Wu, Zhiyong Wu

Abstract: Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have… ▽ More Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have integrated speech with text-based LLMs to take speech question as input and output text response. However, the lack of spoken question-answering datasets that include speech style information to supervised fine-tuning (SFT) limits the performance of these systems. As a result, while these systems excel at understanding speech content, they often struggle to generate empathetic responses. In response, we propose a novel approach that circumvents the need for question-answering data, called Listen, Perceive, and Express (LPE). Our method employs a two-stage training process, initially guiding the LLM to listen the content and perceive the emotional aspects of speech. Subsequently, we utilize Chain-of-Thought (CoT) prompting to unlock the model's potential for expressing empathetic responses based on listened spoken content and perceived emotional cues. We employ experiments to prove the effectiveness of proposed method. To our knowledge, this is the first attempt to leverage CoT for speech-based dialogue. △ Less

Submitted 18 January, 2025; originally announced January 2025.

Comments: Accepted by ICASSP 2025

arXiv:2412.08237 [pdf, other]

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

Authors: Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu

Abstract: It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely o… ▽ More It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data. △ Less

Submitted 12 December, 2024; v1 submitted 11 December, 2024; originally announced December 2024.

Comments: Technical Report

arXiv:2412.01100 [pdf, other]

The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024

Authors: Shuoyi Zhou, Yixuan Zhou, Weiqin Li, Jun Chen, Runchuan Ye, Weihao Wu, Zijian Lin, Shun Lei, Zhiyong Wu

Abstract: This paper describes the zero-shot spontaneous style TTS system for the ISCSLP 2024 Conversational Voice Clone Challenge (CoVoC). We propose a LLaMA-based codec language model with a delay pattern to achieve spontaneous style voice cloning. To improve speech intelligibility, we introduce the Classifier-Free Guidance (CFG) strategy in the language model to strengthen conditional guidance on token p… ▽ More This paper describes the zero-shot spontaneous style TTS system for the ISCSLP 2024 Conversational Voice Clone Challenge (CoVoC). We propose a LLaMA-based codec language model with a delay pattern to achieve spontaneous style voice cloning. To improve speech intelligibility, we introduce the Classifier-Free Guidance (CFG) strategy in the language model to strengthen conditional guidance on token prediction. To generate high-quality utterances, we adopt effective data preprocessing operations and fine-tune our model with selected high-quality spontaneous speech data. The official evaluations in the CoVoC constrained track show that our system achieves the best speech naturalness MOS of 3.80 and obtains considerable speech quality and speaker similarity results. △ Less

Submitted 4 February, 2025; v1 submitted 1 December, 2024; originally announced December 2024.

Comments: Accepted by ISCSLP 2024

arXiv:2411.18266 [pdf]

Wearable intelligent throat enables natural speech in stroke patients with dysarthria

Authors: Chenyu Tang, Shuo Gao, Cong Li, Wentian Yi, Yuxuan Jin, Xiaoxue Zhai, Sixuan Lei, Hongbei Meng, Zibo Zhang, Muzi Xu, Shengbo Wang, Xuhang Chen, Chenxi Wang, Hongyun Yang, Ningli Wang, Wenyu Wang, Jin Cao, Xiaodong Feng, Peter Smielewski, Yu Pan, Wenhui Song, Martin Birchall, Luigi G. Occhipinti

Abstract: Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to ena… ▽ More Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to enable fluent, emotionally expressive communication. The system utilizes ultrasensitive textile strain sensors to capture high-quality signals from the neck area and supports token-level processing for real-time, continuous speech decoding, enabling seamless, delay-free communication. In tests with five stroke patients with dysarthria, IT's LLM agents intelligently corrected token errors and enriched sentence-level emotional and logical coherence, achieving low error rates (4.2% word error rate, 2.9% sentence error rate) and a 55% increase in user satisfaction. This work establishes a portable, intuitive communication platform for patients with dysarthria with the potential to be applied broadly across different neurological conditions and in multi-language support systems. △ Less

Submitted 14 March, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

Comments: 5 figures, 45 references

arXiv:2411.04472 [pdf]

Accurate Calculation of Switching Events in Electromagnetic Transient Simulation Considering State Variable Discontinuities

Authors: Sheng Lei

Abstract: Accurate calculation of switching events is important for electromagnetic transient simulation to obtain reliable results. The common presumption of continuous differential state variables could prevent the accurate calculation, thus leading to unreliable results. This paper explores accurately calculating switching events without presuming continuous differential state variables. Possibility of t… ▽ More Accurate calculation of switching events is important for electromagnetic transient simulation to obtain reliable results. The common presumption of continuous differential state variables could prevent the accurate calculation, thus leading to unreliable results. This paper explores accurately calculating switching events without presuming continuous differential state variables. Possibility of the calculation is revealed by the proposal of related methods. Feasibility and accuracy of the proposed methods are demonstrated and analyzed via numerical case studies. △ Less

Submitted 29 March, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

Comments: Accepted by the 2025 IEEE PES General Meeting

arXiv:2409.13216 [pdf, other]

MuCodec: Ultra Low-Bitrate Music Codec

Authors: Yaoxun Xu, Hangting Chen, Jianwei Yu, Wei Tan, Rongzhi Gu, Shun Lei, Zhiwei Lin, Zhiyong Wu

Abstract: Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCod… ▽ More Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCodec, specifically targeting music compression and reconstruction tasks at ultra low bitrates. MuCodec employs MuEncoder to extract both acoustic and semantic features, discretizes them with RVQ, and obtains Mel-VAE features via flow-matching. The music is then reconstructed using a pre-trained MEL-VAE decoder and HiFi-GAN. MuCodec can reconstruct high-fidelity music at ultra low (0.35kbps) or high bitrates (1.35kbps), achieving the best results to date in both subjective and objective metrics. Code and Demo: https://xuyaoxun.github.io/MuCodec_demo/. △ Less

Submitted 28 September, 2024; v1 submitted 20 September, 2024; originally announced September 2024.

arXiv:2409.06307 [pdf, other]

An End-to-End Approach for Chord-Conditioned Song Generation

Authors: Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu

Abstract: The Song Generation task aims to synthesize music composed of vocals and accompaniment from given lyrics. While the existing method, Jukebox, has explored this task, its constrained control over the generations often leads to deficiency in music performance. To mitigate the issue, we introduce an important concept from music composition, namely chords, to song generation networks. Chords form the… ▽ More The Song Generation task aims to synthesize music composed of vocals and accompaniment from given lyrics. While the existing method, Jukebox, has explored this task, its constrained control over the generations often leads to deficiency in music performance. To mitigate the issue, we introduce an important concept from music composition, namely chords, to song generation networks. Chords form the foundation of accompaniment and provide vocal melody with associated harmony. Given the inaccuracy of automatic chord extractors, we devise a robust cross-attention mechanism augmented with dynamic weight sequence to integrate extracted chord information into song generations and reduce frame-level flaws, and propose a novel model termed Chord-Conditioned Song Generator (CSG) based on it. Experimental evidence demonstrates our proposed method outperforms other approaches in terms of musical performance and control precision of generated songs. △ Less

Submitted 10 September, 2024; originally announced September 2024.

arXiv:2409.06029 [pdf, other]

SongCreator: Lyrics-based Universal Song Generation

Authors: Shun Lei, Yixuan Zhou, Boshi Tang, Max W. Y. Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

Abstract: Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the a… ▽ More Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the application of music generation models in the real world. In this light, we propose SongCreator, a song-generation system designed to tackle this challenge. The model features two novel designs: a meticulously designed dual-sequence language model (DSLM) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM, which allows our model to understand, generate and edit songs, making it suitable for various songrelated generation tasks by utilizing specific attention masks. Extensive experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks. Notably, it surpasses previous works by a large margin in lyrics-to-song and lyrics-to-vocals. Additionally, it is able to independently control the acoustic conditions of the vocals and accompaniment in the generated song through different audio prompts, exhibiting its potential applicability. Our samples are available at https://thuhcsi.github.io/SongCreator/. △ Less

Submitted 30 October, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

Comments: Accepted by NeurIPS 2024

arXiv:2408.15676 [pdf, other]

doi 10.1145/3664647.3681680

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Authors: Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, Jia Jia

Abstract: Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), i… ▽ More Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt. Codes, models and demos are at: https://github.com/thuhcsi/VoxInstruct. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: Accepted by ACM Multimedia 2024

arXiv:2408.14340 [pdf, other]

Foundation Models for Music: A Survey

Authors: Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg, Ruibin Yuan , et al. (17 additional authors not shown)

Abstract: In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the signifi… ▽ More In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm. △ Less

Submitted 3 September, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

arXiv:2404.16619 [pdf, other]

The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge

Authors: Yixuan Zhou, Shuoyi Zhou, Shun Lei, Zhiyong Wu, Menglin Wu

Abstract: This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder… ▽ More This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Accepted in Grand Challenge of ICASSP 2024

arXiv:2401.09639 [pdf]

Uncertainty Modeling in Ultrasound Image Segmentation for Precise Fetal Biometric Measurements

Authors: Shuge Lei

Abstract: Medical image segmentation, particularly in the context of ultrasound data, is a crucial aspect of computer vision and medical imaging. This paper delves into the complexities of uncertainty in the segmentation process, focusing on fetal head and femur ultrasound images. The proposed methodology involves extracting target contours and exploring techniques for precise parameter measurement. Uncerta… ▽ More Medical image segmentation, particularly in the context of ultrasound data, is a crucial aspect of computer vision and medical imaging. This paper delves into the complexities of uncertainty in the segmentation process, focusing on fetal head and femur ultrasound images. The proposed methodology involves extracting target contours and exploring techniques for precise parameter measurement. Uncertainty modeling methods are employed to enhance the training and testing processes of the segmentation network. The study reveals that the average absolute error in fetal head circumference measurement is 8.0833mm, with a relative error of 4.7347%. Similarly, the average absolute error in fetal femur measurement is 2.6163mm, with a relative error of 6.3336%. Uncertainty modeling experiments employing Test-Time Augmentation (TTA) demonstrate effective interpretability of data uncertainty on both datasets. This suggests that incorporating data uncertainty based on the TTA method can support clinical practitioners in making informed decisions and obtaining more reliable measurement results in practical clinical applications. The paper contributes to the advancement of ultrasound image segmentation, addressing critical challenges and improving the reliability of biometric measurements. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.03664 [pdf]

Dual-Channel Reliable Breast Ultrasound Image Classification Based on Explainable Attribution and Uncertainty Quantification

Authors: Shuge Lei, Haonan Hu, Dasheng Sun, Huabin Zhang, Kehong Yuan, Jian Dai, Jijun Tang, Yan Tong

Abstract: This paper focuses on the classification task of breast ultrasound images and researches on the reliability measurement of classification results. We proposed a dual-channel evaluation framework based on the proposed inference reliability and predictive reliability scores. For the inference reliability evaluation, human-aligned and doctor-agreed inference rationales based on the improved feature a… ▽ More This paper focuses on the classification task of breast ultrasound images and researches on the reliability measurement of classification results. We proposed a dual-channel evaluation framework based on the proposed inference reliability and predictive reliability scores. For the inference reliability evaluation, human-aligned and doctor-agreed inference rationales based on the improved feature attribution algorithm SP-RISA are gracefully applied. Uncertainty quantification is used to evaluate the predictive reliability via the Test Time Enhancement. The effectiveness of this reliability evaluation framework has been verified on our breast ultrasound clinical dataset YBUS, and its robustness is verified on the public dataset BUSI. The expected calibration errors on both datasets are significantly lower than traditional evaluation methods, which proves the effectiveness of our proposed reliability measurement. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2401.00269 [pdf]

doi 10.1109/TPWRS.2021.3081557

Sample Robust Scheduling of Electricity-Gas Systems Under Wind Power Uncertainty

Authors: Rong-Peng Liu, Yunhe Hou, Yujia Li, Shunbo Lei, Wei Wei, Xiaozhe Wang

Abstract: This paper adopts a two-stage sample robust optimization (SRO) model to address the wind power penetrated unit commitment optimal energy flow (UC-OEF) problem for IEGSs. The two-stage SRO model can be approximately transformed into a computationally efficient form. Specifically, we employ linear decision rules to simplify the proposed UC-OEF model. Moreover, we further enhance the tractability of… ▽ More This paper adopts a two-stage sample robust optimization (SRO) model to address the wind power penetrated unit commitment optimal energy flow (UC-OEF) problem for IEGSs. The two-stage SRO model can be approximately transformed into a computationally efficient form. Specifically, we employ linear decision rules to simplify the proposed UC-OEF model. Moreover, we further enhance the tractability of the simplified model by exploring its structural features and, accordingly, develop a solution method. △ Less

Submitted 30 December, 2023; originally announced January 2024.

Comments: 10 pages

Journal ref: IEEE Trans. Power Syst., vol. 36, no. 6, pp. 5889-5900, Nov. 2021

arXiv:2309.11977 [pdf, other]

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Abstract: Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by th… ▽ More Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt. △ Less

Submitted 9 April, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

Comments: Accepted bt ICASSP 2024

arXiv:2309.09799 [pdf, other]

Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement

Authors: Shanglin Lei, Xiaoping Wang, Guanting Dong, Jiang Li, Yingjian Liu

Abstract: Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in achieving generalization to diverse scenarios due to insufficient modeling of context, ambiguous capture of dialogue relationships and overfitting in speaker modeling. In this work, we… ▽ More Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in achieving generalization to diverse scenarios due to insufficient modeling of context, ambiguous capture of dialogue relationships and overfitting in speaker modeling. In this work, we present a Hybrid Continuous Attributive Network (HCAN) to address these issues in the perspective of emotional continuation and emotional attribution. Specifically, HCAN adopts a hybrid recurrent and attention-based module to model global emotion continuity. Then a novel Emotional Attribution Encoding (EAE) is proposed to model intra- and inter-emotional attribution for each utterance. Moreover, aiming to enhance the robustness of the model in speaker modeling and improve its performance in different scenarios, A comprehensive loss function emotional cognitive loss $\mathcal{L}_{\rm EC}$ is proposed to alleviate emotional drift and overcome the overfitting of the model to speaker modeling. Our model achieves state-of-the-art performance on three datasets, demonstrating the superiority of our work. Another extensive comparative experiments and ablation studies on three benchmarks are conducted to provided evidence to support the efficacy of each module. Further exploration of generalization ability experiments shows the plug-and-play nature of the EAE module in our method. △ Less

Submitted 19 September, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

arXiv:2309.02780 [pdf, other]

GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Authors: Aobo Xia, Shuyu Lei, Yushu Yang, Xiang Guo, Hua Chai

Abstract: This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our pro… ▽ More This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more, after fine-tuning. Furthermore, the proposed model achieves competitive performance in zero-shot and few-shot scenarios. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code. △ Less

Submitted 11 September, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

arXiv:2308.16836 [pdf, other]

Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information

Authors: Shaohuan Zhou, Shun Lei, Weiya You, Deyi Tuo, Yuren You, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings to improve the expressiveness of the synthesized singing voice. Based on the main architecture of recently proposed VISinger, we put forward several specific designs for expressive singing voice synthesis. First, dif… ▽ More This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings to improve the expressiveness of the synthesized singing voice. Based on the main architecture of recently proposed VISinger, we put forward several specific designs for expressive singing voice synthesis. First, different from the previous SVS models, we use text representation of lyrics extracted from pre-trained BERT as additional input to the model. The representation contains information about semantics of the lyrics, which could help SVS system produce more expressive and natural voice. Second, we further introduce an energy predictor to stabilize the synthesized voice and model the wider range of energy variations that also contribute to the expressiveness of singing voice. Last but not the least, to attenuate the off-key issues, the pitch predictor is re-designed to predict the real to note pitch ratio. Both objective and subjective experimental results indicate that the proposed SVS system can produce singing voice with higher-quality outperforming VISinger. △ Less

Submitted 31 August, 2023; originally announced August 2023.

arXiv:2308.16593 [pdf, other]

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

Authors: Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech an… ▽ More The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text. △ Less

Submitted 31 August, 2023; originally announced August 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2307.16012 [pdf, other]

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang, Helen Meng

Abstract: Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challengi… ▽ More Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method significantly outperforms the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representations that have never been discussed before. △ Less

Submitted 29 July, 2023; originally announced July 2023.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2304.12704 [pdf, other]

GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Authors: Haolin Zhuang, Shun Lei, Long Xiao, Weiqin Li, Liyang Chen, Sicheng Yang, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Music-driven 3D dance generation has become an intensive research topic in recent years with great potential for real-world applications. Most existing methods lack the consideration of genre, which results in genre inconsistency in the generated dance movements. In addition, the correlation between the dance genre and the music has not been investigated. To address these issues, we propose a genr… ▽ More Music-driven 3D dance generation has become an intensive research topic in recent years with great potential for real-world applications. Most existing methods lack the consideration of genre, which results in genre inconsistency in the generated dance movements. In addition, the correlation between the dance genre and the music has not been investigated. To address these issues, we propose a genre-consistent dance generation framework, GTN-Bailando. First, we propose the Genre Token Network (GTN), which infers the genre from music to enhance the genre consistency of long-term dance generation. Second, to improve the generalization capability of the model, the strategy of pre-training and fine-tuning is adopted.Experimental results on the AIST++ dataset show that the proposed dance generation framework outperforms state-of-the-art methods in terms of motion quality and genre consistency. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted by ICASSP2023.Demo page: https://im1eon.github.io/ICASSP23-GTNB-DG/

arXiv:2304.06359 [pdf, other]

Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding… ▽ More Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding of the current utterance, a hierarchical transformer-based context-aware style predictor with a mixture attention mask is designed, considering both text-side context information and speech-side style information of previous speeches. Based on this, we can generate long-form speech with coherent style and prosody sentence by sentence. Objective and subjective evaluations on a Mandarin audiobook dataset demonstrate that our proposed model can generate speech with more expressive and coherent speaking style than baselines, for both single-sentence and multi-sentence test. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: Accepted by ICASSP 2023

arXiv:2207.00741 [pdf, other]

doi 10.1109/TSG.2023.3310979

A Distributionally Robust Resilience Enhancement Strategy for Distribution Networks Considering Decision-Dependent Contingencies

Authors: Yujia Li, Shunbo Lei, Wei Sun, Chenxi Hu, Yunhe Hou

Abstract: When performing the resilience enhancement for distribution networks, there are two obstacles to reliably model the uncertain contingencies: 1) decision-dependent uncertainty (DDU) due to various line hardening decisions, and 2) distributional ambiguity due to limited outage information during extreme weather events (EWEs). To address these two challenges, this paper develops scenario-wise decisio… ▽ More When performing the resilience enhancement for distribution networks, there are two obstacles to reliably model the uncertain contingencies: 1) decision-dependent uncertainty (DDU) due to various line hardening decisions, and 2) distributional ambiguity due to limited outage information during extreme weather events (EWEs). To address these two challenges, this paper develops scenario-wise decision-dependent ambiguity sets (SWDD-ASs), where the DDU and distributional ambiguity inherent in EWE-induced contingencies are simultaneously captured for each possible EWE scenario. Then, a two-stage trilevel decision-dependent distributionally robust resilient enhancement (DD-DRRE) model is formulated, whose outputs include the optimal line hardening, distributed generation (DG) allocation, and proactive network reconfiguration strategy under the worst-case distributions in SWDD-ASs. Subsequently, the DD-DRRE model is equivalently recast to a mixed-integer linear programming (MILP)-based master problem and multiple scenario-wise subproblems, facilitating the adoption of a customized column-and-constraint generation (C&CG) algorithm. Finally, case studies demonstrate a remarkable improvement in the out-of-sample performance of our model, compared to its prevailing stochastic and robust counterparts. Moreover, the potential values of incorporating the ambiguity and distributional information are quantitatively estimated, providing a useful reference for planners with different budgets and risk-aversion levels. △ Less

Submitted 23 August, 2022; v1 submitted 2 July, 2022; originally announced July 2022.

arXiv:2204.02743 [pdf, other]

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Jiankun Hu, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Previous works on expressive speech synthesis focus on modelling the mono-scale style embedding from the current sentence or context, but the multi-scale nature of speaking style in human speech is neglected. In this paper, we propose a multi-scale speaking style modelling method to capture and predict multi-scale speaking style for improving the naturalness and expressiveness of synthetic speech.… ▽ More Previous works on expressive speech synthesis focus on modelling the mono-scale style embedding from the current sentence or context, but the multi-scale nature of speaking style in human speech is neglected. In this paper, we propose a multi-scale speaking style modelling method to capture and predict multi-scale speaking style for improving the naturalness and expressiveness of synthetic speech. A multi-scale extractor is proposed to extract speaking style embeddings at three different levels from the ground-truth speech, and explicitly guide the training of a multi-scale style predictor based on hierarchical context information. Both objective and subjective evaluations on a Mandarin audiobooks dataset demonstrate that our proposed method can significantly improve the naturalness and expressiveness of the synthesized speech. △ Less

Submitted 5 July, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

Comments: Accepted by INTERSPEECH 2022

arXiv:2203.16746 [pdf]

Resilient Distribution System Restoration with Communication Recovery by Drone Small Cells

Authors: Haochen Zhang, Chen Chen, Shunbo Lei, Zhaohong Bie

Abstract: Distribution system (DS) restoration after natural disasters often faces the challenge of communication failures to feeder automation (FA) facilities, resulting in prolonged load pick-up process. This letter discusses the utilization of drone small cells for wireless communication recovery of FA, and proposes an integrated DS restoration strategy with communication recovery. Demonstrative case stu… ▽ More Distribution system (DS) restoration after natural disasters often faces the challenge of communication failures to feeder automation (FA) facilities, resulting in prolonged load pick-up process. This letter discusses the utilization of drone small cells for wireless communication recovery of FA, and proposes an integrated DS restoration strategy with communication recovery. Demonstrative case studies are conducted to validate the proposed model, and its advantages are illustrated by comparing to benchmark strategies. △ Less

Submitted 30 March, 2022; originally announced March 2022.

arXiv:2203.14000 [pdf]

On Time Stepping Schemes Considering Switching Behaviors for Power System Electromagnetic Transient Simulation

Authors: Sheng Lei

Abstract: Several difficulties will appear when typical electromagnetic transient simulation, using the implicit trapezoidal method and fixed step sizes, is applied to power systems with switching behaviors. These difficulties are addressed by different aspects of time stepping schemes in the literature. This paper first details the different aspects and reviews corresponding methods. Some misunderstanding… ▽ More Several difficulties will appear when typical electromagnetic transient simulation, using the implicit trapezoidal method and fixed step sizes, is applied to power systems with switching behaviors. These difficulties are addressed by different aspects of time stepping schemes in the literature. This paper first details the different aspects and reviews corresponding methods. Some misunderstanding in the literature is clarified. Issues that may be encountered by the existing methods are concurrently revealed. Based on the detailed review, the paper then puts forward a novel time stepping scheme which fully addresses the difficulties. The effectiveness of the proposed scheme is demonstrated via numerical case studies. △ Less

Submitted 26 March, 2022; originally announced March 2022.

Comments: Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2203.12201 [pdf, other]

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information… ▽ More Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information considering structural relationship in context, including inter-phrase and inter-sentence relations. Moreover, to encourage this encoder to learn style representation better, we introduce a novel training strategy with knowledge distillation, which provides the target for encoder training. Both objective and subjective evaluations on a Mandarin lecture dataset demonstrate that the proposed method can significantly improve the naturalness and expressiveness of the synthesized speech. △ Less

Submitted 6 April, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: Accepted by ICASSP 2022

arXiv:2106.03329 [pdf]

Improved Method for Dealing with Discontinuities in Power System Transient Simulation Based on Frequency Response Optimized Integrators Considering Second Order Derivative

Authors: Sheng Lei, Alexander Flueck

Abstract: Potential disagreement in the result induced by discontinuities is revealed in this paper between a novel power system transient simulation scheme using numerical integrators considering second order derivative and conventional ones using numerical integrators considering first order derivative. The disagreement is due to the formula of the different numerical integrators. An improved method for d… ▽ More Potential disagreement in the result induced by discontinuities is revealed in this paper between a novel power system transient simulation scheme using numerical integrators considering second order derivative and conventional ones using numerical integrators considering first order derivative. The disagreement is due to the formula of the different numerical integrators. An improved method for dealing with discontinuities in the novel transient simulation scheme is proposed to resolve the disagreement. The effectiveness of the improved method is demonstrated and verified via numerical case studies. Although the disagreement is studied on and the improved method is proposed for a particular transient simulation scheme, similar conclusions also apply to other ones using numerical integrators considering high order derivative. △ Less

Submitted 7 June, 2021; originally announced June 2021.

Comments: Accepted by the 2021 IEEE Midwest Symposium on Circuits and Systems

arXiv:2104.10385 [pdf, other]

Wide-Beam Array Antenna Power Gain Maximization via ADMM Framework

Authors: Shiwen Lei, Jing Tian, Zhipeng Lin, Haoquan Hu, Bo Chen, Wei Yang, Pu Tang, Xiangdong Qiu

Abstract: This paper proposes two algorithms to maximize the minimum array power gain in a wide-beam mainlobe by solving the power gain pattern synthesis (PGPS) problem with and without sidelobe constraints. Firstly, the nonconvex PGPS problem is transformed into a nonconvex linear inequality optimization problem and then converted to an augmented Lagrangian problem by introducing auxiliary variables via th… ▽ More This paper proposes two algorithms to maximize the minimum array power gain in a wide-beam mainlobe by solving the power gain pattern synthesis (PGPS) problem with and without sidelobe constraints. Firstly, the nonconvex PGPS problem is transformed into a nonconvex linear inequality optimization problem and then converted to an augmented Lagrangian problem by introducing auxiliary variables via the Alternating Direction Method of Multipliers (ADMM) framework. Next,the original intractable problem is converted into a series of nonconvex and convex subproblems. The nonconvex subproblems are solved by dividing their solution space into a finite set of smaller ones, in which the solution would be obtained pseudoanalytically. In such a way, the proposed algorithms are superior to the existing PGPS-based ones as their convergence can be theoretically guaranteed with a lower computational burden. Numerical examples with both isotropic element pattern (IEP) and active element pattern (AEP) arrays are simulated to show the effectiveness and superiority of the proposed algorithms by comparing with the related existing algorithms. △ Less

Submitted 21 April, 2021; originally announced April 2021.

arXiv:2101.03266 [pdf]

Studies on Frequency Response Optimized Integrators Considering Second Order Derivative

Authors: Sheng Lei, Alexander Flueck

Abstract: This paper presents comprehensive studies on frequency response optimized integrators considering second order derivative regarding their numerical error, numerical stability and transient performance. Frequency domain error analysis is conducted on these numerical integrators to reveal their accuracy. Numerical stability of the numerical integrators is investigated. Interesting new types of numer… ▽ More This paper presents comprehensive studies on frequency response optimized integrators considering second order derivative regarding their numerical error, numerical stability and transient performance. Frequency domain error analysis is conducted on these numerical integrators to reveal their accuracy. Numerical stability of the numerical integrators is investigated. Interesting new types of numerical stability are recognized. Transient performance of the numerical integrators is defined to qualitatively characterize their ability to track fast decaying transients. This property is related to unsatisfactory phenomena such as numerical oscillation which frequently appear in time domain simulation of circuits and systems. Transient performance analysis of the numerical integrators is provided. Theoretical observations from the analysis of the numerical integrators are verified via time domain case studies. △ Less

Submitted 8 January, 2021; originally announced January 2021.

Comments: Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2101.03063 [pdf]

Knowledge AI: New Medical AI Solution for Medical image Diagnosis

Authors: Yingni Wang, Shuge Lei, Jian Dai, Kehong Yuan

Abstract: The implementation of medical AI has always been a problem. The effect of traditional perceptual AI algorithm in medical image processing needs to be improved. Here we propose a method of knowledge AI, which is a combination of perceptual AI and clinical knowledge and experience. Based on this method, the geometric information mining of medical images can represent the experience and information a… ▽ More The implementation of medical AI has always been a problem. The effect of traditional perceptual AI algorithm in medical image processing needs to be improved. Here we propose a method of knowledge AI, which is a combination of perceptual AI and clinical knowledge and experience. Based on this method, the geometric information mining of medical images can represent the experience and information and evaluate the quality of medical images. △ Less

Submitted 8 January, 2021; originally announced January 2021.

Comments: 9 pages,8 figures. arXiv admin note: text overlap with arXiv:2101.02639

arXiv:2012.01375 [pdf]

Proper Selection of Obreshkov-Like Numerical Integrators Used as Numerical Differentiators for Power System Transient Simulation

Authors: Sheng Lei, Alexander Flueck

Abstract: Obreshkov-like numerical integrators have been widely applied to power system transient simulation. Misuse of the numerical integrators as numerical differentiators may lead to numerical oscillation or bias. Criteria for Obreshkov-like numerical integrators to be used as numerical differentiators are proposed in this paper to avoid these misleading phenomena. The coefficients of a numerical integr… ▽ More Obreshkov-like numerical integrators have been widely applied to power system transient simulation. Misuse of the numerical integrators as numerical differentiators may lead to numerical oscillation or bias. Criteria for Obreshkov-like numerical integrators to be used as numerical differentiators are proposed in this paper to avoid these misleading phenomena. The coefficients of a numerical integrator for the highest order derivative turn out to determine its suitability. Some existing Obreshkov-like numerical integrators are examined under the proposed criteria. It is revealed that the notorious numerical oscillations induced by the implicit trapezoidal method cannot always be eliminated by using the backward Euler method for a few time steps. Guided by the proposed criteria, a frequency response optimized integrator considering second order derivative is put forward which is suitable to be used as a numerical differentiator. Theoretical observations are demonstrated in time domain via case studies. The paper points out how to properly select the numerical integrators for power system transient simulation and helps to prevent their misuse. △ Less

Submitted 15 February, 2022; v1 submitted 2 December, 2020; originally announced December 2020.

Comments: Accepted by the 2022 IEEE PES General Meeting

arXiv:2011.05439 [pdf]

Transient Simulation of Grid-Feeding Converter System for Stability Studies Using Frequency Response Optimized Integrators

Authors: Sheng Lei, Alexander Flueck

Abstract: A grid-feeding converter system is added to a novel power system transient simulation scheme based on frequency response optimized integrators considering second order derivative. The converter system and its implementation in the simulation scheme are detailed. Case studies verify the accuracy and efficiency of the simulation scheme. Furthermore, this paper proposes and justifies extending the si… ▽ More A grid-feeding converter system is added to a novel power system transient simulation scheme based on frequency response optimized integrators considering second order derivative. The converter system and its implementation in the simulation scheme are detailed. Case studies verify the accuracy and efficiency of the simulation scheme. Furthermore, this paper proposes and justifies extending the simulation scheme by integrating commonly used numerical integrators considering first order derivative for part of the studied system. The proposed extension has an insignificant impact on the accuracy of the simulation scheme while significantly enhancing its efficiency. It also reduces the development burden in adding new devices. △ Less

Submitted 20 February, 2021; v1 submitted 10 November, 2020; originally announced November 2020.

Comments: Accepted by the 2021 IEEE PES General Meeting

arXiv:2011.00711 [pdf]

Multistep Frequency Response Optimized Integrators and Their Application to Accelerating a Power System Transient Simulation Scheme

Authors: Sheng Lei, Alexander Flueck

Abstract: This paper proposes several explicit and implicit multistep frequency response optimized integrators considering first or second order derivative. A prediction-based method aiming at accelerating a novel power system transient simulation scheme without impacting its accuracy is further put forward utilizing the proposed numerical integrators and some others available in the literature. Case studie… ▽ More This paper proposes several explicit and implicit multistep frequency response optimized integrators considering first or second order derivative. A prediction-based method aiming at accelerating a novel power system transient simulation scheme without impacting its accuracy is further put forward utilizing the proposed numerical integrators and some others available in the literature. Case studies verify the effectiveness of the proposed prediction method. Although they are utilized to accelerate the simulation scheme in this paper, the proposed numerical integrators are in fact general-purpose and can be applied to other areas. △ Less

Submitted 15 February, 2021; v1 submitted 1 November, 2020; originally announced November 2020.

Comments: Accepted by the 2021 IEEE PES General Meeting

arXiv:2008.13059 [pdf]

Initialization Process of a Power System Transient Simulation Scheme for Stability Studies

Authors: Sheng Lei, Alexander Flueck

Abstract: The initialization process of a novel power system transient simulation scheme for stability studies is put forward, by further developing a "time-domain harmonic power-flow algorithm". The initialization process is formulated as an algebraic problem to ensure that the power system under study is in steady state and operated at a specified operating point, at the beginning of a transient simulatio… ▽ More The initialization process of a novel power system transient simulation scheme for stability studies is put forward, by further developing a "time-domain harmonic power-flow algorithm". The initialization process is formulated as an algebraic problem to ensure that the power system under study is in steady state and operated at a specified operating point, at the beginning of a transient simulation run. The algebraic problem is then solved efficiently by a preconditioned finite difference Newton-GMRES method. Case studies verify the validity and efficiency of the initialization process. The proposed initialization process is general-purpose and can be applied to other power system transient simulation schemes. △ Less

Submitted 29 August, 2020; originally announced August 2020.

Comments: Accepted by the 52nd North American Power Symposium

arXiv:2007.01496 [pdf, other]

Few-Shot Semantic Segmentation Augmented with Image-Level Weak Annotations

Authors: Shuo Lei, Xuchao Zhang, Jianfeng He, Fanglan Chen, Chang-Tien Lu

Abstract: Despite the great progress made by deep neural networks in the semantic segmentation task, traditional neural-networkbased methods typically suffer from a shortage of large amounts of pixel-level annotations. Recent progress in fewshot semantic segmentation tackles the issue by only a few pixel-level annotated examples. However, these few-shot approaches cannot easily be applied to multi-way or we… ▽ More Despite the great progress made by deep neural networks in the semantic segmentation task, traditional neural-networkbased methods typically suffer from a shortage of large amounts of pixel-level annotations. Recent progress in fewshot semantic segmentation tackles the issue by only a few pixel-level annotated examples. However, these few-shot approaches cannot easily be applied to multi-way or weak annotation settings. In this paper, we advance the few-shot segmentation paradigm towards a scenario where image-level annotations are available to help the training process of a few pixel-level annotations. Our key idea is to learn a better prototype representation of the class by fusing the knowledge from the image-level labeled data. Specifically, we propose a new framework, called PAIA, to learn the class prototype representation in a metric space by integrating image-level annotations. Furthermore, by considering the uncertainty of pseudo-masks, a distilled soft masked average pooling strategy is designed to handle distractions in image-level annotations. Extensive empirical results on two datasets show superior performance of PAIA. △ Less

Submitted 18 June, 2021; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: Accpeted to ICME2021

arXiv:2005.00964 [pdf]

Efficient Power System Transient Simulation Based on Frequency Response Optimized Integrators Considering Second Order Derivative

Authors: Sheng Lei, Alexander Flueck

Abstract: Frequency response optimized integrators considering second order derivative are proposed in this paper. Based on the proposed numerical integrators, and others which also consider second order derivative, this paper puts forward a novel power system transient simulation scheme. Instead of using a unique numerical integrator, the proposed simulation scheme chooses proper ones according to the domi… ▽ More Frequency response optimized integrators considering second order derivative are proposed in this paper. Based on the proposed numerical integrators, and others which also consider second order derivative, this paper puts forward a novel power system transient simulation scheme. Instead of using a unique numerical integrator, the proposed simulation scheme chooses proper ones according to the dominant frequency component of the differential state variables. With the proposed simulation scheme, computational efficiency is improved by using large step sizes without sacrificing accuracy. Numerical case studies demonstrate the validity and efficiency of the simulation scheme. △ Less

Submitted 2 May, 2020; originally announced May 2020.

Comments: Accepted by the 2020 IEEE PES General Meeting

arXiv:2004.13557 [pdf, other]

Baseline Estimation of Commercial Building HVAC Fan Power Using Tensor Completion

Authors: Shunbo Lei, David Hong, Johanna L. Mathieu, Ian A. Hiskens

Abstract: Commercial building heating, ventilation, and air conditioning (HVAC) systems have been studied for providing ancillary services to power grids via demand response (DR). One critical issue is to estimate the counterfactual baseline power consumption that would have prevailed without DR. Baseline methods have been developed based on whole building electric load profiles. New methods are necessary t… ▽ More Commercial building heating, ventilation, and air conditioning (HVAC) systems have been studied for providing ancillary services to power grids via demand response (DR). One critical issue is to estimate the counterfactual baseline power consumption that would have prevailed without DR. Baseline methods have been developed based on whole building electric load profiles. New methods are necessary to estimate the baseline power consumption of HVAC sub-components (e.g., supply and return fans), which have different characteristics compared to that of the whole building. Tensor completion can estimate the unobserved entries of multi-dimensional tensors describing complex data sets. It exploits high-dimensional data to capture granular insights into the problem. This paper proposes to use it for baselining HVAC fan power, by utilizing its capability of capturing dominant fan power patterns. The tensor completion method is evaluated using HVAC fan power data from several buildings at the University of Michigan, and compared with several existing methods. The tensor completion method generally outperforms the benchmarks. △ Less

Submitted 24 April, 2020; originally announced April 2020.

arXiv:1912.06936 [pdf]

doi 10.1021/acs.jpca.9b11681

Compressed Sensing for Reconstructing Coherent Multidimensional Spectra

Authors: Zhengjun Wang, Shiwen Lei, Khadga Jung Karki, Andreas Jakobsson, Tönu Pullerits

Abstract: We apply two sparse reconstruction techniques, the least absolute shrinkage and selection operator (LASSO) and the sparse exponential mode analysis (SEMA), to two-dimensional (2D) spectroscopy. The algorithms are first tested on model data, showing that both are able to reconstruct the spectra using only a fraction of the data required by the traditional Fourier-based estimator. Through the analys… ▽ More We apply two sparse reconstruction techniques, the least absolute shrinkage and selection operator (LASSO) and the sparse exponential mode analysis (SEMA), to two-dimensional (2D) spectroscopy. The algorithms are first tested on model data, showing that both are able to reconstruct the spectra using only a fraction of the data required by the traditional Fourier-based estimator. Through the analysis of a sparsely sampled experimental fluorescence detected 2D spectra of LH2 complexes, we conclude that both SEMA and LASSO can be used to significantly reduce the required data, still allowing to reconstruct the multidimensional spectra. Of the two techniques, it is shown that SEMA offers preferable performance, providing more accurate estimation of the spectral line widths and their positions. Furthermore, SEMA allows for off-grid components, enabling the use of a much smaller dictionary than the LASSO, thereby improving both the performance and lowering the computational complexity for reconstructing coherent multidimensional spectra. △ Less

Submitted 14 December, 2019; originally announced December 2019.

arXiv:1911.09987 [pdf, other]

Transmission System Resilience Enhancement with Extended Steady-state Security Region in Consideration of Uncertain Topology Changes

Authors: Chong Wang, Feng Wu, Ping Ju, Shunbo Lei, Tianguang Lu, Yunhe Hou

Abstract: The increasing extreme weather events poses unprecedented challenges on power system operation because of their uncertain and sequential impacts on power systems. This paper proposes the concept of an extended steady-state security region (ESSR), and resilience enhancement for transmission systems based on ESSR in consideration of uncertain varying topology changes caused by the extreme weather ev… ▽ More The increasing extreme weather events poses unprecedented challenges on power system operation because of their uncertain and sequential impacts on power systems. This paper proposes the concept of an extended steady-state security region (ESSR), and resilience enhancement for transmission systems based on ESSR in consideration of uncertain varying topology changes caused by the extreme weather events is implemented. ESSR is a ploytope describing a region, in which the operating points are within the operating constraints. In consideration of uncertain varying topology changes with ESSR, the resilience enhancement problem is built as a bilevel programming optimization model, in which the system operators deploy the optimal strategy against the most threatening scenario caused by the extreme weather events. To avoid the curse of dimensionality with regard to system topologies for a large scale system, the Monte Carlo method is used to generate uncertain system topologies, and a recursive McCormick envelope-based approach is proposed to connect generated system topologies to optimization variables. Karush Kuhn Tucker (KKT) conditions are used to transform the suboptimization model in the second level into a group of equivalent constraints in the first level. A simple test system and IEEE 118-bus system are used to validate the proposed. △ Less

Submitted 22 November, 2019; originally announced November 2019.

Showing 1–40 of 40 results for author: Lei, S