-
AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation
Authors:
Yan Rong,
Jinting Wang,
Shan Yang,
Guangzhi Lei,
Li Liu
Abstract:
Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in…
▽ More
Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the generation team, a fine-grained task decomposition and an adaptive Mixture-of-Experts (MoE) collaborative entity are designed for dynamic model selection, and a trial-and-error iterative refinement module is designed for self-correction. The supervisor team ensures temporal-spatial consistency and verifies outputs through feedback loops. Moreover, we build MA-Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi-type audios. Experiments demonstrate that our AudioGenie outperforms state-of-the-art (SOTA) methods across 9 metrics in 8 tasks. User study further validate the effectiveness of the proposed method in terms of quality, accuracy, alignment, and aesthetic. The anonymous project website with samples can be found at https://audiogenie.github.io/.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Low-Complexity Channel Estimation in OTFS Systems with Fractional Effects
Authors:
Guangyu Lei,
Yanduo Qiao,
Tianhao Liang,
Weijie Yuan,
Tingting Zhang
Abstract:
Orthogonal Time Frequency Space (OTFS) modulation exploits the sparsity of Delay-Doppler domain channels, making it highly effective in high-mobility scenarios. Its accurate channel estimation supports integrated sensing and communication (ISAC) systems. The letter introduces a low-complexity technique for estimating delay and Doppler shifts under fractional effects, while addressing inter-path in…
▽ More
Orthogonal Time Frequency Space (OTFS) modulation exploits the sparsity of Delay-Doppler domain channels, making it highly effective in high-mobility scenarios. Its accurate channel estimation supports integrated sensing and communication (ISAC) systems. The letter introduces a low-complexity technique for estimating delay and Doppler shifts under fractional effects, while addressing inter-path interference. The method employs a sequential estimation process combined with interference elimination based on energy leakage, ensuring accurate channel estimation. Furthermore, the estimated channel parameters can signifcantly improve ISAC system performance by enhancing sensing capabilities. Experimental results validate the effectiveness of this approach in achieving accurate channel estimation and facilitating sensing tasks for ISAC systems.
△ Less
Submitted 28 April, 2025;
originally announced May 2025.
-
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation
Authors:
Yan Rong,
Shan Yang,
Guangzhi Lei,
Li Liu
Abstract:
Audiobook generation, which creates vivid and emotion-rich audio works, faces challenges in conveying complex emotions, achieving human-like qualities, and aligning evaluations with human preferences. Existing text-to-speech (TTS) methods are often limited to specific scenarios, struggle with emotional transitions, and lack automatic human-aligned evaluation benchmarks, instead relying on either m…
▽ More
Audiobook generation, which creates vivid and emotion-rich audio works, faces challenges in conveying complex emotions, achieving human-like qualities, and aligning evaluations with human preferences. Existing text-to-speech (TTS) methods are often limited to specific scenarios, struggle with emotional transitions, and lack automatic human-aligned evaluation benchmarks, instead relying on either misaligned automated metrics or costly human assessments. To address these issues, we propose Dopamine Audiobook, a new unified training-free system leveraging a multimodal large language model (MLLM) as an AI agent for emotional and human-like audiobook generation and evaluation. Specifically, we first design a flow-based emotion-enhanced framework that decomposes complex emotional speech synthesis into controllable sub-tasks. Then, we propose an adaptive model selection module that dynamically selects the most suitable TTS methods from a set of existing state-of-the-art (SOTA) TTS methods for diverse scenarios. We further enhance emotional expressiveness through paralinguistic augmentation and prosody retrieval at word and utterance levels. For evaluation, we propose a novel GPT-based evaluation framework incorporating self-critique, perspective-taking, and psychological MagicEmo prompts to ensure human-aligned and self-aligned assessments. Experiments show that our method generates long speech with superior emotional expression to SOTA TTS models in various metrics. Importantly, our evaluation framework demonstrates better alignment with human preferences and transferability across audio tasks. Project website with audio samples can be found at https://dopamine-audiobook.github.io.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Promoting Shared Energy Storage Aggregation among High Price-Tolerance Prosumer: An Incentive Deposit and Withdrawal Service
Authors:
Xin Lu,
Jing Qiu,
Cuo Zhang,
Gang Lei,
Jianguo Zhu
Abstract:
Many residential prosumers exhibit a high price-tolerance for household electricity bills and a low response to price incentives. This is because the household electricity bills are not inherently high, and the potential for saving on electricity bills through participation in conventional Shared Energy Storage (SES) is limited, which diminishes their motivation to actively engage in SES. Addition…
▽ More
Many residential prosumers exhibit a high price-tolerance for household electricity bills and a low response to price incentives. This is because the household electricity bills are not inherently high, and the potential for saving on electricity bills through participation in conventional Shared Energy Storage (SES) is limited, which diminishes their motivation to actively engage in SES. Additionally, existing SES models often require prosumers to take additional actions, such as optimizing rental capacity and bidding prices, which happen to be capabilities that typical household prosumers do not possess. To incentivize these high price-tolerance residential prosumers to participate in SES, a novel SES aggregation framework is proposed, which does not require prosumers to take additional actions and allows them to maintain existing energy storage patterns. Compared to conventional long-term operation of SES, the proposed framework introduces an additional short-term construction step during which the energy service provider (ESP) acquires control of the energy storage systems (ESS) and offers electricity deposit and withdrawal services (DWS) with dynamic coefficients, enabling prosumers to withdraw more electricity than they deposit without additional actions. Additionally, a matching mechanism is proposed to align prosumers' electricity consumption behaviors with ESP's optimization strategies. Finally, the dynamic coefficients in DWS and trading strategies are optimized by an improved deep reinforcement learning (DRL) algorithm. Case studies are conducted to verify the effectiveness of the proposed SES aggregation framework with DWS and the matching mechanism.
△ Less
Submitted 13 January, 2025; v1 submitted 8 January, 2025;
originally announced January 2025.
-
DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis
Authors:
Yu Gu,
Qiushi Zhu,
Guangzhi Lei,
Chao Weng,
Dan Su
Abstract:
This paper proposes an improved version of DurIAN-E (DurIAN-E 2), which is also a duration informed attention neural network for expressive and high-fidelity text-to-speech (TTS) synthesis. Similar with the DurIAN-E model, multiple stacked SwishRNN-based Transformer blocks are utilized as linguistic encoders and Style-Adaptive Instance Normalization (SAIN) layers are also exploited into frame-leve…
▽ More
This paper proposes an improved version of DurIAN-E (DurIAN-E 2), which is also a duration informed attention neural network for expressive and high-fidelity text-to-speech (TTS) synthesis. Similar with the DurIAN-E model, multiple stacked SwishRNN-based Transformer blocks are utilized as linguistic encoders and Style-Adaptive Instance Normalization (SAIN) layers are also exploited into frame-level encoders to improve the modeling ability of expressiveness in the proposed the DurIAN-E 2. Meanwhile, motivated by other TTS models using generative models such as VITS, the proposed DurIAN-E 2 utilizes variational autoencoders (VAEs) augmented with normalizing flows and a BigVGAN waveform generator with adversarial training strategy, which further improve the synthesized speech quality and expressiveness. Both objective test and subjective evaluation results prove that the proposed expressive TTS model DurIAN-E 2 can achieve better performance than several state-of-the-art approaches besides DurIAN-E.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis
Authors:
Yu Gu,
Yianrao Bian,
Guangzhi Lei,
Chao Weng,
Dan Su
Abstract:
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed Du…
▽ More
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed DurIAN-E utilizes multiple stacked SwishRNN-based Transformer blocks as linguistic encoders. Style-Adaptive Instance Normalization (SAIN) layers are exploited into frame-level encoders to improve the modeling ability of expressiveness. A denoiser incorporating both denoising diffusion probabilistic model (DDPM) for mel-spectrograms and SAIN modules is conducted to further improve the synthetic speech quality and expressiveness. Experimental results prove that the proposed expressive TTS model in this paper can achieve better performance than the state-of-the-art approaches in both subjective mean opinion score (MOS) and preference tests.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams
Authors:
Huirong Huang,
Zhiyong Wu,
Shiyin Kang,
Dongyang Dai,
Jia Jia,
Tianxiao Fu,
Deyi Tuo,
Guangzhi Lei,
Peng Liu,
Dan Su,
Dong Yu,
Helen Meng
Abstract:
Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phone…
▽ More
Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phonetic posteriorgrams (PPG). In this way, our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches. Furthermore, our method can support multilingual speech as input by building a universal phoneme space. As far as we know, our model is the first to support multilingual/mixlingual speech as input with convincing results. Objective and subjective experiments have shown that our model can generate high quality animations given speech from unseen languages or speakers and be robust to noise.
△ Less
Submitted 20 June, 2020;
originally announced June 2020.
-
DurIAN: Duration Informed Attention Network For Multimodal Synthesis
Authors:
Chengzhu Yu,
Heng Lu,
Na Hu,
Meng Yu,
Chao Weng,
Kun Xu,
Peng Liu,
Deyi Tuo,
Shiyin Kang,
Guangzhi Lei,
Dan Su,
Dong Yu
Abstract:
In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from th…
▽ More
In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron. Furthermore, DurIAN can be used to generate high quality facial expression which can be synchronized with generated speech with/without parallel speech and face data. To improve the efficiency of speech generation, we also propose a multi-band parallel generation strategy on top of the WaveRNN model. The proposed Multi-band WaveRNN effectively reduces the total computational complexity from 9.8 to 5.5 GFLOPS, and is able to generate audio that is 6 times faster than real time on a single CPU core. We show that DurIAN could generate highly natural speech that is on par with current state of the art end-to-end systems, while at the same time avoid word skipping/repeating errors in those systems. Finally, a simple yet effective approach for fine-grained control of expressiveness of speech and facial expression is introduced.
△ Less
Submitted 5 September, 2019; v1 submitted 4 September, 2019;
originally announced September 2019.