Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation

Rong, Yan; Yang, Shan; Li, Chenxing; Yu, Dong; Liu, Li

Abstract:Audiobook generation aims to create rich, immersive listening experiences from multimodal inputs, but current approaches face three critical challenges: (1) the lack of synergistic generation of diverse audio types (e.g., speech, sound effects, and music) with precise temporal and semantic alignment; (2) the difficulty in conveying expressive, fine-grained emotions, which often results in machine-like vocal outputs; and (3) the absence of automated evaluation frameworks that align with human preferences for complex and diverse audio. To address these issues, we propose Dopamine Audiobook, a novel unified training-free multi-agent system, where a multimodal large language model (MLLM) serves two specialized roles (i.e., speech designer and audio designer) for emotional, human-like, and immersive audiobook generation and evaluation. Specifically, we firstly propose a flow-based, context-aware framework for diverse audio generation with word-level semantic and temporal alignment. To enhance expressiveness, we then design word-level paralinguistic augmentation, utterance-level prosody retrieval, and adaptive TTS model selection. Finally, for evaluation, we introduce a novel MLLM-based evaluation framework incorporating self-critique, perspective-taking, and psychological MagicEmo prompts to ensure human-aligned and self-aligned assessments. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance on multiple metrics. Importantly, our evaluation framework shows better alignment with human preferences and transferability across audio tasks.

Subjects:	Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2504.11002 [cs.SD]
	(or arXiv:2504.11002v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2504.11002

Computer Science > Sound

Title:Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators