-
Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech
Authors:
Yejin Lee,
Jaehoon Kang,
Kyuhong Shim
Abstract:
In this paper, we propose a novel framework to control voice style in prompt-based, controllable text-to-speech systems by leveraging textual personas as voice style prompts. We present two persona rewriting strategies to transform generic persona descriptions into speech-oriented prompts, enabling fine-grained manipulation of prosodic attributes such as pitch, emotion, and speaking rate. Experime…
▽ More
In this paper, we propose a novel framework to control voice style in prompt-based, controllable text-to-speech systems by leveraging textual personas as voice style prompts. We present two persona rewriting strategies to transform generic persona descriptions into speech-oriented prompts, enabling fine-grained manipulation of prosodic attributes such as pitch, emotion, and speaking rate. Experimental results demonstrate that our methods enhance the naturalness, clarity, and consistency of synthesized speech. Finally, we analyze implicit social biases introduced by LLM-based rewriting, with a focus on gender. We underscore voice style as a crucial factor for persona-driven AI dialogue systems.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions
Authors:
Junzhi Ning,
Cheng Tang,
Kaijin Zhou,
Diping Song,
Lihao Liu,
Ming Hu,
Wei Li,
Yanzhou Su,
Tianbing Li,
Jiyao Liu,
Yejin,
Sheng Zhang,
Yuanfeng Ji,
Junjun He
Abstract:
The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. To synthesise Colour Fundus Photographs (CFPs), existing methods primarily relying on predefined disease labels face significant limitations. However, current methods remain limited, thus failing to gener…
▽ More
The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. To synthesise Colour Fundus Photographs (CFPs), existing methods primarily relying on predefined disease labels face significant limitations. However, current methods remain limited, thus failing to generate images for broader categories with diverse and fine-grained anatomical structures. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, synthetic Caption-CFP dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses large language models (LLMs) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Furthermore, based on this dataset, we employ a novel three-step training framework, called RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Extensive experiments demonstrate state-of-the-art performance across multiple datasets, with 62.07% of text-driven synthetic images indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 10%-25% in diabetic retinopathy grading and glaucoma detection, thereby providing a scalable solution to augment ophthalmic datasets.
△ Less
Submitted 24 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech
Authors:
Youngjae Kim,
Yejin Jeon,
Gary Geunbae Lee
Abstract:
The difficulty of acquiring abundant, high-quality data, especially in multi-lingual contexts, has sparked interest in addressing low-resource scenarios. Moreover, current literature rely on fixed expressions from language IDs, which results in the inadequate learning of language representations, and the failure to generate speech in unseen languages. To address these challenges, we propose a nove…
▽ More
The difficulty of acquiring abundant, high-quality data, especially in multi-lingual contexts, has sparked interest in addressing low-resource scenarios. Moreover, current literature rely on fixed expressions from language IDs, which results in the inadequate learning of language representations, and the failure to generate speech in unseen languages. To address these challenges, we propose a novel method that directly extracts linguistic features from audio input while effectively filtering out miscellaneous acoustic information including speaker-specific attributes like timbre. Subjective and objective evaluations affirm the effectiveness of our approach for multi-lingual text-to-speech, and highlight its superiority in low-resource transfer learning for previously unseen language.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
An Investigation Into Explainable Audio Hate Speech Detection
Authors:
Jinmyeong An,
Wonjun Lee,
Yejin Jeon,
Jungseul Ok,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
Research on hate speech has predominantly revolved around detection and interpretation from textual inputs, leaving verbal content largely unexplored. While there has been limited exploration into hate speech detection within verbal acoustic speech inputs, the aspect of interpretability has been overlooked. Therefore, we introduce a new task of explainable audio hate speech detection. Specifically…
▽ More
Research on hate speech has predominantly revolved around detection and interpretation from textual inputs, leaving verbal content largely unexplored. While there has been limited exploration into hate speech detection within verbal acoustic speech inputs, the aspect of interpretability has been overlooked. Therefore, we introduce a new task of explainable audio hate speech detection. Specifically, we aim to identify the precise time intervals, referred to as audio frame-level rationales, which serve as evidence for hate speech classification. Towards this end, we propose two different approaches: cascading and End-to-End (E2E). The cascading approach initially converts audio to transcripts, identifies hate speech within these transcripts, and subsequently locates the corresponding audio time frames. Conversely, the E2E approach processes audio utterances directly, which allows it to pinpoint hate speech within specific time frames. Additionally, due to the lack of explainable audio hate speech datasets that include audio frame-level rationales, we curated a synthetic audio dataset to train our models. We further validated these models on actual human speech utterances and found that the E2E approach outperforms the cascading method in terms of the audio frame Intersection over Union (IoU) metric. Furthermore, we observed that including frame-level rationales significantly enhances hate speech detection accuracy for the E2E approach.
\textbf{Disclaimer} The reader may encounter content of an offensive or hateful nature. However, given the nature of the work, this cannot be avoided.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
An Application of Model Reference Adaptive Control for Multi-Agent Synchronization in Drone Networks
Authors:
Miguel F. Arevalo-Castiblanco,
Yejin Wi,
Marzia Cescon and,
Cesar A. Uribe
Abstract:
This paper presents the application of a Distributed Model Reference Adaptive Control (DMRAC) strategy for robust multi-agent synchronization of a network of drones. The proposed approach enables the development of controllers capable of accommodating differences in real-life model parameters between agents, thereby enhancing overall network performance. We compare the performance of the adaptive…
▽ More
This paper presents the application of a Distributed Model Reference Adaptive Control (DMRAC) strategy for robust multi-agent synchronization of a network of drones. The proposed approach enables the development of controllers capable of accommodating differences in real-life model parameters between agents, thereby enhancing overall network performance. We compare the performance of the adaptive control laws with classical PID controllers for the reference tracking task. Each follower drone has a model reference adaptive controller that continuously updates its parameters based on real-time feedback and reference model information. This adaptability ensures an adequate performance that, compared to conventional non-adaptive techniques, can reduce the amount of energy required and consequently increase the operating duration of the drones. The experimental results, particularly in vertical velocity control, underscore the effectiveness of the proposed approach in achieving synchronized behavior.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation
Authors:
Yejin Jeon,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and Fas…
▽ More
Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and FastSpeech variants show substantial pausing errors when applied to the Korean language, which affects speech perception and naturalness. In order to address the aforementioned issues, we propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips. Architectural design choices are validated through comparisons with baseline models and ablation studies using subjective and objective metrics, thus confirming model performance.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Multi-Level Attention Aggregation for Language-Agnostic Speaker Replication
Authors:
Yejin Jeon,
Gary Geunbae Lee
Abstract:
This paper explores the task of language-agnostic speaker replication, a novel endeavor that seeks to replicate a speaker's voice irrespective of the language they are speaking. Towards this end, we introduce a multi-level attention aggregation approach that systematically probes and amplifies various speaker-specific attributes in a hierarchical manner. Through rigorous evaluations across a wide…
▽ More
This paper explores the task of language-agnostic speaker replication, a novel endeavor that seeks to replicate a speaker's voice irrespective of the language they are speaking. Towards this end, we introduce a multi-level attention aggregation approach that systematically probes and amplifies various speaker-specific attributes in a hierarchical manner. Through rigorous evaluations across a wide range of scenarios including seen and unseen speakers conversing in seen and unseen lingua, we establish that our proposed model is able to achieve substantial speaker similarity, and is able to generalize to out-of-domain (OOD) cases.
△ Less
Submitted 3 April, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
Authors:
Yejin Jeon,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that mode…
▽ More
Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-`a-vis alternative baseline models.
△ Less
Submitted 5 March, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking
Authors:
Jihyun Lee,
Yejin Jeon,
Wonjun Lee,
Yunsu Kim,
Gary Geunbae Lee
Abstract:
Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audi…
▽ More
Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data. To facilitate evaluation tailored to audio modalities, we introduce a novel PhonemeF1 to capture pronunciation similarity. Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data. By eliminating the dependency on human speech data collection, these insights pave the way for significant practical advancements in audio-based DST. Data and code are available at https://github.com/JihyunLee1/E2E-DST.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Authors:
Rowan Zellers,
Jiasen Lu,
Ximing Lu,
Youngjae Yu,
Yanpeng Zhao,
Mohammadreza Salehi,
Aditya Kusupati,
Jack Hessel,
Ali Farhadi,
Yejin Choi
Abstract:
As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Ou…
▽ More
As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos.
Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark.
We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
△ Less
Submitted 13 May, 2022; v1 submitted 7 January, 2022;
originally announced January 2022.
-
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
Authors:
Yanpeng Zhao,
Jack Hessel,
Youngjae Yu,
Ximing Lu,
Rowan Zellers,
Yejin Choi
Abstract:
Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is t…
▽ More
Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly.
In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8\% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.
△ Less
Submitted 2 May, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
Multi-Channel Volumetric Neural Network for Knee Cartilage Segmentation in Cone-beam CT
Authors:
Jennifer Maier,
Luis Carlos Rivera Monroy,
Christopher Syben,
Yejin Jeon,
Jang-Hwan Choi,
Mary Elizabeth Hall,
Marc Levenston,
Garry Gold,
Rebecca Fahrig,
Andreas Maier
Abstract:
Analyzing knee cartilage thickness and strain under load can help to further the understanding of the effects of diseases like Osteoarthritis. A precise segmentation of the cartilage is a necessary prerequisite for this analysis. This segmentation task has mainly been addressed in Magnetic Resonance Imaging, and was rarely investigated on contrast-enhanced Computed Tomography, where contrast agent…
▽ More
Analyzing knee cartilage thickness and strain under load can help to further the understanding of the effects of diseases like Osteoarthritis. A precise segmentation of the cartilage is a necessary prerequisite for this analysis. This segmentation task has mainly been addressed in Magnetic Resonance Imaging, and was rarely investigated on contrast-enhanced Computed Tomography, where contrast agent visualizes the border between femoral and tibial cartilage. To overcome the main drawback of manual segmentation, namely its high time investment, we propose to use a 3D Convolutional Neural Network for this task. The presented architecture consists of a V-Net with SeLu activation, and a Tversky loss function. Due to the high imbalance between very few cartilage pixels and many background pixels, a high false positive rate is to be expected. To reduce this rate, the two largest segmented point clouds are extracted using a connected component analysis, since they most likely represent the medial and lateral tibial cartilage surfaces. The resulting segmentations are compared to manual segmentations, and achieve on average a recall of 0.69, which confirms the feasibility of this approach.
△ Less
Submitted 3 December, 2019;
originally announced December 2019.
-
From Brain Imaging to Graph Analysis: a study on ADNI's patient cohort
Authors:
Rui Zhang,
Luca Giancardo,
Danilo A. Pena,
Yejin Kim,
Hanghang Tong,
Xiaoqian Jiang
Abstract:
In this paper, we studied the association between the change of structural brain volumes to the potential development of Alzheimer's disease (AD). Using a simple abstraction technique, we converted regional cortical and subcortical volume differences over two time points for each study subject into a graph. We then obtained substructures of interest using a graph decomposition algorithm in order t…
▽ More
In this paper, we studied the association between the change of structural brain volumes to the potential development of Alzheimer's disease (AD). Using a simple abstraction technique, we converted regional cortical and subcortical volume differences over two time points for each study subject into a graph. We then obtained substructures of interest using a graph decomposition algorithm in order to extract pivotal nodes via multi-view feature selection. Intensive experiments using robust classification frameworks were conducted to evaluate the performance of using the brain substructures obtained under different thresholds. The results indicated that compact substructures acquired by examining the differences between patient groups were sufficient to discriminate between AD and healthy controls with an area under the receiver operating curve of 0.72.
△ Less
Submitted 14 May, 2019;
originally announced May 2019.