-
Articulatory Configurations across Genders and Periods in French Radio and TV archives
Authors:
Benjamin Elie,
David Doukhan,
Rémi Uro,
Lucas Ondel-Yang,
Albert Rilliard,
Simon Devauchelle
Abstract:
This paper studies changes in articulatory configurations across genders and periods using an inversion from acoustic to articulatory parameters. From a diachronic corpus based on French media archives spanning 60 years from 1955 to 2015, automatic transcription and forced alignment allowed extracting the central frame of each vowel. More than one million frames were obtained from over a thousand…
▽ More
This paper studies changes in articulatory configurations across genders and periods using an inversion from acoustic to articulatory parameters. From a diachronic corpus based on French media archives spanning 60 years from 1955 to 2015, automatic transcription and forced alignment allowed extracting the central frame of each vowel. More than one million frames were obtained from over a thousand speakers across gender and age categories. Their formants were used from these vocalic frames to fit the parameters of Maeda's articulatory model. Evaluations of the quality of these processes are provided. We focus here on two parameters of Maeda's model linked to total vocal tract length: the relative position of the larynx (higher for females) and the lips protrusion (more protruded for males). Implications for voice quality across genders are discussed. The effect across periods seems gender independent; thus, the assertion that females lowered their pitch with time is not supported.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content
Authors:
Rémi Uro,
Marie Tahon,
David Doukhan,
Antoine Laurent,
Albert Rilliard
Abstract:
Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in mul…
▽ More
Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in multi-speaker settings. We compared audio, text, and fusions of both approaches on a French corpus of TV and Radio extracts annotated with turn-terminality information at each speaker change. Our models are based on pre-trained self-supervised representations. We report results for different fusion strategies and varying context sizes. This study also questions the problem of performance variability by analyzing the differences in results for multiple training runs with random initialization. The measured accuracy would allow the use of these models for large-scale analysis of turn-taking.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
A Semi-Automatic Approach to Create Large Gender- and Age-Balanced Speaker Corpora: Usefulness of Speaker Diarization & Identification
Authors:
Rémi Uro,
David Doukhan,
Albert Rilliard,
Laëtitia Larcher,
Anissa-Claire Adgharouamane,
Marie Tahon,
Antoine Laurent
Abstract:
This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker's age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For eac…
▽ More
This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker's age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Evolution of Voices in French Audiovisual Media Across Genders and Age in a Diachronic Perspective
Authors:
Albert Rilliard,
David Doukhan,
Rémi Uro,
Simon Devauchelle
Abstract:
We present a diachronic acoustic analysis of the voice of 1023 speakers from French media archives. The speakers are spread across 32 categories based on four periods (years 1955/56, 1975/76, 1995/96, 2015/16), four age groups (20-35; 36-50; 51-65, >65), and two genders. The fundamental frequency ($F_0$) and the first four formants (F1-4) were estimated. Procedures used to ensure the quality of th…
▽ More
We present a diachronic acoustic analysis of the voice of 1023 speakers from French media archives. The speakers are spread across 32 categories based on four periods (years 1955/56, 1975/76, 1995/96, 2015/16), four age groups (20-35; 36-50; 51-65, >65), and two genders. The fundamental frequency ($F_0$) and the first four formants (F1-4) were estimated. Procedures used to ensure the quality of these estimations on heterogeneous data are described. From each speaker's $F_0$ distribution, the base-$F_0$ value was calculated to estimate the register. Average vocal tract length was estimated from formant frequencies. Base-$F_0$ and vocal tract length were fit by linear mixed models to evaluate how they may have changed across time periods and genders, corrected for age effects. Results show an effect of the period with a tendency to lower voices, independently of gender. A lowering of pitch is observed with age for female but not male speakers.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.