Search | arXiv e-print repository

An evaluation of data augmentation methods for sound scene geotagging

Authors: Helen L. Bear, Veronica Morfi, Emmanouil Benetos

Abstract: Sound scene geotagging is a new topic of research which has evolved from acoustic scene classification. It is motivated by the idea of audio surveillance. Not content with only describing a scene in a recording, a machine which can locate where the recording was captured would be of use to many. In this paper we explore a series of common audio data augmentation methods to evaluate which best impr… ▽ More Sound scene geotagging is a new topic of research which has evolved from acoustic scene classification. It is motivated by the idea of audio surveillance. Not content with only describing a scene in a recording, a machine which can locate where the recording was captured would be of use to many. In this paper we explore a series of common audio data augmentation methods to evaluate which best improves the accuracy of audio geotagging classifiers. Our work improves on the state-of-the-art city geotagging method by 23% in terms of classification accuracy. △ Less

Submitted 9 October, 2021; originally announced October 2021.

Comments: Presented at Interspeech 2021

arXiv:2110.04584 [pdf, other]

Visually Exploring Multi-Purpose Audio Data

Authors: David Heise, Helen L. Bear

Abstract: We analyse multi-purpose audio using tools to visualise similarities within the data that may be observed via unsupervised methods. The success of machine learning classifiers is affected by the information contained within system inputs, so we investigate whether latent patterns within the data may explain performance limitations of such classifiers. We use the visual assessment of cluster tenden… ▽ More We analyse multi-purpose audio using tools to visualise similarities within the data that may be observed via unsupervised methods. The success of machine learning classifiers is affected by the information contained within system inputs, so we investigate whether latent patterns within the data may explain performance limitations of such classifiers. We use the visual assessment of cluster tendency (VAT) technique on a well known data set to observe how the samples naturally cluster, and we make comparisons to the labels used for audio geotagging and acoustic scene classification. We demonstrate that VAT helps to explain and corroborate confusions observed in prior work to classify this audio, yielding greater insight into the performance - and limitations - of supervised classification systems. While this exploratory analysis is conducted on data for which we know the "ground truth" labels, this method of visualising the natural groupings as dictated by the data leads to important questions about unlabelled data that can help the evaluation and realistic expectations of future (including self-supervised) classification systems. △ Less

Submitted 9 October, 2021; originally announced October 2021.

Comments: Presented at MMSP 2021

arXiv:2005.06650 [pdf, other]

Memory Controlled Sequential Self Attention for Sound Recognition

Authors: Arjun Pankajakshan, Helen L. Bear, Vinod Subramanian, Emmanouil Benetos

Abstract: In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound recog… ▽ More In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound recognition performance with the self attention induced SED model. We extend the proposed idea with a multi-head self attention mechanism where each attention head processes the audio embedding with explicit attention width values. The proposed use of memory controlled sequential self attention offers a way to induce relations among frames of sound event tokens. We show that our memory controlled self attention model achieves an event based F -score of 33.92% on the URBAN-SED dataset, outperforming the F -score of 20.10% reported by the model without self attention. △ Less

Submitted 5 August, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

Comments: Accepted to INTERSPEECH 2020

arXiv:1907.05122 [pdf, other]

Polyphonic Sound Event and Sound Activity Detection: A Multi-task approach

Authors: Arjun Pankajakshan, Helen L. Bear, Emmanouil Benetos

Abstract: Polyphonic Sound Event Detection (SED) in real-world recordings is a challenging task because of the dynamic polyphony level, intensity, and duration of sound events. Current polyphonic SED systems fail to model the temporal structure of sound events explicitly and instead attempt to look at which sound events are present at each audio frame. Consequently, the event-wise detection performance is m… ▽ More Polyphonic Sound Event Detection (SED) in real-world recordings is a challenging task because of the dynamic polyphony level, intensity, and duration of sound events. Current polyphonic SED systems fail to model the temporal structure of sound events explicitly and instead attempt to look at which sound events are present at each audio frame. Consequently, the event-wise detection performance is much lower than the segment-wise detection performance. In this work, we propose a joint model approach to improve the temporal localization of sound events using a multi-task learning setup. The first task predicts which sound events are present at each time frame; we call this branch 'Sound Event Detection (SED) model', while the second task predicts if a sound event is present or not at each frame; we call this branch 'Sound Activity Detection (SAD) model'. We verify the proposed joint model by comparing it with a separate implementation of both tasks aggregated together from individual task predictions. Our experiments on the URBAN-SED dataset show that the proposed joint model can alleviate False Positive (FP) and False Negative (FN) errors and improve both the segment-wise and the event-wise metrics. △ Less

Submitted 1 August, 2019; v1 submitted 11 July, 2019; originally announced July 2019.

Comments: Accepted to WASPAA 2019

arXiv:1905.00979 [pdf, other]

City classification from multiple real-world sound scenes

Authors: Helen L. Bear, Toni Heittola, Annamaria Mesaros, Emmanouil Benetos, Tuomas Virtanen

Abstract: The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a… ▽ More The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like `park', and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each city has recordings from multiple scenes. We test a series of methods for this novel task and show that a simple convolutional neural network (CNN) can achieve accuracy of 50%. This is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge on the same data. A simple adaptation to the class labels of pairing city labels with grouped scenes, accuracy increases to 52%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56%, outperforming the aforementioned approaches. △ Less

Submitted 29 July, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

Comments: Accepted to WASPAA 2019

arXiv:1904.10408 [pdf, other]

Towards joint sound scene and polyphonic sound event recognition

Authors: Helen L. Bear, Ines Nolasco, Emmanouil Benetos

Abstract: Acoustic Scene Classification (ASC) and Sound Event Detection (SED) are two separate tasks in the field of computational sound scene analysis. In this work, we present a new dataset with both sound scene and sound event labels and use this to demonstrate a novel method for jointly classifying sound scenes and recognizing sound events. We show that by taking a joint approach, learning is more effic… ▽ More Acoustic Scene Classification (ASC) and Sound Event Detection (SED) are two separate tasks in the field of computational sound scene analysis. In this work, we present a new dataset with both sound scene and sound event labels and use this to demonstrate a novel method for jointly classifying sound scenes and recognizing sound events. We show that by taking a joint approach, learning is more efficient and whilst improvements are still needed for sound event detection, SED results are robust in a dataset where the sample distribution is skewed towards sound scenes. △ Less

Submitted 1 July, 2019; v1 submitted 23 April, 2019; originally announced April 2019.

Comments: Accepted to Interspeech 2019

arXiv:1811.06330 [pdf, other]

Audio-based identification of beehive states

Authors: Inês Nolasco, Alessandro Terenzi, Stefania Cecchi, Simone Orcioni, Helen L. Bear, Emmanouil Benetos

Abstract: The absence of the queen in a beehive is a very strong indicator of the need for beekeeper intervention. Manually searching for the queen is an arduous recurrent task for beekeepers that disrupts the normal life cycle of the beehive and can be a source of stress for bees. Sound is an indicator for signalling different states of the beehive, including the absence of the queen bee. In this work, we… ▽ More The absence of the queen in a beehive is a very strong indicator of the need for beekeeper intervention. Manually searching for the queen is an arduous recurrent task for beekeepers that disrupts the normal life cycle of the beehive and can be a source of stress for bees. Sound is an indicator for signalling different states of the beehive, including the absence of the queen bee. In this work, we apply machine learning methods to automatically recognise different states in a beehive using audio as input. % The system is built on top of a method for beehive sound recognition in order to detect bee sounds from other external sounds. We investigate both support vector machines and convolutional neural networks for beehive state recognition, using audio data of beehives collected from the NU-Hive project. Results indicate the potential of machine learning methods as well as the challenges of generalizing the system to new hives. △ Less

Submitted 15 February, 2019; v1 submitted 15 November, 2018; originally announced November 2018.

Comments: Accepted for ICASSP 2019

arXiv:1810.10597 [pdf, other]

The speaker-independent lipreading play-off; a survey of lipreading machines

Authors: Jake Burton, David Frank, Madhi Saleh, Nassir Navab, Helen L. Bear

Abstract: Lipreading is a difficult gesture classification task. One problem in computer lipreading is speaker-independence. Speaker-independence means to achieve the same accuracy on test speakers not included in the training set as speakers within the training set. Current literature is limited on speaker-independent lipreading, the few independent test speaker accuracy scores are usually aggregated withi… ▽ More Lipreading is a difficult gesture classification task. One problem in computer lipreading is speaker-independence. Speaker-independence means to achieve the same accuracy on test speakers not included in the training set as speakers within the training set. Current literature is limited on speaker-independent lipreading, the few independent test speaker accuracy scores are usually aggregated within dependent test speaker accuracies for an averaged performance. This leads to unclear independent results. Here we undertake a systematic survey of experiments with the TCD-TIMIT dataset using both conventional approaches and deep learning methods to provide a series of wholly speaker-independent benchmarks and show that the best speaker-independent machine scores 69.58% accuracy with CNN features and an SVM classifier. This is less than state of the art speaker-dependent lipreading machines, but greater than previously reported in independence experiments. △ Less

Submitted 24 October, 2018; originally announced October 2018.

Comments: To appear at the third IEEE International Conference on Image Processing, Applications and Systems 2018

arXiv:1809.10047 [pdf, other]

An extensible cluster-graph taxonomy for open set sound scene analysis

Authors: Helen L Bear, Emmanouil Benetos

Abstract: We present a new extensible and divisible taxonomy for open set sound scene analysis. This new model allows complex scene analysis with tangible descriptors and perception labels. Its novel structure is a cluster graph such that each cluster (or subset) can stand alone for targeted analyses such as office sound event detection, whilst maintaining integrity over the whole graph (superset) of labels… ▽ More We present a new extensible and divisible taxonomy for open set sound scene analysis. This new model allows complex scene analysis with tangible descriptors and perception labels. Its novel structure is a cluster graph such that each cluster (or subset) can stand alone for targeted analyses such as office sound event detection, whilst maintaining integrity over the whole graph (superset) of labels. The key design benefit is its extensibility as new labels are needed during new data capture. Furthermore, datasets which use the same taxonomy are easily augmented, saving future data collection effort. We balance the details needed for complex scene analysis with avoiding 'the taxonomy of everything' with our framework to ensure no duplicity in the superset of labels and demonstrate this with DCASE challenge classifications. △ Less

Submitted 26 September, 2018; originally announced September 2018.

Comments: To be presented at Detection and Classification of Audio Scenes and Events (DCASE) workshop, November 2018

arXiv:1809.06800 [pdf, other]

doi 10.13140/RG.2.2.15396.94081

Visual Speech Language Models

Authors: Helen L Bear

Abstract: Language models (LM) are very powerful in lipreading systems. Language models built upon the ground truth utterances of datasets learn grammar and structure rules of words and sentences (the latter in the case of continuous speech). However, visual co-articulation effects in visual speech signals damage the performance of visual speech LM's as visually, people do not utter what the language model… ▽ More Language models (LM) are very powerful in lipreading systems. Language models built upon the ground truth utterances of datasets learn grammar and structure rules of words and sentences (the latter in the case of continuous speech). However, visual co-articulation effects in visual speech signals damage the performance of visual speech LM's as visually, people do not utter what the language model expects. These models are commonplace but while higher-order N-gram LM's may improve classification rates, the cost of this model is disproportionate to the common goal of developing more accurate classifiers. So we compare which unit would best optimize a lipreading (visual speech) LM to observe their limitations. We compare three units; visemes (visual speech units) \cite{lan2010improving}, phonemes (audible speech units), and words. △ Less

Submitted 14 September, 2018; originally announced September 2018.

Comments: Extended abstract based on Decoding Visemes: improving machine lipreading, Bear & Harvey, ICASSP 2016

arXiv:1805.02948 [pdf, other]

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

Authors: Helen L Bear, Richard Harvey

Abstract: Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; `the visual equivalent of a phoneme' and `phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly… ▽ More Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; `the visual equivalent of a phoneme' and `phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly dependent upon the speaker. So here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We test these phoneme to viseme maps to examine how similarly speakers talk visually and we use signed rank tests to measure the distance between individuals. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures. △ Less

Submitted 8 May, 2018; originally announced May 2018.

Journal ref: Computer Speech and Language, May 2018

arXiv:1805.02934 [pdf, other]

doi 10.1016/j.specom.2017.07.001

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Authors: Helen L Bear, Richard Harvey

Abstract: Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is "a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguit… ▽ More Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is "a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguity damaging to the performance of audio-visual classifiers operating on real expressive speech, there is also considerable choice between possible mappings. In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is definite difference in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, `Bear' visemes, are shown to perform better than previously known units. △ Less

Submitted 8 May, 2018; originally announced May 2018.

Journal ref: Speech Communication, Special Issue on AV expressive speech. 2017

arXiv:1805.02924 [pdf, other]

Comparing phonemes and visemes with DNN-based lipreading

Authors: Kwanchiva Thangthai, Helen L Bear, Richard Harvey

Abstract: There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report t… ▽ More There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report the accuracy of our system at both word and unit levels. The evaluation task is large vocabulary continuous speech using the TCD-TIMIT corpus. We complete our visual speech modeling via hybrid DNN-HMMs and our visual speech decoder is a Weighted Finite-State Transducer (WFST). We use DCT and Eigenlips as a representation of mouth ROI image. The phoneme lipreading system word accuracy outperforms the viseme based system word accuracy. However, the phoneme system achieved lower accuracy at the unit level which shows the importance of the dictionary for decoding classification outputs into words. △ Less

Submitted 8 May, 2018; originally announced May 2018.

Journal ref: BMVC Lipreading Workshop 2017

arXiv:1710.01351 [pdf, other]

Understanding the visual speech signal

Authors: Helen L Bear

Abstract: For machines to lipread, or understand speech from lip movement, they decode lip-motions (known as visemes) into the spoken sounds. We investigate the visual speech channel to further our understanding of visemes. This has applications beyond machine lipreading; speech therapists, animators, and psychologists can benefit from this work. We explain the influence of speaker individuality, and demons… ▽ More For machines to lipread, or understand speech from lip movement, they decode lip-motions (known as visemes) into the spoken sounds. We investigate the visual speech channel to further our understanding of visemes. This has applications beyond machine lipreading; speech therapists, animators, and psychologists can benefit from this work. We explain the influence of speaker individuality, and demonstrate how one can use visemes to boost lipreading. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Comments: Computer Vision and Pattern Recognition (CVPR) Women in Computer Vision (WiCV) workshop. 2017

arXiv:1710.01297 [pdf, other]

Visual gesture variability between talkers in continuous visual speech

Authors: Helen L Bear

Abstract: Recent adoption of deep learning methods to the field of machine lipreading research gives us two options to pursue to improve system performance. Either, we develop end-to-end systems holistically or, we experiment to further our understanding of the visual speech signal. The latter option is more difficult but this knowledge would enable researchers to both improve systems and apply the new know… ▽ More Recent adoption of deep learning methods to the field of machine lipreading research gives us two options to pursue to improve system performance. Either, we develop end-to-end systems holistically or, we experiment to further our understanding of the visual speech signal. The latter option is more difficult but this knowledge would enable researchers to both improve systems and apply the new knowledge to other domains such as speech therapy. One challenge in lipreading systems is the correct labeling of the classifiers. These labels map an estimated function between visemes on the lips and the phonemes uttered. Here we ask if such maps are speaker-dependent? Prior work investigated isolated word recognition from speaker-dependent (SD) visemes, we extend this to continuous speech. Benchmarked against SD results, and the isolated words performance, we test with RMAV dataset speakers and observe that with continuous speech, the trajectory between visemes has a greater negative effect on the speaker differentiation. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Journal ref: Helen L Bear. Visual gesture variability between talkers in continuous visual speech. British Machine Vision Conference (BMVC) Deep learning for machine lip reading workshop. 2017

arXiv:1710.01292 [pdf, other]

Visual speech recognition: aligning terminologies for better understanding

Authors: Helen L Bear, Sarah Taylor

Abstract: We are at an exciting time for machine lipreading. Traditional research stemmed from the adaptation of audio recognition systems. But now, the computer vision community is also participating. This joining of two previously disparate areas with different perspectives on computer lipreading is creating opportunities for collaborations, but in doing so the literature is experiencing challenges in kno… ▽ More We are at an exciting time for machine lipreading. Traditional research stemmed from the adaptation of audio recognition systems. But now, the computer vision community is also participating. This joining of two previously disparate areas with different perspectives on computer lipreading is creating opportunities for collaborations, but in doing so the literature is experiencing challenges in knowledge sharing due to multiple uses of terms and phrases and the range of methods for scoring results. In particular we highlight three areas with the intention to improve communication between those researching lipreading; the effects of interchanging between speech reading and lipreading; speaker dependence across train, validation, and test splits; and the use of accuracy, correctness, errors, and varying units (phonemes, visemes, words, and sentences) to measure system performance. We make recommendations as to how we can be more consistent. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Journal ref: Helen L Bear and Sarah Taylor. Visual speech recognition: aligning terminologies for better understanding. British Machine Vision Conference (BMVC) Deep learning for machine lip reading workshop. 2017

arXiv:1710.01288 [pdf, other]

Decoding visemes: improving machine lipreading

Authors: Helen L Bear

Abstract: Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machin… ▽ More Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Comments: PhD thesis. Computer Vision and Pattern Recognition (CVPR), Women in Computer Vision (WiCV) Workshop 2017

Journal ref: Helen L Bear. Decoding visemes: improving lipreading (PhD thesis). University of East Anglia. July 2016

arXiv:1710.01169 [pdf, other]

Decoding visemes: improving machine lipreading

Authors: Helen L. Bear, Richard Harvey

Abstract: To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses p… ▽ More To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses previously trained visemes in the first pass. With our new training algorithm, we show classification performance which significantly improves on previous lip-reading results. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Journal ref: Helen L Bear and Richard Harvey. Decoding visemes: improving machine lipreading. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. p2009-2013

arXiv:1710.01142 [pdf, other]

Finding phonemes: improving machine lip-reading

Authors: Helen L. Bear, Richard W. Harvey, Yuxuan Lan

Abstract: In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45. Viseme classes are based upon the mapping of articulated pho… ▽ More In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45. Viseme classes are based upon the mapping of articulated phonemes, which have been confused during phoneme recognition, into viseme groups. Using these maps, with the LiLIR dataset, we show the effect of changing the viseme map size in speaker-dependent machine lip-reading, measured by word recognition correctness and so demonstrate that word recognition with phoneme classifiers is not just possible, but often better than word recognition with viseme classifiers. Furthermore, there are intermediate units between visemes and phonemes which are better still. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Journal ref: Helen L. Bear, Richard W. Harvey, Yuxuan Lan. Finding phonemes: improving machine lip-reading. Audio-Visual Speech Processing (AVSP), 2015 p115-120

arXiv:1710.01122 [pdf, other]

Speaker-independent machine lip-reading with speaker-dependent viseme classifiers

Authors: Helen L. Bear, Stephen J. Cox, Richard W. Harvey

Abstract: In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We use these maps to examine how similarly speakers talk visually. We conclude that broadly speaking, spea… ▽ More In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We use these maps to examine how similarly speakers talk visually. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Journal ref: Helen L. Bear, Stephen J. Cox, Richard W. Harvey, Speaker-independent machine lip-reading with speaker-dependent viseme classifiers. Audio-Visual Speech Processing (AVSP) 2015, p190-195

arXiv:1710.01093 [pdf, other]

Which phoneme-to-viseme maps best improve visual-only computer lip-reading?

Authors: Helen L. Bear, Richard W. Harvey, Barry-John Theobald, Yuxuan Lan

Abstract: A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings… ▽ More A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings and consider if any are stable across talkers. We show a method for devising maps based on phoneme confusions from an automated lip-reading system, and we present new mappings that show improvements for individual talkers. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Journal ref: Helen L. Bear, Richard W. Harvey, Barry-John Theobald, and Yuxuan Lan. Which phoneme-to-viseme maps best improve visual-only computer lip-reading? Advances in Visual Computing 2014. p230-239

arXiv:1710.01084 [pdf, other]

Some observations on computer lip-reading: moving from the dream to the reality

Authors: Helen L. Bear, Gari Owen, Richard Harvey, Barry-John Theobald

Abstract: In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution,… ▽ More In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors. However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit for recognition does need further examination. We conclude that visemes, which were defined over a century ago, are unlikely to be optimal for a modern computer lip-reading system. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Journal ref: Helen L. Bear, Gari Owen, Richard Harvey, and Barry-John Theobald. Some observations on computer lip-reading: moving from the dream to the reality. International Society for Optics and Photonics- Security and defence. 2014. p92530G--92530G

arXiv:1710.01073 [pdf, other]

Resolution limits on visual speech recognition

Authors: Helen L. Bear, Richard Harvey, Barry-John Theobald, Yuxuan Lan

Abstract: Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test… ▽ More Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test recognizers so we can measure the affect of video resolution on recognition accuracy. We conclude that, contrary to common practice, resolution need not be that great for automatic lip-reading. However it is highly unlikely that automatic lip-reading can work reliably when the distance between the bottom of the lower lip and the top of the upper lip is less than four pixels at rest. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Journal ref: Helen L. Bear, Richard Harvey, Barry-John Theobald, Yuxuan Lan. Resolution limits on visual speech recognition. International Conference on Image Processing (ICIP). 2014. p1371-1375

Showing 1–23 of 23 results for author: Bear, H L