Skip to main content

Showing 1–23 of 23 results for author: Bear, H L

.
  1. arXiv:2110.04585  [pdf, other

    eess.AS cs.SD

    An evaluation of data augmentation methods for sound scene geotagging

    Authors: Helen L. Bear, Veronica Morfi, Emmanouil Benetos

    Abstract: Sound scene geotagging is a new topic of research which has evolved from acoustic scene classification. It is motivated by the idea of audio surveillance. Not content with only describing a scene in a recording, a machine which can locate where the recording was captured would be of use to many. In this paper we explore a series of common audio data augmentation methods to evaluate which best impr… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

    Comments: Presented at Interspeech 2021

  2. arXiv:2110.04584  [pdf, other

    eess.AS cs.HC cs.SD

    Visually Exploring Multi-Purpose Audio Data

    Authors: David Heise, Helen L. Bear

    Abstract: We analyse multi-purpose audio using tools to visualise similarities within the data that may be observed via unsupervised methods. The success of machine learning classifiers is affected by the information contained within system inputs, so we investigate whether latent patterns within the data may explain performance limitations of such classifiers. We use the visual assessment of cluster tenden… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

    Comments: Presented at MMSP 2021

  3. arXiv:2005.06650  [pdf, other

    eess.AS cs.LG cs.SD

    Memory Controlled Sequential Self Attention for Sound Recognition

    Authors: Arjun Pankajakshan, Helen L. Bear, Vinod Subramanian, Emmanouil Benetos

    Abstract: In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound recog… ▽ More

    Submitted 5 August, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

    Comments: Accepted to INTERSPEECH 2020

  4. arXiv:1907.05122  [pdf, other

    eess.AS cs.SD

    Polyphonic Sound Event and Sound Activity Detection: A Multi-task approach

    Authors: Arjun Pankajakshan, Helen L. Bear, Emmanouil Benetos

    Abstract: Polyphonic Sound Event Detection (SED) in real-world recordings is a challenging task because of the dynamic polyphony level, intensity, and duration of sound events. Current polyphonic SED systems fail to model the temporal structure of sound events explicitly and instead attempt to look at which sound events are present at each audio frame. Consequently, the event-wise detection performance is m… ▽ More

    Submitted 1 August, 2019; v1 submitted 11 July, 2019; originally announced July 2019.

    Comments: Accepted to WASPAA 2019

  5. arXiv:1905.00979  [pdf, other

    eess.AS cs.SD

    City classification from multiple real-world sound scenes

    Authors: Helen L. Bear, Toni Heittola, Annamaria Mesaros, Emmanouil Benetos, Tuomas Virtanen

    Abstract: The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a… ▽ More

    Submitted 29 July, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

    Comments: Accepted to WASPAA 2019

  6. arXiv:1904.10408  [pdf, other

    eess.AS cs.SD

    Towards joint sound scene and polyphonic sound event recognition

    Authors: Helen L. Bear, Ines Nolasco, Emmanouil Benetos

    Abstract: Acoustic Scene Classification (ASC) and Sound Event Detection (SED) are two separate tasks in the field of computational sound scene analysis. In this work, we present a new dataset with both sound scene and sound event labels and use this to demonstrate a novel method for jointly classifying sound scenes and recognizing sound events. We show that by taking a joint approach, learning is more effic… ▽ More

    Submitted 1 July, 2019; v1 submitted 23 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019

  7. arXiv:1811.06330  [pdf, other

    cs.SD eess.AS

    Audio-based identification of beehive states

    Authors: InĂªs Nolasco, Alessandro Terenzi, Stefania Cecchi, Simone Orcioni, Helen L. Bear, Emmanouil Benetos

    Abstract: The absence of the queen in a beehive is a very strong indicator of the need for beekeeper intervention. Manually searching for the queen is an arduous recurrent task for beekeepers that disrupts the normal life cycle of the beehive and can be a source of stress for bees. Sound is an indicator for signalling different states of the beehive, including the absence of the queen bee. In this work, we… ▽ More

    Submitted 15 February, 2019; v1 submitted 15 November, 2018; originally announced November 2018.

    Comments: Accepted for ICASSP 2019

  8. arXiv:1810.10597  [pdf, other

    cs.CV eess.AS

    The speaker-independent lipreading play-off; a survey of lipreading machines

    Authors: Jake Burton, David Frank, Madhi Saleh, Nassir Navab, Helen L. Bear

    Abstract: Lipreading is a difficult gesture classification task. One problem in computer lipreading is speaker-independence. Speaker-independence means to achieve the same accuracy on test speakers not included in the training set as speakers within the training set. Current literature is limited on speaker-independent lipreading, the few independent test speaker accuracy scores are usually aggregated withi… ▽ More

    Submitted 24 October, 2018; originally announced October 2018.

    Comments: To appear at the third IEEE International Conference on Image Processing, Applications and Systems 2018

  9. arXiv:1809.10047  [pdf, other

    cs.SD eess.AS

    An extensible cluster-graph taxonomy for open set sound scene analysis

    Authors: Helen L Bear, Emmanouil Benetos

    Abstract: We present a new extensible and divisible taxonomy for open set sound scene analysis. This new model allows complex scene analysis with tangible descriptors and perception labels. Its novel structure is a cluster graph such that each cluster (or subset) can stand alone for targeted analyses such as office sound event detection, whilst maintaining integrity over the whole graph (superset) of labels… ▽ More

    Submitted 26 September, 2018; originally announced September 2018.

    Comments: To be presented at Detection and Classification of Audio Scenes and Events (DCASE) workshop, November 2018

  10. Visual Speech Language Models

    Authors: Helen L Bear

    Abstract: Language models (LM) are very powerful in lipreading systems. Language models built upon the ground truth utterances of datasets learn grammar and structure rules of words and sentences (the latter in the case of continuous speech). However, visual co-articulation effects in visual speech signals damage the performance of visual speech LM's as visually, people do not utter what the language model… ▽ More

    Submitted 14 September, 2018; originally announced September 2018.

    Comments: Extended abstract based on Decoding Visemes: improving machine lipreading, Bear & Harvey, ICASSP 2016

  11. arXiv:1805.02948  [pdf, other

    eess.IV cs.CV cs.SD eess.AS

    Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

    Authors: Helen L Bear, Richard Harvey

    Abstract: Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; `the visual equivalent of a phoneme' and `phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly… ▽ More

    Submitted 8 May, 2018; originally announced May 2018.

    Journal ref: Computer Speech and Language, May 2018

  12. arXiv:1805.02934  [pdf, other

    cs.CV cs.SD eess.AS eess.IV

    Phoneme-to-viseme mappings: the good, the bad, and the ugly

    Authors: Helen L Bear, Richard Harvey

    Abstract: Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is "a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguit… ▽ More

    Submitted 8 May, 2018; originally announced May 2018.

    Journal ref: Speech Communication, Special Issue on AV expressive speech. 2017

  13. arXiv:1805.02924  [pdf, other

    cs.CV cs.CL cs.SD eess.AS eess.IV

    Comparing phonemes and visemes with DNN-based lipreading

    Authors: Kwanchiva Thangthai, Helen L Bear, Richard Harvey

    Abstract: There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report t… ▽ More

    Submitted 8 May, 2018; originally announced May 2018.

    Journal ref: BMVC Lipreading Workshop 2017

  14. arXiv:1710.01351  [pdf, other

    cs.CV eess.AS

    Understanding the visual speech signal

    Authors: Helen L Bear

    Abstract: For machines to lipread, or understand speech from lip movement, they decode lip-motions (known as visemes) into the spoken sounds. We investigate the visual speech channel to further our understanding of visemes. This has applications beyond machine lipreading; speech therapists, animators, and psychologists can benefit from this work. We explain the influence of speaker individuality, and demons… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Comments: Computer Vision and Pattern Recognition (CVPR) Women in Computer Vision (WiCV) workshop. 2017

  15. arXiv:1710.01297  [pdf, other

    cs.CV eess.AS

    Visual gesture variability between talkers in continuous visual speech

    Authors: Helen L Bear

    Abstract: Recent adoption of deep learning methods to the field of machine lipreading research gives us two options to pursue to improve system performance. Either, we develop end-to-end systems holistically or, we experiment to further our understanding of the visual speech signal. The latter option is more difficult but this knowledge would enable researchers to both improve systems and apply the new know… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L Bear. Visual gesture variability between talkers in continuous visual speech. British Machine Vision Conference (BMVC) Deep learning for machine lip reading workshop. 2017

  16. arXiv:1710.01292  [pdf, other

    cs.CV eess.AS

    Visual speech recognition: aligning terminologies for better understanding

    Authors: Helen L Bear, Sarah Taylor

    Abstract: We are at an exciting time for machine lipreading. Traditional research stemmed from the adaptation of audio recognition systems. But now, the computer vision community is also participating. This joining of two previously disparate areas with different perspectives on computer lipreading is creating opportunities for collaborations, but in doing so the literature is experiencing challenges in kno… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L Bear and Sarah Taylor. Visual speech recognition: aligning terminologies for better understanding. British Machine Vision Conference (BMVC) Deep learning for machine lip reading workshop. 2017

  17. arXiv:1710.01288  [pdf, other

    cs.CV eess.AS

    Decoding visemes: improving machine lipreading

    Authors: Helen L Bear

    Abstract: Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machin… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Comments: PhD thesis. Computer Vision and Pattern Recognition (CVPR), Women in Computer Vision (WiCV) Workshop 2017

    Journal ref: Helen L Bear. Decoding visemes: improving lipreading (PhD thesis). University of East Anglia. July 2016

  18. arXiv:1710.01169  [pdf, other

    cs.CV eess.AS

    Decoding visemes: improving machine lipreading

    Authors: Helen L. Bear, Richard Harvey

    Abstract: To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses p… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L Bear and Richard Harvey. Decoding visemes: improving machine lipreading. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. p2009-2013

  19. arXiv:1710.01142  [pdf, other

    cs.CV cs.CL eess.AS

    Finding phonemes: improving machine lip-reading

    Authors: Helen L. Bear, Richard W. Harvey, Yuxuan Lan

    Abstract: In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45. Viseme classes are based upon the mapping of articulated pho… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L. Bear, Richard W. Harvey, Yuxuan Lan. Finding phonemes: improving machine lip-reading. Audio-Visual Speech Processing (AVSP), 2015 p115-120

  20. arXiv:1710.01122  [pdf, other

    cs.CV eess.AS

    Speaker-independent machine lip-reading with speaker-dependent viseme classifiers

    Authors: Helen L. Bear, Stephen J. Cox, Richard W. Harvey

    Abstract: In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We use these maps to examine how similarly speakers talk visually. We conclude that broadly speaking, spea… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L. Bear, Stephen J. Cox, Richard W. Harvey, Speaker-independent machine lip-reading with speaker-dependent viseme classifiers. Audio-Visual Speech Processing (AVSP) 2015, p190-195

  21. arXiv:1710.01093  [pdf, other

    cs.CV cs.CL eess.AS

    Which phoneme-to-viseme maps best improve visual-only computer lip-reading?

    Authors: Helen L. Bear, Richard W. Harvey, Barry-John Theobald, Yuxuan Lan

    Abstract: A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L. Bear, Richard W. Harvey, Barry-John Theobald, and Yuxuan Lan. Which phoneme-to-viseme maps best improve visual-only computer lip-reading? Advances in Visual Computing 2014. p230-239

  22. arXiv:1710.01084  [pdf, other

    cs.CV eess.IV

    Some observations on computer lip-reading: moving from the dream to the reality

    Authors: Helen L. Bear, Gari Owen, Richard Harvey, Barry-John Theobald

    Abstract: In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution,… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L. Bear, Gari Owen, Richard Harvey, and Barry-John Theobald. Some observations on computer lip-reading: moving from the dream to the reality. International Society for Optics and Photonics- Security and defence. 2014. p92530G--92530G

  23. arXiv:1710.01073  [pdf, other

    cs.CV eess.IV

    Resolution limits on visual speech recognition

    Authors: Helen L. Bear, Richard Harvey, Barry-John Theobald, Yuxuan Lan

    Abstract: Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L. Bear, Richard Harvey, Barry-John Theobald, Yuxuan Lan. Resolution limits on visual speech recognition. International Conference on Image Processing (ICIP). 2014. p1371-1375