Skip to main content

Showing 1–9 of 9 results for author: Kajarekar, S

.
  1. arXiv:2202.03587  [pdf, other

    eess.AS cs.SD eess.SP

    CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

    Authors: Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, Sachin Kajarekar, Panayiotis Georgiou

    Abstract: Deriving multimodal representations of audio and lexical inputs is a central problem in Natural Language Understanding (NLU). In this paper, we present Contrastive Aligned Audio-Language Multirate and Multimodal Representations (CALM), an approach for learning multimodal representations using contrastive and multirate information inherent in audio and lexical inputs. The proposed model aligns acou… ▽ More

    Submitted 7 February, 2022; originally announced February 2022.

  2. arXiv:2110.04656  [pdf, other

    cs.SD cs.LG eess.AS

    Streaming on-device detection of device directed speech from voice and touch-based invocation

    Authors: Ognjen Rudovic, Akanksha Bindal, Vineet Garg, Pramod Simha, Pranay Dighe, Sachin Kajarekar

    Abstract: When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or accidental button press, which may have implications on user experience and privacy. To this end, we propose an acoustic false-t… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

  3. arXiv:2106.11759  [pdf, other

    eess.AS cs.AI cs.CL cs.CV cs.LG cs.SD

    Analysis and Tuning of a Voice Assistant System for Dysfluent Speech

    Authors: Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu, Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis Georgiou, Sachin Kajarekar, Jefferey Bigham

    Abstract: Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetition… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 page reference, 2 figures

  4. arXiv:2102.12394  [pdf, other

    eess.AS cs.SD

    SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter

    Authors: Colin Lea, Vikramjit Mitra, Aparna Joshi, Sachin Kajarekar, Jeffrey P. Bigham

    Abstract: The ability to automatically detect stuttering events in speech could help speech pathologists track an individual's fluency over time or help improve speech recognition systems for people with atypical speech patterns. Despite increasing interest in this area, existing public datasets are too small to build generalizable dysfluency detection systems and lack sufficient annotations. In this work,… ▽ More

    Submitted 24 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021

  5. arXiv:2010.10591  [pdf, other

    eess.AS cs.LG cs.SD

    Knowledge Transfer for Efficient On-device False Trigger Mitigation

    Authors: Pranay Dighe, Erik Marchi, Srikanth Vishnubhotla, Sachin Kajarekar, Devang Naik

    Abstract: In this paper, we address the task of determining whether a given utterance is directed towards a voice-enabled smart-assistant device or not. An undirected utterance is termed as a "false trigger" and false trigger mitigation (FTM) is essential for designing a privacy-centric non-intrusive smart assistant. The directedness of an utterance can be identified by running automatic speech recognition… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

  6. arXiv:2008.00620  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Audiovisual Speech Synthesis using Tacotron2

    Authors: Ahmed Hussen Abdelaziz, Anushree Prasanna Kumar, Chloe Seivwright, Gabriele Fanelli, Justin Binder, Yannis Stylianou, Sachin Kajarekar

    Abstract: Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a s… ▽ More

    Submitted 29 August, 2021; v1 submitted 2 August, 2020; originally announced August 2020.

    Comments: This work has been submitted to the 23rd ACM International Conference on Multimodal Interaction for possible publication

  7. arXiv:2004.12031  [pdf, ps, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    On the Role of Visual Cues in Audiovisual Speech Enhancement

    Authors: Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, Ahmed Hussen Abdelaziz

    Abstract: We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of… ▽ More

    Submitted 25 February, 2021; v1 submitted 24 April, 2020; originally announced April 2020.

    Comments: ICASSP 2021

  8. arXiv:2002.01323  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Detecting Emotion Primitives from Speech and their use in discerning Categorical Emotions

    Authors: Vasudha Kowtha, Vikramjit Mitra, Chris Bartels, Erik Marchi, Sue Booker, William Caruso, Sachin Kajarekar, Devang Naik

    Abstract: Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity. While modern speech technologies rely heavily on speech recognition and natural language understanding for speech content understanding, the investigation of vocal expression is increasingly gaining attention. Key considerations for building robust emotion… ▽ More

    Submitted 30 January, 2020; originally announced February 2020.

    Comments: 5 pages

  9. arXiv:2001.10816  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Multi-task Learning for Speaker Verification and Voice Trigger Detection

    Authors: Siddharth Sigtia, Erik Marchi, Sachin Kajarekar, Devang Naik, John Bridle

    Abstract: Automatic speech transcription and speaker recognition are usually treated as separate tasks even though they are interdependent. In this study, we investigate training a single network to perform both tasks jointly. We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classific… ▽ More

    Submitted 26 January, 2020; originally announced January 2020.

    Journal ref: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Spain, 2020, pp. 6844-6848