Search | arXiv e-print repository

Improving End-to-End SLU performance with Prosodic Attention and Distillation

Abstract: Most End-to-End SLU methods depend on the pretrained ASR or language model features for intent prediction. However, other essential information in speech, such as prosody, is often ignored. Recent research has shown improved results in classifying dialogue acts by incorporating prosodic information. The margins of improvement in these methods are minimal as the neural models ignore prosodic featur… ▽ More Most End-to-End SLU methods depend on the pretrained ASR or language model features for intent prediction. However, other essential information in speech, such as prosody, is often ignored. Recent research has shown improved results in classifying dialogue acts by incorporating prosodic information. The margins of improvement in these methods are minimal as the neural models ignore prosodic features. In this work, we propose prosody-attention, which uses the prosodic features differently to generate attention maps across time frames of the utterance. Then we propose prosody-distillation to explicitly learn the prosodic information in the acoustic encoder rather than concatenating the implicit prosodic features. Both the proposed methods improve the baseline results, and the prosody-distillation method gives an intent classification accuracy improvement of 8\% and 2\% on SLURP and STOP datasets over the prosody baseline. △ Less

Submitted 14 May, 2023; originally announced May 2023.

Comments: Submitted to InterSpeech 2023

arXiv:2212.13015 [pdf, other]

Skit-S2I: An Indian Accented Speech to Intent dataset

Authors: Shangeth Rajaa, Swaraj Dalmia, Kumarmanas Nethil

Abstract: Conventional conversation assistants extract text transcripts from the speech signal using automatic speech recognition (ASR) and then predict intent from the transcriptions. Using end-to-end spoken language understanding (SLU), the intents of the speaker are predicted directly from the speech signal without requiring intermediate text transcripts. As a result, the model can optimize directly for… ▽ More Conventional conversation assistants extract text transcripts from the speech signal using automatic speech recognition (ASR) and then predict intent from the transcriptions. Using end-to-end spoken language understanding (SLU), the intents of the speaker are predicted directly from the speech signal without requiring intermediate text transcripts. As a result, the model can optimize directly for intent classification and avoid cascading errors from ASR. The end-to-end SLU system also helps in reducing the latency of the intent prediction model. Although many datasets are available publicly for text-to-intent tasks, the availability of labeled speech-to-intent datasets is limited, and there are no datasets available in the Indian accent. In this paper, we release the Skit-S2I dataset, the first publicly available Indian-accented SLU dataset in the banking domain in a conversational tonality. We experiment with multiple baselines, compare different pretrained speech encoder's representations, and find that SSL pretrained representations perform slightly better than ASR pretrained representations lacking prosodic features for speech-to-intent classification. The dataset and baseline code is available at \url{https://github.com/skit-ai/speech-to-intent-dataset} △ Less

Submitted 26 December, 2022; originally announced December 2022.

arXiv:2110.13653 [pdf, other]

Learning Speaker Representation with Semi-supervised Learning approach for Speaker Profiling

Authors: Shangeth Rajaa, Pham Van Tung, Chng Eng Siong

Abstract: Speaker profiling, which aims to estimate speaker characteristics such as age and height, has a wide range of applications inforensics, recommendation systems, etc. In this work, we propose a semisupervised learning approach to mitigate the issue of low training data for speaker profiling. This is done by utilizing external corpus with speaker information to train a better representation which can… ▽ More Speaker profiling, which aims to estimate speaker characteristics such as age and height, has a wide range of applications inforensics, recommendation systems, etc. In this work, we propose a semisupervised learning approach to mitigate the issue of low training data for speaker profiling. This is done by utilizing external corpus with speaker information to train a better representation which can help to improve the speaker profiling systems. Specifically, besides the standard supervised learning path, the proposed framework has two more paths: (1) an unsupervised speaker representation learning path that helps to capture the speaker information; (2) a consistency training path that helps to improve the robustness of the system by enforcing it to produce similar predictions for utterances of the same speaker.The proposed approach is evaluated on the TIMIT and NISP datasets for age, height, and gender estimation, while the Librispeech is used as the unsupervised external corpus. Trained both on single-task and multi-task settings, our approach was able to achieve state-of-the-art results on age estimation on the TIMIT Test dataset with Root Mean Square Error(RMSE) of6.8 and 7.4 years and Mean Absolute Error(MAE) of 4.8 and5.0 years for male and female speakers respectively. △ Less

Submitted 24 October, 2021; originally announced October 2021.

Comments: 5 pages, 4 figures

arXiv:2004.13122 [pdf]

Development of a Machine-Learning System to Classify Lung CT Scan Images into Normal/COVID-19 Class

Authors: Seifedine Kadry, Venkatesan Rajinikanth, Seungmin Rho, Nadaradjane Sri Madhava Raja, Vaddi Seshagiri Rao, Krishnan Palani Thanaraj

Abstract: Recently, the lung infection due to Coronavirus Disease (COVID-19) affected a large human group worldwide and the assessment of the infection rate in the lung is essential for treatment planning. This research aims to propose a Machine-Learning-System (MLS) to detect the COVID-19 infection using the CT scan Slices (CTS). This MLS implements a sequence of methods, such as multi-thresholding, image… ▽ More Recently, the lung infection due to Coronavirus Disease (COVID-19) affected a large human group worldwide and the assessment of the infection rate in the lung is essential for treatment planning. This research aims to propose a Machine-Learning-System (MLS) to detect the COVID-19 infection using the CT scan Slices (CTS). This MLS implements a sequence of methods, such as multi-thresholding, image separation using threshold filter, feature-extraction, feature-selection, feature-fusion and classification. The initial part implements the Chaotic-Bat-Algorithm and Kapur's Entropy (CBA+KE) thresholding to enhance the CTS. The threshold filter separates the image into two segments based on a chosen threshold 'Th'. The texture features of these images are extracted, refined and selected using the chosen procedures. Finally, a two-class classifier system is implemented to categorize the chosen CTS (n=500 with a pixel dimension of 512x512x1) into normal/COVID-19 group. In this work, the classifiers, such as Naive Bayes (NB), k-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF) and Support Vector Machine with linear kernel (SVM) are implemented and the classification task is performed using various feature vectors. The experimental outcome of the SVM with Fused-Feature-Vector (FFV) helped to attain a detection accuracy of 89.80%. △ Less

Submitted 24 April, 2020; originally announced April 2020.

Comments: 16 PAGES

arXiv:2004.10246 [pdf]

Music Generation with Temporal Structure Augmentation

Authors: Shakeel Raja

Abstract: In this paper we introduce a novel feature augmentation approach for generating structured musical compositions comprising melodies and harmonies. The proposed method augments a connectionist generation model with count-down to song conclusion and meter markers as extra input features to study whether neural networks can learn to produce more aesthetically pleasing and structured musical output as… ▽ More In this paper we introduce a novel feature augmentation approach for generating structured musical compositions comprising melodies and harmonies. The proposed method augments a connectionist generation model with count-down to song conclusion and meter markers as extra input features to study whether neural networks can learn to produce more aesthetically pleasing and structured musical output as a consequence of augmenting the input data with structural features. An RNN architecture with LSTM cells is trained on the Nottingham folk music dataset in a supervised sequence learning setup, following a Music Language Modelling approach, and then applied to generation of harmonies and melodies. Our experiments show an improved prediction performance for both types of annotation. The generated music was also subjectively evaluated using an on-line Turing style listening test which confirms a substantial improvement in the aesthetic quality and in the perceived structure of the music generated using the temporal structure. △ Less

Submitted 21 April, 2020; originally announced April 2020.

arXiv:2004.03431 [pdf]

Harmony-Search and Otsu based System for Coronavirus Disease (COVID-19) Detection using Lung CT Scan Images

Authors: V. Rajinikanth, Nilanjan Dey, Alex Noel Joseph Raj, Aboul Ella Hassanien, K. C. Santosh, N. Sri Madhava Raja

Abstract: Pneumonia is one of the foremost lung diseases and untreated pneumonia will lead to serious threats for all age groups. The proposed work aims to extract and evaluate the Coronavirus disease (COVID-19) caused pneumonia infection in lung using CT scans. We propose an image-assisted system to extract COVID-19 infected sections from lung CT scans (coronal view). It includes following steps: (i) Thres… ▽ More Pneumonia is one of the foremost lung diseases and untreated pneumonia will lead to serious threats for all age groups. The proposed work aims to extract and evaluate the Coronavirus disease (COVID-19) caused pneumonia infection in lung using CT scans. We propose an image-assisted system to extract COVID-19 infected sections from lung CT scans (coronal view). It includes following steps: (i) Threshold filter to extract the lung region by eliminating possible artifacts; (ii) Image enhancement using Harmony-Search-Optimization and Otsu thresholding; (iii) Image segmentation to extract infected region(s); and (iv) Region-of-interest (ROI) extraction (features) from binary image to compute level of severity. The features that are extracted from ROI are then employed to identify the pixel ratio between the lung and infection sections to identify infection level of severity. The primary objective of the tool is to assist the pulmonologist not only to detect but also to help plan treatment process. As a consequence, for mass screening processing, it will help prevent diagnostic burden. △ Less

Submitted 6 April, 2020; originally announced April 2020.

Comments: 13 pages

Showing 1–6 of 6 results for author: Raja, S