Search | arXiv e-print repository

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

Authors: Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik

Abstract: Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusi… ▽ More Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Accepted at INTERSPEECH 2025, 5 pages, 4 figures, 2 tables

arXiv:2412.17823 [pdf]

doi 10.1016/j.heliyon.2024.e39268

RUL forecasting for wind turbine predictive maintenance based on deep learning

Authors: Syed Shazaib Shah, Tan Daoliang, Sah Chandan Kumar

Abstract: Predictive maintenance (PdM) is increasingly pursued to reduce wind farm operation and maintenance costs by accurately predicting the remaining useful life (RUL) and strategically scheduling maintenance. However, the remoteness of wind farms often renders current methodologies ineffective, as they fail to provide a sufficiently reliable advance time window for maintenance planning, limiting PdM's… ▽ More Predictive maintenance (PdM) is increasingly pursued to reduce wind farm operation and maintenance costs by accurately predicting the remaining useful life (RUL) and strategically scheduling maintenance. However, the remoteness of wind farms often renders current methodologies ineffective, as they fail to provide a sufficiently reliable advance time window for maintenance planning, limiting PdM's practicality. This study introduces a novel deep learning (DL) methodology for future RUL forecasting. By employing a multi-parametric attention-based DL approach that bypasses feature engineering, thereby minimizing the risk of human error, two models: ForeNet-2d and ForeNet-3d are proposed. These models successfully forecast the RUL for seven multifaceted wind turbine (WT) failures with a 2-week forecast window. The most precise forecast deviated by only 10 minutes from the actual RUL, while the least accurate prediction deviated by 1.8 days, with most predictions being off by only a few hours. This methodology offers a substantial time frame to access remote WTs and perform necessary maintenance, thereby enabling the practical implementation of PdM. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: 19 pages, 16 figures, Journal Paper

Report number: Volume 10, Issue 20e39268October 30, 2024 MSC Class: 14J60 (Primary)

Journal ref: Helyion (Journal); Volume 10, Issue 20e39268October 30, 2024

arXiv:2407.08655 [pdf, other]

SPOCKMIP: Segmentation of Vessels in MRAs with Enhanced Continuity using Maximum Intensity Projection as Loss

Authors: Chethan Radhakrishna, Karthikesh Varma Chintalapati, Sri Chandana Hudukula Ram Kumar, Raviteja Sutrave, Hendrik Mattern, Oliver Speck, Andreas Nürnberger, Soumick Chatterjee

Abstract: Identification of vessel structures of different sizes in biomedical images is crucial in the diagnosis of many neurodegenerative diseases. However, the sparsity of good-quality annotations of such images makes the task of vessel segmentation challenging. Deep learning offers an efficient way to segment vessels of different sizes by learning their high-level feature representations and the spatial… ▽ More Identification of vessel structures of different sizes in biomedical images is crucial in the diagnosis of many neurodegenerative diseases. However, the sparsity of good-quality annotations of such images makes the task of vessel segmentation challenging. Deep learning offers an efficient way to segment vessels of different sizes by learning their high-level feature representations and the spatial continuity of such features across dimensions. Semi-supervised patch-based approaches have been effective in identifying small vessels of one to two voxels in diameter. This study focuses on improving the segmentation quality by considering the spatial correlation of the features using the Maximum Intensity Projection~(MIP) as an additional loss criterion. Two methods are proposed with the incorporation of MIPs of label segmentation on the single~(z-axis) and multiple perceivable axes of the 3D volume. The proposed MIP-based methods produce segmentations with improved vessel continuity, which is evident in visual examinations of ROIs. Patch-based training is improved by introducing an additional loss term, MIP loss, to penalise the predicted discontinuity of vessels. A training set of 14 volumes is selected from the StudyForrest dataset comprising of 18 7-Tesla 3D Time-of-Flight~(ToF) Magnetic Resonance Angiography (MRA) images. The generalisation performance of the method is evaluated using the other unseen volumes in the dataset. It is observed that the proposed method with multi-axes MIP loss produces better quality segmentations with a median Dice of $80.245 \pm 0.129$. Also, the method with single-axis MIP loss produces segmentations with a median Dice of $79.749 \pm 0.109$. Furthermore, a visual comparison of the ROIs in the predicted segmentation reveals a significant improvement in the continuity of the vessels when MIP loss is incorporated into training. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2312.09842 [pdf, ps, other]

On the compression of shallow non-causal ASR models using knowledge distillation and tied-and-reduced decoder for low-latency on-device speech recognition

Authors: Nagaraj Adiga, Jinhwan Park, Chintigari Shiva Kumar, Shatrughan Singh, Kyungmin Lee, Chanwoo Kim, Dhananjaya Gowda

Abstract: Recently, the cascaded two-pass architecture has emerged as a strong contender for on-device automatic speech recognition (ASR). A cascade of causal and shallow non-causal encoders coupled with a shared decoder enables operation in both streaming and look-ahead modes. In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation,… ▽ More Recently, the cascaded two-pass architecture has emerged as a strong contender for on-device automatic speech recognition (ASR). A cascade of causal and shallow non-causal encoders coupled with a shared decoder enables operation in both streaming and look-ahead modes. In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation, shared decoder, and tied-and-reduced transducer network in order to reduce the model footprint. The shared decoder is changed into a tied-and-reduced network. The cascaded two-pass model is further compressed using knowledge distillation using a Kullback-Leibler divergence loss on the model posteriors. We demonstrate a 50% reduction in the size of a 41 M parameter cascaded teacher model with no noticeable degradation in ASR accuracy and a 30% reduction in latency △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2311.12758 [pdf, other]

Estimating time of arrival of vehicle fleets with GCN based traffic prediction

Authors: Shivika Sharma, Nandini Mawane, Dhruthick Gowda M, Mayur Taware, Chetan Kumar, Yash Chandrashekhar Dixit, Rakshit Ramesh

Abstract: This paper presents an effective framework for estimating time of arrival of vehicles (buses) in an Intelligent Transit Management System (ITMS) having sparse position updates. Our contributions towards this is firstly in implementing a constrained optimization based road linestring segmenting framework ensuring ideal segment lengths and segments with sufficient density of vehicle position measure… ▽ More This paper presents an effective framework for estimating time of arrival of vehicles (buses) in an Intelligent Transit Management System (ITMS) having sparse position updates. Our contributions towards this is firstly in implementing a constrained optimization based road linestring segmenting framework ensuring ideal segment lengths and segments with sufficient density of vehicle position measurements which will result in valid statistics for scenarios involving sparse position measurements. Over this we propose a comprehensive approach for predicting traffic delays and estimated time of vehicle arrival addressing both the spatial and temporal dependencies of traffic. The traffic delay model is built on top of the T-GCN architecture on which we optimally augment an adjacency matrix which models a complexly connected road network considering the degree of influence between road segments, enabling the traffic delay model to look beyond physical road connectivity in predicting traffic delays and therefore producing better estimates of arrival times to points along the designated route of the vehicles. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2202.13541 [pdf, other]

Pattern Based Multivariable Regression using Deep Learning (PBMR-DP)

Authors: Jiztom Kavalakkatt Francis, Chandan Kumar, Jansel Herrera-Gerena, Kundan Kumar, Matthew J Darr

Abstract: We propose a deep learning methodology for multivariate regression that is based on pattern recognition that triggers fast learning over sensor data. We used a conversion of sensors-to-image which enables us to take advantage of Computer Vision architectures and training processes. In addition to this data preparation methodology, we explore the use of state-of-the-art architectures to generate re… ▽ More We propose a deep learning methodology for multivariate regression that is based on pattern recognition that triggers fast learning over sensor data. We used a conversion of sensors-to-image which enables us to take advantage of Computer Vision architectures and training processes. In addition to this data preparation methodology, we explore the use of state-of-the-art architectures to generate regression outputs to predict agricultural crop continuous yield information. Finally, we compare with some of the top models reported in MLCAS2021. We found that using a straightforward training process, we were able to accomplish an MAE of 4.394, RMSE of 5.945, and R^2 of 0.861. △ Less

Submitted 9 March, 2022; v1 submitted 27 February, 2022; originally announced February 2022.

Comments: 7 pages, 5 figures, 3 tables

arXiv:2110.11795 [pdf, other]

HDRVideo-GAN: Deep Generative HDR Video Reconstruction

Authors: Mrinal Anand, Nidhin Harilal, Chandan Kumar, Shanmuganathan Raman

Abstract: High dynamic range (HDR) videos provide a more visually realistic experience than the standard low dynamic range (LDR) videos. Despite having significant progress in HDR imaging, it is still a challenging task to capture high-quality HDR video with a conventional off-the-shelf camera. Existing approaches rely entirely on using dense optical flow between the neighboring LDR sequences to reconstruct… ▽ More High dynamic range (HDR) videos provide a more visually realistic experience than the standard low dynamic range (LDR) videos. Despite having significant progress in HDR imaging, it is still a challenging task to capture high-quality HDR video with a conventional off-the-shelf camera. Existing approaches rely entirely on using dense optical flow between the neighboring LDR sequences to reconstruct an HDR frame. However, they lead to inconsistencies in color and exposure over time when applied to alternating exposures with noisy frames. In this paper, we propose an end-to-end GAN-based framework for HDR video reconstruction from LDR sequences with alternating exposures. We first extract clean LDR frames from noisy LDR video with alternating exposures with a denoising network trained in a self-supervised setting. Using optical flow, we then align the neighboring alternating-exposure frames to a reference frame and then reconstruct high-quality HDR frames in a complete adversarial setting. To further improve the robustness and quality of generated frames, we incorporate temporal stability-based regularization term along with content and style-based losses in the cost function during the training procedure. Experimental results demonstrate that our framework achieves state-of-the-art performance and generates superior quality HDR frames of a video over the existing methods. △ Less

Submitted 3 November, 2021; v1 submitted 22 October, 2021; originally announced October 2021.

Comments: In Proceedings of 12th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP-21)

arXiv:2002.00336 [pdf, other]

3D Object Detection on Point Clouds using Local Ground-aware and Adaptive Representation of scenes' surface

Authors: Arun CS Kumar, Disha Ahuja, Ashwath Aithal

Abstract: A novel, adaptive ground-aware, and cost-effective 3D Object Detection pipeline is proposed. The ground surface representation introduced in this paper, in comparison to its uni-planar counterparts (methods that model the surface of a whole 3D scene using single plane), is far more accurate while being ~10x faster. The novelty of the ground representation lies both in the way in which the ground s… ▽ More A novel, adaptive ground-aware, and cost-effective 3D Object Detection pipeline is proposed. The ground surface representation introduced in this paper, in comparison to its uni-planar counterparts (methods that model the surface of a whole 3D scene using single plane), is far more accurate while being ~10x faster. The novelty of the ground representation lies both in the way in which the ground surface of the scene is represented in Lidar perception problems, as well as in the (cost-efficient) way in which it is computed. Furthermore, the proposed object detection pipeline builds on the traditional two-stage object detection models by incorporating the ability to dynamically reason the surface of the scene, ultimately achieving a new state-of-the-art 3D object detection performance among the two-stage Lidar Object Detection pipelines. △ Less

Submitted 26 June, 2020; v1 submitted 2 February, 2020; originally announced February 2020.

Showing 1–8 of 8 results for author: Kumar, C