-
Perturbed State Space Feature Encoders for Optical Flow with Event Cameras
Authors:
Gokul Raju Govinda Raju,
Nikola Zubić,
Marco Cannici,
Davide Scaramuzza
Abstract:
With their motion-responsive nature, event-based cameras offer significant advantages over traditional cameras for optical flow estimation. While deep learning has improved upon traditional methods, current neural networks adopted for event-based optical flow still face temporal and spatial reasoning limitations. We propose Perturbed State Space Feature Encoders (P-SSE) for multi-frame optical flo…
▽ More
With their motion-responsive nature, event-based cameras offer significant advantages over traditional cameras for optical flow estimation. While deep learning has improved upon traditional methods, current neural networks adopted for event-based optical flow still face temporal and spatial reasoning limitations. We propose Perturbed State Space Feature Encoders (P-SSE) for multi-frame optical flow with event cameras to address these challenges. P-SSE adaptively processes spatiotemporal features with a large receptive field akin to Transformer-based methods, while maintaining the linear computational complexity characteristic of SSMs. However, the key innovation that enables the state-of-the-art performance of our model lies in our perturbation technique applied to the state dynamics matrix governing the SSM system. This approach significantly improves the stability and performance of our model. We integrate P-SSE into a framework that leverages bi-directional flows and recurrent connections, expanding the temporal context of flow prediction. Evaluations on DSEC-Flow and MVSEC datasets showcase P-SSE's superiority, with 8.48% and 11.86% improvements in EPE performance, respectively.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
Authors:
Ashwin Sankar,
Srija Anand,
Praveen Srinivasa Varadhan,
Sherry Thomas,
Mehak Singal,
Shridhar Kumar,
Deovrat Mehendale,
Aditi Krishana,
Giri Raju,
Mitesh Khapra
Abstract:
Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations…
▽ More
Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, releasing the first TTS model for all 22 official Indian languages.
△ Less
Submitted 7 October, 2024; v1 submitted 9 September, 2024;
originally announced September 2024.
-
Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings
Authors:
Praveen Srinivasa Varadhan,
Ashwin Sankar,
Giri Raju,
Mitesh M. Khapra
Abstract:
We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increas…
▽ More
We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increasing neutral data to 10 hours, with minimal expressive data, significantly enhances expressiveness. This offers a practical recipe for resource-constrained languages, prioritizing easily obtainable neutral data alongside smaller amounts of expressive data. We show the importance of syllabically balanced data and pooling emotions to enhance expressiveness. We also highlight challenges in generating specific emotions, e.g., fear and surprise.
△ Less
Submitted 30 August, 2024; v1 submitted 19 July, 2024;
originally announced July 2024.
-
Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies
Authors:
Srija Anand,
Praveen Srinivasa Varadhan,
Ashwin Sankar,
Giri Raju,
Mitesh M. Khapra
Abstract:
Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from se…
▽ More
Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from several real-world applications. Indeed, state-of-the-art Hindi and Tamil TTS systems perform poorly on this OOV benchmark, as indicated by intelligibility tests. To improve the model's OOV performance, we propose a low-effort and economically viable strategy to obtain more training data. Specifically, we propose using volunteers as opposed to high quality voice artists to record words containing character bigrams unseen in the training data. We show that using such inexpensive data, the model's performance improves on OOV words, while not affecting voice quality and in-domain performance.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Inception-inspired LSTM for Next-frame Video Prediction
Authors:
Matin Hosseini,
Anthony S. Maida,
Majid Hosseini,
Gottumukkala Raju
Abstract:
The problem of video frame prediction has received much interest due to its relevance to many computer vision applications such as autonomous vehicles or robotics. Supervised methods for video frame prediction rely on labeled data, which may not always be available. In this paper, we provide a novel unsupervised deep-learning method called Inception-based LSTM for video frame prediction. The gener…
▽ More
The problem of video frame prediction has received much interest due to its relevance to many computer vision applications such as autonomous vehicles or robotics. Supervised methods for video frame prediction rely on labeled data, which may not always be available. In this paper, we provide a novel unsupervised deep-learning method called Inception-based LSTM for video frame prediction. The general idea of inception networks is to implement wider networks instead of deeper networks. This network design was shown to improve the performance of image classification. The proposed method is evaluated on both Inception-v1 and Inception-v2 structures. The proposed Inception LSTM methods are compared with convolutional LSTM when applied using PredNet predictive coding framework for both the KITTI and KTH data sets. We observed that the Inception based LSTM outperforms the convolutional LSTM. Also, Inception LSTM has better prediction performance compared to Inception v2 LSTM. However, Inception v2 LSTM has a lower computational cost compared to Inception LSTM.
△ Less
Submitted 24 April, 2020; v1 submitted 27 August, 2019;
originally announced September 2019.
-
A Review on Failure Node Recovery Algorithms in Wireless Sensor Actor Networks
Authors:
G. Sumalatha,
N. Zareena,
Ch. Gopi Raju
Abstract:
In wireless sensor-actor networks, sensors probe their surroundings and forward their data to actor nodes. Actors collect sensor data and perform certain tasks in response to various events. Since actors operate on harsh environment, they may easily get damaged or failed. Failed actor nodes may partition the network into disjoint subsets. In order to reestablish connectivity nodes may be relocated…
▽ More
In wireless sensor-actor networks, sensors probe their surroundings and forward their data to actor nodes. Actors collect sensor data and perform certain tasks in response to various events. Since actors operate on harsh environment, they may easily get damaged or failed. Failed actor nodes may partition the network into disjoint subsets. In order to reestablish connectivity nodes may be relocated to new positions. This paper focus on review of three (LeDir, RIM, DARA) node recovery algorithms, and their performance has been analysed in terms network overhead and path length validation metrics.
△ Less
Submitted 30 June, 2014;
originally announced July 2014.
-
Cellular Automata based Feedback Mechanism in Strengthening biological Sequence Analysis Approach to Robotic Soccer
Authors:
P. Kiran Sree,
G. V. S. Raju,
S. Viswandha Raju,
N. S. S. S. N Usha Devi
Abstract:
This paper reports on the application of sequence analysis algorithms for agents in robotic soccer and a suitable representation is proposed to achieve this mapping. The objective of this research is to generate novel better in-game strategies with the aim of faster adaptation to the changing environment. A homogeneous non-communicating multi-agent architecture using the representation is presente…
▽ More
This paper reports on the application of sequence analysis algorithms for agents in robotic soccer and a suitable representation is proposed to achieve this mapping. The objective of this research is to generate novel better in-game strategies with the aim of faster adaptation to the changing environment. A homogeneous non-communicating multi-agent architecture using the representation is presented. To achieve real-time learning during a game, a bucket brigade algorithm is used to reinforce Cellular Automata Based Classifier. A technique for selecting strategies based on sequence analysis is adopted.
△ Less
Submitted 9 December, 2013;
originally announced December 2013.
-
Mine Blood Donors Information through Improved K-Means Clustering
Authors:
Bondu Venkateswarlu,
Prof G. S. V. Prasad Raju
Abstract:
The number of accidents and health diseases which are increasing at an alarming rate are resulting in a huge increase in the demand for blood. There is a necessity for the organized analysis of the blood donor database or blood banks repositories. Clustering analysis is one of the data mining applications and K-means clustering algorithm is the fundamental algorithm for modern clustering technique…
▽ More
The number of accidents and health diseases which are increasing at an alarming rate are resulting in a huge increase in the demand for blood. There is a necessity for the organized analysis of the blood donor database or blood banks repositories. Clustering analysis is one of the data mining applications and K-means clustering algorithm is the fundamental algorithm for modern clustering techniques. K-means clustering algorithm is traditional approach and iterative algorithm. At every iteration, it attempts to find the distance from the centroid of each cluster to each and every data point. This paper gives the improvement to the original k-means algorithm by improving the initial centroids with distribution of data. Results and discussions show that improved K-means algorithm produces accurate clusters in less computation time to find the donors information.
△ Less
Submitted 10 September, 2013;
originally announced September 2013.