-
Visual-Aware Speech Recognition for Noisy Scenarios
Authors:
Lakshmipathi Balaji,
Karan Singla
Abstract:
Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual c…
▽ More
Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. Unlike works that rely on lip motion and require the speaker's visibility, we exploit broader visual information from the environment. This allows our model to naturally filter speech from noise and improve transcription, much like humans do in noisy scenarios. Our method re-purposes pretrained speech and visual encoders, linking them with multi-headed attention. This approach enables the transcription of speech and the prediction of noise labels in video inputs. We introduce a scalable pipeline to develop audio-visual datasets, where visual cues correlate to noise in the audio. We show significant improvements over existing audio-only models in noisy scenarios. Results also highlight that visual cues play a vital role in improved transcription accuracy.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding
Authors:
Sankalp KJ,
Ashutosh Kumar,
Laxmaan Balaji,
Nikunj Kotecha,
Vinija Jain,
Aman Chadha,
Sreyoshi Bhaduri
Abstract:
Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro…
▽ More
Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro (Massive Multitask Language Understanding) framework. Covering major languages such as Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu, our benchmark addresses the unique challenges and opportunities presented by the linguistic diversity of the Indian subcontinent. This benchmark encompasses a wide range of tasks in language comprehension, reasoning, and generation, meticulously crafted to capture the intricacies of Indian languages. IndicMMLU-Pro provides a standardized evaluation framework to push the research boundaries in Indic language AI, facilitating the development of more accurate, efficient, and culturally sensitive models. This paper outlines the benchmarks' design principles, task taxonomy, and data collection methodology, and presents baseline results from state-of-the-art multilingual models.
△ Less
Submitted 27 January, 2025; v1 submitted 26 January, 2025;
originally announced January 2025.
-
A Synthetic Prediction Market for Estimating Confidence in Published Work
Authors:
Sarah Rajtmajer,
Christopher Griffin,
Jian Wu,
Robert Fraleigh,
Laxmaan Balaji,
Anna Squicciarini,
Anthony Kwasnica,
David Pennock,
Michael McLaughlin,
Timothy Fritton,
Nishanth Nakshatri,
Arjun Menon,
Sai Ajay Modukuri,
Rajal Nivargi,
Xin Wei,
C. Lee Giles
Abstract:
Explainably estimating confidence in published scholarly work offers opportunity for faster and more robust scientific progress. We develop a synthetic prediction market to assess the credibility of published claims in the social and behavioral sciences literature. We demonstrate our system and detail our findings using a collection of known replication projects. We suggest that this work lays the…
▽ More
Explainably estimating confidence in published scholarly work offers opportunity for faster and more robust scientific progress. We develop a synthetic prediction market to assess the credibility of published claims in the social and behavioral sciences literature. We demonstrate our system and detail our findings using a collection of known replication projects. We suggest that this work lays the foundation for a research agenda that creatively uses AI for peer review.
△ Less
Submitted 23 December, 2021;
originally announced January 2022.
-
H.264/SVC Mode Decision Based on Mode Correlation and Desired Mode List
Authors:
L. Balaji,
K. K. Thyagharajan
Abstract:
The design of video encoders involves the implementation of fast mode decision (FMD) algorithm to reduce computation complexity while maintaining the performance of the coding. Although H.264/scalable video coding (SVC) achieves high scalability and coding efficiency, it also has high complexity in implementing its exhaustive computation. In this paper, a novel algorithm is proposed to reduce the…
▽ More
The design of video encoders involves the implementation of fast mode decision (FMD) algorithm to reduce computation complexity while maintaining the performance of the coding. Although H.264/scalable video coding (SVC) achieves high scalability and coding efficiency, it also has high complexity in implementing its exhaustive computation. In this paper, a novel algorithm is proposed to reduce the redundant candidate modes by making use of the correlation among layers. The desired mode list is created based on the probability to be the best mode for each block in the base layer and a candidate mode selection in the enhancement layer by the correlations of modes among the reference frame and current frame. Our algorithm is implemented in joint scalable video model (JSVM) 9.19.15 reference software and the performance is evaluated based on the average encoding time, peak signal to noise ratio (PSNR) and bit rate. The experimental results show 41.89% improvement in encoding time with minimal loss of 0.02dB in PSNR and 0.05% increase in bit rate.
△ Less
Submitted 22 September, 2020;
originally announced September 2020.
-
An enhanced performance for H.265/SHVC based on combined AEGBM3D filter and back-propagation neural network
Authors:
L. Balaji,
K. K. Thyagharajan
Abstract:
This paper deals with the latest video coding standard H265 SHVC, a scalable extension to High Efficiency Video Coding (HEVC). HEVC introduces new coding tools compared to its predecessor and is backward compatible with all types of electronic gadgets. The gadgets with different display capabilities cannot be offered the same quality video due to the constraints in transmission bandwidth is a majo…
▽ More
This paper deals with the latest video coding standard H265 SHVC, a scalable extension to High Efficiency Video Coding (HEVC). HEVC introduces new coding tools compared to its predecessor and is backward compatible with all types of electronic gadgets. The gadgets with different display capabilities cannot be offered the same quality video due to the constraints in transmission bandwidth is a major problem. One solution to this problem will be the compression of the video sequence which is focused in this paper to preserve or increase PSNR while reducing bit-rate besides a novel method implemented in SHVC encoder. The novel method undergoes a combined AEGBM3D (adaptive edge guided block-matching and 3D) filtering and back-propagation technique. The technique includes an AEGBM3D filter which avoids spatial redundancy and de-noise frames; hence enhancement in PSNR is achieved. The obtained PSNR of the video is compared with the set threshold PSNR to maintain PSNR above the threshold by repeated AEGBM3D filtering. The BP technique based on the neural network machine learning approach continually restrains the output if the input block does not contain a feature they were trained to recognize. This frequent control over the output produces few bits; hence reduction in bit-rate is achieved. The simulation results show that the proposed technique delivers an average increment of 0.16 and 0.25dB in PSNR and an average decrement of 28 and 37% in bit-rate for 1.5 and 2 times spatial ratios respectively, compared with the existing methods.
△ Less
Submitted 20 September, 2020;
originally announced September 2020.
-
An optimal mode selection algorithm for scalable video coding
Authors:
L. Balaji,
K. K. Thyagharajan,
C. Raja,
A. Dhanalakshmi
Abstract:
Scalable video coding (SVC) is extended from its predecessor advanced video coding (AVC) because of its flexible transmission to all type of gadgets. However, SVC is more flexible and scalable than AVC, but it is more complex in determining the computations than AVC. The traditional full search method in the standard H.264 SVC consumes more encoding time for computation. This complexity in computa…
▽ More
Scalable video coding (SVC) is extended from its predecessor advanced video coding (AVC) because of its flexible transmission to all type of gadgets. However, SVC is more flexible and scalable than AVC, but it is more complex in determining the computations than AVC. The traditional full search method in the standard H.264 SVC consumes more encoding time for computation. This complexity in computation need to be reduced and many fast mode decision (FMD) algorithms were developed, but many fail to balance in all the three measures such as peak signal to noise ratio (PSNR), encoding time and bit rate. In this paper, the proposed optimal mode selection algorithm based on the orientation of pixels achieves better time saving, good PSNR and coding efficiency. The proposed algorithm is compared with the standard H.264 JSVM reference software and found to be 57.44% time saving, 0.43 dB increments in PSNR and 0.23% compression in bit rate.
△ Less
Submitted 8 September, 2020;
originally announced September 2020.