-
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Authors:
Sanjoy Chowdhury,
Hanan Gani,
Nishit Anand,
Sayan Nag,
Ruohan Gao,
Mohamed Elhoseiny,
Salman Khan,
Dinesh Manocha
Abstract:
Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into…
▽ More
Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Authors:
Sanjoy Chowdhury,
Sayan Nag,
Subhrajyoti Dasgupta,
Jun Chen,
Mohamed Elhoseiny,
Ruohan Gao,
Dinesh Manocha
Abstract:
Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained un…
▽ More
Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
△ Less
Submitted 3 July, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models
Authors:
Sanjoy Chowdhury,
Sayan Nag,
K J Joseph,
Balaji Vasan Srinivasan,
Dinesh Manocha
Abstract:
Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, w…
▽ More
Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
OmniCount: Multi-label Object Counting with Semantic-Geometric Priors
Authors:
Anindya Mondal,
Sauradip Nag,
Xiatian Zhu,
Anjan Dutta
Abstract:
Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficien…
▽ More
Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions. The project webpage is available at https://mondalanindya.github.io/OmniCount.
△ Less
Submitted 22 February, 2025; v1 submitted 8 March, 2024;
originally announced March 2024.
-
DiffSED: Sound Event Detection with Denoising Diffusion
Authors:
Swapnil Bhosale,
Sauradip Nag,
Diptesh Kanojia,
Jiankang Deng,
Xiatian Zhu
Abstract:
Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate t…
▽ More
Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training.
△ Less
Submitted 16 August, 2023; v1 submitted 14 August, 2023;
originally announced August 2023.
-
Actor-agnostic Multi-label Action Recognition with Multi-modal Query
Authors:
Anindya Mondal,
Sauradip Nag,
Joaquin M Prada,
Xiatian Zhu,
Anjan Dutta
Abstract:
Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecti…
▽ More
Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.
△ Less
Submitted 10 January, 2024; v1 submitted 20 July, 2023;
originally announced July 2023.
-
BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion
Authors:
Ahana Deb,
Sayan Nag,
Ayan Mahapatra,
Soumitri Chattopadhyay,
Aritra Marik,
Pijush Kanti Gayen,
Shankha Sanyal,
Archi Banerjee,
Samir Karmakar
Abstract:
Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful represent…
▽ More
Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts ($\underline{\textbf{Be}}$ngali speech acts recognition using Multimodal $\underline{\textbf{At}}$tention Fu$\underline{\textbf{s}}$ion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data. Project page: https://soumitri2001.github.io/BeAts
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
WaferSegClassNet -- A Light-weight Network for Classification and Segmentation of Semiconductor Wafer Defects
Authors:
Subhrajit Nag,
Dhruv Makwana,
Sai Chandra Teja R,
Sparsh Mittal,
C Krishna Mohan
Abstract:
As the integration density and design intricacy of semiconductor wafers increase, the magnitude and complexity of defects in them are also on the rise. Since the manual inspection of wafer defects is costly, an automated artificial intelligence (AI) based computer-vision approach is highly desired. The previous works on defect analysis have several limitations, such as low accuracy and the need fo…
▽ More
As the integration density and design intricacy of semiconductor wafers increase, the magnitude and complexity of defects in them are also on the rise. Since the manual inspection of wafer defects is costly, an automated artificial intelligence (AI) based computer-vision approach is highly desired. The previous works on defect analysis have several limitations, such as low accuracy and the need for separate models for classification and segmentation. For analyzing mixed-type defects, some previous works require separately training one model for each defect type, which is non-scalable. In this paper, we present WaferSegClassNet (WSCN), a novel network based on encoder-decoder architecture. WSCN performs simultaneous classification and segmentation of both single and mixed-type wafer defects. WSCN uses a "shared encoder" for classification, and segmentation, which allows training WSCN end-to-end. We use N-pair contrastive loss to first pretrain the encoder and then use BCE-Dice loss for segmentation, and categorical cross-entropy loss for classification. Use of N-pair contrastive loss helps in better embedding representation in the latent dimension of wafer maps. WSCN has a model size of only 0.51MB and performs only 0.2M FLOPS. Thus, it is much lighter than other state-of-the-art models. Also, it requires only 150 epochs for convergence, compared to 4,000 epochs needed by a previous work. We evaluate our model on the MixedWM38 dataset, which has 38,015 images. WSCN achieves an average classification accuracy of 98.2% and a dice coefficient of 0.9999. We are the first to show segmentation results on the MixedWM38 dataset. The source code can be obtained from https://github.com/ckmvigil/WaferSegClassNet.
△ Less
Submitted 3 July, 2022;
originally announced July 2022.
-
Agile Satellite Planning for Multi-Payload Observations for Earth Science
Authors:
Rich Levinson,
Sreeja Nag,
Vinay Ravindra
Abstract:
We present planning challenges, methods and preliminary results for a new model-based paradigm for earth observing systems in adaptive remote sensing. Our heuristically guided constraint optimization planner produces coordinated plans for multiple satellites, each with multiple instruments (payloads). The satellites are agile, meaning they can quickly maneuver to change viewing angles in response…
▽ More
We present planning challenges, methods and preliminary results for a new model-based paradigm for earth observing systems in adaptive remote sensing. Our heuristically guided constraint optimization planner produces coordinated plans for multiple satellites, each with multiple instruments (payloads). The satellites are agile, meaning they can quickly maneuver to change viewing angles in response to rapidly changing phenomena. The planner operates in a closed-loop context, updating the plan as it receives regular sensor data and updated predictions. We describe the planner's search space and search procedure, and present preliminary experiment results. Contributions include initial identification of the planner's search space, constraints, heuristics, and performance metrics applied to a soil moisture monitoring scenario using spaceborne radars.
△ Less
Submitted 13 November, 2021;
originally announced November 2021.
-
Attitude Trajectory Optimization for Agile Satellites in Autonomous Remote Sensing Constellation
Authors:
Emmanuel Sin,
Sreeja Nag,
Vinay Ravindra,
Alan Li,
Murat Arcak
Abstract:
Agile attitude maneuvering maximizes the utility of remote sensing satellite constellations. By taking into account a satellite's physical properties and its actuator specifications, we may leverage the full performance potential of the attitude control system to conduct agile remote sensing beyond conventional slew-and-stabilize maneuvers. Employing a constellation of agile satellites, coordinate…
▽ More
Agile attitude maneuvering maximizes the utility of remote sensing satellite constellations. By taking into account a satellite's physical properties and its actuator specifications, we may leverage the full performance potential of the attitude control system to conduct agile remote sensing beyond conventional slew-and-stabilize maneuvers. Employing a constellation of agile satellites, coordinated by an autonomous and responsive scheduler, can significantly increase overall response rate, revisit time and global coverage for the mission. In this paper, we use recent advances in sequential convex programming based trajectory optimization to enable rapid-target acquisition, pointing and tracking capabilities for a scheduler-based constellation. We present two problem formulations. The Minimum-Time Slew Optimal Control Problem determines the minimum time, required energy, and optimal trajectory to slew between any two orientations given nonlinear quaternion kinematics, gyrostat and actuator dynamics, and state/input constraints. By gridding the space of 3D rotations and efficiently solving this problem on the grid, we produce lookup tables or parametric fits off-line that can then be used on-line by a scheduler to compute accurate estimates of minimum-time and maneuver energy for real-time constellation scheduling. The Minimum-Effort Multi-Target Pointing Optimal Control Problem is used on-line by each satellite to produce continuous attitude-state and control-input trajectories that realize a given schedule while minimizing attitude error and control effort. The optimal trajectory may then be achieved by a low-level tracking controller. We demonstrate our approach with an example of a reference satellite in Sun-synchronous orbit passing over globally-distributed, Earth-observation targets.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction
Authors:
Sayan Nag,
Uddalok Sarkar,
Shankha Sanyal,
Archi Banerjee,
Souparno Roy,
Samir Karmakar,
Ranjan Sengupta,
Dipak Ghosh
Abstract:
It is already known that both auditory and visual stimulus is able to convey emotions in human mind to different extent. The strength or intensity of the emotional arousal vary depending on the type of stimulus chosen. In this study, we try to investigate the emotional arousal in a cross-modal scenario involving both auditory and visual stimulus while studying their source characteristics. A robus…
▽ More
It is already known that both auditory and visual stimulus is able to convey emotions in human mind to different extent. The strength or intensity of the emotional arousal vary depending on the type of stimulus chosen. In this study, we try to investigate the emotional arousal in a cross-modal scenario involving both auditory and visual stimulus while studying their source characteristics. A robust fractal analytic technique called Detrended Fluctuation Analysis (DFA) and its 2D analogue has been used to characterize three (3) standardized audio and video signals quantifying their scaling exponent corresponding to positive and negative valence. It was found that there is significant difference in scaling exponents corresponding to the two different modalities. Detrended Cross Correlation Analysis (DCCA) has also been applied to decipher degree of cross-correlation among the individual audio and visual stimulus. This is the first of its kind study which proposes a novel algorithm with which emotional arousal can be classified in cross-modal scenario using only the source audio and visual signals while also attempting a correlation between them.
△ Less
Submitted 11 February, 2021;
originally announced February 2021.
-
Language Independent Emotion Quantification using Non linear Modelling of Speech
Authors:
Uddalok Sarkar,
Sayan Nag,
Chirayata Bhattacharya,
Shankha Sanyal,
Archi Banerjee,
Ranjan Sengupta,
Dipak Ghosh
Abstract:
At present emotion extraction from speech is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking styles of a person, vocal tract information, timbral qualities and other congenital information regarding his voice. Our speech production system is a nonlinear system like most other real world system…
▽ More
At present emotion extraction from speech is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking styles of a person, vocal tract information, timbral qualities and other congenital information regarding his voice. Our speech production system is a nonlinear system like most other real world systems. Hence the need arises for modelling our speech information using nonlinear techniques. In this work we have modelled our articulation system using nonlinear multifractal analysis. The multifractal spectral width and scaling exponents reveals essentially the complexity associated with the speech signals taken. The multifractal spectrums are well distinguishable the in low fluctuation region in case of different emotions. The source characteristics have been quantified with the help of different non-linear models like Multi-Fractal Detrended Fluctuation Analysis, Wavelet Transform Modulus Maxima. The Results obtained from this study gives a very good result in emotion clustering.
△ Less
Submitted 11 February, 2021;
originally announced February 2021.
-
Neural Network architectures to classify emotions in Indian Classical Music
Authors:
Uddalok Sarkar,
Sayan Nag,
Medha Basu,
Archi Banerjee,
Shankha Sanyal,
Ranjan Sengupta,
Dipak Ghosh
Abstract:
Music is often considered as the language of emotions. It has long been known to elicit emotions in human being and thus categorizing music based on the type of emotions they induce in human being is a very intriguing topic of research. When the task comes to classify emotions elicited by Indian Classical Music (ICM), it becomes much more challenging because of the inherent ambiguity associated wi…
▽ More
Music is often considered as the language of emotions. It has long been known to elicit emotions in human being and thus categorizing music based on the type of emotions they induce in human being is a very intriguing topic of research. When the task comes to classify emotions elicited by Indian Classical Music (ICM), it becomes much more challenging because of the inherent ambiguity associated with ICM. The fact that a single musical performance can evoke a variety of emotional response in the audience is implicit to the nature of ICM renditions. With the rapid advancements in the field of Deep Learning, this Music Emotion Recognition (MER) task is becoming more and more relevant and robust, hence can be applied to one of the most challenging test case i.e. classifying emotions elicited from ICM. In this paper we present a new dataset called JUMusEmoDB which presently has 400 audio clips (30 seconds each) where 200 clips correspond to happy emotions and the remaining 200 clips correspond to sad emotion. For supervised classification purposes, we have used 4 existing deep Convolutional Neural Network (CNN) based architectures (resnet18, mobilenet v2.0, squeezenet v1.0 and vgg16) on corresponding music spectrograms of the 2000 sub-clips (where every clip was segmented into 5 sub-clips of about 5 seconds each) which contain both time as well as frequency domain information. The initial results are quite inspiring, and we look forward to setting the baseline values for the dataset using this architecture. This type of CNN based classification algorithm using a rich corpus of Indian Classical Music is unique even in the global perspective and can be replicated in other modalities of music also. This dataset is still under development and we plan to include more data containing other emotional features as well. We plan to make the dataset publicly available soon.
△ Less
Submitted 31 January, 2021;
originally announced February 2021.
-
Planning a Reference Constellation for Radiometric Cross-Calibration of Commercial Earth Observing Sensors
Authors:
Sreeja Nag,
Philip Dabney,
Vinay Ravindra,
Cody Anderson
Abstract:
The Earth Observation planning community has access to tools that can propagate orbits and compute coverage of Earth observing imagers with customizable shapes and orientation, model the expected Earth Reflectance at various bands, epochs and directions, generate simplified instrument performance metrics for imagers and radars, and schedule single and multiple spacecraft payload operations. We are…
▽ More
The Earth Observation planning community has access to tools that can propagate orbits and compute coverage of Earth observing imagers with customizable shapes and orientation, model the expected Earth Reflectance at various bands, epochs and directions, generate simplified instrument performance metrics for imagers and radars, and schedule single and multiple spacecraft payload operations. We are working toward integrating existing tools to design a planner that allows commercial small spacecraft to assess the opportunities for cross-calibration of their sensors against current satellite to be calibrated, specifications of the reference instruments, sensor stability, allowable latency between calibration measurements, differences in viewing and solar geometry between calibration measurements, etc. The planner would output cross-calibration opportunities for every reference target pair as a function of flexible user-defined parameters. We use a preliminary version of this planner to inform the design of a constellation of transfer radiometers that can serve as stable, radiometric references for commercial sensors to cross-calibrate with. We propose such a constellation for either vicarious cross-calibration using pre-selected sites, or top of the atmosphere (TOA) cross-calibration globally. Results from the calibration planner applied to a subset of informed architecture designs show that a 4 sat constellation provides multiple calibration opportunities within half a day planning horizon, for Cubesat sensors deployed into a typical rideshare orbits. While such opportunities are available for cross calibration image pairs within 5 deg of solar or view directions, and with-in an hour (for TOA) and less than a day (vicariously), the planner allows us to identify many more by relaxing user-defined restrictions.
△ Less
Submitted 19 October, 2020;
originally announced October 2020.
-
Autonomous Scheduling of Agile Spacecraft Constellations with Delay Tolerant Networking for Reactive Imaging
Authors:
Sreeja Nag,
Alan S. Li,
Vinay Ravindra,
Marc Sanchez Net,
Kar-Ming Cheung,
Rod Lammers,
Brian Bledsoe
Abstract:
Small spacecraft now have precise attitude control systems available commercially, allowing them to slew in 3 degrees of freedom, and capture images within short notice. When combined with appropriate software, this agility can significantly increase response rate, revisit time and coverage. In prior work, we have demonstrated an algorithmic framework that combines orbital mechanics, attitude cont…
▽ More
Small spacecraft now have precise attitude control systems available commercially, allowing them to slew in 3 degrees of freedom, and capture images within short notice. When combined with appropriate software, this agility can significantly increase response rate, revisit time and coverage. In prior work, we have demonstrated an algorithmic framework that combines orbital mechanics, attitude control and scheduling optimization to plan the time-varying, full-body orientation of agile, small spacecraft in a constellation. The proposed schedule optimization would run at the ground station autonomously, and the resultant schedules uplinked to the spacecraft for execution. The algorithm is generalizable over small steerable spacecraft, control capability, sensor specs, imaging requirements, and regions of interest. In this article, we modify the algorithm to run onboard small spacecraft, such that the constellation can make time-sensitive decisions to slew and capture images autonomously, without ground control. We have developed a communication module based on Delay/Disruption Tolerant Networking (DTN) for onboard data management and routing among the satellites, which will work in conjunction with the other modules to optimize the schedule of agile communication and steering. We then apply this preliminary framework on representative constellations to simulate targeted measurements of episodic precipitation events and subsequent urban floods. The command and control efficiency of our agile algorithm is compared to non-agile (11.3x improvement) and non-DTN (21% improvement) constellations.
△ Less
Submitted 19 October, 2020;
originally announced October 2020.
-
E2GC: Energy-efficient Group Convolution in Deep Neural Networks
Authors:
Nandan Kumar Jha,
Rajat Saini,
Subhrajit Nag,
Sparsh Mittal
Abstract:
The number of groups ($g$) in group convolution (GConv) is selected to boost the predictive performance of deep neural networks (DNNs) in a compute and parameter efficient manner. However, we show that naive selection of $g$ in GConv creates an imbalance between the computational complexity and degree of data reuse, which leads to suboptimal energy efficiency in DNNs. We devise an optimum group si…
▽ More
The number of groups ($g$) in group convolution (GConv) is selected to boost the predictive performance of deep neural networks (DNNs) in a compute and parameter efficient manner. However, we show that naive selection of $g$ in GConv creates an imbalance between the computational complexity and degree of data reuse, which leads to suboptimal energy efficiency in DNNs. We devise an optimum group size model, which enables a balance between computational cost and data movement cost, thus, optimize the energy-efficiency of DNNs. Based on the insights from this model, we propose an "energy-efficient group convolution" (E2GC) module where, unlike the previous implementations of GConv, the group size ($G$) remains constant. Further, to demonstrate the efficacy of the E2GC module, we incorporate this module in the design of MobileNet-V1 and ResNeXt-50 and perform experiments on two GPUs, P100 and P4000. We show that, at comparable computational complexity, DNNs with constant group size (E2GC) are more energy-efficient than DNNs with a fixed number of groups (F$g$GC). For example, on P100 GPU, the energy-efficiency of MobileNet-V1 and ResNeXt-50 is increased by 10.8% and 4.73% (respectively) when E2GC modules substitute the F$g$GC modules in both the DNNs. Furthermore, through our extensive experimentation with ImageNet-1K and Food-101 image classification datasets, we show that the E2GC module enables a trade-off between generalization ability and representational power of DNN. Thus, the predictive performance of DNNs can be optimized by selecting an appropriate $G$. The code and trained models are available at https://github.com/iithcandle/E2GC-release.
△ Less
Submitted 26 June, 2020;
originally announced June 2020.
-
Acoustical classification of different speech acts using nonlinear methods
Authors:
Chirayata Bhattacharyya,
Sourya Sengupta,
Sayan Nag,
Shankha Sanyal,
Archi Banerjee,
Ranjan Sengupta,
Dipak Ghosh
Abstract:
A recitation is a way of combining the words together so that they have a sense of rhythm and thus an emotional content is imbibed within. In this study we envisaged to answer these questions in a scientific manner taking into consideration 5 (five) well known Bengali recitations of different poets conveying a variety of moods ranging from joy to sorrow. The clips were recited as well as read (in…
▽ More
A recitation is a way of combining the words together so that they have a sense of rhythm and thus an emotional content is imbibed within. In this study we envisaged to answer these questions in a scientific manner taking into consideration 5 (five) well known Bengali recitations of different poets conveying a variety of moods ranging from joy to sorrow. The clips were recited as well as read (in the form of flat speech without any rhythm) by the same person to avoid any perceptual difference arising out of timbre variation. Next, the emotional content from the 5 recitations were standardized with the help of listening test conducted on a pool of 50 participants. The recitations as well as the speech were analyzed with the help of a latest non linear technique called Detrended Fluctuation Analysis (DFA) that gives a scaling exponent α, which is essentially the measure of long range correlations present in the signal. Similar pieces (the parts which have the exact lyrical content in speech as well as in the recital) were extracted from the complete signal and analyzed with the help of DFA technique. Our analysis shows that the scaling exponent for all parts of recitation were much higher in general as compared to their counterparts in speech. We have also established a critical value from our analysis, above which a mere speech may become a recitation. The case may be similar to the conventional phase transition, wherein the measurement of external condition at which the transformation occurs (generally temperature) is called phase transition. Further, we have also categorized the 5 recitations on the basis of their emotional content with the help of the same DFA technique. Analysis with a greater variety of recitations is being carried out to yield more interesting results.
△ Less
Submitted 5 August, 2020; v1 submitted 15 April, 2020;
originally announced April 2020.
-
Speaker Recognition in Bengali Language from Nonlinear Features
Authors:
Uddalok Sarkar,
Soumyadeep Pal,
Sayan Nag,
Chirayata Bhattacharya,
Shankha Sanyal,
Archi Banerjee,
Ranjan Sengupta,
Dipak Ghosh
Abstract:
At present Automatic Speaker Recognition system is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking style of a person, vocal tract information, timbral qualities of his voice and other congenital information regarding his voice. The study of Bengali speech recognition and speaker identification…
▽ More
At present Automatic Speaker Recognition system is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking style of a person, vocal tract information, timbral qualities of his voice and other congenital information regarding his voice. The study of Bengali speech recognition and speaker identification is scarce in the literature. Hence the need arises for involving Bengali subjects in modelling our speaker identification engine. In this work, we have extracted some acoustic features of speech using non linear multifractal analysis. The Multifractal Detrended Fluctuation Analysis reveals essentially the complexity associated with the speech signals taken. The source characteristics have been quantified with the help of different techniques like Correlation Matrix, skewness of MFDFA spectrum etc. The Results obtained from this study gives a good recognition rate for Bengali Speakers.
△ Less
Submitted 15 April, 2020;
originally announced April 2020.
-
Music of Brain and Music on Brain: A Novel EEG Sonification approach
Authors:
Sayan Nag,
Shankha Sanyal,
Archi Banerjee,
Ranjan Sengupta,
Dipak Ghosh
Abstract:
Can we hear the sound of our brain? Is there any technique which can enable us to hear the neuro-electrical impulses originating from the different lobes of brain? The answer to all these questions is YES. In this paper we present a novel method with which we can sonify the Electroencephalogram (EEG) data recorded in rest state as well as under the influence of a simplest acoustical stimuli - a ta…
▽ More
Can we hear the sound of our brain? Is there any technique which can enable us to hear the neuro-electrical impulses originating from the different lobes of brain? The answer to all these questions is YES. In this paper we present a novel method with which we can sonify the Electroencephalogram (EEG) data recorded in rest state as well as under the influence of a simplest acoustical stimuli - a tanpura drone. The tanpura drone has a very simple yet very complex acoustic features, which is generally used for creation of an ambiance during a musical performance. Hence, for this pilot project we chose to study the correlation between a simple acoustic stimuli (tanpura drone) and sonified EEG data. Till date, there have been no study which deals with the direct correlation between a bio-signal and its acoustic counterpart and how that correlation varies under the influence of different types of stimuli. This is the first of its kind study which bridges this gap and looks for a direct correlation between music signal and EEG data using a robust mathematical microscope called Multifractal Detrended Cross Correlation Analysis (MFDXA). For this, we took EEG data of 10 participants in 2 min 'rest state' (i.e. with white noise) and in 2 min 'tanpura drone' (musical stimulus) listening condition. Next, the EEG signals from different electrodes were sonified and MFDXA technique was used to assess the degree of correlation (or the cross correlation coefficient) between tanpura signal and EEG signals. The variation of γx for different lobes during the course of the experiment also provides major interesting new information. Only music stimuli has the ability to engage several areas of the brain significantly unlike other stimuli (which engages specific domains only).
△ Less
Submitted 22 December, 2017;
originally announced December 2017.