Skip to main content

Showing 1–19 of 19 results for author: Nag, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2503.23219  [pdf, other

    eess.AS cs.AI cs.CV cs.LG

    Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

    Authors: Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha

    Abstract: Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  2. arXiv:2407.01851  [pdf, other

    cs.CV cs.AI cs.LG eess.AS

    Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

    Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

    Abstract: Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained un… ▽ More

    Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  3. arXiv:2406.04673  [pdf, other

    cs.CV cs.AI cs.MM eess.AS

    MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

    Authors: Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha

    Abstract: Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, w… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted at CVPR 2024 as Highlight paper. Webpage: https://schowdhury671.github.io/melfusion_cvpr2024/

  4. arXiv:2403.05435  [pdf, other

    cs.CV eess.IV eess.SP

    OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

    Authors: Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta

    Abstract: Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficien… ▽ More

    Submitted 22 February, 2025; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted to AAAI 2025

  5. arXiv:2308.07293  [pdf, other

    cs.SD cs.LG eess.AS

    DiffSED: Sound Event Detection with Denoising Diffusion

    Authors: Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

    Abstract: Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate t… ▽ More

    Submitted 16 August, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

  6. arXiv:2307.10763  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Actor-agnostic Multi-label Action Recognition with Multi-modal Query

    Authors: Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan Dutta

    Abstract: Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecti… ▽ More

    Submitted 10 January, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Published at the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France

  7. arXiv:2306.02680  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion

    Authors: Ahana Deb, Sayan Nag, Ayan Mahapatra, Soumitri Chattopadhyay, Aritra Marik, Pijush Kanti Gayen, Shankha Sanyal, Archi Banerjee, Samir Karmakar

    Abstract: Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful represent… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023

  8. WaferSegClassNet -- A Light-weight Network for Classification and Segmentation of Semiconductor Wafer Defects

    Authors: Subhrajit Nag, Dhruv Makwana, Sai Chandra Teja R, Sparsh Mittal, C Krishna Mohan

    Abstract: As the integration density and design intricacy of semiconductor wafers increase, the magnitude and complexity of defects in them are also on the rise. Since the manual inspection of wafer defects is costly, an automated artificial intelligence (AI) based computer-vision approach is highly desired. The previous works on defect analysis have several limitations, such as low accuracy and the need fo… ▽ More

    Submitted 3 July, 2022; originally announced July 2022.

    Comments: 11 pages, 2 figures, 7 tables, Published in Computers in Industry

    Journal ref: Volume 142, 2022, 103720, ISSN 0166-3615,

  9. arXiv:2111.07042  [pdf

    cs.RO eess.SY

    Agile Satellite Planning for Multi-Payload Observations for Earth Science

    Authors: Rich Levinson, Sreeja Nag, Vinay Ravindra

    Abstract: We present planning challenges, methods and preliminary results for a new model-based paradigm for earth observing systems in adaptive remote sensing. Our heuristically guided constraint optimization planner produces coordinated plans for multiple satellites, each with multiple instruments (payloads). The satellites are agile, meaning they can quickly maneuver to change viewing angles in response… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

    Journal ref: International Workshop on Planning & Scheduling for Space (IWPSS) 2021

  10. arXiv:2102.07940  [pdf, other

    eess.SY

    Attitude Trajectory Optimization for Agile Satellites in Autonomous Remote Sensing Constellation

    Authors: Emmanuel Sin, Sreeja Nag, Vinay Ravindra, Alan Li, Murat Arcak

    Abstract: Agile attitude maneuvering maximizes the utility of remote sensing satellite constellations. By taking into account a satellite's physical properties and its actuator specifications, we may leverage the full performance potential of the attitude control system to conduct agile remote sensing beyond conventional slew-and-stabilize maneuvers. Employing a constellation of agile satellites, coordinate… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

    Comments: 24 pages, 27 figures

  11. arXiv:2102.06038  [pdf

    cs.SD cs.CL eess.AS

    A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction

    Authors: Sayan Nag, Uddalok Sarkar, Shankha Sanyal, Archi Banerjee, Souparno Roy, Samir Karmakar, Ranjan Sengupta, Dipak Ghosh

    Abstract: It is already known that both auditory and visual stimulus is able to convey emotions in human mind to different extent. The strength or intensity of the emotional arousal vary depending on the type of stimulus chosen. In this study, we try to investigate the emotional arousal in a cross-modal scenario involving both auditory and visual stimulus while studying their source characteristics. A robus… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

  12. arXiv:2102.06003  [pdf

    cs.SD cs.CL eess.AS

    Language Independent Emotion Quantification using Non linear Modelling of Speech

    Authors: Uddalok Sarkar, Sayan Nag, Chirayata Bhattacharya, Shankha Sanyal, Archi Banerjee, Ranjan Sengupta, Dipak Ghosh

    Abstract: At present emotion extraction from speech is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking styles of a person, vocal tract information, timbral qualities and other congenital information regarding his voice. Our speech production system is a nonlinear system like most other real world system… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

  13. arXiv:2102.00616  [pdf

    cs.SD cs.LG cs.MM eess.AS

    Neural Network architectures to classify emotions in Indian Classical Music

    Authors: Uddalok Sarkar, Sayan Nag, Medha Basu, Archi Banerjee, Shankha Sanyal, Ranjan Sengupta, Dipak Ghosh

    Abstract: Music is often considered as the language of emotions. It has long been known to elicit emotions in human being and thus categorizing music based on the type of emotions they induce in human being is a very intriguing topic of research. When the task comes to classify emotions elicited by Indian Classical Music (ICM), it becomes much more challenging because of the inherent ambiguity associated wi… ▽ More

    Submitted 31 January, 2021; originally announced February 2021.

  14. arXiv:2010.09946  [pdf

    eess.SY astro-ph.IM

    Planning a Reference Constellation for Radiometric Cross-Calibration of Commercial Earth Observing Sensors

    Authors: Sreeja Nag, Philip Dabney, Vinay Ravindra, Cody Anderson

    Abstract: The Earth Observation planning community has access to tools that can propagate orbits and compute coverage of Earth observing imagers with customizable shapes and orientation, model the expected Earth Reflectance at various bands, epochs and directions, generate simplified instrument performance metrics for imagers and radars, and schedule single and multiple spacecraft payload operations. We are… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

    Journal ref: International Workshop on Planning and Scheduling for Space, Berkeley CA, July 2019

  15. arXiv:2010.09940  [pdf

    eess.SY

    Autonomous Scheduling of Agile Spacecraft Constellations with Delay Tolerant Networking for Reactive Imaging

    Authors: Sreeja Nag, Alan S. Li, Vinay Ravindra, Marc Sanchez Net, Kar-Ming Cheung, Rod Lammers, Brian Bledsoe

    Abstract: Small spacecraft now have precise attitude control systems available commercially, allowing them to slew in 3 degrees of freedom, and capture images within short notice. When combined with appropriate software, this agility can significantly increase response rate, revisit time and coverage. In prior work, we have demonstrated an algorithmic framework that combines orbital mechanics, attitude cont… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

    Journal ref: International Conference on Automated Planning and Scheduling SPARK Workshop, Berkeley, July 2019

  16. arXiv:2006.15100  [pdf, other

    cs.LG eess.SP stat.ML

    E2GC: Energy-efficient Group Convolution in Deep Neural Networks

    Authors: Nandan Kumar Jha, Rajat Saini, Subhrajit Nag, Sparsh Mittal

    Abstract: The number of groups ($g$) in group convolution (GConv) is selected to boost the predictive performance of deep neural networks (DNNs) in a compute and parameter efficient manner. However, we show that naive selection of $g$ in GConv creates an imbalance between the computational complexity and degree of data reuse, which leads to suboptimal energy efficiency in DNNs. We devise an optimum group si… ▽ More

    Submitted 26 June, 2020; originally announced June 2020.

    Comments: Accepted as a conference paper in 2020 33rd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID)

    ACM Class: I.5.1; I.5.2; I.5.5; C.0

    Journal ref: VLSID (2020) 155-160

  17. arXiv:2004.08248  [pdf

    eess.AS cs.SD nlin.CD q-bio.NC

    Acoustical classification of different speech acts using nonlinear methods

    Authors: Chirayata Bhattacharyya, Sourya Sengupta, Sayan Nag, Shankha Sanyal, Archi Banerjee, Ranjan Sengupta, Dipak Ghosh

    Abstract: A recitation is a way of combining the words together so that they have a sense of rhythm and thus an emotional content is imbibed within. In this study we envisaged to answer these questions in a scientific manner taking into consideration 5 (five) well known Bengali recitations of different poets conveying a variety of moods ranging from joy to sorrow. The clips were recited as well as read (in… ▽ More

    Submitted 5 August, 2020; v1 submitted 15 April, 2020; originally announced April 2020.

    Comments: 6 pages, 2 figures; Proceedings of WESPAC 2018, New Delhi, India, November 11-15, 2018

  18. arXiv:2004.07820  [pdf

    cs.SD cs.CL eess.AS

    Speaker Recognition in Bengali Language from Nonlinear Features

    Authors: Uddalok Sarkar, Soumyadeep Pal, Sayan Nag, Chirayata Bhattacharya, Shankha Sanyal, Archi Banerjee, Ranjan Sengupta, Dipak Ghosh

    Abstract: At present Automatic Speaker Recognition system is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking style of a person, vocal tract information, timbral qualities of his voice and other congenital information regarding his voice. The study of Bengali speech recognition and speaker identification… ▽ More

    Submitted 15 April, 2020; originally announced April 2020.

    Comments: arXiv admin note: text overlap with arXiv:1612.00171, arXiv:1601.07709

  19. arXiv:1712.08336  [pdf

    q-bio.NC cs.SD eess.AS physics.data-an

    Music of Brain and Music on Brain: A Novel EEG Sonification approach

    Authors: Sayan Nag, Shankha Sanyal, Archi Banerjee, Ranjan Sengupta, Dipak Ghosh

    Abstract: Can we hear the sound of our brain? Is there any technique which can enable us to hear the neuro-electrical impulses originating from the different lobes of brain? The answer to all these questions is YES. In this paper we present a novel method with which we can sonify the Electroencephalogram (EEG) data recorded in rest state as well as under the influence of a simplest acoustical stimuli - a ta… ▽ More

    Submitted 22 December, 2017; originally announced December 2017.

    Comments: 6 pages, 4 figures; Presented in the International Symposium on Frontiers of Research in speech and Music (FRSM)-2017, held at NIT, Rourkela in 15-16 December 2017