Skip to main content

Showing 1–50 of 192 results for author: Kumar, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.01350  [pdf, ps, other

    eess.SY cs.MA cs.RO

    Cooperative Target Capture in 3D Engagements over Switched Dynamic Graphs

    Authors: Abhinav Sinha, Shashi Ranjan Kumar

    Abstract: This paper presents a leaderless cooperative guidance strategy for simultaneous time-constrained interception of a stationary target when the interceptors exchange information over switched dynamic graphs. We specifically focus on scenarios when the interceptors lack radial acceleration capabilities, relying solely on their lateral acceleration components. This consideration aligns with their inhe… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  2. arXiv:2506.19522  [pdf, ps, other

    eess.SY

    Time-Constrained Interception of Seeker-Equipped Interceptors with Bounded Input

    Authors: Ashok Samrat R, Swati Singh, Shashi Ranjan Kumar

    Abstract: This paper presents a nonlinear guidance scheme designed to achieve precise interception of stationary targets at a pre-specified impact time. The proposed strategy essentially accounts for the constraints imposed by the interceptor's seeker field-of-view (FOV) and actuator limitations, which, if ignored, can degrade guidance performance. To address these challenges, the guidance law incorporates… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  3. arXiv:2506.17005  [pdf, ps, other

    eess.SY

    Trajectory tracking control of USV with actuator constraints in the presence of disturbances

    Authors: Ram Milan Kumar Verma, Shashi Ranjan Kumar, Hemendra Arya

    Abstract: All practical systems often pose a problem of finite control capability, which can notably degrade the performance if not properly addressed. Since actuator input bounds are typically known, integrating actuator saturation considerations into the control law design process can lead to enhanced performance and more precise trajectory tracking. Also, the actuators cannot provide the demanded forces… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  4. arXiv:2506.14434  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Unifying Streaming and Non-streaming Zipformer-based ASR

    Authors: Bidisha Sharma, Karthik Pandia Durai, Shankar Venkatesan, Jeena J Prakash, Shashi Kumar, Malolan Chetlur, Andreas Stolcke

    Abstract: There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked atten… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: Accepted in ACL2025 Industry track

  5. arXiv:2506.09661  [pdf, ps, other

    eess.IV cs.CV q-bio.TO

    A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma

    Authors: Garima Jain, Sanghamitra Pati, Mona Duggal, Amit Sethi, Abhijeet Patil, Gururaj Malekar, Nilesh Kowe, Jitender Kumar, Jatin Kashyap, Divyajeet Rout, Deepali, Hitesh, Nishi Halduniya, Sharat Kumar, Heena Tabassum, Rupinder Singh Dhaliwal, Sucheta Devi Khuraijam, Sushma Khuraijam, Sharmila Laishram, Simmi Kharb, Sunita Singh, K. Swaminadtan, Ranjana Solanki, Deepika Hemranjani, Shashank Nath Singh , et al. (12 additional authors not shown)

    Abstract: Oral squamous cell carcinoma OSCC is a major global health burden, particularly in several regions across Asia, Africa, and South America, where it accounts for a significant proportion of cancer cases. Early detection dramatically improves outcomes, with stage I cancers achieving up to 90 percent survival. However, traditional diagnosis based on histopathology has limited accessibility in low-res… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 7 pages, 2 figurs

  6. arXiv:2506.05893  [pdf, ps, other

    eess.SY

    Field-of-View and Input Constrained Impact Time Guidance Against Stationary Targets

    Authors: Swati Singh, Shashi Ranjan Kumar, Dwaipayan Mukherjee

    Abstract: This paper proposes a guidance strategy to achieve time-constrained interception of stationary targets, taking into account both the bounded field-of-view (FOV) of seeker-equipped interceptors and the actuator's physical constraints. Actuator saturation presents a significant challenge in real-world systems, often resulting in degraded performance. However, since these limitations are typically kn… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  7. arXiv:2506.04981  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

    Authors: Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

    Abstract: Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxilia… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Accepted at Interspeech 2025, Netherlands

  8. arXiv:2506.03681  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

    Authors: Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

    Abstract: Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple sel… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted at Interspeech 2025, Netherlands

  9. arXiv:2506.00145  [pdf

    cs.CL cs.SD eess.AS

    Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry

    Authors: Sujeet Kumar, Pretam Ray, Abhinay Beerukuri, Shrey Kamoji, Manoj Balaji Jagadeeshan, Pawan Goyal

    Abstract: Sanskrit, an ancient language with a rich linguistic heritage, presents unique challenges for automatic speech recognition (ASR) due to its phonemic complexity and the phonetic transformations that occur at word junctures, similar to the connected speech found in natural conversations. Due to these complexities, there has been limited exploration of ASR in Sanskrit, particularly in the context of… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  10. arXiv:2505.18854  [pdf, ps, other

    eess.SY

    Three-Dimensional Nonlinear Guidance with Impact Time and Field-of-view Constraints

    Authors: Ashok R Samrat, Swati Singh, Shashi Ranjan Kumar

    Abstract: This paper addresses the time-constrained interception of targets at a predetermined time with bounded field-of-view capability of the seeker-equipped interceptors. We propose guidance laws using the effective lead angle and velocity lead angles of the interceptor to achieve a successful interception of the target. The former scheme extends the existing two-dimensional guidance strategy to a three… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  11. arXiv:2505.14716  [pdf

    eess.IV cs.CV cs.ET cs.LG

    A Hybrid Quantum Classical Pipeline for X Ray Based Fracture Diagnosis

    Authors: Sahil Tomar, Rajeshwar Tripathi, Sandeep Kumar

    Abstract: Bone fractures are a leading cause of morbidity and disability worldwide, imposing significant clinical and economic burdens on healthcare systems. Traditional X ray interpretation is time consuming and error prone, while existing machine learning and deep learning solutions often demand extensive feature engineering, large, annotated datasets, and high computational resources. To address these ch… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: 8 pages

  12. arXiv:2505.10933  [pdf, ps, other

    eess.SP cs.IT

    Cross-layer Integrated Sensing and Communication: A Joint Industrial and Academic Perspective

    Authors: Henk Wymeersch, Nuutti Tervo, Stefan Wänstedt, Sharief Saleh, Joerg Ahlendorf, Ozgur Akgul, Vasileios Tsekenis, Sokratis Barmpounakis, Liping Bai, Martin Beale, Rafael Berkvens, Nabeel Nisar Bhat, Hui Chen, Shrayan Das, Claude Desset, Antonio de la Oliva, Prajnamaya Dass, Jeroen Famaey, Hamed Farhadi, Gerhard P. Fettweis, Yu Ge, Hao Guo, Rreze Halili, Katsuyuki Haneda, Abdur Rahman Mohamed Ismail , et al. (18 additional authors not shown)

    Abstract: Integrated sensing and communication (ISAC) enables radio systems to simultaneously sense and communicate with their environment. This paper, developed within the Hexa-X-II project funded by the European Union, presents a comprehensive cross-layer vision for ISAC in 6G networks, integrating insights from physical-layer design, hardware architectures, AI-driven intelligence, and protocol-level inno… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  13. arXiv:2505.07365  [pdf, ps, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

    Authors: Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro

    Abstract: We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Preprint. DCASE 2025 Audio QA Challenge: https://dcase.community/challenge2025/task-audio-question-answering

  14. arXiv:2505.04419  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Recognizing Ornaments in Vocal Indian Art Music with Active Annotation

    Authors: Sumit Kumar, Parampreet Singh, Vipul Arora

    Abstract: Ornamentations, embellishments, or microtonal inflections are essential to melodic expression across many musical traditions, adding depth, nuance, and emotional impact to performances. Recognizing ornamentations in singing voices is key to MIR, with potential applications in music pedagogy, singer identification, genre classification, and controlled singing voice generation. However, the lack of… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  15. arXiv:2505.03054  [pdf, other

    cs.AI cs.CL cs.SD eess.AS

    BLAB: Brutally Long Audio Bench

    Authors: Orevaoghene Ahia, Martijn Bartelds, Kabir Ahuja, Hila Gonen, Valentin Hofmann, Siddhant Arora, Shuyue Stella Li, Vishal Puttagunta, Mofetoluwa Adeyemi, Charishma Buchireddy, Ben Walls, Noah Bennett, Shinji Watanabe, Noah A. Smith, Yulia Tsvetkov, Sachin Kumar

    Abstract: Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limite… ▽ More

    Submitted 12 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

  16. arXiv:2505.01853  [pdf

    eess.SP

    Artificial Intelligence implementation of onboard flexible payload and adaptive beamforming using commercial off-the-shelf devices

    Authors: Luis Manuel Garcés-Socarrás, Amirhosein Nik, Flor Ortiz, Juan A. Vásquez-Peralvo, Jorge Luis González Rios, Mouhamad Chehailty, Marcele Kuhfuss, Eva Lagunas, Jan Thoemel, Sumit Kumar, Vishal Singh, Juan Carlos Merlano Duncan, Sahar Malmir, Swetha Varadajulu, Jorge Querol, Symeon Chatzinotas

    Abstract: Very High Throughput satellites typically provide multibeam coverage, however, a common problem is that there can be a mismatch between the capacity of each beam and the traffic demand: some beams may fall short, while others exceed the requirements. This challenge can be addressed by integrating machine learning with flexible payload and adaptive beamforming techniques. These methods allow for dy… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: 5th ESA Workshop on Advanced Flexible Telecom Payloads, 10 pages, 6 figures

  17. arXiv:2504.11592  [pdf, other

    eess.SY math.DS math.OC

    Provably Safe Control for Constrained Nonlinear Systems with Bounded Input

    Authors: Saurabh Kumar, Shashi Ranjan Kumar, Abhinav Sinha

    Abstract: In real-world control applications, actuator constraints and output constraints (specifically in tracking problems) are inherent and critical to ensuring safe and reliable operation. However, generally, control strategies often neglect these physical limitations, leading to potential instability, degraded performance, or even system failure when deployed on real-world systems. This paper addresses… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  18. arXiv:2504.10793  [pdf, other

    cs.SD cs.HC cs.LG eess.AS

    SonicSieve: Bringing Directional Speech Extraction to Smartphones Using Acoustic Microstructures

    Authors: Kuang Yuan, Yifeng Wang, Xiyuxing Zhang, Chengyi Shen, Swarun Kumar, Justin Chan

    Abstract: Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer's voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto inco… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  19. arXiv:2503.13277  [pdf

    eess.IV cs.AI cs.CV

    Artificial Intelligence-Driven Prognostic Classification of COVID-19 Using Chest X-rays: A Deep Learning Approach

    Authors: Alfred Simbun, Suresh Kumar

    Abstract: Background: The COVID-19 pandemic has overwhelmed healthcare systems, emphasizing the need for AI-driven tools to assist in rapid and accurate patient prognosis. Chest X-ray imaging is a widely available diagnostic tool, but existing methods for prognosis classification lack scalability and efficiency. Objective: This study presents a high-accuracy deep learning model for classifying COVID-19 seve… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: 27 pages, 6 figures, 10 tables

  20. arXiv:2503.06375  [pdf, other

    eess.AS

    ProSE: Diffusion Priors for Speech Enhancement

    Authors: Sonal Kumar, Sreyan Ghosh, Utkarsh Tyagi, Anton Jeran Ratnarajah, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha

    Abstract: Speech enhancement (SE) is the foundational task of enhancing the clarity and quality of speech in the presence of non-stationary additive noise. While deterministic deep learning models have been commonly employed for SE, recent research indicates that generative models, such as denoising diffusion probabilistic models (DDPMs), have shown promise. However, unlike speech generation, SE has a stron… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: Accepted at NAACL 2025

  21. arXiv:2503.06057  [pdf, other

    eess.SY

    A 2-6 GHz Ultra-Wideband CMOS Transceiver for Radar Applications

    Authors: Alin Thomas Tharakan, Prince Philip, Gokulan T., Sumit Kumar, Gaurab Banerjee

    Abstract: This paper presents a low power, low cost transceiver architecture to implement radar-on-a-chip. The transceiver comprises of a full ultra-wideband (UWB) transmitter and a full UWB band receiver. A design methodology to maximize the tuning range of the voltage-controlled oscillator (VCO) is presented. At the transmitter side, a sub-harmonic mixer is used for signal up-conversion. The receiver lo… ▽ More

    Submitted 9 April, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

  22. arXiv:2503.03983  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

    Authors: Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro

    Abstract: Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, an… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  23. arXiv:2502.13893  [pdf, other

    cs.SD eess.AS

    Audio-Based Classification of Insect Species Using Machine Learning Models: Cicada, Beetle, Termite, and Cricket

    Authors: Manas V Shetty, Yoga Disha Sendhil Kumar

    Abstract: This project addresses the challenge of classifying insect species: Cicada, Beetle, Termite, and Cricket using sound recordings. Accurate species identification is crucial for ecological monitoring and pest management. We employ machine learning models such as XGBoost, Random Forest, and K Nearest Neighbors (KNN) to analyze audio features, including Mel Frequency Cepstral Coefficients (MFCC). The… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

  24. arXiv:2502.01588  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

    Authors: Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

    Abstract: Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a no… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

  25. arXiv:2502.01108  [pdf, other

    cs.LG cs.AI eess.SP

    Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings

    Authors: Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar

    Abstract: Photoplethysmography (PPG)-based foundation models are gaining traction due to the widespread use of PPG in biosignal monitoring and their potential to generalize across diverse health applications. In this paper, we introduce Pulse-PPG, the first open-source PPG foundation model trained exclusively on raw PPG data collected over a 100-day field study with 120 participants. Existing PPG foundation… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

    Comments: The first two listed authors contributed equally to this research

  26. arXiv:2501.09999  [pdf, other

    cs.CV cs.AI eess.IV

    Deep Learning for Early Alzheimer Disease Detection with MRI Scans

    Authors: Mohammad Rafsan, Tamer Oraby, Upal Roy, Sanjeev Kumar, Hansapani Rodrigo

    Abstract: Alzheimer's Disease is a neurodegenerative condition characterized by dementia and impairment in neurological function. The study primarily focuses on the individuals above age 40, affecting their memory, behavior, and cognitive processes of the brain. Alzheimer's disease requires diagnosis by a detailed assessment of MRI scans and neuropsychological tests of the patients. This project compares ex… ▽ More

    Submitted 17 January, 2025; originally announced January 2025.

  27. arXiv:2412.18771  [pdf, ps, other

    cs.IT eess.SP

    RIS-Assisted MIMO CV-QKD at THz Frequencies: Channel Estimation and SKR Analysis

    Authors: Sushil Kumar, Soumya P. Dash, Debasish Ghose, George C. Alexandropoulos

    Abstract: In this paper, a multiple-input multiple-output (MIMO) wireless system incorporating a reconfigurable intelligent surface (RIS) to efficiently operate at terahertz (THz) frequencies is considered. The transmitter, Alice, employs continuous-variable quantum key distribution (CV-QKD) to communicate secret keys to the receiver, Bob, which utilizes either homodyne or heterodyne detection. The latter n… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

    Comments: 11 pages, 6 figures

  28. arXiv:2412.17823  [pdf

    eess.SP cs.AI cs.LG eess.SY

    RUL forecasting for wind turbine predictive maintenance based on deep learning

    Authors: Syed Shazaib Shah, Tan Daoliang, Sah Chandan Kumar

    Abstract: Predictive maintenance (PdM) is increasingly pursued to reduce wind farm operation and maintenance costs by accurately predicting the remaining useful life (RUL) and strategically scheduling maintenance. However, the remoteness of wind farms often renders current methodologies ineffective, as they fail to provide a sufficiently reliable advance time window for maintenance planning, limiting PdM's… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: 19 pages, 16 figures, Journal Paper

    Report number: Volume 10, Issue 20e39268October 30, 2024 MSC Class: 14J60 (Primary)

    Journal ref: Helyion (Journal); Volume 10, Issue 20e39268October 30, 2024

  29. arXiv:2412.09789  [pdf, other

    cs.SD eess.AS

    SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

    Authors: Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, Oriol Nieto

    Abstract: The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects with control over key acoustic parameters such as loudness, pitch, reverb, fade, brightness, noise and duration, enabling creative appl… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

    Comments: Website: https://sonalkum.github.io/SILA/

  30. arXiv:2411.12919  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Robust multi-coil MRI reconstruction via self-supervised denoising

    Authors: Asad Aali, Marius Arvinte, Sidharth Kumar, Yamin I. Arefeen, Jonathan I. Tamir

    Abstract: We study the effect of incorporating self-supervised denoising as a pre-processing step for training deep learning (DL) based reconstruction methods on data corrupted by Gaussian noise. K-space data employed for training are typically multi-coil and inherently noisy. Although DL-based reconstruction methods trained on fully sampled data can enable high reconstruction quality, obtaining large, nois… ▽ More

    Submitted 24 May, 2025; v1 submitted 19 November, 2024; originally announced November 2024.

    Journal ref: MRM, 2025

  31. arXiv:2411.10716  [pdf, other

    cs.LG cs.CE eess.SP

    FlowScope: Enhancing Decision Making by Time Series Forecasting based on Prediction Optimization using HybridFlow Forecast Framework

    Authors: Nitin Sagar Boyeena, Begari Susheel Kumar

    Abstract: Time series forecasting is crucial in several sectors, such as meteorology, retail, healthcare, and finance. Accurately forecasting future trends and patterns is crucial for strategic planning and making well-informed decisions. In this case, it is crucial to include many forecasting methodologies. The strengths of Auto-regressive Integrated Moving Average (ARIMA) for linear time series, Seasonal… ▽ More

    Submitted 16 November, 2024; originally announced November 2024.

    Comments: 12 pages and 6 figures

    MSC Class: 62M10 (Primary); 68T07 (Secondary) ACM Class: I.2.6; G.3; I.5

  32. arXiv:2411.03866  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

    Authors: Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esaú Villatoro-Tello, Manjunath K E, Kadri Hacioğlu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke

    Abstract: Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In th… ▽ More

    Submitted 22 January, 2025; v1 submitted 6 November, 2024; originally announced November 2024.

    Comments: Accepted in ICASSP 2025 SALMA Workshop

    Journal ref: Proc. ICASSP Workshop on Speech and Audio Language Models (SALMA), 2025

  33. arXiv:2410.19168  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    Authors: S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha

    Abstract: The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural langu… ▽ More

    Submitted 24 October, 2024; originally announced October 2024.

    Comments: Project Website: https://sakshi113.github.io/mmau_homepage/

  34. arXiv:2410.16505  [pdf, other

    cs.SD cs.LG eess.AS

    Do Audio-Language Models Understand Linguistic Variations?

    Authors: Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha

    Abstract: Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled experiments on various benchmarks to show that existing ALMs struggle to generalize to linguistic variations in textual queries. To address this issue, w… ▽ More

    Submitted 19 February, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: Accepted to NAACL 2025

  35. arXiv:2410.15670  [pdf

    eess.IV cs.CV

    Transforming Blood Cell Detection and Classification with Advanced Deep Learning Models: A Comparative Study

    Authors: Shilpa Choudhary, Sandeep Kumar, Pammi Sri Siddhaarth, Guntu Charitasri

    Abstract: Efficient detection and classification of blood cells are vital for accurate diagnosis and effective treatment of blood disorders. This study utilizes a YOLOv10 model trained on Roboflow data with images resized to 640x640 pixels across varying epochs. The results show that increased training epochs significantly enhance accuracy, precision, and recall, particularly in real-time blood cell detecti… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

    Comments: 26 pages, 4884 Words, 17 Figures, 10 Tables

  36. arXiv:2410.15062  [pdf, other

    cs.SD eess.AS

    PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification

    Authors: Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

    Abstract: Audio-Language Models (ALMs) have demonstrated remarkable performance in zero-shot audio classification. In this paper, we introduce PAT (Parameter-free Audio-Text aligner), a simple and training-free method aimed at boosting the zero-shot audio classification performance of CLAP-like ALMs. To achieve this, we propose to improve the cross-modal interaction between audio and language modalities by… ▽ More

    Submitted 19 October, 2024; originally announced October 2024.

    Comments: 18 pages

  37. arXiv:2410.13179  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

    Authors: Ashish Seth, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

    Abstract: In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Modeling (MAM), we introduce a novel selective and adaptive masking strategy. Specifically, during SSL training, we progressively introduce harder regions… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

  38. arXiv:2410.02056  [pdf, other

    eess.AS cs.AI cs.CL

    Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

    Authors: Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha

    Abstract: We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-wo… ▽ More

    Submitted 11 March, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

    Comments: Accepted at ICLR 2025. Code and Checkpoints available here: https://github.com/Sreyan88/Synthio

  39. arXiv:2409.17352  [pdf, other

    cs.SI eess.SY

    On the Interplay of Clustering and Evolution in the Emergence of Epidemic Outbreaks

    Authors: Mansi Sood, Hejin Gu, Rashad Eletreby, Swarun Kumar, Chai Wah Wu, Osman Yagan

    Abstract: In an increasingly interconnected world, a key scientific challenge is to examine mechanisms that lead to the widespread propagation of contagions, such as misinformation and pathogens, and identify risk factors that can trigger large-scale outbreaks. Underlying both the spread of disease and misinformation epidemics is the evolution of the contagion as it propagates, leading to the emergence of d… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  40. arXiv:2409.16469  [pdf, other

    cs.CL cs.SD eess.AS

    Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices

    Authors: Leonid Velikovich, Christopher Li, Diamantino Caseiro, Shankar Kumar, Pat Rondon, Kandarp Joshi, Xavier Velez

    Abstract: For end-to-end Automatic Speech Recognition (ASR) models, recognizing personal or rare phrases can be hard. A promising way to improve accuracy is through spelling correction (or rewriting) of the ASR lattice, where potentially misrecognized phrases are replaced with acoustically similar and contextually relevant alternatives. However, rewriting is challenging for ASR models trained with connectio… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: 8 pages, 7 figures

  41. ECHO: Environmental Sound Classification with Hierarchical Ontology-guided Semi-Supervised Learning

    Authors: Pranav Gupta, Raunak Sharma, Rashmi Kumari, Sri Krishna Aditya, Shwetank Choudhary, Sumit Kumar, Kanchana M, Thilagavathy R

    Abstract: Environment Sound Classification has been a well-studied research problem in the field of signal processing and up till now more focus has been laid on fully supervised approaches. Over the last few years, focus has moved towards semi-supervised methods which concentrate on the utilization of unlabeled data, and self-supervised methods which learn the intermediate representation through pretext ta… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

    Comments: IEEE CONECCT 2024, Signal Processing and Pattern Recognition, Environmental Sound Classification, ESC

  42. arXiv:2409.13514  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Unifying Global and Near-Context Biasing in a Single Trie Pass

    Authors: Iuliia Thorbecke, Esaú Villatoro-Tello, Juan Zuluaga-Gomez, Shashi Kumar, Sergio Burdisso, Pradeep Rangappa, Andrés Carofilis, Srikanth Madikeri, Petr Motlicek, Karthik Pandia, Kadri Hacioğlu, Andreas Stolcke

    Abstract: Despite the success of end-to-end automatic speech recognition (ASR) models, challenges persist in recognizing rare, out-of-vocabulary words - including named entities (NE) - and in adapting to new domains using only text data. This work presents a practical approach to address these challenges through an unexplored combination of an NE bias list and a word-level n-gram language model (LM). This s… ▽ More

    Submitted 2 July, 2025; v1 submitted 20 September, 2024; originally announced September 2024.

    Comments: Accepted to TSD2025

  43. arXiv:2409.13499  [pdf, other

    cs.CL cs.SD eess.AS

    Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

    Authors: Iuliia Thorbecke, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Shashi Kumar, Pradeep Rangappa, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

    Abstract: The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and… ▽ More

    Submitted 7 October, 2024; v1 submitted 20 September, 2024; originally announced September 2024.

    Comments: Accepted to EMNLP Findings 2024

  44. arXiv:2409.09213  [pdf, other

    eess.AS cs.CL cs.SD

    ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

    Authors: Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

    Abstract: Open-vocabulary audio-language models, like CLAP, offer a promising approach for zero-shot audio classification (ZSAC) by enabling classification with any arbitrary set of categories specified with natural language prompts. In this paper, we propose a simple but effective method to improve ZSAC with CLAP. Specifically, we shift from the conventional method of using prompts with abstract category l… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

    Comments: Code and Checkpoints: https://github.com/Sreyan88/ReCLAP

  45. arXiv:2409.08507  [pdf, other

    eess.SY cs.RO math.DS math.OC

    Three-dimensional Nonlinear Path-following Guidance with Bounded Input Constraints

    Authors: Saurabh Kumar, Shashi Ranjan Kumar, Abhinav Sinha

    Abstract: In this paper, we consider the tracking of arbitrary curvilinear geometric paths in three-dimensional output spaces of unmanned aerial vehicles (UAVs) without pre-specified timing requirements, commonly referred to as path-following problems, subjected to bounded inputs. Specifically, we propose a novel nonlinear path-following guidance law for a UAV that enables it to follow any smooth curvilinea… ▽ More

    Submitted 16 March, 2025; v1 submitted 12 September, 2024; originally announced September 2024.

  46. arXiv:2409.06137  [pdf, other

    eess.AS cs.SD eess.SP

    DeWinder: Single-Channel Wind Noise Reduction using Ultrasound Sensing

    Authors: Kuang Yuan, Shuo Han, Swarun Kumar, Bhiksha Raj

    Abstract: The quality of audio recordings in outdoor environments is often degraded by the presence of wind. Mitigating the impact of wind noise on the perceptual quality of single-channel speech remains a significant challenge due to its non-stationary characteristics. Prior work in noise suppression treats wind noise as a general background noise without explicit modeling of its characteristics. In this p… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

  47. arXiv:2409.05356  [pdf, other

    cs.CL cs.LG cs.SD eess.SP

    IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS

    Authors: Ashwin Sankar, Srija Anand, Praveen Srinivasa Varadhan, Sherry Thomas, Mehak Singal, Shridhar Kumar, Deovrat Mehendale, Aditi Krishana, Giri Raju, Mitesh Khapra

    Abstract: Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations… ▽ More

    Submitted 7 October, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

    Comments: Accepted to NeurIPS 2024 Datasets and Benchmarks track

  48. arXiv:2409.04976  [pdf, other

    cs.AR cs.AI cs.CV eess.IV

    HYDRA: Hybrid Data Multiplexing and Run-time Layer Configurable DNN Accelerator

    Authors: Sonu Kumar, Komal Gupta, Gopal Raut, Mukul Lokhande, Santosh Kumar Vishvakarma

    Abstract: Deep neural networks (DNNs) offer plenty of challenges in executing efficient computation at edge nodes, primarily due to the huge hardware resource demands. The article proposes HYDRA, hybrid data multiplexing, and runtime layer configurable DNN accelerators to overcome the drawbacks. The work proposes a layer-multiplexed approach, which further reuses a single activation function within the exec… ▽ More

    Submitted 8 September, 2024; originally announced September 2024.

  49. HistoSPACE: Histology-Inspired Spatial Transcriptome Prediction And Characterization Engine

    Authors: Shivam Kumar, Samrat Chatterjee

    Abstract: Spatial transcriptomics (ST) enables the visualization of gene expression within the context of tissue morphology. This emerging discipline has the potential to serve as a foundation for developing tools to design precision medicines. However, due to the higher costs and expertise required for such experiments, its translation into a regular clinical practice might be challenging. Despite the impl… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

  50. arXiv:2407.19773  [pdf, other

    eess.IV cs.CV q-bio.QM

    Unmasking unlearnable models: a classification challenge for biomedical images without visible cues

    Authors: Shivam Kumar, Samrat Chatterjee

    Abstract: Predicting traits from images lacking visual cues is challenging, as algorithms are designed to capture visually correlated ground truth. This problem is critical in biomedical sciences, and their solution can improve the efficacy of non-invasive methods. For example, a recent challenge of predicting MGMT methylation status from MRI images is critical for treatment decisions of glioma patients. Us… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.