Search | arXiv e-print repository

A Versatile and Programmable UAV Platform for Radio Access Network and End-to-End Cellular Measurements

Authors: Sherwan Jalal Abdullah, Sravan Reddy Chintareddy, Victor S. Frost, Shawn Keshmiri, Morteza Hashemi

Abstract: In this work, we develop a measurement platform to capture mobile network performance metrics including coverage and quality of service in regions where conventional coverage testing approaches are frequently time-intensive, labor-demanding, and occasionally hazardous. Traditionally, crowd-sourcing methods are used to collect cellular network performance metrics. However, these approaches are inad… ▽ More In this work, we develop a measurement platform to capture mobile network performance metrics including coverage and quality of service in regions where conventional coverage testing approaches are frequently time-intensive, labor-demanding, and occasionally hazardous. Traditionally, crowd-sourcing methods are used to collect cellular network performance metrics. However, these approaches are inadequate in rural areas due to low-density population, and difficult terrain. The platform described here is a UAV-based and is designed to investigate the mobile network performance through aerial operations and gather Radio Access Network (RAN) signal alongside end-to-end network performance metrics. Our platform gathers metrics through the integration of an onboard computation unit and commercial off-the-shelf cellular modem. The gathered data are subsequently analyzed and displayed using geospatial mapping utilities and statistical techniques to deliver key observations on cellular network performance. Experimental results showed that the received signal power improves at higher altitudes due to enhanced line-of-sight (LoS) conditions as expected. However, the signal quality degrades as a result of increased interference from neighboring cells. The analysis reveals that for most of the geographic area covered in the initial experiments the system maintained acceptable signal quality, with adequate throughput performance for both uplink and downlink communications, while maintaining satisfactory round-trip time characteristics. Notably, the experiment showed that a strong radio signal metric for a given cell does not necessarily translate to consistent spatial coverage across the tested region. △ Less

Submitted 3 September, 2025; originally announced September 2025.

arXiv:2508.17180 [pdf, ps, other]

MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes

Authors: Nilay Pande, Sahiti Yerramilli, Jayant Sravan Tamarapalli, Rynaa Grover

Abstract: A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on t… ▽ More A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these core reasoning skills. The benchmark comprises two novel tasks: Topological Counting, identifying and enumerating features like local maxima; and Transformation Recognition, recognizing applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to superficial heuristics instead of robust spatial reasoning. MaRVL-QA provides a challenging new tool for the research community to measure progress, expose model limitations, and guide the development of MLLMs with more profound reasoning abilities. △ Less

Submitted 23 August, 2025; originally announced August 2025.

arXiv:2508.15690 [pdf, ps, other]

GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning

Authors: Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran

Abstract: GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematica… ▽ More GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field. △ Less

Submitted 21 August, 2025; originally announced August 2025.

Comments: 23 pages, 9 tables, 3 figures

arXiv:2508.06585 [pdf, ps, other]

CountQA: How Well Do MLLMs Count in the Wild?

Authors: Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, Sahiti Yerramilli

Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes, yet they exhibit a critical lack in a fundamental cognitive skill: object counting. This blind spot severely limits their reliability in real-world applications. To date, this capability has been largely unevaluated in complex scenarios, as existing benchmarks either feature sparse object densit… ▽ More Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes, yet they exhibit a critical lack in a fundamental cognitive skill: object counting. This blind spot severely limits their reliability in real-world applications. To date, this capability has been largely unevaluated in complex scenarios, as existing benchmarks either feature sparse object densities or are confined to specific visual domains, failing to test models under realistic conditions. Addressing this gap, we introduce CountQA, a challenging new benchmark designed to probe this deficiency. Comprising over 1,500 question-answer pairs, CountQA features real-world images with high object density, clutter, and occlusion. We investigate this weakness by evaluating 15 prominent MLLMs on the CountQA benchmark and reveal that the top-performing model achieves a mere 42.9% accuracy, with performance declining as object counts rise. By providing a dedicated benchmark to diagnose and rectify this core weakness, CountQA paves the way for a new generation of MLLMs that are not only descriptively fluent but also numerically grounded and spatially aware. We will open-source the dataset and code upon paper acceptance to foster further research. △ Less

Submitted 8 August, 2025; originally announced August 2025.

arXiv:2507.08806 [pdf, ps, other]

Think Clearly: Improving Reasoning via Redundant Token Pruning

Authors: Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

Abstract: Recent large language models have shown promising capabilities in long-form reasoning, following structured chains of thought before arriving at a final answer. However, we observe that these reasoning paths tend to include substantial redundancy; analyzing attention patterns reveals that attention scores are widely scattered, particularly incorrect answers exhibit greater attention sparsity. In t… ▽ More Recent large language models have shown promising capabilities in long-form reasoning, following structured chains of thought before arriving at a final answer. However, we observe that these reasoning paths tend to include substantial redundancy; analyzing attention patterns reveals that attention scores are widely scattered, particularly incorrect answers exhibit greater attention sparsity. In this paper, we demonstrate that deliberately removing this redundancy in the reasoning process significantly improves performance through clear thinking, i.e., removing distraction. Specifically, we systematically identify reasoning redundancy by measuring token-level attention scores to a special end-of-thinking token, which is appended to an explicit instruction inserted to conclude each intermediate reasoning step. Furthermore, we propose structure-aware pruning that prioritizes removing tokens in low-contributing reasoning chunks over individual tokens. After evicting redundant tokens, we remove the injected end-of-thinking instruction, then resume the reasoning generation. We demonstrate that our method significantly improves overall accuracy across reasoning-intensive benchmarks without any training involved. In particular, our method shows strong performance on challenging mathematical competition benchmarks such as AIME and AMC, where reasoning redundancy is more prevalent. △ Less

Submitted 17 June, 2025; originally announced July 2025.

arXiv:2506.17530 [pdf, ps, other]

Deep-OFDM: Neural Modulation for High Mobility

Authors: Sravan Kumar Ankireddy, S. Ashwin Hebbar, Pramod Viswanath, Hyeji Kim

Abstract: Orthogonal Frequency Division Multiplexing (OFDM) is the foundational waveform in current 5G deployments due to its robustness in quasi-static channels and efficient spectrum use. However, in high-mobility scenarios, OFDM suffers from inter-carrier interference (ICI), and its reliance on dense pilot patterns and cyclic prefixes significantly reduces spectral efficiency. In this work, we propose De… ▽ More Orthogonal Frequency Division Multiplexing (OFDM) is the foundational waveform in current 5G deployments due to its robustness in quasi-static channels and efficient spectrum use. However, in high-mobility scenarios, OFDM suffers from inter-carrier interference (ICI), and its reliance on dense pilot patterns and cyclic prefixes significantly reduces spectral efficiency. In this work, we propose Deep-OFDM: a learnable modulation framework that augments traditional OFDM by incorporating neural parameterization. Instead of mapping each symbol to a fixed resource element, Deep-OFDM spreads information across the OFDM grid using a convolutional neural network modulator. This modulator is jointly optimized with a neural receiver through end-to-end training, enabling the system to adapt to time-varying channels without relying on explicit channel estimation. Deep-OFDM outperforms conventional OFDM when paired with neural receiver baselines, particularly in pilot-sparse and pilotless regimes, achieving substantial gains in BLER and goodput, especially at high Doppler frequencies. In the pilotless setting, the neural modulator learns a low-rank structure that resembles a superimposed pilot, effectively enabling reliable communication without explicit overhead. Deep-OFDM demonstrates significant improvements in BLER and goodput at high Doppler frequencies across various scenarios, including MIMO systems. Comprehensive ablation studies quantify the role of nonlinear activations and characterize performance-complexity trade-offs. These results highlight the potential of transmitter-receiver co-design for robust, resource-efficient communication, paving the way for AI-native physical layer designs in next-generation wireless systems. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 13 pages, 15 figures, 3 tables

arXiv:2506.12103 [pdf, other]

The Amazon Nova Family of Models: Technical Report and Model Card

Authors: Amazon AGI, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, Abner Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarsh Singh, Aditya Rawal, Adok Achar Budihal Prasad, Adrià de Gispert, Agnika Kumar, Aishwarya Aryamane, Ajay Nair, Akilan M, Akshaya Iyengar, Akshaya Vishnu Kudlu Shanbhogue , et al. (761 additional authors not shown)

Abstract: We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents… ▽ More We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation. △ Less

Submitted 17 March, 2025; originally announced June 2025.

Comments: 48 pages, 10 figures

Report number: 20250317

arXiv:2506.09068 [pdf, ps, other]

BG-HOP: A Bimanual Generative Hand-Object Prior

Authors: Sriram Krishna, Sravan Chittupalli, Sungjae Park

Abstract: In this work, we present BG-HOP, a generative prior that seeks to model bimanual hand-object interactions in 3D. We address the challenge of limited bimanual interaction data by extending existing single-hand generative priors, demonstrating preliminary results in capturing the joint distribution of hands and objects. Our experiments showcase the model's capability to generate bimanual interaction… ▽ More In this work, we present BG-HOP, a generative prior that seeks to model bimanual hand-object interactions in 3D. We address the challenge of limited bimanual interaction data by extending existing single-hand generative priors, demonstrating preliminary results in capturing the joint distribution of hands and objects. Our experiments showcase the model's capability to generate bimanual interactions and synthesize grasps for given objects. We make code and models publicly available. △ Less

Submitted 8 June, 2025; originally announced June 2025.

Comments: Presented at Agents in Interaction, from Humans to Robots, CVPR 2025

arXiv:2506.04708 [pdf, other]

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Authors: Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

Abstract: Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding appro… ▽ More Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that leverages the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis reveals that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND outperforms state-of-the-art speculative decoding methods by 14-28% in throughput and shows strong performance even in single-trajectory scenarios, reducing inference latency by 48-58%. As a model-free approach, STAND can be applied to any existing language model without additional training, being a powerful plug-and-play solution for accelerating language model reasoning. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.03194 [pdf, ps, other]

HueManity: Probing Fine-Grained Visual Perception in MLLMs

Authors: Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pat… ▽ More Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs. △ Less

Submitted 15 July, 2025; v1 submitted 31 May, 2025; originally announced June 2025.

arXiv:2506.01215 [pdf, other]

Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

Authors: Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

Abstract: As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches req… ▽ More As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2506.01206 [pdf, other]

Mamba Drafters for Speculative Decoding

Authors: Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

Abstract: Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but requi… ▽ More Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2506.00785 [pdf, ps, other]

GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

Authors: Sahiti Yerramilli, Nilay Pande, Rynaa Grover, Jayant Sravan Tamarapalli

Abstract: This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization acr… ▽ More This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs. △ Less

Submitted 15 July, 2025; v1 submitted 31 May, 2025; originally announced June 2025.

arXiv:2505.21681 [pdf, ps, other]

Residual Diffusion Models for Variable-Rate Joint Source Channel Coding of MIMO CSI

Authors: Sravan Kumar Ankireddy, Heasung Kim, Hyeji Kim

Abstract: Despite significant advancements in deep learning-based CSI compression, some key limitations remain unaddressed. Current approaches predominantly treat CSI compression as a source coding problem, neglecting transmission errors. In finite block length regimes, separate source and channel coding proves suboptimal, with reconstruction performance deteriorating significantly under challenging channel… ▽ More Despite significant advancements in deep learning-based CSI compression, some key limitations remain unaddressed. Current approaches predominantly treat CSI compression as a source coding problem, neglecting transmission errors. In finite block length regimes, separate source and channel coding proves suboptimal, with reconstruction performance deteriorating significantly under challenging channel conditions. While existing autoencoder-based compression schemes can be readily extended to support joint source-channel coding, they struggle to capture complex channel distributions and exhibit poor scalability with increasing parameter count. To overcome these inherent limitations of autoencoder-based approaches, we propose Residual-Diffusion Joint Source-Channel Coding (RD-JSCC), a novel framework that integrates a lightweight autoencoder with a residual diffusion module to iteratively refine CSI reconstruction. Our flexible decoding strategy balances computational efficiency and performance by dynamically switching between low-complexity autoencoder decoding and sophisticated diffusion-based refinement based on channel conditions. Comprehensive simulations demonstrate that RD-JSCC significantly outperforms existing autoencoder-based approaches in challenging wireless environments. Furthermore, RD-JSCC offers several practical features, including a low-latency 2-step diffusion during inference, support for multiple compression rates with a single model, robustness to fixed-bit quantization, and adaptability to imperfect channel estimation. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 13 pages, 11 figures

arXiv:2504.08177 [pdf, other]

SynthFM: Training Modality-agnostic Foundation Models for Medical Image Segmentation without Real Medical Data

Authors: Sourya Sengupta, Satrajit Chakrabarty, Keerthi Sravan Ravi, Gopal Avinash, Ravi Soni

Abstract: Foundation models like the Segment Anything Model (SAM) excel in zero-shot segmentation for natural images but struggle with medical image segmentation due to differences in texture, contrast, and noise. Annotating medical images is costly and requires domain expertise, limiting large-scale annotated data availability. To address this, we propose SynthFM, a synthetic data generation framework that… ▽ More Foundation models like the Segment Anything Model (SAM) excel in zero-shot segmentation for natural images but struggle with medical image segmentation due to differences in texture, contrast, and noise. Annotating medical images is costly and requires domain expertise, limiting large-scale annotated data availability. To address this, we propose SynthFM, a synthetic data generation framework that mimics the complexities of medical images, enabling foundation models to adapt without real medical data. Using SAM's pretrained encoder and training the decoder from scratch on SynthFM's dataset, we evaluated our method on 11 anatomical structures across 9 datasets (CT, MRI, and Ultrasound). SynthFM outperformed zero-shot baselines like SAM and MedSAM, achieving superior results under different prompt settings and on out-of-distribution datasets. △ Less

Submitted 10 April, 2025; originally announced April 2025.

arXiv:2503.04992 [pdf, ps, other]

Wanda++: Pruning Large Language Models via Regional Gradients

Authors: Yifan Yang, Kai Zhen, Bhavana Ganesh, Aram Galstyan, Goeric Huybrechts, Markus Müller, Jonas M. Kübler, Rupak Vignesh Swaminathan, Athanasios Mouchtaris, Sravan Babu Bodapati, Nathan Susanj, Zheng Zhang, Jack FitzGerald, Abhishek Kumar

Abstract: Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level \textbf{regional} gradients.… ▽ More Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level \textbf{regional} gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32\% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Moreover, despite updating weights with regional optimization, Wanda++ remains orthogonal to sparsity-aware fine-tuning, further reducing perplexity with LoRA in great extend. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single H100 GPU. △ Less

Submitted 1 June, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

Comments: Paper accepted at ACL 2025 Findings

arXiv:2502.18002 [pdf, other]

A Radon-Nikodým Perspective on Anomaly Detection: Theory and Implications

Authors: Shlok Mehendale, Aditya Challa, Rahul Yedida, Sravan Danda, Santonu Sarkar, Snehanshu Saha

Abstract: Which principle underpins the design of an effective anomaly detection loss function? The answer lies in the concept of Radon-Nikodým theorem, a fundamental concept in measure theory. The key insight from this article is: Multiplying the vanilla loss function with the Radon-Nikodým derivative improves the performance across the board. We refer to this as RN-Loss. We prove this using the setting of… ▽ More Which principle underpins the design of an effective anomaly detection loss function? The answer lies in the concept of Radon-Nikodým theorem, a fundamental concept in measure theory. The key insight from this article is: Multiplying the vanilla loss function with the Radon-Nikodým derivative improves the performance across the board. We refer to this as RN-Loss. We prove this using the setting of PAC (Probably Approximately Correct) learnability. Depending on the context a Radon-Nikodým derivative takes different forms. In the simplest case of supervised anomaly detection, Radon-Nikodým derivative takes the form of a simple weighted loss. In the case of unsupervised anomaly detection (with distributional assumptions), Radon-Nikodým derivative takes the form of the popular cluster based local outlier factor. We evaluate our algorithm on 96 datasets, including univariate and multivariate data from diverse domains, including healthcare, cybersecurity, and finance. We show that RN-Derivative algorithms outperform state-of-the-art methods on 68% of Multivariate datasets (based on F1 scores) and also achieves peak F1-scores on 72% of time series (Univariate) datasets. △ Less

Submitted 16 May, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

arXiv:2501.07957 [pdf, other]

AI Guide Dog: Egocentric Path Prediction on Smartphone

Authors: Aishwarya Jadhav, Jeffery Cao, Abhishree Shetty, Urvashi Priyam Kumar, Aditi Sharma, Ben Sukboontip, Jayant Sravan Tamarapalli, Jingyi Zhang, Anirudh Koul

Abstract: This paper presents AI Guide Dog (AIGD), a lightweight egocentric (first-person) navigation system for visually impaired users, designed for real-time deployment on smartphones. AIGD employs a vision-only multi-label classification approach to predict directional commands, ensuring safe navigation across diverse environments. We introduce a novel technique for goal-based outdoor navigation by inte… ▽ More This paper presents AI Guide Dog (AIGD), a lightweight egocentric (first-person) navigation system for visually impaired users, designed for real-time deployment on smartphones. AIGD employs a vision-only multi-label classification approach to predict directional commands, ensuring safe navigation across diverse environments. We introduce a novel technique for goal-based outdoor navigation by integrating GPS signals and high-level directions, while also handling uncertain multi-path predictions for destination-free indoor navigation. As the first navigation assistance system to handle both goal-oriented and exploratory navigation across indoor and outdoor settings, AIGD establishes a new benchmark in blind navigation. We present methods, datasets, evaluations, and deployment insights to encourage further innovations in assistive navigation systems. △ Less

Submitted 16 February, 2025; v1 submitted 14 January, 2025; originally announced January 2025.

Comments: Accepted at the AAAI 2025 Spring Symposium on Human-Compatible AI for Well-being: Harnessing Potential of GenAI for AI-Powered Science

arXiv:2412.03035 [pdf, other]

A Granger-Causal Perspective on Gradient Descent with Application to Pruning

Authors: Aditya Shah, Aditya Challa, Sravan Danda, Archana Mathur, Snehanshu Saha

Abstract: Stochastic Gradient Descent (SGD) is the main approach to optimizing neural networks. Several generalization properties of deep networks, such as convergence to a flatter minima, are believed to arise from SGD. This article explores the causality aspect of gradient descent. Specifically, we show that the gradient descent procedure has an implicit granger-causal relationship between the reduction i… ▽ More Stochastic Gradient Descent (SGD) is the main approach to optimizing neural networks. Several generalization properties of deep networks, such as convergence to a flatter minima, are believed to arise from SGD. This article explores the causality aspect of gradient descent. Specifically, we show that the gradient descent procedure has an implicit granger-causal relationship between the reduction in loss and a change in parameters. By suitable modifications, we make this causal relationship explicit. A causal approach to gradient descent has many significant applications which allow greater control. In this article, we illustrate the significance of the causal approach using the application of Pruning. The causal approach to pruning has several interesting properties - (i) We observe a phase shift as the percentage of pruned parameters increase. Such phase shift is indicative of an optimal pruning strategy. (ii) After pruning, we see that minima becomes "flatter", explaining the increase in accuracy after pruning weights. △ Less

Submitted 4 December, 2024; originally announced December 2024.

arXiv:2410.20252 [pdf, other]

Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning

Authors: Sullam Jeoung, Goeric Huybrechts, Bhavana Ganesh, Aram Galstyan, Sravan Bodapati

Abstract: Understanding long-form video content presents significant challenges due to its temporal complexity and the substantial computational resources required. In this work, we propose an agent-based approach to enhance both the efficiency and effectiveness of long-form video understanding by utilizing large language models (LLMs) and their tool-harnessing ability. A key aspect of our method is query-a… ▽ More Understanding long-form video content presents significant challenges due to its temporal complexity and the substantial computational resources required. In this work, we propose an agent-based approach to enhance both the efficiency and effectiveness of long-form video understanding by utilizing large language models (LLMs) and their tool-harnessing ability. A key aspect of our method is query-adaptive frame sampling, which leverages the reasoning capabilities of LLMs to process only the most relevant frames in real-time, and addresses an important limitation of existing methods which typically involve sampling redundant or irrelevant frames. To enhance the reasoning abilities of our video-understanding agent, we leverage the self-reflective capabilities of LLMs to provide verbal reinforcement to the agent, which leads to improved performance while minimizing the number of frames accessed. We evaluate our method across several video understanding benchmarks and demonstrate that not only it enhances state-of-the-art performance but also improves efficiency by reducing the number of frames sampled. △ Less

Submitted 26 October, 2024; originally announced October 2024.

arXiv:2410.09362 [pdf, other]

SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins

Authors: Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh, Sailik Sengupta, Sravan Bodapati, Aram Galstyan

Abstract: Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives for Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the preferences used in DAAs are usually collected before the alignment training begins and remain unchanged (off-policy). This can lead to two problems where the policy… ▽ More Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives for Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the preferences used in DAAs are usually collected before the alignment training begins and remain unchanged (off-policy). This can lead to two problems where the policy model (1) picks up on spurious correlations in the dataset (as opposed to learning the intended alignment expressed in the human preference labels), and (2) overfits to feedback on off-policy trajectories that have less likelihood of being generated by an updated policy model. To address these issues, we introduce Self-Reviewing and Alignment (SeRA), a cost-efficient and effective method that can be readily combined with existing DAAs. SeRA comprises of two components: (1) sample selection using implicit reward margins, which helps alleviate over-fitting to some undesired features, and (2) preference bootstrapping using implicit rewards to augment preference data with updated policy models in a cost-efficient manner. Extensive experimentation, including some on instruction-following tasks, demonstrate the effectiveness and generality of SeRA in training LLMs on offline preference datasets with DAAs. △ Less

Submitted 12 October, 2024; originally announced October 2024.

arXiv:2410.03775 [pdf, other]

Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge

Authors: Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan Bodapati, Dan Roth

Abstract: The effectiveness of automatic evaluation of generative models is typically measured by comparing the labels generated via automation with labels by humans using correlation metrics. However, metrics like Krippendorff's $α$ and Randolph's $κ$ were originally designed to measure the reliability of human labeling, thus make assumptions about typical human labeling behavior, and these assumptions may… ▽ More The effectiveness of automatic evaluation of generative models is typically measured by comparing the labels generated via automation with labels by humans using correlation metrics. However, metrics like Krippendorff's $α$ and Randolph's $κ$ were originally designed to measure the reliability of human labeling, thus make assumptions about typical human labeling behavior, and these assumptions may not be applicable to machine generated labels. In this paper, we show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation, including LLM-as-a-Judge. Specifically, we demonstrate that when the proportion of samples with variation or uncertainty in human assigned labels is relatively high, machine labels (generated by automatic evaluation methods) may superficially appear to have similar or better correlation with the human majority label compared to the human-to-human (HH) correlation. This can create the illusion that labels from automatic evaluation approximates the human majority label. However, as the proportion of samples with consistent human labels increases, the correlation between machine and human labels fall well below HH correlation. Based on these findings, we first propose stratifying data by human label uncertainty to provide a more robust analysis of automatic evaluation performance. Second, recognizing that uncertainty and variation are inherent in perception-based human evaluations, such as those involving attitudes or preferences, we introduce a new metric - binned Jensen-Shannon Divergence for perception for such scenarios to better measure the effectiveness of automatic evaluations. We present visualization techniques -- perception charts, to contextualize correlation measures appropriately. We have open-sourced at https://github.com/amazon-science/BeyondCorrelation. △ Less

Submitted 27 January, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: Accepted at ICLR 2025

arXiv:2409.11580 [pdf, other]

PLATO: Planning with LLMs and Affordances for Tool Manipulation

Authors: Arvind Car, Sai Sravan Yarlagadda, Alison Bartsch, Abraham George, Amir Barati Farimani

Abstract: As robotic systems become increasingly integrated into complex real-world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre-programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language m… ▽ More As robotic systems become increasingly integrated into complex real-world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre-programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems. Unlike traditional systems that depend on hard-coded environmental information, PLATO employs a modular architecture of specialized agents to operate without any initial knowledge of the environment. These agents identify objects and their locations within the scene, generate a comprehensive high-level plan, translate this plan into a series of low-level actions, and verify the completion of each step. The system is particularly tested on challenging tool-use tasks, which involve handling diverse objects and require long-horizon planning. PLATO's design allows it to adapt to dynamic and unstructured settings, significantly enhancing its flexibility and robustness. By evaluating the system across various complex scenarios, we demonstrate its capability to tackle a diverse range of tasks and offer a novel solution to integrate LLMs with robotic platforms, advancing the state-of-the-art in autonomous robotic task execution. For videos and prompt details, please see our project website: https://sites.google.com/andrew.cmu.edu/plato △ Less

Submitted 17 September, 2024; originally announced September 2024.

Comments: 7 pages, 4 figures, submitted to ICRA 2025

arXiv:2407.06443 [pdf, other]

Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment

Authors: Qizhang Feng, Siva Rajesh Kasa, Santhosh Kumar Kasa, Hyokun Yun, Choon Hui Teo, Sravan Babu Bodapati

Abstract: Large Language Models (LLMs) have seen widespread adoption due to their remarkable natural language capabilities. However, when deploying them in real-world settings, it is important to align LLMs to generate texts according to acceptable human standards. Methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) have enabled significant progress in refining LLMs u… ▽ More Large Language Models (LLMs) have seen widespread adoption due to their remarkable natural language capabilities. However, when deploying them in real-world settings, it is important to align LLMs to generate texts according to acceptable human standards. Methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) have enabled significant progress in refining LLMs using human preference data. However, the privacy concerns inherent in utilizing such preference data have yet to be adequately studied. In this paper, we investigate the vulnerability of LLMs aligned using two widely used methods - DPO and PPO - to membership inference attacks (MIAs). Our study has two main contributions: first, we theoretically motivate that DPO models are more vulnerable to MIA compared to PPO models; second, we introduce a novel reference-based attack framework specifically for analyzing preference data called PREMIA (\uline{Pre}ference data \uline{MIA}). Using PREMIA and existing baselines we empirically show that DPO models have a relatively heightened vulnerability towards MIA. △ Less

Submitted 27 April, 2025; v1 submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.02233 [pdf, other]

Synthetic Multimodal Question Generation

Authors: Ian Wu, Sravan Jayanthi, Vijay Viswanathan, Simon Rosenberg, Sina Pakazad, Tongshuang Wu, Graham Neubig

Abstract: Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. A key challenge with evaluating MMRAG is the paucity of high-quality datasets matching the question styles and modalities of interest. In light of this, we propose SMMQG, a synthetic data generation framework. SMMQG leverages interplay between a retriever, large language model… ▽ More Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. A key challenge with evaluating MMRAG is the paucity of high-quality datasets matching the question styles and modalities of interest. In light of this, we propose SMMQG, a synthetic data generation framework. SMMQG leverages interplay between a retriever, large language model (LLM) and large multimodal model (LMM) to generate question and answer pairs directly from multimodal documents, with the questions conforming to specified styles and modalities. We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it, revealing insights into model performance that are attainable only through style- and modality-specific evaluation data. Next, we measure the quality of data produced by SMMQG via a human study. We find that the quality of SMMQG-generated synthetic data is on par with the quality of the crowdsourced benchmark MMQA and that downstream evaluation results using both datasets strongly concur. △ Less

Submitted 3 October, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted to EMNLP 2024 Findings; Camera Ready

arXiv:2406.01727 [pdf, other]

Federated Learning-based Collaborative Wideband Spectrum Sensing and Scheduling for UAVs in UTM Systems

Authors: Sravan Reddy Chintareddy, Keenan Roach, Kenny Cheung, Morteza Hashemi

Abstract: In this paper, we propose a data-driven framework for collaborative wideband spectrum sensing and scheduling for networked unmanned aerial vehicles (UAVs), which act as the secondary users (SUs) to opportunistically utilize detected "spectrum holes". Our overall framework consists of three main stages. Firstly, in the model training stage, we explore dataset generation in a multi-cell environment… ▽ More In this paper, we propose a data-driven framework for collaborative wideband spectrum sensing and scheduling for networked unmanned aerial vehicles (UAVs), which act as the secondary users (SUs) to opportunistically utilize detected "spectrum holes". Our overall framework consists of three main stages. Firstly, in the model training stage, we explore dataset generation in a multi-cell environment and training a machine learning (ML) model using the federated learning (FL) architecture. Unlike the existing studies on FL for wireless that presume datasets are readily available for training, we propose a novel architecture that directly integrates wireless dataset generation, which involves capturing I/Q samples from over-the-air signals in a multi-cell environment, into the FL training process. Secondly, in the collaborative spectrum inference stage, we propose a collaborative spectrum fusion strategy that is compatible with the unmanned aircraft system traffic management (UTM) ecosystem. Finally, in the spectrum scheduling stage, we leverage reinforcement learning (RL) solutions to dynamically allocate the detected spectrum holes to the secondary users. To evaluate the proposed methods, we establish a comprehensive simulation framework that generates a near-realistic synthetic dataset using MATLAB LTE toolbox by incorporating base-station~(BS) locations in a chosen area of interest, performing ray-tracing, and emulating the primary users channel usage in terms of I/Q samples. This evaluation methodology provides a flexible framework to generate large spectrum datasets that could be used for developing ML/AI-based spectrum management solutions for aerial devices. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: This is a preprint version submitted to IEEE Transactions on Machine learning in Communications and Networking. arXiv admin note: text overlap with arXiv:2308.05036

arXiv:2406.00144 [pdf, other]

Query2CAD: Generating CAD models using natural language queries

Authors: Akshay Badagabettu, Sai Sravan Yarlagadda, Amir Barati Farimani

Abstract: Computer Aided Design (CAD) engineers typically do not achieve their best prototypes in a single attempt. Instead, they iterate and refine their designs to achieve an optimal solution through multiple revisions. This traditional approach, though effective, is time-consuming and relies heavily on the expertise of skilled engineers. To address these challenges, we introduce Query2CAD, a novel framew… ▽ More Computer Aided Design (CAD) engineers typically do not achieve their best prototypes in a single attempt. Instead, they iterate and refine their designs to achieve an optimal solution through multiple revisions. This traditional approach, though effective, is time-consuming and relies heavily on the expertise of skilled engineers. To address these challenges, we introduce Query2CAD, a novel framework to generate CAD designs. The framework uses a large language model to generate executable CAD macros. Additionally, Query2CAD refines the generation of the CAD model with the help of its self-refinement loops. Query2CAD operates without supervised data or additional training, using the LLM as both a generator and a refiner. The refiner leverages feedback generated by the BLIP2 model, and to address false negatives, we have incorporated human-in-the-loop feedback into our system. Additionally, we have developed a dataset that encompasses most operations used in CAD model designing and have evaluated our framework using this dataset. Our findings reveal that when we used GPT-4 Turbo as our language model, the architecture achieved a success rate of 53.6\% on the first attempt. With subsequent refinements, the success rate increased by 23.1\%. In particular, the most significant improvement in the success rate was observed with the first iteration of the refinement. With subsequent refinements, the accuracy of the correct designs did not improve significantly. We have open-sourced our data, model, and code (github.com/akshay140601/Query2CAD). △ Less

Submitted 31 May, 2024; originally announced June 2024.

Comments: 8 pages, 5 figures

arXiv:2405.18638 [pdf, other]

doi 10.18653/v1/2024.acl-long.63

ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models

Authors: Aparna Elangovan, Ling Liu, Lei Xu, Sravan Bodapati, Dan Roth

Abstract: In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, a… ▽ More In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability. △ Less

Submitted 31 August, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: Accepted in ACL 2024

arXiv:2405.11573 [pdf, other]

Quantile Activation: Correcting a Failure Mode of ML Models

Authors: Aditya Challa, Sravan Danda, Laurent Najman, Snehanshu Saha

Abstract: Standard ML models fail to infer the context distribution and suitably adapt. For instance, the learning fails when the underlying distribution is actually a mixture of distributions with contradictory labels. Learning also fails if there is a shift between train and test distributions. Standard neural network architectures like MLPs or CNNs are not equipped to handle this. In this article, we p… ▽ More Standard ML models fail to infer the context distribution and suitably adapt. For instance, the learning fails when the underlying distribution is actually a mixture of distributions with contradictory labels. Learning also fails if there is a shift between train and test distributions. Standard neural network architectures like MLPs or CNNs are not equipped to handle this. In this article, we propose a simple activation function, quantile activation (QAct), that addresses this problem without significantly increasing computational costs. The core idea is to "adapt" the outputs of each neuron to its context distribution. The proposed quantile activation (QAct) outputs the relative quantile position of neuron activations within their context distribution, diverging from the direct numerical outputs common in traditional networks. A specific case of the above failure mode is when there is an inherent distribution shift, i.e the test distribution differs slightly from the train distribution. We validate the proposed activation function under covariate shifts, using datasets designed to test robustness against distortions. Our results demonstrate significantly better generalization across distortions compared to conventional classifiers and other adaptive methods, across various architectures. Although this paper presents a proof of concept, we find that this approach unexpectedly outperforms DINOv2 (small), despite DINOv2 being trained with a much larger network and dataset. △ Less

Submitted 2 April, 2025; v1 submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.08295 [pdf, other]

SpeechVerse: A Large-scale Generalizable Audio Language Model

Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sravan Bodapati, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks. △ Less

Submitted 24 March, 2025; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: Single Column, 13 page

arXiv:2404.09067 [pdf, other]

Exploring Explainability in Video Action Recognition

Authors: Avinab Saha, Shashank Gupta, Sravan Kumar Ankireddy, Karl Chahine, Joydeep Ghosh

Abstract: Image Classification and Video Action Recognition are perhaps the two most foundational tasks in computer vision. Consequently, explaining the inner workings of trained deep neural networks is of prime importance. While numerous efforts focus on explaining the decisions of trained deep neural networks in image classification, exploration in the domain of its temporal version, video action recognit… ▽ More Image Classification and Video Action Recognition are perhaps the two most foundational tasks in computer vision. Consequently, explaining the inner workings of trained deep neural networks is of prime importance. While numerous efforts focus on explaining the decisions of trained deep neural networks in image classification, exploration in the domain of its temporal version, video action recognition, has been scant. In this work, we take a deeper look at this problem. We begin by revisiting Grad-CAM, one of the popular feature attribution methods for Image Classification, and its extension to Video Action Recognition tasks and examine the method's limitations. To address these, we introduce Video-TCAV, by building on TCAV for Image Classification tasks, which aims to quantify the importance of specific concepts in the decision-making process of Video Action Recognition models. As the scalable generation of concepts is still an open problem, we propose a machine-assisted approach to generate spatial and spatiotemporal concepts relevant to Video Action Recognition for testing Video-TCAV. We then establish the importance of temporally-varying concepts by demonstrating the superiority of dynamic spatiotemporal concepts over trivial spatial concepts. In conclusion, we introduce a framework for investigating hypotheses in action recognition and quantitatively testing them, thus advancing research in the explainability of deep neural networks used in video action recognition. △ Less

Submitted 13 April, 2024; originally announced April 2024.

Comments: 6 pages, 10 figures, Accepted to the 3rd Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2024

arXiv:2404.02359 [pdf, ps, other]

Attribution Regularization for Multimodal Paradigms

Authors: Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, Eric Nyberg

Abstract: Multimodal machine learning has gained significant attention in recent years due to its potential for integrating information from multiple modalities to enhance learning and decision-making processes. However, it is commonly observed that unimodal models outperform multimodal models, despite the latter having access to richer information. Additionally, the influence of a single modality often dom… ▽ More Multimodal machine learning has gained significant attention in recent years due to its potential for integrating information from multiple modalities to enhance learning and decision-making processes. However, it is commonly observed that unimodal models outperform multimodal models, despite the latter having access to richer information. Additionally, the influence of a single modality often dominates the decision-making process, resulting in suboptimal performance. This research project aims to address these challenges by proposing a novel regularization term that encourages multimodal models to effectively utilize information from all modalities when making decisions. The focus of this project lies in the video-audio domain, although the proposed regularization technique holds promise for broader applications in embodied AI research, where multiple modalities are involved. By leveraging this regularization term, the proposed approach aims to mitigate the issue of unimodal dominance and improve the performance of multimodal machine learning systems. Through extensive experimentation and evaluation, the effectiveness and generalizability of the proposed technique will be assessed. The findings of this research project have the potential to significantly contribute to the advancement of multimodal machine learning and facilitate its application in various domains, including multimedia analysis, human-computer interaction, and embodied AI research. △ Less

Submitted 9 July, 2025; v1 submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.02353 [pdf, ps, other]

Semantic Augmentation in Images using Language

Authors: Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, Eric Nyberg

Abstract: Deep Learning models are incredibly data-hungry and require very large labeled datasets for supervised learning. As a consequence, these models often suffer from overfitting, limiting their ability to generalize to real-world examples. Recent advancements in diffusion models have enabled the generation of photorealistic images based on textual inputs. Leveraging the substantial datasets used to tr… ▽ More Deep Learning models are incredibly data-hungry and require very large labeled datasets for supervised learning. As a consequence, these models often suffer from overfitting, limiting their ability to generalize to real-world examples. Recent advancements in diffusion models have enabled the generation of photorealistic images based on textual inputs. Leveraging the substantial datasets used to train these diffusion models, we propose a technique to utilize generated images to augment existing datasets. This paper explores various strategies for effective data augmentation to improve the out-of-domain generalization capabilities of deep learning models. △ Less

Submitted 9 July, 2025; v1 submitted 2 April, 2024; originally announced April 2024.

arXiv:2403.10751 [pdf, other]

LightCode: Light Analytical and Neural Codes for Channels with Feedback

Authors: Sravan Kumar Ankireddy, Krishna Narayanan, Hyeji Kim

Abstract: The design of reliable and efficient codes for channels with feedback remains a longstanding challenge in communication theory. While significant improvements have been achieved by leveraging deep learning techniques, neural codes often suffer from high computational costs, a lack of interpretability, and limited practicality in resource-constrained settings. We focus on designing low-complexity c… ▽ More The design of reliable and efficient codes for channels with feedback remains a longstanding challenge in communication theory. While significant improvements have been achieved by leveraging deep learning techniques, neural codes often suffer from high computational costs, a lack of interpretability, and limited practicality in resource-constrained settings. We focus on designing low-complexity coding schemes that are interpretable and more suitable for communication systems. We advance both analytical and neural codes. First, we demonstrate that PowerBlast, an analytical coding scheme inspired by Schalkwijk-Kailath (SK) and Gallager-Nakiboğlu (GN) schemes, achieves notable reliability improvements over both SK and GN schemes, outperforming neural codes in high signal-to-noise ratio (SNR) regions. Next, to enhance reliability in low-SNR regions, we propose LightCode, a lightweight neural code that achieves state-of-the-art reliability while using a fraction of memory and compute compared to existing deeplearning-based codes. Finally, we systematically analyze the learned codes, establishing connections between LightCode and PowerBlast, identifying components crucial for performance, and providing interpretation aided by linear regression analysis. △ Less

Submitted 16 November, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

Comments: 16 pages, 12 figures, To appear in IEEE Journal on Selected Areas in Communications, 2024

arXiv:2402.08864 [pdf, other]

DeepPolar: Inventing Nonlinear Large-Kernel Polar Codes via Deep Learning

Authors: S Ashwin Hebbar, Sravan Kumar Ankireddy, Hyeji Kim, Sewoong Oh, Pramod Viswanath

Abstract: Progress in designing channel codes has been driven by human ingenuity and, fittingly, has been sporadic. Polar codes, developed on the foundation of Arikan's polarization kernel, represent the latest breakthrough in coding theory and have emerged as the state-of-the-art error-correction code for short-to-medium block length regimes. In an effort to automate the invention of good channel codes, es… ▽ More Progress in designing channel codes has been driven by human ingenuity and, fittingly, has been sporadic. Polar codes, developed on the foundation of Arikan's polarization kernel, represent the latest breakthrough in coding theory and have emerged as the state-of-the-art error-correction code for short-to-medium block length regimes. In an effort to automate the invention of good channel codes, especially in this regime, we explore a novel, non-linear generalization of Polar codes, which we call DeepPolar codes. DeepPolar codes extend the conventional Polar coding framework by utilizing a larger kernel size and parameterizing these kernels and matched decoders through neural networks. Our results demonstrate that these data-driven codes effectively leverage the benefits of a larger kernel size, resulting in enhanced reliability when compared to both existing neural codes and conventional Polar codes. △ Less

Submitted 4 June, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: 22 pages, 24 figures

arXiv:2402.08749 [pdf]

Automated detection of motion artifacts in brain MR images using deep learning and explainable artificial intelligence

Authors: Marina Manso Jimeno, Keerthi Sravan Ravi, Maggie Fung, John Thomas Vaughan, Jr., Sairam Geethanath

Abstract: Quality assessment, including inspecting the images for artifacts, is a critical step during MRI data acquisition to ensure data quality and downstream analysis or interpretation success. This study demonstrates a deep learning model to detect rigid motion in T1-weighted brain images. We leveraged a 2D CNN for three-class classification and tested it on publicly available retrospective and prospec… ▽ More Quality assessment, including inspecting the images for artifacts, is a critical step during MRI data acquisition to ensure data quality and downstream analysis or interpretation success. This study demonstrates a deep learning model to detect rigid motion in T1-weighted brain images. We leveraged a 2D CNN for three-class classification and tested it on publicly available retrospective and prospective datasets. Grad-CAM heatmaps enabled the identification of failure modes and provided an interpretation of the model's results. The model achieved average precision and recall metrics of 85% and 80% on six motion-simulated retrospective datasets. Additionally, the model's classifications on the prospective dataset showed a strong inverse correlation (-0.84) compared to average edge strength, an image quality metric indicative of motion. This model is part of the ArtifactID tool, aimed at inline automatic detection of Gibbs ringing, wrap-around, and motion artifacts. This tool automates part of the time-consuming QA process and augments expertise on-site, particularly relevant in low-resource settings where local MR knowledge is scarce. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: 25 pages, 9 figures, 1 table. Submitted to NMR in Biomedicine

arXiv:2402.08405 [pdf, other]

A Novel Approach to Regularising 1NN classifier for Improved Generalization

Authors: Aditya Challa, Sravan Danda, Laurent Najman

Abstract: In this paper, we propose a class of non-parametric classifiers, that learn arbitrary boundaries and generalize well. Our approach is based on a novel way to regularize 1NN classifiers using a greedy approach. We refer to this class of classifiers as Watershed Classifiers. 1NN classifiers are known to trivially over-fit but have very large VC dimension, hence do not generalize well. We show that… ▽ More In this paper, we propose a class of non-parametric classifiers, that learn arbitrary boundaries and generalize well. Our approach is based on a novel way to regularize 1NN classifiers using a greedy approach. We refer to this class of classifiers as Watershed Classifiers. 1NN classifiers are known to trivially over-fit but have very large VC dimension, hence do not generalize well. We show that watershed classifiers can find arbitrary boundaries on any dense enough dataset, and, at the same time, have very small VC dimension; hence a watershed classifier leads to good generalization. Traditional approaches to regularize 1NN classifiers are to consider $K$ nearest neighbours. Neighbourhood component analysis (NCA) proposes a way to learn representations consistent with ($n-1$) nearest neighbour classifier, where $n$ denotes the size of the dataset. In this article, we propose a loss function which can learn representations consistent with watershed classifiers, and show that it outperforms the NCA baseline. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2401.17188 [pdf, other]

Nested Construction of Polar Codes via Transformers

Authors: Sravan Kumar Ankireddy, S Ashwin Hebbar, Heping Wan, Joonyoung Cho, Charlie Zhang

Abstract: Tailoring polar code construction for decoding algorithms beyond successive cancellation has remained a topic of significant interest in the field. However, despite the inherent nested structure of polar codes, the use of sequence models in polar code construction is understudied. In this work, we propose using a sequence modeling framework to iteratively construct a polar code for any given lengt… ▽ More Tailoring polar code construction for decoding algorithms beyond successive cancellation has remained a topic of significant interest in the field. However, despite the inherent nested structure of polar codes, the use of sequence models in polar code construction is understudied. In this work, we propose using a sequence modeling framework to iteratively construct a polar code for any given length and rate under various channel conditions. Simulations show that polar codes designed via sequential modeling using transformers outperform both 5G-NR sequence and Density Evolution based approaches for both AWGN and Rayleigh fading channels. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 7 pages; 8 figures

arXiv:2312.08356 [pdf, other]

CUTTANA: Scalable Graph Partitioning for Faster Distributed Graph Databases and Analytics

Authors: Milad Rezaei Hajidehi, Sraavan Sridhar, Margo Seltzer

Abstract: Graph partitioning plays a pivotal role in various distributed graph processing applications, including graph analytics, graph neural network training, and distributed graph databases. Graphs that require distributed settings are often too large to fit in the main memory of a single machine. This challenge renders traditional in-memory graph partitioners infeasible, leading to the emergence of str… ▽ More Graph partitioning plays a pivotal role in various distributed graph processing applications, including graph analytics, graph neural network training, and distributed graph databases. Graphs that require distributed settings are often too large to fit in the main memory of a single machine. This challenge renders traditional in-memory graph partitioners infeasible, leading to the emergence of streaming solutions. Streaming partitioners produce lower-quality partitions because they work from partial information and must make premature decisions before they have a complete view of a vertex's neighborhood. We introduce CUTTANA, a streaming graph partitioner that partitions massive graphs (Web/Twitter scale) with superior quality compared to existing streaming solutions. CUTTANA uses a novel buffering technique that prevents the premature assignment of vertices to partitions and a scalable coarsening and refinement technique that enables a complete graph view, improving the intermediate assignment made by a streaming partitioner. We implemented a parallel version for CUTTANA that offers nearly the same partitioning latency as existing streaming partitioners. Our experimental analysis shows that CUTTANA consistently yields better partitioning quality than existing state-of-the-art streaming vertex partitioners in terms of both edge-cut and communication volume metrics. We also evaluate the workload latencies that result from using CUTTANA and other partitioners in distributed graph analytics and databases. CUTTANA outperforms the other methods in most scenarios (algorithms, datasets). In analytics applications, CUTTANA improves runtime performance by up to 59% compared to various streaming partitioners (HDRF, Fennel, Ginger, HeiStream). In graph database tasks, CUTTANA results in higher query throughput by up to 23%, without hurting tail latency. △ Less

Submitted 9 December, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Accepted at VLDB 2025 (Vol.18, No.1). Please use VLDB version of paper for the updated version and bibtex

arXiv:2311.18618 [pdf, other]

JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

Authors: Shishir Muralidhara, Sravan Kumar Jagadeesh, René Schuster, Didier Stricker

Abstract: Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our Joint Panoptic Part Fusion (JPPF) that combines the three individual segmentations effectively to obtain a panop… ▽ More Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our Joint Panoptic Part Fusion (JPPF) that combines the three individual segmentations effectively to obtain a panoptic-part segmentation. Two aspects are of utmost importance for this: First, a unified model for the three problems is desired that allows for mutually improved and consistent representation learning. Second, balancing the combination so that it gives equal importance to all individual results during fusion. Our proposed JPPF is parameter-free and dynamically balances its input. The method is evaluated and compared on the Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets in terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we verify the importance of our fair fusion, highlight its most significant impact for areas that can be further segmented into parts, and demonstrate the generalization capabilities of our design without fine-tuning on 5 additional datasets. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: Accepted for Springer Nature Computer Science. arXiv admin note: substantial text overlap with arXiv:2212.07671

arXiv:2311.11518 [pdf, other]

Multi-teacher Distillation for Multilingual Spelling Correction

Authors: Jingfen Zhang, Xuan Guo, Sravan Bodapati, Christopher Potts

Abstract: Accurate spelling correction is a critical step in modern search interfaces, especially in an era of mobile devices and speech-to-text interfaces. For services that are deployed around the world, this poses a significant challenge for multilingual NLP: spelling errors need to be caught and corrected in all languages, and even in queries that use multiple languages. In this paper, we tackle this ch… ▽ More Accurate spelling correction is a critical step in modern search interfaces, especially in an era of mobile devices and speech-to-text interfaces. For services that are deployed around the world, this poses a significant challenge for multilingual NLP: spelling errors need to be caught and corrected in all languages, and even in queries that use multiple languages. In this paper, we tackle this challenge using multi-teacher distillation. On our approach, a monolingual teacher model is trained for each language/locale, and these individual models are distilled into a single multilingual student model intended to serve all languages/locales. In experiments using open-source data as well as user data from a worldwide search service, we show that this leads to highly effective spelling correction models that can meet the tight latency requirements of deployed services. △ Less

Submitted 19 November, 2023; originally announced November 2023.

arXiv:2311.08402 [pdf, other]

Retrieve and Copy: Scaling ASR Personalization to Large Catalogs

Authors: Sai Muralidhar Jayanthi, Devang Kulshreshtha, Saket Dingliwal, Srikanth Ronanki, Sravan Bodapati

Abstract: Personalization of automatic speech recognition (ASR) models is a widely studied topic because of its many practical applications. Most recently, attention-based contextual biasing techniques are used to improve the recognition of rare words and domain specific entities. However, due to performance constraints, the biasing is often limited to a few thousand entities, restricting real-world usabili… ▽ More Personalization of automatic speech recognition (ASR) models is a widely studied topic because of its many practical applications. Most recently, attention-based contextual biasing techniques are used to improve the recognition of rare words and domain specific entities. However, due to performance constraints, the biasing is often limited to a few thousand entities, restricting real-world usability. To address this, we first propose a "Retrieve and Copy" mechanism to improve latency while retaining the accuracy even when scaled to a large catalog. We also propose a training strategy to overcome the degradation in recall at such scale due to an increased number of confusing entities. Overall, our approach achieves up to 6% more Word Error Rate reduction (WERR) and 3.6% absolute improvement in F1 when compared to a strong baseline. Our method also allows for large catalog sizes of up to 20K without significantly affecting WER and F1-scores, while achieving at least 20% inference speedup per acoustic frame. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: EMNLP 2023

arXiv:2311.02482 [pdf, other]

Generalized zero-shot audio-to-intent classification

Authors: Veera Raghavendra Elluru, Devang Kulshreshtha, Rohit Paturi, Sravan Bodapati, Srikanth Ronanki

Abstract: Spoken language understanding systems using audio-only data are gaining popularity, yet their ability to handle unseen intents remains limited. In this study, we propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent. To achieve this, we first train a supervised audio-to-intent classifier by making use of a self-supervised pre-trai… ▽ More Spoken language understanding systems using audio-only data are gaining popularity, yet their ability to handle unseen intents remains limited. In this study, we propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent. To achieve this, we first train a supervised audio-to-intent classifier by making use of a self-supervised pre-trained model. We then leverage a neural audio synthesizer to create audio embeddings for sample text utterances and perform generalized zero-shot classification on unseen intents using cosine similarity. We also propose a multimodal training strategy that incorporates lexical information into the audio representation to improve zero-shot performance. Our multimodal training approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2% for the SLURP and internal goal-oriented dialog datasets, respectively, compared to audio-only training. △ Less

Submitted 4 November, 2023; originally announced November 2023.

arXiv:2310.08660 [pdf, other]

Learning RL-Policies for Joint Beamforming Without Exploration: A Batch Constrained Off-Policy Approach

Authors: Heasung Kim, Sravan Kumar Ankireddy

Abstract: In this work, we consider the problem of network parameter optimization for rate maximization. We frame this as a joint optimization problem of power control, beam forming, and interference cancellation. We consider the setting where multiple Base Stations (BSs) communicate with multiple user equipment (UEs). Because of the exponential computational complexity of brute force search, we instead sol… ▽ More In this work, we consider the problem of network parameter optimization for rate maximization. We frame this as a joint optimization problem of power control, beam forming, and interference cancellation. We consider the setting where multiple Base Stations (BSs) communicate with multiple user equipment (UEs). Because of the exponential computational complexity of brute force search, we instead solve this nonconvex optimization problem using deep reinforcement learning (RL) techniques. Modern communication systems are notorious for their difficulty in exactly modeling their behavior. This limits us in using RL-based algorithms as interaction with the environment is needed for the agent to explore and learn efficiently. Further, it is ill-advised to deploy the algorithm in the real world for exploration and learning because of the high cost of failure. In contrast to the previous RL-based solutions proposed, such as deep-Q network (DQN) based control, we suggest an offline model-based approach. We specifically consider discrete batch-constrained deep Q-learning (BCQ) and show that performance similar to DQN can be achieved with only a fraction of the data without exploring. This maximizes sample efficiency and minimizes risk in deploying a new algorithm to commercial networks. We provide the entire project resource, including code and data, at the following link: https://github.com/Heasung-Kim/ safe-rl-deployment-for-5g. △ Less

Submitted 11 November, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

Comments: 10 pages, 8 figures

arXiv:2308.11357 [pdf, other]

Exemplar-Free Continual Transformer with Convolutions

Authors: Anurag Roy, Vinay Kumar Verma, Sravan Voonna, Kripabandhu Ghosh, Saptarshi Ghosh, Abir Das

Abstract: Continual Learning (CL) involves training a machine learning model in a sequential manner to learn new information while retaining previously learned tasks without the presence of previous training data. Although there has been significant interest in CL, most recent CL approaches in computer vision have focused on convolutional architectures only. However, with the recent success of vision transf… ▽ More Continual Learning (CL) involves training a machine learning model in a sequential manner to learn new information while retaining previously learned tasks without the presence of previous training data. Although there has been significant interest in CL, most recent CL approaches in computer vision have focused on convolutional architectures only. However, with the recent success of vision transformers, there is a need to explore their potential for CL. Although there have been some recent CL approaches for vision transformers, they either store training instances of previous tasks or require a task identifier during test time, which can be limiting. This paper proposes a new exemplar-free approach for class/task incremental learning called ConTraCon, which does not require task-id to be explicitly present during inference and avoids the need for storing previous training instances. The proposed approach leverages the transformer architecture and involves re-weighting the key, query, and value weights of the multi-head self-attention layers of a transformer trained on a similar task. The re-weighting is done using convolution, which enables the approach to maintain low parameter requirements per task. Additionally, an image augmentation-based entropic task identification approach is used to predict tasks without requiring task-ids during inference. Experiments on four benchmark datasets demonstrate that the proposed approach outperforms several competitive approaches while requiring fewer parameters. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: Accepted in ICCV 2023

arXiv:2308.05036 [pdf, other]

Collaborative Wideband Spectrum Sensing and Scheduling for Networked UAVs in UTM Systems

Authors: Sravan Reddy Chintareddy, Keenan Roach, Kenny Cheung, Morteza Hashemi

Abstract: In this paper, we propose a data-driven framework for collaborative wideband spectrum sensing and scheduling for networked unmanned aerial vehicles (UAVs), which act as the secondary users to opportunistically utilize detected spectrum holes. To this end, we propose a multi-class classification problem for wideband spectrum sensing to detect vacant spectrum spots based on collected I/Q samples. To… ▽ More In this paper, we propose a data-driven framework for collaborative wideband spectrum sensing and scheduling for networked unmanned aerial vehicles (UAVs), which act as the secondary users to opportunistically utilize detected spectrum holes. To this end, we propose a multi-class classification problem for wideband spectrum sensing to detect vacant spectrum spots based on collected I/Q samples. To enhance the accuracy of the spectrum sensing module, the outputs from the multi-class classification by each individual UAV are fused at a server in the unmanned aircraft system traffic management (UTM) ecosystem. In the spectrum scheduling phase, we leverage reinforcement learning (RL) solutions to dynamically allocate the detected spectrum holes to the secondary users (i.e., UAVs). To evaluate the proposed methods, we establish a comprehensive simulation framework that generates a near-realistic synthetic dataset using MATLAB LTE toolbox by incorporating base-station~(BS) locations in a chosen area of interest, performing ray-tracing, and emulating the primary users channel usage in terms of I/Q samples. This evaluation methodology provides a flexible framework to generate large spectrum datasets that could be used for developing ML/AI-based spectrum management solutions for aerial devices. △ Less

Submitted 9 August, 2023; originally announced August 2023.

arXiv:2307.13850 [pdf, other]

MAEA: Multimodal Attribution for Embodied AI

Authors: Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Yonatan Bisk

Abstract: Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different… ▽ More Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different policies trained on the ALFRED dataset. Attribution analysis can be utilized to rank and group the failure scenarios, investigate modeling and dataset biases, and critically analyze multimodal EAI policies for robustness and user trust before deployment. We present MAEA, a framework to compute global attributions per modality of any differentiable policy. In addition, we show how attributions enable lower-level behavior analysis in EAI policies for language and visual attributions. △ Less

Submitted 25 July, 2023; originally announced July 2023.

arXiv:2307.00759 [pdf, other]

Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages

Authors: Devang Kulshreshtha, Saket Dingliwal, Brady Houston, Sravan Bodapati

Abstract: Connectionist Temporal Classification (CTC) models are popular for their balance between speed and performance for Automatic Speech Recognition (ASR). However, these CTC models still struggle in other areas, such as personalization towards custom words. A recent approach explores Contextual Adapters, wherein an attention-based biasing model for CTC is used to improve the recognition of custom enti… ▽ More Connectionist Temporal Classification (CTC) models are popular for their balance between speed and performance for Automatic Speech Recognition (ASR). However, these CTC models still struggle in other areas, such as personalization towards custom words. A recent approach explores Contextual Adapters, wherein an attention-based biasing model for CTC is used to improve the recognition of custom entities. While this approach works well with enough data, we showcase that it isn't an effective strategy for low-resource languages. In this work, we propose a supervision loss for smoother training of the Contextual Adapters. Further, we explore a multilingual strategy to improve performance with limited training data. Our method achieves 48% F1 improvement in retrieving unseen custom entities for a low-resource language. Interestingly, as a by-product of training the Contextual Adapters, we see a 5-11% Word Error Rate (WER) reduction in the performance of the base CTC model as well. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: Published at INTERSPEECH 2023

arXiv:2307.00453 [pdf, other]

Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Authors: Anshu Bhatia, Sanchit Sinha, Saket Dingliwal, Karthik Gopalakrishnan, Sravan Bodapati, Katrin Kirchhoff

Abstract: Speech representations learned in a self-supervised fashion from massive unlabeled speech corpora have been adapted successfully toward several downstream tasks. However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose an… ▽ More Speech representations learned in a self-supervised fashion from massive unlabeled speech corpora have been adapted successfully toward several downstream tasks. However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose and investigate self-supervised adaptation of speech representations to such populations in a parameter-efficient way via training accent-specific residual adapters. We experiment with 4 accents and choose automatic speech recognition (ASR) as the downstream task of interest. We obtain strong word error rate reductions (WERR) over HuBERT-large for all 4 accents, with a mean WERR of 22.7% with accent-specific adapters and a mean WERR of 25.1% if the entire encoder is accent-adapted. While our experiments utilize HuBERT and ASR as the downstream task, our proposed approach is both model and task-agnostic. △ Less

Submitted 1 July, 2023; originally announced July 2023.

arXiv:2306.08175 [pdf, other]

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

Authors: Goeric Huybrechts, Srikanth Ronanki, Xilai Li, Hadis Nosrati, Sravan Bodapati, Katrin Kirchhoff

Abstract: Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the… ▽ More Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings. △ Less

Submitted 1 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

Showing 1–50 of 94 results for author: Sraavan