Search | arXiv e-print repository

Compositional Image-Text Matching and Retrieval by Grounding Entities

Authors: Madhukar Reddy Vongala, Saurabh Srivastava, Jana Košecká

Abstract: Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform enti… ▽ More Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform entity grounding and compositional image and text matching~\cite{Jiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR, learninglocalizeCVPR24}. In this work we propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties. We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors and dynamically adjust the baseline global image embedding. % The final embedding is obtained by computing a weighted combination of the sub-image embeddings. The resulting embedding is then utilized for similarity computation with text embedding, resulting in a average 1.5\% improvement in image-text matching accuracy on the Visual Genome and SVO Probes datasets~\cite{krishna2017visualgenome, svo}. Notably, the enhanced embeddings demonstrate superior retrieval performance, thus achieving significant gains on the Flickr30K and MS-COCO retrieval benchmarks~\cite{flickr30ke, mscoco}, improving the state-of-the-art Recall@1 by 12\% and 0.4\%, respectively. Our code is available at https://github.com/madhukarreddyvongala/GroundingCLIP. △ Less

Submitted 4 May, 2025; originally announced May 2025.

Comments: Accepted at CVPR-W

arXiv:2505.02222 [pdf, other]

Practical Efficiency of Muon for Pretraining

Authors: Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani

Abstract: We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study th… ▽ More We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture. △ Less

Submitted 10 May, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

arXiv:2504.20333 [pdf, other]

List Decoding Expander-Based Codes up to Capacity in Near-Linear Time

Authors: Shashank Srivastava, Madhur Tulsiani

Abstract: We give a new framework based on graph regularity lemmas, for list decoding and list recovery of codes based on spectral expanders. Using existing algorithms for computing regularity decompositions of sparse graphs in (randomized) near-linear time, and appropriate choices for the constant-sized inner/base codes, we prove the following: - Expander-based codes constructed using the distance amplif… ▽ More We give a new framework based on graph regularity lemmas, for list decoding and list recovery of codes based on spectral expanders. Using existing algorithms for computing regularity decompositions of sparse graphs in (randomized) near-linear time, and appropriate choices for the constant-sized inner/base codes, we prove the following: - Expander-based codes constructed using the distance amplification technique of Alon, Edmonds and Luby [FOCS 1995] with rate $ρ$, can be list decoded to a radius $1 - ρ- ε$ in near-linear time. By known results, the output list has size $O(1/ε)$. - The above codes of Alon, Edmonds and Luby, with rate $ρ$, can also be list recovered to radius $1 - ρ- ε$ in near-linear time, with constant-sized output lists. - The Tanner code construction of Sipser and Spielman [IEEE Trans. Inf. Theory 1996] with distance $δ$, can be list decoded to radius $δ- ε$ in near-linear time, with constant-sized output lists. Our results imply novel combinatorial as well as algorithmic bounds for each of the above explicit constructions. All of these bounds are obtained via combinatorial rigidity phenomena, proved using (weak) graph regularity. The regularity framework allows us to lift the list decoding and list recovery properties for the local base codes, to the global codes obtained via the above constructions. △ Less

Submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.07357 [pdf, ps, other]

Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction

Authors: Saurabh Srivastava, Ziyu Yao

Abstract: Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to sys… ▽ More Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines. △ Less

Submitted 9 April, 2025; originally announced April 2025.

arXiv:2504.04022 [pdf, other]

Rethinking Reflection in Pre-Training

Authors: Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk , et al. (4 additional authors not shown)

Abstract: A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model c… ▽ More A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks. △ Less

Submitted 4 April, 2025; originally announced April 2025.

arXiv:2502.18848 [pdf, other]

A Causal Lens for Evaluating Faithfulness Metrics

Authors: Kerem Zaman, Shashank Srivastava

Abstract: Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's internal reasoning faithfully, which is crucial for understanding the model's true decision-making processes. Although several faithfulness metrics have been proposed, a unified evaluation… ▽ More Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's internal reasoning faithfully, which is crucial for understanding the model's true decision-making processes. Although several faithfulness metrics have been proposed, a unified evaluation framework remains absent. To address this gap, we present Causal Diagnosticity, a framework to evaluate faithfulness metrics for natural language explanations. Our framework employs the concept of causal diagnosticity, and uses model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate a variety of faithfulness metrics, including post-hoc explanation and chain-of-thought-based methods. We find that all tested faithfulness metrics often fail to surpass a random baseline. Our work underscores the need for improved metrics and more reliable interpretability methods in LLMs. △ Less

Submitted 26 February, 2025; originally announced February 2025.

Comments: 18 pages, 18 figures, 6 tables

arXiv:2502.18710 [pdf, other]

Bridging Critical Gaps in Convergent Learning: How Representational Alignment Evolves Across Layers, Training, and Distribution Shifts

Authors: Chaitanya Kapoor, Sudhanshu Srivastava, Meenakshi Khosla

Abstract: Understanding convergent learning -- the extent to which artificial and biological neural networks develop similar representations -- is crucial for neuroscience and AI, as it reveals shared learning principles and guides brain-like model design. While several studies have noted convergence in early and late layers of vision networks, key gaps remain. First, much existing work relies on a limited… ▽ More Understanding convergent learning -- the extent to which artificial and biological neural networks develop similar representations -- is crucial for neuroscience and AI, as it reveals shared learning principles and guides brain-like model design. While several studies have noted convergence in early and late layers of vision networks, key gaps remain. First, much existing work relies on a limited set of metrics, overlooking transformation invariances required for proper alignment. We compare three metrics that ignore specific irrelevant transformations: linear regression (ignoring affine transformations), Procrustes (ignoring rotations and reflections), and permutation/soft-matching (ignoring unit order). Notably, orthogonal transformations align representations nearly as effectively as more flexible linear ones, and although permutation scores are lower, they significantly exceed chance, indicating a robust representational basis. A second critical gap lies in understanding when alignment emerges during training. Contrary to expectations that convergence builds gradually with task-specific learning, our findings reveal that nearly all convergence occurs within the first epoch -- long before networks achieve optimal performance. This suggests that shared input statistics, architectural biases, or early training dynamics drive convergence rather than the final task solution. Finally, prior studies have not systematically examined how changes in input statistics affect alignment. Our work shows that out-of-distribution (OOD) inputs consistently amplify differences in later layers, while early layers remain aligned for both in-distribution and OOD inputs, suggesting that this alignment is driven by generalizable features stable across distribution shifts. These findings fill critical gaps in our understanding of representational convergence, with implications for neuroscience and AI. △ Less

Submitted 25 February, 2025; originally announced February 2025.

arXiv:2502.16377 [pdf, other]

Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines

Authors: Saurabh Srivastava, Sweta Pati, Ziyu Yao

Abstract: In this work, we study the effect of annotation guidelines -- textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amo… ▽ More In this work, we study the effect of annotation guidelines -- textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance. △ Less

Submitted 22 February, 2025; originally announced February 2025.

arXiv:2502.16255 [pdf]

rECGnition_v2.0: Self-Attentive Canonical Fusion of ECG and Patient Data using deep learning for effective Cardiac Diagnostics

Authors: Shreya Srivastava, Durgesh Kumar, Ram Jiwari, Sandeep Seth, Deepak Sharma

Abstract: The variability in ECG readings influenced by individual patient characteristics has posed a considerable challenge to adopting automated ECG analysis in clinical settings. A novel feature fusion technique termed SACC (Self Attentive Canonical Correlation) was proposed to address this. This technique is combined with DPN (Dual Pathway Network) and depth-wise separable convolution to create a robus… ▽ More The variability in ECG readings influenced by individual patient characteristics has posed a considerable challenge to adopting automated ECG analysis in clinical settings. A novel feature fusion technique termed SACC (Self Attentive Canonical Correlation) was proposed to address this. This technique is combined with DPN (Dual Pathway Network) and depth-wise separable convolution to create a robust, interpretable, and fast end-to-end arrhythmia classification model named rECGnition_v2.0 (robust ECG abnormality detection). This study uses MIT-BIH, INCARTDB and EDB dataset to evaluate the efficiency of rECGnition_v2.0 for various classes of arrhythmias. To investigate the influence of constituting model components, various ablation studies were performed, i.e. simple concatenation, CCA and proposed SACC were compared, while the importance of global and local ECG features were tested using DPN rECGnition_v2.0 model and vice versa. It was also benchmarked with state-of-the-art CNN models for overall accuracy vs model parameters, FLOPs, memory requirements, and prediction time. Furthermore, the inner working of the model was interpreted by comparing the activation locations in ECG before and after the SACC layer. rECGnition_v2.0 showed a remarkable accuracy of 98.07% and an F1-score of 98.05% for classifying ten distinct classes of arrhythmia with just 82.7M FLOPs per sample, thereby going beyond the performance metrics of current state-of-the-art (SOTA) models by utilizing MIT-BIH Arrhythmia dataset. Similarly, on INCARTDB and EDB datasets, excellent F1-scores of 98.01% and 96.21% respectively was achieved for AAMI classification. The compact architectural footprint of the rECGnition_v2.0, characterized by its lesser trainable parameters and diminished computational demands, unfurled several advantages including interpretability and scalability. △ Less

Submitted 22 February, 2025; originally announced February 2025.

arXiv:2502.07308 [pdf, other]

Explicit Codes approaching Generalized Singleton Bound using Expanders

Authors: Fernando Granha Jeronimo, Tushant Mittal, Shashank Srivastava, Madhur Tulsiani

Abstract: We construct a new family of explicit codes that are list decodable to capacity and achieve an optimal list size of $O(\frac{1}ε)$. In contrast to existing explicit constructions of codes achieving list decoding capacity, our arguments do not rely on algebraic structure but utilize simple combinatorial properties of expander graphs. Our construction is based on a celebrated distance amplificatio… ▽ More We construct a new family of explicit codes that are list decodable to capacity and achieve an optimal list size of $O(\frac{1}ε)$. In contrast to existing explicit constructions of codes achieving list decoding capacity, our arguments do not rely on algebraic structure but utilize simple combinatorial properties of expander graphs. Our construction is based on a celebrated distance amplification procedure due to Alon, Edmonds, and Luby [FOCS'95], which transforms any high-rate code into one with near-optimal rate-distance tradeoff. We generalize it to show that the same procedure can be used to transform any high-rate code into one that achieves list decoding capacity. Our proof can be interpreted as a "local-to-global" phenomenon for (a slight strengthening of) the generalized Singleton bound. Using this construction, for every $R, ε\in (0,1)$ and $k \in \mathbb{N}^+$, we obtain an \emph{explicit} family of codes $\mathcal{C} \subseteq Σ^n$, with rate $R$ such that, - They achieve the $ε$-relaxed generalized Singleton bound: for any $g \in Σ^n$ and any list $\mathcal{H}$ of at most $k$ codewords, we have, \[ \underset{h \in \mathcal{H}}{\mathbb{E}} [Δ(g,h)] ~\geq~ \frac{|\mathcal{H}|-1}{|\mathcal{H}|} \cdot (1 - R - ε). \] - The alphabet size is a constant depending only on $ε$ and $k$. - They can be list decoded up to radius $\frac{k-1}{k}(1-R-ε)$, in time $n^{O_{k,ε}(1)}$. As a corollary of our result, we also obtain the first explicit construction of LDPC codes achieving list decoding capacity, and in fact arbitrarily close to the generalized Singleton bound. △ Less

Submitted 11 February, 2025; originally announced February 2025.

Comments: STOC 2025

arXiv:2501.06348 [pdf, other]

Why Automate This? Exploring the Connection between Time Use, Well-being and Robot Automation Across Social Groups

Authors: Ruchira Ray, Leona Pang, Sanjana Srivastava, Li Fei-Fei, Samantha Shorey, Roberto Martín-Martín

Abstract: Understanding the motivations underlying the human inclination to automate tasks is vital to developing truly helpful robots integrated into daily life. Accordingly, we ask: are individuals more inclined to automate chores based on the time they consume or the feelings experienced while performing them? This study explores these preferences and whether they vary across different social groups (i.e… ▽ More Understanding the motivations underlying the human inclination to automate tasks is vital to developing truly helpful robots integrated into daily life. Accordingly, we ask: are individuals more inclined to automate chores based on the time they consume or the feelings experienced while performing them? This study explores these preferences and whether they vary across different social groups (i.e., gender category and income level). Leveraging data from the BEHAVIOR-1K dataset, the American Time-Use Survey, and the American Time-Use Survey Well-Being Module, we investigate the relationship between the desire for automation, time spent on daily activities, and their associated feelings - Happiness, Meaningfulness, Sadness, Painfulness, Stressfulness, or Tiredness. Our key findings show that, despite common assumptions, time spent does not strongly relate to the desire for automation for the general population. For the feelings analyzed, only happiness and pain are key indicators. Significant differences by gender and economic level also emerged: Women prefer to automate stressful activities, whereas men prefer to automate those that make them unhappy; mid-income individuals prioritize automating less enjoyable and meaningful activities, while low and high-income show no significant correlations. We hope our research helps motivate technologies to develop robots that match the priorities of potential users, moving domestic robotics toward more socially relevant solutions. We open-source all the data, including an online tool that enables the community to replicate our analysis and explore additional trends at https://hri1260.github.io/why-automate-this. △ Less

Submitted 10 January, 2025; originally announced January 2025.

Comments: 20 pages, 14 figures

arXiv:2412.16395 [pdf, other]

Autonomous Option Invention for Continual Hierarchical Reinforcement Learning and Planning

Authors: Rashmeet Kaur Nayyar, Siddharth Srivastava

Abstract: Abstraction is key to scaling up reinforcement learning (RL). However, autonomously learning abstract state and action representations to enable transfer and generalization remains a challenging open problem. This paper presents a novel approach for inventing, representing, and utilizing options, which represent temporally extended behaviors, in continual RL settings. Our approach addresses stream… ▽ More Abstraction is key to scaling up reinforcement learning (RL). However, autonomously learning abstract state and action representations to enable transfer and generalization remains a challenging open problem. This paper presents a novel approach for inventing, representing, and utilizing options, which represent temporally extended behaviors, in continual RL settings. Our approach addresses streams of stochastic problems characterized by long horizons, sparse rewards, and unknown transition and reward functions. Our approach continually learns and maintains an interpretable state abstraction, and uses it to invent high-level options with abstract symbolic representations. These options meet three key desiderata: (1) composability for solving tasks effectively with lookahead planning, (2) reusability across problem instances for minimizing the need for relearning, and (3) mutual independence for reducing interference among options. Our main contributions are approaches for continually learning transferable, generalizable options with symbolic representations, and for integrating search techniques with RL to efficiently plan over these learned options to solve new problems. Empirical results demonstrate that the resulting approach effectively learns and transfers abstract knowledge across problem instances, achieving superior sample efficiency compared to state-of-the-art methods. △ Less

Submitted 20 December, 2024; originally announced December 2024.

arXiv:2412.11388 [pdf, other]

INTERACT: Enabling Interactive, Question-Driven Learning in Large Language Models

Authors: Aum Kendapadi, Kerem Zaman, Rakesh R. Menon, Shashank Srivastava

Abstract: Large language models (LLMs) excel at answering questions but remain passive learners--absorbing static data without the ability to question and refine knowledge. This paper explores how LLMs can transition to interactive, question-driven learning through student-teacher dialogues. We introduce INTERACT (INTEReractive Learning for Adaptive Concept Transfer), a framework in which a "student" LLM en… ▽ More Large language models (LLMs) excel at answering questions but remain passive learners--absorbing static data without the ability to question and refine knowledge. This paper explores how LLMs can transition to interactive, question-driven learning through student-teacher dialogues. We introduce INTERACT (INTEReractive Learning for Adaptive Concept Transfer), a framework in which a "student" LLM engages a "teacher" LLM through iterative inquiries to acquire knowledge across 1,347 contexts, including song lyrics, news articles, movie plots, academic papers, and images. Our experiments show that across a wide range of scenarios and LLM architectures, interactive learning consistently enhances performance, achieving up to a 25% improvement, with 'cold-start' student models matching static learning baselines in as few as five dialogue turns. Interactive setups can also mitigate the disadvantages of weaker teachers, showcasing the robustness of question-driven learning. △ Less

Submitted 15 December, 2024; originally announced December 2024.

Comments: 30 pages, 8 figures, 14 tables

arXiv:2412.05528 [pdf]

AI Planning: A Primer and Survey (Preliminary Report)

Authors: Dillon Z. Chen, Pulkit Verma, Siddharth Srivastava, Michael Katz, Sylvie Thiébaux

Abstract: Automated decision-making is a fundamental topic that spans multiple sub-disciplines in AI: reinforcement learning (RL), AI planning (AP), foundation models, and operations research, among others. Despite recent efforts to ``bridge the gaps'' between these communities, there remain many insights that have not yet transcended the boundaries. Our goal in this paper is to provide a brief and non-exha… ▽ More Automated decision-making is a fundamental topic that spans multiple sub-disciplines in AI: reinforcement learning (RL), AI planning (AP), foundation models, and operations research, among others. Despite recent efforts to ``bridge the gaps'' between these communities, there remain many insights that have not yet transcended the boundaries. Our goal in this paper is to provide a brief and non-exhaustive primer on ideas well-known in AP, but less so in other sub-disciplines. We do so by introducing the classical AP problem and representation, and extensions that handle uncertainty and time through the Markov Decision Process formalism. Next, we survey state-of-the-art techniques and ideas for solving AP problems, focusing on their ability to exploit problem structure. Lastly, we cover subfields within AP for learning structure from unstructured inputs and learning to generalise to unseen scenarios and situations. △ Less

Submitted 6 December, 2024; originally announced December 2024.

arXiv:2411.17957 [pdf, other]

Optimization-Free Image Immunization Against Diffusion-Based Editing

Authors: Tarik Can Ozden, Ozgur Kara, Oguzhan Akcin, Kerem Zaman, Shashank Srivastava, Sandeep P. Chinchali, James M. Rehg

Abstract: Current image immunization defense techniques against diffusion-based editing embed imperceptible noise in target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optimization for each image-taking hours for small batches. To address these challenges, we introduce DiffVax, a scalable, lightweight, and optimization-free framewor… ▽ More Current image immunization defense techniques against diffusion-based editing embed imperceptible noise in target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optimization for each image-taking hours for small batches. To address these challenges, we introduce DiffVax, a scalable, lightweight, and optimization-free framework for image immunization, specifically designed to prevent diffusion-based editing. Our approach enables effective generalization to unseen content, reducing computational costs and cutting immunization time from days to milliseconds-achieving a 250,000x speedup. This is achieved through a loss term that ensures the failure of editing attempts and the imperceptibility of the perturbations. Extensive qualitative and quantitative results demonstrate that our model is scalable, optimization-free, adaptable to various diffusion-based editing tools, robust against counter-attacks, and, for the first time, effectively protects video content from editing. Our code is provided in our project webpage. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: Project webpage: https://diffvax.github.io/

arXiv:2411.14633 [pdf, other]

Evaluating Representational Similarity Measures from the Lens of Functional Correspondence

Authors: Yiqing Bo, Ansh Soni, Sudhanshu Srivastava, Meenakshi Khosla

Abstract: Neuroscience and artificial intelligence (AI) both face the challenge of interpreting high-dimensional neural data, where the comparative analysis of such data is crucial for revealing shared mechanisms and differences between these complex systems. Despite the widespread use of representational comparisons and the abundance classes of comparison methods, a critical question remains: which metrics… ▽ More Neuroscience and artificial intelligence (AI) both face the challenge of interpreting high-dimensional neural data, where the comparative analysis of such data is crucial for revealing shared mechanisms and differences between these complex systems. Despite the widespread use of representational comparisons and the abundance classes of comparison methods, a critical question remains: which metrics are most suitable for these comparisons? While some studies evaluate metrics based on their ability to differentiate models of different origins or constructions (e.g., various architectures), another approach is to assess how well they distinguish models that exhibit distinct behaviors. To investigate this, we examine the degree of alignment between various representational similarity measures and behavioral outcomes, employing group statistics and a comprehensive suite of behavioral metrics for comparison. In our evaluation of eight commonly used representational similarity metrics in the visual domain -- spanning alignment-based, Canonical Correlation Analysis (CCA)-based, inner product kernel-based, and nearest-neighbor methods -- we found that metrics like linear Centered Kernel Alignment (CKA) and Procrustes distance, which emphasize the overall geometric structure or shape of representations, excelled in differentiating trained from untrained models and aligning with behavioral measures, whereas metrics such as linear predictivity, commonly used in neuroscience, demonstrated only moderate alignment with behavior. These insights are crucial for selecting metrics that emphasize behaviorally meaningful comparisons in NeuroAI research. △ Less

Submitted 21 November, 2024; originally announced November 2024.

arXiv:2411.13903 [pdf]

AmpliNetECG12: A lightweight SoftMax-based relativistic amplitude amplification architecture for 12 lead ECG classification

Authors: Shreya Srivastava

Abstract: The urgent need to promptly detect cardiac disorders from 12-lead Electrocardiograms using limited computations is motivated by the heart's fast and complex electrical activity and restricted computational power of portable devices. Timely and precise diagnoses are crucial since delays might significantly impact patient health outcomes. This research presents a novel deep-learning architecture tha… ▽ More The urgent need to promptly detect cardiac disorders from 12-lead Electrocardiograms using limited computations is motivated by the heart's fast and complex electrical activity and restricted computational power of portable devices. Timely and precise diagnoses are crucial since delays might significantly impact patient health outcomes. This research presents a novel deep-learning architecture that aims to diagnose heart abnormalities quickly and accurately. We devised a new activation function called aSoftMax, designed to improve the visibility of ECG deflections. The proposed activation function is used with Convolutional Neural Network architecture to includes kernel weight sharing across the ECG's various leads. This innovative method thoroughly generalizes the global 12-lead ECG features and minimizes the model's complexity by decreasing the trainable parameters. aSoftMax, combined with enhanced CNN architecture yielded AmpliNetECG12, we obtain exceptional accuracy of 84% in diagnosing cardiac disorders. AmpliNetECG12 shows outstanding prediction ability when used with the CPSC2018 dataset for arrhythmia classification. The model attains an F1-score of 80.71% and a ROC-AUC score of 96.00%, with 280,000 trainable parameters which signifies the lightweight yet efficient nature of AmpliNetECG12. The stochastic characteristics of aSoftMax, a fundamental element of AmpliNetECG12, improve prediction accuracy and also increasse the model's interpretability. This feature enhances comprehension of important ECG segments in different forms of arrhythmias, establishing a new standard of explainable architecture for cardiac disorder classification. △ Less

Submitted 21 November, 2024; originally announced November 2024.

arXiv:2411.06559 [pdf, other]

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Authors: Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su

Abstract: Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermine… ▽ More Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamer achieves substantial performance improvements over reactive baselines. It is competitive, while being 4-5 times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments. △ Less

Submitted 1 April, 2025; v1 submitted 10 November, 2024; originally announced November 2024.

Comments: 22 pages, 11 figures, 6 tables

arXiv:2411.04517 [pdf]

doi 10.1007/s11277-024-11356-0

Continuous Sign Language Recognition System using Deep Learning with MediaPipe Holistic

Authors: Sharvani Srivastava, Sudhakar Singh, Pooja, Shiv Prakash

Abstract: Sign languages are the language of hearing-impaired people who use visuals like the hand, facial, and body movements for communication. There are different signs and gestures representing alphabets, words, and phrases. Nowadays approximately 300 sign languages are being practiced worldwide such as American Sign Language (ASL), Chinese Sign Language (CSL), Indian Sign Language (ISL), and many more.… ▽ More Sign languages are the language of hearing-impaired people who use visuals like the hand, facial, and body movements for communication. There are different signs and gestures representing alphabets, words, and phrases. Nowadays approximately 300 sign languages are being practiced worldwide such as American Sign Language (ASL), Chinese Sign Language (CSL), Indian Sign Language (ISL), and many more. Sign languages are dependent on the vocal language of a place. Unlike vocal or spoken languages, there are no helping words in sign language like is, am, are, was, were, will, be, etc. As only a limited population is well-versed in sign language, this lack of familiarity of sign language hinders hearing-impaired people from communicating freely and easily with everyone. This issue can be addressed by a sign language recognition (SLR) system which has the capability to translate the sign language into vocal language. In this paper, a continuous SLR system is proposed using a deep learning model employing Long Short-Term Memory (LSTM), trained and tested on an ISL primary dataset. This dataset is created using MediaPipe Holistic pipeline for tracking face, hand, and body movements and collecting landmarks. The system recognizes the signs and gestures in real-time with 88.23% accuracy. △ Less

Submitted 7 November, 2024; originally announced November 2024.

Comments: 14 pages, 4 figures, Wireless Pers Commun

Report number: WIRE-D-22-02256

Journal ref: Wireless Personal Communication, 2024

arXiv:2411.04306 [pdf, other]

List Decodable Quantum LDPC Codes

Authors: Thiago Bergamaschi, Fernando Granha Jeronimo, Tushant Mittal, Shashank Srivastava, Madhur Tulsiani

Abstract: We give a construction of Quantum Low-Density Parity Check (QLDPC) codes with near-optimal rate-distance tradeoff and efficient list decoding up to the Johnson bound in polynomial time. Previous constructions of list decodable good distance quantum codes either required access to a classical side channel or were based on algebraic constructions that preclude the LDPC property. Our construction r… ▽ More We give a construction of Quantum Low-Density Parity Check (QLDPC) codes with near-optimal rate-distance tradeoff and efficient list decoding up to the Johnson bound in polynomial time. Previous constructions of list decodable good distance quantum codes either required access to a classical side channel or were based on algebraic constructions that preclude the LDPC property. Our construction relies on new algorithmic results for codes obtained via the quantum analog of the distance amplification scheme of Alon, Edmonds, and Luby [FOCS 1995]. These results are based on convex relaxations obtained using the Sum-of-Squares hierarchy, which reduce the problem of list decoding the distance amplified codes to unique decoding the starting base codes. Choosing these base codes to be the recent breakthrough constructions of good QLDPC codes with efficient unique decoders, we get efficiently list decodable QLDPC codes. △ Less

Submitted 6 November, 2024; originally announced November 2024.

arXiv:2410.22239 [pdf, other]

DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers

Authors: Rakesh R. Menon, Shashank Srivastava

Abstract: Despite their high predictive accuracies, current machine learning systems often exhibit systematic biases stemming from annotation artifacts or insufficient support for certain classes in the dataset. Recent work proposes automatic methods for identifying and explaining systematic biases using keywords. We introduce DISCERN, a framework for interpreting systematic biases in text classifiers using… ▽ More Despite their high predictive accuracies, current machine learning systems often exhibit systematic biases stemming from annotation artifacts or insufficient support for certain classes in the dataset. Recent work proposes automatic methods for identifying and explaining systematic biases using keywords. We introduce DISCERN, a framework for interpreting systematic biases in text classifiers using language explanations. DISCERN iteratively generates precise natural language descriptions of systematic errors by employing an interactive loop between two large language models. Finally, we use the descriptions to improve classifiers by augmenting classifier training sets with synthetically generated instances or annotated examples via active learning. On three text-classification datasets, we demonstrate that language explanations from our framework induce consistent performance improvements that go beyond what is achievable with exemplars of systematic bias. Finally, in human evaluations, we show that users can interpret systematic biases more effectively (by over 25% relative) and efficiently when described through language explanations as opposed to cluster exemplars. △ Less

Submitted 29 October, 2024; originally announced October 2024.

Comments: 20 pages, 9 figures, 15 tables; Accepted to EMNLP 2024

arXiv:2410.19925 [pdf, other]

Improving Multimodal Large Language Models Using Continual Learning

Authors: Shikhar Srivastava, Md Yousuf Harun, Robik Shrestha, Christopher Kanan

Abstract: Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using th… ▽ More Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss. Our approach reduces linguistic performance degradation by up to 15\% over the LLaVA recipe, while maintaining high multimodal accuracy. We also demonstrate the robustness of our method through continual learning on a sequence of vision-language tasks, effectively preserving linguistic skills while acquiring new multimodal capabilities. △ Less

Submitted 25 October, 2024; originally announced October 2024.

Comments: NeurIPS 2024 Workshop on Scalable Continual Learning for Lifelong Foundation Models

arXiv:2410.19858 [pdf, other]

Enhancing Deep Learning based RMT Data Inversion using Gaussian Random Field

Authors: Koustav Ghosal, Arun Singh, Samir Malakar, Shalivahan Srivastava, Deepak Gupta

Abstract: Deep learning (DL) methods have emerged as a powerful tool for the inversion of geophysical data. When applied to field data, these models often struggle without additional fine-tuning of the network. This is because they are built on the assumption that the statistical patterns in the training and test datasets are the same. To address this, we propose a DL-based inversion scheme for Radio Magnet… ▽ More Deep learning (DL) methods have emerged as a powerful tool for the inversion of geophysical data. When applied to field data, these models often struggle without additional fine-tuning of the network. This is because they are built on the assumption that the statistical patterns in the training and test datasets are the same. To address this, we propose a DL-based inversion scheme for Radio Magnetotelluric data where the subsurface resistivity models are generated using Gaussian Random Fields (GRF). The network's generalization ability was tested with an out-of-distribution (OOD) dataset comprising a homogeneous background and various rectangular-shaped anomalous bodies. After end-to-end training with the GRF dataset, the pre-trained network successfully identified anomalies in the OOD dataset. Synthetic experiments confirmed that the GRF dataset enhances generalization compared to a homogeneous background OOD dataset. The network accurately recovered structures in a checkerboard resistivity model, and demonstrated robustness to noise, outperforming traditional gradient-based methods. Finally, the developed scheme is tested using exemplary field data from a waste site near Roorkee, India. The proposed scheme enhances generalization in a data-driven supervised learning framework, suggesting a promising direction for OOD generalization in DL methods. △ Less

Submitted 22 October, 2024; originally announced October 2024.

arXiv:2410.18985 [pdf]

rECGnition_v1.0: Arrhythmia detection using cardiologist-inspired multi-modal architecture incorporating demographic attributes in ECG

Authors: Shreya Srivastava, Durgesh Kumar, Jatin Bedi, Sandeep Seth, Deepak Sharma

Abstract: A substantial amount of variability in ECG manifested due to patient characteristics hinders the adoption of automated analysis algorithms in clinical practice. None of the ECG annotators developed till date consider the characteristics of the patients in a multi-modal architecture. We employed the XGBoost model to analyze the UCI Arrhythmia dataset, linking patient characteristics to ECG morpholo… ▽ More A substantial amount of variability in ECG manifested due to patient characteristics hinders the adoption of automated analysis algorithms in clinical practice. None of the ECG annotators developed till date consider the characteristics of the patients in a multi-modal architecture. We employed the XGBoost model to analyze the UCI Arrhythmia dataset, linking patient characteristics to ECG morphological changes. The model accurately classified patient gender using discriminative ECG features with 87.75% confidence. We propose a novel multi-modal methodology for ECG analysis and arrhythmia classification that can help defy the variability in ECG related to patient-specific conditions. This deep learning algorithm, named rECGnition_v1.0 (robust ECG abnormality detection Version 1), fuses Beat Morphology with Patient Characteristics to create a discriminative feature map that understands the internal correlation between both modalities. A Squeeze and Excitation based Patient characteristic Encoding Network (SEPcEnet) has been introduced, considering the patient's demographics. The trained model outperformed the various existing algorithms by achieving the overall F1-score of 0.986 for the ten arrhythmia class classification in the MITDB and achieved near perfect prediction scores of ~0.99 for LBBB, RBBB, Premature ventricular contraction beat, Atrial premature beat and Paced beat. Subsequently, the methodology was validated across INCARTDB, EDB and different class groups of MITDB using transfer learning. The generalizability test provided F1-scores of 0.980, 0.946, 0.977, and 0.980 for INCARTDB, EDB, MITDB AAMI, and MITDB Normal vs. Abnormal Classification, respectively. Therefore, with a more enhanced and comprehensive understanding of the patient being examined and their ECG for diverse CVD manifestations, the proposed rECGnition_v1.0 algorithm paves the way for its deployment in clinics. △ Less

Submitted 9 October, 2024; originally announced October 2024.

arXiv:2410.17481 [pdf, other]

AI, Global Governance, and Digital Sovereignty

Authors: Swati Srivastava, Justin Bullock

Abstract: This essay examines how Artificial Intelligence (AI) systems are becoming more integral to international affairs by affecting how global governors exert power and pursue digital sovereignty. We first introduce a taxonomy of multifaceted AI payoffs for governments and corporations related to instrumental, structural, and discursive power in the domains of violence, markets, and rights. We next leve… ▽ More This essay examines how Artificial Intelligence (AI) systems are becoming more integral to international affairs by affecting how global governors exert power and pursue digital sovereignty. We first introduce a taxonomy of multifaceted AI payoffs for governments and corporations related to instrumental, structural, and discursive power in the domains of violence, markets, and rights. We next leverage different institutional and practice perspectives on sovereignty to assess how digital sovereignty is variously implicated in AI-empowered global governance. States both seek sovereign control over AI infrastructures in the institutional approach, while establishing sovereign competence through AI infrastructures in the practice approach. Overall, we present the digital sovereignty stakes of AI as related to entanglements of public and private power. Rather than foreseeing technology companies as replacing states, we argue that AI systems will embed in global governance to create dueling dynamics of public/private cooperation and contestation. We conclude with sketching future directions for IR research on AI and global governance. △ Less

Submitted 22 October, 2024; originally announced October 2024.

Comments: 21 pages, 2 tables

arXiv:2410.09031 [pdf, other]

Improved List Size for Folded Reed-Solomon Codes

Authors: Shashank Srivastava

Abstract: Folded Reed-Solomon (FRS) codes are variants of Reed-Solomon codes, known for their optimal list decoding radius. We show explicit FRS codes with rate $R$ that can be list decoded up to radius $1-R-ε$ with lists of size $\mathcal{O}(1/ ε^2)$. This improves the best known list size among explicit list decoding capacity achieving codes. We also show a more general result that for any $k\geq 1$, th… ▽ More Folded Reed-Solomon (FRS) codes are variants of Reed-Solomon codes, known for their optimal list decoding radius. We show explicit FRS codes with rate $R$ that can be list decoded up to radius $1-R-ε$ with lists of size $\mathcal{O}(1/ ε^2)$. This improves the best known list size among explicit list decoding capacity achieving codes. We also show a more general result that for any $k\geq 1$, there are explicit FRS codes with rate $R$ and distance $1-R$ that can be list decoded arbitrarily close to radius $\frac{k}{k+1}(1-R)$ with lists of size $(k-1)^2+1$. Our results are based on a new and simple combinatorial viewpoint of the intersections between Hamming balls and affine subspaces that recovers previously known parameters. We then use folded Wronskian determinants to carry out an inductive proof that yields sharper bounds. △ Less

Submitted 11 October, 2024; originally announced October 2024.

Comments: Accepted to SODA 2025

arXiv:2410.08698 [pdf, other]

SocialGaze: Improving the Integration of Human Social Norms in Large Language Models

Authors: Anvesh Rao Vijjini, Rakesh R. Menon, Jiayi Fu, Shashank Srivastava, Snigdha Chaturvedi

Abstract: While much research has explored enhancing the reasoning capabilities of large language models (LLMs) in the last few years, there is a gap in understanding the alignment of these models with social values and norms. We introduce the task of judging social acceptance. Social acceptance requires models to judge and rationalize the acceptability of people's actions in social situations. For example,… ▽ More While much research has explored enhancing the reasoning capabilities of large language models (LLMs) in the last few years, there is a gap in understanding the alignment of these models with social values and norms. We introduce the task of judging social acceptance. Social acceptance requires models to judge and rationalize the acceptability of people's actions in social situations. For example, is it socially acceptable for a neighbor to ask others in the community to keep their pets indoors at night? We find that LLMs' understanding of social acceptance is often misaligned with human consensus. To alleviate this, we introduce SocialGaze, a multi-step prompting framework, in which a language model verbalizes a social situation from multiple perspectives before forming a judgment. Our experiments demonstrate that the SocialGaze approach improves the alignment with human judgments by up to 11 F1 points with the GPT-3.5 model. We also identify biases and correlations in LLMs in assigning blame that is related to features such as the gender (males are significantly more likely to be judged unfairly) and age (LLMs are more aligned with humans for older narrators). △ Less

Submitted 11 October, 2024; originally announced October 2024.

arXiv:2410.08437 [pdf, other]

Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

Authors: Rushang Karia, Daniel Bramblett, Daksh Dobhal, Siddharth Srivastava

Abstract: This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of i… ▽ More This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; (c) the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM's performance on AutoEval is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update. △ Less

Submitted 11 April, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

arXiv:2410.07166 [pdf, other]

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Authors: Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu

Abstract: We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evalua… ▽ More We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc. Overall, our benchmark offers a comprehensive assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making. △ Less

Submitted 19 January, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

Comments: Accepted for oral presentation at NeurIPS 2024 in the Datasets and Benchmarks track. Final Camera version

arXiv:2409.12002 [pdf, other]

Towards Global Localization using Multi-Modal Object-Instance Re-Identification

Authors: Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna

Abstract: Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance… ▽ More Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available. △ Less

Submitted 1 May, 2025; v1 submitted 18 September, 2024; originally announced September 2024.

Comments: 8 pages, 5 figures, 3 tables. Accepted at Advances in Robotics, AIR 2025 (Oral)

MSC Class: 68T40 ACM Class: I.2.9; I.2.10

arXiv:2409.08605 [pdf, other]

Effective Integration of KAN for Keyword Spotting

Authors: Anfeng Xu, Biqiao Zhang, Shuyu Kong, Yiteng Huang, Zhaojun Yang, Sangeeta Srivastava, Ming Sun

Abstract: Keyword spotting (KWS) is an important speech processing component for smart devices with voice assistance capability. In this paper, we investigate if Kolmogorov-Arnold Networks (KAN) can be used to enhance the performance of KWS. We explore various approaches to integrate KAN for a model architecture based on 1D Convolutional Neural Networks (CNN). We find that KAN is effective at modeling high-… ▽ More Keyword spotting (KWS) is an important speech processing component for smart devices with voice assistance capability. In this paper, we investigate if Kolmogorov-Arnold Networks (KAN) can be used to enhance the performance of KWS. We explore various approaches to integrate KAN for a model architecture based on 1D Convolutional Neural Networks (CNN). We find that KAN is effective at modeling high-level features in lower-dimensional spaces, resulting in improved KWS performance when integrated appropriately. The findings shed light on understanding KAN for speech processing tasks and on other modalities for future researchers. △ Less

Submitted 11 January, 2025; v1 submitted 13 September, 2024; originally announced September 2024.

Comments: Accepted to ICASSP 2025

arXiv:2408.14652 [pdf, other]

Continuous Optimization for Decoding Errors

Authors: Shashank Srivastava

Abstract: Error-correcting codes are one of the most fundamental objects in pseudorandomness, with applications in communication, complexity theory, and beyond. Codes are useful because of their ability to support decoding, which is the task of recovering a codeword from its noisy copy. List decoding is a relaxation where the decoder is allowed to output a list of codewords, and has seen tremendous progress… ▽ More Error-correcting codes are one of the most fundamental objects in pseudorandomness, with applications in communication, complexity theory, and beyond. Codes are useful because of their ability to support decoding, which is the task of recovering a codeword from its noisy copy. List decoding is a relaxation where the decoder is allowed to output a list of codewords, and has seen tremendous progress over the last 25 years. In this thesis, we prove new algorithmic and combinatorial results about list decoding. We describe a list decoding framework that reduces the task of efficient decoding to proving distance in certain restricted proof systems. We then instantiate this framework for Tanner codes of Sipser and Spielman [IEEE Trans. Inf. Theory 1996] and Alon-Edmonds-Luby (AEL) distance amplification [FOCS 1995] of unique decodable base codes to get the first polynomial time list decoding algorithms for these codes. We also show extensions to the quantum version of AEL distance amplification to get polynomial-time list decodable quantum LDPC codes. We next give an alternate viewpoint of the above framework based on abstract regularity lemmas. We show how to efficiently implement the regularity lemma for the case of Ta-Shma's explicit codes near the Gilbert-Varshamov bound [STOC 2017]. This leads to a near-linear time algorithm for unique decoding of Ta-Shma's code. We also give new combinatorial results that improve known list sizes beyond the Johnson bound. Firstly, we adapt the AEL amplification to construct a new family of explicit codes that can be combinatorially list decoded to the optimal error correction radius. This is the first example of such a code that does not use significant algebraic ingredients. Secondly, we present list size improvements for Folded Reed-Solomon codes, improving the state of the art list size for explicit list decoding capacity achieving codes. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: PhD Thesis, 235 pages

arXiv:2408.05334 [pdf, other]

Revisiting Multi-Modal LLM Evaluation

Authors: Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan

Abstract: With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analys… ▽ More With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: https://kevinlujian.github.io/MLLM_Evaluations/ △ Less

Submitted 9 August, 2024; originally announced August 2024.

arXiv:2405.20772 [pdf]

Reinforcement Learning for Sociohydrology

Authors: Tirthankar Roy, Shivendra Srivastava, Beichen Zhang

Abstract: In this study, we discuss how reinforcement learning (RL) provides an effective and efficient framework for solving sociohydrology problems. The efficacy of RL for these types of problems is evident because of its ability to update policies in an iterative manner - something that is also foundational to sociohydrology, where we are interested in representing the co-evolution of human-water interac… ▽ More In this study, we discuss how reinforcement learning (RL) provides an effective and efficient framework for solving sociohydrology problems. The efficacy of RL for these types of problems is evident because of its ability to update policies in an iterative manner - something that is also foundational to sociohydrology, where we are interested in representing the co-evolution of human-water interactions. We present a simple case study to demonstrate the implementation of RL in a problem of runoff reduction through management decisions related to changes in land-use land-cover (LULC). We then discuss the benefits of RL for these types of problems and share our perspectives on the future research directions in this area. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2405.15907 [pdf, other]

Belief-State Query Policies for User-Aligned POMDPs

Authors: Daniel Bramblett, Siddharth Srivastava

Abstract: Planning in real-world settings often entails addressing partial observability while aligning with users' requirements. We present a novel framework for expressing users' constraints and preferences about agent behavior in a partially observable setting using parameterized belief-state query (BSQ) policies in the setting of goal-oriented partially observable Markov decision processes (gPOMDPs). We… ▽ More Planning in real-world settings often entails addressing partial observability while aligning with users' requirements. We present a novel framework for expressing users' constraints and preferences about agent behavior in a partially observable setting using parameterized belief-state query (BSQ) policies in the setting of goal-oriented partially observable Markov decision processes (gPOMDPs). We present the first formal analysis of such constraints and prove that while the expected cost function of a parameterized BSQ policy w.r.t its parameters is not convex, it is piecewise constant and yields an implicit discrete parameter search space that is finite for finite horizons. This theoretical result leads to novel algorithms that optimize gPOMDP agent behavior with guaranteed user alignment. Analysis proves that our algorithms converge to the optimal user-aligned behavior in the limit. Empirical results show that parameterized BSQ policies provide a computationally feasible approach for user-aligned planning in partially observable settings. △ Less

Submitted 15 April, 2025; v1 submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.13004 [pdf, other]

MathDivide: Improved mathematical reasoning by large language models

Authors: Saksham Sahai Srivastava, Ashutosh Gandhi

Abstract: Large language models have been proven to be capable of handling complex linguistic and cognitive tasks. Therefore their usage has been extended to tasks requiring logical reasoning ability such as Mathematics. In this paper, we propose a prompting technique called MathDivide that breaks down the mathematical problem into simpler subproblems. Each of the subproblems is formulated as an algebraic e… ▽ More Large language models have been proven to be capable of handling complex linguistic and cognitive tasks. Therefore their usage has been extended to tasks requiring logical reasoning ability such as Mathematics. In this paper, we propose a prompting technique called MathDivide that breaks down the mathematical problem into simpler subproblems. Each of the subproblems is formulated as an algebraic expression whose value is evaluated by the Python code generated by the LLM for the corresponding algebraic expression. The values fed to the Python code are the numerical values provided in the problem statement. The solutions for the subproblems are composed together to obtain the final answer for the problem statement. Finally, the final answer is compared to the correct answer. If the final answer matches the correct answer, it is produced as output else a refinement prompt is fed to the LLM. We experiment with this prompting technique on both closed-source LLM models and open-source LLM models using GSM8K dataset. The results obtained demonstrate that MathDivide was able to significantly outperform the leading prompting technique called Math-prompter. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: 10 pages, 3 figures

arXiv:2405.09546 [pdf, other]

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Authors: Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu

Abstract: The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and renderin… ▽ More The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/ △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: CVPR 2024 (Highlight). Project website: https://behavior-vision-suite.github.io/

arXiv:2404.19095 [pdf]

Catalyzing Social Interactions in Mixed Reality using ML Recommendation Systems

Authors: Sparsh Srivastava, Rohan Arora

Abstract: We create an innovative mixed reality-first social recommendation model, utilizing features uniquely collected through mixed reality (MR) systems to promote social interaction, such as gaze recognition, proximity, noise level, congestion level, and conversational intensity. We further extend these models to include right-time features to deliver timely notifications. We measure performance metrics… ▽ More We create an innovative mixed reality-first social recommendation model, utilizing features uniquely collected through mixed reality (MR) systems to promote social interaction, such as gaze recognition, proximity, noise level, congestion level, and conversational intensity. We further extend these models to include right-time features to deliver timely notifications. We measure performance metrics across various models by creating a new intersection of user features, MR features, and right-time features. We create four model types trained on different combinations of the feature classes, where we compare the baseline model trained on the class of user features against the models trained on MR features, right-time features, and a combination of all of the feature classes. Due to limitations in data collection and cost, we observe performance degradation in the right-time, mixed reality, and combination models. Despite these challenges, we introduce optimizations to improve accuracy across all models by over 14 percentage points, where the best performing model achieved 24% greater accuracy. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.15325 [pdf]

Quantifying Social Presence in Mixed Reality: A Contemporary Review of Techniques and Innovations

Authors: Sparsh Srivastava

Abstract: This literature review investigates the transformative potential of mixed reality (MR) technology, where we explore the intersection of contemporary technological advancements, modern deep learning recommendation systems, and social psychology frameworks. This interdisciplinary study informs the understanding of MR's role in improving social presence, catalyzing novel social interactions, and enha… ▽ More This literature review investigates the transformative potential of mixed reality (MR) technology, where we explore the intersection of contemporary technological advancements, modern deep learning recommendation systems, and social psychology frameworks. This interdisciplinary study informs the understanding of MR's role in improving social presence, catalyzing novel social interactions, and enhancing the quality of interpersonal communication in the real world. We also discuss the challenges and barriers blocking the wide-spread adoption of social networking in MR, such as device constraints, privacy and accessibility concerns, and social norms. Through carefully structured, closed-environment experiments with diverse participants of varying levels of digital literacy, we measure the differences in social dynamics, frequency, quality, and duration of interactions, and levels of social anxiety between MR-enhanced, mobile-enhanced, and control condition participants. △ Less

Submitted 26 April, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

arXiv:2404.12415 [pdf]

doi 10.1016/j.soilad.2024.100016

Prediction of soil fertility parameters using USB-microscope imagery and portable X-ray fluorescence spectrometry

Authors: Shubhadip Dasgupta, Satwik Pate, Divya Rathore, L. G. Divyanth, Ayan Das, Anshuman Nayak, Subhadip Dey, Asim Biswas, David C. Weindorf, Bin Li, Sergio Henrique Godinho Silva, Bruno Teixeira Ribeiro, Sanjay Srivastava, Somsubhra Chakraborty

Abstract: This study investigated the use of portable X-ray fluorescence (PXRF) spectrometry and soil image analysis for rapid soil fertility assessment, with a focus on key indicators such as available boron (B), organic carbon (OC), available manganese (Mn), available sulfur (S), and the sulfur availability index (SAI). A total of 1,133 soil samples from diverse agro-climatic zones in Eastern India were a… ▽ More This study investigated the use of portable X-ray fluorescence (PXRF) spectrometry and soil image analysis for rapid soil fertility assessment, with a focus on key indicators such as available boron (B), organic carbon (OC), available manganese (Mn), available sulfur (S), and the sulfur availability index (SAI). A total of 1,133 soil samples from diverse agro-climatic zones in Eastern India were analyzed. The research integrated color and texture features from microscopic soil images, PXRF data, and auxiliary soil variables (AVs) using a Random Forest model. Results showed that combining image features (IFs) with AVs significantly improved prediction accuracy for available B (R2 = 0.80) and OC (R2 = 0.88). A data fusion approach, incorporating IFs, AVs, and PXRF data, further enhanced predictions for available Mn and SAI, with R2 values of 0.72 and 0.70, respectively. The study highlights the potential of integrating these technologies to offer rapid, cost-effective soil testing methods, paving the way for more advanced predictive models and a deeper understanding of soil fertility. Future work should explore the application of deep learning models on a larger dataset, incorporating soils from a wider range of agro-climatic zones under field conditions. △ Less

Submitted 5 September, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

Comments: Published in 'Soil Advances'

Journal ref: Soil Advances, Volume 2, 2024, 100016

arXiv:2404.00846 [pdf, other]

Transfer Learning with Point Transformers

Authors: Kartik Gupta, Rahul Vippala, Sahima Srivastava

Abstract: Point Transformers are near state-of-the-art models for classification, segmentation, and detection tasks on Point Cloud data. They utilize a self attention based mechanism to model large range spatial dependencies between multiple point sets. In this project we explore two things: classification performance of these attention based networks on ModelNet10 dataset and then, we use the trained model… ▽ More Point Transformers are near state-of-the-art models for classification, segmentation, and detection tasks on Point Cloud data. They utilize a self attention based mechanism to model large range spatial dependencies between multiple point sets. In this project we explore two things: classification performance of these attention based networks on ModelNet10 dataset and then, we use the trained model to classify 3D MNIST dataset after finetuning. We also train the model from scratch on 3D MNIST dataset to compare the performance of finetuned and from-scratch model on the MNIST dataset. We observe that since the two datasets have a large difference in the degree of the distributions, transfer learned models do not outperform the from-scratch models in this case. Although we do expect transfer learned models to converge faster since they already know the lower level edges, corners, etc features from the ModelNet10 dataset. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2404.00808 [pdf, other]

Using Explainable AI and Hierarchical Planning for Outreach with Robots

Authors: Rushang Karia, Jayesh Nagpal, Daksh Dobhal, Pulkit Verma, Rashmeet Kaur Nayyar, Naman Shah, Siddharth Srivastava

Abstract: Understanding how robots plan and execute tasks is crucial in today's world, where they are becoming more prevalent in our daily lives. However, teaching non-experts, such as K-12 students, the complexities of robot planning can be challenging. This work presents an open-source platform, \nameAbbr{}, that simplifies the process using a visual interface that abstracts the details of various plannin… ▽ More Understanding how robots plan and execute tasks is crucial in today's world, where they are becoming more prevalent in our daily lives. However, teaching non-experts, such as K-12 students, the complexities of robot planning can be challenging. This work presents an open-source platform, \nameAbbr{}, that simplifies the process using a visual interface that abstracts the details of various planning processes that robots use for performing complex mobile manipulation tasks. Using principles developed in the field of explainable AI, this intuitive platform enables students to use a high-level intuitive instruction set to perform complex tasks, visualize them on an in-built simulator, and to obtain helpful hints and natural language explanations for errors. Finally, \nameAbbr{}, includes an adaptive curriculum generation method that provides students with customized learning ramps. This platform's efficacy was tested through a user study with university students who had little to no computer science background. Our results show that \nameAbbr{} is highly effective in increasing student engagement, teaching robotics programming, and decreasing the time need to solve tasks as compared to baselines. △ Less

Submitted 11 November, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

arXiv:2403.18327 [pdf, other]

$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks

Authors: Rushang Karia, Daniel Bramblett, Daksh Dobhal, Pulkit Verma, Siddharth Srivastava

Abstract: This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in… ▽ More This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in these areas require labor-intensive ground-truth creation, the availability of which undermines the separation of training and test sets. Furthermore, such datasets typically include relatively few hand-coded test cases over which LLM accuracy is determined, thus making them inadequate for determining the safety or correctness of their generated outputs. We introduce a new approach that utilizes context-free grammars (CFGs) to generate out-of-distribution datasets on the fly and perform closed-loop testing of LLM capabilities using formal verifiers to guarantee the correctness of LLM outputs without any human intervention. We release our dataset and benchmark as open-source code at \url{https://github.com/AAIR-lab/auto-llm-assessment}. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm. Our experiments reveal that SOTA LLMs are unable to solve the formal translation task adequately. △ Less

Submitted 21 July, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.09227 [pdf, other]

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Authors: Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R Matthews , et al. (10 additional authors not shown)

Abstract: We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with… ▽ More We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with rich physical and semantic properties. The second is OMNIGIBSON, a novel simulation environment that supports these activities via realistic physics simulation and rendering of rigid bodies, deformable bodies, and liquids. Our experiments indicate that the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions. To calibrate the simulation-to-reality gap of BEHAVIOR-1K, we provide an initial study on transferring solutions learned with a mobile manipulator in a simulated apartment to its real-world counterpart. We hope that BEHAVIOR-1K's human-grounded nature, diversity, and realism make it valuable for embodied AI and robot learning research. Project website: https://behavior.stanford.edu. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: A preliminary version was published at 6th Conference on Robot Learning (CoRL 2022)

arXiv:2403.05022 [pdf, other]

Effective Fault Localization using Probabilistic and Grouping Approach

Authors: Saksham Sahai Srivastava, Arpita Dutta, Rajib Mall

Abstract: Context: Fault localization (FL) is the key activity while debugging a program. Any improvement to this activity leads to significant improvement in total software development cost. There is an internal linkage between the program spectrum and test execution result. Conditional probability in statistics captures the probability of occurring one event in relationship to one or more other events. Ob… ▽ More Context: Fault localization (FL) is the key activity while debugging a program. Any improvement to this activity leads to significant improvement in total software development cost. There is an internal linkage between the program spectrum and test execution result. Conditional probability in statistics captures the probability of occurring one event in relationship to one or more other events. Objectives: The aim of this paper is to use the conception of conditional probability to design an effective fault localization technique. Methods: In the paper, we present a fault localization technique that derives the association between statement coverage information and test case execution result using condition probability statistics. This association with the failed test case result shows the fault containing the probability of that specific statement. Subsequently, we use a grouping method to refine the obtained statement ranking sequence for better fault localization. Results: We evaluated the effectiveness of proposed method over eleven open-source data sets. Our obtained results show that on average, the proposed CGFL method is 24.56% more effective than other contemporary fault localization methods such as D*, Tarantula, Ochiai, Crosstab, BPNN, RBFNN, DNN, and CNN. Conclusion: We devised an effective fault localization technique by combining the conditional probabilistic method with failed test case execution-based approach. Our experimental evaluation shows our proposed method outperforms the existing fault localization techniques. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.03360 [pdf, other]

Bridge the Future: High-Performance Networks in Confidential VMs without Trusted I/O devices

Authors: Mengyuan Li, Shashvat Srivastava, Mengjia Yan

Abstract: Trusted I/O (TIO) is an appealing solution to improve I/O performance for confidential VMs (CVMs), with the potential to eliminate broad sources of I/O overhead. However, this paper emphasizes that not all types of I/O can derive substantial benefits from TIO, particularly network I/O. Given the obligatory use of encryption protocols for network traffic in CVM's threat model, TIO's approach of I/O… ▽ More Trusted I/O (TIO) is an appealing solution to improve I/O performance for confidential VMs (CVMs), with the potential to eliminate broad sources of I/O overhead. However, this paper emphasizes that not all types of I/O can derive substantial benefits from TIO, particularly network I/O. Given the obligatory use of encryption protocols for network traffic in CVM's threat model, TIO's approach of I/O encryption over the PCIe bus becomes redundant. Furthermore, TIO solutions need to expand the Trusted Computing Base (TCB) to include TIO devices and are commercially unavailable. Motivated by these insights, the goal of this paper is to propose a software solution that helps CVMs immediately benefit from high-performance networks, while confining trust only to the on-chip CVM. We present FOLIO, a software solution crafted from a secure and efficient Data Plane Development Kit (DPDK) extension compatible with the latest version of AMD Secure Encrypted Virtualization (SEV), a.k.a., Secure Nested Paging (SNP). Our design is informed by a thorough analysis of all possible factors that impact SNP VM's network performance. By extensively removing overhead sources, we arrive at a design that approaches the efficiency of an optimal TIO-based configuration. Evaluation shows that FOLIO has a performance dip less than 6% relative to the optimal TIO configuration, while only relying on off-the-shelf CPUs. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2402.19450 [pdf, other]

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

Authors: Saurabh Srivastava, Annarose M B, Anto P V, Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, Sooraj Thomas

Abstract: We propose a framework for robust evaluation of reasoning capabilities of language models, using functional variants of benchmarks. Models that solve a reasoning test should exhibit no difference in performance over the static version of a problem compared to a snapshot of the functional variant. We have rewritten the relevant fragment of the MATH benchmark into its functional variant MATH(), with… ▽ More We propose a framework for robust evaluation of reasoning capabilities of language models, using functional variants of benchmarks. Models that solve a reasoning test should exhibit no difference in performance over the static version of a problem compared to a snapshot of the functional variant. We have rewritten the relevant fragment of the MATH benchmark into its functional variant MATH(), with functionalization of other benchmarks to follow. When evaluating current state-of-the-art models over snapshots of MATH(), we find a reasoning gap -- the percentage difference between the static and functional accuracies. We find reasoning gaps from 58.35% to 80.31% among the state-of-the-art closed and open weights models that perform well on static benchmarks, with the caveat that the gaps are likely to be smaller with more sophisticated prompting strategies. Here we show that models which anecdotally have good reasoning performance over real-world tasks, have quantifiable lower gaps, motivating the open problem of building "gap 0" models. Code for evaluation and new evaluation datasets, three MATH() snapshots, are publicly available at https://github.com/consequentai/fneval/. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: 37 pages, 10 figures

arXiv:2402.11871 [pdf, other]

From Reals to Logic and Back: Inventing Symbolic Vocabularies, Actions, and Models for Planning from Raw Data

Authors: Naman Shah, Jayesh Nagpal, Pulkit Verma, Siddharth Srivastava

Abstract: Hand-crafted, logic-based state and action representations have been widely used to overcome the intractable computational complexity of long-horizon robot planning problems, including task and motion planning problems. However, creating such representations requires experts with strong intuitions and detailed knowledge about the robot and the tasks it may need to accomplish in a given setting. Re… ▽ More Hand-crafted, logic-based state and action representations have been widely used to overcome the intractable computational complexity of long-horizon robot planning problems, including task and motion planning problems. However, creating such representations requires experts with strong intuitions and detailed knowledge about the robot and the tasks it may need to accomplish in a given setting. Removing this dependency on human intuition is a highly active research area. This paper presents the first approach for autonomously learning generalizable, logic-based relational representations for abstract states and actions starting from unannotated high-dimensional, real-valued robot trajectories. The learned representations constitute auto-invented PDDL-like domain models. Empirical results in deterministic settings show that powerful abstract representations can be learned from just a handful of robot trajectories; the learned relational representations include but go beyond classical, intuitive notions of high-level actions; and that the learned models allow planning algorithms to scale to tasks that were previously beyond the scope of planning without hand-crafted abstractions. △ Less

Submitted 4 March, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.08231 [pdf, ps, other]

doi 10.1109/OJVT.2023.3349151

Asynchronous Distributed Coordinated Hybrid Precoding in Multi-cell mmWave Wireless Networks

Authors: Meesam Jafri, Suraj Srivastava, Sunil Kumar, Aditya K. Jagannatham, Lajos Hanzo

Abstract: Asynchronous distributed hybrid beamformers (ADBF) are conceived for minimizing the total transmit power subject to signal-to-interference-plus-noise ratio (SINR) constraints at the users. Our design requires only limited information exchange between the base stations (BSs) of the mmWave multi-cell coordinated (MCC) networks considered. To begin with, a semidefinite relaxation (SDR)-based fully-di… ▽ More Asynchronous distributed hybrid beamformers (ADBF) are conceived for minimizing the total transmit power subject to signal-to-interference-plus-noise ratio (SINR) constraints at the users. Our design requires only limited information exchange between the base stations (BSs) of the mmWave multi-cell coordinated (MCC) networks considered. To begin with, a semidefinite relaxation (SDR)-based fully-digital (FD) beamformer is designed for a centralized MCC system. Subsequently, a Bayesian learning (BL) technique is harnessed for decomposing the FD beamformer into its analog and baseband components and construct a hybrid transmit precoder (TPC). However, the centralized TPC design requires global channel state information (CSI), hence it results in a high signaling overhead. An alternating direction based method of multipliers (ADMM) technique is developed for a synchronous distributed beamformer (SDBF) design, which relies only on limited information exchange among the BSs, thus reducing the signaling overheads required by the centralized TPC design procedure. However, the SDBF design is challenging, since it requires the updates from the BSs to be strictly synchronized. As a remedy, an ADBF framework is developed that mitigates the inter-cell interference (ICI) and also control the asynchrony in the system. Furthermore, the above ADBF framework is also extended to the robust ADBF (R-ADBF) algorithm that incorporates the CSI uncertainty into the design procedure for minimizing the the worst-case transmit power. Our simulation results illustrate both the enhanced performance and the improved convergence properties of the ADMM-based ADBF and R-ADBF schemes. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Journal ref: IEEE Open Journal of Vehicular Technology, vol. 5, pp. 200-218, 2024

arXiv:2402.08145 [pdf, other]

doi 10.1609/icaps.v34i1.31489

Epistemic Exploration for Generalizable Planning and Learning in Non-Stationary Settings

Authors: Rushang Karia, Pulkit Verma, Alberto Speranzon, Siddharth Srivastava

Abstract: This paper introduces a new approach for continual planning and model learning in relational, non-stationary stochastic environments. Such capabilities are essential for the deployment of sequential decision-making systems in the uncertain and constantly evolving real world. Working in such practical settings with unknown (and non-stationary) transition systems and changing tasks, the proposed fra… ▽ More This paper introduces a new approach for continual planning and model learning in relational, non-stationary stochastic environments. Such capabilities are essential for the deployment of sequential decision-making systems in the uncertain and constantly evolving real world. Working in such practical settings with unknown (and non-stationary) transition systems and changing tasks, the proposed framework models gaps in the agent's current state of knowledge and uses them to conduct focused, investigative explorations. Data collected using these explorations is used for learning generalizable probabilistic models for solving the current task despite continual changes in the environment dynamics. Empirical evaluations on several non-stationary benchmark domains show that this approach significantly outperforms planning and RL baselines in terms of sample complexity. Theoretical results show that the system exhibits desirable convergence properties when stationarity holds. △ Less

Submitted 6 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: To appear at ICAPS-24

Journal ref: Proceedings of the International Conference on Automated Planning and Scheduling, 34(1), 310-318, 2024

Showing 1–50 of 196 results for author: Srivastava, S