Skip to main content

Showing 1–6 of 6 results for author: Tamarapalli, J S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03194  [pdf, ps, other

    cs.CV cs.AI cs.LG

    HueManity: Probing Fine-Grained Visual Perception in MLLMs

    Authors: Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

    Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pat… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  2. arXiv:2506.00785  [pdf, ps, other

    cs.AI cs.CV cs.LG

    GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

    Authors: Sahiti Yerramilli, Nilay Pande, Rynaa Grover, Jayant Sravan Tamarapalli

    Abstract: This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization acr… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  3. arXiv:2501.07957  [pdf, other

    cs.RO cs.AI cs.CV cs.HC cs.LG

    AI Guide Dog: Egocentric Path Prediction on Smartphone

    Authors: Aishwarya Jadhav, Jeffery Cao, Abhishree Shetty, Urvashi Priyam Kumar, Aditi Sharma, Ben Sukboontip, Jayant Sravan Tamarapalli, Jingyi Zhang, Anirudh Koul

    Abstract: This paper presents AI Guide Dog (AIGD), a lightweight egocentric (first-person) navigation system for visually impaired users, designed for real-time deployment on smartphones. AIGD employs a vision-only multi-label classification approach to predict directional commands, ensuring safe navigation across diverse environments. We introduce a novel technique for goal-based outdoor navigation by inte… ▽ More

    Submitted 16 February, 2025; v1 submitted 14 January, 2025; originally announced January 2025.

    Comments: Accepted at the AAAI 2025 Spring Symposium on Human-Compatible AI for Well-being: Harnessing Potential of GenAI for AI-Powered Science

  4. arXiv:2404.02359  [pdf, ps, other

    cs.LG

    Attribution Regularization for Multimodal Paradigms

    Authors: Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, Eric Nyberg

    Abstract: Multimodal machine learning has gained significant attention in recent years due to its potential for integrating information from multiple modalities to enhance learning and decision-making processes. However, it is commonly observed that unimodal models outperform multimodal models, despite the latter having access to richer information. Additionally, the influence of a single modality often dom… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  5. arXiv:2404.02353  [pdf, other

    cs.CV cs.AI cs.LG

    Semantic Augmentation in Images using Language

    Authors: Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, Eric Nyberg

    Abstract: Deep Learning models are incredibly data-hungry and require very large labeled datasets for supervised learning. As a consequence, these models often suffer from overfitting, limiting their ability to generalize to real-world examples. Recent advancements in diffusion models have enabled the generation of photorealistic images based on textual inputs. Leveraging the substantial datasets used to tr… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  6. arXiv:2307.13850  [pdf, other

    cs.LG cs.AI cs.CV cs.RO

    MAEA: Multimodal Attribution for Embodied AI

    Authors: Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Yonatan Bisk

    Abstract: Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.