Skip to main content

Showing 1–5 of 5 results for author: Penamakuri, A S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2412.20619  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    Audiopedia: Audio QA with Knowledge

    Authors: Abhirama Subramanyam Penamakuri, Kiran Chhatre, Akshat Jain

    Abstract: In this paper, we introduce Audiopedia, a novel task called Audio Question Answering with Knowledge, which requires both audio comprehension and external knowledge reasoning. Unlike traditional Audio Question Answering (AQA) benchmarks that focus on simple queries answerable from audio alone, Audiopedia targets knowledge-intensive questions. We define three sub-tasks: (i) Single Audio Question Ans… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

    Comments: Accepted to ICASSP 2025

  2. arXiv:2410.19144  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant

    Authors: Abhirama Subramanyam Penamakuri, Anand Mishra

    Abstract: We revisit knowledge-aware text-based visual question answering, also known as Text-KVQA, in the light of modern advancements in large multimodal models (LMMs), and make the following contributions: (i) We propose VisTEL - a principled approach to perform visual text entity linking. The proposed VisTEL module harnesses a state-of-the-art visual text recognition engine and the power of a large mult… ▽ More

    Submitted 24 October, 2024; originally announced October 2024.

    Comments: Accepted to EMNLP (Main) 2024

  3. arXiv:2306.16713  [pdf, other

    cs.CV cs.AI cs.LG

    Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

    Authors: Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand Mishra

    Abstract: We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively di… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Comments: Accepted to IJCAI 2023

  4. arXiv:2211.12926  [pdf, other

    cs.CV cs.AI cs.LG

    Contrastive Multi-View Textual-Visual Encoding: Towards One Hundred Thousand-Scale One-Shot Logo Identification

    Authors: Nakul Sharma, Abhirama S. Penamakuri, Anand Mishra

    Abstract: In this paper, we study the problem of identifying logos of business brands in natural scenes in an open-set one-shot setting. This problem setup is significantly more challenging than traditionally-studied 'closed-set' and 'large-scale training samples per category' logo recognition settings. We propose a novel multi-view textual-visual encoding framework that encodes text appearing in the logos… ▽ More

    Submitted 23 November, 2022; originally announced November 2022.

    Comments: Accepted to ICVGIP 2022

  5. arXiv:2210.08554  [pdf, other

    cs.CV cs.CL

    COFAR: Commonsense and Factual Reasoning in Image Search

    Authors: Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand Mishra, Shubhashis Sengupta, Roshni Ramnani

    Abstract: One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries - (i) "a queue of customers patiently waiting to buy ice cream" and (ii) "a queue of tourists going to see a famous Mughal architecture in India." Interpreting these queries requires o… ▽ More

    Submitted 16 October, 2022; originally announced October 2022.

    Comments: Accepted in AACL-IJCNLP 2022