Skip to main content

Showing 1–9 of 9 results for author: Vayani, A

.
  1. arXiv:2506.07032  [pdf, ps, other

    cs.CL cs.CV

    A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

    Authors: Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin'ichi Satoh , et al. (4 additional authors not shown)

    Abstract: Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of vid… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  2. arXiv:2505.11454  [pdf, ps, other

    cs.CV cs.AI

    HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

    Authors: Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya

    Abstract: Large multimodal models (LMMs) now excel on many vision language benchmarks, however, they still struggle with human centered criteria such as fairness, ethics, empathy, and inclusivity, key to aligning with human values. We introduce HumaniBench, a holistic benchmark of 32K real-world image question pairs, annotated via a scalable GPT4o assisted pipeline and exhaustively verified by domain expert… ▽ More

    Submitted 23 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  3. arXiv:2503.16423  [pdf, other

    cs.CV cs.LG

    GAEA: A Geolocation Aware Conversational Model

    Authors: Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Rohit Gupta, Aritra Dutta, Mubarak Shah

    Abstract: Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with tr… ▽ More

    Submitted 24 March, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: The dataset and code used in this submission is available at: https://ucf-crcv.github.io/GAEA/

    ACM Class: I.4; I.2.7; I.5

  4. arXiv:2502.11361  [pdf, ps, other

    cs.CL

    VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment

    Authors: Shaina Raza, Ashmal Vayani, Aditya Jain, Aravind Narayanan, Vahid Reza Khazaie, Syed Raza Bashir, Elham Dolatabadi, Gias Uddin, Christos Emmanouilidis, Rizwan Qureshi, Mubarak Shah

    Abstract: Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitat… ▽ More

    Submitted 30 May, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

    Comments: under review

  5. arXiv:2502.08779  [pdf, other

    cs.CV

    SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models

    Authors: Vishal Narnaware, Ashmal Vayani, Rohit Gupta, Sirnam Swetha, Mubarak Shah

    Abstract: Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to stereotypes, harmful generations, and ambiguous assumptions in real-world scenarios has become essential. However, existing datasets evaluating stereotype biase… ▽ More

    Submitted 17 February, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

  6. arXiv:2502.08650  [pdf, other

    cs.CY

    Who is Responsible? The Data, Models, Users or Regulations? A Comprehensive Survey on Responsible Generative AI for a Sustainable Future

    Authors: Shaina Raza, Rizwan Qureshi, Anam Zahid, Joseph Fioresi, Ferhat Sadak, Muhammad Saeed, Ranjan Sapkota, Aditya Jain, Anas Zafar, Muneeb Ul Hassan, Aizan Zafar, Hasan Maqbool, Ashmal Vayani, Jia Wu, Maged Shoman

    Abstract: Responsible Artificial Intelligence (RAI) has emerged as a crucial framework for addressing ethical concerns in the development and deployment of Artificial Intelligence (AI) systems. A significant body of literature exists, primarily focusing on either RAI guidelines and principles or the technical aspects of RAI, largely within the realm of traditional AI. However, a notable gap persists in brid… ▽ More

    Submitted 28 April, 2025; v1 submitted 15 January, 2025; originally announced February 2025.

    Comments: under review

  7. arXiv:2411.16508  [pdf, other

    cs.CV cs.CL

    All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

    Authors: Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani , et al. (44 additional authors not shown)

    Abstract: Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All La… ▽ More

    Submitted 30 April, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: A Multilingual Multimodal cultural benchmark for 100 languages

  8. arXiv:2403.14743  [pdf, other

    cs.CV

    VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

    Authors: Ahmad Mahmood, Ashmal Vayani, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the… ▽ More

    Submitted 9 March, 2025; v1 submitted 21 March, 2024; originally announced March 2024.

  9. arXiv:2402.16840  [pdf, other

    cs.CL

    MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

    Authors: Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan

    Abstract: "Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. However, LLMs do not suit well for scenarios that require on-device processing, energy efficiency, low memory footprint, and response efficiency. These requisites are crucial for privacy, security, and sustainable deployment. This paper explores the "less is more" paradigm by addressing the chall… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Code available at : https://github.com/mbzuai-oryx/MobiLlama