Skip to main content

Showing 1–31 of 31 results for author: Gokhale, T

.
  1. Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations

    Authors: Naresh Kumar Devulapally, Shruti Agarwal, Tejas Gokhale, Vishnu Suresh Lokhande

    Abstract: Text-to-image diffusion models have demonstrated remarkable effectiveness in rapid and high-fidelity personalization, even when provided with only a few user images. However, the effectiveness of personalization techniques has lead to concerns regarding data privacy, intellectual property protection, and unauthorized usage. To mitigate such unauthorized usage and model replication, the idea of gen… ▽ More

    Submitted 3 October, 2025; originally announced October 2025.

  2. arXiv:2508.15124  [pdf, ps, other

    cs.LG cs.CV

    Side Effects of Erasing Concepts from Diffusion Models

    Authors: Shaswati Saha, Sourajit Saha, Manas Gaur, Tejas Gokhale

    Abstract: Concerns about text-to-image (T2I) generative models infringing on privacy, copyright, and safety have led to the development of concept erasure techniques (CETs). The goal of an effective CET is to prohibit the generation of undesired "target" concepts specified by the user, while preserving the ability to synthesize high-quality images of other concepts. In this work, we demonstrate that concept… ▽ More

    Submitted 19 September, 2025; v1 submitted 20 August, 2025; originally announced August 2025.

    Comments: Findings of the Association for Computational Linguistics: EMNLP 2025

  3. arXiv:2503.00043  [pdf, other

    cs.CV cs.AI cs.CL

    VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

    Authors: Nilay Yilmaz, Maitreya Patel, Yiran Lawrence Luo, Tejas Gokhale, Chitta Baral, Suren Jayasuriya, Yezhou Yang

    Abstract: Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. Despite their exceptional performance on visual understanding benchmarks, measuring their ability to reason abstractly across multiple images remains a significant challenge. To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs… ▽ More

    Submitted 4 March, 2025; v1 submitted 25 February, 2025; originally announced March 2025.

    Comments: Accepted at ICLR 2025. Code and data: https://github.com/nlylmz/Voila

  4. arXiv:2411.02545  [pdf, other

    cs.CV cs.CL

    TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

    Authors: Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show tha… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: Accepted at: NeurIPS 2024 | Project Page: https://tripletclip.github.io

  5. arXiv:2408.02231  [pdf, other

    cs.CV

    REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

    Authors: Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, Chitta Baral

    Abstract: Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models.… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Accepted to ECCV 2024. Project Page : https://agneetchatterjee.com/revision/

  6. arXiv:2405.15961  [pdf, other

    cs.CV

    Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images

    Authors: Yiran Luo, Joshua Feinglass, Tejas Gokhale, Kuan-Cheng Lee, Chitta Baral, Yezhou Yang

    Abstract: Domain Generalization (DG) is a challenging task in machine learning that requires a coherent ability to comprehend shifts across various domains through extraction of domain-invariant features. DG performance is typically evaluated by performing image classification in domains of various image styles. However, current methodology lacks quantitative understanding about shifts in stylistic domain,… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: Accepted at the 3rd CVPR Workshop on Vision Datasets Understanding

  7. arXiv:2405.09567  [pdf

    eess.SP cs.AI cs.LG

    ECG-SMART-NET: A Deep Learning Architecture for Precise ECG Diagnosis of Occlusion Myocardial Infarction

    Authors: Nathan T. Riek, Murat Akcakaya, Zeineb Bouzid, Tanmay Gokhale, Stephanie Helman, Karina Kraevsky-Philips, Rui Qi Ji, Ervin Sejdic, Jessica K. Zègre-Hemsey, Christian Martin-Gill, Clifton W. Callaway, Samir Saba, Salah Al-Zaiti

    Abstract: Objective: In this paper we develop and evaluate ECG-SMART-NET for occlusion myocardial infarction (OMI) identification. OMI is a severe form of heart attack characterized by complete blockage of one or more coronary arteries requiring immediate referral for cardiac catheterization to restore blood flow to the heart. Two thirds of OMI cases are difficult to visually identify from a 12-lead electro… ▽ More

    Submitted 24 June, 2025; v1 submitted 8 May, 2024; originally announced May 2024.

    Comments: 9 pages, 7 figures, 6 tables

    Journal ref: IEEE Transactions on Biomedical Engineering, 2025

  8. arXiv:2404.08540  [pdf, other

    cs.CV

    On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

    Authors: Agneet Chatterjee, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, remains unexplored. In this paper, we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness acros… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project webpage: https://agneetchatterjee.com/robustness_depth_lang/

  9. arXiv:2404.07410  [pdf, other

    cs.CV cs.LG

    Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

    Authors: Sourajit Saha, Tejas Gokhale

    Abstract: Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework, we study shift invariance of CNNs by inspecting existing downsampling operators in terms of their maximum-sampling bias (MSB), and find that MSB is… ▽ More

    Submitted 1 December, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

    Comments: Accepted to WACV 2025

  10. arXiv:2404.01197  [pdf, other

    cs.CV

    Getting it Right: Improving Spatial Consistency in Text-to-Image Models

    Authors: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

    Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find… ▽ More

    Submitted 6 August, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: Accepted to ECCV 2024. Project Page : https://spright-t2i.github.io/

  11. arXiv:2307.09520  [pdf, other

    cs.CV

    Adversarial Bayesian Augmentation for Single-Source Domain Generalization

    Authors: Sheng Cheng, Tejas Gokhale, Yezhou Yang

    Abstract: Generalizing to unseen image domains is a challenging problem primarily due to the lack of diverse training data, inaccessible target data, and the large domain shift that may exist in many real-world settings. As such data augmentation is a critical component of domain generalization methods that seek to address this problem. We present Adversarial Bayesian Augmentation (ABA), a novel algorithm t… ▽ More

    Submitted 2 October, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

    Comments: Accepted to ICCV 2023

  12. arXiv:2306.04695  [pdf, other

    cs.CV cs.CL cs.LG

    ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models

    Authors: Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualita… ▽ More

    Submitted 22 February, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: Accepted at AAAI'24 | Project page: https://conceptbed.github.io

  13. arXiv:2306.00424  [pdf, other

    cs.CL cs.CV cs.IR

    End-to-end Knowledge Retrieval with Multi-modal Queries

    Authors: Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, Chitta Baral

    Abstract: We investigate knowledge retrieval with multi-modal queries, i.e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval. We curate a new dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and imag… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  14. arXiv:2303.17080  [pdf, other

    cs.LG

    Mole Recruitment: Poisoning of Image Classifiers via Selective Batch Sampling

    Authors: Ethan Wisdom, Tejas Gokhale, Chaowei Xiao, Yezhou Yang

    Abstract: In this work, we present a data poisoning attack that confounds machine learning models without any manipulation of the image or label. This is achieved by simply leveraging the most confounding natural samples found within the training data itself, in a new form of a targeted attack coined "Mole Recruitment." We define moles as the training samples of a class that appear most similar to samples o… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

  15. arXiv:2212.10015  [pdf, other

    cs.CV cs.AI cs.CL

    Benchmarking Spatial Relationships in Text-to-Image Generation

    Authors: Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, Yezhou Yang

    Abstract: Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of… ▽ More

    Submitted 27 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: preprint; Code and Data at https://github.com/microsoft/VISOR and https://huggingface.co/datasets/tgokhale/sr2d_visor

  16. arXiv:2211.03779  [pdf, other

    cs.CV cs.CL

    CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering

    Authors: Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: Videos often capture objects, their visible properties, their motion, and the interactions between different objects. Objects also have physical properties such as mass, which the imaging pipeline is unable to directly capture. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce CRIPP-VQA, a… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: Accepted to EMNLP 2022; https://maitreyapatel.com/CRIPP-VQA/

  17. arXiv:2206.07736  [pdf, other

    cs.LG cs.CV

    Improving Diversity with Adversarially Learned Transformations for Domain Generalization

    Authors: Tejas Gokhale, Rushil Anirudh, Jayaraman J. Thiagarajan, Bhavya Kailkhura, Chitta Baral, Yezhou Yang

    Abstract: To be successful in single source domain generalization, maximizing diversity of synthesized domains has emerged as one of the most effective strategies. Many of the recent successes have come from methods that pre-specify the types of diversity that a model is exposed to during training, so that it can ultimately generalize well to new domains. However, naïve diversity based augmentations do not… ▽ More

    Submitted 12 December, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: WACV 2023. Code: https://github.com/tejas-gokhale/ALT

  18. arXiv:2203.16682  [pdf, other

    cs.CV cs.CL cs.LG

    To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo

    Authors: Yiran Luo, Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral

    Abstract: We present a debiased dataset for the Person-centric Visual Grounding (PCVG) task first proposed by Cui et al. (2021) in the Who's Waldo dataset. Given an image and a caption, PCVG requires pairing up a person's name mentioned in a caption with a bounding box that points to the person in the image. We find that the original Who's Waldo dataset compiled for this task contains a large number of bias… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted at ACL 2022 (Short Paper)

  19. arXiv:2203.07653  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial Robustness

    Authors: Tejas Gokhale, Swaroop Mishra, Man Luo, Bhavdeep Singh Sachdeva, Chitta Baral

    Abstract: Data modification, either via additional training datasets, data augmentation, debiasing, and dataset filtering, has been proposed as an effective solution for generalizing to out-of-domain (OOD) inputs, in both natural language processing and computer vision literature. However, the effect of data modification on adversarial robustness remains unclear. In this work, we conduct a comprehensive stu… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

    Comments: ACL 2022 Findings

  20. arXiv:2201.07745  [pdf, other

    cs.IR cs.CL

    Improving Biomedical Information Retrieval with Neural Retrievers

    Authors: Man Luo, Arindam Mitra, Tejas Gokhale, Chitta Baral

    Abstract: Information retrieval (IR) is essential in search engines and dialogue systems as well as natural language processing tasks such as open-domain question answering. IR serve an important function in the biomedical domain, where content and sources of scientific knowledge may evolve rapidly. Although neural retrievers have surpassed traditional IR approaches such as TF-IDF and BM25 in standard open-… ▽ More

    Submitted 19 January, 2022; originally announced January 2022.

    Comments: Accepted at AAAI 2022

  21. arXiv:2110.08438  [pdf, other

    cs.CL cs.AI cs.LG

    Unsupervised Natural Language Inference Using PHL Triplet Generation

    Authors: Neeraj Varshney, Pratyay Banerjee, Tejas Gokhale, Chitta Baral

    Abstract: Transformer-based models achieve impressive performance on numerous Natural Language Inference (NLI) benchmarks when trained on respective training datasets. However, in certain cases, training samples may not be available or collecting them could be time-consuming and resource-intensive. In this work, we address the above challenge and present an explorative study on unsupervised NLI, a paradigm… ▽ More

    Submitted 15 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: ACL 2022 Findings

  22. arXiv:2110.07165  [pdf, other

    cs.CV cs.CL

    Semantically Distributed Robust Optimization for Vision-and-Language Inference

    Authors: Tejas Gokhale, Abhishek Chaudhary, Pratyay Banerjee, Chitta Baral, Yezhou Yang

    Abstract: Analysis of vision-and-language models has revealed their brittleness under linguistic phenomena such as paraphrasing, negation, textual entailment, and word substitutions with synonyms or antonyms. While data augmentation techniques have been designed to mitigate against these failure modes, methods that can integrate this knowledge into the training pipeline remain under-explored. In this paper,… ▽ More

    Submitted 14 March, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Findings of ACL 2022; code available at https://github.com/ASU-APG/VLI_SDRO

  23. arXiv:2109.01934  [pdf, other

    cs.CV cs.CL cs.LG

    Weakly Supervised Relative Spatial Reasoning for Visual Question Answering

    Authors: Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral

    Abstract: Vision-and-language (V\&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay between the two modalities. One crucial aspect of visual reasoning is spatial understanding, which involves understanding relative locations of objects, i.e.\ implicitly learning the geometry of the scene. In… ▽ More

    Submitted 4 September, 2021; originally announced September 2021.

    Comments: Accepted to ICCV 2021. PaperId : ICCV2021-10857 Copyright transferred to IEEE ICCV. DOI will be updated later

  24. arXiv:2103.11263  [pdf, other

    cs.CL cs.LG

    Self-Supervised Test-Time Learning for Reading Comprehension

    Authors: Pratyay Banerjee, Tejas Gokhale, Chitta Baral

    Abstract: Recent work on unsupervised question answering has shown that models can be trained with procedurally generated question-answer pairs and can achieve performance competitive with supervised methods. In this work, we consider the task of unsupervised reading comprehension and present a method that performs "test-time learning" (TTL) on a given context (text passage), without requiring training on l… ▽ More

    Submitted 20 March, 2021; originally announced March 2021.

    Comments: Accepted to NAACL 2021

  25. arXiv:2012.02356  [pdf, other

    cs.CV cs.CL

    WeaQA: Weak Supervision via Captions for Visual Question Answering

    Authors: Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral

    Abstract: Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated \textit{Image-Question-Answer} (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA mod… ▽ More

    Submitted 28 May, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

    Comments: Accepted in Findings of ACL 2021

  26. arXiv:2012.01806  [pdf, other

    cs.CV cs.LG

    Attribute-Guided Adversarial Training for Robustness to Natural Perturbations

    Authors: Tejas Gokhale, Rushil Anirudh, Bhavya Kailkhura, Jayaraman J. Thiagarajan, Chitta Baral, Yezhou Yang

    Abstract: While existing work in robust deep learning has focused on small pixel-level norm-based perturbations, this may not account for perturbations encountered in several real-world settings. In many such cases although test data might not be available, broad specifications about the types of perturbations (such as an unknown degree of rotation) may be known. We consider a setup where robustness is expe… ▽ More

    Submitted 7 April, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

    Comments: AAAI 2021. Camera Ready version + Appendix

  27. arXiv:2009.08566  [pdf, other

    cs.CV cs.CL

    MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering

    Authors: Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang

    Abstract: While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct… ▽ More

    Submitted 15 October, 2020; v1 submitted 17 September, 2020; originally announced September 2020.

    Comments: Accepted to EMNLP 2020, Long Papers

  28. arXiv:2004.08614  [pdf, other

    cs.CV cs.LG

    Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships

    Authors: Kuldeep Kulkarni, Tejas Gokhale, Rajhans Singh, Pavan Turaga, Aswin Sankaranarayanan

    Abstract: Recently, there has been substantial progress in image synthesis from semantic labelmaps. However, methods used for this task assume the availability of complete and unambiguous labelmaps, with instance boundaries of objects, and class labels for each pixel. This reliance on heavily annotated inputs restricts the application of image synthesis techniques to real-world applications, especially unde… ▽ More

    Submitted 20 May, 2021; v1 submitted 18 April, 2020; originally announced April 2020.

    Comments: Accepted to AI for Content Creation Workshop @CVPR 2021

  29. arXiv:2003.05162  [pdf, other

    cs.CV cs.CL

    Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

    Authors: Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang

    Abstract: Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to… ▽ More

    Submitted 7 January, 2023; v1 submitted 11 March, 2020; originally announced March 2020.

    Comments: EMNLP 2020. V2C Website: https://asu-apg.github.io/Video2Commonsense/

  30. arXiv:2002.08325  [pdf, other

    cs.CV cs.CL

    VQA-LOL: Visual Question Answering under the Lens of Logic

    Authors: Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang

    Abstract: Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this \textit{Lens of Logic}, state-of-the-art VQA models ha… ▽ More

    Submitted 15 July, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

    Comments: Accepted to ECCV 2020

  31. arXiv:1905.12042  [pdf, other

    cs.CV cs.AI

    Blocksworld Revisited: Learning and Reasoning to Generate Event-Sequences from Image Pairs

    Authors: Tejas Gokhale, Shailaja Sampat, Zhiyuan Fang, Yezhou Yang, Chitta Baral

    Abstract: The process of identifying changes or transformations in a scene along with the ability of reasoning about their causes and effects, is a key aspect of intelligence. In this work we go beyond recent advances in computational perception, and introduce a more challenging task, Image-based Event-Sequencing (IES). In IES, the task is to predict a sequence of actions required to rearrange objects from… ▽ More

    Submitted 28 May, 2019; originally announced May 2019.

    Comments: 10 pages, 5 figures, for associated dataset, see https://asu-active-perception-group.github.io/bird_dataset_web/