Skip to main content

Showing 1–19 of 19 results for author: Cogswell, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.12716  [pdf, other

    cs.CV cs.CL cs.LG

    BloomVQA: Assessing Hierarchical Multi-modal Comprehension

    Authors: Yunye Gong, Robik Shrestha, Jared Claypoole, Michael Cogswell, Arijit Ray, Christopher Kanan, Ajay Divakaran

    Abstract: We propose a novel VQA dataset, BloomVQA, to facilitate comprehensive evaluation of large vision-language models on comprehension tasks. Unlike current benchmarks that often focus on fact-based memorization and simple reasoning tasks without theoretical grounding, we collect multiple-choice samples based on picture stories that reflect different levels of comprehension, as laid out in Bloom's Taxo… ▽ More

    Submitted 10 June, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted by ACL Findings (2024). Dataset available at https://huggingface.co/datasets/ygong/BloomVQA

  2. arXiv:2312.00115  [pdf, other

    cs.CV cs.CL

    A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

    Authors: Matthew Gwilliam, Michael Cogswell, Meng Ye, Karan Sikka, Abhinav Shrivastava, Ajay Divakaran

    Abstract: Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could range anywhere from moment-by-moment detail to a single phrase summary. To provide a more thorough evaluation of the capabilities of long… ▽ More

    Submitted 9 December, 2024; v1 submitted 30 November, 2023; originally announced December 2023.

    Comments: 17 pages, 16 tables, 8 figures. To appear at WACV 2025

  3. arXiv:2311.10081  [pdf, other

    cs.CV cs.CL cs.LG

    DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

    Authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran

    Abstract: We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First, prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feed… ▽ More

    Submitted 19 March, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: CVPR 2024. The feedback datasets are released at: https://huggingface.co/datasets/YangyiYY/LVLM_NLF

  4. arXiv:2309.04461  [pdf, other

    cs.CL cs.CV cs.LG

    Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

    Authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran

    Abstract: Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are… ▽ More

    Submitted 19 March, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

    Comments: NAACL 2024 Main Conference. The data is released at https://github.com/Yangyi-Chen/CoTConsistency

  5. arXiv:2304.03659  [pdf, other

    cs.CV

    Probing Conceptual Understanding of Large Visual-Language Models

    Authors: Madeline Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat

    Abstract: In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content und… ▽ More

    Submitted 26 April, 2024; v1 submitted 7 April, 2023; originally announced April 2023.

    Comments: All code and dataset is available at: https://tinyurl.com/vlm-robustness. Accepted in CVPRW 2024

  6. arXiv:2209.15093  [pdf, other

    cs.CL

    Unpacking Large Language Models with Conceptual Consistency

    Authors: Pritish Sahu, Michael Cogswell, Yunye Gong, Ajay Divakaran

    Abstract: If a Large Language Model (LLM) answers "yes" to the question "Are mountains tall?" then does it know what a mountain is? Can you rely on it responding correctly or incorrectly to other questions about mountains? The success of Large Language Models (LLMs) indicates they are increasingly able to answer queries like these accurately, but that ability does not necessarily imply a general understandi… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  7. arXiv:2110.08335  [pdf, other

    cs.CV cs.CG

    Trigger Hunting with a Topological Prior for Trojan Detection

    Authors: Xiaoling Hu, Xiao Lin, Michael Cogswell, Yi Yao, Susmit Jha, Chao Chen

    Abstract: Despite their success and popularity, deep neural networks (DNNs) are vulnerable when facing backdoor attacks. This impedes their wider adoption, especially in mission critical applications. This paper tackles the problem of Trojan detection, namely, identifying Trojaned models -- models trained with poisoned data. One popular approach is reverse engineering, i.e., recovering the triggers on a cle… ▽ More

    Submitted 2 April, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: 17 pages, 10 figures

  8. arXiv:2110.06863  [pdf, other

    cs.CV cs.AI cs.HC

    Improving Users' Mental Model with Attention-directed Counterfactual Edits

    Authors: Kamran Alipour, Arijit Ray, Xiao Lin, Michael Cogswell, Jurgen P. Schulze, Yi Yao, Giedrius T. Burachas

    Abstract: In the domain of Visual Question Answering (VQA), studies have shown improvement in users' mental model of the VQA system when they are exposed to examples of how these systems answer certain Image-Question (IQ) pairs. In this work, we show that showing controlled counterfactual image-question examples are more effective at improving the mental model of users as compared to simply showing random e… ▽ More

    Submitted 15 October, 2021; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: Accepted for publication in Applied AI Letters

  9. arXiv:2106.04653  [pdf, other

    cs.CL

    Comprehension Based Question Answering using Bloom's Taxonomy

    Authors: Pritish Sahu, Michael Cogswell, Sara Rutherford-Quach, Ajay Divakaran

    Abstract: Current pre-trained language models have lots of knowledge, but a more limited ability to use that knowledge. Bloom's Taxonomy helps educators teach children how to use knowledge by categorizing comprehension skills, so we use it to analyze and improve the comprehension skills of large pre-trained language models. Our experiments focus on zero-shot question answering, using the taxonomy to provide… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

  10. arXiv:2103.14712  [pdf, other

    cs.CV cs.AI cs.CY cs.HC

    Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA Models

    Authors: Arijit Ray, Michael Cogswell, Xiao Lin, Kamran Alipour, Ajay Divakaran, Yi Yao, Giedrius Burachas

    Abstract: Attention maps, a popular heatmap-based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we… ▽ More

    Submitted 25 October, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: Applied AI Letters, Wiley, 25 October 2021

  11. arXiv:2007.12750  [pdf, other

    cs.CV cs.AI cs.CL

    Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

    Authors: Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh, Dhruv Batra

    Abstract: Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog model… ▽ More

    Submitted 24 July, 2020; originally announced July 2020.

    Comments: 19 pages, 8 figures

  12. arXiv:1904.09067  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Emergence of Compositional Language with Deep Generational Transmission

    Authors: Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra

    Abstract: Recent work has studied the emergence of language among deep reinforcement learning agents that must collaborate to solve a task. Of particular interest are the factors that cause language to be compositional -- i.e., express meaning by combining words which themselves have meaning. Evolutionary linguists have found that in addition to structural priors like those already studied in deep learning,… ▽ More

    Submitted 27 May, 2020; v1 submitted 19 April, 2019; originally announced April 2019.

  13. arXiv:1611.07450  [pdf, other

    stat.ML cs.CV cs.LG

    Grad-CAM: Why did you say that?

    Authors: Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra

    Abstract: We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations. Our approach, called Gradient-weighted Class Activation Mapping (Grad-CAM), uses class-specific gradient information to localize important regions. These localizations are combined with existing pixel-space v… ▽ More

    Submitted 25 January, 2017; v1 submitted 22 November, 2016; originally announced November 2016.

    Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems. This is an extended abstract version of arXiv:1610.02391 (CVPR format)

  14. arXiv:1610.02424  [pdf, other

    cs.AI cs.CL cs.CV

    Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

    Authors: Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, Dhruv Batra

    Abstract: Neural sequence models are widely used to model time-series data. Equally ubiquitous is the usage of beam search (BS) as an approximate inference algorithm to decode output sequences from these models. BS explores the search space in a greedy left-right fashion retaining only the top-B candidates - resulting in sequences that differ only slightly from each other. Producing lists of nearly identica… ▽ More

    Submitted 22 October, 2018; v1 submitted 7 October, 2016; originally announced October 2016.

    Comments: 16 pages; accepted at AAAI 2018

  15. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

    Authors: Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra

    Abstract: We propose a technique for producing "visual explanations" for decisions from a large class of CNN-based models, making them more transparent. Our approach - Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting important regions in the image for predicting the co… ▽ More

    Submitted 2 December, 2019; v1 submitted 7 October, 2016; originally announced October 2016.

    Comments: This version was published in International Journal of Computer Vision (IJCV) in 2019; A previous version of the paper was published at International Conference on Computer Vision (ICCV'17)

  16. arXiv:1606.07839  [pdf, other

    cs.CV cs.CL

    Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles

    Authors: Stefan Lee, Senthil Purushwalkam, Michael Cogswell, Viresh Ranjan, David Crandall, Dhruv Batra

    Abstract: Many practical perception systems exist within larger processes that include interactions with users or additional components capable of evaluating the quality of predicted solutions. In these contexts, it is beneficial to provide these oracle mechanisms with multiple highly likely hypotheses rather than a single prediction. In this work, we pose the task of producing multiple outputs as a learnin… ▽ More

    Submitted 5 October, 2016; v1 submitted 24 June, 2016; originally announced June 2016.

  17. arXiv:1511.06314  [pdf, other

    cs.CV cs.LG cs.NE

    Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks

    Authors: Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, Dhruv Batra

    Abstract: Convolutional Neural Networks have achieved state-of-the-art performance on a wide range of tasks. Most benchmarks are led by ensembles of these powerful learners, but ensembling is typically treated as a post-hoc procedure implemented by averaging independently trained models with model variation induced by bagging or random initialization. In this paper, we rigorously treat ensembling as a first… ▽ More

    Submitted 19 November, 2015; originally announced November 2015.

  18. arXiv:1511.06068  [pdf, other

    cs.LG stat.ML

    Reducing Overfitting in Deep Networks by Decorrelating Representations

    Authors: Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, Dhruv Batra

    Abstract: One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. In this work, we propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between… ▽ More

    Submitted 10 June, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

    Comments: 12 pages, 5 figures, 5 tables, Accepted to ICLR 2016, (v4 adds acknowledgements)

  19. arXiv:1412.4313  [pdf, other

    cs.CV

    Combining the Best of Graphical Models and ConvNets for Semantic Segmentation

    Authors: Michael Cogswell, Xiao Lin, Senthil Purushwalkam, Dhruv Batra

    Abstract: We present a two-module approach to semantic segmentation that incorporates Convolutional Networks (CNNs) and Graphical Models. Graphical models are used to generate a small (5-30) set of diverse segmentations proposals, such that this set has high recall. Since the number of required proposals is so low, we can extract fairly complex features to rank them. Our complex feature of choice is a novel… ▽ More

    Submitted 15 December, 2014; v1 submitted 14 December, 2014; originally announced December 2014.

    Comments: 13 pages, 6 figures