Skip to main content

Showing 1–6 of 6 results for author: B.G, V K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.04627  [pdf, other

    cs.CV

    Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

    Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, Manmohan Chandraker

    Abstract: Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  2. arXiv:2310.17050  [pdf, other

    cs.CV

    Exploring Question Decomposition for Zero-Shot VQA

    Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu

    Abstract: Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: NeurIPS 2023 Camera Ready

  3. arXiv:2306.03932  [pdf, other

    cs.CV

    Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

    Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Xiang Yu, Yun Fu, Manmohan Chandraker

    Abstract: Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challe… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: CVPR 2023

  4. arXiv:2203.14395  [pdf, other

    cs.CV

    Single-Stream Multi-Level Alignment for Vision-Language Pretraining

    Authors: Zaid Khan, Vijay Kumar BG, Xiang Yu, Samuel Schulter, Manmohan Chandraker, Yun Fu

    Abstract: Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a si… ▽ More

    Submitted 27 July, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

    Comments: ECCV 2022

  5. arXiv:2011.11735  [pdf

    cs.AI cs.CV cs.LG

    Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

    Authors: Varnith Chordia, Vijay Kumar BG

    Abstract: Accurate and efficient product classification is significant for E-commerce applications, as it enables various downstream tasks such as recommendation, retrieval, and pricing. Items often contain textual and visual information, and utilizing both modalities usually outperforms classification utilizing either mode alone. In this paper we describe our methodology and results for the SIGIR eCom Raku… ▽ More

    Submitted 23 November, 2020; originally announced November 2020.

  6. arXiv:1603.04992  [pdf, other

    cs.CV

    Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

    Authors: Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, Ian Reid

    Abstract: A significant weakness of most current deep Convolutional Neural Networks is the need to train them using vast amounts of manu- ally labelled data. In this work we propose a unsupervised framework to learn a deep convolutional neural network for single view depth predic- tion, without requiring a pre-training stage or annotated ground truth depths. We achieve this by training the network in a mann… ▽ More

    Submitted 28 July, 2016; v1 submitted 16 March, 2016; originally announced March 2016.

    Comments: Accepted for publication at ECCV 2016