Skip to main content

Showing 1–5 of 5 results for author: Satzoda, R K

.
  1. arXiv:2410.03061  [pdf, other

    cs.CV cs.CL

    DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

    Authors: Sungnyun Kim, Haofu Liao, Srikar Appalaraju, Peng Tang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan, Stefano Soatto

    Abstract: Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new fra… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: Accepted to EMNLP 2024

  2. arXiv:2406.19150  [pdf, other

    cs.CV cs.AI cs.IR

    RAVEN: Multitask Retrieval Augmented Vision-Language Learning

    Authors: Varun Nagaraj Rao, Siddharth Choudhary, Aditya Deshpande, Ravi Kumar Satzoda, Srikar Appalaraju

    Abstract: The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resour… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  3. arXiv:2307.07929  [pdf, other

    cs.CV

    DocTr: Document Transformer for Structured Information Extraction in Documents

    Authors: Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan

    Abstract: We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anch… ▽ More

    Submitted 15 July, 2023; originally announced July 2023.

  4. arXiv:2302.07387  [pdf, other

    cs.CV

    PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

    Authors: Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, R. Manmatha

    Abstract: In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens… ▽ More

    Submitted 27 March, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

    Comments: CVPR 2023. Project Page: https://polyformer.github.io/

  5. arXiv:1709.07502  [pdf, other

    cs.CV

    A Multimodal, Full-Surround Vehicular Testbed for Naturalistic Studies and Benchmarking: Design, Calibration and Deployment

    Authors: Akshay Rangesh, Kevan Yuen, Ravi Kumar Satzoda, Rakesh Nattoji Rajaram, Pujitha Gunaratne, Mohan M. Trivedi

    Abstract: Recent progress in autonomous and semi-autonomous driving has been made possible in part through an assortment of sensors that provide the intelligent agent with an enhanced perception of its surroundings. It has been clear for quite some while now that for intelligent vehicles to function effectively in all situations and conditions, a fusion of different sensor technologies is essential. Consequ… ▽ More

    Submitted 30 January, 2019; v1 submitted 21 September, 2017; originally announced September 2017.