Skip to main content

Showing 1–2 of 2 results for author: Khemlani, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10453  [pdf, ps, other

    cs.CV cs.AI

    Vision language models have difficulty recognizing virtual objects

    Authors: Tyler Tran, Sangeet Khemlani, J. G. Trafton

    Abstract: Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects t… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2504.16061  [pdf, other

    cs.CV cs.AI

    Vision language models are unreliable at trivial spatial cognition

    Authors: Sangeet Khemlani, Tyler Tran, Nathaniel Gyory, Anthony M. Harrison, Wallace E. Lawson, Ravenna Thielstrom, Hunter Thompson, Taaren Singh, J. Gregory Trafton

    Abstract: Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability to process relational information. To achieve widespread applicability, VLMs must perform reliably, yielding comparable competence across a wide variety of rel… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.