Skip to main content

Showing 1–11 of 11 results for author: Sarch, G

.
  1. arXiv:2505.24257  [pdf, ps, other

    cs.CV

    Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

    Authors: Sahithya Ravi, Gabriel Sarch, Vibhav Vineet, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel

    Abstract: An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluat… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  2. arXiv:2505.23678  [pdf, ps, other

    cs.CV

    Grounded Reinforcement Learning for Visual Reasoning

    Authors: Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki

    Abstract: While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language mode… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Project website: https://visually-grounded-rl.github.io/

  3. arXiv:2505.01578  [pdf, other

    cs.CV

    Grounding Task Assistance with Multimodal Cues from a Single Demonstration

    Authors: Gabriel Sarch, Balasaravanan Thoravi Kumaravel, Sahithya Ravi, Vibhav Vineet, Andrew D. Wilson

    Abstract: A person's demonstration often serves as a key reference for others learning the same task. However, RGB video, the dominant medium for representing these demonstrations, often fails to capture fine-grained contextual cues such as intent, safety-critical environmental factors, and subtle preferences embedded in human behavior. This sensory gap fundamentally limits the ability of Vision Language Mo… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  4. arXiv:2406.14596  [pdf, other

    cs.CV cs.AI cs.LG

    VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

    Authors: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki

    Abstract: Large-scale LLMs and VLMs excel at few-shot learning but require high-quality examples. We introduce In-Context Abstraction Learning (ICAL), which iteratively refines suboptimal trajectories into high-quality data with optimized actions and detailed reasoning. Given an inefficient demonstration, a VLM corrects actions and annotates causal relationships, object states, subgoals, and task-relevant v… ▽ More

    Submitted 20 January, 2025; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: Project website: https://ical-learning.github.io/

  5. arXiv:2406.02659  [pdf, other

    q-bio.NC cs.AI cs.CV

    Reanimating Images using Neural Representations of Dynamic Stimuli

    Authors: Jacob Yeung, Andrew F. Luo, Gabriel Sarch, Margaret M. Henderson, Deva Ramanan, Michael J. Tarr

    Abstract: While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments. Our approach, BrainNRDS (Brain-Neural Representations of Dynamic Stimuli), leverages state-o… ▽ More

    Submitted 25 March, 2025; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Project Page: https://brain-nrds.github.io

    Journal ref: CVPR 2025 (oral)

  6. arXiv:2404.19065  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

    Authors: Gabriel Sarch, Sahil Somani, Raghav Kapoor, Michael J. Tarr, Katerina Fragkiadaki

    Abstract: Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELP… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Videos and code https://helper-agent-llm.github.io/

  7. arXiv:2401.02416  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    ODIN: A Single Model for 2D and 3D Segmentation

    Authors: Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki

    Abstract: State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that… ▽ More

    Submitted 25 June, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Camera Ready (CVPR 2024, Highlight)

  8. arXiv:2310.15127  [pdf, other

    cs.AI cs.CL cs.LG cs.RO

    Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

    Authors: Gabriel Sarch, Yue Wu, Michael J. Tarr, Katerina Fragkiadaki

    Abstract: Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an… ▽ More

    Submitted 20 November, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Project page with code & videos: https://helper-agent-llm.github.io

  9. arXiv:2309.01782  [pdf, other

    cs.CV cs.AI cs.LG q-bio.NC

    3D View Prediction Models of the Dorsal Visual Stream

    Authors: Gabriel Sarch, Hsiao-Yu Fish Tung, Aria Wang, Jacob Prince, Michael Tarr

    Abstract: Deep neural network representations align well with brain activity in the ventral visual stream. However, the primate visual system has a distinct dorsal processing stream with different functional properties. To test if a model trained to perceive 3D scene geometry aligns better with neural responses in dorsal visual areas, we trained a self-supervised geometry-aware recurrent neural network (GRN… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: 2023 Conference on Cognitive Computational Neuroscience

  10. arXiv:2207.10761  [pdf, other

    cs.CV

    TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

    Authors: Gabriel Sarch, Zhaoyuan Fang, Adam W. Harley, Paul Schydlo, Michael J. Tarr, Saurabh Gupta, Katerina Fragkiadaki

    Abstract: We introduce TIDEE, an embodied agent that tidies up a disordered scene based on learned commonsense object placement and room arrangement priors. TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects. Commonsense priors are encoded in three modules… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

  11. arXiv:2012.00057  [pdf, other

    cs.CV cs.AI cs.LG

    Move to See Better: Self-Improving Embodied Object Detection

    Authors: Zhaoyuan Fang, Ayush Jain, Gabriel Sarch, Adam W. Harley, Katerina Fragkiadaki

    Abstract: Passive methods for object detection and segmentation treat images of the same scene as individual samples and do not exploit object permanence across multiple views. Generalization to novel or difficult viewpoints thus requires additional training with lots of annotations. In contrast, humans often recognize objects by simply moving around, to get more informative viewpoints. In this paper, we pr… ▽ More

    Submitted 29 March, 2021; v1 submitted 30 November, 2020; originally announced December 2020.

    Comments: First three authors contributed equally. Project Page: https://ayushjain1144.github.io/SeeingByMoving/