Skip to main content

Showing 1–1 of 1 results for author: Ilaslan, M F

.
  1. arXiv:2412.11621  [pdf, other

    cs.CV cs.MM

    VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

    Authors: Muhammet Furkan Ilaslan, Ali Koksal, Kevin Qinhong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu

    Abstract: Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and vide… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: Accepted for The 39th Annual AAAI Conference on Artificial Intelligence 2025 in Main Track, 19 pages, 24 figures