Skip to main content

Showing 1–50 of 136 results for author: Grauman, K

.
  1. arXiv:2506.03340  [pdf, ps, other

    cs.CV

    Seeing the Arrow of Time in Large Multimodal Models

    Authors: Zihui Xue, Mi Luo, Kristen Grauman

    Abstract: The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a cri… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Project website: https://vision.cs.utexas.edu/projects/SeeAoT

  2. arXiv:2506.00717  [pdf, ps, other

    cs.HC cs.CV

    Vid2Coach: Transforming How-To Videos into Task Assistants

    Authors: Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kristen Grauman, Amy Pavel

    Abstract: People use videos to learn new recipes, exercises, and crafts. Such videos remain difficult for blind and low vision (BLV) people to follow as they rely on visual comparison. Our observations of visual rehabilitation therapists (VRTs) guiding BLV people to follow how-to videos revealed that VRTs provide both proactive and responsive support including detailed descriptions, non-visual workarounds,… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  3. arXiv:2504.13180  [pdf, other

    cs.CV cs.AI cs.LG

    PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

    Authors: Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl , et al. (4 additional authors not shown)

    Abstract: Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Technical report

  4. arXiv:2504.05451  [pdf, other

    cs.CV

    Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation

    Authors: Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman

    Abstract: Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric t… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  5. arXiv:2503.13821  [pdf, other

    cs.CV

    Stitch-a-Recipe: Video Demonstration from Multistep Descriptions

    Authors: Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

    Abstract: When obtaining visual illustrations from text descriptions, today's methods take a description with-a single text context caption, or an action description-and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe composed of multiple steps. Furthermore, simply handling each step description in iso… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  6. arXiv:2503.11953  [pdf, other

    cs.CV

    SPOC: Spatially-Progressing Object State Change Segmentation in Video

    Authors: Priyanka Mandikal, Tushar Nagarajan, Alex Stoken, Zihui Xue, Kristen Grauman

    Abstract: Object state changes in video reveal critical information about human and agent activity. However, existing methods are limited to temporal localization of when the object is in its initial state (e.g., the unchopped avocado) versus when it has completed a state change (e.g., the chopped avocado), which limits applicability for any task requiring detailed information about the progress of the acti… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  7. arXiv:2412.18386  [pdf, other

    cs.CV

    Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

    Authors: Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman

    Abstract: We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled -- but human-edited -- video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then di… ▽ More

    Submitted 22 April, 2025; v1 submitted 24 December, 2024; originally announced December 2024.

  8. arXiv:2412.02071  [pdf, other

    cs.CV

    Progress-Aware Video Frame Captioning

    Authors: Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman

    Abstract: While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression… ▽ More

    Submitted 25 March, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

    Comments: Accepted by CVPR 2025, Project website: https://vision.cs.utexas.edu/projects/ProgressCaptioner/

  9. arXiv:2412.00932  [pdf, other

    cs.CV

    FIction: 4D Future Interaction Prediction from Video

    Authors: Kumar Ashutosh, Georgios Pavlakos, Kristen Grauman

    Abstract: Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of "what" and ignoring the "where" and "how". We introduce FIction for 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predi… ▽ More

    Submitted 11 April, 2025; v1 submitted 1 December, 2024; originally announced December 2024.

    Comments: CVPR 2025 (Highlight)

  10. arXiv:2411.08753  [pdf, other

    cs.CV

    Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos

    Authors: Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman

    Abstract: Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis i… ▽ More

    Submitted 9 April, 2025; v1 submitted 13 November, 2024; originally announced November 2024.

    Comments: Accepted to CVPR 2025 (Highlight)

  11. arXiv:2410.14045  [pdf, other

    cs.CV cs.LG

    Human Action Anticipation: A Survey

    Authors: Bolin Lai, Sam Toyer, Tushar Nagarajan, Rohit Girdhar, Shengxin Zha, James M. Rehg, Kris Kitani, Kristen Grauman, Ruta Desai, Miao Liu

    Abstract: Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans various tasks, including action anticipation, activity forecasting, intent prediction, goal prediction, and so on. Our survey aims to tie together this f… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: 30 pages, 9 figures, 12 tables

  12. arXiv:2408.00672  [pdf, other

    cs.CV

    ExpertAF: Expert Actionable Feedback from Video

    Authors: Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman

    Abstract: Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback (AF) from video of a person doing a physical activity, such as basketball or soccer… ▽ More

    Submitted 11 April, 2025; v1 submitted 1 August, 2024; originally announced August 2024.

    Comments: CVPR 2025

  13. arXiv:2406.09272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    Authors: Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

    Abstract: Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations… ▽ More

    Submitted 25 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/action2sound. ECCV 2024 camera-ready version

  14. arXiv:2406.07754  [pdf, other

    cs.CV

    HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

    Authors: Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman

    Abstract: We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, these models often fall short in handling the intricacies of hand-object interactions (HOI), failing to produce realistic edits -- especially when objec… ▽ More

    Submitted 8 November, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by NeurIPS 2024, Project website: https://vision.cs.utexas.edu/projects/HOI-Swap/

  15. arXiv:2405.02821  [pdf, other

    cs.SD cs.AI cs.LG cs.RO eess.AS

    Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

    Authors: Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman

    Abstract: Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans a… ▽ More

    Submitted 10 September, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

    Comments: Camera ready version for IROS 2024. Project page: https://vision.cs.utexas.edu/projects/sim2real/

  16. arXiv:2404.16216  [pdf, other

    cs.CV cs.RO cs.SD eess.AS

    ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

    Authors: Arjun Somayazulu, Sagnik Majumder, Changan Chen, Kristen Grauman

    Abstract: An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to inte… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/active_rir/

  17. arXiv:2404.05206  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

    Authors: Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

    Abstract: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024. Project page: https://vision.cs.utexas.edu/projects/soundingactions

  18. arXiv:2403.06351  [pdf, other

    cs.CV

    Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

    Authors: Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman

    Abstract: We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. To this end, we propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation, which explicitly… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

    Comments: 22 pages

  19. arXiv:2401.01823  [pdf, other

    cs.CV

    Detours for Navigating Instructional Videos

    Authors: Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman

    Abstract: We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve t… ▽ More

    Submitted 4 May, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: CVPR 2024

  20. arXiv:2312.11782  [pdf, other

    cs.CV

    Learning Object State Changes in Videos: An Open-World Perspective

    Authors: Zihui Xue, Kumar Ashutosh, Kristen Grauman

    Abstract: Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects, current approaches are confined to a closed vocabulary. Addressing this gap, we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC -- the object's initial state, i… ▽ More

    Submitted 3 April, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: Accepted by CVPR 2024, Project website: https://vision.cs.utexas.edu/projects/VidOSC/

  21. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 25 September, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: Expanded manuscript (compared to arxiv v1 from Nov 2023 and CVPR 2024 paper from June 2024) for more comprehensive dataset and benchmark presentation, plus new results on v2 data release

  22. arXiv:2307.15064  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Self-Supervised Visual Acoustic Matching

    Authors: Arjun Somayazulu, Changan Chen, Kristen Grauman

    Abstract: Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised ap… ▽ More

    Submitted 23 November, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Comments: Project page: https://vision.cs.utexas.edu/projects/ss_vam/ . Accepted at NeurIPS 2023

  23. arXiv:2307.08763  [pdf, other

    cs.CV

    Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

    Authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, Kristen Grauman

    Abstract: Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequ… ▽ More

    Submitted 29 October, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023

  24. arXiv:2307.04760  [pdf, other

    cs.CV cs.SD eess.AS

    Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

    Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

    Abstract: We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downst… ▽ More

    Submitted 5 May, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

    Comments: Accepted to CVPR 2024

  25. arXiv:2306.15850  [pdf, other

    cs.CV

    SpotEM: Efficient Video Search for Episodic Memory

    Authors: Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman

    Abstract: The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve effici… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Published in ICML 2023

  26. arXiv:2306.09324  [pdf, other

    cs.CV

    Single-Stage Visual Query Localization in Egocentric Videos

    Authors: Hanwen Jiang, Santhosh Kumar Ramakrishnan, Kristen Grauman

    Abstract: Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline re… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Winner of Ego4D VQ2D challenge 2023

  27. arXiv:2306.05526  [pdf, other

    cs.CV

    Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

    Authors: Zihui Xue, Kristen Grauman

    Abstract: The egocentric and exocentric viewpoints of a human activity look dramatically different, yet invariant representations to link them are essential for many potential applications in robotics and augmented reality. Prior work is limited to learning view-invariant features from paired synchronized viewpoints. We relax that strong data assumption and propose to learn fine-grained action features that… ▽ More

    Submitted 25 November, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Accepted by NeurIPS 2023, Project website: https://vision.cs.utexas.edu/projects/AlignEgoExo/

  28. arXiv:2302.01891  [pdf, other

    cs.CV

    Egocentric Video Task Translation @ Ego4D Challenge 2022

    Authors: Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

    Abstract: This technical report describes the EgoTask Translation approach that explores relations among a set of egocentric video tasks in the Ego4D challenge. To improve the primary task of interest, we propose to leverage existing models developed for other related tasks and design a task translator that learns to ''translate'' auxiliary task features to the primary task. With no modification to the base… ▽ More

    Submitted 3 February, 2023; originally announced February 2023.

    Comments: The technical report of ECCV@2022 Ego4D challenge

  29. arXiv:2301.08730  [pdf, other

    cs.CV cs.SD eess.AS

    Novel-View Acoustic Synthesis

    Authors: Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi

    Abstract: We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benc… ▽ More

    Submitted 24 October, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

    Comments: Accepted at CVPR 2023. Project page: https://vision.cs.utexas.edu/projects/nvas

  30. A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems

    Authors: Megan M. Baker, Alexander New, Mario Aguilar-Simon, Ziad Al-Halah, Sébastien M. R. Arnold, Ese Ben-Iwhiwhu, Andrew P. Brna, Ethan Brooks, Ryan C. Brown, Zachary Daniels, Anurag Daram, Fabien Delattre, Ryan Dellana, Eric Eaton, Haotian Fu, Kristen Grauman, Jesse Hostetler, Shariq Iqbal, Cassandra Kent, Nicholas Ketz, Soheil Kolouri, George Konidaris, Dhireesha Kudithipudi, Erik Learned-Miller, Seungwon Lee , et al. (22 additional authors not shown)

    Abstract: Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through th… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: To appear in Neural Networks

  31. arXiv:2301.02311  [pdf, other

    cs.CV

    HierVL: Learning Hierarchical Video-Language Embeddings

    Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

    Abstract: Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos acc… ▽ More

    Submitted 8 June, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: CVPR 2023

  32. arXiv:2301.02307  [pdf, other

    cs.CV

    What You Say Is What You Show: Visual Narration Detection in Instructional Videos

    Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

    Abstract: Narrated ''how-to'' videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies. However, this data is extremely noisy, as the narrations do not always describe the actions demonstrated in the video. To address this problem we introduce the novel task of visual narration detection, which entails determining w… ▽ More

    Submitted 18 July, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: Technical Report

  33. arXiv:2301.02217  [pdf, other

    cs.CV

    EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

    Authors: Shuhan Tan, Tushar Nagarajan, Kristen Grauman

    Abstract: Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight… ▽ More

    Submitted 5 January, 2023; originally announced January 2023.

    Comments: Tech report. Project page: https://vision.cs.utexas.edu/projects/egodistill

  34. arXiv:2301.02184  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

    Authors: Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman, Vamsi Krishna Ithapu

    Abstract: Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multi… ▽ More

    Submitted 20 April, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

    Comments: Accepted to CVPR 2023

  35. arXiv:2301.00746  [pdf, other

    cs.CV

    NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

    Authors: Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman

    Abstract: Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window output… ▽ More

    Submitted 25 March, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

    Comments: 13 pages, 7 figures, appearing in CVPR 2023

  36. arXiv:2212.06301  [pdf, other

    cs.CV

    Egocentric Video Task Translation

    Authors: Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

    Abstract: Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks -- hand-object manipulations, naviga… ▽ More

    Submitted 6 April, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

    Comments: Accepted by CVPR 2023 (Highlight), Project website: https://vision.cs.utexas.edu/projects/egot2/

  37. arXiv:2212.04492  [pdf, other

    cs.CV

    Few-View Object Reconstruction with Unknown Categories and Camera Poses

    Authors: Hanwen Jiang, Zhenyu Jiang, Kristen Grauman, Yuke Zhu

    Abstract: While object reconstruction has made great strides in recent years, current methods typically require densely captured images and/or known camera poses, and generalize poorly to novel object categories. To step toward object reconstruction in the wild, this work explores reconstructing general real-world objects from a few images without known camera poses or object categories. The crux of our wor… ▽ More

    Submitted 25 January, 2024; v1 submitted 8 December, 2022; originally announced December 2022.

  38. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  39. arXiv:2207.11365  [pdf, other

    cs.CV

    EgoEnv: Human-centric environment representations from egocentric video

    Authors: Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

    Abstract: First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocen… ▽ More

    Submitted 9 November, 2023; v1 submitted 22 July, 2022; originally announced July 2022.

    Comments: Published in NeurIPS 2023 (Oral)

  40. arXiv:2206.08312  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

    Authors: Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robinson, Kristen Grauman

    Abstract: We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, m… ▽ More

    Submitted 23 January, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: Camera-ready version. Website: https://soundspaces.org. Project page: https://vision.cs.utexas.edu/projects/soundspaces2

  41. arXiv:2206.04006  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Few-Shot Audio-Visual Learning of Environment Acoustics

    Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman

    Abstract: Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed… ▽ More

    Submitted 24 November, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: Accepted to NeurIPS 2022

  42. arXiv:2202.06875  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Visual Acoustic Matching

    Authors: Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman

    Abstract: We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal tr… ▽ More

    Submitted 13 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: Project page: https://vision.cs.utexas.edu/projects/visual-acoustic-matching. Accepted at CVPR 2022

  43. arXiv:2202.02440  [pdf, other

    cs.CV cs.AI cs.LG

    Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

    Authors: Ziad Al-Halah, Santhosh K. Ramakrishnan, Kristen Grauman

    Abstract: In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality.… ▽ More

    Submitted 28 April, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

    Comments: CVPR 2022. Project page: https://vision.cs.utexas.edu/projects/zsel/

  44. arXiv:2202.00850  [pdf, other

    cs.CV cs.LG cs.SD eess.AS eess.IV

    Active Audio-Visual Separation of Dynamic Sound Sources

    Authors: Sagnik Majumder, Kristen Grauman

    Abstract: We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs… ▽ More

    Submitted 25 July, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

    Comments: Accepted to ECCV 2022

  45. arXiv:2202.00164  [pdf, other

    cs.RO cs.CV

    DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

    Authors: Priyanka Mandikal, Kristen Grauman

    Abstract: Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing… ▽ More

    Submitted 31 January, 2022; originally announced February 2022.

  46. arXiv:2201.10029  [pdf, other

    cs.CV cs.AI

    PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

    Authors: Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, Kristen Grauman

    Abstract: State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that… ▽ More

    Submitted 17 June, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

    Comments: 8 pages + supplementary. Accepted in CVPR 2022

  47. arXiv:2111.10882  [pdf, other

    cs.CV cs.SD eess.AS

    Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

    Authors: Rishabh Garg, Ruohan Gao, Kristen Grauman

    Abstract: Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach ex… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

    Comments: Published in BMVC 2021, project page: http://vision.cs.utexas.edu/projects/geometry-aware-binaural/

  48. arXiv:2110.07692  [pdf, other

    cs.CV cs.RO

    Shaping embodied agent behavior with activity-context priors from egocentric video

    Authors: Tushar Nagarajan, Kristen Grauman

    Abstract: Complex physical tasks entail a sequence of object interactions, each with its own preconditions -- which can be difficult for robotic agents to learn efficiently solely through their own experience. We introduce an approach to discover activity-context priors from in-the-wild egocentric video captured with human worn cameras. For a given object, an activity-context prior represents the set of oth… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

  49. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  50. arXiv:2107.02739  [pdf, other

    econ.EM cs.CV

    Shapes as Product Differentiation: Neural Network Embedding in the Analysis of Markets for Fonts

    Authors: Sukjin Han, Eric H. Schulman, Kristen Grauman, Santhosh Ramakrishnan

    Abstract: Many differentiated products have key attributes that are unstructured and thus high-dimensional (e.g., design, text). Instead of treating unstructured attributes as unobservables in economic models, quantifying them can be important to answer interesting economic questions. To propose an analytical framework for these types of products, this paper considers one of the simplest design products-fon… ▽ More

    Submitted 7 March, 2024; v1 submitted 6 July, 2021; originally announced July 2021.