Skip to main content

Showing 1–50 of 52 results for author: Owens, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.09989  [pdf, ps, other

    cs.CV

    Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

    Authors: Yiming Dou, Wonseok Oh, Yuqing Luo, Antonio Loquercio, Andrew Owens

    Abstract: We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio.… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: CVPR 2025, Project page: https://www.yimingdou.com/hearing_hands/ , Code: https://github.com/Dou-Yiming/hearing_hands/

  2. arXiv:2506.03148  [pdf, ps, other

    cs.CV

    Self-Supervised Spatial Correspondence Across Modalities

    Authors: Ayush Shrivastava, Andrew Owens

    Abstract: We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for bot… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: CVPR 2025. Project link: https://www.ayshrv.com/cmrw . Code: https://github.com/ayshrv/cmrw

  3. arXiv:2502.18705  [pdf, other

    cs.HC

    Understanding Children's Avatar Making in Social Online Games

    Authors: Yue Fu, Samuel Schwamm, Amanda Baughan, Nicole M Powell, Zoe Kronberg, Alicia Owens, Emily Renee Izenman, Dania Alsabeh, Elizabeth Hunt, Michael Rich, David Bickham, Jenny Radesky, Alexis Hiniker

    Abstract: Social online games like Minecraft and Roblox have become increasingly integral to children's daily lives. Our study explores how children aged 8 to 13 create and customize avatars in these virtual environments. Through semi-structured interviews and gameplay observations with 48 participants, we investigate the motivations behind children's avatar-making. Our findings show that children's avatar… ▽ More

    Submitted 11 March, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

  4. arXiv:2501.12390  [pdf, other

    cs.CV

    GPS as a Control Signal for Image Generation

    Authors: Chao Feng, Ziyang Chen, Aleksander Holynski, Alexei A. Efros, Andrew Owens

    Abstract: We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appea… ▽ More

    Submitted 22 January, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

    Comments: Project page: https://cfeng16.github.io/gps-gen/

  5. arXiv:2412.02700  [pdf, other

    cs.CV

    Motion Prompting: Controlling Video Generation with Motion Trajectories

    Authors: Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, Deqing Sun

    Abstract: Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion c… ▽ More

    Submitted 27 March, 2025; v1 submitted 3 December, 2024; originally announced December 2024.

    Comments: CVPR 2025 camera ready. Project page: https://motion-prompting.github.io/

  6. arXiv:2411.17698  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Video-Guided Foley Sound Generation with Multimodal Controls

    Authors: Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon

    Abstract: Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley all… ▽ More

    Submitted 17 March, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: Accepted at CVPR 2025. Project site: https://ificl.github.io/MultiFoley/

  7. arXiv:2411.04125  [pdf, ps, other

    cs.CV

    Community Forensics: Using Thousands of Generators to Train Fake Image Detectors

    Authors: Jeongsoo Park, Andrew Owens

    Abstract: One of the key challenges of detecting AI-generated images is spotting images that have been created by previously unseen generative models. We argue that the limited diversity of the training data is a major obstacle to addressing this problem, and we propose a new dataset that is significantly larger and more diverse than prior work. As part of creating this dataset, we systematically download t… ▽ More

    Submitted 9 June, 2025; v1 submitted 6 November, 2024; originally announced November 2024.

    Comments: 16 pages; CVPR 2025; Project page: https://jespark.net/projects/2024/community_forensics

    Journal ref: In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp. 8245-8257. 2025

  8. arXiv:2410.11834  [pdf, other

    cs.RO

    Contrastive Touch-to-Touch Pretraining

    Authors: Samanta Rodriguez, Yiming Dou, William van den Bogert, Miquel Oller, Kevin So, Andrew Owens, Nima Fazeli

    Abstract: Today's tactile sensors have a variety of different designs, making it challenging to develop general-purpose methods for processing touch signals. In this paper, we learn a unified representation that captures the shared information between different tactile sensors. Unlike current approaches that focus on reconstruction or task-specific supervision, we leverage contrastive learning to integrate… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

  9. arXiv:2409.16288  [pdf, other

    cs.CV

    Self-Supervised Any-Point Tracking by Contrastive Random Walks

    Authors: Ayush Shrivastava, Andrew Owens

    Abstract: We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allow… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: ECCV 2024. Project link: https://ayshrv.com/gmrw . Code: https://github.com/ayshrv/gmrw/

  10. arXiv:2409.14592  [pdf, other

    cs.RO

    Tactile Functasets: Neural Implicit Representations of Tactile Datasets

    Authors: Sikai Li, Samanta Rodriguez, Yiming Dou, Andrew Owens, Nima Fazeli

    Abstract: Modern incarnations of tactile sensors produce high-dimensional raw sensory feedback such as images, making it challenging to efficiently store, process, and generalize across sensors. To address these concerns, we introduce a novel implicit function representation for tactile sensor feedback. Rather than directly using raw tactile images, we propose neural implicit functions trained to reconstruc… ▽ More

    Submitted 22 September, 2024; originally announced September 2024.

  11. arXiv:2409.14340  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Self-Supervised Audio-Visual Soundscape Stylization

    Authors: Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli

    Abstract: Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact… ▽ More

    Submitted 22 September, 2024; originally announced September 2024.

    Comments: ECCV 2024

  12. arXiv:2409.08269  [pdf, other

    cs.RO

    Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation

    Authors: Samanta Rodriguez, Yiming Dou, Miquel Oller, Andrew Owens, Nima Fazeli

    Abstract: Today's touch sensors come in many shapes and sizes. This has made it challenging to develop general-purpose touch processing methods since models are generally tied to one specific sensor design. We address this problem by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, we use a generative model to estimate how the same physical contact would be… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

  13. arXiv:2405.12221  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Images that Sound: Composing Images and Sounds on a Single Canvas

    Authors: Ziyang Chen, Daniel Geng, Andrew Owens

    Abstract: Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is s… ▽ More

    Submitted 4 February, 2025; v1 submitted 20 May, 2024; originally announced May 2024.

    Comments: Accepted to NeurIPS 2024. Project site: https://ificl.github.io/images-that-sound/

  14. arXiv:2405.08815  [pdf, other

    cs.CV

    Efficient Vision-Language Pre-training by Cluster Masking

    Authors: Zihao Wei, Zixuan Pan, Andrew Owens

    Abstract: We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself,… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: CVPR 2024, Project page: https://zxp46.github.io/cluster-masking/ , Code: https://github.com/Zi-hao-Wei/Efficient-Vision-Language-Pre-training-by-Cluster-Masking

  15. arXiv:2405.04534  [pdf, other

    cs.CV

    Tactile-Augmented Radiance Fields

    Authors: Yiming Dou, Fengyu Yang, Yi Liu, Antonio Loquercio, Andrew Owens

    Abstract: We present a scene representation, which we call a tactile-augmented radiance field (TaRF), that brings vision and touch into a shared 3D space. This representation can be used to estimate the visual and tactile signals for a given 3D position within a scene. We capture a scene's TaRF from a collection of photos and sparsely sampled touch probes. Our approach makes use of two insights: (i) common… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: CVPR 2024, Project page: https://dou-yiming.github.io/TaRF, Code: https://github.com/Dou-Yiming/TaRF/

  16. arXiv:2404.11615  [pdf, other

    cs.CV

    Factorized Diffusion: Perceptual Illusions by Noise Decomposition

    Authors: Daniel Geng, Inbum Park, Andrew Owens

    Abstract: Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. For example, we can decompose an image into low and high spatial frequencies and condition these components on different text prompts. This produces hybrid images, which change appearance depending on viewing distance. By decomposin… ▽ More

    Submitted 10 January, 2025; v1 submitted 17 April, 2024; originally announced April 2024.

    Comments: ECCV 2024 camera ready version + more readable size

  17. arXiv:2403.18821  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

    Authors: Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard

    Abstract: We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthes… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024. Project site: https://facebookresearch.github.io/real-acoustic-fields/

  18. arXiv:2401.18085  [pdf, other

    cs.CV

    Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

    Authors: Daniel Geng, Andrew Owens

    Abstract: Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

  19. arXiv:2401.18084  [pdf, other

    cs.CV cs.RO

    Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

    Authors: Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong

    Abstract: The ability to associate touch with other modalities has huge implications for humans and computational systems. However, multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and s… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

  20. arXiv:2311.17919  [pdf, other

    cs.CV

    Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

    Authors: Daniel Geng, Inbum Park, Andrew Owens

    Abstract: We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation, such as a flip or rotation. We propose a simple, zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process, we estimate the noise from different views of a noisy image, and then combine these noise est… ▽ More

    Submitted 2 April, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: CVPR 2024 camera ready

  21. arXiv:2311.17056  [pdf, other

    cs.CV

    Self-Supervised Motion Magnification by Backpropagating Through Optical Flow

    Authors: Zhaoying Pan, Daniel Geng, Andrew Owens

    Abstract: This paper presents a simple, self-supervised method for magnifying subtle motions in video: given an input video and a magnification factor, we manipulate the video such that its new optical flow is scaled by the desired amount. To train our model, we propose a loss function that estimates the optical flow of the generated video and penalizes how far if deviates from the given magnification facto… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Journal ref: Thirty-seventh Conference on Neural Information Processing Systems (2023)

  22. arXiv:2309.15117  [pdf, other

    cs.CV

    Generating Visual Scenes from Touch

    Authors: Fengyu Yang, Jiacheng Zhang, Andrew Owens

    Abstract: An emerging line of work has sought to generate plausible imagery from touch. Existing approaches, however, tackle only narrow aspects of the visuo-tactile synthesis problem, and lag significantly behind the quality of cross-modal synthesis methods in other domains. We draw on recent advances in latent diffusion to create a model for synthesizing images from tactile signals (and vice versa) and ap… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: ICCV 2023; Project site: https://fredfyyang.github.io/vision-from-touch/

  23. arXiv:2304.08490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Conditional Generation of Audio from Video via Foley Analogies

    Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

    Abstract: The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributi… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  24. arXiv:2303.17490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

    Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh

    Abstract: How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The k… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  25. arXiv:2303.11989  [pdf, other

    cs.CV

    Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

    Authors: Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, Matthias Nießner

    Abstract: We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses. In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model. The core idea of… ▽ More

    Submitted 10 September, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: Accepted to ICCV 2023 (Oral) video: https://youtu.be/fjRnFL91EZc project page: https://lukashoel.github.io/text-to-room/ code: https://github.com/lukasHoel/text2room

  26. arXiv:2303.11329  [pdf, other

    cs.CV cs.SD eess.AS

    Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

    Authors: Ziyang Chen, Shengyi Qian, Andrew Owens

    Abstract: The images and sounds that we perceive undergo subtle but geometrically consistent changes as we rotate our heads. In this paper, we use these cues to solve a problem we call Sound Localization from Motion (SLfM): jointly estimating camera rotation and localizing sound sources. We learn to solve these tasks solely through self-supervision. A visual model predicts camera rotation from a pair of ima… ▽ More

    Submitted 21 August, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

    Comments: ICCV 2023. Project site: https://ificl.github.io/SLfM/

  27. arXiv:2301.04647  [pdf, other

    cs.CV cs.CL

    EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata

    Authors: Chenhao Zheng, Ayush Shrivastava, Andrew Owens

    Abstract: We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outp… ▽ More

    Submitted 17 June, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: CVPR 2023 (Highlight). Project link: http://hellomuffin.github.io/exif-as-language

  28. arXiv:2301.01767  [pdf, other

    cs.CV

    Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

    Authors: Chao Feng, Ziyang Chen, Andrew Owens

    Abstract: Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, and that can be trained solely using real, unlabeled data. We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronizati… ▽ More

    Submitted 27 March, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

    Comments: CVPR 2023

  29. arXiv:2211.15058  [pdf, other

    cs.CV

    Mix and Localize: Localizing Sound Sources in Mixtures

    Authors: Xixi Hu, Ziyang Chen, Andrew Owens

    Abstract: We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds c… ▽ More

    Submitted 27 November, 2022; originally announced November 2022.

    Comments: CVPR 2022

  30. arXiv:2211.12498  [pdf, other

    cs.CV

    Touch and Go: Learning from Human-Collected Vision and Touch

    Authors: Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, Andrew Owens

    Abstract: The ability to associate touch with sight is essential for tasks that require physically interacting with objects in the world. We propose a dataset with paired visual and tactile data called Touch and Go, in which human data collectors probe objects in natural environments using tactile sensors, while simultaneously recording egocentric video. In contrast to previous efforts, which have largely b… ▽ More

    Submitted 29 November, 2022; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: Accepted by NeurIPS 2022 Track of Datasets and Benchmarks

  31. arXiv:2205.05072  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Learning Visual Styles from Audio-Visual Associations

    Authors: Tingle Li, Yichen Liu, Andrew Owens, Hang Zhao

    Abstract: From the patter of rain to the crunch of snow, the sounds we hear often convey the visual textures that appear within a scene. In this paper, we present a method for learning visual styles from unlabeled audio-visual data. Our model learns to manipulate the texture of a scene to match a sound, a problem we term audio-driven image stylization. Given a dataset of paired audio-visual data, we learn t… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

  32. arXiv:2204.12489  [pdf, other

    cs.CV cs.SD eess.AS

    Sound Localization by Self-Supervised Time Delay Estimation

    Authors: Ziyang Chen, David F. Fouhey, Andrew Owens

    Abstract: Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive rando… ▽ More

    Submitted 28 January, 2023; v1 submitted 26 April, 2022; originally announced April 2022.

    Comments: ECCV 2022

  33. arXiv:2201.08379  [pdf, other

    cs.CV

    Learning Pixel Trajectories with Multiscale Contrastive Random Walks

    Authors: Zhangxing Bian, Allan Jabri, Alexei A. Efros, Andrew Owens

    Abstract: A range of video modeling tasks, from optical flow to multiple object tracking, share the same fundamental challenge: establishing space-time correspondence. Yet, approaches that dominate each space differ. We take a step towards bridging this gap by extending the recent contrastive random walk formulation to much denser, pixel-level space-time graphs. The main contribution is introducing hierarch… ▽ More

    Submitted 4 April, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

  34. arXiv:2201.07202  [pdf, other

    cs.CV

    GANmouflage: 3D Object Nondetection with Texture Fields

    Authors: Rui Guo, Jasmine Collins, Oscar de Lima, Andrew Owens

    Abstract: We propose a method that learns to camouflage 3D objects within scenes. Given an object's shape and a distribution of viewpoints from which it will be seen, we estimate a texture that will make it difficult to detect. Successfully solving this task requires a model that can accurately reproduce textures from the scene, while simultaneously dealing with the highly conflicting constraints imposed by… ▽ More

    Submitted 23 April, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

  35. arXiv:2111.05846  [pdf, other

    cs.SD cs.CV cs.MM cs.RO eess.AS

    Structure from Silence: Learning Scene Structure from Ambient Sound

    Authors: Ziyang Chen, Xixi Hu, Andrew Owens

    Abstract: From whirling ceiling fans to ticking clocks, the sounds that we hear subtly vary as we move through a scene. We ask whether these ambient sounds convey information about 3D scene structure and, if so, whether they provide a useful learning signal for multimodal models. To study this, we collect a dataset of paired audio and RGB-D recordings from a variety of quiet indoor scenes. We then train mod… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

    Comments: Accepted to CoRL 2021 (Oral Presentation)

  36. arXiv:2104.09498  [pdf, other

    cs.CV cs.LG

    Comparing Correspondences: Video Prediction with Correspondence-wise Losses

    Authors: Daniel Geng, Max Hamilton, Andrew Owens

    Abstract: Image prediction methods often struggle on tasks that require changing the positions of objects, such as video prediction, producing blurry images that average over the many positions that objects might occupy. In this paper, we propose a simple change to existing image similarity metrics that makes them more robust to positional errors: we match the images using optical flow, then measure the vis… ▽ More

    Submitted 31 March, 2022; v1 submitted 19 April, 2021; originally announced April 2021.

    Comments: CVPR 2022 Camera Ready

  37. arXiv:2104.02687  [pdf, other

    cs.CV cs.AI cs.MM

    Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

    Authors: Medhini Narasimhan, Shiry Ginosar, Andrew Owens, Alexei A. Efros, Trevor Darrell

    Abstract: We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. This classic work, however, was constrained by its use of hand-designed distance met… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: Project website at https://medhini.github.io/audio_video_textures/

  38. arXiv:2103.14644  [pdf, other

    cs.CV

    Planar Surface Reconstruction from Sparse Views

    Authors: Linyi Jin, Shengyi Qian, Andrew Owens, David F. Fouhey

    Abstract: The paper studies planar surface reconstruction of indoor scenes from two views with unknown camera poses. While prior approaches have successfully created object-centric reconstructions of many scenes, they fail to exploit other structures, such as planes, which are typically the dominant components of indoor scenes. In this paper, we reconstruct planar surfaces from multiple views, while jointly… ▽ More

    Submitted 20 August, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: Accepted to ICCV 2021 (Oral Presentation)

  39. arXiv:2008.04237  [pdf, other

    cs.CV cs.SD eess.AS

    Self-Supervised Learning of Audio-Visual Objects from Video

    Authors: Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman

    Abstract: Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented… ▽ More

    Submitted 10 August, 2020; originally announced August 2020.

    Comments: ECCV 2020

  40. arXiv:2006.14613  [pdf, other

    cs.CV cs.LG eess.IV

    Space-Time Correspondence as a Contrastive Random Walk

    Authors: Allan Jabri, Andrew Owens, Alexei A. Efros

    Abstract: This paper proposes a simple self-supervised approach for learning a representation for visual correspondence from raw video. We cast correspondence as prediction of links in a space-time graph constructed from video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a representation in which pairwise similarity defines tra… ▽ More

    Submitted 3 December, 2020; v1 submitted 25 June, 2020; originally announced June 2020.

    Comments: NeurIPS 2020 camera ready version -- Code at github.com/ajabri/videowalk

  41. arXiv:1912.11035  [pdf, other

    cs.CV

    CNN-generated images are surprisingly easy to spot... for now

    Authors: Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, Alexei A. Efros

    Abstract: In this work we ask whether it is possible to create a "universal" detector for telling apart real images from these generated by a CNN, regardless of architecture or dataset used. To test this, we collect a dataset consisting of fake images generated by 11 different CNN-based image generator models, chosen to span the space of commonly used architectures today (ProGAN, StyleGAN, BigGAN, CycleGAN,… ▽ More

    Submitted 4 April, 2020; v1 submitted 23 December, 2019; originally announced December 2019.

    Comments: Accepted to CVPR 2020

  42. arXiv:1906.05856  [pdf, other

    cs.CV

    Detecting Photoshopped Faces by Scripting Photoshop

    Authors: Sheng-Yu Wang, Oliver Wang, Andrew Owens, Richard Zhang, Alexei A. Efros

    Abstract: Most malicious photo manipulations are created using standard image editing tools, such as Adobe Photoshop. We present a method for detecting one very popular Photoshop manipulation -- image warping applied to human faces -- using a model trained entirely using fake images that were automatically generated by scripting Photoshop itself. We show that our model outperforms humans at the task of reco… ▽ More

    Submitted 5 September, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

  43. arXiv:1906.04160  [pdf, other

    cs.CV cs.LG eess.AS

    Learning Individual Styles of Conversational Gesture

    Authors: Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik

    Abstract: Human speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically, we perform cross-modal translation from "in-the-wild'' monologue speech of a single speaker to their hand and arm motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system.… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

    Comments: CVPR 2019

  44. arXiv:1809.05491  [pdf, other

    cs.HC cs.CV cs.GR

    MoSculp: Interactive Visualization of Shape and Time

    Authors: Xiuming Zhang, Tali Dekel, Tianfan Xue, Andrew Owens, Qiurui He, Jiajun Wu, Stefanie Mueller, William T. Freeman

    Abstract: We present a system that allows users to visualize complex human motion via 3D motion sculptures---a representation that conveys the 3D structure swept by a human body as it moves through space. Given an input video, our system computes the motion sculptures and provides a user interface for rendering it in different styles, including the options to insert the sculpture back into the original vide… ▽ More

    Submitted 2 January, 2019; v1 submitted 14 September, 2018; originally announced September 2018.

    Comments: UIST 2018. Project page: http://mosculp.csail.mit.edu/

  45. arXiv:1805.11085  [pdf, other

    cs.RO cs.LG stat.ML

    More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch

    Authors: Roberto Calandra, Andrew Owens, Dinesh Jayaraman, Justin Lin, Wenzhen Yuan, Jitendra Malik, Edward H. Adelson, Sergey Levine

    Abstract: For humans, the process of grasping an object relies heavily on rich tactile feedback. Most recent robotic grasping work, however, has been based only on visual input, and thus cannot easily benefit from feedback after initiating contact. In this paper, we investigate how a robot can learn to use tactile information to iteratively and efficiently adjust its grasp. To this end, we propose an end-to… ▽ More

    Submitted 26 July, 2018; v1 submitted 28 May, 2018; originally announced May 2018.

    Comments: 8 pages. Published on IEEE Robotics and Automation Letters (RAL). Website: https://sites.google.com/view/more-than-a-feeling

  46. arXiv:1805.04096  [pdf, other

    cs.CV

    Fighting Fake News: Image Splice Detection via Learned Self-Consistency

    Authors: Minyoung Huh, Andrew Liu, Andrew Owens, Alexei A. Efros

    Abstract: Advances in photo editing and manipulation tools have made it significantly easier to create fake imagery. Learning to detect such manipulations, however, remains a challenging problem due to the lack of sufficient amounts of manipulated training data. In this paper, we propose a learning algorithm for detecting visual image manipulations that is trained only using a large dataset of real photogra… ▽ More

    Submitted 5 September, 2018; v1 submitted 10 May, 2018; originally announced May 2018.

  47. arXiv:1804.03641  [pdf, other

    cs.CV cs.SD eess.AS

    Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

    Authors: Andrew Owens, Alexei A. Efros

    Abstract: The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-sup… ▽ More

    Submitted 9 October, 2018; v1 submitted 10 April, 2018; originally announced April 2018.

  48. arXiv:1712.07271  [pdf, other

    cs.CV

    Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

    Authors: Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba

    Abstract: The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, throug… ▽ More

    Submitted 19 December, 2017; originally announced December 2017.

    Comments: Journal preprint of arXiv:1608.07017 (unpublished submission to IJCV)

  49. arXiv:1710.05512  [pdf, other

    cs.RO cs.CV cs.LG stat.ML

    The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?

    Authors: Roberto Calandra, Andrew Owens, Manu Upadhyaya, Wenzhen Yuan, Justin Lin, Edward H. Adelson, Sergey Levine

    Abstract: A successful grasp requires careful balancing of the contact forces. Deducing whether a particular grasp will be successful from indirect measurements, such as vision, is therefore quite challenging, and direct sensing of contacts through touch sensing provides an appealing avenue toward more successful and consistent robotic grasping. However, in order to fully evaluate the value of touch sensing… ▽ More

    Submitted 4 March, 2025; v1 submitted 16 October, 2017; originally announced October 2017.

    Comments: 10 pages, published at the 1st Annual Conference on Robot Learning (CoRL), Code and dataset available at: https://lasr.org/research/feeling-of-success

  50. Shape-independent Hardness Estimation Using Deep Learning and a GelSight Tactile Sensor

    Authors: Wenzhen Yuan, Chenzhuo Zhu, Andrew Owens, Mandayam A. Srinivasan, Edward H. Adelson

    Abstract: Hardness is among the most important attributes of an object that humans learn about through touch. However, approaches for robots to estimate hardness are limited, due to the lack of information provided by current tactile sensors. In this work, we address these limitations by introducing a novel method for hardness estimation, based on the GelSight tactile sensor, and the method does not require… ▽ More

    Submitted 12 April, 2017; originally announced April 2017.