Skip to main content

Showing 1–50 of 164 results for author: Black, M J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.13040  [pdf, ps, other

    cs.CV

    MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

    Authors: Hanz Cuevas-Velasquez, Anastasios Yiannakidis, Soyong Shin, Giorgio Becherini, Markus Höschle, Joachim Tesch, Taylor Obersat, Tsvetelina Alexiadis, Michael J. Black

    Abstract: We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent lear… ▽ More

    Submitted 24 June, 2025; v1 submitted 15 June, 2025; originally announced June 2025.

  2. arXiv:2505.06166  [pdf, other

    cs.CV

    DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models

    Authors: Radu Alexandru Rosu, Keyu Wu, Yao Feng, Youyi Zheng, Michael J. Black

    Abstract: We address the task of generating 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data. Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted to CVPR 2025

  3. arXiv:2504.17695  [pdf, other

    cs.CV

    PICO: Reconstructing 3D People In Contact with Objects

    Authors: Alpár Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Arjun Lakshmipathy, Agniv Chatterjee, Michael J. Black, Dimitrios Tzionas

    Abstract: Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: Accepted in CVPR'25. Project Page: https://pico.is.tue.mpg.de

  4. arXiv:2504.13386  [pdf, other

    cs.GR cs.CV

    Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

    Authors: Radek Daněček, Carolin Schmitt, Senya Polikovsky, Michael J. Black

    Abstract: In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync qualit… ▽ More

    Submitted 22 May, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

  5. arXiv:2504.13152  [pdf, other

    cs.CV

    St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World

    Authors: Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, Angjoo Kanazawa

    Abstract: Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different momen… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Project page: https://St4RTrack.github.io/

  6. arXiv:2504.06397  [pdf, other

    cs.CV

    PromptHMR: Promptable Human Mesh Recovery

    Authors: Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, Muhammed Kocabas

    Abstract: Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot… ▽ More

    Submitted 23 May, 2025; v1 submitted 8 April, 2025; originally announced April 2025.

    Comments: CVPR 2025. Project website: https://yufu-wang.github.io/phmr-page

  7. arXiv:2504.05303  [pdf, other

    cs.CV

    InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

    Authors: Sai Kumar Dwivedi, Dimitrije Antić, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, Dimitrios Tzionas

    Abstract: We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual label… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

    Comments: CVPR 2025

  8. arXiv:2503.17544  [pdf, other

    cs.CV cs.AI

    PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

    Authors: Yan Zhang, Yao Feng, Alpár Cseke, Nitin Saini, Nathan Bajandas, Nicolas Heron, Michael J. Black

    Abstract: To build a motor system of the interactive avatar, it is essential to develop a generative motion model drives the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although motion generation has been extensively studied, most methods do not support ``embodied intelligence'' due to their offline setting, slow speed, limited motion lengths, or unnatural m… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: 18 pages

  9. arXiv:2503.10624  [pdf, other

    cs.CV cs.AI cs.GR

    ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

    Authors: Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu

    Abstract: Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that est… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Page: https://boqian-li.github.io/ETCH/, Code: https://github.com/boqian-li/ETCH

  10. arXiv:2501.08329  [pdf, other

    cs.CV

    Predicting 4D Hand Trajectory from Monocular Videos

    Authors: Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, Michael J. Black

    Abstract: We present HaPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space. Despite the additional temporal cues, they generally underperform compared to image-based methods due to the scarcity… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

  11. arXiv:2412.17811  [pdf, other

    cs.CV

    ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

    Authors: Siyuan Bian, Chenghao Xu, Yuliang Xiu, Artur Grigorev, Zhen Liu, Cewu Lu, Michael J. Black, Yao Feng

    Abstract: We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garments from images or text descriptions. Unlike previous methods that struggle in real-world scenarios or lack interactive editing capabilities, ChatGarment can estimate sewing patterns from in-the-wild images or sketches, generate them from text… ▽ More

    Submitted 3 April, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

    Comments: CVPR 2025

  12. arXiv:2412.11785  [pdf, other

    cs.CV

    InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

    Authors: Rick Akkerman, Haiwen Feng, Michael J. Black, Dimitrios Tzionas, Victoria Fernández Abrevaya

    Abstract: Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which negle… ▽ More

    Submitted 4 April, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

  13. arXiv:2412.11224  [pdf, ps, other

    cs.CV cs.GR

    GenLit: Reformulating Single-Image Relighting as Video Generation

    Authors: Shrisha Bharadwaj, Haiwen Feng, Giorgio Becherini, Victoria Fernandez Abrevaya, Michael J. Black

    Abstract: Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be… ▽ More

    Submitted 20 June, 2025; v1 submitted 15 December, 2024; originally announced December 2024.

  14. arXiv:2412.08101  [pdf, other

    cs.CV cs.LG

    Generative Zoo

    Authors: Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas-Velasquez, Soubhik Sanyal, Michael J. Black, Silvia Zuffi, Peter Kulits

    Abstract: The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior. Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations. However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impo… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

    Comments: 12 pages; project page: https://genzoo.is.tue.mpg.de

  15. arXiv:2411.18807  [pdf, other

    cs.CV cs.CL

    Reconstructing Animals and the Wild

    Authors: Peter Kulits, Michael J. Black, Silvia Zuffi

    Abstract: The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here, we consi… ▽ More

    Submitted 27 November, 2024; originally announced November 2024.

    Comments: 12 pages; project page: https://raw.is.tue.mpg.de/

  16. arXiv:2411.08128  [pdf, other

    cs.CV

    CameraHMR: Aligning People with Perspective

    Authors: Priyanka Patel, Michael J. Black

    Abstract: We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT ac… ▽ More

    Submitted 12 November, 2024; originally announced November 2024.

    Comments: 3DV 2025

  17. arXiv:2409.08189  [pdf, other

    cs.CV cs.GR

    Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photorealistic Appearance from Multi-View Video

    Authors: Boxiang Rong, Artur Grigorev, Wenbo Wang, Michael J. Black, Bernhard Thomaszewski, Christina Tsalicoglou, Otmar Hilliges

    Abstract: We introduce Gaussian Garments, a novel approach for reconstructing realistic simulation-ready garment assets from multi-view videos. Our method represents garments with a combination of a 3D mesh and a Gaussian texture that encodes both the color and high-frequency surface details. This representation enables accurate registration of garment geometries to multi-view videos and helps disentangle a… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

  18. arXiv:2409.03944  [pdf, other

    cs.CV cs.AI

    HUMOS: Human Motion Model Conditioned on Body Shape

    Authors: Shashank Tripathi, Omid Taheri, Christoph Lassner, Michael J. Black, Daniel Holden, Carsten Stoll

    Abstract: Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics… ▽ More

    Submitted 3 April, 2025; v1 submitted 5 September, 2024; originally announced September 2024.

    Comments: Accepted in ECCV'24. Project page: https://CarstenEpic.github.io/humos/

  19. arXiv:2408.08313  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Can Large Language Models Understand Symbolic Graphics Programs?

    Authors: Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, Bernhard Schölkopf

    Abstract: Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of L… ▽ More

    Submitted 27 May, 2025; v1 submitted 15 August, 2024; originally announced August 2024.

    Comments: ICLR 2025 Spotlight (v4: 47 pages, 26 figures, project page: https://sgp-bench.github.io/)

  20. arXiv:2408.00712  [pdf, other

    cs.CV cs.GR

    MotionFix: Text-Driven 3D Human Motion Editing

    Authors: Nikos Athanasiou, Alpár Cseke, Markos Diomataris, Michael J. Black, Gül Varol

    Abstract: The focus of this paper is on 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. In this paper, we address both challenges. We propose a methodology to semi-… ▽ More

    Submitted 24 November, 2024; v1 submitted 1 August, 2024; originally announced August 2024.

    Comments: SIGGRAPH Asia 2024 Camera Ready, Project page: https://motionfix.is.tue.mpg.de

  21. arXiv:2406.08472  [pdf, other

    cs.LG cs.AI cs.RO

    RILe: Reinforced Imitation Learning

    Authors: Mert Albaba, Sammy Christen, Thomas Langarek, Christoph Gebhardt, Otmar Hilliges, Michael J. Black

    Abstract: Acquiring complex behaviors is essential for artificially intelligent agents, yet learning these behaviors in high-dimensional settings poses a significant challenge due to the vast search space. Traditional reinforcement learning (RL) requires extensive manual effort for reward function engineering. Inverse reinforcement learning (IRL) uncovers reward functions from expert demonstrations but reli… ▽ More

    Submitted 21 April, 2025; v1 submitted 12 June, 2024; originally announced June 2024.

  22. arXiv:2405.14869  [pdf, other

    cs.CV cs.AI cs.GR

    PuzzleAvatar: Assembling 3D Avatars from Personal Albums

    Authors: Yuliang Xiu, Yufei Ye, Zhen Liu, Dimitrios Tzionas, Michael J. Black

    Abstract: Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar i… ▽ More

    Submitted 14 September, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: Page: https://puzzleavatar.is.tue.mpg.de/, Code: https://github.com/YuliangXiu/PuzzleAvatar, Video: https://youtu.be/0hpXH2tVPk4

  23. ContourCraft: Learning to Resolve Intersections in Neural Multi-Garment Simulations

    Authors: Artur Grigorev, Giorgio Becherini, Michael J. Black, Otmar Hilliges, Bernhard Thomaszewski

    Abstract: Learning-based approaches to cloth simulation have started to show their potential in recent years. However, handling collisions and intersections in neural simulations remains a largely unsolved problem. In this work, we present \moniker{}, a learning-based solution for handling intersections in neural cloth simulations. Unlike conventional approaches that critically rely on intersection-free inp… ▽ More

    Submitted 24 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

    Comments: Accepted for publication by SIGGRAPH 2024, conference track

  24. arXiv:2405.04533  [pdf, ps, other

    cs.CV cs.LG

    ChatHuman: Chatting about 3D Humans with Tools

    Authors: Jing Lin, Yao Feng, Weiyang Liu, Michael J. Black

    Abstract: Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including 3D pose, shape, contact, human-object interaction, and emotion. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that integrates the capabil… ▽ More

    Submitted 29 May, 2025; v1 submitted 7 May, 2024; originally announced May 2024.

    Comments: Project page: https://chathuman.github.io

  25. arXiv:2404.16752  [pdf, other

    cs.CV

    TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

    Authors: Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, Michael J. Black

    Abstract: We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an ap… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  26. arXiv:2404.15383  [pdf, other

    cs.CV cs.AI

    WANDR: Intention-guided Human Motion Generation

    Authors: Markos Diomataris, Nikos Athanasiou, Omid Taheri, Xi Wang, Otmar Hilliges, Michael J. Black

    Abstract: Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness. A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  27. arXiv:2404.15228  [pdf, other

    cs.CV cs.CL

    Re-Thinking Inverse Graphics With Large Language Models

    Authors: Peter Kulits, Haiwen Feng, Weiyang Liu, Victoria Abrevaya, Michael J. Black

    Abstract: Inverse graphics -- the task of inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics. Successfully disentangling an image into its constituent elements, such as the shape, color, and material properties of the objects of the 3D scene that produced it, requires a comprehensive understa… ▽ More

    Submitted 23 August, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: TMLR camera-ready; 31 pages; project page: https://ig-llm.is.tue.mpg.de/

  28. arXiv:2404.10685  [pdf, other

    cs.CV cs.GR

    Generating Human Interaction Motions in Scenes with Text Control

    Authors: Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, Davis Rempe

    Abstract: We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model,… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: Project Page: https://research.nvidia.com/labs/toronto-ai/tesmo/

  29. arXiv:2404.03042  [pdf, other

    cs.CV

    AWOL: Analysis WithOut synthesis using Language

    Authors: Silvia Zuffi, Michael J. Black

    Abstract: Many classical parametric 3D shape models exist, but creating novel shapes with such models requires expert knowledge of their parameters. For example, imagine creating a specific type of tree using procedural graphics or a new kind of animal from a statistical shape model. Our key idea is to leverage language to control such existing models to produce novel shapes. This involves learning a mappin… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  30. arXiv:2403.14611  [pdf, other

    cs.CV

    Explorative Inbetweening of Time and Space

    Authors: Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Victoria Abrevaya, Michael J. Black, Xuaner Zhang

    Abstract: We introduce bounded generation as a generalized task to control video generation to synthesize arbitrary camera and subject motion based only on a given start and end frame. Our objective is to fully leverage the inherent generalization capability of an image-to-video model without additional training or fine-tuning of the original model. This is achieved through the proposed new sampling strateg… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: project page at https://time-reversal.github.io

  31. arXiv:2401.08559  [pdf, other

    cs.CV cs.GR cs.LG

    Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

    Authors: Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, Gül Varol, Xue Bin Peng, Davis Rempe

    Abstract: Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To… ▽ More

    Submitted 24 May, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

    Comments: CVPR 2024, HuMoGen Workshop

  32. arXiv:2401.00374  [pdf, other

    cs.CV

    EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

    Authors: Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black

    Abstract: We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements,… ▽ More

    Submitted 30 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

    Comments: Fix typos; Conflict of Interest Disclosure; CVPR Camera Ready; Project Page: https://pantomatrix.github.io/EMAGE/

  33. arXiv:2312.16737  [pdf, other

    cs.CV

    HMP: Hand Motion Priors for Pose and Shape Estimation from Video

    Authors: Enes Duran, Muhammed Kocabas, Vasileios Choutas, Zicong Fan, Michael J. Black

    Abstract: Understanding how humans interact with the world necessitates accurate 3D hand pose estimation, a task complicated by the hand's high degree of articulation, frequent occlusions, self-occlusions, and rapid motions. While most existing methods rely on single-image inputs, videos have useful cues to address aforementioned issues. However, existing video-based 3D hand datasets are insufficient for tr… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

    Journal ref: WACV 2024

  34. arXiv:2312.14579  [pdf, other

    cs.CV

    Synthesizing Environment-Specific People in Photographs

    Authors: Mirela Ostrek, Carol O'Sullivan, Michael J. Black, Justus Thies

    Abstract: We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing i… ▽ More

    Submitted 26 September, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: Accepted at ECCV 2024, Project: https://esp.is.tue.mpg.de

  35. arXiv:2312.11666  [pdf, other

    cs.CV cs.GR

    HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

    Authors: Vanessa Sklyarova, Egor Zakharov, Otmar Hilliges, Michael J. Black, Justus Thies

    Abstract: We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: For more results please refer to the project page https://haar.is.tue.mpg.de/

  36. arXiv:2312.07531  [pdf, other

    cs.CV

    WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

    Authors: Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black

    Abstract: The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines,… ▽ More

    Submitted 18 April, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

  37. arXiv:2312.04466  [pdf, other

    cs.CV

    Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

    Authors: Kiran Chhatre, Radek Daněček, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, Timo Bolkart

    Abstract: Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusi… ▽ More

    Submitted 1 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2024. Webpage: https://amuse.is.tue.mpg.de/

  38. arXiv:2311.18836  [pdf, other

    cs.CV

    ChatPose: Chatting about 3D Human Pose

    Authors: Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black

    Abstract: We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional huma… ▽ More

    Submitted 23 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: Home page: https://yfeng95.github.io/ChatPose/

  39. arXiv:2311.18448  [pdf, other

    cs.CV

    HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

    Authors: Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, Otmar Hilliges

    Abstract: Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

  40. arXiv:2311.06243  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

    Authors: Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, Bernhard Schölkopf

    Abstract: Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly larg… ▽ More

    Submitted 28 April, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

    Comments: ICLR 2024 (v2: 34 pages, 19 figures)

  41. FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

    Authors: Shrisha Bharadwaj, Yufeng Zheng, Otmar Hilliges, Michael J. Black, Victoria Fernandez-Abrevaya

    Abstract: Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are sl… ▽ More

    Submitted 27 October, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: 15 pages, Accepted: ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 2023

    Journal ref: Volume 42, article number 204, year 2023

  42. arXiv:2310.15168  [pdf, other

    cs.CV cs.GR cs.LG

    Ghost on the Shell: An Expressive Representation of General 3D Shapes

    Authors: Zhen Liu, Yao Feng, Yuliang Xiu, Weiyang Liu, Liam Paull, Michael J. Black, Bernhard Schölkopf

    Abstract: The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they 1) enable fast physics-based rendering with realistic material and lighting, 2) support physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D s… ▽ More

    Submitted 24 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 Oral (v3: 30 pages, 19 figures, Project Page: https://gshell3d.github.io/)

  43. arXiv:2310.13768  [pdf, other

    cs.CV

    PACE: Human and Camera Motion Estimation from in-the-wild Videos

    Authors: Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, Umar Iqbal

    Abstract: We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: 3DV 2024. Project page: https://nvlabs.github.io/PACE/

  44. arXiv:2310.09449  [pdf, other

    cs.CV cs.LG

    Pairwise Similarity Learning is SimPLE

    Authors: Yandong Wen, Weiyang Liu, Yao Feng, Bhiksha Raj, Rita Singh, Adrian Weller, Michael J. Black, Bernhard Schölkopf

    Abstract: In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples w… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: Published in ICCV 2023 (Project page: https://simple.is.tue.mpg.de/)

  45. arXiv:2309.15273  [pdf, other

    cs.CV

    DECO: Dense Estimation of 3D Human-Scene Contact In The Wild

    Authors: Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, Michael J. Black

    Abstract: Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. I… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted as Oral in ICCV'23. Project page: https://deco.is.tue.mpg.de

  46. arXiv:2309.07125  [pdf, other

    cs.CV

    Text-Guided Generation and Editing of Compositional 3D Avatars

    Authors: Hao Zhang, Yao Feng, Peter Kulits, Yandong Wen, Justus Thies, Michael J. Black

    Abstract: Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach,… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Comments: Home page: https://yfeng95.github.io/teca

  47. arXiv:2309.06441  [pdf, other

    cs.CV cs.AI cs.GR

    Learning Disentangled Avatars with Hybrid 3D Representations

    Authors: Yao Feng, Weiyang Liu, Timo Bolkart, Jinlong Yang, Marc Pollefeys, Michael J. Black

    Abstract: Tremendous efforts have been made to learn animatable and photorealistic human avatars. Towards this end, both explicit and implicit 3D representations are heavily studied for a holistic modeling and capture of the whole human (e.g., body, clothing, face and hair), but neither representation is an optimal choice in terms of representation efficacy since different parts of the human avatar have dif… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: home page: https://yfeng95.github.io/delta. arXiv admin note: text overlap with arXiv:2210.01868

  48. arXiv:2308.12965  [pdf, other

    cs.CV

    POCO: 3D Pose and Shape Estimation with Confidence

    Authors: Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas

    Abstract: The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the con… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

  49. arXiv:2308.11617  [pdf, other

    cs.CV

    GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency

    Authors: Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, Michael J. Black

    Abstract: Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3… ▽ More

    Submitted 15 July, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: The project has been started during Omid Taheri's internship at Adobe and as a collaboration with the Max Planck Institute for Intelligent Systems

  50. arXiv:2308.10899  [pdf, other

    cs.AI

    TADA! Text to Animatable Digital Avatars

    Authors: Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, Michael J. Black

    Abstract: We introduce TADA, a simple-yet-effective approach that takes textual descriptions and produces expressive 3D avatars with high-quality geometry and lifelike textures, that can be animated and rendered with traditional graphics pipelines. Existing text-based character generation methods are limited in terms of geometry and texture quality, and cannot be realistically animated due to inconsistent a… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.