Search | arXiv e-print repository

Protonated acetylene in the z=0.89 molecular absorber toward PKS1830-211

Authors: S. Muller, R. Le Gal, E. Roueff, J. H. Black, A. Faure, M. Guelin, A. Omont, M. Gerin, F. Combes, S. Aalto

Abstract: We report the first interstellar identification of protonated acetylene, C2H3+, a fundamental hydrocarbon, in the z=0.89 molecular absorber toward the gravitationally lensed quasar PKS1830-211. The molecular species is identified from clear absorption features corresponding to the 2_12-1_01 (rest frequency 494.034 GHz) and 1_11-0_00 (431.316 GHz) ground-state transitions of ortho and para forms of… ▽ More We report the first interstellar identification of protonated acetylene, C2H3+, a fundamental hydrocarbon, in the z=0.89 molecular absorber toward the gravitationally lensed quasar PKS1830-211. The molecular species is identified from clear absorption features corresponding to the 2_12-1_01 (rest frequency 494.034 GHz) and 1_11-0_00 (431.316 GHz) ground-state transitions of ortho and para forms of C2H3+, respectively, in ALMA spectra toward the southwestern image of PKS1830-211, where numerous molecules, including other hydrocarbons, have already been detected. From the simple assumption of local thermodynamic equilibrium (LTE) with cosmic microwave background photons and an ortho-to-para ratio of three, we estimate a total C2H3+ column density of 2 x 10^12 cm^-2 and an abundance of 10^-10 compared to H_2. However, formation pumping could affect the population of metastable states, yielding a C2H3+ column density higher than the LTE value by a factor of a few. We explore possible routes to the formation of C2H3+, mainly connected to acetylene and methane, and find that the methane route is more likely in PDR environment. As one of the initial hydrocarbon building blocks, C2H3+ is thought to play an important role in astrochemistry, in particular in the formation of more complex organic molecules. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted for publication in A&A

arXiv:2401.08559 [pdf, other]

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

Authors: Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, Gül Varol, Xue Bin Peng, Davis Rempe

Abstract: Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To… ▽ More Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. Our code and models are publicly available at https://mathis.petrovich.fr/stmc. △ Less

Submitted 24 May, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: CVPR 2024, HuMoGen Workshop

arXiv:2401.07473 [pdf]

doi 10.1016/j.foodchem.2024.139382

Vitamin K content of Australian-grown horticultural commodities

Authors: Eleanor Dunlop, Judy Cunningham, Paul Adorno, Georgios Dabos, Stuart K Johnson, Lucinda J Black

Abstract: Vitamin K is emerging as a multi-function vitamin that plays a role in bone, brain and vascular health. Vitamin K composition data remain limited globally and Australia has lacked nationally representative data for vitamin K1 (phylloquinone, PK) in horticultural commodities. Primary samples (n = 927) of 90 different Australian-grown fruit, vegetable and nut commodities were purchased in three Aust… ▽ More Vitamin K is emerging as a multi-function vitamin that plays a role in bone, brain and vascular health. Vitamin K composition data remain limited globally and Australia has lacked nationally representative data for vitamin K1 (phylloquinone, PK) in horticultural commodities. Primary samples (n = 927) of 90 different Australian-grown fruit, vegetable and nut commodities were purchased in three Australian cities. We measured PK in duplicate in 95 composite samples using liquid chromatography with electrospray ionisation-tandem mass spectrometry. The greatest mean concentrations of PK were found in kale (565 ug/100 g), baby spinach (255 ug/100 g) and Brussels sprouts (195 ug/100 g). The data contribute to the global collection of vitamin K food composition data. They add to the evidence that PK concentrations vary markedly between geographic regions, supporting development of region-specific datasets for national food composition databases that do not yet contain data for vitamin K. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: 22 pages, 2 tables

Journal ref: Food Chem. 452:139382 (2024)

arXiv:2401.03296 [pdf, other]

doi 10.1038/s41586-023-06307-x

Formation of the Methyl Cation by Photochemistry in a Protoplanetary Disk

Authors: Olivier Berné, Marie-Aline Martin-Drumel, Ilane Schroetter, Javier R. Goicoechea, Ugo Jacovella, Bérenger Gans, Emmanuel Dartois, Laurent Coudert, Edwin Bergin, Felipe Alarcon, Jan Cami, Evelyne Roueff, John H. Black, Oskar Asvany, Emilie Habart, Els Peeters, Amelie Canin, Boris Trahin, Christine Joblin, Stephan Schlemmer, Sven Thorwirth, Jose Cernicharo, Maryvonne Gerin, Alexander Tielens, Marion Zannese , et al. (31 additional authors not shown)

Abstract: Forty years ago it was proposed that gas phase organic chemistry in the interstellar medium was initiated by the methyl cation CH3+, but hitherto it has not been observed outside the Solar System. Alternative routes involving processes on grain surfaces have been invoked. Here we report JWST observations of CH3+ in a protoplanetary disk in the Orion star forming region. We find that gas-phase orga… ▽ More Forty years ago it was proposed that gas phase organic chemistry in the interstellar medium was initiated by the methyl cation CH3+, but hitherto it has not been observed outside the Solar System. Alternative routes involving processes on grain surfaces have been invoked. Here we report JWST observations of CH3+ in a protoplanetary disk in the Orion star forming region. We find that gas-phase organic chemistry is activated by UV irradiation. △ Less

Submitted 6 January, 2024; originally announced January 2024.

Comments: Published in Nature

Journal ref: Nature 621, 56-59 (2023)

arXiv:2401.00374 [pdf, other]

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Authors: Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black

Abstract: We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements,… ▽ More We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available https://pantomatrix.github.io/EMAGE/ △ Less

Submitted 30 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

Comments: Fix typos; Conflict of Interest Disclosure; CVPR Camera Ready; Project Page: https://pantomatrix.github.io/EMAGE/

arXiv:2312.16737 [pdf, other]

HMP: Hand Motion Priors for Pose and Shape Estimation from Video

Authors: Enes Duran, Muhammed Kocabas, Vasileios Choutas, Zicong Fan, Michael J. Black

Abstract: Understanding how humans interact with the world necessitates accurate 3D hand pose estimation, a task complicated by the hand's high degree of articulation, frequent occlusions, self-occlusions, and rapid motions. While most existing methods rely on single-image inputs, videos have useful cues to address aforementioned issues. However, existing video-based 3D hand datasets are insufficient for tr… ▽ More Understanding how humans interact with the world necessitates accurate 3D hand pose estimation, a task complicated by the hand's high degree of articulation, frequent occlusions, self-occlusions, and rapid motions. While most existing methods rely on single-image inputs, videos have useful cues to address aforementioned issues. However, existing video-based 3D hand datasets are insufficient for training feedforward models to generalize to in-the-wild scenarios. On the other hand, we have access to large human motion capture datasets which also include hand motions, e.g. AMASS. Therefore, we develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions. This motion prior is then employed for video-based 3D hand motion estimation following a latent optimization approach. Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios. It produces stable, temporally consistent results that surpass conventional single-frame methods. We demonstrate our method's efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets, with special emphasis on an occlusion-focused subset of HO3D. Code is available at https://hmp.is.tue.mpg.de △ Less

Submitted 27 December, 2023; originally announced December 2023.

Journal ref: WACV 2024

arXiv:2312.14579 [pdf, other]

Synthesizing Environment-Specific People in Photographs

Authors: Mirela Ostrek, Carol O'Sullivan, Michael J. Black, Justus Thies

Abstract: We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing i… ▽ More We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are used as tight guiding masks for inpainting, such that no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation. △ Less

Submitted 26 September, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

Comments: Accepted at ECCV 2024, Project: https://esp.is.tue.mpg.de

arXiv:2312.14056 [pdf, other]

OH as a probe of the warm water cycle in planet-forming disks

Authors: Marion Zannese, Benoît Tabone, Emilie Habart, Javier R. Goicoechea, Alexandre Zanchet, Ewine F. van Dishoeck, Marc C. van Hemert, John H. Black, Alexander G. G. M. Tielens, A. Veselinova, P. G. Jambrina, M. Menendez, E. Verdasco, F. J. Aoiz, L. Gonzalez-Sanchez, Boris Trahin, Emmanuel Dartois, Olivier Berné, Els Peeters, Jinhua He, Ameek Sidhu, Ryan Chown, Ilane Schroetter, Dries Van De Putte, Amélie Canin , et al. (30 additional authors not shown)

Abstract: Water is a key ingredient for the emergence of life as we know it. Yet, its destruction and reformation in space remains unprobed in warm gas. Here, we detect the hydroxyl radical (OH) emission from a planet-forming disk exposed to external far-ultraviolet (FUV) radiation with the James Webb Space Telescope. The observations are confronted with the results of quantum dynamical calculations. The hi… ▽ More Water is a key ingredient for the emergence of life as we know it. Yet, its destruction and reformation in space remains unprobed in warm gas. Here, we detect the hydroxyl radical (OH) emission from a planet-forming disk exposed to external far-ultraviolet (FUV) radiation with the James Webb Space Telescope. The observations are confronted with the results of quantum dynamical calculations. The highly excited OH infrared rotational lines are the tell-tale signs of H2O destruction by FUV. The OH infrared ro-vibrational lines are attributed to chemical excitation via the key reaction O+H=OH+H which seeds the formation of water in the gas-phase. We infer that the equivalent of the Earth ocean's worth of water is destroyed per month and replenished. These results show that under warm and irradiated conditions water is destroyed and efficiently reformed via gas-phase reactions. This process, assisted by diffusive transport, could reduce the HDO/H2O ratio in the warm regions of planet-forming disks. △ Less

Submitted 22 December, 2023; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: Version submitted to Nature Astronomy

arXiv:2312.11666 [pdf, other]

HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

Authors: Vanessa Sklyarova, Egor Zakharov, Otmar Hilliges, Michael J. Black, Justus Thies

Abstract: We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by… ▽ More We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by using the 2D priors, they are intrinsically limited to only recovering the visual parts. Highly occluded hair structures can not be reconstructed with those methods, and they only model the ''outer shell'', which is not ready to be used in physics-based rendering or simulation pipelines. In contrast, we propose a first text-guided generative method that uses 3D hair strands as an underlying representation. Leveraging 2D visual question-answering (VQA) systems, we automatically annotate synthetic hair models that are generated from a small set of artist-created hairstyles. This allows us to train a latent diffusion model that operates in a common hairstyle UV space. In qualitative and quantitative studies, we demonstrate the capabilities of the proposed model and compare it to existing hairstyle generation approaches. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: For more results please refer to the project page https://haar.is.tue.mpg.de/

arXiv:2312.07531 [pdf, other]

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Authors: Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black

Abstract: The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines,… ▽ More The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes at http://wham.is.tue.mpg.de/ △ Less

Submitted 18 April, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.04466 [pdf, other]

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Authors: Kiran Chhatre, Radek Daněček, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, Timo Bolkart

Abstract: Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusi… ▽ More Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content, and better represent the emotion expressed by the input speech. Our code is available at amuse.is.tue.mpg.de. △ Less

Submitted 1 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2024. Webpage: https://amuse.is.tue.mpg.de/

arXiv:2312.02767 [pdf]

doi 10.2514/6.2024-1075

Experimental Characterization of Non-Associative Plasticity Flow Rule Coefficients for the LS-DYNA MAT213 Model

Authors: Ryan Premo, Jackob Black, Michael Pereira, Robert K. Goldberg, Trenton M. Ricks, Han-Gyu Kim

Abstract: This project is focused on developing an experimental framework for characterizing non-associative plasticity flow rule coefficients through coupon-scale tests for the LS-DYNA MAT213 model. The main objective is to characterize these coefficients based on the multi-scale (i.e., both microscopic and macroscopic) full-field measurement of the evolution of strain and stress fields. This paper focuses… ▽ More This project is focused on developing an experimental framework for characterizing non-associative plasticity flow rule coefficients through coupon-scale tests for the LS-DYNA MAT213 model. The main objective is to characterize these coefficients based on the multi-scale (i.e., both microscopic and macroscopic) full-field measurement of the evolution of strain and stress fields. This paper focuses on presenting the experimental work on characterizing the full-scale stress-strain curves of T700/LM-PAEK composites under tension, compression, and shear loads. The experimental data set was intended to build a deformation sub-model in the MAT213 model for the material. The strain data were collected using both microscopic and macroscopic digital image correlation techniques. The microscopic technique was particularly useful for fracture cases under small strains. A preliminary simulation result obtained from the MAT213 model is also presented in the paper. The experimental framework herein will be extended to characterize post-peak stress degradation in the composite material and to develop a damage sub-model for the material. This project will contribute to developing a simulation tool based on the MAT213 model for simulating the rate-dependent impact damages in composites under multi-axial loading. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2312.02764 [pdf]

doi 10.2514/6.2024-2083

Experimental multi-scale characterization of mode-II interlaminar fracture in geometrically scaled stitched and unstitched resin-infused composites

Authors: Dawson Ozborn, Jackob Black, Wayne Huberty, Christopher Bounds, Han-Gyu Kim

Abstract: This work is focused on investigating the impact of out-of-plane stitches on enhancing mode-II interlaminar fracture toughness (or energy) and characterizing damage progression and crack arrestment in stitched resin-infused composites. For the experimental work, End-Notched Flexure (ENF) quasi-isotropic specimens were manufactured using +/-45 non-crimp carbon-fiber fabrics through a resin-infusion… ▽ More This work is focused on investigating the impact of out-of-plane stitches on enhancing mode-II interlaminar fracture toughness (or energy) and characterizing damage progression and crack arrestment in stitched resin-infused composites. For the experimental work, End-Notched Flexure (ENF) quasi-isotropic specimens were manufactured using +/-45 non-crimp carbon-fiber fabrics through a resin-infusion process. Both stitched and unstitched specimen sets were designed for comparison. For a size effect study, the ENF specimens were geometrically scaled with three scaling levels. Based on the load-displacement data (i.e., global analysis), the fracture energy of the specimen material was analyzed using the compliance calibration method and a size effect theory. The fracture energy values were compared between the stitched and unstitched cases to characterize the enhanced fracture toughness of stitched composites. For local analysis, two types of digital image correlation (DIC) systems were employed: microscopic and macroscopic (i.e., coupon-scale) DIC systems. By analyzing in-plane displacement through the thickness, separation development was characterized along predicted fracture process zones. The impact of out-of-plane stitches on separation propagation along fracture process zones was discussed based on the DIC analysis. This work will contribute to developing a high-fidelity damage model for stitched resin-infused composites in the form of a traction-separation for high-speed aircraft applications. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2311.18836 [pdf, other]

ChatPose: Chatting about 3D Human Pose

Authors: Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black

Abstract: We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional huma… ▽ More We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation, lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore, ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis. △ Less

Submitted 23 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: Home page: https://yfeng95.github.io/ChatPose/

arXiv:2311.18448 [pdf, other]

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Authors: Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, Otmar Hilliges

Abstract: Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction… ▽ More Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To this end, we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hand and object from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualitatively show its robustness in reconstructing from in-the-wild videos. Code: https://github.com/zc-alexfan/hold △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.17161 [pdf, other]

JOYS+: mid-infrared detection of gas-phase SO$_2$ emission in a low-mass protostar. The case of NGC 1333 IRAS2A: hot core or accretion shock?

Authors: M. L. van Gelder, M. E. Ressler, E. F. van Dishoeck, P. Nazari, B. Tabone, J. H. Black, Ł. Tychoniec, L. Francis, M. Barsony, H. Beuther, A. Caratti o Garatti, Y. Chen, C. Gieser, V. J. M. le Gouellec, P. J. Kavanagh, P. D. Klaassen, B. W. P. Lew, H. Linnartz, L. Majumdar, G. Perotti, W. R. M. Rocha

Abstract: JWST/MIRI has sharpened our infrared eyes toward the star formation process. This paper presents the first mid-infrared detection of gaseous SO$_2$ emission in an embedded low-mass protostellar system. MIRI-MRS observations of the low-mass protostellar binary NGC 1333 IRAS2A are presented from the JWST Observations of Young protoStars (JOYS+) program, revealing emission from the SO$_2~ν_3$ asymmet… ▽ More JWST/MIRI has sharpened our infrared eyes toward the star formation process. This paper presents the first mid-infrared detection of gaseous SO$_2$ emission in an embedded low-mass protostellar system. MIRI-MRS observations of the low-mass protostellar binary NGC 1333 IRAS2A are presented from the JWST Observations of Young protoStars (JOYS+) program, revealing emission from the SO$_2~ν_3$ asymmetric stretching mode at 7.35 micron. The results are compared to those derived from high-angular resolution SO$_2$ data obtained with ALMA. The SO$_2$ emission from the $ν_3$ band is predominantly located on $\sim50-100$ au scales around the main component of the binary, IRAS2A1. A rotational temperature of $92\pm8$ K is derived from the $ν_3$ lines. This is in good agreement with the rotational temperature derived from pure rotational lines in the vibrational ground state (i.e., $ν=0$) with ALMA ($104\pm5$ K). However, the emission of the $ν_3$ lines is not in LTE given that the total number of molecules predicted by a LTE model is found to be a factor $2\times10^4$ higher than what is derived for the $ν=0$ state. This difference can be explained by a vibrational temperature that is $\sim100$ K higher than the derived rotational temperature of the $ν=0$ state. The brightness temperature derived from the continuum around the $ν_3$ band of SO$_2$ is $\sim180$ K, which confirms that the $ν_3=1$ level is not collisionally populated but rather infrared pumped by scattered radiation. This is also consistent with the non-detection of the $ν_2$ bending mode at 18-20 micron. Given the rotational temperature, the extent of the emission ($\sim100$ au in radius), and the narrow line widths in the ALMA data (3.5 km/s), the SO$_2$ in IRAS2A likely originates from ice sublimation in the central hot core around the protostar rather than from an accretion shock at the disk-envelope boundary. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: 19 pages, 17 figures, accepted for publication in A&A, abstract abbreviated

arXiv:2311.06243 [pdf, other]

Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

Authors: Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, Bernhard Schölkopf

Abstract: Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly larg… ▽ More Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language. △ Less

Submitted 28 April, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

Comments: ICLR 2024 (v2: 34 pages, 19 figures)

arXiv:2311.01042 [pdf]

Higher Mediterranean diet score is associated with longer time between relapses in Australian females with multiple sclerosis

Authors: Hajar Mazahery, Alison Daly, Ngoc Minh Pham, Madeleine Stephens, Eleanor Dunlop, Anne-Louise Ponsonby, Ausimmune/AusLong Investigator Group, Lucinda J Black

Abstract: A higher Mediterranean diet score has been associated with lower likelihood of multiple sclerosis. However, evidence regarding its association with disease activity and progression is limited. Using data from the AusLong Study, we tested longitudinal associations (over 10 years follow-up) between the alternate Mediterranean diet score (aMED) and aMED-Red (including moderate consumption of unproces… ▽ More A higher Mediterranean diet score has been associated with lower likelihood of multiple sclerosis. However, evidence regarding its association with disease activity and progression is limited. Using data from the AusLong Study, we tested longitudinal associations (over 10 years follow-up) between the alternate Mediterranean diet score (aMED) and aMED-Red (including moderate consumption of unprocessed red meat) and time between relapses and disability measured by Expanded Disability Status Scale (EDSS) (n=132; 27 males, 105 females). We used covariate-adjusted survival analysis for time between relapses, and time series mixed-effects negative binomial regression for EDSS. After adjusting for covariates, both higher aMED (aHR=0.94, 95%CI: 0.90, 0.99, p=0.009) and higher aMED-Red (aHR=0.93, 95%CI: 0.89, 0.97, p=0.001) were associated with significantly longer time between relapses in females. Whether specific dietary components of a Mediterranean diet are important in relation to relapses merits further study. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: Original article, Brief communication, 13 pages, 2 tables (one main table and one supplementary table)

arXiv:2310.17519 [pdf, other]

doi 10.1145/3618401

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Authors: Shrisha Bharadwaj, Yufeng Zheng, Otmar Hilliges, Michael J. Black, Victoria Fernandez-Abrevaya

Abstract: Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are sl… ▽ More Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the pre-filtered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches. △ Less

Submitted 27 October, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: 15 pages, Accepted: ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 2023

Journal ref: Volume 42, article number 204, year 2023

arXiv:2310.15168 [pdf, other]

Ghost on the Shell: An Expressive Representation of General 3D Shapes

Authors: Zhen Liu, Yao Feng, Yuliang Xiu, Weiyang Liu, Liam Paull, Michael J. Black, Bernhard Schölkopf

Abstract: The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they 1) enable fast physics-based rendering with realistic material and lighting, 2) support physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D s… ▽ More The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they 1) enable fast physics-based rendering with realistic material and lighting, 2) support physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D shape, however, has critiqued meshes as being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight, shapes as well as thin, open, surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modelling. Inspired by the observation that open surfaces can be seen as islands floating on watertight surfaces, we parameterize open surfaces by defining a manifold signed distance field on watertight templates. With this parameterization, we further develop a grid-based and differentiable representation that parameterizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modelling of non-watertight meshes. We empirically demonstrate that G-Shell achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes. △ Less

Submitted 24 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: ICLR 2024 Oral (v3: 30 pages, 19 figures, Project Page: https://gshell3d.github.io/)

arXiv:2310.13768 [pdf, other]

PACE: Human and Camera Motion Estimation from in-the-wild Videos

Authors: Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, Umar Iqbal

Abstract: We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM… ▽ More We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world RICH datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 3DV 2024. Project page: https://nvlabs.github.io/PACE/

arXiv:2310.12399 [pdf, other]

A New Time Series Similarity Measure and Its Smart Grid Applications

Authors: Rui Yuan, S. Ali Pourmousavi, Wen L. Soong, Andrew J. Black, Jon A. R. Liisberg, Julian Lemos-Vinasco

Abstract: Many smart grid applications involve data mining, clustering, classification, identification, and anomaly detection, among others. These applications primarily depend on the measurement of similarity, which is the distance between different time series or subsequences of a time series. The commonly used time series distance measures, namely Euclidean Distance (ED) and Dynamic Time Warping (DTW), d… ▽ More Many smart grid applications involve data mining, clustering, classification, identification, and anomaly detection, among others. These applications primarily depend on the measurement of similarity, which is the distance between different time series or subsequences of a time series. The commonly used time series distance measures, namely Euclidean Distance (ED) and Dynamic Time Warping (DTW), do not quantify the flexible nature of electricity usage data in terms of temporal dynamics. As a result, there is a need for a new distance measure that can quantify both the amplitude and temporal changes of electricity time series for smart grid applications, e.g., demand response and load profiling. This paper introduces a novel distance measure to compare electricity usage patterns. The method consists of two phases that quantify the effort required to reshape one time series into another, considering both amplitude and temporal changes. The proposed method is evaluated against ED and DTW using real-world data in three smart grid applications. Overall, the proposed measure outperforms ED and DTW in accurately identifying the best load scheduling strategy, anomalous days with irregular electricity usage, and determining electricity users' behind-the-meter (BTM) equipment. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: 7 pages, 6 figures conference

arXiv:2310.09449 [pdf, other]

Pairwise Similarity Learning is SimPLE

Authors: Yandong Wen, Weiyang Liu, Yao Feng, Bhiksha Raj, Rita Singh, Adrian Weller, Michael J. Black, Bernhard Schölkopf

Abstract: In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples w… ▽ More In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples with the same label) than to negative pairs (i.e., a pair of samples with different label). We start by identifying a key desideratum for PSL, and then discuss how existing methods can achieve this desideratum. We then propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin and yet is able to generalize well in open-set recognition. We apply the proposed method to three challenging PSL tasks: open-set face recognition, image retrieval and speaker verification. Comprehensive experimental results on large-scale benchmarks show that our method performs significantly better than current state-of-the-art methods. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Published in ICCV 2023 (Project page: https://simple.is.tue.mpg.de/)

arXiv:2310.08720 [pdf, other]

doi 10.1051/0004-6361/202348244

PDRs4All III: JWST's NIR spectroscopic view of the Orion Bar

Authors: Els Peeters, Emilie Habart, Olivier Berne, Ameek Sidhu, Ryan Chown, Dries Van De Putte, Boris Trahin, Ilane Schroetter, Amelie Canin, Felipe Alarcon, Bethany Schefter, Baria Khan, Sofia Pasquini, Alexander G. G. M. Tielens, Mark G. Wolfire, Emmanuel Dartois, Javier R. Goicoechea, Alexandros Maragkoudakis, Takashi Onaka, Marc W. Pound, Silvia Vicente, Alain Abergel, Edwin A. Bergin, Jeronimo Bernard-Salas, Christiaan Boersma , et al. (113 additional authors not shown)

Abstract: (Abridged) We investigate the impact of radiative feedback from massive stars on their natal cloud and focus on the transition from the HII region to the atomic PDR (crossing the ionisation front (IF)), and the subsequent transition to the molecular PDR (crossing the dissociation front (DF)). We use high-resolution near-IR integral field spectroscopic data from NIRSpec on JWST to observe the Orion… ▽ More (Abridged) We investigate the impact of radiative feedback from massive stars on their natal cloud and focus on the transition from the HII region to the atomic PDR (crossing the ionisation front (IF)), and the subsequent transition to the molecular PDR (crossing the dissociation front (DF)). We use high-resolution near-IR integral field spectroscopic data from NIRSpec on JWST to observe the Orion Bar PDR as part of the PDRs4All JWST Early Release Science Program. The NIRSpec data reveal a forest of lines including, but not limited to, HeI, HI, and CI recombination lines, ionic lines, OI and NI fluorescence lines, Aromatic Infrared Bands (AIBs including aromatic CH, aliphatic CH, and their CD counterparts), CO2 ice, pure rotational and ro-vibrational lines from H2, and ro-vibrational lines HD, CO, and CH+, most of them detected for the first time towards a PDR. Their spatial distribution resolves the H and He ionisation structure in the Huygens region, gives insight into the geometry of the Bar, and confirms the large-scale stratification of PDRs. We observe numerous smaller scale structures whose typical size decreases with distance from Ori C and IR lines from CI, if solely arising from radiative recombination and cascade, reveal very high gas temperatures consistent with the hot irradiated surface of small-scale dense clumps deep inside the PDR. The H2 lines reveal multiple, prominent filaments which exhibit different characteristics. This leaves the impression of a "terraced" transition from the predominantly atomic surface region to the CO-rich molecular zone deeper in. This study showcases the discovery space created by JWST to further our understanding of the impact radiation from young stars has on their natal molecular cloud and proto-planetary disk, which touches on star- and planet formation as well as galaxy evolution. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: 52 pages, 30 figures, submitted to A&A

Journal ref: A&A 685, A74 (2024)

arXiv:2309.15273 [pdf, other]

DECO: Dense Estimation of 3D Human-Scene Contact In The Wild

Authors: Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, Michael J. Black

Abstract: Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. I… ▽ More Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: Accepted as Oral in ICCV'23. Project page: https://deco.is.tue.mpg.de

arXiv:2309.07125 [pdf, other]

Text-Guided Generation and Editing of Compositional 3D Avatars

Authors: Hao Zhang, Yao Feng, Peter Kulits, Yandong Wen, Justus Thies, Michael J. Black

Abstract: Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach,… ▽ More Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: Home page: https://yfeng95.github.io/teca

arXiv:2309.06441 [pdf, other]

Learning Disentangled Avatars with Hybrid 3D Representations

Authors: Yao Feng, Weiyang Liu, Timo Bolkart, Jinlong Yang, Marc Pollefeys, Michael J. Black

Abstract: Tremendous efforts have been made to learn animatable and photorealistic human avatars. Towards this end, both explicit and implicit 3D representations are heavily studied for a holistic modeling and capture of the whole human (e.g., body, clothing, face and hair), but neither representation is an optimal choice in terms of representation efficacy since different parts of the human avatar have dif… ▽ More Tremendous efforts have been made to learn animatable and photorealistic human avatars. Towards this end, both explicit and implicit 3D representations are heavily studied for a holistic modeling and capture of the whole human (e.g., body, clothing, face and hair), but neither representation is an optimal choice in terms of representation efficacy since different parts of the human avatar have different modeling desiderata. For example, meshes are generally not suitable for modeling clothing and hair. Motivated by this, we present Disentangled Avatars~(DELTA), which models humans with hybrid explicit-implicit 3D representations. DELTA takes a monocular RGB video as input, and produces a human avatar with separate body and clothing/hair layers. Specifically, we demonstrate two important applications for DELTA. For the first one, we consider the disentanglement of the human body and clothing and in the second, we disentangle the face and hair. To do so, DELTA represents the body or face with an explicit mesh-based parametric 3D model and the clothing or hair with an implicit neural radiance field. To make this possible, we design an end-to-end differentiable renderer that integrates meshes into volumetric rendering, enabling DELTA to learn directly from monocular videos without any 3D supervision. Finally, we show that how these two applications can be easily combined to model full-body avatars, such that the hair, face, body and clothing can be fully disentangled yet jointly rendered. Such a disentanglement enables hair and clothing transfer to arbitrary body shapes. We empirically validate the effectiveness of DELTA's disentanglement by demonstrating its promising performance on disentangled reconstruction, virtual clothing try-on and hairstyle transfer. To facilitate future research, we also release an open-sourced pipeline for the study of hybrid human avatar modeling. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: home page: https://yfeng95.github.io/delta. arXiv admin note: text overlap with arXiv:2210.01868

arXiv:2308.16733 [pdf, other]

doi 10.1051/0004-6361/202346662

PDRs4All IV. An embarrassment of riches: Aromatic infrared bands in the Orion Bar

Authors: Ryan Chown, Ameek Sidhu, Els Peeters, Alexander G. G. M. Tielens, Jan Cami, Olivier Berné, Emilie Habart, Felipe Alarcón, Amélie Canin, Ilane Schroetter, Boris Trahin, Dries Van De Putte, Alain Abergel, Edwin A. Bergin, Jeronimo Bernard-Salas, Christiaan Boersma, Emeric Bron, Sara Cuadrado, Emmanuel Dartois, Daniel Dicken, Meriem El-Yajouri, Asunción Fuente, Javier R. Goicoechea, Karl D. Gordon, Lina Issa , et al. (114 additional authors not shown)

Abstract: (Abridged) Mid-infrared observations of photodissociation regions (PDRs) are dominated by strong emission features called aromatic infrared bands (AIBs). The most prominent AIBs are found at 3.3, 6.2, 7.7, 8.6, and 11.2 $μ$m. The most sensitive, highest-resolution infrared spectral imaging data ever taken of the prototypical PDR, the Orion Bar, have been captured by JWST. We provide an inventory o… ▽ More (Abridged) Mid-infrared observations of photodissociation regions (PDRs) are dominated by strong emission features called aromatic infrared bands (AIBs). The most prominent AIBs are found at 3.3, 6.2, 7.7, 8.6, and 11.2 $μ$m. The most sensitive, highest-resolution infrared spectral imaging data ever taken of the prototypical PDR, the Orion Bar, have been captured by JWST. We provide an inventory of the AIBs found in the Orion Bar, along with mid-IR template spectra from five distinct regions in the Bar: the molecular PDR, the atomic PDR, and the HII region. We use JWST NIRSpec IFU and MIRI MRS observations of the Orion Bar from the JWST Early Release Science Program, PDRs4All (ID: 1288). We extract five template spectra to represent the morphology and environment of the Orion Bar PDR. The superb sensitivity and the spectral and spatial resolution of these JWST observations reveal many details of the AIB emission and enable an improved characterization of their detailed profile shapes and sub-components. While the spectra are dominated by the well-known AIBs at 3.3, 6.2, 7.7, 8.6, 11.2, and 12.7 $μ$m, a wealth of weaker features and sub-components are present. We report trends in the widths and relative strengths of AIBs across the five template spectra. These trends yield valuable insight into the photochemical evolution of PAHs, such as the evolution responsible for the shift of 11.2 $μ$m AIB emission from class B$_{11.2}$ in the molecular PDR to class A$_{11.2}$ in the PDR surface layers. This photochemical evolution is driven by the increased importance of FUV processing in the PDR surface layers, resulting in a "weeding out" of the weakest links of the PAH family in these layers. For now, these JWST observations are consistent with a model in which the underlying PAH family is composed of a few species: the so-called 'grandPAHs'. △ Less

Submitted 5 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

Comments: 25 pages, 10 figures, to appear in A&A

Journal ref: A&A 685, A75 (2024)

arXiv:2308.16732 [pdf, other]

doi 10.1051/0004-6361/202346747

PDRs4All II: JWST's NIR and MIR imaging view of the Orion Nebula

Authors: Emilie Habart, Els Peeters, Olivier Berné, Boris Trahin, Amélie Canin, Ryan Chown, Ameek Sidhu, Dries Van De Putte, Felipe Alarcón, Ilane Schroetter, Emmanuel Dartois, Sílvia Vicente, Alain Abergel, Edwin A. Bergin, Jeronimo Bernard-Salas, Christiaan Boersma, Emeric Bron, Jan Cami, Sara Cuadrado, Daniel Dicken, Meriem Elyajouri, Asunción Fuente, Javier R. Goicoechea, Karl D. Gordon, Lina Issa , et al. (117 additional authors not shown)

Abstract: The JWST has captured the most detailed and sharpest infrared images ever taken of the inner region of the Orion Nebula, the nearest massive star formation region, and a prototypical highly irradiated dense photo-dissociation region (PDR). We investigate the fundamental interaction of far-ultraviolet photons with molecular clouds. The transitions across the ionization front (IF), dissociation fron… ▽ More The JWST has captured the most detailed and sharpest infrared images ever taken of the inner region of the Orion Nebula, the nearest massive star formation region, and a prototypical highly irradiated dense photo-dissociation region (PDR). We investigate the fundamental interaction of far-ultraviolet photons with molecular clouds. The transitions across the ionization front (IF), dissociation front (DF), and the molecular cloud are studied at high-angular resolution. These transitions are relevant to understanding the effects of radiative feedback from massive stars and the dominant physical and chemical processes that lead to the IR emission that JWST will detect in many Galactic and extragalactic environments. Due to the proximity of the Orion Nebula and the unprecedented angular resolution of JWST, these data reveal that the molecular cloud borders are hyper structured at small angular scales of 0.1-1" (0.0002-0.002 pc or 40-400 au at 414 pc). A diverse set of features are observed such as ridges, waves, globules and photoevaporated protoplanetary disks. At the PDR atomic to molecular transition, several bright features are detected that are associated with the highly irradiated surroundings of the dense molecular condensations and embedded young star. Toward the Orion Bar PDR, a highly sculpted interface is detected with sharp edges and density increases near the IF and DF. This was predicted by previous modeling studies, but the fronts were unresolved in most tracers. A complex, structured, and folded DF surface was traced by the H2 lines. This dataset was used to revisit the commonly adopted 2D PDR structure of the Orion Bar. JWST provides us with a complete view of the PDR, all the way from the PDR edge to the substructured dense region, and this allowed us to determine, in detail, where the emission of the atomic and molecular lines, aromatic bands, and dust originate. △ Less

Submitted 2 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

Journal ref: A&A 685, A73 (2024)

arXiv:2308.12965 [pdf, other]

POCO: 3D Pose and Shape Estimation with Confidence

Authors: Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas

Abstract: The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the con… ▽ More The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the confidence of their outputs, meaning that downstream tasks cannot differentiate accurate estimates from inaccurate ones. To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass. Specifically, POCO estimates both the 3D body pose and a per-sample variance. The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing uncertainty that is highly correlated to pose reconstruction quality. The POCO framework can be applied to any HPS regressor and here we evaluate it by modifying HMR, PARE, and CLIFF. In all cases, training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose. While this was not our goal, the improvement is modest but consistent. Our main motivation is to provide uncertainty estimates for downstream tasks; we demonstrate this in two ways: (1) We use the confidence estimates to bootstrap HPS training. Given unlabelled image data, we take the confident estimates of a POCO-trained regressor as pseudo ground truth. Retraining with this automatically-curated data improves accuracy. (2) We exploit uncertainty in video pose estimation by automatically identifying uncertain frames (e.g. due to occlusion) and inpainting these from confident frames. Code and models will be available for research at https://poco.is.tue.mpg.de. △ Less

Submitted 24 August, 2023; originally announced August 2023.

arXiv:2308.11617 [pdf, other]

GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency

Authors: Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, Michael J. Black

Abstract: Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3… ▽ More Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3D focuses on the body and object motion, often ignoring hand pose. In contrast, we introduce GRIP, a learning-based method that takes, as input, the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction. As a preliminary step before synthesizing the hand motion, we first use a network, ANet, to denoise the arm motion. Then, we leverage the spatio-temporal relationship between the body and the object to extract two types of novel temporal interaction cues, and use them in a two-stage inference pipeline to generate the hand motion. In the first stage, we introduce a new approach to enforce motion temporal consistency in the latent space (LTC), and generate consistent interaction motions. In the second stage, GRIP generates refined hand poses to avoid hand-object penetrations. Given sequences of noisy body and object motion, GRIP upgrades them to include hand-object interaction. Quantitative experiments and perceptual studies demonstrate that GRIP outperforms baseline methods and generalizes to unseen objects and motions from different motion-capture datasets. △ Less

Submitted 15 July, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: The project has been started during Omid Taheri's internship at Adobe and as a collaboration with the Max Planck Institute for Intelligent Systems

arXiv:2308.10899 [pdf, other]

TADA! Text to Animatable Digital Avatars

Authors: Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, Michael J. Black

Abstract: We introduce TADA, a simple-yet-effective approach that takes textual descriptions and produces expressive 3D avatars with high-quality geometry and lifelike textures, that can be animated and rendered with traditional graphics pipelines. Existing text-based character generation methods are limited in terms of geometry and texture quality, and cannot be realistically animated due to inconsistent a… ▽ More We introduce TADA, a simple-yet-effective approach that takes textual descriptions and produces expressive 3D avatars with high-quality geometry and lifelike textures, that can be animated and rendered with traditional graphics pipelines. Existing text-based character generation methods are limited in terms of geometry and texture quality, and cannot be realistically animated due to inconsistent alignment between the geometry and the texture, particularly in the face region. To overcome these limitations, TADA leverages the synergy of a 2D diffusion model and an animatable parametric body model. Specifically, we derive an optimizable high-resolution body model from SMPL-X with 3D displacements and a texture map, and use hierarchical rendering with score distillation sampling (SDS) to create high-quality, detailed, holistic 3D avatars from text. To ensure alignment between the geometry and texture, we render normals and RGB images of the generated character and exploit their latent embeddings in the SDS training process. We further introduce various expression parameters to deform the generated character during training, ensuring that the semantics of our generated character remain consistent with the original SMPL-X model, resulting in an animatable character. Comprehensive evaluations demonstrate that TADA significantly surpasses existing approaches on both qualitative and quantitative measures. TADA enables creation of large-scale digital character assets that are ready for animation and rendering, while also being easily editable through natural language. The code will be public for research purposes. △ Less

Submitted 21 August, 2023; originally announced August 2023.

arXiv:2308.10638 [pdf, other]

SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

Authors: Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J. Black, Justus Thies, Timo Bolkart

Abstract: We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-s… ▽ More We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. Our code and data can be found at https://sculpt.is.tue.mpg.de. △ Less

Submitted 6 May, 2024; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: Updated to camera ready version of CVPR 2024

arXiv:2307.09882 [pdf, other]

Adversarial Likelihood Estimation With One-Way Flows

Authors: Omri Ben-Dov, Pravir Singh Gupta, Victoria Abrevaya, Michael J. Black, Partha Ghosh

Abstract: Generative Adversarial Networks (GANs) can produce high-quality samples, but do not provide an estimate of the probability density around the samples. However, it has been noted that maximizing the log-likelihood within an energy-based setting can lead to an adversarial framework where the discriminator provides unnormalized density (often called energy). We further develop this perspective, incor… ▽ More Generative Adversarial Networks (GANs) can produce high-quality samples, but do not provide an estimate of the probability density around the samples. However, it has been noted that maximizing the log-likelihood within an energy-based setting can lead to an adversarial framework where the discriminator provides unnormalized density (often called energy). We further develop this perspective, incorporate importance sampling, and show that 1) Wasserstein GAN performs a biased estimate of the partition function, and we propose instead to use an unbiased estimator; and 2) when optimizing for likelihood, one must maximize generator entropy. This is hypothesized to provide a better mode coverage. Different from previous works, we explicitly compute the density of the generated samples. This is the key enabler to designing an unbiased estimator of the partition function and computation of the generator entropy term. The generator density is obtained via a new type of flow network, called one-way flow network, that is less constrained in terms of architecture, as it does not require a tractable inverse function. Our experimental results show that our method converges faster, produces comparable sample quality to GANs with similar architecture, successfully avoids over-fitting to commonly used datasets and produces smooth low-dimensional latent representations of the training data. △ Less

Submitted 2 October, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

arXiv:2306.17331 [pdf, other]

doi 10.1007/s00285-024-02132-6

Computation of random time-shift distributions for stochastic population models

Authors: Dylan Morris, John Maclean, Andrew J. Black

Abstract: Even in large systems, the effect of noise arising from when populations are initially small can persist to be measurable on the macroscale. A deterministic approximation to a stochastic model will fail to capture this effect, but it can be accurately approximated by including an additional random time-shift to the initial conditions. We present a efficient numerical method to compute this time-sh… ▽ More Even in large systems, the effect of noise arising from when populations are initially small can persist to be measurable on the macroscale. A deterministic approximation to a stochastic model will fail to capture this effect, but it can be accurately approximated by including an additional random time-shift to the initial conditions. We present a efficient numerical method to compute this time-shift distribution for a large class of stochastic models. The method relies on differentiation of certain functional equations, which we show can be effectively automated by deriving rules for different types of model rates that arise commonly when mass-action mixing is assumed. Explicit computation of the time-shift distribution can be used to build a practical tool for the efficient generation of macroscopic trajectories of stochastic population models, without the need for costly stochastic simulations. Full code is provided to implement this and we demonstrate our method on an epidemic model and a model of within-host viral dynamics. △ Less

Submitted 12 August, 2024; v1 submitted 29 June, 2023; originally announced June 2023.

Comments: 46 pages, 10 figures

MSC Class: 60J80; 60J28; 60J22

Journal ref: J. Math. Biol. 89, 33 (2024)

arXiv:2306.16940 [pdf, other]

BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion

Authors: Michael J. Black, Priyanka Patel, Joachim Tesch, Jinlong Yang

Abstract: We show, for the first time, that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape (HPS) estimation from real images. Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing. Achieving sufficient realism is non-trivial and we show how to do this for full bodies in motion. Specifically, our BEDL… ▽ More We show, for the first time, that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape (HPS) estimation from real images. Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing. Achieving sufficient realism is non-trivial and we show how to do this for full bodies in motion. Specifically, our BEDLAM dataset contains monocular RGB videos with ground-truth 3D bodies in SMPL-X format. It includes a diversity of body shapes, motions, skin tones, hair, and clothing. The clothing is realistically simulated on the moving bodies using commercial clothing physics simulation. We render varying numbers of people in realistic scenes with varied lighting and camera motions. We then train various HPS regressors using BEDLAM and achieve state-of-the-art accuracy on real-image benchmarks despite training with synthetic data. We use BEDLAM to gain insights into what model design choices are important for accuracy. With good synthetic training data, we find that a basic method like HMR approaches the accuracy of the current SOTA method (CLIFF). BEDLAM is useful for a variety of tasks and all images, ground truth bodies, 3D clothing, support code, and more are available for research purposes. Additionally, we provide detailed information about our synthetic data generation pipeline, enabling others to generate their own datasets. See the project page: https://bedlam.is.tue.mpg.de/. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Journal ref: CVPR 2023

arXiv:2306.08990 [pdf, other]

doi 10.1145/3610548.3618183

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

Authors: Radek Daněček, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael J. Black, Timo Bolkart

Abstract: To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE… ▽ More To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control. △ Less

Submitted 26 September, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: SIGGRAPH Asia 2023 Conference Paper

arXiv:2306.07437 [pdf, other]

Instant Multi-View Head Capture through Learnable Registration

Authors: Timo Bolkart, Tianye Li, Michael J. Black

Abstract: Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow, and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from… ▽ More Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow, and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view- and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2023

arXiv:2306.02850 [pdf, other]

TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments

Authors: Yu Sun, Qian Bao, Wu Liu, Tao Mei, Michael J. Black

Abstract: Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that… ▽ More Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new "maps" to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes. △ Less

Submitted 20 November, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: Project page: https://www.yusun.work/TRACE/TRACE.html

arXiv:2305.16449 [pdf, other]

Threshold and laser-conversion in nanostructured-resonator parametric oscillators

Authors: Haixin Liu, Grant M. Brodnik, Jizhao Zang, David R. Carlson, Jennifer A. Black, Scott B. Papp

Abstract: We explore optical parametric oscillation (OPO) in nanophotonic resonators, enabling arbitrary, nonlinear phase-matching and nearly lossless control of energy conversion. Such pristine OPO laser converters are determined by nonlinear light-matter interactions, making them both technologically flexible and broadly reconfigurable. We utilize a nanostructured inner-wall modulation in the resonator to… ▽ More We explore optical parametric oscillation (OPO) in nanophotonic resonators, enabling arbitrary, nonlinear phase-matching and nearly lossless control of energy conversion. Such pristine OPO laser converters are determined by nonlinear light-matter interactions, making them both technologically flexible and broadly reconfigurable. We utilize a nanostructured inner-wall modulation in the resonator to achieve universal phase-matching for OPO-laser conversion, but coherent backscattering also induces a counterpropagating pump laser. This depletes the intra-resonator optical power in either direction, increasing the OPO threshold power and limiting laser-conversion efficiency, the ratio of optical power in target signal and idler frequencies to the pump. We develop an analytical model of this system that emphasizes an understanding of optimal laser conversion and threshold behaviors, and we use the model to guide experiments with nanostructured-resonator OPO laser-conversion circuits, fully integrated on chip and unlimited by group-velocity dispersion. Our work demonstrates the fundamental connection between OPO laser-conversion efficiency and the resonator coupling rate, subject to the relative phase and power of counterpropagating pump fields. We achieve $(40\pm4)$ mW of on-chip power, corresponding to $(41\pm4)$% conversion efficiency, and discover a path toward near-unity OPO laser conversion efficiency. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.02312 [pdf, other]

AG3D: Learning to Generate 3D Avatars from 2D Image Collections

Authors: Zijian Dong, Xu Chen, Jinlong Yang, Michael J. Black, Otmar Hilliges, Andreas Geiger

Abstract: While progress in 2D generative models of human appearance has been rapid, many applications require 3D avatars that can be animated and rendered. Unfortunately, most existing methods for learning generative models of 3D humans with diverse shape and appearance require 3D training data, which is limited and expensive to acquire. The key to progress is hence to learn generative models of 3D avatars… ▽ More While progress in 2D generative models of human appearance has been rapid, many applications require 3D avatars that can be animated and rendered. Unfortunately, most existing methods for learning generative models of 3D humans with diverse shape and appearance require 3D training data, which is limited and expensive to acquire. The key to progress is hence to learn generative models of 3D avatars from abundant unstructured 2D image collections. However, learning realistic and complete 3D appearance and geometry in this under-constrained setting remains challenging, especially in the presence of loose clothing such as dresses. In this paper, we propose a new adversarial generative model of realistic 3D people from 2D images. Our method captures shape and deformation of the body and loose clothing by adopting a holistic 3D generator and integrating an efficient and flexible articulation module. To improve realism, we train our model using multiple discriminators while also integrating geometric cues in the form of predicted 2D normal maps. We experimentally find that our method outperforms previous 3D- and articulation-aware methods in terms of geometry and appearance. We validate the effectiveness of our model and the importance of each component via systematic ablation studies. △ Less

Submitted 3 May, 2023; originally announced May 2023.

Comments: Project Page: https://zj-dong.github.io/AG3D/

arXiv:2305.00976 [pdf, other]

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

Authors: Mathis Petrovich, Michael J. Black, Gül Varol

Abstract: In this paper, we present TMR, a simple yet effective approach for text to 3D human motion retrieval. While previous work has only treated retrieval as a proxy evaluation metric, we tackle it as a standalone task. Our method extends the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a contrastive loss to better structure the cross-modal latent space. We show that maintaini… ▽ More In this paper, we present TMR, a simple yet effective approach for text to 3D human motion retrieval. While previous work has only treated retrieval as a proxy evaluation metric, we tackle it as a standalone task. Our method extends the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a contrastive loss to better structure the cross-modal latent space. We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance. We introduce a benchmark for evaluation and provide an in-depth analysis by reporting results on several protocols. Our extensive experiments on the KIT-ML and HumanML3D datasets show that TMR outperforms the prior work by a significant margin, for example reducing the median rank from 54 to 19. Finally, we showcase the potential of our approach on moment retrieval. Our code and models are publicly available at https://mathis.petrovich.fr/tmr. △ Less

Submitted 25 August, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: ICCV 2023 Camera Ready, project page: https://mathis.petrovich.fr/tmr/

arXiv:2304.10528 [pdf, other]

Generalizing Neural Human Fitting to Unseen Poses With Articulated SE(3) Equivariance

Authors: Haiwen Feng, Peter Kulits, Shichen Liu, Michael J. Black, Victoria Abrevaya

Abstract: We address the problem of fitting a parametric human body model (SMPL) to point cloud data. Optimization-based methods require careful initialization and are prone to becoming trapped in local optima. Learning-based methods address this but do not generalize well when the input pose is far from those seen during training. For rigid point clouds, remarkable generalization has been achieved by lever… ▽ More We address the problem of fitting a parametric human body model (SMPL) to point cloud data. Optimization-based methods require careful initialization and are prone to becoming trapped in local optima. Learning-based methods address this but do not generalize well when the input pose is far from those seen during training. For rigid point clouds, remarkable generalization has been achieved by leveraging SE(3)-equivariant networks, but these methods do not work on articulated objects. In this work we extend this idea to human bodies and propose ArtEq, a novel part-based SE(3)-equivariant neural architecture for SMPL model estimation from point clouds. Specifically, we learn a part detection network by leveraging local SO(3) invariance, and regress shape and pose using articulated SE(3) shape-invariant and pose-equivariant networks, all trained end-to-end. Our novel pose regression module leverages the permutation-equivariant property of self-attention layers to preserve rotational equivariance. Experimental results show that ArtEq generalizes to poses not seen during training, outperforming state-of-the-art methods by ~44% in terms of body reconstruction accuracy, without requiring an optimization refinement step. Furthermore, ArtEq is three orders of magnitude faster during inference than prior work and has 97.3% fewer parameters. The code and model are available for research purposes at https://arteq.is.tue.mpg.de. △ Less

Submitted 19 September, 2023; v1 submitted 20 April, 2023; originally announced April 2023.

Comments: Accepted at ICCV 2023 as an oral presentation. Project page: https://arteq.is.tue.mpg.de ; Update V2: Camera-Ready version, fix metric issues and numeric bug of ID performance

arXiv:2304.10482 [pdf, other]

Reconstructing Signing Avatars From Video Using Linguistic Priors

Authors: Maria-Paola Forte, Peter Kulits, Chun-Hao Huang, Vasileios Choutas, Dimitrios Tzionas, Katherine J. Kuchenbecker, Michael J. Black

Abstract: Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world. Video dictionaries of isolated signs are a core SL learning tool. Replacing these with 3D avatars can aid learning and enable AR/VR applications, improving access to technology and online media. However, little work has attempted to estimate expressive 3D avatars from SL video; occlusion, noi… ▽ More Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world. Video dictionaries of isolated signs are a core SL learning tool. Replacing these with 3D avatars can aid learning and enable AR/VR applications, improving access to technology and online media. However, little work has attempted to estimate expressive 3D avatars from SL video; occlusion, noise, and motion blur make this task difficult. We address this by introducing novel linguistic priors that are universally applicable to SL and provide constraints on 3D hand pose that help resolve ambiguities within isolated signs. Our method, SGNify, captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos. We evaluate SGNify quantitatively by using a commercial motion-capture system to compute 3D avatars synchronized with monocular video. SGNify outperforms state-of-the-art 3D body-pose- and shape-estimation methods on SL videos. A perceptual study shows that SGNify's 3D reconstructions are significantly more comprehensible and natural than those of previous methods and are on par with the source videos. Code and data are available at $\href{http://sgnify.is.tue.mpg.de}{\text{sgnify.is.tue.mpg.de}}$. △ Less

Submitted 20 April, 2023; originally announced April 2023.

arXiv:2304.10417 [pdf, other]

SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation

Authors: Nikos Athanasiou, Mathis Petrovich, Michael J. Black, Gül Varol

Abstract: Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are i… ▽ More Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action <action name>?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/. △ Less

Submitted 26 March, 2024; v1 submitted 20 April, 2023; originally announced April 2023.

Comments: Teaser Fixed

arXiv:2304.09002 [pdf, ps, other]

doi 10.1051/0004-6361/202245768

Cosmo-tomography toward PKS1830-211: Variability of the quasar and of its foreground molecular absorption monitored with ALMA

Authors: S. Muller, I. Marti-Vidal, F. Combes, M. Gerin, A. Beelen, C. Horellou, M. Guelin, S. Aalto, J. H. Black, E. van Kampen

Abstract: Time variability of astronomical sources provides crude information on their typical size and on the implied physical mechanisms. PKS1830-211 is a remarkable radio-bright lensed quasar with a foreground molecular absorber at z=0.89. Small-scale morphological changes in the core-jet structure of the quasar -- which is magnified by the lensing -- result in a varying illumination of the absorber scre… ▽ More Time variability of astronomical sources provides crude information on their typical size and on the implied physical mechanisms. PKS1830-211 is a remarkable radio-bright lensed quasar with a foreground molecular absorber at z=0.89. Small-scale morphological changes in the core-jet structure of the quasar -- which is magnified by the lensing -- result in a varying illumination of the absorber screen, which in turn causes variations in the absorption profile. We aim to study the time variations of the system [...] in order to obtain constraints on both the quasar activity and small-scale structures in the ISM of the absorber. We used ALMA to monitor the submm continuum emission, together with the absorption spectra of the H2O and CH molecules, with 17 visits spread over six months in 2016. [...] From the continuum data, we followed the evolution of the flux density, flux-density ratio, spectral index, and differential polarization between the two lensed images of the quasar; all quantities show significant variations related to the intrinsic activity of the quasar. We propose a simple parametric model of a core plus a ballistic plasmon to account for the continuum evolution, from which we constrain a time delay of 25+/-3~days between lensed images. The spectral lines reveal significant variations in the foreground absorption. A PCA highlights apparent wavy time variations, possibly linked to the helical jet precession period of the quasar. From the deep averaged spectra towards the SW image, we detect the absorption of 13CH and estimate an abundance ratio of 12CH/13CH~150. We also measure the oxygen isotopic ratios, 16O/18O=65.3+/-0.7 and 18O/17O=11.5+/-0.5. Finally, we find a remarkable continuous shallow trough in the water absorption spanning a velocity interval of nearly 500 km/s. This broad absorption could be the signature of an extra-planar molecular component. [Abridged] △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: Accepted for publication in A&A

Journal ref: A&A 674, A101 (2023)

arXiv:2304.05954 [pdf, other]

A rich hydrocarbon chemistry and high C to O ratio in the inner disk around a very low-mass star

Authors: B. Tabone, G. Bettoni, E. F. van Dishoeck, A. M. Arabhavi, S. L. Grant, D. Gasman, T. Henning, I. Kamp, M. Güdel, P. -O. Lagage, T. P. Ray, B. Vandenbussche, A. Abergel, O. Absil, I. Argyriou, D. Barrado, A. Boccaletti, J. Bouwman, A. Caratti o Garatti, V. Geers, A. M. Glauser, K. Justannont, F. Lahuis, M. Mueller, C. Nehmé , et al. (21 additional authors not shown)

Abstract: Carbon is an essential element for life but how much can be delivered to young planets is still an open question. The chemical characterization of planet-forming disks is a crucial step in our understanding of the diversity and habitability of exoplanets. Very low-mass stars ($<0.2~M_{\odot}$) are interesting targets because they host a rich population of terrestrial planets. Here we present the J… ▽ More Carbon is an essential element for life but how much can be delivered to young planets is still an open question. The chemical characterization of planet-forming disks is a crucial step in our understanding of the diversity and habitability of exoplanets. Very low-mass stars ($<0.2~M_{\odot}$) are interesting targets because they host a rich population of terrestrial planets. Here we present the JWST detection of abundant hydrocarbons in the disk of a very low-mass star obtained as part of the MIRI mid-INfrared Disk Survey (MINDS). In addition to very strong and broad emission from C$_2$H$_2$ and its $^{13}$C$^{12}$CH$_2$ isotopologue, C$_4$H$_2$, benzene, and possibly CH$_4$ are identified, but water, PAH and silicate features are weak or absent. The lack of small silicate grains implies that we can look deep down into this disk. These detections testify to an active warm hydrocarbon chemistry with a high C/O ratio in the inner 0.1 au of this disk, perhaps due to destruction of carbonaceous grains. The exceptionally high C$_2$H$_2$/CO$_2$ and C$_2$H$_2$/H$_2$O column density ratios suggest that oxygen is locked up in icy pebbles and planetesimals outside the water iceline. This, in turn, will have significant consequences for the composition of forming exoplanets. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: version submitted to Nature Astronomy

arXiv:2303.18246 [pdf, other]

3D Human Pose Estimation via Intuitive Physics

Authors: Shashank Tripathi, Lea Müller, Chun-Hao P. Huang, Omid Taheri, Michael J. Black, Dimitrios Tzionas

Abstract: Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks… ▽ More Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Inspired by biomechanics, we infer the pressure heatmap on the body, the Center of Pressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With these, we develop IPMAN, to estimate a 3D body from a color image in a "stable" configuration by encouraging plausible floor contact and overlapping CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, differentiable, and can be integrated into existing optimization and regression methods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with synchronized multi-view images, ground-truth 3D bodies with complex poses, body-floor contact, CoM and pressure. IPMAN produces more plausible results than the state of the art, improving accuracy for static poses, while not hurting dynamic ones. Code and data are available for research at https://ipman.is.tue.mpg.de. △ Less

Submitted 24 July, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

Comments: Accepted in CVPR'23. Project page: https://ipman.is.tue.mpg.de

arXiv:2303.08133 [pdf, other]

MeshDiffusion: Score-based Generative 3D Mesh Modeling

Authors: Zhen Liu, Yao Feng, Michael J. Black, Derek Nowrouzezahrai, Liam Paull, Weiyang Liu

Abstract: We consider the task of generating realistic 3D shapes, which is useful for a variety of applications such as automatic scene generation and physical simulation. Compared to other 3D representations like voxels and point clouds, meshes are more desirable in practice, because (1) they enable easy and arbitrary manipulation of shapes for relighting and simulation, and (2) they can fully leverage the… ▽ More We consider the task of generating realistic 3D shapes, which is useful for a variety of applications such as automatic scene generation and physical simulation. Compared to other 3D representations like voxels and point clouds, meshes are more desirable in practice, because (1) they enable easy and arbitrary manipulation of shapes for relighting and simulation, and (2) they can fully leverage the power of modern graphics pipelines which are mostly optimized for meshes. Previous scalable methods for generating meshes typically rely on sub-optimal post-processing, and they tend to produce overly-smooth or noisy surfaces without fine-grained geometric details. To overcome these shortcomings, we take advantage of the graph structure of meshes and use a simple yet very effective generative modeling method to generate 3D meshes. Specifically, we represent meshes with deformable tetrahedral grids, and then train a diffusion model on this direct parametrization. We demonstrate the effectiveness of our model on multiple generative tasks. △ Less

Submitted 15 April, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: ICLR 2023 (Spotlight, Notable-top-25%)

arXiv:2303.03373 [pdf, other]

Detecting Human-Object Contact in Images

Authors: Yixin Chen, Sai Kumar Dwivedi, Michael J. Black, Dimitrios Tzionas

Abstract: Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-obje… ▽ More Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts for images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons for the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability. △ Less

Submitted 4 April, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Accepted at CVPR 2023

Showing 51–100 of 339 results for author: Black, J