Search | arXiv e-print repository

AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences

Authors: Minoru Kishi, Ryosuke Sakai, Shinnosuke Takamichi, Yusuke Kanamori, Yuki Okamoto

Abstract: We propose a novel objective evaluation metric for synthesized audio in text-to-audio (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of the synthesized sound is an important, but its implementation requires monetary costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjecti… ▽ More We propose a novel objective evaluation metric for synthesized audio in text-to-audio (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of the synthesized sound is an important, but its implementation requires monetary costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the $p$-norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2506.23582 [pdf, ps, other]

RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio

Authors: Yusuke Kanamori, Yuki Okamoto, Taisei Takano, Shinnosuke Takamichi, Yuki Saito, Hiroshi Saruwatari

Abstract: In text-to-audio (TTA) research, the relevance between input text and output audio is an important evaluation aspect. Traditionally, it has been evaluated from both subjective and objective perspectives. However, subjective evaluation is costly in terms of money and time, and objective evaluation is unclear regarding the correlation to subjective evaluation scores. In this study, we construct RELA… ▽ More In text-to-audio (TTA) research, the relevance between input text and output audio is an important evaluation aspect. Traditionally, it has been evaluated from both subjective and objective perspectives. However, subjective evaluation is costly in terms of money and time, and objective evaluation is unclear regarding the correlation to subjective evaluation scores. In this study, we construct RELATE, an open-sourced dataset that subjectively evaluates the relevance. Also, we benchmark a model for automatically predicting the subjective evaluation score from synthesized audio. Our model outperforms a conventional CLAPScore model, and that trend extends to many sound categories. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: Accepted to INTERSPEECH2025

arXiv:2506.23553 [pdf, ps, other]

Human-CLAP: Human-perception-based contrastive language-audio pretraining

Authors: Taisei Takano, Yuki Okamoto, Yusuke Kanamori, Yuki Saito, Ryotaro Nagase, Hiroshi Saruwatari

Abstract: Contrastive language-audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text-to-audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CL… ▽ More Contrastive language-audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text-to-audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CLAPScore has a low correlation with human subjective evaluation scores. Additionally, we propose a human-perception-based CLAP called Human-CLAP by training a contrastive language-audio model using the subjective evaluation score. In our experiments, the results indicate that our Human-CLAP improved the Spearman's rank correlation coefficient (SRCC) between the CLAPScore and the subjective evaluation scores by more than 0.25 compared with the conventional CLAP. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2505.04052 [pdf, ps, other]

Person-In-Situ: Scene-Consistent Human Image Insertion with Occlusion-Aware Pose Control

Authors: Shun Masuda, Yuki Endo, Yoshihiro Kanamori

Abstract: Compositing human figures into scene images has broad applications in areas such as entertainment and advertising. However, existing methods often cannot handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer. Moreover, they offer limited control over the inserted person's pose. To address these challenges, we propose two methods. Both… ▽ More Compositing human figures into scene images has broad applications in areas such as entertainment and advertising. However, existing methods often cannot handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer. Moreover, they offer limited control over the inserted person's pose. To address these challenges, we propose two methods. Both allow explicit pose control via a 3D body model and leverage latent diffusion models to synthesize the person at a contextually appropriate depth, naturally handling occlusions without requiring occlusion masks. The first is a two-stage approach: the model first learns a depth map of the scene with the person through supervised learning, and then synthesizes the person accordingly. The second method learns occlusion implicitly and synthesizes the person directly from input data without explicit depth supervision. Quantitative and qualitative evaluations show that both methods outperform existing approaches by better preserving scene consistency while accurately reflecting occlusions and user-specified poses. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.04050 [pdf, ps, other]

TerraFusion: Joint Generation of Terrain Geometry and Texture Using Latent Diffusion Models

Authors: Kazuki Higo, Toshiki Kanai, Yuki Endo, Yoshihiro Kanamori

Abstract: 3D terrain models are essential in fields such as video game development and film production. Since surface color often correlates with terrain geometry, capturing this relationship is crucial to achieving realism. However, most existing methods generate either a heightmap or a texture, without sufficiently accounting for the inherent correlation. In this paper, we propose a method that jointly ge… ▽ More 3D terrain models are essential in fields such as video game development and film production. Since surface color often correlates with terrain geometry, capturing this relationship is crucial to achieving realism. However, most existing methods generate either a heightmap or a texture, without sufficiently accounting for the inherent correlation. In this paper, we propose a method that jointly generates terrain heightmaps and textures using a latent diffusion model. First, we train the model in an unsupervised manner to randomly generate paired heightmaps and textures. Then, we perform supervised learning of an external adapter to enable user control via hand-drawn sketches. Experiments show that our approach allows intuitive terrain generation while preserving the correlation between heightmaps and textures. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2503.17197 [pdf, other]

FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy

Authors: Xingchao Yang, Takafumi Taketomi, Yuki Endo, Yoshihiro Kanamori

Abstract: Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diff… ▽ More Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diffusion model alongside a Cross-Assembly inference strategy to fulfill this objective. In FreeUV, separate networks are trained independently to focus on realistic appearance and structural consistency, and these networks are combined during inference to generate coherent textures. Our approach accurately captures intricate facial features and demonstrates robust performance across diverse poses and occlusions. Extensive experiments validate FreeUV's effectiveness, with results surpassing state-of-the-art methods in both quantitative and qualitative metrics. Additionally, FreeUV enables new applications, including local editing, facial feature interpolation, and multi-view texture recovery. By reducing data requirements, FreeUV offers a scalable solution for generating high-fidelity 3D facial textures suitable for real-world scenarios. △ Less

Submitted 21 March, 2025; originally announced March 2025.

Comments: CVPR 2025. Project: https://yangxingchao.github.io/FreeUV-page/

arXiv:2502.13987 [pdf, other]

SelfAge: Personalized Facial Age Transformation Using Self-reference Images

Authors: Taishi Ito, Yuki Endo, Yoshihiro Kanamori

Abstract: Age transformation of facial images is a technique that edits age-related person's appearances while preserving the identity. Existing deep learning-based methods can reproduce natural age transformations; however, they only reproduce averaged transitions and fail to account for individual-specific appearances influenced by their life histories. In this paper, we propose the first diffusion model-… ▽ More Age transformation of facial images is a technique that edits age-related person's appearances while preserving the identity. Existing deep learning-based methods can reproduce natural age transformations; however, they only reproduce averaged transitions and fail to account for individual-specific appearances influenced by their life histories. In this paper, we propose the first diffusion model-based method for personalized age transformation. Our diffusion model takes a facial image and a target age as input and generates an age-edited face image as output. To reflect individual-specific features, we incorporate additional supervision using self-reference images, which are facial images of the same person at different ages. Specifically, we fine-tune a pretrained diffusion model for personalized adaptation using approximately 3 to 5 self-reference images. Additionally, we design an effective prompt to enhance the performance of age editing and identity preservation. Experiments demonstrate that our method achieves superior performance both quantitatively and qualitatively compared to existing methods. The code and the pretrained model are available at https://github.com/shiiiijp/SelfAge. △ Less

Submitted 18 February, 2025; originally announced February 2025.

arXiv:2411.00356 [pdf, other]

All-frequency Full-body Human Image Relighting

Authors: Daichi Tajima, Yoshihiro Kanamori, Yuki Endo

Abstract: Relighting of human images enables post-photography editing of lighting effects in portraits. The current mainstream approach uses neural networks to approximate lighting effects without explicitly accounting for the principle of physical shading. As a result, it often has difficulty representing high-frequency shadows and shading. In this paper, we propose a two-stage relighting method that can r… ▽ More Relighting of human images enables post-photography editing of lighting effects in portraits. The current mainstream approach uses neural networks to approximate lighting effects without explicitly accounting for the principle of physical shading. As a result, it often has difficulty representing high-frequency shadows and shading. In this paper, we propose a two-stage relighting method that can reproduce physically-based shadows and shading from low to high frequencies. The key idea is to approximate an environment light source with a set of a fixed number of area light sources. The first stage employs supervised inverse rendering from a single image using neural networks and calculates physically-based shading. The second stage then calculates shadow for each area light and sums up to render the final image. We propose to make soft shadow mapping differentiable for the area-light approximation of environment lighting. We demonstrate that our method can plausibly reproduce all-frequency shadows and shading caused by environment illumination, which have been difficult to reproduce using existing methods. △ Less

Submitted 1 November, 2024; originally announced November 2024.

Comments: project page: [this URL](https://github.com/majita06/All-frequency_Full-body_Human_Image_Relighting)

arXiv:2410.12242 [pdf, other]

EG-HumanNeRF: Efficient Generalizable Human NeRF Utilizing Human Prior for Sparse View

Authors: Zhaorong Wang, Yoshihiro Kanamori, Yuki Endo

Abstract: Generalizable neural radiance field (NeRF) enables neural-based digital human rendering without per-scene retraining. When combined with human prior knowledge, high-quality human rendering can be achieved even with sparse input views. However, the inference of these methods is still slow, as a large number of neural network queries on each ray are required to ensure the rendering quality. Moreover… ▽ More Generalizable neural radiance field (NeRF) enables neural-based digital human rendering without per-scene retraining. When combined with human prior knowledge, high-quality human rendering can be achieved even with sparse input views. However, the inference of these methods is still slow, as a large number of neural network queries on each ray are required to ensure the rendering quality. Moreover, occluded regions often suffer from artifacts, especially when the input views are sparse. To address these issues, we propose a generalizable human NeRF framework that achieves high-quality and real-time rendering with sparse input views by extensively leveraging human prior knowledge. We accelerate the rendering with a two-stage sampling reduction strategy: first constructing boundary meshes around the human geometry to reduce the number of ray samples for sampling guidance regression, and then volume rendering using fewer guided samples. To improve rendering quality, especially in occluded regions, we propose an occlusion-aware attention mechanism to extract occlusion information from the human priors, followed by an image space refinement network to improve rendering quality. Furthermore, for volume rendering, we adopt a signed ray distance function (SRDF) formulation, which allows us to propose an SRDF loss at every sample position to improve the rendering quality further. Our experiments demonstrate that our method outperforms the state-of-the-art methods in rendering quality and has a competitive rendering speed compared with speed-prioritized novel view synthesis methods. △ Less

Submitted 16 October, 2024; originally announced October 2024.

Comments: project page: https://github.com/LarsPh/EG-HumanNeRF

arXiv:2405.16443 [pdf, other]

3D View Optimization for Improving Image Aesthetics

Authors: Taichi Uchida, Yoshihiro Kanamori, Yuki Endo

Abstract: Achieving aesthetically pleasing photography necessitates attention to multiple factors, including composition and capture conditions, which pose challenges to novices. Prior research has explored the enhancement of photo aesthetics post-capture through 2D manipulation techniques; however, these approaches offer limited search space for aesthetics. We introduce a pioneering method that employs 3D… ▽ More Achieving aesthetically pleasing photography necessitates attention to multiple factors, including composition and capture conditions, which pose challenges to novices. Prior research has explored the enhancement of photo aesthetics post-capture through 2D manipulation techniques; however, these approaches offer limited search space for aesthetics. We introduce a pioneering method that employs 3D operations to simulate the conditions at the moment of capture retrospectively. Our approach extrapolates the input image and then reconstructs the 3D scene from the extrapolated image, followed by an optimization to identify camera parameters and image aspect ratios that yield the best 3D view with enhanced aesthetics. Comparative qualitative and quantitative assessments reveal that our method surpasses traditional 2D editing techniques with superior aesthetics. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: 10 pages

arXiv:2403.17761 [pdf, other]

Makeup Prior Models for 3D Facial Makeup Estimation and Applications

Authors: Xingchao Yang, Takafumi Taketomi, Yuki Endo, Yoshihiro Kanamori

Abstract: In this work, we introduce two types of makeup prior models to extend existing 3D face prior models: PCA-based and StyleGAN2-based priors. The PCA-based prior model is a linear model that is easy to construct and is computationally efficient. However, it retains only low-frequency information. Conversely, the StyleGAN2-based model can represent high-frequency information with relatively higher com… ▽ More In this work, we introduce two types of makeup prior models to extend existing 3D face prior models: PCA-based and StyleGAN2-based priors. The PCA-based prior model is a linear model that is easy to construct and is computationally efficient. However, it retains only low-frequency information. Conversely, the StyleGAN2-based model can represent high-frequency information with relatively higher computational cost than the PCA-based model. Although there is a trade-off between the two models, both are applicable to 3D facial makeup estimation and related applications. By leveraging makeup prior models and designing a makeup consistency module, we effectively address the challenges that previous methods faced in robustly estimating makeup, particularly in the context of handling self-occluded faces. In experiments, we demonstrate that our approach reduces computational costs by several orders of magnitude, achieving speeds up to 180 times faster. In addition, by improving the accuracy of the estimated makeup, we confirm that our methods are highly advantageous for various 3D facial makeup applications such as 3D makeup face reconstruction, user-friendly makeup editing, makeup transfer, and interpolation. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: CVPR2024. Project: https://yangxingchao.github.io/makeup-priors-page

arXiv:2401.02804 [pdf, other]

DiffBody: Diffusion-based Pose and Shape Editing of Human Images

Authors: Yuta Okuyama, Yuki Endo, Yoshihiro Kanamori

Abstract: Pose and body shape editing in a human image has received increasing attention. However, current methods often struggle with dataset biases and deteriorate realism and the person's identity when users make large edits. We propose a one-shot approach that enables large edits with identity preservation. To enable large edits, we fit a 3D body model, project the input image onto the 3D model, and cha… ▽ More Pose and body shape editing in a human image has received increasing attention. However, current methods often struggle with dataset biases and deteriorate realism and the person's identity when users make large edits. We propose a one-shot approach that enables large edits with identity preservation. To enable large edits, we fit a 3D body model, project the input image onto the 3D model, and change the body's pose and shape. Because this initial textured body model has artifacts due to occlusion and the inaccurate body shape, the rendered image undergoes a diffusion-based refinement, in which strong noise destroys body structure and identity whereas insufficient noise does not help. We thus propose an iterative refinement with weak noise, applied first for the whole body and then for the face. We further enhance the realism by fine-tuning text embeddings via self-supervised learning. Our quantitative and qualitative evaluations demonstrate that our method outperforms other existing methods across various datasets. △ Less

Submitted 7 January, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: Accepted to WACV 2024, project page: https://www.cgg.cs.tsukuba.ac.jp/~okuyama/pub/diffbody/

arXiv:2305.16759 [pdf, other]

StyleHumanCLIP: Text-guided Garment Manipulation for StyleGAN-Human

Authors: Takato Yoshikawa, Yuki Endo, Yoshihiro Kanamori

Abstract: This paper tackles text-guided control of StyleGAN for editing garments in full-body human images. Existing StyleGAN-based methods suffer from handling the rich diversity of garments and body shapes and poses. We propose a framework for text-guided full-body human image synthesis via an attention-based latent code mapper, which enables more disentangled control of StyleGAN than existing mappers. O… ▽ More This paper tackles text-guided control of StyleGAN for editing garments in full-body human images. Existing StyleGAN-based methods suffer from handling the rich diversity of garments and body shapes and poses. We propose a framework for text-guided full-body human image synthesis via an attention-based latent code mapper, which enables more disentangled control of StyleGAN than existing mappers. Our latent code mapper adopts an attention mechanism that adaptively manipulates individual latent codes on different StyleGAN layers under text guidance. In addition, we introduce feature-space masking at inference time to avoid unwanted changes caused by text inputs. Our quantitative and qualitative evaluations reveal that our method can control generated images more faithfully to given texts than existing methods. △ Less

Submitted 20 March, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: VISIAPP 2024, project page: https://www.cgg.cs.tsukuba.ac.jp/~yoshikawa/pub/style_human_clip/

arXiv:2302.13279 [pdf, other]

Makeup Extraction of 3D Representation via Illumination-Aware Image Decomposition

Authors: Xingchao Yang, Takafumi Taketomi, Yoshihiro Kanamori

Abstract: Facial makeup enriches the beauty of not only real humans but also virtual characters; therefore, makeup for 3D facial models is highly in demand in productions. However, painting directly on 3D faces and capturing real-world makeup are costly, and extracting makeup from 2D images often struggles with shading effects and occlusions. This paper presents the first method for extracting makeup for 3D… ▽ More Facial makeup enriches the beauty of not only real humans but also virtual characters; therefore, makeup for 3D facial models is highly in demand in productions. However, painting directly on 3D faces and capturing real-world makeup are costly, and extracting makeup from 2D images often struggles with shading effects and occlusions. This paper presents the first method for extracting makeup for 3D facial models from a single makeup portrait. Our method consists of the following three steps. First, we exploit the strong prior of 3D morphable models via regression-based inverse rendering to extract coarse materials such as geometry and diffuse/specular albedos that are represented in the UV space. Second, we refine the coarse materials, which may have missing pixels due to occlusions. We apply inpainting and optimization. Finally, we extract the bare skin, makeup, and an alpha matte from the diffuse albedo. Our method offers various applications for not only 3D facial models but also 2D portrait images. The extracted makeup is well-aligned in the UV space, from which we build a large-scale makeup dataset and a parametric makeup model for 3D faces. Our disentangled materials also yield robust makeup transfer and illumination-aware makeup interpolation/removal without a reference image. △ Less

Submitted 26 February, 2023; originally announced February 2023.

Comments: Eurographics 2023

arXiv:2110.07272 [pdf, other]

Relighting Humans in the Wild: Monocular Full-Body Human Relighting with Domain Adaptation

Authors: Daichi Tajima, Yoshihiro Kanamori, Yuki Endo

Abstract: The modern supervised approaches for human image relighting rely on training data generated from 3D human models. However, such datasets are often small (e.g., Light Stage data with a small number of individuals) or limited to diffuse materials (e.g., commercial 3D scanned human models). Thus, the human relighting techniques suffer from the poor generalization capability and synthetic-to-real doma… ▽ More The modern supervised approaches for human image relighting rely on training data generated from 3D human models. However, such datasets are often small (e.g., Light Stage data with a small number of individuals) or limited to diffuse materials (e.g., commercial 3D scanned human models). Thus, the human relighting techniques suffer from the poor generalization capability and synthetic-to-real domain gap. In this paper, we propose a two-stage method for single-image human relighting with domain adaptation. In the first stage, we train a neural network for diffuse-only relighting. In the second stage, we train another network for enhancing non-diffuse reflection by learning residuals between real photos and images reconstructed by the diffuse-only network. Thanks to the second stage, we can achieve higher generalization capability against various cloth textures, while reducing the domain gap. Furthermore, to handle input videos, we integrate illumination-aware deep video prior to greatly reduce flickering artifacts even with challenging settings under dynamic illuminations. △ Less

Submitted 14 October, 2021; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: Accepted to Pacific Graphics 2021, project page: http://www.cgg.cs.tsukuba.ac.jp/~tajima/pub/relighting_in_the_wild/

arXiv:2106.13416 [pdf, other]

doi 10.1111/cgf.14164

Diversifying Semantic Image Synthesis and Editing via Class- and Layer-wise VAEs

Authors: Yuki Endo, Yoshihiro Kanamori

Abstract: Semantic image synthesis is a process for generating photorealistic images from a single semantic mask. To enrich the diversity of multimodal image synthesis, previous methods have controlled the global appearance of an output image by learning a single latent space. However, a single latent code is often insufficient for capturing various object styles because object appearance depends on multipl… ▽ More Semantic image synthesis is a process for generating photorealistic images from a single semantic mask. To enrich the diversity of multimodal image synthesis, previous methods have controlled the global appearance of an output image by learning a single latent space. However, a single latent code is often insufficient for capturing various object styles because object appearance depends on multiple factors. To handle individual factors that determine object styles, we propose a class- and layer-wise extension to the variational autoencoder (VAE) framework that allows flexible control over each object class at the local to global levels by learning multiple latent spaces. Furthermore, we demonstrate that our method generates images that are both plausible and more diverse compared to state-of-the-art methods via extensive experiments with real and synthetic datasets inthree different domains. We also show that our method enables a wide range of applications in image synthesis and editing tasks. △ Less

Submitted 29 June, 2021; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: Accepted to Pacific Graphics 2020, codes available at https://github.com/endo-yuki-t/DiversifyingSMIS

arXiv:2103.14877 [pdf, other]

Few-shot Semantic Image Synthesis Using StyleGAN Prior

Authors: Yuki Endo, Yoshihiro Kanamori

Abstract: This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and… ▽ More This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks. With such mappings, we can generate an unlimited number of pseudo semantic masks from random noise to train an encoder for controlling a pre-trained StyleGAN generator. Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles. Qualitative and quantitative results with various datasets demonstrate improvement over previous approaches with respect to layout fidelity and visual quality in as few as one- or five-shot settings. △ Less

Submitted 12 May, 2021; v1 submitted 27 March, 2021; originally announced March 2021.

Comments: The source codes are available at https://github.com/endo-yuki-t/Fewshot-SMIS

arXiv:1910.07192 [pdf, other]

Animating Landscape: Self-Supervised Learning of Decoupled Motion and Appearance for Single-Image Video Synthesis

Authors: Yuki Endo, Yoshihiro Kanamori, Shigeru Kuriyama

Abstract: Automatic generation of a high-quality video from a single image remains a challenging task despite the recent advances in deep generative models. This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where we mainly focus on skies and waters. Our key observation is that the motion (e.g., moving… ▽ More Automatic generation of a high-quality video from a single image remains a challenging task despite the recent advances in deep generative models. This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where we mainly focus on skies and waters. Our key observation is that the motion (e.g., moving clouds) and appearance (e.g., time-varying colors in the sky) in natural scenes have different time scales. We thus learn them separately and predict them with decoupled control while handling future uncertainty in both predictions by introducing latent codes. Unlike previous methods that infer output frames directly, our CNNs predict spatially-smooth intermediate data, i.e., for motion, flow fields for warping, and for appearance, color transfer maps, via self-supervised learning, i.e., without explicitly-provided ground truth. These intermediate data are applied not to each previous output frame, but to the input image only once for each output frame. This design is crucial to alleviate error accumulation in long-term predictions, which is the essential problem in previous recurrent approaches. The output frames can be looped like cinemagraph, and also be controlled directly by specifying latent codes or indirectly via visual annotations. We demonstrate the effectiveness of our method through comparisons with the state-of-the-arts on video prediction as well as appearance manipulation. △ Less

Submitted 16 October, 2019; originally announced October 2019.

Comments: Published at SIGGRAPH Asia 2019 (ACM Transactions on Graphics)

arXiv:1908.02714 [pdf, other]

doi 10.1145/3272127.3275104

Relighting Humans: Occlusion-Aware Inverse Rendering for Full-Body Human Images

Authors: Yoshihiro Kanamori, Yuki Endo

Abstract: Relighting of human images has various applications in image synthesis. For relighting, we must infer albedo, shape, and illumination from a human portrait. Previous techniques rely on human faces for this inference, based on spherical harmonics (SH) lighting. However, because they often ignore light occlusion, inferred shapes are biased and relit images are unnaturally bright particularly at holl… ▽ More Relighting of human images has various applications in image synthesis. For relighting, we must infer albedo, shape, and illumination from a human portrait. Previous techniques rely on human faces for this inference, based on spherical harmonics (SH) lighting. However, because they often ignore light occlusion, inferred shapes are biased and relit images are unnaturally bright particularly at hollowed regions such as armpits, crotches, or garment wrinkles. This paper introduces the first attempt to infer light occlusion in the SH formulation directly. Based on supervised learning using convolutional neural networks (CNNs), we infer not only an albedo map, illumination but also a light transport map that encodes occlusion as nine SH coefficients per pixel. The main difficulty in this inference is the lack of training datasets compared to unlimited variations of human portraits. Surprisingly, geometric information including occlusion can be inferred plausibly even with a small dataset of synthesized human figures, by carefully preparing the dataset so that the CNNs can exploit the data coherency. Our method accomplishes more realistic relighting than the occlusion-ignored formulation. △ Less

Submitted 7 August, 2019; originally announced August 2019.

Comments: Published at SIGGRAPH Asia 2018 (ACM Transactions on Graphics). Project page with codes, pretrained models, and human model lists is at http://kanamori.cs.tsukuba.ac.jp/projects/relighting_human/

arXiv:1004.0599 [pdf]

Quantum Three-Pass protocol: Key distribution using quantum superposition states

Authors: Yoshito Kanamori, Seong-Moo Yoo

Abstract: This letter proposes a novel key distribution protocol with no key exchange in advance, which is secure as the BB84 quantum key distribution protocol. Our protocol utilizes a photon in superposition state for single-bit data transmission instead of a classical electrical/optical signal. The security of this protocol relies on the fact, that the arbitrary quantum state cannot be cloned, known as th… ▽ More This letter proposes a novel key distribution protocol with no key exchange in advance, which is secure as the BB84 quantum key distribution protocol. Our protocol utilizes a photon in superposition state for single-bit data transmission instead of a classical electrical/optical signal. The security of this protocol relies on the fact, that the arbitrary quantum state cannot be cloned, known as the no-cloning theorem. This protocol can be implemented with current technologies. △ Less

Submitted 5 April, 2010; originally announced April 2010.

Comments: 7Pages

Journal ref: International Journal of Network Security & Its Applications 1.2 (2009) 64-70

Showing 1–20 of 20 results for author: Kanamori, Y