-
Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
Authors:
Manuel Kansy,
Jacek Naruniec,
Christopher Schroers,
Markus Gross,
Romann M. Weber
Abstract:
Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video mo…
▽ More
Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.
Project website: https://mkansy.github.io/reenact-anything/
△ Less
Submitted 23 May, 2025; v1 submitted 1 August, 2024;
originally announced August 2024.
-
Controllable Inversion of Black-Box Face Recognition Models via Diffusion
Authors:
Manuel Kansy,
Anton Raƫl,
Graziana Mignone,
Jacek Naruniec,
Christopher Schroers,
Markus Gross,
Romann M. Weber
Abstract:
Face recognition models embed a face image into a low-dimensional identity vector containing abstract encodings of identity-specific facial features that allow individuals to be distinguished from one another. We tackle the challenging task of inverting the latent space of pre-trained face recognition models without full model access (i.e. black-box setting). A variety of methods have been propose…
▽ More
Face recognition models embed a face image into a low-dimensional identity vector containing abstract encodings of identity-specific facial features that allow individuals to be distinguished from one another. We tackle the challenging task of inverting the latent space of pre-trained face recognition models without full model access (i.e. black-box setting). A variety of methods have been proposed in literature for this task, but they have serious shortcomings such as a lack of realistic outputs and strong requirements for the data set and accessibility of the face recognition model. By analyzing the black-box inversion problem, we show that the conditional diffusion model loss naturally emerges and that we can effectively sample from the inverse distribution even without an identity-specific loss. Our method, named identity denoising diffusion probabilistic model (ID3PM), leverages the stochastic nature of the denoising diffusion process to produce high-quality, identity-preserving face images with various backgrounds, lighting, poses, and expressions. We demonstrate state-of-the-art performance in terms of identity preservation and diversity both qualitatively and quantitatively, and our method is the first black-box face recognition model inversion method that offers intuitive control over the generation process.
△ Less
Submitted 30 September, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Augmentation for small object detection
Authors:
Mate Kisantal,
Zbigniew Wojna,
Jakub Murawski,
Jacek Naruniec,
Kyunghyun Cho
Abstract:
In recent years, object detection has experienced impressive progress. Despite these improvements, there is still a significant gap in the performance between the detection of small and large objects. We analyze the current state-of-the-art model, Mask-RCNN, on a challenging dataset, MS COCO. We show that the overlap between small ground-truth objects and the predicted anchors is much lower than t…
▽ More
In recent years, object detection has experienced impressive progress. Despite these improvements, there is still a significant gap in the performance between the detection of small and large objects. We analyze the current state-of-the-art model, Mask-RCNN, on a challenging dataset, MS COCO. We show that the overlap between small ground-truth objects and the predicted anchors is much lower than the expected IoU threshold. We conjecture this is due to two factors; (1) only a few images are containing small objects, and (2) small objects do not appear enough even within each image containing them. We thus propose to oversample those images with small objects and augment each of those images by copy-pasting small objects many times. It allows us to trade off the quality of the detector on large objects with that on small objects. We evaluate different pasting augmentation strategies, and ultimately, we achieve 9.7\% relative improvement on the instance segmentation and 7.1\% on the object detection of small objects, compared to the current state of the art method on MS COCO.
△ Less
Submitted 19 February, 2019;
originally announced February 2019.
-
Face Alignment Using K-Cluster Regression Forests With Weighted Splitting
Authors:
Marek Kowalski,
Jacek Naruniec
Abstract:
In this work we present a face alignment pipeline based on two novel methods: weighted splitting for K-cluster Regression Forests and 3D Affine Pose Regression for face shape initialization. Our face alignment method is based on the Local Binary Feature framework, where instead of standard regression forests and pixel difference features used in the original method, we use our K-cluster Regression…
▽ More
In this work we present a face alignment pipeline based on two novel methods: weighted splitting for K-cluster Regression Forests and 3D Affine Pose Regression for face shape initialization. Our face alignment method is based on the Local Binary Feature framework, where instead of standard regression forests and pixel difference features used in the original method, we use our K-cluster Regression Forests with Weighted Splitting (KRFWS) and Pyramid HOG features. We also use KRFWS to perform Affine Pose Regression (APR) and 3D-Affine Pose Regression (3D-APR), which intend to improve the face shape initialization. APR applies a rigid 2D transform to the initial face shape that compensates for inaccuracy in the initial face location, size and in-plane rotation. 3D-APR estimates the parameters of a 3D transform that additionally compensates for out-of-plane rotation. The resulting pipeline, consisting of APR and 3D-APR followed by face alignment, shows an improvement of 20% over standard LBF on the challenging IBUG dataset, and state-of-theart accuracy on the entire 300-W dataset.
△ Less
Submitted 6 June, 2017;
originally announced June 2017.
-
Deep Alignment Network: A convolutional neural network for robust face alignment
Authors:
Marek Kowalski,
Jacek Naruniec,
Tomasz Trzcinski
Abstract:
In this paper, we propose Deep Alignment Network (DAN), a robust face alignment method based on a deep neural network architecture. DAN consists of multiple stages, where each stage improves the locations of the facial landmarks estimated by the previous stage. Our method uses entire face images at all stages, contrary to the recently proposed face alignment methods that rely on local patches. Thi…
▽ More
In this paper, we propose Deep Alignment Network (DAN), a robust face alignment method based on a deep neural network architecture. DAN consists of multiple stages, where each stage improves the locations of the facial landmarks estimated by the previous stage. Our method uses entire face images at all stages, contrary to the recently proposed face alignment methods that rely on local patches. This is possible thanks to the use of landmark heatmaps which provide visual information about landmark locations estimated at the previous stages of the algorithm. The use of entire face images rather than patches allows DAN to handle face images with large variation in head pose and difficult initializations. An extensive evaluation on two publicly available datasets shows that DAN reduces the state-of-the-art failure rate by up to 70%. Our method has also been submitted for evaluation as part of the Menpo challenge.
△ Less
Submitted 10 August, 2017; v1 submitted 6 June, 2017;
originally announced June 2017.