-
Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos
Authors:
Ekta Prashnani,
Koki Nagano,
Shalini De Mello,
David Luebke,
Orazio Gallo
Abstract:
Modern avatar generators allow anyone to synthesize photorealistic real-time talking avatars, ushering in a new era of avatar-based human communication, such as with immersive AR/VR interactions or videoconferencing with limited bandwidths. Their safe adoption, however, requires a mechanism to verify if the rendered avatar is trustworthy: does it use the appearance of an individual without their c…
▽ More
Modern avatar generators allow anyone to synthesize photorealistic real-time talking avatars, ushering in a new era of avatar-based human communication, such as with immersive AR/VR interactions or videoconferencing with limited bandwidths. Their safe adoption, however, requires a mechanism to verify if the rendered avatar is trustworthy: does it use the appearance of an individual without their consent? We term this task avatar fingerprinting. To tackle it, we first introduce a large-scale dataset of real and synthetic videos of people interacting on a video call, where the synthetic videos are generated using the facial appearance of one person and the expressions of another. We verify the identity driving the expressions in a synthetic video, by learning motion signatures that are independent of the facial appearance shown. Our solution, the first in this space, achieves an average AUC of 0.85. Critical to its practical use, it also generalizes to new generators never seen in training (average AUC of 0.83). The proposed dataset and other resources can be found at: https://research.nvidia.com/labs/nxp/avatar-fingerprinting/.
△ Less
Submitted 4 August, 2024; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Generalizable Deepfake Detection with Phase-Based Motion Analysis
Authors:
Ekta Prashnani,
Michael Goebel,
B. S. Manjunath
Abstract:
We propose PhaseForensics, a DeepFake (DF) video detection method that leverages a phase-based motion representation of facial temporal dynamics. Existing methods relying on temporal inconsistencies for DF detection present many advantages over the typical frame-based methods. However, they still show limited cross-dataset generalization and robustness to common distortions. These shortcomings are…
▽ More
We propose PhaseForensics, a DeepFake (DF) video detection method that leverages a phase-based motion representation of facial temporal dynamics. Existing methods relying on temporal inconsistencies for DF detection present many advantages over the typical frame-based methods. However, they still show limited cross-dataset generalization and robustness to common distortions. These shortcomings are partially due to error-prone motion estimation and landmark tracking, or the susceptibility of the pixel intensity-based features to spatial distortions and the cross-dataset domain shifts. Our key insight to overcome these issues is to leverage the temporal phase variations in the band-pass components of the Complex Steerable Pyramid on face sub-regions. This not only enables a robust estimate of the temporal dynamics in these regions, but is also less prone to cross-dataset variations. Furthermore, the band-pass filters used to compute the local per-frame phase form an effective defense against the perturbations commonly seen in gradient-based adversarial attacks. Overall, with PhaseForensics, we show improved distortion and adversarial robustness, and state-of-the-art cross-dataset generalization, with 91.2% video-level AUC on the challenging CelebDFv2 (a recent state-of-the-art compares at 86.9%).
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
LOCL: Learning Object-Attribute Composition using Localization
Authors:
Satish Kumar,
ASM Iftekhar,
Ekta Prashnani,
B. S. Manjunath
Abstract:
This paper describes LOCL (Learning Object Attribute Composition using Localization) that generalizes composition zero shot learning to objects in cluttered and more realistic settings. The problem of unseen Object Attribute (OA) associations has been well studied in the field, however, the performance of existing methods is limited in challenging scenes. In this context, our key contribution is a…
▽ More
This paper describes LOCL (Learning Object Attribute Composition using Localization) that generalizes composition zero shot learning to objects in cluttered and more realistic settings. The problem of unseen Object Attribute (OA) associations has been well studied in the field, however, the performance of existing methods is limited in challenging scenes. In this context, our key contribution is a modular approach to localizing objects and attributes of interest in a weakly supervised context that generalizes robustly to unseen configurations. Localization coupled with a composition classifier significantly outperforms state of the art (SOTA) methods, with an improvement of about 12% on currently available challenging datasets. Further, the modularity enables the use of localized feature extractor to be used with existing OA compositional learning methods to improve their overall performance.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
Noise-Aware Video Saliency Prediction
Authors:
Ekta Prashnani,
Orazio Gallo,
Joohwan Kim,
Josef Spjut,
Pradeep Sen,
Iuri Frosio
Abstract:
We tackle the problem of predicting saliency maps for videos of dynamic scenes. We note that the accuracy of the maps reconstructed from the gaze data of a fixed number of observers varies with the frame, as it depends on the content of the scene. This issue is particularly pressing when a limited number of observers are available. In such cases, directly minimizing the discrepancy between the pre…
▽ More
We tackle the problem of predicting saliency maps for videos of dynamic scenes. We note that the accuracy of the maps reconstructed from the gaze data of a fixed number of observers varies with the frame, as it depends on the content of the scene. This issue is particularly pressing when a limited number of observers are available. In such cases, directly minimizing the discrepancy between the predicted and measured saliency maps, as traditional deep-learning methods do, results in overfitting to the noisy data. We propose a noise-aware training (NAT) paradigm that quantifies and accounts for the uncertainty arising from frame-specific gaze data inaccuracy. We show that NAT is especially advantageous when limited training data is available, with experiments across different models, loss functions, and datasets. We also introduce a video game-based saliency dataset, with rich temporal semantics, and multiple gaze attractors per frame. The dataset and source code are available at https://github.com/NVlabs/NAT-saliency.
△ Less
Submitted 22 November, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
PieAPP: Perceptual Image-Error Assessment through Pairwise Preference
Authors:
Ekta Prashnani,
Hong Cai,
Yasamin Mostofi,
Pradeep Sen
Abstract:
The ability to estimate the perceptual error between images is an important problem in computer vision with many applications. Although it has been studied extensively, however, no method currently exists that can robustly predict visual differences like humans. Some previous approaches used hand-coded models, but they fail to model the complexity of the human visual system. Others used machine le…
▽ More
The ability to estimate the perceptual error between images is an important problem in computer vision with many applications. Although it has been studied extensively, however, no method currently exists that can robustly predict visual differences like humans. Some previous approaches used hand-coded models, but they fail to model the complexity of the human visual system. Others used machine learning to train models on human-labeled datasets, but creating large, high-quality datasets is difficult because people are unable to assign consistent error labels to distorted images. In this paper, we present a new learning-based method that is the first to predict perceptual image error like human observers. Since it is much easier for people to compare two given images and identify the one more similar to a reference than to assign quality scores to each, we propose a new, large-scale dataset labeled with the probability that humans will prefer one image over another. We then train a deep-learning model using a novel, pairwise-learning framework to predict the preference of one distorted image over the other. Our key observation is that our trained network can then be used separately with only one distorted image and a reference to predict its perceptual error, without ever being trained on explicit human perceptual-error labels. The perceptual error estimated by our new metric, PieAPP, is well-correlated with human opinion. Furthermore, it significantly outperforms existing algorithms, beating the state-of-the-art by almost 3x on our test set in terms of binary error rate, while also generalizing to new kinds of distortions, unlike previous learning-based methods.
△ Less
Submitted 6 June, 2018;
originally announced June 2018.