-
AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild
Authors:
Nicholas Dufour,
Arkanath Pathak,
Pouya Samangouei,
Nikki Hariri,
Shashi Deshetti,
Andrew Dudfield,
Christopher Guess,
Pablo Hernández Escayola,
Bobby Tran,
Mevan Babakar,
Christoph Bregler
Abstract:
The prevalence and harms of online misinformation is a perennial concern for internet platforms, institutions and society at large. Over time, information shared online has become more media-heavy and misinformation has readily adapted to these new modalities. The rise of generative AI-based tools, which provide widely-accessible methods for synthesizing realistic audio, images, video and human-li…
▽ More
The prevalence and harms of online misinformation is a perennial concern for internet platforms, institutions and society at large. Over time, information shared online has become more media-heavy and misinformation has readily adapted to these new modalities. The rise of generative AI-based tools, which provide widely-accessible methods for synthesizing realistic audio, images, video and human-like text, have amplified these concerns. Despite intense public interest and significant press coverage, quantitative information on the prevalence and modality of media-based misinformation remains scarce. Here, we present the results of a two-year study using human raters to annotate online media-based misinformation, mostly focusing on images, based on claims assessed in a large sample of publicly-accessible fact checks with the ClaimReview markup. We present an image typology, designed to capture aspects of the image and manipulation relevant to the image's role in the misinformation claim. We visualize the distribution of these types over time. We show the rise of generative AI-based content in misinformation claims, and that its commonality is a relatively recent phenomenon, occurring significantly after heavy press coverage. We also show "simple" methods dominated historically, particularly context manipulations, and continued to hold a majority as of the end of data collection in November 2023. The dataset, Annotated Misinformation, Media-Based (AMMeBa), is publicly-available, and we hope that these data will serve as both a means of evaluating mitigation methods in a realistic setting and as a first-of-its-kind census of the types and modalities of online misinformation.
△ Less
Submitted 21 May, 2024; v1 submitted 19 May, 2024;
originally announced May 2024.
-
Training-Free Neural Matte Extraction for Visual Effects
Authors:
Sharif Elcott,
J. P. Lewis,
Nori Kanazawa,
Christoph Bregler
Abstract:
Alpha matting is widely used in video conferencing as well as in movies, television, and social media sites. Deep learning approaches to the matte extraction problem are well suited to video conferencing due to the consistent subject matter (front-facing humans), however training-based approaches are somewhat pointless for entertainment videos where varied subjects (spaceships, monsters, etc.) may…
▽ More
Alpha matting is widely used in video conferencing as well as in movies, television, and social media sites. Deep learning approaches to the matte extraction problem are well suited to video conferencing due to the consistent subject matter (front-facing humans), however training-based approaches are somewhat pointless for entertainment videos where varied subjects (spaceships, monsters, etc.) may appear only a few times in a single movie -- if a method of creating ground truth for training exists, just use that method to produce the desired mattes. We introduce a training-free high quality neural matte extraction approach that specifically targets the assumptions of visual effects production. Our approach is based on the deep image prior, which optimizes a deep neural network to fit a single image, thereby providing a deep encoding of the particular image. We make use of the representations in the penultimate layer to interpolate coarse and incomplete "trimap" constraints. Videos processed with this approach are temporally consistent. The algorithm is both very simple and surprisingly effective.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
ACM Multimedia Grand Challenge on Detecting Cheapfakes
Authors:
Shivangi Aneja,
Cise Midoglu,
Duc-Tien Dang-Nguyen,
Sohail Ahmed Khan,
Michael Riegler,
Pål Halvorsen,
Chris Bregler,
Balu Adsumilli
Abstract:
Cheapfake is a recently coined term that encompasses non-AI (``cheap'') manipulations of multimedia content. Cheapfakes are known to be more prevalent than deepfakes. Cheapfake media can be created using editing software for image/video manipulations, or even without using any software, by simply altering the context of an image/video by sharing the media alongside misleading claims. This alterati…
▽ More
Cheapfake is a recently coined term that encompasses non-AI (``cheap'') manipulations of multimedia content. Cheapfakes are known to be more prevalent than deepfakes. Cheapfake media can be created using editing software for image/video manipulations, or even without using any software, by simply altering the context of an image/video by sharing the media alongside misleading claims. This alteration of context is referred to as out-of-context (OOC) misuse of media. OOC media is much harder to detect than fake media, since the images and videos are not tampered. In this challenge, we focus on detecting OOC images, and more specifically the misuse of real photographs with conflicting image captions in news items. The aim of this challenge is to develop and benchmark models that can be used to detect whether given samples (news image and associated captions) are OOC, based on the recently compiled COSMOS dataset.
△ Less
Submitted 29 July, 2022;
originally announced July 2022.
-
MMSys'21 Grand Challenge on Detecting Cheapfakes
Authors:
Shivangi Aneja,
Cise Midoglu,
Duc-Tien Dang-Nguyen,
Michael Alexander Riegler,
Paal Halvorsen,
Matthias Niessner,
Balu Adsumilli,
Chris Bregler
Abstract:
Cheapfake is a recently coined term that encompasses non-AI ("cheap") manipulations of multimedia content. Cheapfakes are known to be more prevalent than deepfakes. Cheapfake media can be created using editing software for image/video manipulations, or even without using any software, by simply altering the context of an image/video by sharing the media alongside misleading claims. This alteration…
▽ More
Cheapfake is a recently coined term that encompasses non-AI ("cheap") manipulations of multimedia content. Cheapfakes are known to be more prevalent than deepfakes. Cheapfake media can be created using editing software for image/video manipulations, or even without using any software, by simply altering the context of an image/video by sharing the media alongside misleading claims. This alteration of context is referred to as out-of-context (OOC) misuse} of media. OOC media is much harder to detect than fake media, since the images and videos are not tampered. In this challenge, we focus on detecting OOC images, and more specifically the misuse of real photographs with conflicting image captions in news items. The aim of this challenge is to develop and benchmark models that can be used to detect whether given samples (news image and associated captions) are OOC, based on the recently compiled COSMOS dataset.
△ Less
Submitted 12 July, 2021;
originally announced July 2021.
-
LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization
Authors:
Avisek Lahiri,
Vivek Kwatra,
Christian Frueh,
John Lewis,
Chris Bregler
Abstract:
In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D fac…
▽ More
In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning
Authors:
Shivangi Aneja,
Chris Bregler,
Matthias Nießner
Abstract:
Despite the recent attention to DeepFakes, one of the most prevalent ways to mislead audiences on social media is the use of unaltered images in a new but false context. To address these challenges and support fact-checkers, we propose a new method that automatically detects out-of-context image and text pairs. Our key insight is to leverage the grounding of image with text to distinguish out-of-c…
▽ More
Despite the recent attention to DeepFakes, one of the most prevalent ways to mislead audiences on social media is the use of unaltered images in a new but false context. To address these challenges and support fact-checkers, we propose a new method that automatically detects out-of-context image and text pairs. Our key insight is to leverage the grounding of image with text to distinguish out-of-context scenarios that cannot be disambiguated with language alone. We propose a self-supervised training strategy where we only need a set of captioned images. At train time, our method learns to selectively align individual objects in an image with textual claims, without explicit supervision. At test time, we check if both captions correspond to the same object(s) in the image but are semantically different, which allows us to make fairly accurate out-of-context predictions. Our method achieves 85% out-of-context detection accuracy. To facilitate benchmarking of this task, we create a large-scale dataset of 200K images with 450K textual captions from a variety of news websites, blogs, and social media posts. The dataset and source code is publicly available at https://shivangi-aneja.github.io/projects/cosmos/.
△ Less
Submitted 21 April, 2021; v1 submitted 15 January, 2021;
originally announced January 2021.
-
SimPose: Effectively Learning DensePose and Surface Normals of People from Simulated Data
Authors:
Tyler Zhu,
Per Karlsson,
Christoph Bregler
Abstract:
With a proliferation of generic domain-adaptation approaches, we report a simple yet effective technique for learning difficult per-pixel 2.5D and 3D regression representations of articulated people. We obtained strong sim-to-real domain generalization for the 2.5D DensePose estimation task and the 3D human surface normal estimation task. On the multi-person DensePose MSCOCO benchmark, our approac…
▽ More
With a proliferation of generic domain-adaptation approaches, we report a simple yet effective technique for learning difficult per-pixel 2.5D and 3D regression representations of articulated people. We obtained strong sim-to-real domain generalization for the 2.5D DensePose estimation task and the 3D human surface normal estimation task. On the multi-person DensePose MSCOCO benchmark, our approach outperforms the state-of-the-art methods which are trained on real images that are densely labelled. This is an important result since obtaining human manifold's intrinsic uv coordinates on real images is time consuming and prone to labeling noise. Additionally, we present our model's 3D surface normal predictions on the MSCOCO dataset that lacks any real 3D surface normal labels. The key to our approach is to mitigate the "Inter-domain Covariate Shift" with a carefully selected training batch from a mixture of domain samples, a deep batch-normalized residual network, and a modified multi-task learning objective. Our approach is complementary to existing domain-adaptation techniques and can be applied to other dense per-pixel pose estimation problems.
△ Less
Submitted 30 July, 2020;
originally announced July 2020.
-
Cross-Domain Image Manipulation by Demonstration
Authors:
Ben Usman,
Nick Dufour,
Kate Saenko,
Chris Bregler
Abstract:
In this work we propose a model that can manipulate individual visual attributes of objects in a real scene using examples of how respective attribute manipulations affect the output of a simulation. As an example, we train our model to manipulate the expression of a human face using nonphotorealistic 3D renders of a face with varied expression. Our model manages to preserve all other visual attri…
▽ More
In this work we propose a model that can manipulate individual visual attributes of objects in a real scene using examples of how respective attribute manipulations affect the output of a simulation. As an example, we train our model to manipulate the expression of a human face using nonphotorealistic 3D renders of a face with varied expression. Our model manages to preserve all other visual attributes of a real face, such as head orientation, even though this and other attributes are not labeled in either real or synthetic domain. Since our model learns to manipulate a specific property in isolation using only "synthetic demonstrations" of such manipulations without explicitly provided labels, it can be applied to shape, texture, lighting, and other properties that are difficult to measure or represent as real-valued vectors. We measure the degree to which our model preserves other attributes of a real image when a single specific attribute is manipulated. We use digit datasets to analyze how discrepancy in attribute distributions affects the performance of our model, and demonstrate results in a far more difficult setting: learning to manipulate real human faces using nonphotorealistic 3D renders.
△ Less
Submitted 3 April, 2019; v1 submitted 28 January, 2019;
originally announced January 2019.
-
Towards Accurate Multi-person Pose Estimation in the Wild
Authors:
George Papandreou,
Tyler Zhu,
Nori Kanazawa,
Alexander Toshev,
Jonathan Tompson,
Chris Bregler,
Kevin Murphy
Abstract:
We propose a method for multi-person detection and 2-D pose estimation that achieves state-of-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages.
In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the…
▽ More
We propose a method for multi-person detection and 2-D pose estimation that achieves state-of-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages.
In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring.
Trained on COCO data alone, our final system achieves average precision of 0.649 on the COCO test-dev set and the 0.643 test-standard sets, outperforming the winner of the 2016 COCO keypoints challenge and other recent state-of-art. Further, by using additional in-house labeled data we obtain an even higher average precision of 0.685 on the test-dev set and 0.673 on the test-standard set, more than 5% absolute improvement compared to the previous best performing method on the same dataset.
△ Less
Submitted 14 April, 2017; v1 submitted 6 January, 2017;
originally announced January 2017.
-
Efficient Object Localization Using Convolutional Networks
Authors:
Jonathan Tompson,
Ross Goroshin,
Arjun Jain,
Yann LeCun,
Christopher Bregler
Abstract:
Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance and prevent over-training. These benefits of pooling come at the cost of reduced localization accuracy. We introduce a novel architectu…
▽ More
Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance and prevent over-training. These benefits of pooling come at the cost of reduced localization accuracy. We introduce a novel architecture which includes an efficient `position refinement' model that is trained to estimate the joint offset location within a small region of the image. This refinement model is jointly trained in cascade with a state-of-the-art ConvNet model to achieve improved accuracy in human joint location estimation. We show that the variance of our detector approaches the variance of human annotations on the FLIC dataset and outperforms all existing approaches on the MPII-human-pose dataset.
△ Less
Submitted 9 June, 2015; v1 submitted 16 November, 2014;
originally announced November 2014.
-
MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation
Authors:
Arjun Jain,
Jonathan Tompson,
Yann LeCun,
Christoph Bregler
Abstract:
In this work, we propose a novel and efficient method for articulated human pose estimation in videos using a convolutional network architecture, which incorporates both color and motion features. We propose a new human body pose dataset, FLIC-motion, that extends the FLIC dataset with additional motion features. We apply our architecture to this dataset and report significantly better performance…
▽ More
In this work, we propose a novel and efficient method for articulated human pose estimation in videos using a convolutional network architecture, which incorporates both color and motion features. We propose a new human body pose dataset, FLIC-motion, that extends the FLIC dataset with additional motion features. We apply our architecture to this dataset and report significantly better performance than current state-of-the-art pose detection systems.
△ Less
Submitted 28 September, 2014;
originally announced September 2014.
-
Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation
Authors:
Jonathan Tompson,
Arjun Jain,
Yann LeCun,
Christoph Bregler
Abstract:
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint tr…
▽ More
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.
△ Less
Submitted 17 September, 2014; v1 submitted 11 June, 2014;
originally announced June 2014.
-
Learning Human Pose Estimation Features with Convolutional Networks
Authors:
Arjun Jain,
Jonathan Tompson,
Mykhaylo Andriluka,
Graham W. Taylor,
Christoph Bregler
Abstract:
This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models. Unconstrained human pose estimation is one of the hardest problems in computer vision, and our new architecture and learning schema shows significant improvement over the cur…
▽ More
This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models. Unconstrained human pose estimation is one of the hardest problems in computer vision, and our new architecture and learning schema shows significant improvement over the current state-of-the-art results. The main contribution of this paper is showing, for the first time, that a specific variation of deep learning is able to outperform all existing traditional architectures on this task. The paper also discusses several lessons learned while researching alternatives, most notably, that it is possible to learn strong low-level feature detectors on features that might even just cover a few pixels in the image. Higher-level spatial models improve somewhat the overall result, but to a much lesser extent then expected. Many researchers previously argued that the kinematic structure and top-down information is crucial for this domain, but with our purely bottom up, and weak spatial model, we could improve other more complicated architectures that currently produce the best results. This mirrors what many other researchers, like those in the speech recognition, object recognition, and other domains have experienced.
△ Less
Submitted 23 April, 2014; v1 submitted 27 December, 2013;
originally announced December 2013.
-
Markerless Motion Capture in the Crowd
Authors:
Ian Spiro,
Thomas Huston,
Christoph Bregler
Abstract:
This work uses crowdsourcing to obtain motion capture data from video recordings. The data is obtained by information workers who click repeatedly to indicate body configurations in the frames of a video, resulting in a model of 2D structure over time. We discuss techniques to optimize the tracking task and strategies for maximizing accuracy and efficiency. We show visualizations of a variety of m…
▽ More
This work uses crowdsourcing to obtain motion capture data from video recordings. The data is obtained by information workers who click repeatedly to indicate body configurations in the frames of a video, resulting in a model of 2D structure over time. We discuss techniques to optimize the tracking task and strategies for maximizing accuracy and efficiency. We show visualizations of a variety of motions captured with our pipeline then apply reconstruction techniques to derive 3D structure.
△ Less
Submitted 16 April, 2012;
originally announced April 2012.