-
Artifact magnification on deepfake videos increases human detection and subjective confidence
Authors:
Emilie Josephs,
Camilo Fosco,
Aude Oliva
Abstract:
The development of technologies for easily and automatically falsifying video has raised practical questions about people's ability to detect false information online. How vulnerable are people to deepfake videos? What technologies can be applied to boost their performance? Human susceptibility to deepfake videos is typically measured in laboratory settings, which do not reflect the challenges of…
▽ More
The development of technologies for easily and automatically falsifying video has raised practical questions about people's ability to detect false information online. How vulnerable are people to deepfake videos? What technologies can be applied to boost their performance? Human susceptibility to deepfake videos is typically measured in laboratory settings, which do not reflect the challenges of real-world browsing. In typical browsing, deepfakes are rare, engagement with the video may be short, participants may be distracted, or the video streaming quality may be degraded. Here, we tested deepfake detection under these ecological viewing conditions, and found that detection was lowered in all cases. Principles from signal detection theory indicated that different viewing conditions affected different dimensions of detection performance. Overall, this suggests that the current literature underestimates people's susceptibility to deepfakes. Next, we examined how computer vision models might be integrated into users' decision process to increase accuracy and confidence during deepfake detection. We evaluated the effectiveness of communicating the model's prediction to the user by amplifying artifacts in fake videos. We found that artifact amplification was highly effective at making fake video distinguishable from real, in a manner that was robust across viewing conditions. Additionally, compared to a traditional text-based prompt, artifact amplification was more convincing: people accepted the model's suggestion more often, and reported higher final confidence in their model-supported decision, particularly for more challenging videos. Overall, this suggests that visual indicators that cause distortions on fake videos may be highly effective at mitigating the impact of falsified video.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
Overview of The MediaEval 2022 Predicting Video Memorability Task
Authors:
Lorin Sweeney,
Mihai Gabriel Constantin,
Claire-Hélène Demarty,
Camilo Fosco,
Alba G. Seco de Herrera,
Sebastian Halder,
Graham Healy,
Bogdan Ionescu,
Ana Matran-Fernandez,
Alan F. Smeaton,
Mushfika Sultana
Abstract:
This paper describes the 5th edition of the Predicting Video Memorability Task as part of MediaEval2022. This year we have reorganised and simplified the task in order to lubricate a greater depth of inquiry. Similar to last year, two datasets are provided in order to facilitate generalisation, however, this year we have replaced the TRECVid2019 Video-to-Text dataset with the VideoMem dataset in o…
▽ More
This paper describes the 5th edition of the Predicting Video Memorability Task as part of MediaEval2022. This year we have reorganised and simplified the task in order to lubricate a greater depth of inquiry. Similar to last year, two datasets are provided in order to facilitate generalisation, however, this year we have replaced the TRECVid2019 Video-to-Text dataset with the VideoMem dataset in order to remedy underlying data quality issues, and to prioritise short-term memorability prediction by elevating the Memento10k dataset as the primary dataset. Additionally, a fully fledged electroencephalography (EEG)-based prediction sub-task is introduced. In this paper, we outline the core facets of the task and its constituent sub-tasks; describing the datasets, evaluation metrics, and requirements for participant submissions.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Experiences from the MediaEval Predicting Media Memorability Task
Authors:
Alba García Deco de Herrera,
Mihai Gabriel Constantin,
Chaire-Hélène Demarty,
Camilo Fosco,
Sebastian Halder,
Graham Healy,
Bogdan Ionescu,
Ana Matran-Fernandez,
Alan F. Smeaton,
Mushfika Sultana,
Lorin Sweeney
Abstract:
The Predicting Media Memorability task in the MediaEval evaluation campaign has been running annually since 2018 and several different tasks and data sets have been used in this time. This has allowed us to compare the performance of many memorability prediction techniques on the same data and in a reproducible way and to refine and improve on those techniques. The resources created to compute med…
▽ More
The Predicting Media Memorability task in the MediaEval evaluation campaign has been running annually since 2018 and several different tasks and data sets have been used in this time. This has allowed us to compare the performance of many memorability prediction techniques on the same data and in a reproducible way and to refine and improve on those techniques. The resources created to compute media memorability are now being used by researchers well beyond the actual evaluation campaign. In this paper we present a summary of the task, including the collective lessons we have learned for the research community.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Deepfake Caricatures: Amplifying attention to artifacts increases deepfake detection by humans and machines
Authors:
Camilo Fosco,
Emilie Josephs,
Alex Andonian,
Allen Lee,
Xi Wang,
Aude Oliva
Abstract:
Deepfakes pose a serious threat to digital well-being by fueling misinformation. As deepfakes get harder to recognize with the naked eye, human users become increasingly reliant on deepfake detection models to decide if a video is real or fake. Currently, models yield a prediction for a video's authenticity, but do not integrate a method for alerting a human user. We introduce a framework for ampl…
▽ More
Deepfakes pose a serious threat to digital well-being by fueling misinformation. As deepfakes get harder to recognize with the naked eye, human users become increasingly reliant on deepfake detection models to decide if a video is real or fake. Currently, models yield a prediction for a video's authenticity, but do not integrate a method for alerting a human user. We introduce a framework for amplifying artifacts in deepfake videos to make them more detectable by people. We propose a novel, semi-supervised Artifact Attention module, which is trained on human responses to create attention maps that highlight video artifacts. These maps make two contributions. First, they improve the performance of our deepfake detection classifier. Second, they allow us to generate novel "Deepfake Caricatures": transformations of the deepfake that exacerbate artifacts to improve human detection. In a user study, we demonstrate that Caricatures greatly increase human detection, across video presentation times and user engagement levels. Overall, we demonstrate the success of a human-centered approach to designing deepfake mitigation methods.
△ Less
Submitted 10 April, 2023; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Overview of The MediaEval 2021 Predicting Media Memorability Task
Authors:
Rukiye Savran Kiziltepe,
Mihai Gabriel Constantin,
Claire-Helene Demarty,
Graham Healy,
Camilo Fosco,
Alba Garcia Seco de Herrera,
Sebastian Halder,
Bogdan Ionescu,
Ana Matran-Fernandez,
Alan F. Smeaton,
Lorin Sweeney
Abstract:
This paper describes the MediaEval 2021 Predicting Media Memorability}task, which is in its 4th edition this year, as the prediction of short-term and long-term video memorability remains a challenging task. In 2021, two datasets of videos are used: first, a subset of the TRECVid 2019 Video-to-Text dataset; second, the Memento10K dataset in order to provide opportunities to explore cross-dataset g…
▽ More
This paper describes the MediaEval 2021 Predicting Media Memorability}task, which is in its 4th edition this year, as the prediction of short-term and long-term video memorability remains a challenging task. In 2021, two datasets of videos are used: first, a subset of the TRECVid 2019 Video-to-Text dataset; second, the Memento10K dataset in order to provide opportunities to explore cross-dataset generalisation. In addition, an Electroencephalography (EEG)-based prediction pilot subtask is introduced. In this paper, we outline the main aspects of the task and describe the datasets, evaluation metrics, and requirements for participants' submissions.
△ Less
Submitted 11 December, 2021;
originally announced December 2021.
-
VA-RED$^2$: Video Adaptive Redundancy Reduction
Authors:
Bowen Pan,
Rameswar Panda,
Camilo Fosco,
Chung-Ching Lin,
Alex Andonian,
Yue Meng,
Kate Saenko,
Aude Oliva,
Rogerio Feris
Abstract:
Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depe…
▽ More
Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy while videos focusing on objects tend to have more channel redundancy. Here we present a redundancy reduction framework, termed VA-RED$^2$, which is input-dependent. Specifically, our VA-RED$^2$ framework uses an input-dependent policy to decide how many features need to be computed for temporal and channel dimensions. To keep the capacity of the original model, after fully computing the necessary features, we reconstruct the remaining redundant features from those using cheap linear operations. We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism, making it highly efficient. Extensive experiments on multiple video datasets and different visual tasks show that our framework achieves $20\% - 40\%$ reduction in computation (FLOPs) when compared to state-of-the-art methods without any performance loss. Project page: http://people.csail.mit.edu/bpan/va-red/.
△ Less
Submitted 4 October, 2021; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability
Authors:
Anelise Newman,
Camilo Fosco,
Vincent Casser,
Allen Lee,
Barry McNamara,
Aude Oliva
Abstract:
A key capability of an intelligent system is deciding when events from past experience must be remembered and when they can be forgotten. Towards this goal, we develop a predictive model of human visual event memory and how those memories decay over time. We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays. Based on our findin…
▽ More
A key capability of an intelligent system is deciding when events from past experience must be remembered and when they can be forgotten. Towards this goal, we develop a predictive model of human visual event memory and how those memories decay over time. We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays. Based on our findings we propose a new mathematical formulation of memorability decay, resulting in a model that is able to produce the first quantitative estimation of how a video decays in memory over time. In contrast with previous work, our model can predict the probability that a video will be remembered at an arbitrary delay. Importantly, our approach combines visual and semantic information (in the form of textual captions) to fully represent the meaning of events. Our experiments on two video memorability benchmarks, including Memento10k, show that our model significantly improves upon the best prior approach (by 12% on average).
△ Less
Submitted 5 September, 2020;
originally announced September 2020.
-
We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos
Authors:
Alex Andonian,
Camilo Fosco,
Mathew Monfort,
Allen Lee,
Rogerio Feris,
Carl Vondrick,
Aude Oliva
Abstract:
Identifying common patterns among events is a key ability in human and machine perception, as it underlies intelligent decision making. We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos. This allows…
▽ More
Identifying common patterns among events is a key ability in human and machine perception, as it underlies intelligent decision making. We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos. This allows our model to perform cognitive tasks such as set abstraction (which general concept is in common among a set of videos?), set completion (which new video goes well with the set?), and odd one out detection (which video does not belong to the set?). Experiments on two video benchmarks, Kinetics and Multi-Moments in Time, show that robust and versatile representations emerge when learning to recognize commonalities among sets. We compare our model to several baseline algorithms and show that significant improvements result from explicitly learning relational abstractions with semantic supervision.
△ Less
Submitted 12 August, 2020;
originally announced August 2020.
-
Predicting Visual Importance Across Graphic Design Types
Authors:
Camilo Fosco,
Vincent Casser,
Amish Kumar Bedi,
Peter O'Donovan,
Aaron Hertzmann,
Zoya Bylinskii
Abstract:
This paper introduces a Unified Model of Saliency and Importance (UMSI), which learns to predict visual importance in input graphic designs, and saliency in natural images, along with a new dataset and applications. Previous methods for predicting saliency or visual importance are trained individually on specialized datasets, making them limited in application and leading to poor generalization on…
▽ More
This paper introduces a Unified Model of Saliency and Importance (UMSI), which learns to predict visual importance in input graphic designs, and saliency in natural images, along with a new dataset and applications. Previous methods for predicting saliency or visual importance are trained individually on specialized datasets, making them limited in application and leading to poor generalization on novel image classes, while requiring a user to know which model to apply to which input. UMSI is a deep learning-based model simultaneously trained on images from different design classes, including posters, infographics, mobile UIs, as well as natural images, and includes an automatic classification module to classify the input. This allows the model to work more effectively without requiring a user to label the input. We also introduce Imp1k, a new dataset of designs annotated with importance information. We demonstrate two new design interfaces that use importance prediction, including a tool for adjusting the relative importance of design elements, and a tool for reflowing designs to new aspect ratios while preserving visual importance. The model, code, and importance dataset are available at https://predimportance.mit.edu .
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
TurkEyes: A Web-Based Toolbox for Crowdsourcing Attention Data
Authors:
Anelise Newman,
Barry McNamara,
Camilo Fosco,
Yun Bin Zhang,
Pat Sukhum,
Matthew Tancik,
Nam Wook Kim,
Zoya Bylinskii
Abstract:
Eye movements provide insight into what parts of an image a viewer finds most salient, interesting, or relevant to the task at hand. Unfortunately, eye tracking data, a commonly-used proxy for attention, is cumbersome to collect. Here we explore an alternative: a comprehensive web-based toolbox for crowdsourcing visual attention. We draw from four main classes of attention-capturing methodologies…
▽ More
Eye movements provide insight into what parts of an image a viewer finds most salient, interesting, or relevant to the task at hand. Unfortunately, eye tracking data, a commonly-used proxy for attention, is cumbersome to collect. Here we explore an alternative: a comprehensive web-based toolbox for crowdsourcing visual attention. We draw from four main classes of attention-capturing methodologies in the literature. ZoomMaps is a novel "zoom-based" interface that captures viewing on a mobile phone. CodeCharts is a "self-reporting" methodology that records points of interest at precise viewing durations. ImportAnnots is an "annotation" tool for selecting important image regions, and "cursor-based" BubbleView lets viewers click to deblur a small area. We compare these methodologies using a common analysis framework in order to develop appropriate use cases for each interface. This toolbox and our analyses provide a blueprint for how to gather attention data at scale without an eye tracker.
△ Less
Submitted 13 January, 2020;
originally announced January 2020.