-
Rich Human Feedback for Text-to-Image Generation
Authors:
Youwei Liang,
Junfeng He,
Gang Li,
Peizhao Li,
Arseniy Klimovskiy,
Nicholas Carolan,
Jiao Sun,
Jordi Pont-Tuset,
Sarah Young,
Feng Yang,
Junjie Ke,
Krishnamurthy Dj Dvijotham,
Katie Collins,
Yiwen Luo,
Yang Li,
Kai J Kohlhoff,
Deepak Ramachandran,
Vidhya Navalpakkam
Abstract:
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback…
▽ More
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.
△ Less
Submitted 8 April, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
UniAR: A Unified model for predicting human Attention and Responses on visual content
Authors:
Peizhao Li,
Junfeng He,
Gang Li,
Rachit Bhargava,
Shaolei Shen,
Nachiappan Valliappan,
Youwei Liang,
Hongxiang Gu,
Venky Ramachandran,
Golnaz Farhadi,
Yang Li,
Kai J Kohlhoff,
Vidhya Navalpakkam
Abstract:
Progress in human behavior modeling involves understanding both implicit, early-stage perceptual behavior, such as human attention, and explicit, later-stage behavior, such as subjective preferences or likes. Yet most prior research has focused on modeling implicit and explicit human behavior in isolation; and often limited to a specific type of visual content. We propose UniAR -- a unified model…
▽ More
Progress in human behavior modeling involves understanding both implicit, early-stage perceptual behavior, such as human attention, and explicit, later-stage behavior, such as subjective preferences or likes. Yet most prior research has focused on modeling implicit and explicit human behavior in isolation; and often limited to a specific type of visual content. We propose UniAR -- a unified model of human attention and preference behavior across diverse visual content. UniAR leverages a multimodal transformer to predict subjective feedback, such as satisfaction or aesthetic quality, along with the underlying human attention or interaction heatmaps and viewing order. We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks across various image domains and behavior modeling tasks. Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements.
△ Less
Submitted 31 October, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
Differentially Private Heatmaps
Authors:
Badih Ghazi,
Junfeng He,
Kai Kohlhoff,
Ravi Kumar,
Pasin Manurangsi,
Vidhya Navalpakkam,
Nachiappan Valliappan
Abstract:
We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private (DP) algorithm for this task and demonstrate its advantages over previous algorithms on real-world datasets.
Our core algorithmic primitive is a DP procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance to t…
▽ More
We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private (DP) algorithm for this task and demonstrate its advantages over previous algorithms on real-world datasets.
Our core algorithmic primitive is a DP procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance to the average of the inputs. We prove theoretical bounds on the error of our algorithm under a certain sparsity assumption and that these are near-optimal.
△ Less
Submitted 24 November, 2022;
originally announced November 2022.
-
GazeGAN - Unpaired Adversarial Image Generation for Gaze Estimation
Authors:
Matan Sela,
Pingmei Xu,
Junfeng He,
Vidhya Navalpakkam,
Dmitry Lagun
Abstract:
Recent research has demonstrated the ability to estimate gaze on mobile devices by performing inference on the image from the phone's front-facing camera, and without requiring specialized hardware. While this offers wide potential applications such as in human-computer interaction, medical diagnosis and accessibility (e.g., hands free gaze as input for patients with motor disorders), current meth…
▽ More
Recent research has demonstrated the ability to estimate gaze on mobile devices by performing inference on the image from the phone's front-facing camera, and without requiring specialized hardware. While this offers wide potential applications such as in human-computer interaction, medical diagnosis and accessibility (e.g., hands free gaze as input for patients with motor disorders), current methods are limited as they rely on collecting data from real users, which is a tedious and expensive process that is hard to scale across devices. There have been some attempts to synthesize eye region data using 3D models that can simulate various head poses and camera settings, however these lack in realism.
In this paper, we improve upon a recently suggested method, and propose a generative adversarial framework to generate a large dataset of high resolution colorful images with high diversity (e.g., in subjects, head pose, camera settings) and realism, while simultaneously preserving the accuracy of gaze labels. The proposed approach operates on extended regions of the eye, and even completes missing parts of the image. Using this rich synthesized dataset, and without using any additional training data from real users, we demonstrate improvements over state-of-the-art for estimating 2D gaze position on mobile devices. We further demonstrate cross-device generalization of model performance, as well as improved robustness to diverse head pose, blur and distance.
△ Less
Submitted 27 November, 2017;
originally announced November 2017.
-
The Impact of Visual Appearance on User Response in Online Display Advertising
Authors:
Javad Azimi,
Ruofei Zhang,
Yang Zhou,
Vidhya Navalpakkam,
Jianchang Mao,
Xiaoli Fern
Abstract:
Display advertising has been a significant source of revenue for publishers and ad networks in online advertising ecosystem. One of the main goals in display advertising is to maximize user response rate for advertising campaigns, such as click through rates (CTR) or conversion rates. Although in the online advertising industry we believe that the visual appearance of ads (creatives) matters for p…
▽ More
Display advertising has been a significant source of revenue for publishers and ad networks in online advertising ecosystem. One of the main goals in display advertising is to maximize user response rate for advertising campaigns, such as click through rates (CTR) or conversion rates. Although in the online advertising industry we believe that the visual appearance of ads (creatives) matters for propensity of user response, there is no published work so far to address this topic via a systematic data-driven approach. In this paper we quantitatively study the relationship between the visual appearance and performance of creatives using large scale data in the world's largest display ads exchange system, RightMedia. We designed a set of 43 visual features, some of which are novel and some are inspired by related work. We extracted these features from real creatives served on RightMedia. We also designed and conducted a series of experiments to evaluate the effectiveness of visual features for CTR prediction, ranking and performance classification. Based on the evaluation results, we selected a subset of features that have the most important impact on CTR. We believe that the findings presented in this paper will be very useful for the online advertising industry in designing high-performance creatives. It also provides the research community with the first ever data set, initial insights into visual appearance's effect on user response propensity, and evaluation benchmarks for further study.
△ Less
Submitted 4 April, 2012; v1 submitted 9 February, 2012;
originally announced February 2012.