-
Exploring Consequences of Privacy Policies with Narrative Generation via Answer Set Programming
Authors:
Chinmaya Dabral,
Emma Tosch,
Chris Martens
Abstract:
Informed consent has become increasingly salient for data privacy and its regulation. Entities from governments to for-profit companies have addressed concerns about data privacy with policies that enumerate the conditions for personal data storage and transfer. However, increased enumeration of and transparency in data privacy policies has not improved end-users' comprehension of how their data m…
▽ More
Informed consent has become increasingly salient for data privacy and its regulation. Entities from governments to for-profit companies have addressed concerns about data privacy with policies that enumerate the conditions for personal data storage and transfer. However, increased enumeration of and transparency in data privacy policies has not improved end-users' comprehension of how their data might be used: not only are privacy policies written in legal language that users may struggle to understand, but elements of these policies may compose in such a way that the consequences of the policy are not immediately apparent.
We present a framework that uses Answer Set Programming (ASP) -- a type of logic programming -- to formalize privacy policies. Privacy policies thus become constraints on a narrative planning space, allowing end-users to forward-simulate possible consequences of the policy in terms of actors having roles and taking actions in a domain. We demonstrate through the example of the Health Insurance Portability and Accountability Act (HIPAA) how to use the system in various ways, including asking questions about possibilities and identifying which clauses of the law are broken by a given sequence of events.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
PlanAlyzer: Assessing Threats to the Validity of Online Experiments
Authors:
Emma Tosch,
Eytan Bakshy,
Emery D. Berger,
David D. Jensen,
J. Eliot B. Moss
Abstract:
Online experiments are ubiquitous. As the scale of experiments has grown, so has the complexity of their design and implementation. In response, firms have developed software frameworks for designing and deploying online experiments. Ensuring that experiments in these frameworks are correctly designed and that their results are trustworthy---referred to as *internal validity*---can be difficult. C…
▽ More
Online experiments are ubiquitous. As the scale of experiments has grown, so has the complexity of their design and implementation. In response, firms have developed software frameworks for designing and deploying online experiments. Ensuring that experiments in these frameworks are correctly designed and that their results are trustworthy---referred to as *internal validity*---can be difficult. Currently, verifying internal validity requires manual inspection by someone with substantial expertise in experimental design.
We present the first approach for statically checking the internal validity of online experiments. Our checks are based on well-known problems that arise in experimental design and causal inference. Our analyses target PlanOut, a widely deployed, open-source experimentation framework that uses a domain-specific language to specify and run complex experiments. We have built a tool, PlanAlyzer, that checks PlanOut programs for a variety of threats to internal validity, including failures of randomization, treatment assignment, and causal sufficiency. PlanAlyzer uses its analyses to automatically generate *contrasts*, a key type of information required to perform valid statistical analyses over experimental results. We demonstrate PlanAlyzer's utility on a corpus of PlanOut scripts deployed in production at Facebook, and we evaluate its ability to identify threats to validity on a mutated subset of this corpus. PlanAlyzer has both precision and recall of 92% on the mutated corpus, and 82% of the contrasts it automatically generates match hand-specified data.
△ Less
Submitted 30 September, 2019;
originally announced September 2019.
-
Toybox: A Suite of Environments for Experimental Evaluation of Deep Reinforcement Learning
Authors:
Emma Tosch,
Kaleigh Clary,
John Foley,
David Jensen
Abstract:
Evaluation of deep reinforcement learning (RL) is inherently challenging. In particular, learned policies are largely opaque, and hypotheses about the behavior of deep RL agents are difficult to test in black-box environments. Considerable effort has gone into addressing opacity, but almost no effort has been devoted to producing high quality environments for experimental evaluation of agent behav…
▽ More
Evaluation of deep reinforcement learning (RL) is inherently challenging. In particular, learned policies are largely opaque, and hypotheses about the behavior of deep RL agents are difficult to test in black-box environments. Considerable effort has gone into addressing opacity, but almost no effort has been devoted to producing high quality environments for experimental evaluation of agent behavior. We present TOYBOX, a new high-performance, open-source* subset of Atari environments re-designed for the experimental evaluation of deep RL. We show that TOYBOX enables a wide range of experiments and analyses that are impossible in other environments.
*https://kdl-umass.github.io/Toybox/
△ Less
Submitted 7 May, 2019;
originally announced May 2019.
-
Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments
Authors:
Kaleigh Clary,
Emma Tosch,
John Foley,
David Jensen
Abstract:
Reproducibility in reinforcement learning is challenging: uncontrolled stochasticity from many sources, such as the learning algorithm, the learned policy, and the environment itself have led researchers to report the performance of learned agents using aggregate metrics of performance over multiple random seeds for a single environment. Unfortunately, there are still pernicious sources of variabi…
▽ More
Reproducibility in reinforcement learning is challenging: uncontrolled stochasticity from many sources, such as the learning algorithm, the learned policy, and the environment itself have led researchers to report the performance of learned agents using aggregate metrics of performance over multiple random seeds for a single environment. Unfortunately, there are still pernicious sources of variability in reinforcement learning agents that make reporting common summary statistics an unsound metric for performance. Our experiments demonstrate the variability of common agents used in the popular OpenAI Baselines repository. We make the case for reporting post-training agent performance as a distribution, rather than a point estimate.
△ Less
Submitted 12 April, 2019;
originally announced April 2019.
-
Measuring and Characterizing Generalization in Deep Reinforcement Learning
Authors:
Sam Witty,
Jun Ki Lee,
Emma Tosch,
Akanksha Atrey,
Michael Littman,
David Jensen
Abstract:
Deep reinforcement-learning methods have achieved remarkable performance on challenging control tasks. Observations of the resulting behavior give the impression that the agent has constructed a generalized representation that supports insightful action decisions. We re-examine what is meant by generalization in RL, and propose several definitions based on an agent's performance in on-policy, off-…
▽ More
Deep reinforcement-learning methods have achieved remarkable performance on challenging control tasks. Observations of the resulting behavior give the impression that the agent has constructed a generalized representation that supports insightful action decisions. We re-examine what is meant by generalization in RL, and propose several definitions based on an agent's performance in on-policy, off-policy, and unreachable states. We propose a set of practical methods for evaluating agents with these definitions of generalization. We demonstrate these techniques on a common benchmark task for deep RL, and we show that the learned networks make poor decisions for states that differ only slightly from on-policy states, even though those states are not selected adversarially. Taken together, these results call into question the extent to which deep Q-networks learn generalized representations, and suggest that more experimentation and analysis is necessary before claims of representation learning can be supported.
△ Less
Submitted 11 December, 2018; v1 submitted 6 December, 2018;
originally announced December 2018.
-
ToyBox: Better Atari Environments for Testing Reinforcement Learning Agents
Authors:
John Foley,
Emma Tosch,
Kaleigh Clary,
David Jensen
Abstract:
It is a widely accepted principle that software without tests has bugs. Testing reinforcement learning agents is especially difficult because of the stochastic nature of both agents and environments, the complexity of state-of-the-art models, and the sequential nature of their predictions. Recently, the Arcade Learning Environment (ALE) has become one of the most widely used benchmark suites for d…
▽ More
It is a widely accepted principle that software without tests has bugs. Testing reinforcement learning agents is especially difficult because of the stochastic nature of both agents and environments, the complexity of state-of-the-art models, and the sequential nature of their predictions. Recently, the Arcade Learning Environment (ALE) has become one of the most widely used benchmark suites for deep learning research, and state-of-the-art Reinforcement Learning (RL) agents have been shown to routinely equal or exceed human performance on many ALE tasks. Since ALE is based on emulation of original Atari games, the environment does not provide semantically meaningful representations of internal game state. This means that ALE has limited utility as an environment for supporting testing or model introspection. We propose ToyBox, a collection of reimplementations of these games that solves this critical problem and enables robust testing of RL agents.
△ Less
Submitted 25 January, 2019; v1 submitted 6 December, 2018;
originally announced December 2018.
-
SurveyMan: Programming and Automatically Debugging Surveys
Authors:
Emma Tosch,
Emery D. Berger
Abstract:
Surveys can be viewed as programs, complete with logic, control flow, and bugs. Word choice or the order in which questions are asked can unintentionally bias responses. Vague, confusing, or intrusive questions can cause respondents to abandon a survey. Surveys can also have runtime errors: inattentive respondents can taint results. This effect is especially problematic when deploying surveys in u…
▽ More
Surveys can be viewed as programs, complete with logic, control flow, and bugs. Word choice or the order in which questions are asked can unintentionally bias responses. Vague, confusing, or intrusive questions can cause respondents to abandon a survey. Surveys can also have runtime errors: inattentive respondents can taint results. This effect is especially problematic when deploying surveys in uncontrolled settings, such as on the web or via crowdsourcing platforms. Because the results of surveys drive business decisions and inform scientific conclusions, it is crucial to make sure they are correct.
We present SurveyMan, a system for designing, deploying, and automatically debugging surveys. Survey authors write their surveys in a lightweight domain-specific language aimed at end users. SurveyMan statically analyzes the survey to provide feedback to survey authors before deployment. It then compiles the survey into JavaScript and deploys it either to the web or a crowdsourcing platform. SurveyMan's dynamic analyses automatically find survey bugs, and control for the quality of responses. We evaluate SurveyMan's algorithms analytically and empirically, demonstrating its effectiveness with case studies of social science surveys conducted via Amazon's Mechanical Turk.
△ Less
Submitted 20 June, 2014;
originally announced June 2014.