Search | arXiv e-print repository

arXiv:2408.07892 [pdf, other]

Personhood credentials: Artificial intelligence and the value of privacy-preserving tools to distinguish who is real online

Authors: Steven Adler, Zoë Hitzig, Shrey Jain, Catherine Brewer, Wayne Chang, Renée DiResta, Eddy Lazzarin, Sean McGregor, Wendy Seltzer, Divya Siddarth, Nouran Soliman, Tobin South, Connor Spelliscy, Manu Sporny, Varya Srivastava, John Bailey, Brian Christian, Andrew Critch, Ronnie Falcon, Heather Flanagan, Kim Hamilton Duffy, Eric Ho, Claire R. Leibowicz, Srikanth Nadhamuni, Alan Z. Rozenshtein , et al. (7 additional authors not shown)

Abstract: Anonymity is an important principle online. However, malicious actors have long used misleading identities to conduct fraud, spread disinformation, and carry out other deceptive schemes. With the advent of increasingly capable AI, bad actors can amplify the potential scale and effectiveness of their operations, intensifying the challenge of balancing anonymity and trustworthiness online. In this p… ▽ More Anonymity is an important principle online. However, malicious actors have long used misleading identities to conduct fraud, spread disinformation, and carry out other deceptive schemes. With the advent of increasingly capable AI, bad actors can amplify the potential scale and effectiveness of their operations, intensifying the challenge of balancing anonymity and trustworthiness online. In this paper, we analyze the value of a new tool to address this challenge: "personhood credentials" (PHCs), digital credentials that empower users to demonstrate that they are real people -- not AIs -- to online services, without disclosing any personal information. Such credentials can be issued by a range of trusted institutions -- governments or otherwise. A PHC system, according to our definition, could be local or global, and does not need to be biometrics-based. Two trends in AI contribute to the urgency of the challenge: AI's increasing indistinguishability from people online (i.e., lifelike content and avatars, agentic activity), and AI's increasing scalability (i.e., cost-effectiveness, accessibility). Drawing on a long history of research into anonymous credentials and "proof-of-personhood" systems, personhood credentials give people a way to signal their trustworthiness on online platforms, and offer service providers new tools for reducing misuse by bad actors. In contrast, existing countermeasures to automated deception -- such as CAPTCHAs -- are inadequate against sophisticated AI, while stringent identity verification solutions are insufficiently private for many use-cases. After surveying the benefits of personhood credentials, we also examine deployment risks and design challenges. We conclude with actionable next steps for policymakers, technologists, and standards bodies to consider in consultation with the public. △ Less

Submitted 17 January, 2025; v1 submitted 14 August, 2024; originally announced August 2024.

Comments: 63 pages, 7 figures, 5 tables; minor additions to acknowledgments and wording changes for clarity; corrected typo; updated email address reference for author

arXiv:2306.06924 [pdf, other]

TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI

Authors: Andrew Critch, Stuart Russell

Abstract: While several recent works have identified societal-scale and extinction-level risks to humanity arising from artificial intelligence, few have attempted an {\em exhaustive taxonomy} of such risks. Many exhaustive taxonomies are possible, and some are useful -- particularly if they reveal new risks or practical approaches to safety. This paper explores a taxonomy based on accountability: whose act… ▽ More While several recent works have identified societal-scale and extinction-level risks to humanity arising from artificial intelligence, few have attempted an {\em exhaustive taxonomy} of such risks. Many exhaustive taxonomies are possible, and some are useful -- particularly if they reveal new risks or practical approaches to safety. This paper explores a taxonomy based on accountability: whose actions lead to the risk, are the actors unified, and are they deliberate? We also provide stories to illustrate how the various risk types could each play out, including risks arising from unanticipated interactions of many AI systems, as well as risks from deliberate misuse, for which combined technical and policy solutions are indicated. △ Less

Submitted 14 June, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

MSC Class: 68T01 ACM Class: I.2.0

arXiv:2208.07006 [pdf, ps, other]

Cooperative and uncooperative institution designs: Surprises and problems in open-source game theory

Authors: Andrew Critch, Michael Dennis, Stuart Russell

Abstract: It is increasingly possible for real-world agents, such as software-based agents or human institutions, to view the internal programming of other such agents that they interact with. For instance, a company can read the bylaws of another company, or one software system can read the source code of another. Game-theoretic equilibria between the designers of such agents are called \emph{program equil… ▽ More It is increasingly possible for real-world agents, such as software-based agents or human institutions, to view the internal programming of other such agents that they interact with. For instance, a company can read the bylaws of another company, or one software system can read the source code of another. Game-theoretic equilibria between the designers of such agents are called \emph{program equilibria}, and we call this area \emph{open-source game theory}. In this work we demonstrate a series of counterintuitive results on open-source games, which are independent of the programming language in which agents are written. We show that certain formal institution designs that one might expect to defect against each other will instead turn out to cooperate, or conversely, cooperate when one might expect them to defect. The results hold in a setting where each institution has full visibility into the other institution's true operating procedures. We also exhibit examples and ten open problems for better understanding these phenomena. We argue that contemporary game theory remains ill-equipped to study program equilibria, given that even the outcomes of single games in open-source settings remain counterintuitive and poorly understood. Nonetheless, some of these open-source agents exhibit desirable characteristics -- e.g., they can unexploitably create incentives for cooperation and legibility from other agents -- such that analyzing them could yield considerable benefits. △ Less

Submitted 15 August, 2022; originally announced August 2022.

Comments: 41 pages

MSC Class: 93A14; 93A16; 91-08; 91A11; 91A35; 91A68; 91A44; 91B06; 91B41; 91B52 ACM Class: F.3.1; F.4.1; I.2.3; J.4

arXiv:2207.10806 [pdf, other]

WordSig: QR streams enabling platform-independent self-identification that's impossible to deepfake

Authors: Andrew Critch

Abstract: Deepfakes can degrade the fabric of society by limiting our ability to trust video content from leaders, authorities, and even friends. Cryptographically secure digital signatures may be used by video streaming platforms to endorse content, but these signatures are applied by the content distributor rather than the participants in the video. We introduce WordSig, a simple protocol allowing video p… ▽ More Deepfakes can degrade the fabric of society by limiting our ability to trust video content from leaders, authorities, and even friends. Cryptographically secure digital signatures may be used by video streaming platforms to endorse content, but these signatures are applied by the content distributor rather than the participants in the video. We introduce WordSig, a simple protocol allowing video participants to digitally sign the words they speak using a stream of QR codes, and allowing viewers to verify the consistency of signatures across videos. This allows establishing a trusted connection between the viewer and the participant that is not mediated by the content distributor. Given the widespread adoption of QR codes for distributing hyperlinks and vaccination records, and the increasing prevalence of celebrity deepfakes, 2022 or later may be a good time for public figures to begin using and promoting QR-based self-authentication tools. △ Less

Submitted 15 July, 2022; originally announced July 2022.

MSC Class: 68P25; 68T01; 94A62 ACM Class: E.3; I.2; K.4

arXiv:2207.03470 [pdf, other]

For Learning in Symmetric Teams, Local Optima are Global Nash Equilibria

Authors: Scott Emmons, Caspar Oesterheld, Andrew Critch, Vincent Conitzer, Stuart Russell

Abstract: Although it has been known since the 1970s that a globally optimal strategy profile in a common-payoff game is a Nash equilibrium, global optimality is a strict requirement that limits the result's applicability. In this work, we show that any locally optimal symmetric strategy profile is also a (global) Nash equilibrium. Furthermore, we show that this result is robust to perturbations to the comm… ▽ More Although it has been known since the 1970s that a globally optimal strategy profile in a common-payoff game is a Nash equilibrium, global optimality is a strict requirement that limits the result's applicability. In this work, we show that any locally optimal symmetric strategy profile is also a (global) Nash equilibrium. Furthermore, we show that this result is robust to perturbations to the common payoff and to the local optimum. Applied to machine learning, our result provides a global guarantee for any gradient method that finds a local optimum in symmetric strategy space. While this result indicates stability to unilateral deviation, we nevertheless identify broad classes of games where mixed local optima are unstable under joint, asymmetric deviations. We analyze the prevalence of instability by running learning algorithms in a suite of symmetric games, and we conclude by discussing the applicability of our results to multi-agent RL, cooperative inverse RL, and decentralized POMDPs. △ Less

Submitted 7 July, 2022; originally announced July 2022.

arXiv:2111.06956 [pdf, other]

Human irrationality: both bad and good for reward inference

Authors: Lawrence Chan, Andrew Critch, Anca Dragan

Abstract: Assuming humans are (approximately) rational enables robots to infer reward functions by observing human behavior. But people exhibit a wide array of irrationalities, and our goal with this work is to better understand the effect they can have on reward inference. The challenge with studying this effect is that there are many types of irrationality, with varying degrees of mathematical formalizati… ▽ More Assuming humans are (approximately) rational enables robots to infer reward functions by observing human behavior. But people exhibit a wide array of irrationalities, and our goal with this work is to better understand the effect they can have on reward inference. The challenge with studying this effect is that there are many types of irrationality, with varying degrees of mathematical formalization. We thus operationalize irrationality in the language of MDPs, by altering the Bellman optimality equation, and use this framework to study how these alterations would affect inference. We find that wrongly modeling a systematically irrational human as noisy-rational performs a lot worse than correctly capturing these biases -- so much so that it can be better to skip inference altogether and stick to the prior! More importantly, we show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can. That is, if a robot has the correct model of a human's irrationality, it can make an even stronger inference than it ever could if the human were rational. Irrationality fundamentally helps rather than hinder reward inference, but it needs to be correctly accounted for. △ Less

Submitted 12 November, 2021; originally announced November 2021.

Comments: 12 pages, 10 figures

arXiv:2110.08058 [pdf, other]

Quantifying Local Specialization in Deep Neural Networks

Authors: Shlomi Hod, Daniel Filan, Stephen Casper, Andrew Critch, Stuart Russell

Abstract: A neural network is locally specialized to the extent that parts of its computational graph (i.e. structure) can be abstractly represented as performing some comprehensible sub-task relevant to the overall task (i.e. functionality). Are modern deep neural networks locally specialized? How can this be quantified? In this paper, we consider the problem of taking a neural network whose neurons are pa… ▽ More A neural network is locally specialized to the extent that parts of its computational graph (i.e. structure) can be abstractly represented as performing some comprehensible sub-task relevant to the overall task (i.e. functionality). Are modern deep neural networks locally specialized? How can this be quantified? In this paper, we consider the problem of taking a neural network whose neurons are partitioned into clusters, and quantifying how functionally specialized the clusters are. We propose two proxies for this: importance, which reflects how crucial sets of neurons are to network performance; and coherence, which reflects how consistently their neurons associate with features of the inputs. To measure these proxies, we develop a set of statistical methods based on techniques conventionally used to interpret individual neurons. We apply the proxies to partitionings generated by spectrally clustering a graph representation of the network's neurons with edges determined either by network weights or correlations of activations. We show that these partitionings, even ones based only on weights (i.e. strictly from non-runtime analysis), reveal groups of neurons that are important and coherent. These results suggest that graph-based partitioning can reveal local specialization and that statistical methods can be used to automatedly screen for sets of neurons that can be understood abstractly. △ Less

Submitted 7 February, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: 21 pages, 6 figures. Code is available at https://github.com/thestephencasper/detecting_nn_modularity

arXiv:2103.03386 [pdf, other]

Clusterability in Neural Networks

Authors: Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell

Abstract: The learned weights of a neural network have often been considered devoid of scrutable internal structure. In this paper, however, we look for structure in the form of clusterability: how well a network can be divided into groups of neurons with strong internal connectivity but weak external connectivity. We find that a trained neural network is typically more clusterable than randomly initialized… ▽ More The learned weights of a neural network have often been considered devoid of scrutable internal structure. In this paper, however, we look for structure in the form of clusterability: how well a network can be divided into groups of neurons with strong internal connectivity but weak external connectivity. We find that a trained neural network is typically more clusterable than randomly initialized networks, and often clusterable relative to random networks with the same distribution of weights. We also exhibit novel methods to promote clusterability in neural network training, and find that in multi-layer perceptrons they lead to more clusterable networks with little reduction in accuracy. Understanding and controlling the clusterability of neural networks will hopefully render their inner workings more interpretable to engineers by facilitating partitioning into meaningful clusters. △ Less

Submitted 4 March, 2021; originally announced March 2021.

Comments: 20 pages, 22 figures. arXiv admin note: text overlap with arXiv:2003.04881

arXiv:2101.10305 [pdf, other]

Accumulating Risk Capital Through Investing in Cooperation

Authors: Charlotte Roman, Michael Dennis, Andrew Critch, Stuart Russell

Abstract: Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully promote cooperation at the cost of becoming more vulnerable to exploitation by malicious actors. We show that this is an unavoidable trade-off and propose an objective which balances these concerns, promoting both safety and long-term cooperation. Moreover, the trade-off between safety and… ▽ More Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully promote cooperation at the cost of becoming more vulnerable to exploitation by malicious actors. We show that this is an unavoidable trade-off and propose an objective which balances these concerns, promoting both safety and long-term cooperation. Moreover, the trade-off between safety and cooperation is not severe, and you can receive exponentially large returns through cooperation from a small amount of risk. We study both an exact solution method and propose a method for training policies that targets this objective, Accumulating Risk Capital Through Investing in Cooperation (ARCTIC), and evaluate them in iterated Prisoner's Dilemma and Stag Hunt. △ Less

Submitted 20 April, 2021; v1 submitted 25 January, 2021; originally announced January 2021.

arXiv:2012.14536 [pdf, other]

Multi-Principal Assistance Games: Definition and Collegial Mechanisms

Authors: Arnaud Fickinger, Simon Zhuang, Andrew Critch, Dylan Hadfield-Menell, Stuart Russell

Abstract: We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism. In an MPAG, a single agent assists N human principals who may have widely different preferences. MPAGs generalize assistance games, also known as cooperative inverse reinforcement learning game… ▽ More We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism. In an MPAG, a single agent assists N human principals who may have widely different preferences. MPAGs generalize assistance games, also known as cooperative inverse reinforcement learning games. We analyze in particular a generalization of apprenticeship learning in which the humans first perform some work to obtain utility and demonstrate their preferences, and then the robot acts to further maximize the sum of human payoffs. We show in this setting that if the game is sufficiently collegial, i.e. if the humans are responsible for obtaining a sufficient fraction of the rewards through their own actions, then their preferences are straightforwardly revealed through their work. This revelation mechanism is non-dictatorial, does not limit the possible outcomes to two alternatives, and is dominant-strategy incentive-compatible. △ Less

Submitted 28 December, 2020; originally announced December 2020.

Comments: arXiv admin note: text overlap with arXiv:2007.09540

arXiv:2012.02096 [pdf, other]

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

Authors: Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, Sergey Levine

Abstract: A wide range of reinforcement learning (RL) problems - including robustness, transfer learning, unsupervised RL, and emergent complexity - require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environmen… ▽ More A wide range of reinforcement learning (RL) problems - including robustness, transfer learning, unsupervised RL, and emergent complexity - require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent's learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent's return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments. △ Less

Submitted 3 February, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

arXiv:2011.00401 [pdf, other]

The MAGICAL Benchmark for Robust Imitation

Authors: Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell

Abstract: Imitation Learning (IL) algorithms are typically evaluated in the same environment that was used to create demonstrations. This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings. This paper presents the MAGICAL benchmark… ▽ More Imitation Learning (IL) algorithms are typically evaluated in the same environment that was used to create demonstrations. This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings. This paper presents the MAGICAL benchmark suite, which permits systematic evaluation of generalisation by quantifying robustness to different kinds of distribution shift that an IL algorithm is likely to encounter in practice. Using the MAGICAL suite, we confirm that existing IL algorithms overfit significantly to the context in which demonstrations are provided. We also show that standard methods for reducing overfitting are effective at creating narrow perceptual invariances, but are not sufficient to enable transfer to contexts that require substantially different behaviour, which suggests that new approaches will be needed in order to robustly generalise demonstrator intent. Code and data for the MAGICAL suite is available at https://github.com/qxcv/magical/. △ Less

Submitted 31 October, 2020; originally announced November 2020.

Comments: NeurIPS 2020 conference paper (poster)

arXiv:2008.02275 [pdf, other]

Aligning AI With Shared Human Values

Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt

Abstract: We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable… ▽ More We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values. △ Less

Submitted 17 February, 2023; v1 submitted 5 August, 2020; originally announced August 2020.

Comments: ICLR 2021; the ETHICS dataset is available at https://github.com/hendrycks/ethics/

arXiv:2006.04948 [pdf, other]

AI Research Considerations for Human Existential Safety (ARCHES)

Authors: Andrew Critch, David Krueger

Abstract: Framed in positive terms, this report examines how technical AI research might be steered in a manner that is more attentive to humanity's long-term prospects for survival as a species. In negative terms, we ask what existential risks humanity might face from AI development in the next century, and by what principles contemporary technical research might be directed to address those risks. A key… ▽ More Framed in positive terms, this report examines how technical AI research might be steered in a manner that is more attentive to humanity's long-term prospects for survival as a species. In negative terms, we ask what existential risks humanity might face from AI development in the next century, and by what principles contemporary technical research might be directed to address those risks. A key property of hypothetical AI technologies is introduced, called \emph{prepotence}, which is useful for delineating a variety of potential existential risks from artificial intelligence, even as AI paradigms might shift. A set of \auxref{dirtot} contemporary research \directions are then examined for their potential benefit to existential safety. Each research direction is explained with a scenario-driven motivation, and examples of existing work from which to build. The research directions present their own risks and benefits to society that could occur at various scales of impact, and in particular are not guaranteed to benefit existential safety if major developments in them are deployed without adequate forethought and oversight. As such, each direction is accompanied by a consideration of potentially negative side effects. △ Less

Submitted 29 May, 2020; originally announced June 2020.

MSC Class: 68T01 ACM Class: I.2.0

arXiv:2003.04881 [pdf, other]

Pruned Neural Networks are Surprisingly Modular

Authors: Daniel Filan, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell

Abstract: The learned weights of a neural network are often considered devoid of scrutable internal structure. To discern structure in these weights, we introduce a measurable notion of modularity for multi-layer perceptrons (MLPs), and investigate the modular structure of MLPs trained on datasets of small images. Our notion of modularity comes from the graph clustering literature: a "module" is a set of ne… ▽ More The learned weights of a neural network are often considered devoid of scrutable internal structure. To discern structure in these weights, we introduce a measurable notion of modularity for multi-layer perceptrons (MLPs), and investigate the modular structure of MLPs trained on datasets of small images. Our notion of modularity comes from the graph clustering literature: a "module" is a set of neurons with strong internal connectivity but weak external connectivity. We find that training and weight pruning produces MLPs that are more modular than randomly initialized ones, and often significantly more modular than random MLPs with the same (sparse) distribution of weights. Interestingly, they are much more modular when trained with dropout. We also present exploratory analyses of the importance of different modules for performance and how modules depend on each other. Understanding the modular structure of neural networks, when such structure exists, will hopefully render their inner workings more interpretable to engineers. Note that this paper has been superceded by "Clusterability in Neural Networks", arxiv:2103.03386 and "Quantifying Local Specialization in Deep Neural Networks", arxiv:2110.08058! △ Less

Submitted 7 February, 2022; v1 submitted 10 March, 2020; originally announced March 2020.

Comments: 25 pages, 12 figures

arXiv:1912.01683 [pdf, other]

Optimal Policies Tend to Seek Power

Authors: Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli

Abstract: Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers point out that RL agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decisio… ▽ More Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers point out that RL agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decision processes, we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that in these environments, most reward functions make it optimal to seek power by keeping a range of options available and, when maximizing average reward, by navigating towards larger sets of potential terminal states. △ Less

Submitted 28 January, 2023; v1 submitted 3 December, 2019; originally announced December 2019.

Comments: Accepted to NeurIPS 2021 as spotlight paper. 12 pages, 44 pages with appendices. Since the 2021 acceptance, we updated the paper to point out that optimal policies can be qualitatively divorced from real-world learned policies

arXiv:1711.00363 [pdf, ps, other]

Servant of Many Masters: Shifting priorities in Pareto-optimal sequential decision-making

Authors: Andrew Critch, Stuart Russell

Abstract: It is often argued that an agent making decisions on behalf of two or more principals who have different utility functions should adopt a {\em Pareto-optimal} policy, i.e., a policy that cannot be improved upon for one agent without making sacrifices for another. A famous theorem of Harsanyi shows that, when the principals have a common prior on the outcome distributions of all policies, a Pareto-… ▽ More It is often argued that an agent making decisions on behalf of two or more principals who have different utility functions should adopt a {\em Pareto-optimal} policy, i.e., a policy that cannot be improved upon for one agent without making sacrifices for another. A famous theorem of Harsanyi shows that, when the principals have a common prior on the outcome distributions of all policies, a Pareto-optimal policy for the agent is one that maximizes a fixed, weighted linear combination of the principals' utilities. In this paper, we show that Harsanyi's theorem does not hold for principals with different priors, and derive a more precise generalization which does hold, which constitutes our main result. In this more general case, the relative weight given to each principal's utility should evolve over time according to how well the agent's observations conform with that principal's prior. The result has implications for the design of contracts, treaties, joint ventures, and robots. △ Less

Submitted 31 October, 2017; originally announced November 2017.

Comments: 10 pages. arXiv admin note: substantial text overlap with arXiv:1701.01302

arXiv:1707.08747 [pdf, ps, other]

doi 10.4204/EPTCS.251.16

A Formal Approach to the Problem of Logical Non-Omniscience

Authors: Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, Jessica Taylor

Abstract: We present the logical induction criterion for computable algorithms that assign probabilities to every logical statement in a given formal language, and refine those probabilities over time. The criterion is motivated by a series of stock trading analogies. Roughly speaking, each logical sentence phi is associated with a stock that is worth $1 per share if phi is true and nothing otherwise, and w… ▽ More We present the logical induction criterion for computable algorithms that assign probabilities to every logical statement in a given formal language, and refine those probabilities over time. The criterion is motivated by a series of stock trading analogies. Roughly speaking, each logical sentence phi is associated with a stock that is worth $1 per share if phi is true and nothing otherwise, and we interpret the belief-state of a logically uncertain reasoner as a set of market prices, where pt_N(phi)=50% means that on day N, shares of phi may be bought or sold from the reasoner for 50%. A market is then called a logical inductor if (very roughly) there is no polynomial-time computable trading strategy with finite risk tolerance that earns unbounded profits in that market over time. We then describe how this single criterion implies a number of desirable properties of bounded reasoners; for example, logical inductors outpace their underlying deductive process, perform universal empirical induction given enough time to think, and place strong trust in their own reasoning process. △ Less

Submitted 27 July, 2017; originally announced July 2017.

Comments: In Proceedings TARK 2017, arXiv:1707.08250

ACM Class: F.4.0; G.3

Journal ref: EPTCS 251, 2017, pp. 221-235

arXiv:1701.01302 [pdf, ps, other]

Toward negotiable reinforcement learning: shifting priorities in Pareto optimal sequential decision-making

Authors: Andrew Critch

Abstract: Existing multi-objective reinforcement learning (MORL) algorithms do not account for objectives that arise from players with differing beliefs. Concretely, consider two players with different beliefs and utility functions who may cooperate to build a machine that takes actions on their behalf. A representation is needed for how much the machine's policy will prioritize each player's interests over… ▽ More Existing multi-objective reinforcement learning (MORL) algorithms do not account for objectives that arise from players with differing beliefs. Concretely, consider two players with different beliefs and utility functions who may cooperate to build a machine that takes actions on their behalf. A representation is needed for how much the machine's policy will prioritize each player's interests over time. Assuming the players have reached common knowledge of their situation, this paper derives a recursion that any Pareto optimal policy must satisfy. Two qualitative observations can be made from the recursion: the machine must (1) use each player's own beliefs in evaluating how well an action will serve that player's utility function, and (2) shift the relative priority it assigns to each player's expected utilities over time, by a factor proportional to how well that player's beliefs predict the machine's inputs. Observation (2) represents a substantial divergence from naïve linear utility aggregation (as in Harsanyi's utilitarian theorem, and existing MORL algorithms), which is shown here to be inadequate for Pareto optimal sequential decision-making on behalf of players with different beliefs. △ Less

Submitted 13 May, 2017; v1 submitted 5 January, 2017; originally announced January 2017.

arXiv:1609.03543 [pdf, ps, other]

Logical Induction

Authors: Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, Jessica Taylor

Abstract: We present a computable algorithm that assigns probabilities to every logical statement in a given formal language, and refines those probabilities over time. For instance, if the language is Peano arithmetic, it assigns probabilities to all arithmetical statements, including claims about the twin prime conjecture, the outputs of long-running computations, and its own probabilities. We show that o… ▽ More We present a computable algorithm that assigns probabilities to every logical statement in a given formal language, and refines those probabilities over time. For instance, if the language is Peano arithmetic, it assigns probabilities to all arithmetical statements, including claims about the twin prime conjecture, the outputs of long-running computations, and its own probabilities. We show that our algorithm, an instance of what we call a logical inductor, satisfies a number of intuitive desiderata, including: (1) it learns to predict patterns of truth and falsehood in logical statements, often long before having the resources to evaluate the statements, so long as the patterns can be written down in polynomial time; (2) it learns to use appropriate statistical summaries to predict sequences of statements whose truth values appear pseudorandom; and (3) it learns to have accurate beliefs about its own current beliefs, in a manner that avoids the standard paradoxes of self-reference. For example, if a given computer program only ever produces outputs in a certain range, a logical inductor learns this fact in a timely manner; and if late digits in the decimal expansion of $π$ are difficult to predict, then a logical inductor learns to assign $\approx 10\%$ probability to "the $n$th digit of $π$ is a 7" for large $n$. Logical inductors also learn to trust their future beliefs more than their current beliefs, and their beliefs are coherent in the limit (whenever $φ\implies ψ$, $\mathbb{P}_\infty(φ) \le \mathbb{P}_\infty(ψ)$, and so on); and logical inductors strictly dominate the universal semimeasure in the limit. These properties and many others all follow from a single logical induction criterion, which is motivated by a series of stock trading analogies. Roughly speaking, each logical sentence $φ$ is associated with a stock that is worth \$1 per share if [...] △ Less

Submitted 7 December, 2020; v1 submitted 12 September, 2016; originally announced September 2016.

arXiv:1602.04184 [pdf, ps, other]

Parametric Bounded Löb's Theorem and Robust Cooperation of Bounded Agents

Authors: Andrew Critch

Abstract: Löb's theorem and Gödel's theorems make predictions about the behavior of systems capable of self-reference with unbounded computational resources with which to write and evaluate proofs. However, in the real world, systems capable of self-reference will have limited memory and processing speed, so in this paper we introduce an effective version of Löb's theorem which is applicable given such boun… ▽ More Löb's theorem and Gödel's theorems make predictions about the behavior of systems capable of self-reference with unbounded computational resources with which to write and evaluate proofs. However, in the real world, systems capable of self-reference will have limited memory and processing speed, so in this paper we introduce an effective version of Löb's theorem which is applicable given such bounded resources. These results have powerful implications for the game theory of bounded agents who are able to write proofs about themselves and one another, including the capacity to out-perform classical Nash equilibria and correlated equilibria, attaining mutually cooperative program equilibrium in the Prisoner's Dilemma. Previous cooperative program equilibria studied by Tennenholtz (2004) and Fortnow (2009) have depended on tests for program equality, a fragile condition, whereas "Löbian" cooperation is much more robust and agnostic of the opponent's implementation. △ Less

Submitted 24 August, 2016; v1 submitted 12 February, 2016; originally announced February 2016.

Comments: Corrected typos, added grant acknowledgement, updated citation style to author-year

arXiv:1210.2812 [pdf, other]

doi 10.3842/SIGMA.2014.095

Algebraic Geometry of Matrix Product States

Authors: Andrew Critch, Jason Morton

Abstract: We quantify the representational power of matrix product states (MPS) for entangled qubit systems by giving polynomial expressions in a pure quantum state's amplitudes which hold if and only if the state is a translation invariant matrix product state or a limit of such states. For systems with few qubits, we give these equations explicitly, considering both periodic and open boundary conditions.… ▽ More We quantify the representational power of matrix product states (MPS) for entangled qubit systems by giving polynomial expressions in a pure quantum state's amplitudes which hold if and only if the state is a translation invariant matrix product state or a limit of such states. For systems with few qubits, we give these equations explicitly, considering both periodic and open boundary conditions. Using the classical theory of trace varieties and trace algebras, we explain the relationship between MPS and hidden Markov models and exploit this relationship to derive useful parameterizations of MPS. We make four conjectures on the identifiability of MPS parameters. △ Less

Submitted 9 September, 2014; v1 submitted 10 October, 2012; originally announced October 2012.

MSC Class: 81R05; 81R50; 20C35; 22E70; 13P25; 13A50; 14J70; 14J81; 14L30; 14Q15; 14R20

Journal ref: SIGMA 10 (2014), 095, 10 pages

arXiv:1206.0500 [pdf, ps, other]

Binary hidden Markov models and varieties

Authors: Andrew J. Critch

Abstract: The technological applications of hidden Markov models have been extremely diverse and successful, including natural language processing, gesture recognition, gene sequencing, and Kalman filtering of physical measurements. HMMs are highly non-linear statistical models, and just as linear models are amenable to linear algebraic techniques, non-linear models are amenable to commutative algebra and a… ▽ More The technological applications of hidden Markov models have been extremely diverse and successful, including natural language processing, gesture recognition, gene sequencing, and Kalman filtering of physical measurements. HMMs are highly non-linear statistical models, and just as linear models are amenable to linear algebraic techniques, non-linear models are amenable to commutative algebra and algebraic geometry. This paper closely examines HMMs in which all the hidden random variables are binary. Its main contributions are (1) a birational parametrization for every such HMM, with an explicit inverse for recovering the hidden parameters in terms of observables, (2) a semialgebraic model membership test for every such HMM, and (3) minimal defining equations for the 4-node fully binary model, comprising 21 quadrics and 29 cubics, which were computed using Grobner bases in the cumulant coordinates of Sturmfels and Zwiernik. The new model parameters in (1) are rationally identifiable in the sense of Sullivant, Garcia-Puente, and Spielvogel, and each model's Zariski closure is therefore a rational projective variety of dimension 5. Grobner basis computations for the model and its graph are found to be considerably faster using these parameters. In the case of two hidden states, item (2) supersedes a previous algorithm of Schonhuth which is only generically defined, and the defining equations (3) yield new invariants for HMMs of all lengths $\geq 4$. Such invariants have been used successfully in model selection problems in phylogenetics, and one can hope for similar applications in the case of HMMs. △ Less

Submitted 3 September, 2012; v1 submitted 3 June, 2012; originally announced June 2012.

MSC Class: 14Q15

arXiv:1203.6431 [pdf, ps, other]

A note on the proportionality between some consistency indices in the AHP

Authors: Matteo Brunelli, Andrew Critch, Michele Fedrizzi

Abstract: Analyzing the consistency of preferences is an important step in decision making with pairwise comparison matrices, and several indices have been proposed in order to estimate it. In this paper we prove the proportionality between some consistency indices in the framework of the Analytic Hierarchy Process. Knowing such equivalences eliminates redundancy in the consideration of evidence for consist… ▽ More Analyzing the consistency of preferences is an important step in decision making with pairwise comparison matrices, and several indices have been proposed in order to estimate it. In this paper we prove the proportionality between some consistency indices in the framework of the Analytic Hierarchy Process. Knowing such equivalences eliminates redundancy in the consideration of evidence for consistent preferences. △ Less

Submitted 29 March, 2012; originally announced March 2012.

Comments: 9 pages

MSC Class: 90B50 (Primary) 13P25 (Secondary)

Showing 1–24 of 24 results for author: Critch, A