Adversarial Training for High-Stakes Reliability

Ziegler, Daniel M.; Nix, Seraphina; Chan, Lawrence; Bauman, Tim; Schmidt-Nielsen, Peter; Lin, Tao; Scherlis, Adam; Nabeshima, Noa; Weinstein-Raun, Ben; de Haas, Daniel; Shlegeris, Buck; Thomas, Nate

Computer Science > Machine Learning

arXiv:2205.01663 (cs)

[Submitted on 3 May 2022 (v1), last revised 10 Nov 2022 (this version, v5)]

Title:Adversarial Training for High-Stakes Reliability

Authors:Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

View PDF

Abstract:In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance.
In this work, we used a safe language generation task (``avoid injuries'') as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques -- including a tool that assists human adversaries -- to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.
We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.

Comments:	30 pages, 7 figures, NeurIPS camera-ready
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2205.01663 [cs.LG]
	(or arXiv:2205.01663v5 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2205.01663

Submission history

From: Daniel M. Ziegler [view email]
[v1] Tue, 3 May 2022 17:50:06 UTC (581 KB)
[v2] Wed, 4 May 2022 17:58:20 UTC (581 KB)
[v3] Thu, 15 Sep 2022 17:36:48 UTC (589 KB)
[v4] Fri, 7 Oct 2022 01:30:53 UTC (695 KB)
[v5] Thu, 10 Nov 2022 01:02:29 UTC (720 KB)

Computer Science > Machine Learning

Title:Adversarial Training for High-Stakes Reliability

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Adversarial Training for High-Stakes Reliability

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators