Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Jucys, Karolis; Adamopoulos, George; Hamidi, Mehrab; Milani, Stephanie; Samsami, Mohammad Reza; Zholus, Artem; Joseph, Sonia; Richards, Blake; Rish, Irina; Şimşek, Özgür

Computer Science > Artificial Intelligence

arXiv:2407.12161 (cs)

[Submitted on 16 Jul 2024]

Title:Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Authors:Karolis Jucys, George Adamopoulos, Mehrab Hamidi, Stephanie Milani, Mohammad Reza Samsami, Artem Zholus, Sonia Joseph, Blake Richards, Irina Rish, Özgür Şimşek

View PDF HTML (experimental)

Abstract:Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making tasks is critical to ensuring that such systems operate transparently and safely. In this work, we perform exploratory analysis on the Video PreTraining (VPT) Minecraft playing agent, one of the largest open-source vision-based agents. We aim to illuminate its reasoning mechanisms by applying various interpretability techniques. First, we analyze the attention mechanism while the agent solves its training task - crafting a diamond pickaxe. The agent pays attention to the last four frames and several key-frames further back in its six-second memory. This is a possible mechanism for maintaining coherence in a task that takes 3-10 minutes, despite the short memory span. Secondly, we perform various interventions, which help us uncover a worrying case of goal misgeneralization: VPT mistakenly identifies a villager wearing brown clothes as a tree trunk when the villager is positioned stationary under green tree leaves, and punches it to death.

Comments:	Mechanistic Interpretability Workshop at ICML 2024
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.12161 [cs.AI]
	(or arXiv:2407.12161v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2407.12161

Submission history

From: Karolis Jucys [view email]
[v1] Tue, 16 Jul 2024 20:38:08 UTC (14,990 KB)

Computer Science > Artificial Intelligence

Title:Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators