-
When Bad Data Leads to Good Models
Authors:
Kenneth Li,
Yida Chen,
Fernanda Viégas,
Martin Wattenberg
Abstract:
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy e…
▽ More
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
The Geometry of Self-Verification in a Task-Specific Reasoning Model
Authors:
Andrew Lee,
Lihao Sun,
Chris Wendler,
Fernanda Viégas,
Martin Wattenberg
Abstract:
How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its out…
▽ More
How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''. Bottom-up, we find that ``previous-token heads'' are mainly responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.
△ Less
Submitted 11 May, 2025; v1 submitted 19 April, 2025;
originally announced April 2025.
-
Shared Global and Local Geometry of Language Model Embeddings
Authors:
Andrew Lee,
Melanie Weber,
Fernanda Viégas,
Martin Wattenberg
Abstract:
Researchers have recently suggested that models share common representations. In our work, we find that token embeddings of language models exhibit common geometric structure. First, we find ``global'' similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measu…
▽ More
Researchers have recently suggested that models share common representations. In our work, we find that token embeddings of language models exhibit common geometric structure. First, we find ``global'' similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each token embedding. Our intrinsic dimension demonstrates that token embeddings lie on a lower dimensional manifold. We qualitatively show that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Both characterizations allow us to find similarities in the local geometry of token embeddings. Perhaps most surprisingly, we find that alignment in token embeddings persists through the hidden states of language models, allowing us to develop an application for interpretability. Namely, we introduce Emb2Emb, a simple method to transfer steering vectors from one language model to another, despite the two models having different dimensions.
△ Less
Submitted 23 April, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
Authors:
Thomas Fel,
Ekdeep Singh Lubana,
Jacob S. Prince,
Matthew Kowal,
Victor Boutin,
Isabel Papadimitriou,
Binxu Wang,
Martin Wattenberg,
Demba Ba,
Talia Konkle
Abstract:
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictio…
▽ More
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.
△ Less
Submitted 23 May, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
Open Problems in Mechanistic Interpretability
Authors:
Lee Sharkey,
Bilal Chughtai,
Joshua Batson,
Jack Lindsey,
Jeff Wu,
Lucius Bushnaq,
Nicholas Goldowsky-Dill,
Stefan Heimersheim,
Alejandro Ortega,
Joseph Bloom,
Stella Biderman,
Adria Garriga-Alonso,
Arthur Conmy,
Neel Nanda,
Jessica Rumbelow,
Martin Wattenberg,
Nandi Schoots,
Joseph Miller,
Eric J. Michaud,
Stephen Casper,
Max Tegmark,
William Saunders,
David Bau,
Eric Todd,
Atticus Geiger
, et al. (4 additional authors not shown)
Abstract:
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals,…
▽ More
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
ICLR: In-Context Learning of Representations
Authors:
Core Francisco Park,
Andrew Lee,
Ekdeep Singh Lubana,
Yongyi Yang,
Maya Okawa,
Kento Nishi,
Martin Wattenberg,
Hidenori Tanaka
Abstract:
Recent work has demonstrated that semantics specified by pretraining data influence how representations of different concepts are organized in a large language model (LLM). However, given the open-ended nature of LLMs, e.g., their ability to in-context learn, we can ask whether models alter these pretraining semantics to adopt alternative, context-specified ones. Specifically, if we provide in-con…
▽ More
Recent work has demonstrated that semantics specified by pretraining data influence how representations of different concepts are organized in a large language model (LLM). However, given the open-ended nature of LLMs, e.g., their ability to in-context learn, we can ask whether models alter these pretraining semantics to adopt alternative, context-specified ones. Specifically, if we provide in-context exemplars wherein a concept plays a different role than what the pretraining data suggests, do models reorganize their representations in accordance with these novel semantics? To answer this question, we take inspiration from the theory of conceptual role semantics and define a toy "graph tracing" task wherein the nodes of the graph are referenced via concepts seen during training (e.g., apple, bird, etc.) and the connectivity of the graph is defined via some predefined structure (e.g., a square grid). Given exemplars that indicate traces of random walks on the graph, we analyze intermediate representations of the model and find that as the amount of context is scaled, there is a sudden re-organization from pretrained semantic representations to in-context representations aligned with the graph structure. Further, we find that when reference concepts have correlations in their semantics (e.g., Monday, Tuesday, etc.), the context-specified graph structure is still present in the representations, but is unable to dominate the pretrained structure. To explain these results, we analogize our task to energy minimization for a predefined graph topology, providing evidence towards an implicit optimization process to infer context-specified semantics. Overall, our findings indicate scaling context-size can flexibly re-organize model representations, possibly unlocking novel capabilities.
△ Less
Submitted 2 May, 2025; v1 submitted 29 December, 2024;
originally announced January 2025.
-
Relational Composition in Neural Networks: A Survey and Call to Action
Authors:
Martin Wattenberg,
Fernanda B. Viégas
Abstract:
Many neural nets appear to represent data as linear combinations of "feature vectors." Algorithms for discovering these vectors have seen impressive recent success. However, we argue that this success is incomplete without an understanding of relational composition: how (or whether) neural nets combine feature vectors to represent more complicated relationships. To facilitate research in this area…
▽ More
Many neural nets appear to represent data as linear combinations of "feature vectors." Algorithms for discovering these vectors have seen impressive recent success. However, we argue that this success is incomplete without an understanding of relational composition: how (or whether) neural nets combine feature vectors to represent more complicated relationships. To facilitate research in this area, this paper offers a guided tour of various relational mechanisms that have been proposed, along with preliminary analysis of how such mechanisms might affect the search for interpretable features. We end with a series of promising areas for empirical research, which may help determine how neural networks represent structured data.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
Authors:
Kenneth Li,
Yiming Wang,
Fernanda Viégas,
Martin Wattenberg
Abstract:
We present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a contin…
▽ More
We present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a continuous action vector, used for controlled generation in each round. This design avoids the problem of language degradation under reward optimization. When evaluated on the Sotopia platform for social simulations, the DAT-steered LLaMA model surpasses GPT-4's performance. We also apply DAT to steer an attacker language model in a novel multi-turn red-teaming setting, revealing a potential new attack surface.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Designing a Dashboard for Transparency and Control of Conversational AI
Authors:
Yida Chen,
Aoyu Wu,
Trevor DePodesta,
Catherine Yeh,
Kenneth Li,
Nicholas Castillo Marin,
Oam Patel,
Jan Riecke,
Shivam Raval,
Olivia Seow,
Martin Wattenberg,
Fernanda Viégas
Abstract:
Conversational LLMs function as black box systems, leaving users guessing about why they see the output they do. This lack of transparency is potentially problematic, especially given concerns around bias and truthfulness. To address this issue, we present an end-to-end prototype-connecting interpretability techniques with user experience design-that seeks to make chatbots more transparent. We beg…
▽ More
Conversational LLMs function as black box systems, leaving users guessing about why they see the output they do. This lack of transparency is potentially problematic, especially given concerns around bias and truthfulness. To address this issue, we present an end-to-end prototype-connecting interpretability techniques with user experience design-that seeks to make chatbots more transparent. We begin by showing evidence that a prominent open-source LLM has a "user model": examining the internal state of the system, we can extract data related to a user's age, gender, educational level, and socioeconomic status. Next, we describe the design of a dashboard that accompanies the chatbot interface, displaying this user model in real time. The dashboard can also be used to control the user model and the system's behavior. Finally, we discuss a study in which users conversed with the instrumented system. Our results suggest that users appreciate seeing internal states, which helped them expose biased behavior and increased their sense of control. Participants also made valuable suggestions that point to future directions for both design and machine learning research. The project page and video demo of our TalkTuner system are available at https://bit.ly/talktuner-project-page
△ Less
Submitted 14 October, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
Authors:
Kenneth Li,
Samy Jelassi,
Hugh Zhang,
Sham Kakade,
Martin Wattenberg,
David Brandfonbrener
Abstract:
We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candid…
▽ More
We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candidate completions. We theoretically show that this sampling procedure is equivalent to a KL-constrained maximization of the Q-probe as the number of samples increases. To train the Q-probes we consider either reward modeling or a class of novel direct policy learning objectives based on importance weighted policy gradients. With this technique, we see gains in domains with ground-truth rewards (code generation) as well as implicit rewards defined by preference data, even outperforming finetuning in data-limited regimes. Moreover, a Q-probe can be trained on top of an API since it only assumes access to sampling and embeddings. Code: https://github.com/likenneth/q_probe .
△ Less
Submitted 2 June, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
Authors:
Kenneth Li,
Tianle Liu,
Naomi Bashkansky,
David Bau,
Fernanda Viégas,
Hanspeter Pfister,
Martin Wattenberg
Abstract:
System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated instructions for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating…
▽ More
System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated instructions for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating instruction stability via self-chats between two instructed chatbots. Testing popular models like LLaMA2-chat-70B and GPT-3.5, we reveal a significant instruction drift within eight rounds of conversations. An empirical and theoretical analysis of this phenomenon suggests the transformer attention mechanism plays a role, due to attention decay over long exchanges. To combat attention decay and instruction drift, we propose a lightweight method called split-softmax, which compares favorably against two strong baselines.
△ Less
Submitted 25 July, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Authors:
Andrew Lee,
Xiaoyan Bai,
Itamar Pres,
Martin Wattenberg,
Jonathan K. Kummerfeld,
Rada Mihalcea
Abstract:
While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namel…
▽ More
While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
Interactive AI Alignment: Specification, Process, and Evaluation Alignment
Authors:
Michael Terry,
Chinmay Kulkarni,
Martin Wattenberg,
Lucas Dixon,
Meredith Ringel Morris
Abstract:
Modern AI enables a high-level, declarative form of interaction: Users describe the intended outcome they wish an AI to produce, but do not actually create the outcome themselves. In contrast, in traditional user interfaces, users invoke specific operations to create the desired outcome. This paper revisits the basic input-output interaction cycle in light of this declarative style of interaction,…
▽ More
Modern AI enables a high-level, declarative form of interaction: Users describe the intended outcome they wish an AI to produce, but do not actually create the outcome themselves. In contrast, in traditional user interfaces, users invoke specific operations to create the desired outcome. This paper revisits the basic input-output interaction cycle in light of this declarative style of interaction, and connects concepts in AI alignment to define three objectives for interactive alignment of AI: specification alignment (aligning on what to do), process alignment (aligning on how to do it), and evaluation alignment (assisting users in verifying and understanding what was produced). Using existing systems as examples, we show how these user-centered views of AI alignment can be used descriptively, prescriptively, and as an evaluative aid.
△ Less
Submitted 16 September, 2024; v1 submitted 23 October, 2023;
originally announced November 2023.
-
ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing
Authors:
Ian Arawjo,
Chelse Swoopes,
Priyan Vaithilingam,
Martin Wattenberg,
Elena Glassman
Abstract:
Evaluating outputs of large language models (LLMs) is challenging, requiring making -- and making sense of -- many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. Chain…
▽ More
Evaluating outputs of large language models (LLMs) is challenging, requiring making -- and making sense of -- many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.
△ Less
Submitted 3 May, 2024; v1 submitted 16 September, 2023;
originally announced September 2023.
-
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
Authors:
Neel Nanda,
Andrew Lee,
Martin Wattenberg
Abstract:
How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the…
▽ More
How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.
△ Less
Submitted 7 September, 2023; v1 submitted 2 September, 2023;
originally announced September 2023.
-
Linearity of Relation Decoding in Transformer Language Models
Authors:
Evan Hernandez,
Arnab Sen Sharma,
Tal Haklay,
Kevin Meng,
Martin Wattenberg,
Jacob Andreas,
Yonatan Belinkov,
David Bau
Abstract:
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a fir…
▽ More
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a first-order approximation to the LM from a single prompt, and they exist for a variety of factual, commonsense, and linguistic relations. However, we also identify many cases in which LM predictions capture relational knowledge accurately, but this knowledge is not linearly encoded in their representations. Our results thus reveal a simple, interpretable, but heterogeneously deployed knowledge representation strategy in transformer LMs.
△ Less
Submitted 15 February, 2024; v1 submitted 17 August, 2023;
originally announced August 2023.
-
Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model
Authors:
Yida Chen,
Fernanda Viégas,
Martin Wattenberg
Abstract:
Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple…
▽ More
Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process$-$well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of an LDM's output. Project page: https://yc015.github.io/scene-representation-diffusion-model/
△ Less
Submitted 4 November, 2023; v1 submitted 9 June, 2023;
originally announced June 2023.
-
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Authors:
Kenneth Li,
Oam Patel,
Fernanda Viégas,
Hanspeter Pfister,
Martin Wattenberg
Abstract:
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLa…
▽ More
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
△ Less
Submitted 26 June, 2024; v1 submitted 5 June, 2023;
originally announced June 2023.
-
AttentionViz: A Global View of Transformer Attention
Authors:
Catherine Yeh,
Yida Chen,
Aoyu Wu,
Cynthia Chen,
Fernanda Viégas,
Martin Wattenberg
Abstract:
Transformer models are revolutionizing machine learning, but their inner workings remain mysterious. In this work, we present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers that allows these models to learn rich, contextual relationships between elements of a sequence. The main idea behind our method is to visualize a joint embedd…
▽ More
Transformer models are revolutionizing machine learning, but their inner workings remain mysterious. In this work, we present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers that allows these models to learn rich, contextual relationships between elements of a sequence. The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention. Unlike previous attention visualization techniques, our approach enables the analysis of global patterns across multiple input sequences. We create an interactive visualization tool, AttentionViz (demo: http://attentionviz.com), based on these joint query-key embeddings, and use it to study attention mechanisms in both language and vision transformers. We demonstrate the utility of our approach in improving model understanding and offering new insights about query-key interactions through several application scenarios and expert feedback.
△ Less
Submitted 9 August, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
The System Model and the User Model: Exploring AI Dashboard Design
Authors:
Fernanda Viégas,
Martin Wattenberg
Abstract:
This is a speculative essay on interface design and artificial intelligence. Recently there has been a surge of attention to chatbots based on large language models, including widely reported unsavory interactions. We contend that part of the problem is that text is not all you need: sophisticated AI systems should have dashboards, just like all other complicated devices. Assuming the hypothesis t…
▽ More
This is a speculative essay on interface design and artificial intelligence. Recently there has been a surge of attention to chatbots based on large language models, including widely reported unsavory interactions. We contend that part of the problem is that text is not all you need: sophisticated AI systems should have dashboards, just like all other complicated devices. Assuming the hypothesis that AI systems based on neural networks will contain interpretable models of aspects of the world around them, we discuss what data such dashboards might display. We conjecture that, for many systems, the two most important models will be of the user and of the system itself. We call these the System Model and User Model. We argue that, for usability and safety, interfaces to dialogue-based AI systems should have a parallel display based on the state of the System Model and the User Model. Finding ways to identify, interpret, and display these two models should be a core part of interface research for AI.
△ Less
Submitted 3 May, 2023;
originally announced May 2023.
-
Investigating How Practitioners Use Human-AI Guidelines: A Case Study on the People + AI Guidebook
Authors:
Nur Yildirim,
Mahima Pushkarna,
Nitesh Goyal,
Martin Wattenberg,
Fernanda Viegas
Abstract:
Artificial intelligence (AI) presents new challenges for the user experience (UX) of products and services. Recently, practitioner-facing resources and design guidelines have become available to ease some of these challenges. However, little research has investigated if and how these guidelines are used, and how they impact practice. In this paper, we investigated how industry practitioners use th…
▽ More
Artificial intelligence (AI) presents new challenges for the user experience (UX) of products and services. Recently, practitioner-facing resources and design guidelines have become available to ease some of these challenges. However, little research has investigated if and how these guidelines are used, and how they impact practice. In this paper, we investigated how industry practitioners use the People + AI Guidebook. We conducted interviews with 31 practitioners (i.e., designers, product managers) to understand how they use human-AI guidelines when designing AI-enabled products. Our findings revealed that practitioners use the guidebook not only for addressing AI's design challenges, but also for education, cross-functional communication, and for developing internal resources. We uncovered that practitioners desire more support for early phase ideation and problem formulation to avoid AI product failures. We discuss the implications for future resources aiming to help practitioners in designing AI products.
△ Less
Submitted 20 April, 2023; v1 submitted 28 January, 2023;
originally announced January 2023.
-
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
Authors:
Kenneth Li,
Aspen K. Hopkins,
David Bau,
Fernanda Viégas,
Hanspeter Pfister,
Martin Wattenberg
Abstract:
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple boa…
▽ More
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.
△ Less
Submitted 26 June, 2024; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Toy Models of Superposition
Authors:
Nelson Elhage,
Tristan Hume,
Catherine Olsson,
Nicholas Schiefer,
Tom Henighan,
Shauna Kravec,
Zac Hatfield-Dodds,
Robert Lasenby,
Dawn Drain,
Carol Chen,
Roger Grosse,
Sam McCandlish,
Jared Kaplan,
Dario Amodei,
Martin Wattenberg,
Christopher Olah
Abstract:
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising…
▽ More
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
Interpreting a Machine Learning Model for Detecting Gravitational Waves
Authors:
Mohammadtaher Safarzadeh,
Asad Khan,
E. A. Huerta,
Martin Wattenberg
Abstract:
We describe a case study of translational research, applying interpretability techniques developed for computer vision to machine learning models used to search for and find gravitational waves. The models we study are trained to detect black hole merger events in non-Gaussian and non-stationary advanced Laser Interferometer Gravitational-wave Observatory (LIGO) data. We produced visualizations of…
▽ More
We describe a case study of translational research, applying interpretability techniques developed for computer vision to machine learning models used to search for and find gravitational waves. The models we study are trained to detect black hole merger events in non-Gaussian and non-stationary advanced Laser Interferometer Gravitational-wave Observatory (LIGO) data. We produced visualizations of the response of machine learning models when they process advanced LIGO data that contains real gravitational wave signals, noise anomalies, and pure advanced LIGO noise. Our findings shed light on the responses of individual neurons in these machine learning models. Further analysis suggests that different parts of the network appear to specialize in local versus global features, and that this difference appears to be rooted in the branched architecture of the network as well as noise characteristics of the LIGO detectors. We believe efforts to whiten these "black box" models can suggest future avenues for research and help inform the design of interpretable machine learning models for gravitational wave astrophysics.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
An Interpretability Illusion for BERT
Authors:
Tolga Bolukbasi,
Adam Pearce,
Ann Yuan,
Andy Coenen,
Emily Reif,
Fernanda Viégas,
Martin Wattenberg
Abstract:
We describe an "interpretability illusion" that arises when analyzing the BERT model. Activations of individual neurons in the network may spuriously appear to encode a single, simple concept, when in fact they are encoding something far more complex. The same effect holds for linear combinations of activations. We trace the source of this illusion to geometric properties of BERT's embedding space…
▽ More
We describe an "interpretability illusion" that arises when analyzing the BERT model. Activations of individual neurons in the network may spuriously appear to encode a single, simple concept, when in fact they are encoding something far more complex. The same effect holds for linear combinations of activations. We trace the source of this illusion to geometric properties of BERT's embedding space as well as the fact that common text corpora represent only narrow slices of possible English sentences. We provide a taxonomy of model-learned concepts and discuss methodological implications for interpretability research, especially the importance of testing hypotheses on multiple data sets.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
"A cold, technical decision-maker": Can AI provide explainability, negotiability, and humanity?
Authors:
Allison Woodruff,
Yasmin Asare Anderson,
Katherine Jameson Armstrong,
Marina Gkiza,
Jay Jennings,
Christopher Moessner,
Fernanda Viegas,
Martin Wattenberg,
and Lynette Webb,
Fabian Wrede,
Patrick Gage Kelley
Abstract:
Algorithmic systems are increasingly deployed to make decisions in many areas of people's lives. The shift from human to algorithmic decision-making has been accompanied by concern about potentially opaque decisions that are not aligned with social values, as well as proposed remedies such as explainability. We present results of a qualitative study of algorithmic decision-making, comprised of fiv…
▽ More
Algorithmic systems are increasingly deployed to make decisions in many areas of people's lives. The shift from human to algorithmic decision-making has been accompanied by concern about potentially opaque decisions that are not aligned with social values, as well as proposed remedies such as explainability. We present results of a qualitative study of algorithmic decision-making, comprised of five workshops conducted with a total of 60 participants in Finland, Germany, the United Kingdom, and the United States. We invited participants to reason about decision-making qualities such as explainability and accuracy in a variety of domains. Participants viewed AI as a decision-maker that follows rigid criteria and performs mechanical tasks well, but is largely incapable of subjective or morally complex judgments. We discuss participants' consideration of humanity in decision-making, and introduce the concept of 'negotiability,' the ability to go beyond formal criteria and work flexibly around the system.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
The What-If Tool: Interactive Probing of Machine Learning Models
Authors:
James Wexler,
Mahima Pushkarna,
Tolga Bolukbasi,
Martin Wattenberg,
Fernanda Viegas,
Jimbo Wilson
Abstract:
A key challenge in developing and deploying Machine Learning (ML) systems is understanding their performance across a wide range of inputs. To address this challenge, we created the What-If Tool, an open-source application that allows practitioners to probe, visualize, and analyze ML systems, with minimal coding. The What-If Tool lets practitioners test performance in hypothetical situations, anal…
▽ More
A key challenge in developing and deploying Machine Learning (ML) systems is understanding their performance across a wide range of inputs. To address this challenge, we created the What-If Tool, an open-source application that allows practitioners to probe, visualize, and analyze ML systems, with minimal coding. The What-If Tool lets practitioners test performance in hypothetical situations, analyze the importance of different data features, and visualize model behavior across multiple models and subsets of input data. It also lets practitioners measure systems according to multiple ML fairness metrics. We describe the design of the tool, and report on real-life usage at different organizations.
△ Less
Submitted 3 October, 2019; v1 submitted 9 July, 2019;
originally announced July 2019.
-
Visualizing and Measuring the Geometry of BERT
Authors:
Andy Coenen,
Emily Reif,
Ann Yuan,
Been Kim,
Adam Pearce,
Fernanda Viégas,
Martin Wattenberg
Abstract:
Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of on…
▽ More
Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.
△ Less
Submitted 28 October, 2019; v1 submitted 6 June, 2019;
originally announced June 2019.
-
Neural Networks Trained on Natural Scenes Exhibit Gestalt Closure
Authors:
Been Kim,
Emily Reif,
Martin Wattenberg,
Samy Bengio,
Michael C. Mozer
Abstract:
The Gestalt laws of perceptual organization, which describe how visual elements in an image are grouped and interpreted, have traditionally been thought of as innate despite their ecological validity. We use deep-learning methods to investigate whether natural scene statistics might be sufficient to derive the Gestalt laws. We examine the law of closure, which asserts that human visual perception…
▽ More
The Gestalt laws of perceptual organization, which describe how visual elements in an image are grouped and interpreted, have traditionally been thought of as innate despite their ecological validity. We use deep-learning methods to investigate whether natural scene statistics might be sufficient to derive the Gestalt laws. We examine the law of closure, which asserts that human visual perception tends to "close the gap" by assembling elements that can jointly be interpreted as a complete figure or object. We demonstrate that a state-of-the-art convolutional neural network, trained to classify natural images, exhibits closure on synthetic displays of edge fragments, as assessed by similarity of internal representations. This finding provides support for the hypothesis that the human perceptual system is even more elegant than the Gestaltists imagined: a single law---adaptation to the statistical structure of the environment---might suffice as fundamental.
△ Less
Submitted 29 June, 2020; v1 submitted 3 March, 2019;
originally announced March 2019.
-
Human-Centered Tools for Coping with Imperfect Algorithms during Medical Decision-Making
Authors:
Carrie J. Cai,
Emily Reif,
Narayan Hegde,
Jason Hipp,
Been Kim,
Daniel Smilkov,
Martin Wattenberg,
Fernanda Viegas,
Greg S. Corrado,
Martin C. Stumpe,
Michael Terry
Abstract:
Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is…
▽ More
Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is algorithmically determined to be similar may not be medically relevant to a doctor's specific diagnostic needs. In this paper, we identified the needs of pathologists when searching for similar images retrieved using a deep learning algorithm, and developed tools that empower users to cope with the search algorithm on-the-fly, communicating what types of similarity are most important at different moments in time. In two evaluations with pathologists, we found that these refinement tools increased the diagnostic utility of images found and increased user trust in the algorithm. The tools were preferred over a traditional interface, without a loss in diagnostic accuracy. We also observed that users adopted new strategies when using refinement tools, re-purposing them to test and understand the underlying algorithm and to disambiguate ML errors from their own errors. Taken together, these findings inform future human-ML collaborative systems for expert decision-making.
△ Less
Submitted 8 February, 2019;
originally announced February 2019.
-
TensorFlow.js: Machine Learning for the Web and Beyond
Authors:
Daniel Smilkov,
Nikhil Thorat,
Yannick Assogba,
Ann Yuan,
Nick Kreeger,
Ping Yu,
Kangyi Zhang,
Shanqing Cai,
Eric Nielsen,
David Soergel,
Stan Bileschi,
Michael Terry,
Charles Nicholson,
Sandeep N. Gupta,
Sarah Sirajuddin,
D. Sculley,
Rajat Monga,
Greg Corrado,
Fernanda B. Viégas,
Martin Wattenberg
Abstract:
TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript ecosystems. TensorFlow.js has empowered a new set o…
▽ More
TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript ecosystems. TensorFlow.js has empowered a new set of developers from the extensive JavaScript community to build and deploy machine learning models and enabled new classes of on-device computation. This paper describes the design, API, and implementation of TensorFlow.js, and highlights some of the impactful use cases.
△ Less
Submitted 27 February, 2019; v1 submitted 16 January, 2019;
originally announced January 2019.
-
GAN Lab: Understanding Complex Deep Generative Models using Interactive Visual Experimentation
Authors:
Minsuk Kahng,
Nikhil Thorat,
Duen Horng Chau,
Fernanda Viégas,
Martin Wattenberg
Abstract:
Recent success in deep learning has generated immense interest among practitioners and students, inspiring many to learn about this new technology. While visual and interactive approaches have been successfully developed to help people more easily learn deep learning, most existing tools focus on simpler models. In this work, we present GAN Lab, the first interactive visualization tool designed fo…
▽ More
Recent success in deep learning has generated immense interest among practitioners and students, inspiring many to learn about this new technology. While visual and interactive approaches have been successfully developed to help people more easily learn deep learning, most existing tools focus on simpler models. In this work, we present GAN Lab, the first interactive visualization tool designed for non-experts to learn and experiment with Generative Adversarial Networks (GANs), a popular class of complex deep learning models. With GAN Lab, users can interactively train generative models and visualize the dynamic training process's intermediate results. GAN Lab tightly integrates an model overview graph that summarizes GAN's structure, and a layered distributions view that helps users interpret the interplay between submodels. GAN Lab introduces new interactive experimentation features for learning complex deep learning models, such as step-by-step training at multiple levels of abstraction for understanding intricate training dynamics. Implemented using TensorFlow.js, GAN Lab is accessible to anyone via modern web browsers, without the need for installation or specialized hardware, overcoming a major practical challenge in deploying interactive tools for deep learning.
△ Less
Submitted 5 September, 2018;
originally announced September 2018.
-
Adversarial Spheres
Authors:
Justin Gilmer,
Luke Metz,
Fartash Faghri,
Samuel S. Schoenholz,
Maithra Raghu,
Martin Wattenberg,
Ian Goodfellow
Abstract:
State of the art computer vision models have been shown to be vulnerable to small adversarial perturbations of the input. In other words, most images in the data distribution are both correctly classified by the model and are very close to a visually similar misclassified image. Despite substantial research interest, the cause of the phenomenon is still poorly understood and remains unsolved. We h…
▽ More
State of the art computer vision models have been shown to be vulnerable to small adversarial perturbations of the input. In other words, most images in the data distribution are both correctly classified by the model and are very close to a visually similar misclassified image. Despite substantial research interest, the cause of the phenomenon is still poorly understood and remains unsolved. We hypothesize that this counter intuitive behavior is a naturally occurring result of the high dimensional geometry of the data manifold. As a first step towards exploring this hypothesis, we study a simple synthetic dataset of classifying between two concentric high dimensional spheres. For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size $O(1/\sqrt{d})$. Surprisingly, when we train several different architectures on this dataset, all of their error sets naturally approach this theoretical bound. As a result of the theory, the vulnerability of neural networks to small adversarial perturbations is a logical consequence of the amount of test error observed. We hope that our theoretical analysis of this very simple case will point the way forward to explore how the geometry of complex real-world data sets leads to adversarial examples.
△ Less
Submitted 10 September, 2018; v1 submitted 8 January, 2018;
originally announced January 2018.
-
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
Authors:
Been Kim,
Martin Wattenberg,
Justin Gilmer,
Carrie Cai,
James Wexler,
Fernanda Viegas,
Rory Sayres
Abstract:
The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-f…
▽ More
The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-friendly concepts. The key idea is to view the high-dimensional internal state of a neural net as an aid, not an obstacle. We show how to use CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result--for example, how sensitive a prediction of "zebra" is to the presence of stripes. Using the domain of image classification as a testing ground, we describe how CAVs may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application.
△ Less
Submitted 7 June, 2018; v1 submitted 30 November, 2017;
originally announced November 2017.
-
Direct-Manipulation Visualization of Deep Networks
Authors:
Daniel Smilkov,
Shan Carter,
D. Sculley,
Fernanda B. Viégas,
Martin Wattenberg
Abstract:
The recent successes of deep learning have led to a wave of interest from non-experts. Gaining an understanding of this technology, however, is difficult. While the theory is important, it is also helpful for novices to develop an intuitive feel for the effect of different hyperparameters and structural variations. We describe TensorFlow Playground, an interactive, open sourced visualization that…
▽ More
The recent successes of deep learning have led to a wave of interest from non-experts. Gaining an understanding of this technology, however, is difficult. While the theory is important, it is also helpful for novices to develop an intuitive feel for the effect of different hyperparameters and structural variations. We describe TensorFlow Playground, an interactive, open sourced visualization that allows users to experiment via direct manipulation rather than coding, enabling them to quickly build an intuition about neural nets.
△ Less
Submitted 12 August, 2017;
originally announced August 2017.
-
SmoothGrad: removing noise by adding noise
Authors:
Daniel Smilkov,
Nikhil Thorat,
Been Kim,
Fernanda Viégas,
Martin Wattenberg
Abstract:
Explaining the output of a deep network remains a challenge. In the case of an image classifier, one type of explanation is to identify pixels that strongly influence the final decision. A starting point for this strategy is the gradient of the class score function with respect to the input image. This gradient can be interpreted as a sensitivity map, and there are several techniques that elaborat…
▽ More
Explaining the output of a deep network remains a challenge. In the case of an image classifier, one type of explanation is to identify pixels that strongly influence the final decision. A starting point for this strategy is the gradient of the class score function with respect to the input image. This gradient can be interpreted as a sensitivity map, and there are several techniques that elaborate on this basic idea. This paper makes two contributions: it introduces SmoothGrad, a simple method that can help visually sharpen gradient-based sensitivity maps, and it discusses lessons in the visualization of these maps. We publish the code for our experiments and a website with our results.
△ Less
Submitted 12 June, 2017;
originally announced June 2017.
-
Embedding Projector: Interactive Visualization and Interpretation of Embeddings
Authors:
Daniel Smilkov,
Nikhil Thorat,
Charles Nicholson,
Emily Reif,
Fernanda B. Viégas,
Martin Wattenberg
Abstract:
Embeddings are ubiquitous in machine learning, appearing in recommender systems, NLP, and many other applications. Researchers and developers often need to explore the properties of a specific embedding, and one way to analyze embeddings is to visualize them. We present the Embedding Projector, a tool for interactive visualization and interpretation of embeddings.
Embeddings are ubiquitous in machine learning, appearing in recommender systems, NLP, and many other applications. Researchers and developers often need to explore the properties of a specific embedding, and one way to analyze embeddings is to visualize them. We present the Embedding Projector, a tool for interactive visualization and interpretation of embeddings.
△ Less
Submitted 16 November, 2016;
originally announced November 2016.
-
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Authors:
Melvin Johnson,
Mike Schuster,
Quoc V. Le,
Maxim Krikun,
Yonghui Wu,
Zhifeng Chen,
Nikhil Thorat,
Fernanda Viégas,
Martin Wattenberg,
Greg Corrado,
Macduff Hughes,
Jeffrey Dean
Abstract:
We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, rem…
▽ More
We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English$\rightarrow$French and surpasses state-of-the-art results for English$\rightarrow$German. Similarly, a single multilingual model surpasses state-of-the-art results for French$\rightarrow$English and German$\rightarrow$English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
△ Less
Submitted 21 August, 2017; v1 submitted 14 November, 2016;
originally announced November 2016.
-
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Authors:
Martín Abadi,
Ashish Agarwal,
Paul Barham,
Eugene Brevdo,
Zhifeng Chen,
Craig Citro,
Greg S. Corrado,
Andy Davis,
Jeffrey Dean,
Matthieu Devin,
Sanjay Ghemawat,
Ian Goodfellow,
Andrew Harp,
Geoffrey Irving,
Michael Isard,
Yangqing Jia,
Rafal Jozefowicz,
Lukasz Kaiser,
Manjunath Kudlur,
Josh Levenberg,
Dan Mane,
Rajat Monga,
Sherry Moore,
Derek Murray,
Chris Olah
, et al. (15 additional authors not shown)
Abstract:
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational de…
▽ More
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
△ Less
Submitted 16 March, 2016; v1 submitted 14 March, 2016;
originally announced March 2016.
-
Semiclassical Geometry of 4D Reduced Supersymmetric Yang-Mills Integrals
Authors:
Zdzislaw Burda,
Bengt Petersson,
Marc Wattenberg
Abstract:
We investigate semiclassical properties of space-time geometry of the low energy limit of reduced four dimensional supersymmetric Yang-Mills integrals using Monte-Carlo simulations. The limit is obtained by an one-loop approximation of the original Yang-Mills integrals leading to an effective model of branched polymers. We numerically determine the behaviour of the gyration radius, the two-point…
▽ More
We investigate semiclassical properties of space-time geometry of the low energy limit of reduced four dimensional supersymmetric Yang-Mills integrals using Monte-Carlo simulations. The limit is obtained by an one-loop approximation of the original Yang-Mills integrals leading to an effective model of branched polymers. We numerically determine the behaviour of the gyration radius, the two-point correlation function and the Polyakov-line operator in the effective model and discuss the results in the context of the large-distance behaviour of the original matrix model.
△ Less
Submitted 13 April, 2005; v1 submitted 3 March, 2005;
originally announced March 2005.
-
From 4D Reduced SYM Integrals to Branched-Polymers
Authors:
Zdzislaw Burda,
Bengt Petersson,
Marc Wattenberg
Abstract:
We derive analytically one-loop corrections to the effective Polyakov-line operator in the branched-polymer approximation of the reduced four-dimensional supersymmetric Yang-Mills integrals.
We derive analytically one-loop corrections to the effective Polyakov-line operator in the branched-polymer approximation of the reduced four-dimensional supersymmetric Yang-Mills integrals.
△ Less
Submitted 29 August, 2003; v1 submitted 28 August, 2003;
originally announced August 2003.
-
Exotic trees
Authors:
Z. Burda,
J. Erdmann,
B. Petersson,
M. Wattenberg
Abstract:
We discuss the scaling properties of free branched polymers. The scaling behaviour of the model is classified by the Hausdorff dimensions for the internal geometry: d_L and d_H, and for the external one: D_L and D_H. The dimensions d_H and D_H characterize the behaviour for long distances while d_L and D_L for short distances. We show that the internal Hausdorff dimension is d_L=2 for generic an…
▽ More
We discuss the scaling properties of free branched polymers. The scaling behaviour of the model is classified by the Hausdorff dimensions for the internal geometry: d_L and d_H, and for the external one: D_L and D_H. The dimensions d_H and D_H characterize the behaviour for long distances while d_L and D_L for short distances. We show that the internal Hausdorff dimension is d_L=2 for generic and scale-free trees, contrary to d_H which is known be equal two for generic trees and to vary between two and infinity for scale-free trees. We show that the external Hausdorff dimension D_H is directly related to the internal one as D_H = αd_H, where αis the stability index of the embedding weights for the nearest-vertex interactions. The index is α=2 for weights from the gaussian domain of attraction and 0<α<2 for those from the Lévy domain of attraction. If the dimension D of the target space is larger than D_H one finds D_L=D_H, or otherwise D_L=D. The latter result means that the fractal structure cannot develop in a target space which has too low dimension.
△ Less
Submitted 18 July, 2002;
originally announced July 2002.