-
Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction
Authors:
Gustaw Opiełka,
Hannes Rosenbusch,
Claire E. Stevenson
Abstract:
Analogical reasoning relies on conceptual abstractions, but it is unclear whether Large Language Models (LLMs) harbor such internal representations. We explore distilled representations from LLM activations and find that function vectors (FVs; Todd et al., 2024) - compact representations for in-context learning (ICL) tasks - are not invariant to simple input changes (e.g., open-ended vs. multiple-…
▽ More
Analogical reasoning relies on conceptual abstractions, but it is unclear whether Large Language Models (LLMs) harbor such internal representations. We explore distilled representations from LLM activations and find that function vectors (FVs; Todd et al., 2024) - compact representations for in-context learning (ICL) tasks - are not invariant to simple input changes (e.g., open-ended vs. multiple-choice), suggesting they capture more than pure concepts. Using representational similarity analysis (RSA), we localize a small set of attention heads that encode invariant concept vectors (CVs) for verbal concepts like "antonym". These CVs function as feature detectors that operate independently of the final output - meaning that a model may form a correct internal representation yet still produce an incorrect output. Furthermore, CVs can be used to causally guide model behaviour. However, for more abstract concepts like "previous" and "next", we do not observe invariant linear representations, a finding we link to generalizability issues LLMs display within these domains.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Pencils to Pixels: A Systematic Study of Creative Drawings across Children, Adults and AI
Authors:
Surabhi S Nath,
Guiomar del Cuvillo y Schröder,
Claire E. Stevenson
Abstract:
Can we derive computational metrics to quantify visual creativity in drawings across intelligent agents, while accounting for inherent differences in technical skill and style? To answer this, we curate a novel dataset consisting of 1338 drawings by children, adults and AI on a creative drawing task. We characterize two aspects of the drawings -- (1) style and (2) content. For style, we define mea…
▽ More
Can we derive computational metrics to quantify visual creativity in drawings across intelligent agents, while accounting for inherent differences in technical skill and style? To answer this, we curate a novel dataset consisting of 1338 drawings by children, adults and AI on a creative drawing task. We characterize two aspects of the drawings -- (1) style and (2) content. For style, we define measures of ink density, ink distribution and number of elements. For content, we use expert-annotated categories to study conceptual diversity, and image and text embeddings to compute distance measures. We compare the style, content and creativity of children, adults and AI drawings and build simple models to predict expert and automated creativity scores. We find significant differences in style and content in the groups -- children's drawings had more components, AI drawings had greater ink density, and adult drawings revealed maximum conceptual diversity. Notably, we highlight a misalignment between creativity judgments obtained through expert and automated ratings and discuss its implications. Through these efforts, our work provides, to the best of our knowledge, the first framework for studying human and artificial creativity beyond the textual modality, and attempts to arrive at the domain-agnostic principles underlying creativity. Our data and scripts are available on GitHub.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
Can Large Language Models generalize analogy solving like people can?
Authors:
Claire E. Stevenson,
Alexandra Pafford,
Han L. J. van der Maas,
Melanie Mitchell
Abstract:
When we solve an analogy we transfer information from a known context to a new one through abstract rules and relational similarity. In people, the ability to solve analogies such as "body : feet :: table : ?" emerges in childhood, and appears to transfer easily to other domains, such as the visual domain "( : ) :: < : ?". Recent research shows that large language models (LLMs) can solve various f…
▽ More
When we solve an analogy we transfer information from a known context to a new one through abstract rules and relational similarity. In people, the ability to solve analogies such as "body : feet :: table : ?" emerges in childhood, and appears to transfer easily to other domains, such as the visual domain "( : ) :: < : ?". Recent research shows that large language models (LLMs) can solve various forms of analogies. However, can LLMs generalize analogy solving to new domains like people can? To investigate this, we had children, adults, and LLMs solve a series of letter-string analogies (e.g., a b : a c :: j k : ?) in the Latin alphabet, in a near transfer domain (Greek alphabet), and a far transfer domain (list of symbols). As expected, children and adults easily generalized their knowledge to unfamiliar domains, whereas LLMs did not. This key difference between human and AI performance is evidence that these LLMs still struggle with robust human-like analogical transfer.
△ Less
Submitted 11 March, 2025; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Evaluating Creative Short Story Generation in Humans and Large Language Models
Authors:
Mete Ismayilzada,
Claire Stevenson,
Lonneke van der Plas
Abstract:
Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story gener…
▽ More
Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence cue-word-based creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.
△ Less
Submitted 10 May, 2025; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Characterising the Creative Process in Humans and Large Language Models
Authors:
Surabhi S. Nath,
Peter Dayan,
Claire Stevenson
Abstract:
Large language models appear quite creative, often performing on par with the average human on creative tasks. However, research on LLM creativity has focused solely on \textit{products}, with little attention on the creative \textit{process}. Process analyses of human creativity often require hand-coded categories or exploit response times, which do not apply to LLMs. We provide an automated meth…
▽ More
Large language models appear quite creative, often performing on par with the average human on creative tasks. However, research on LLM creativity has focused solely on \textit{products}, with little attention on the creative \textit{process}. Process analyses of human creativity often require hand-coded categories or exploit response times, which do not apply to LLMs. We provide an automated method to characterise how humans and LLMs explore semantic spaces on the Alternate Uses Task, and contrast with behaviour in a Verbal Fluency Task. We use sentence embeddings to identify response categories and compute semantic similarities, which we use to generate jump profiles. Our results corroborate earlier work in humans reporting both persistent (deep search in few semantic spaces) and flexible (broad search across multiple semantic spaces) pathways to creativity, where both pathways lead to similar creativity scores. LLMs were found to be biased towards either persistent or flexible paths, that varied across tasks. Though LLMs as a population match human profiles, their relationship with creativity is different, where the more flexible models score higher on creativity. Our dataset and scripts are available on \href{https://github.com/surabhisnath/Creative_Process}{GitHub}.
△ Less
Submitted 5 June, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Do Large Language Models Solve ARC Visual Analogies Like People Do?
Authors:
Gustaw Opiełka,
Hannes Rosenbusch,
Veerle Vijverberg,
Claire E. Stevenson
Abstract:
The Abstraction Reasoning Corpus (ARC) is a visual analogical reasoning test designed for humans and machines (Chollet, 2019). We compared human and large language model (LLM) performance on a new child-friendly set of ARC items. Results show that both children and adults outperform most LLMs on these tasks. Error analysis revealed a similar "fallback" solution strategy in LLMs and young children,…
▽ More
The Abstraction Reasoning Corpus (ARC) is a visual analogical reasoning test designed for humans and machines (Chollet, 2019). We compared human and large language model (LLM) performance on a new child-friendly set of ARC items. Results show that both children and adults outperform most LLMs on these tasks. Error analysis revealed a similar "fallback" solution strategy in LLMs and young children, where part of the analogy is simply copied. In addition, we found two other error types, one based on seemingly grasping key concepts (e.g., Inside-Outside) and the other based on simple combinations of analogy input matrices. On the whole, "concept" errors were more common in humans, and "matrix" errors were more common in LLMs. This study sheds new light on LLM reasoning ability and the extent to which we can use error analyses and comparisons with human development to understand how LLMs solve visual analogies.
△ Less
Submitted 13 May, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized method
Authors:
Luca H. Thoms,
Karel A. Veldkamp,
Hannes Rosenbusch,
Claire E. Stevenson
Abstract:
Analogical reasoning derives information from known relations and generalizes this information to similar yet unfamiliar situations. One of the first generalized ways in which deep learning models were able to solve verbal analogies was through vector arithmetic of word embeddings, essentially relating words that were mapped to a vector space (e.g., king - man + woman = __?). In comparison, most a…
▽ More
Analogical reasoning derives information from known relations and generalizes this information to similar yet unfamiliar situations. One of the first generalized ways in which deep learning models were able to solve verbal analogies was through vector arithmetic of word embeddings, essentially relating words that were mapped to a vector space (e.g., king - man + woman = __?). In comparison, most attempts to solve visual analogies are still predominantly task-specific and less generalizable. This project focuses on visual analogical reasoning and applies the initial generalized mechanism used to solve verbal analogies to the visual realm. Taking the Abstraction and Reasoning Corpus (ARC) as an example to investigate visual analogy solving, we use a variational autoencoder (VAE) to transform ARC items into low-dimensional latent vectors, analogous to the word embeddings used in the verbal approaches. Through simple vector arithmetic, underlying rules of ARC items are discovered and used to solve them. Results indicate that the approach works well on simple items with fewer dimensions (i.e., few colors used, uniform shapes), similar input-to-output examples, and high reconstruction accuracy on the VAE. Predictions on more complex items showed stronger deviations from expected outputs, although, predictions still often approximated parts of the item's rule set. Error patterns indicated that the model works as intended. On the official ARC paradigm, the model achieved a score of 2% (cf. current world record is 21%) and on ConceptARC it scored 8.8%. Although the methodology proposed involves basic dimensionality reduction techniques and standard vector arithmetic, this approach demonstrates promising outcomes on ARC and can easily be generalized to other abstract visual reasoning tasks.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Do large language models solve verbal analogies like children do?
Authors:
Claire E. Stevenson,
Mathilde ter Veen,
Rochelle Choenni,
Han L. J. van der Maas,
Ekaterina Shutova
Abstract:
Analogy-making lies at the heart of human cognition. Adults solve analogies such as \textit{Horse belongs to stable like chicken belongs to ...?} by mapping relations (\textit{kept in}) and answering \textit{chicken coop}. In contrast, children often use association, e.g., answering \textit{egg}. This paper investigates whether large language models (LLMs) solve verbal analogies in A:B::C:? form u…
▽ More
Analogy-making lies at the heart of human cognition. Adults solve analogies such as \textit{Horse belongs to stable like chicken belongs to ...?} by mapping relations (\textit{kept in}) and answering \textit{chicken coop}. In contrast, children often use association, e.g., answering \textit{egg}. This paper investigates whether large language models (LLMs) solve verbal analogies in A:B::C:? form using associations, similar to what children do. We use verbal analogies extracted from an online adaptive learning environment, where 14,002 7-12 year-olds from the Netherlands solved 622 analogies in Dutch. The six tested Dutch monolingual and multilingual LLMs performed around the same level as children, with MGPT performing worst, around the 7-year-old level, and XLM-V and GPT-3 the best, slightly above the 11-year-old level. However, when we control for associative processes this picture changes and each model's performance level drops 1-2 years. Further experiments demonstrate that associative processes often underlie correctly solved analogies. We conclude that the LLMs we tested indeed tend to solve verbal analogies by association with C like children do.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Local Sharing and Sociality Effects on Wealth Inequality in a Simple Artificial Society
Authors:
John C. Stevenson
Abstract:
Redistribution of resources within a group as a method to reduce wealth inequality is a current area of debate. The evolutionary path to or away from wealth sharing is also a subject of active research. In order to investigate effects and evolution of wealth sharing, societies are simulated using a minimal model of a complex adapting system. These simulations demonstrate, for this artificial forag…
▽ More
Redistribution of resources within a group as a method to reduce wealth inequality is a current area of debate. The evolutionary path to or away from wealth sharing is also a subject of active research. In order to investigate effects and evolution of wealth sharing, societies are simulated using a minimal model of a complex adapting system. These simulations demonstrate, for this artificial foraging society, that local sharing of resources reduces the economy's total wealth and increases wealth inequality. Evolutionary pressures strongly select against local sharing, whether globally or within a individual's clan, and select for asocial behaviors. By holding constant the gene for sharing resources among neighbors, from rich to poor, either with everyone or only within members of the same clan, social behavior is selected but total wealth and mean age are substantially reduced relative to non-sharing societies. The Gini coefficient is shown to be ineffective in measuring these changes in total wealth and wealth distributions, and, therefore, individual well-being. Only with sociality do strategies emerge that allow sharing clans to exclude or coexist with non-sharing clans. These strategies are based on spatial effects, emphasizing the importance of modeling movement mediated community assembly and coexistence as well as sociality.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
The Struggle for Existence: Time, Memory and Bloat
Authors:
John C Stevenson
Abstract:
Combining a spatiotemporal, multi-agent based model of a foraging ecosystem with linear, genetically programmed rules for the agents' behaviors results in implicit, endogenous, objective functions and selection algorithms based on "natural selection". Use of this implicit optimization of genetic programs for study of biological systems is tested by application to an artificial foraging ecosystem,…
▽ More
Combining a spatiotemporal, multi-agent based model of a foraging ecosystem with linear, genetically programmed rules for the agents' behaviors results in implicit, endogenous, objective functions and selection algorithms based on "natural selection". Use of this implicit optimization of genetic programs for study of biological systems is tested by application to an artificial foraging ecosystem, and compared with established biological, ecological, and stochastic gene diffusion models. Limited program memory and execution time constraints emulate real-time and concurrent properties of physical and biological systems, and stress test the optimization algorithms. Relative fitness of the agents' programs and efficiency of the resultant populations as functions of these constraints gauge optimization effectiveness and efficiency. Novel solutions confirm the creativity of the optimization process and provide an unique opportunity to experimentally test the neutral code bloating hypotheses. Use of this implicit, endogenous, evolutionary optimization of spatially interacting, genetically programmed agents is thus shown to be novel, consistent with biological systems, and effective and efficient in discovering fit and novel solutions.
△ Less
Submitted 30 May, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Putting GPT-3's Creativity to the (Alternative Uses) Test
Authors:
Claire Stevenson,
Iris Smal,
Matthijs Baas,
Raoul Grasman,
Han van der Maas
Abstract:
AI large language models have (co-)produced amazing written works from newspaper articles to novels and poetry. These works meet the standards of the standard definition of creativity: being original and useful, and sometimes even the additional element of surprise. But can a large language model designed to predict the next text fragment provide creative, out-of-the-box, responses that still solv…
▽ More
AI large language models have (co-)produced amazing written works from newspaper articles to novels and poetry. These works meet the standards of the standard definition of creativity: being original and useful, and sometimes even the additional element of surprise. But can a large language model designed to predict the next text fragment provide creative, out-of-the-box, responses that still solve the problem at hand? We put Open AI's generative natural language model, GPT-3, to the test. Can it provide creative solutions to one of the most commonly used tests in creativity research? We assessed GPT-3's creativity on Guilford's Alternative Uses Test and compared its performance to previously collected human responses on expert ratings of originality, usefulness and surprise of responses, flexibility of each set of ideas as well as an automated method to measure creativity based on the semantic distance between a response and the AUT object in question. Our results show that -- on the whole -- humans currently outperform GPT-3 when it comes to creative output. But, we believe it is only a matter of time before GPT-3 catches up on this particular task. We discuss what this work reveals about human and AI creativity, creativity testing and our definition of creativity.
△ Less
Submitted 10 June, 2022;
originally announced June 2022.
-
Dynamics of Wealth Inequality in Simple Artificial Societies
Authors:
John C. Stevenson
Abstract:
A simple generative model of a foraging society generates significant wealth inequalities from identical agents on an equal opportunity landscape. These inequalities arise in both equilibrium and non-equilibrium regimes with some societies essentially never reaching equilibrium. Reproduction costs mitigate inequality beyond their affect on intrinsic growth rate. The highest levels of inequality ar…
▽ More
A simple generative model of a foraging society generates significant wealth inequalities from identical agents on an equal opportunity landscape. These inequalities arise in both equilibrium and non-equilibrium regimes with some societies essentially never reaching equilibrium. Reproduction costs mitigate inequality beyond their affect on intrinsic growth rate. The highest levels of inequality are found during non-equilibrium regimes. Inequality in dynamic regimes is driven by factors different than those driving steady state inequality. Evolutionary pressures drive the intrinsic growth rate as high as possible, leading to a tragedy of the commons.
△ Less
Submitted 14 October, 2021; v1 submitted 26 August, 2021;
originally announced August 2021.
-
Agentization of Two Population-Driven Models of Mathematical Biology
Authors:
John C. Stevenson
Abstract:
Single species population models and discrete stochastic gene frequency models are two standards of mathematical biology important for the evolution of populations. An agent based model is presented which reproduces these models and then explores where these models agree and disagree under relaxed specifications. For the population models, the requirement of homogeneous mixing prevents prediction…
▽ More
Single species population models and discrete stochastic gene frequency models are two standards of mathematical biology important for the evolution of populations. An agent based model is presented which reproduces these models and then explores where these models agree and disagree under relaxed specifications. For the population models, the requirement of homogeneous mixing prevents prediction of extinctions due to local resource depletion. These models also suggest equilibrium based on attainment of constant population levels though underlying population characteristics may be nowhere close to equilibrium. The discrete stochastic gene frequency models assume well mixed populations at constant levels. The models' predictions for non-constant populations in strongly oscillating and chaotic regimes are surprisingly good, only diverging from the ABM at the most chaotic levels.
△ Less
Submitted 17 September, 2021; v1 submitted 16 August, 2021;
originally announced August 2021.
-
Population and Inequality Dynamics in Simple Economies
Authors:
John C. Stevenson
Abstract:
While the use of spatial agent-based and individual-based models has flourished across many scientific disciplines, the complexities these models generate are often difficult to manage and quantify. This research reduces population-driven, spatial modeling of individuals to the simplest configurations and parameters: an equal resource opportunity landscape with equally capable individuals; and ask…
▽ More
While the use of spatial agent-based and individual-based models has flourished across many scientific disciplines, the complexities these models generate are often difficult to manage and quantify. This research reduces population-driven, spatial modeling of individuals to the simplest configurations and parameters: an equal resource opportunity landscape with equally capable individuals; and asks the question, "Will valid complex population and inequality dynamics emerge from this simple economic model?" Two foraging economies are modeled: subsistence and surplus. The resulting, emergent population dynamics are characterized by their sensitivities to agent and landscape parameters. The various steady and oscillating regimes of single-species population dynamics are generated by appropriate selection of model growth parameters. These emergent dynamics are shown to be consistent with the equation-based, continuum modeling of single-species populations in biology and ecology. The intrinsic growth rates, carry capacities, and delay parameters of these models are implied for these simple economies. Aggregate measures of individual distributions are used to understand the sensitivities to model parameters. New local measures are defined to describe complex behaviors driven by spatial effects, especially extinctions. This simple economic model is shown to generate significantly complex population and inequality dynamics. Model parameters generating the intrinsic growth rate have strong effects on these dynamics, including large variations in inequality. Significant inequality effects are shown to be caused by birth costs above and beyond their contribution to the intrinsic growth rate. The highest levels of inequality are found during the initial non-equilibrium period and are driven by factors different than those driving steady state inequality.
△ Less
Submitted 16 August, 2021; v1 submitted 24 January, 2021;
originally announced January 2021.