-
A Neural Model for Word Repetition
Authors:
Daniel Dager,
Robin Sobczyk,
Emmanuel Chemla,
Yair Lakretz
Abstract:
It takes several years for the developing brain of a baby to fully master word repetition-the task of hearing a word and repeating it aloud. Repeating a new word, such as from a new language, can be a challenging task also for adults. Additionally, brain damage, such as from a stroke, may lead to systematic speech errors with specific characteristics dependent on the location of the brain damage.…
▽ More
It takes several years for the developing brain of a baby to fully master word repetition-the task of hearing a word and repeating it aloud. Repeating a new word, such as from a new language, can be a challenging task also for adults. Additionally, brain damage, such as from a stroke, may lead to systematic speech errors with specific characteristics dependent on the location of the brain damage. Cognitive sciences suggest a model with various components for the different processing stages involved in word repetition. While some studies have begun to localize the corresponding regions in the brain, the neural mechanisms and how exactly the brain performs word repetition remain largely unknown. We propose to bridge the gap between the cognitive model of word repetition and neural mechanisms in the human brain by modeling the task using deep neural networks. Neural models are fully observable, allowing us to study the detailed mechanisms in their various substructures and make comparisons with human behavior and, ultimately, the brain. Here, we make first steps in this direction by: (1) training a large set of models to simulate the word repetition task; (2) creating a battery of tests to probe the models for known effects from behavioral studies in humans, and (3) simulating brain damage through ablation studies, where we systematically remove neurons from the model, and repeat the behavioral study to examine the resulting speech errors in the "patient" model. Our results show that neural models can mimic several effects known from human research, but might diverge in other aspects, highlighting both the potential and the challenges for future research aimed at developing human-like neural models.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
A Minimum Description Length Approach to Regularization in Neural Networks
Authors:
Matan Abudy,
Orr Well,
Emmanuel Chemla,
Roni Katzir,
Nur Lan
Abstract:
State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architec…
▽ More
State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
fastabx: A library for efficient computation of ABX discriminability
Authors:
Maxime Poli,
Emmanuel Chemla,
Emmanuel Dupoux
Abstract:
We introduce fastabx, a high-performance Python library for building ABX discrimination tasks. ABX is a measure of the separation between generic categories of interest. It has been used extensively to evaluate phonetic discriminability in self-supervised speech representations. However, its broader adoption has been limited by the absence of adequate tools. fastabx addresses this gap by providing…
▽ More
We introduce fastabx, a high-performance Python library for building ABX discrimination tasks. ABX is a measure of the separation between generic categories of interest. It has been used extensively to evaluate phonetic discriminability in self-supervised speech representations. However, its broader adoption has been limited by the absence of adequate tools. fastabx addresses this gap by providing a framework capable of constructing any type of ABX task while delivering the efficiency necessary for rapid development cycles, both in task creation and in calculating distances between representations. We believe that fastabx will serve as a valuable resource for the broader representation learning community, enabling researchers to systematically investigate what information can be directly extracted from learned representations across several domains beyond speech processing. The source code is available at https://github.com/bootphon/fastabx.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Large Language Models as Proxies for Theories of Human Linguistic Cognition
Authors:
Imry Ziv,
Nur Lan,
Emmanuel Chemla,
Roni Katzir
Abstract:
We consider the possible role of current large language models (LLMs) in the study of human linguistic cognition. We focus on the use of such models as proxies for theories of cognition that are relatively linguistically-neutral in their representations and learning but differ from current LLMs in key ways. We illustrate this potential use of LLMs as proxies for theories of cognition in the contex…
▽ More
We consider the possible role of current large language models (LLMs) in the study of human linguistic cognition. We focus on the use of such models as proxies for theories of cognition that are relatively linguistically-neutral in their representations and learning but differ from current LLMs in key ways. We illustrate this potential use of LLMs as proxies for theories of cognition in the context of two kinds of questions: (a) whether the target theory accounts for the acquisition of a given pattern from a given corpus; and (b) whether the target theory makes a given typologically-attested pattern easier to acquire than another, typologically-unattested pattern. For each of the two questions we show, building on recent literature, how current LLMs can potentially be of help, but we note that at present this help is quite limited.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Disentanglement and Compositionality of Letter Identity and Letter Position in Variational Auto-Encoder Vision Models
Authors:
Bruno Bianchi,
Aakash Agrawal,
Stanislas Dehaene,
Emmanuel Chemla,
Yair Lakretz
Abstract:
Human readers can accurately count how many letters are in a word (e.g., 7 in ``buffalo''), remove a letter from a given position (e.g., ``bufflo'') or add a new one. The human brain of readers must have therefore learned to disentangle information related to the position of a letter and its identity. Such disentanglement is necessary for the compositional, unbounded, ability of humans to create a…
▽ More
Human readers can accurately count how many letters are in a word (e.g., 7 in ``buffalo''), remove a letter from a given position (e.g., ``bufflo'') or add a new one. The human brain of readers must have therefore learned to disentangle information related to the position of a letter and its identity. Such disentanglement is necessary for the compositional, unbounded, ability of humans to create and parse new strings, with any combination of letters appearing in any positions. Do modern deep neural models also possess this crucial compositional ability? Here, we tested whether neural models that achieve state-of-the-art on disentanglement of features in visual input can also disentangle letter position and letter identity when trained on images of written words. Specifically, we trained beta variational autoencoder ($β$-VAE) to reconstruct images of letter strings and evaluated their disentanglement performance using CompOrth - a new benchmark that we created for studying compositional learning and zero-shot generalization in visual models for orthography. The benchmark suggests a set of tests, of increasing complexity, to evaluate the degree of disentanglement between orthographic features of written words in deep neural models. Using CompOrth, we conducted a set of experiments to analyze the generalization ability of these models, in particular, to unseen word length and to unseen combinations of letter identities and letter positions. We found that while models effectively disentangle surface features, such as horizontal and vertical `retinal' locations of words within an image, they dramatically fail to disentangle letter position and letter identity and lack any notion of word length. Together, this study demonstrates the shortcomings of state-of-the-art $β$-VAE models compared to humans and proposes a new challenge and a corresponding benchmark to evaluate neural models.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
A polar coordinate system represents syntax in large language models
Authors:
Pablo Diego-Simón,
Stéphane D'Ascoli,
Emmanuel Chemla,
Yair Lakretz,
Jean-Rémi King
Abstract:
Originally formalized with symbolic representations, syntactic trees may also be effectively represented in the activations of large language models (LLMs). Indeed, a 'Structural Probe' can find a subspace of neural activations, where syntactically related words are relatively close to one-another. However, this syntactic code remains incomplete: the distance between the Structural Probe word embe…
▽ More
Originally formalized with symbolic representations, syntactic trees may also be effectively represented in the activations of large language models (LLMs). Indeed, a 'Structural Probe' can find a subspace of neural activations, where syntactically related words are relatively close to one-another. However, this syntactic code remains incomplete: the distance between the Structural Probe word embeddings can represent the existence but not the type and direction of syntactic relations. Here, we hypothesize that syntactic relations are, in fact, coded by the relative direction between nearby embeddings. To test this hypothesis, we introduce a 'Polar Probe' trained to read syntactic relations from both the distance and the direction between word embeddings. Our approach reveals three main findings. First, our Polar Probe successfully recovers the type and direction of syntactic relations, and substantially outperforms the Structural Probe by nearly two folds. Second, we confirm that this polar coordinate system exists in a low-dimensional subspace of the intermediate layers of many LLMs and becomes increasingly precise in the latest frontier models. Third, we demonstrate with a new benchmark that similar syntactic relations are coded similarly across the nested levels of syntactic trees. Overall, this work shows that LLMs spontaneously learn a geometry of neural activations that explicitly represents the main symbolic structures of linguistic theory.
△ Less
Submitted 7 December, 2024;
originally announced December 2024.
-
Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach
Authors:
Maxime Poli,
Emmanuel Chemla,
Emmanuel Dupoux
Abstract:
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three order…
▽ More
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.
△ Less
Submitted 30 October, 2024; v1 submitted 16 September, 2024;
originally announced October 2024.
-
No Such Thing as a General Learner: Language models and their dual optimization
Authors:
Emmanuel Chemla,
Ryan M. Nefdt
Abstract:
What role can the otherwise successful Large Language Models (LLMs) play in the understanding of human cognition, and in particular in terms of informing language acquisition debates? To contribute to this question, we first argue that neither humans nor LLMs are general learners, in a variety of senses. We make a novel case for how in particular LLMs follow a dual-optimization process: they are o…
▽ More
What role can the otherwise successful Large Language Models (LLMs) play in the understanding of human cognition, and in particular in terms of informing language acquisition debates? To contribute to this question, we first argue that neither humans nor LLMs are general learners, in a variety of senses. We make a novel case for how in particular LLMs follow a dual-optimization process: they are optimized during their training (which is typically compared to language acquisition), and modern LLMs have also been selected, through a process akin to natural selection in a species. From this perspective, we argue that the performance of LLMs, whether similar or dissimilar to that of humans, does not weigh easily on important debates about the importance of human cognitive biases for language.
△ Less
Submitted 21 August, 2024; v1 submitted 18 August, 2024;
originally announced August 2024.
-
What Makes Two Language Models Think Alike?
Authors:
Jeanne Salle,
Louis Jalouzot,
Nur Lan,
Emmanuel Chemla,
Yair Lakretz
Abstract:
Do architectural differences significantly affect the way models represent and process language? We propose a new approach, based on metric-learning encoding models (MLEMs), as a first step to answer this question. The approach provides a feature-based comparison of how any two layers of any two models represent linguistic information. We apply the method to BERT, GPT-2 and Mamba. Unlike previous…
▽ More
Do architectural differences significantly affect the way models represent and process language? We propose a new approach, based on metric-learning encoding models (MLEMs), as a first step to answer this question. The approach provides a feature-based comparison of how any two layers of any two models represent linguistic information. We apply the method to BERT, GPT-2 and Mamba. Unlike previous methods, MLEMs offer a transparent comparison, by identifying the specific linguistic features responsible for similarities and differences. More generally, the method uses formal, symbolic descriptions of a domain, and use these to compare neural representations. As such, the approach can straightforwardly be extended to other domains, such as speech and vision, and to other neural systems, including human brains.
△ Less
Submitted 24 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation
Authors:
Nicolas Guerin,
Shane Steinert-Threlkeld,
Emmanuel Chemla
Abstract:
Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical,…
▽ More
Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i) parallel word frequency distributions, (ii) partially shared vocabulary, and (iii) similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Metric-Learning Encoding Models Identify Processing Profiles of Linguistic Features in BERT's Representations
Authors:
Louis Jalouzot,
Robin Sobczyk,
Bastien Lhopitallier,
Jeanne Salle,
Nur Lan,
Emmanuel Chemla,
Yair Lakretz
Abstract:
We introduce Metric-Learning Encoding Models (MLEMs) as a new approach to understand how neural systems represent the theoretical features of the objects they process. As a proof-of-concept, we apply MLEMs to neural representations extracted from BERT, and track a wide variety of linguistic features (e.g., tense, subject person, clause type, clause embedding). We find that: (1) linguistic features…
▽ More
We introduce Metric-Learning Encoding Models (MLEMs) as a new approach to understand how neural systems represent the theoretical features of the objects they process. As a proof-of-concept, we apply MLEMs to neural representations extracted from BERT, and track a wide variety of linguistic features (e.g., tense, subject person, clause type, clause embedding). We find that: (1) linguistic features are ordered: they separate representations of sentences to different degrees in different layers; (2) neural representations are organized hierarchically: in some layers, we find clusters of representations nested within larger clusters, following successively important linguistic features; (3) linguistic features are disentangled in middle layers: distinct, selective units are activated by distinct linguistic features. Methodologically, MLEMs are superior (4) to multivariate decoding methods, being more robust to type-I errors, and (5) to univariate encoding methods, in being able to predict both local and distributed representations. Together, this demonstrates the utility of Metric-Learning Encoding Methods for studying how linguistic features are neurally encoded in language models and the advantage of MLEMs over traditional methods. MLEMs can be extended to other domains (e.g. vision) and to other neural systems, such as the human brain.
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
Bridging the Empirical-Theoretical Gap in Neural Network Formal Language Learning Using Minimum Description Length
Authors:
Nur Lan,
Emmanuel Chemla,
Roni Katzir
Abstract:
Neural networks offer good approximation to many tasks but consistently fail to reach perfect generalization, even when theoretical work shows that such perfect solutions can be expressed by certain architectures. Using the task of formal language learning, we focus on one simple formal language and show that the theoretically correct solution is in fact not an optimum of commonly used objectives…
▽ More
Neural networks offer good approximation to many tasks but consistently fail to reach perfect generalization, even when theoretical work shows that such perfect solutions can be expressed by certain architectures. Using the task of formal language learning, we focus on one simple formal language and show that the theoretically correct solution is in fact not an optimum of commonly used objectives -- even with regularization techniques that according to common wisdom should lead to simple weights and good generalization (L1, L2) or other meta-heuristics (early-stopping, dropout). On the other hand, replacing standard targets with the Minimum Description Length objective (MDL) results in the correct solution being an optimum.
△ Less
Submitted 6 June, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Minimum Description Length Hopfield Networks
Authors:
Matan Abudy,
Nur Lan,
Emmanuel Chemla,
Roni Katzir
Abstract:
Associative memory architectures are designed for memorization but also offer, through their retrieval method, a form of generalization to unseen inputs: stored memories can be seen as prototypes from this point of view. Focusing on Modern Hopfield Networks (MHN), we show that a large memorization capacity undermines the generalization opportunity. We offer a solution to better optimize this trade…
▽ More
Associative memory architectures are designed for memorization but also offer, through their retrieval method, a form of generalization to unseen inputs: stored memories can be seen as prototypes from this point of view. Focusing on Modern Hopfield Networks (MHN), we show that a large memorization capacity undermines the generalization opportunity. We offer a solution to better optimize this tradeoff. It relies on Minimum Description Length (MDL) to determine during training which memories to store, as well as how many of them.
△ Less
Submitted 11 November, 2023;
originally announced November 2023.
-
Benchmarking Neural Network Generalization for Grammar Induction
Authors:
Nur Lan,
Emmanuel Chemla,
Roni Katzir
Abstract:
How well do neural networks generalize? Even for grammar induction tasks, where the target generalization is fully known, previous works have left the question open, testing very limited ranges beyond the training set and using different success criteria. We provide a measure of neural network generalization based on fully specified formal languages. Given a model and a formal grammar, the method…
▽ More
How well do neural networks generalize? Even for grammar induction tasks, where the target generalization is fully known, previous works have left the question open, testing very limited ranges beyond the training set and using different success criteria. We provide a measure of neural network generalization based on fully specified formal languages. Given a model and a formal grammar, the method assigns a generalization score representing how well a model generalizes to unseen samples in inverse relation to the amount of data it was trained on. The benchmark includes languages such as $a^nb^n$, $a^nb^nc^n$, $a^nb^mc^{n+m}$, and Dyck-1 and 2. We evaluate selected architectures using the benchmark and find that networks trained with a Minimum Description Length objective (MDL) generalize better and using less data than networks trained using standard loss functions. The benchmark is available at https://github.com/taucompling/bliss.
△ Less
Submitted 25 August, 2023; v1 submitted 16 August, 2023;
originally announced August 2023.
-
Minimum Description Length Recurrent Neural Networks
Authors:
Nur Lan,
Michal Geyer,
Emmanuel Chemla,
Roni Katzir
Abstract:
We train neural networks to optimize a Minimum Description Length score, i.e., to balance between the complexity of the network and its accuracy at a task. We show that networks optimizing this objective function master tasks involving memory challenges and go beyond context-free languages. These learners master languages such as $a^nb^n$, $a^nb^nc^n$, $a^nb^{2n}$, $a^nb^mc^{n+m}$, and they perfor…
▽ More
We train neural networks to optimize a Minimum Description Length score, i.e., to balance between the complexity of the network and its accuracy at a task. We show that networks optimizing this objective function master tasks involving memory challenges and go beyond context-free languages. These learners master languages such as $a^nb^n$, $a^nb^nc^n$, $a^nb^{2n}$, $a^nb^mc^{n+m}$, and they perform addition. Moreover, they often do so with 100% accuracy. The networks are small, and their inner workings are transparent. We thus provide formal proofs that their perfect accuracy holds not only on a given test set, but for any input sequence. To our knowledge, no other connectionist model has been shown to capture the underlying grammars for these languages in full generality.
△ Less
Submitted 31 March, 2022; v1 submitted 31 October, 2021;
originally announced November 2021.
-
On the Spontaneous Emergence of Discrete and Compositional Signals
Authors:
Nur Geffen Lan,
Emmanuel Chemla,
Shane Steinert-Threlkeld
Abstract:
We propose a general framework to study language emergence through signaling games with neural agents. Using a continuous latent space, we are able to (i) train using backpropagation, (ii) show that discrete messages nonetheless naturally emerge. We explore whether categorical perception effects follow and show that the messages are not compositional.
We propose a general framework to study language emergence through signaling games with neural agents. Using a continuous latent space, we are able to (i) train using backpropagation, (ii) show that discrete messages nonetheless naturally emerge. We explore whether categorical perception effects follow and show that the messages are not compositional.
△ Less
Submitted 30 April, 2020;
originally announced May 2020.
-
Suszko's Problem: Mixed Consequence and Compositionality
Authors:
Emmanuel Chemla,
Paul Egré
Abstract:
Suszko's problem is the problem of finding the minimal number of truth values needed to semantically characterize a syntactic consequence relation. Suszko proved that every Tarskian consequence relation can be characterized using only two truth values. Malinowski showed that this number can equal three if some of Tarski's structural constraints are relaxed. By so doing, Malinowski introduced a cas…
▽ More
Suszko's problem is the problem of finding the minimal number of truth values needed to semantically characterize a syntactic consequence relation. Suszko proved that every Tarskian consequence relation can be characterized using only two truth values. Malinowski showed that this number can equal three if some of Tarski's structural constraints are relaxed. By so doing, Malinowski introduced a case of so-called mixed consequence, allowing the notion of a designated value to vary between the premises and the conclusions of an argument. In this paper we give a more systematic perspective on Suszko's problem and on mixed consequence. First, we prove general representation theorems relating structural properties of a consequence relation to their semantic interpretation, uncovering the semantic counterpart of substitution-invariance, and establishing that (intersective) mixed consequence is fundamentally the semantic counterpart of the structural property of monotonicity. We use those to derive maximum-rank results proved recently in a different setting by French and Ripley, as well as by Blasio, Marcos and Wansing, for logics with various structural properties (reflexivity, transitivity, none, or both). We strengthen these results into exact rank results for non-permeable logics (roughly, those which distinguish the role of premises and conclusions). We discuss the underlying notion of rank, and the associated reduction proposed independently by Scott and Suszko. As emphasized by Suszko, that reduction fails to preserve compositionality in general, meaning that the resulting semantics is no longer truth-functional. We propose a modification of that notion of reduction, allowing us to prove that over compact logics with what we call regular connectives, rank results are maintained even if we request the preservation of truth-functionality and additional semantic properties.
△ Less
Submitted 9 February, 2019; v1 submitted 25 July, 2017;
originally announced July 2017.