Search | arXiv e-print repository

EMTeC: A Corpus of Eye Movements on Machine-Generated Texts

Authors: Lena Sophia Bolliger, Patrick Haller, Isabelle Caroline Rose Cretton, David Robert Reich, Tannon Kew, Lena Ann Jäger

Abstract: The Eye Movements on Machine-Generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.… ▽ More The Eye Movements on Machine-Generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures. It further provides both the original and a corrected version of the fixation sequences, accounting for vertical calibration drift. Moreover, the corpus includes the language models' internals that underlie the generation of the stimulus texts: the transition scores, the attention scores, and the hidden states. The stimuli are annotated for a range of linguistic features both at text and at word level. We anticipate EMTeC to be utilized for a variety of use cases such as, but not restricted to, the investigation of reading behavior on machine-generated text and the impact of different decoding strategies; reading behavior on different text types; the development of new pre-processing, data filtering, and drift correction algorithms; the cognitive interpretability and enhancement of language models; and the assessment of the predictive power of surprisal and entropy for human reading times. The data at all stages of pre-processing, the model internals, and the code to reproduce the stimulus generation, data pre-processing and analyses can be accessed via https://github.com/DiLi-Lab/EMTeC/. △ Less

Submitted 8 August, 2024; originally announced August 2024.

arXiv:2406.04988 [pdf, other]

Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences

Authors: Patrick Haller, Lena S. Bolliger, Lena A. Jäger

Abstract: To date, most investigations on surprisal and entropy effects in reading have been conducted on the group level, disregarding individual differences. In this work, we revisit the predictive power of surprisal and entropy measures estimated from a range of language models (LMs) on data of human reading times as a measure of processing effort by incorporating information of language users' cognitive… ▽ More To date, most investigations on surprisal and entropy effects in reading have been conducted on the group level, disregarding individual differences. In this work, we revisit the predictive power of surprisal and entropy measures estimated from a range of language models (LMs) on data of human reading times as a measure of processing effort by incorporating information of language users' cognitive capacities. To do so, we assess the predictive power of surprisal and entropy estimated from generative LMs on reading data obtained from individuals who also completed a wide range of psychometric tests. Specifically, we investigate if modulating surprisal and entropy relative to cognitive scores increases prediction accuracy of reading times, and we examine whether LMs exhibit systematic biases in the prediction of reading times for cognitively high- or low-performing groups, revealing what type of psycholinguistic subject a given LM emulates. Our study finds that in most cases, incorporating cognitive capacities increases predictive power of surprisal and entropy on reading times, and that generally, high performance in the psychometric tests is associated with lower sensitivity to predictability effects. Finally, our results suggest that the analyzed LMs emulate readers with lower verbal intelligence, suggesting that for a given target group (i.e., individuals with high verbal intelligence), these LMs provide less accurate predictability estimates. △ Less

Submitted 2 August, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: ACL 2024 Findings

arXiv:2310.15587 [pdf, other]

ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts

Authors: Lena S. Bolliger, David R. Reich, Patrick Haller, Deborah N. Jakobi, Paul Prasse, Lena A. Jäger

Abstract: Eye movements in reading play a crucial role in psycholinguistic research studying the cognitive mechanisms underlying human language processing. More recently, the tight coupling between eye movements and cognition has also been leveraged for language-related machine learning tasks such as the interpretability, enhancement, and pre-training of language models, as well as the inference of reader-… ▽ More Eye movements in reading play a crucial role in psycholinguistic research studying the cognitive mechanisms underlying human language processing. More recently, the tight coupling between eye movements and cognition has also been leveraged for language-related machine learning tasks such as the interpretability, enhancement, and pre-training of language models, as well as the inference of reader- and text-specific properties. However, scarcity of eye movement data and its unavailability at application time poses a major challenge for this line of research. Initially, this problem was tackled by resorting to cognitive models for synthesizing eye movement data. However, for the sole purpose of generating human-like scanpaths, purely data-driven machine-learning-based methods have proven to be more suitable. Following recent advances in adapting diffusion processes to discrete data, we propose ScanDL, a novel discrete sequence-to-sequence diffusion model that generates synthetic scanpaths on texts. By leveraging pre-trained word representations and jointly embedding both the stimulus text and the fixation sequence, our model captures multi-modal interactions between the two inputs. We evaluate ScanDL within- and across-dataset and demonstrate that it significantly outperforms state-of-the-art scanpath generation methods. Finally, we provide an extensive psycholinguistic analysis that underlines the model's ability to exhibit human-like reading behavior. Our implementation is made available at https://github.com/DiLi-Lab/ScanDL. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: EMNLP 2023

arXiv:1301.4793 [pdf, other]

LMMSE Estimation and Interpolation of Continuous-Time Signals from Discrete-Time Samples Using Factor Graphs

Authors: Lukas Bolliger, Hans-Andrea Loeliger, Christian Vogel

Abstract: The factor graph approach to discrete-time linear Gaussian state space models is well developed. The paper extends this approach to continuous-time linear systems/filters that are driven by white Gaussian noise. By Gaussian message passing, we then obtain MAP/MMSE/LMMSE estimates of the input signal, or of the state, or of the output signal from noisy observations of the output signal. These estim… ▽ More The factor graph approach to discrete-time linear Gaussian state space models is well developed. The paper extends this approach to continuous-time linear systems/filters that are driven by white Gaussian noise. By Gaussian message passing, we then obtain MAP/MMSE/LMMSE estimates of the input signal, or of the state, or of the output signal from noisy observations of the output signal. These estimates may be obtained with arbitrary temporal resolution. The proposed input signal estimation does not seem to have appeared in the prior Kalman filtering literature. △ Less

Submitted 21 January, 2013; originally announced January 2013.

Showing 1–4 of 4 results for author: Bolliger, L