-
Natural Language Statistical Features of LSTM-generated Texts
Authors:
Marco Lippi,
Marcelo A Montemurro,
Mirko Degli Esposti,
Giampaolo Cristadoro
Abstract:
Long Short-Term Memory (LSTM) networks have recently shown remarkable performance in several tasks dealing with natural language generation, such as image captioning or poetry composition. Yet, only few works have analyzed text generated by LSTMs in order to quantitatively evaluate to which extent such artificial texts resemble those generated by humans. We compared the statistical structure of LS…
▽ More
Long Short-Term Memory (LSTM) networks have recently shown remarkable performance in several tasks dealing with natural language generation, such as image captioning or poetry composition. Yet, only few works have analyzed text generated by LSTMs in order to quantitatively evaluate to which extent such artificial texts resemble those generated by humans. We compared the statistical structure of LSTM-generated language to that of written natural language, and to those produced by Markov models of various orders. In particular, we characterized the statistical structure of language by assessing word-frequency statistics, long-range correlations, and entropy measures. Our main finding is that while both LSTM and Markov-generated texts can exhibit features similar to real ones in their word-frequency statistics and entropy measures, LSTM-texts are shown to reproduce long-range correlations at scales comparable to those found in natural language. Moreover, for LSTM networks a temperature-like parameter controlling the generation process shows an optimal value---for which the produced texts are closest to real language---consistent across all the different statistical features investigated.
△ Less
Submitted 15 April, 2019; v1 submitted 10 April, 2018;
originally announced April 2018.
-
FPGA Hardware Acceleration of Monte Carlo Simulations for the Ising Model
Authors:
Francisco Ortega-Zamorano,
Marcelo A. Montemurro,
Sergio A. Cannas,
José M. Jerez,
Leonardo Franco
Abstract:
A two-dimensional Ising model with nearest-neighbors ferromagnetic interactions is implemented in a Field Programmable Gate Array (FPGA) board.Extensive Monte Carlo simulations were carried out using an efficient hardware representation of individual spins and a combined global-local LFSR random number generator. Consistent results regarding the descriptive properties of magnetic systems, like ene…
▽ More
A two-dimensional Ising model with nearest-neighbors ferromagnetic interactions is implemented in a Field Programmable Gate Array (FPGA) board.Extensive Monte Carlo simulations were carried out using an efficient hardware representation of individual spins and a combined global-local LFSR random number generator. Consistent results regarding the descriptive properties of magnetic systems, like energy, magnetization and susceptibility are obtained while a speed-up factor of approximately 6 times is achieved in comparison to previous FPGA-based published works and almost $10^4$ times in comparison to a standard CPU simulation. A detailed description of the logic design used is given together with a careful analysis of the quality of the random number generator used. The obtained results confirm the potential of FPGAs for analyzing the statistical mechanics of magnetic systems.
△ Less
Submitted 9 February, 2016;
originally announced February 2016.
-
Complexity and universality in the long-range order of words
Authors:
Marcelo A Montemurro,
Damián H Zanette
Abstract:
As is the case of many signals produced by complex systems, language presents a statistical structure that is balanced between order and disorder. Here we review and extend recent results from quantitative characterisations of the degree of order in linguistic sequences that give insights into two relevant aspects of language: the presence of statistical universals in word ordering, and the link b…
▽ More
As is the case of many signals produced by complex systems, language presents a statistical structure that is balanced between order and disorder. Here we review and extend recent results from quantitative characterisations of the degree of order in linguistic sequences that give insights into two relevant aspects of language: the presence of statistical universals in word ordering, and the link between semantic information and the statistical linguistic structure. We first analyse a measure of relative entropy that assesses how much the ordering of words contributes to the overall statistical structure of language. This measure presents an almost constant value close to 3.5 bits/word across several linguistic families. Then, we show that a direct application of information theory leads to an entropy measure that can quantify and extract semantic structures from linguistic samples, even without prior knowledge of the underlying language.
△ Less
Submitted 3 March, 2015;
originally announced March 2015.
-
Towards the quantification of the semantic information encoded in written language
Authors:
Marcelo A. Montemurro,
Damian Zanette
Abstract:
Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the sem…
▽ More
Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.
△ Less
Submitted 27 July, 2009; v1 submitted 9 July, 2009;
originally announced July 2009.
-
Long-range fractal correlations in literary corpora
Authors:
Marcelo A. Montemurro,
Pedro A. Pury
Abstract:
In this paper we analyse the fractal structure of long human-language records by mapping large samples of texts onto time series. The particular mapping set up in this work is inspired on linguistic basis in the sense that is retains {\em the word} as the fundamental unit of communication. The results confirm that beyond the short-range correlations resulting from syntactic rules acting at sente…
▽ More
In this paper we analyse the fractal structure of long human-language records by mapping large samples of texts onto time series. The particular mapping set up in this work is inspired on linguistic basis in the sense that is retains {\em the word} as the fundamental unit of communication. The results confirm that beyond the short-range correlations resulting from syntactic rules acting at sentence level, long-range structures emerge in large written language samples that give rise to long-range correlations in the use of words.
△ Less
Submitted 9 January, 2002;
originally announced January 2002.
-
Entropic analysis of the role of words in literary texts
Authors:
Marcelo A. Montemurro,
Damian H. Zanette
Abstract:
Beyond the local constraints imposed by grammar, words concatenated in long sequences carrying a complex message show statistical regularities that may reflect their linguistic role in the message. In this paper, we perform a systematic statistical analysis of the use of words in literary English corpora. We show that there is a quantitative relation between the role of content words in literary…
▽ More
Beyond the local constraints imposed by grammar, words concatenated in long sequences carrying a complex message show statistical regularities that may reflect their linguistic role in the message. In this paper, we perform a systematic statistical analysis of the use of words in literary English corpora. We show that there is a quantitative relation between the role of content words in literary English and the Shannon information entropy defined over an appropriate probability distribution. Without assuming any previous knowledge about the syntactic structure of language, we are able to cluster certain groups of words according to their specific role in the text.
△ Less
Submitted 12 September, 2001;
originally announced September 2001.
-
Beyond the Zipf-Mandelbrot law in quantitative linguistics
Authors:
Marcelo A. Montemurro
Abstract:
In this paper the Zipf-Mandelbrot law is revisited in the context of linguistics. Despite its widespread popularity the Zipf--Mandelbrot law can only describe the statistical behaviour of a rather restricted fraction of the total number of words contained in some given corpus. In particular, we focus our attention on the important deviations that become statistically relevant as larger corpora a…
▽ More
In this paper the Zipf-Mandelbrot law is revisited in the context of linguistics. Despite its widespread popularity the Zipf--Mandelbrot law can only describe the statistical behaviour of a rather restricted fraction of the total number of words contained in some given corpus. In particular, we focus our attention on the important deviations that become statistically relevant as larger corpora are considered and that ultimately could be understood as salient features of the underlying complex process of language generation. Finally, it is shown that all the different observed regimes can be accurately encompassed within a single mathematical framework recently introduced by C. Tsallis.
△ Less
Submitted 9 July, 2001; v1 submitted 3 April, 2001;
originally announced April 2001.