-
Modeling news spread as an SIR process over temporal networks
Authors:
Elisa Mussumeci,
Flávio Codeço Coelho
Abstract:
News spread in internet media outlets can be seen as a contagious process generating temporal networks representing the influence between published articles. In this article we propose a methodology based on the application of natural language analysis of the articles to reconstruct the spread network. From the reconstructed network, we show that the dynamics of the news spread can be approximated…
▽ More
News spread in internet media outlets can be seen as a contagious process generating temporal networks representing the influence between published articles. In this article we propose a methodology based on the application of natural language analysis of the articles to reconstruct the spread network. From the reconstructed network, we show that the dynamics of the news spread can be approximated by a classical SIR epidemiological dynamics upon the network. From the results obtained we argue that the methodology proposed can be used to make predictions about media repercussion, and also to detect viral memes in news streams.
△ Less
Submitted 14 December, 2016;
originally announced January 2017.
-
Using Artificial Intelligence to Identify State Secrets
Authors:
Renato Rocha Souza,
Flavio Codeco Coelho,
Rohan Shah,
Matthew Connelly
Abstract:
Whether officials can be trusted to protect national security information has become a matter of great public controversy, reigniting a long-standing debate about the scope and nature of official secrecy. The declassification of millions of electronic records has made it possible to analyze these issues with greater rigor and precision. Using machine-learning methods, we examined nearly a million…
▽ More
Whether officials can be trusted to protect national security information has become a matter of great public controversy, reigniting a long-standing debate about the scope and nature of official secrecy. The declassification of millions of electronic records has made it possible to analyze these issues with greater rigor and precision. Using machine-learning methods, we examined nearly a million State Department cables from the 1970s to identify features of records that are more likely to be classified, such as international negotiations, military operations, and high-level communications. Even with incomplete data, algorithms can use such features to identify 90% of classified cables with <11% false positives. But our results also show that there are longstanding problems in the identification of sensitive information. Error analysis reveals many examples of both overclassification and underclassification. This indicates both the need for research on inter-coder reliability among officials as to what constitutes classified material and the opportunity to develop recommender systems to better manage both classification and declassification.
△ Less
Submitted 1 November, 2016;
originally announced November 2016.
-
Financial contagion in investment funds
Authors:
Leonardo dos Santos Pinheiro,
Flavio Codeco Coelho
Abstract:
Many new models for measuring financial contagion have been presented recently. While these models have not been specified for investment funds directly, there are many similarities that could be explored to extend the models. In this work we explore ideas developed about financial contagion to create a network of investment funds using both cross-holding of quotas and a bipartite network of funds…
▽ More
Many new models for measuring financial contagion have been presented recently. While these models have not been specified for investment funds directly, there are many similarities that could be explored to extend the models. In this work we explore ideas developed about financial contagion to create a network of investment funds using both cross-holding of quotas and a bipartite network of funds and assets. Using data from the Brazilian asset management market we analyze not only the contagion pattern but also the structure of this network and how this model can be used to assess the stability of the market.
△ Less
Submitted 8 March, 2016;
originally announced March 2016.
-
Computable Compressed Matrices
Authors:
Crysttian Arantes Paixão,
Flávio Codeço Coelho
Abstract:
The biggest cost of computing with large matrices in any modern computer is related to memory latency and bandwidth. The average latency of modern RAM reads is 150 times greater than a clock step of the processor. Throughput is a little better but still 25 times slower than the CPU can consume. The application of bitstring compression allows for larger matrices to be moved entirely to the cache me…
▽ More
The biggest cost of computing with large matrices in any modern computer is related to memory latency and bandwidth. The average latency of modern RAM reads is 150 times greater than a clock step of the processor. Throughput is a little better but still 25 times slower than the CPU can consume. The application of bitstring compression allows for larger matrices to be moved entirely to the cache memory of the computer, which has much better latency and bandwidth (average latency of L1 cache is 3 to 4 clock steps). This allows for massive performance gains as well as the ability to simulate much larger models efficiently. In this work, we propose a methodology to compress matrices in such a way that they retain their mathematical properties. Considerable compression of the data is also achieved in the process Thus allowing for the computation of much larger linear problems within the same memory constraints when compared with the traditional representation of matrices.
△ Less
Submitted 1 March, 2013;
originally announced March 2013.
-
PyPLN: a Distributed Platform for Natural Language Processing
Authors:
Flávio Codeço Coelho,
Renato Rocha Souza,
Álvaro Justen,
Flávio Amieiro,
Heliana Mello
Abstract:
This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks…
▽ More
This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks as long as a linux version is available. PyPLN facilitates analyses both at document and corpus level, simplifying management and publication of corpora and analytical results through an easy to use web interface. In the current (beta) release, it supports English and Portuguese languages with support to other languages planned for future releases. To support the Portuguese language PyPLN uses the PALAVRAS parser\citep{Bick2000}. Currently PyPLN offers the following features: Text extraction with encoding normalization (to UTF-8), part-of-speech tagging, token frequency, semantic annotation, n-gram extraction, word and sentence repertoire, and full-text search across corpora. The platform is licensed as GPL-v3.
△ Less
Submitted 19 February, 2013; v1 submitted 31 January, 2013;
originally announced January 2013.