tabulapdf: An R Package to Extract Tables from PDF Documents
Authors:
Mauricio Vargas Sepúlveda,
Thomas J. Leeper,
Tom Paskhalis,
Manuel Aristarán,
Jeremy B. Merrill,
Mike Tigas
Abstract:
tabulapdf is an R package that utilizes the Tabula Java library to import tables from PDF files directly into R. This tool can reduce time and effort in data extraction processes in fields like investigative journalism. It allows for automatic and manual table extraction, the latter facilitated through a Shiny interface, enabling manual areas selection with a computer mouse for data retrieval.
tabulapdf is an R package that utilizes the Tabula Java library to import tables from PDF files directly into R. This tool can reduce time and effort in data extraction processes in fields like investigative journalism. It allows for automatic and manual table extraction, the latter facilitated through a Shiny interface, enabling manual areas selection with a computer mouse for data retrieval.
△ Less
Submitted 25 August, 2024;
originally announced September 2024.
The Research Space: using the career paths of scholars to predict the evolution of the research output of individuals, institutions, and nations
Authors:
Miguel R. Guevara,
Dominik Hartmann,
Manuel Aristarán,
Marcelo Mendoza,
César A. Hidalgo
Abstract:
In recent years scholars have built maps of science by connecting the academic fields that cite each other, are cited together, or that cite a similar literature. But since scholars cannot always publish in the fields they cite, or that cite them, these science maps are only rough proxies for the potential of a scholar, organization, or country, to enter a new academic field. Here we use a large d…
▽ More
In recent years scholars have built maps of science by connecting the academic fields that cite each other, are cited together, or that cite a similar literature. But since scholars cannot always publish in the fields they cite, or that cite them, these science maps are only rough proxies for the potential of a scholar, organization, or country, to enter a new academic field. Here we use a large dataset of scholarly publications disambiguated at the individual level to create a map of science-or research space-where links connect pairs of fields based on the probability that an individual has published in both of them. We find that the research space is a significantly more accurate predictor of the fields that individuals and organizations will enter in the future than citation based science maps. At the country level, however, the research space and citations based science maps are equally accurate. These findings show that data on career trajectories-the set of fields that individuals have previously published in-provide more accurate predictors of future research output for more focalized units-such as individuals or organizations-than citation based science maps.
△ Less
Submitted 14 April, 2016; v1 submitted 26 February, 2016;
originally announced February 2016.