A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

Tekumalla, Ramya; Banda, Juan M.

Computer Science > Information Retrieval

arXiv:2003.13900 (cs)

[Submitted on 31 Mar 2020]

Title:A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

Authors:Ramya Tekumalla, Juan M. Banda

View PDF

Abstract:With the increase in popularity of deep learning models for natural language processing (NLP) tasks, in the field of Pharmacovigilance, more specifically for the identification of Adverse Drug Reactions (ADRs), there is an inherent need for large-scale social-media datasets aimed at such tasks. With most researchers allocating large amounts of time to crawl Twitter or buying expensive pre-curated datasets, then manually annotating by humans, these approaches do not scale well as more and more data keeps flowing in Twitter. In this work we re-purpose a publicly available archived dataset of more than 9.4 billion Tweets with the objective of creating a very large dataset of drug usage-related tweets. Using existing manually curated datasets from the literature, we then validate our filtered tweets for relevance using machine learning methods, with the end result of a publicly available dataset of 1,181,993 million tweets for public use. We provide all code and detailed procedure on how to extract this dataset and the selected tweet ids for researchers to use.

Comments:	8 tables, 2 figures, 7 pages, accepted after peer review as a workshop paper in ACM Conference on Health, Inference, and Learning (CHIL) 2020 this https URL
Subjects:	Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Cite as:	arXiv:2003.13900 [cs.IR]
	(or arXiv:2003.13900v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2003.13900

Submission history

From: Juan Banda [view email]
[v1] Tue, 31 Mar 2020 01:30:24 UTC (394 KB)

Computer Science > Information Retrieval

Title:A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators