The impact of imbalanced training data on machine learning for author name disambiguation

Kim, Jinseok; Kim, Jenna

doi:10.1007/s11192-018-2865-9

Computer Science > Information Retrieval

arXiv:1808.00525 (cs)

[Submitted on 30 Jul 2018 (v1), last revised 3 Aug 2018 (this version, v2)]

Title:The impact of imbalanced training data on machine learning for author name disambiguation

Authors:Jinseok Kim, Jenna Kim

View PDF

Abstract:In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers - Logistic Regression, Naïve Bayes, and Random Forest - are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic Regression and Naïve Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 ~ 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.

Comments:	17 pages, 3 figures, and 3 tables
Subjects:	Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1808.00525 [cs.IR]
	(or arXiv:1808.00525v2 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1808.00525
Journal reference:	Kim, J. & Kim, J. (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics
Related DOI:	https://doi.org/10.1007/s11192-018-2865-9

Submission history

From: Jinseok Kim [view email]
[v1] Mon, 30 Jul 2018 14:29:27 UTC (1,204 KB)
[v2] Fri, 3 Aug 2018 02:59:12 UTC (1,206 KB)

Computer Science > Information Retrieval

Title:The impact of imbalanced training data on machine learning for author name disambiguation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:The impact of imbalanced training data on machine learning for author name disambiguation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators