DBLP-derived labeled data for author name disambiguation

2018-07-19T15:20:54Z (GMT) by Jinseok Kim
This is a DBLP-derived labeled data originally created by Dr. C. Lee Giles at Penn State University and filtered for duplicate removal and error correction by Dr. Jinseok Kim at University of Michigan. For more details, see references below.

1. Kim, Jinseok (2018). Evaluating author name disambiguation for digital libraries: a case of DBLP. Scientometrics. doi:10.1007/s11192-018-2824-5

2. Kim, Jinseok & Kim, Jenna (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics. doi: 10.1007/s11192-018-2865-9

Each row refers to an author name instance with following feature information separated by tab.

author name: full name string extracted from DBLP
unique author id: labels assigned manually by Dr. C. Lee Giles's team
paper id: assigned by Dr. Jinseok Kim
author list: names of authors in the byline of the paper
year: publication year
venue: conference or journal names
title: stopwords removed and stemmed by the Porter's stemmer

If you want to use this dataset, please consider to cite papers below.

For the original dataset: Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two Supervised Learning Approaches for Name Disambiguation in Author Citations. JCDL 2004: Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries, 296-305. doi:10.1145/996350.996419

For the filtered dataset: 1. Kim, Jinseok (2018). Evaluating author name disambiguation for digital libraries: a case of DBLP. Scientometrics. doi:10.1007/s11192-018-2824-5

or

2. Kim, Jinseok & Kim, Jenna (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics. doi: 10.1007/s11192-018-2865-9