DBLP-derived labeled data for author name disambiguation

dataset

posted on 2018-07-19, 15:20 authored by Jinseok KimJinseok Kim

This is a DBLP-derived labeled data originally created by Dr. C. Lee Giles at Penn State University and filtered for duplicate removal and error correction by Dr. Jinseok Kim at University of Michigan. For more details, see references below.

1. Kim, Jinseok (2018). Evaluating author name disambiguation for digital libraries: a case of DBLP. Scientometrics. doi:10.1007/s11192-018-2824-5

2. Kim, Jinseok & Kim, Jenna (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics. doi: 10.1007/s11192-018-2865-9

Each row refers to an author name instance with following feature information separated by tab.

author name: full name string extracted from DBLP

unique author id: labels assigned manually by Dr. C. Lee Giles's team

paper id: assigned by Dr. Jinseok Kim

author list: names of authors in the byline of the paper

year: publication year

venue: conference or journal names

title: stopwords removed and stemmed by the Porter's stemmer

If you want to use this dataset, please consider to cite papers below.

For the original dataset: Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two Supervised Learning Approaches for Name Disambiguation in Author Citations. JCDL 2004: Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries, 296-305. doi:10.1145/996350.996419

For the filtered dataset: 1. Kim, Jinseok (2018). Evaluating author name disambiguation for digital libraries: a case of DBLP. Scientometrics. doi:10.1007/s11192-018-2824-5

2. Kim, Jinseok & Kim, Jenna (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics. doi: 10.1007/s11192-018-2865-9

History

Usage metrics

Keywords

labeled data training data author name disambiguaiton Information Retrieval and Web Search Library and Information Studies Natural Language Processing

Licence

CC BY 4.0

DBLP-derived labeled data for author name disambiguation

History

Usage metrics

Categories

Keywords

Licence

Exports