Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale

dataset

posted on 2021-02-14, 00:52 authored by Jinseok KimJinseok Kim, Jason Owen-Smith

This page contains four datasets released for the paper entitled "ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale" to be published in Scientometrics (In print).

1. AUT_ORC.zip: this contains a list of 3M author name instances in MEDLINE linked to Author-ity2009.

2. AUT_NIH.zip: this contains a list of 313K author name instances in MEDLINE linked to NIH PI ID.

3. AUT_SCT_pairs.zip: this contains a list of 6.2M paper pairs and author byline positions in self-citation relation.

4. AUT-SCT_info.zip: this contains a list of 4.7M author name instances in self-citation relation as recorded in AUT_SCT_pairs. Information about an author name instance in AUT-SCT_pairs can be connected to AUT-SCT_info using the combination of PMID and Byline Position as a key.

Please see the paper for details on how the datasets were created.

Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6

The uploaded datasets were created by combining several data sources below.

1. ORCID data were downloaded from the link below for the 2018 version.

Please refer to the policies on the use of ORCID data.

https://info.orcid.org/public-data-file-use-policy/

2. MEDLINE baseline data were downloaded from the link below for the 2016 version.

Please refer to the policies on the use of MEDLINE data.

https://www.nlm.nih.gov/databases/download/pubmed_medline.html

3. Author-ity2009, Ethnea, and Genni datasets were downloaded from the link below.

Please refer to the policies on the use of those datasets.

https://databank.illinois.edu/datasets/IDB-9087546

Please cite three papers below to properly give credits to the creators of the original datasets.

Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304

Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.

http://hdl.handle.net/2142/88927

Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720

4. The dataset of NIH ID linked to Author-ity2009 was downloaded from the link below.

https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1

Please cite the paper below to properly give credits to the creators of the original dataset.

Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731