Identifying seminal works most important for research fields: Software for the Reference Publication Year Spectroscopy (RPYS)

Reference Publication Year Spectroscopy (RPYS) was proposed by Marx, Bornmann, Barth, and Leydesdorff (2014, [18]) to identify seminal publications in a research field which are most important in a historical context. We refined our RPYS toolbox by adding some features to the existing programs and we developed two new routines. First, a direct comparison of the results of different RPYSs is now possible, because the software transforms the results of the RPYS into percentiles for standardization. Second, we added routines that facilitate the user with retrieving the most-cited publications in specific years indicated by peaks in the spectrograms. Cited references can be aggregated across misspellings and variants. For this paper, two examples from the humanities and natural sciences are provided to demonstrate the functionalities and results of the programs. A more technical description of the usage of the programs can be found at http://www.leydesdorff.net/software/rpys/.


Introduction
In the current atmosphere of research evaluation on all levels of scientific activities (single researchers, research groups, institutions, and countries), bibliometrics can be considered as an instrument for evaluation L. Bornmann, A. Thor, W. Marx and L. Leydesdorffd purposes (Pendlebury, 2008, [20]). For example, bibliometric data is one of the most important data sources for university rankings: The Leiden Ranking (Waltman et al., 2012, [23]) is solely based on bibliometric data and for the Academic Ranking of World Universities (http://www.shanghairanking.com) and the Times Higher Education World University Rankings (http://www.timeshighereducation.co.uk/world-university-rankings) this data plays an important role. The focus on evaluation purposes in the use of bibliometrics may close one's eyes to other possibilities of using this data. A good example for another use is the VOSViewer tool (http://www.vosviewer.com), which allows visualizations of scientific activities beyond evaluation (citation relations among scientific fields, university profiles and collaborations, co-citations of journals, etc.).
In this study, we present significantly improved software for another tool which can be used to examine the seminal works or historical roots, respectively, of scientific fields. The method for examining historical roots has been developed on the base of a proposal of Bornmann and Marx (2013, [3]) to conduct cited reference analysis (Garfield, Pudovkin, & Istomin, 2003, [6]; Garfield, Sher, & Torpie, 1964, [7]; Leydesdorff, 2010, [13]). Bornmann and Marx (2013, [3]) argue for broadening the perspective in bibliometrics by complementing the (standard) times cited with cited reference analyses using field-specific impact measurements. For such a cited reference analysis, they propose to extract all cited references from a field-specific publication set and to analyze which papers (scientists or journals) have been cited most often and in which years.
A specific application of cited reference analysis is Reference Publication Year Spectroscopy (RPYS), which was introduced by Marx, et al. (2014, [18]): "RPYS is based on the analysis of the frequency with which references are cited in the publications of a specific research field in terms of the publication years of these cited references. The origins show up in the form of more or less pronounced peaks mostly caused by individual publications that are cited particularly frequently" (p. 751).
Recently, RPYS has been used to examine the historical roots in some research fields:  used research on graphene and on solar cells to illustrate how RPYS functions. Leydesdorff, Bornmann, Marx, and Milojevic (2014, [15]) investigated the historical origins of iMetrics (information metrics, bibliometrics, and scientometrics) in scholarly literature. For example, they found that Lotka (1926, [16]) can be considered as the first source, but the intellectual program of iMetrics was especially shaped in the early 1960s. Whereas Barth, Marx, Bornmann, and Mutz (2014, [1]) examined the origins of the Higgs boson research and combined RPYS with a segmented regression analysis, Wray and Bornmann (2014, [24]) took a closer look at the roots of the philosophy of science. As the results of Marx and Bornmann (2014, [17]) show, RPYS can not only be applied to the identification of origins, but also to reveal scientific legends: "Charles Darwin, the originator of evolutionary theory, was given credit for finches he did not see and for observations and insights about the finches he never made" (p. 839). The analysis validated bibliometrically the known fact that a book from 1947 is the origin of the term "Darwin finches" (Lack, 1947, [12]).  [4]) used RPYS to investigate the impact of Viterbi algorithm first published by Andrew Viterbi in 1967. They extended the method of RPYS with heat maps with the goal of comparing the results of different RPYS. A comparison is only possible when the results are standardized. In this paper, we present new RPYS software that uses percentiles for this standardization. In our opinion, this transformation into percentiles improves on the rank transformation proposed by Comins and Hussey (2015, [4]). Furthermore, we add routines that facilitate the user with retrieving the most-cited papers in specific years indicated by peaks in the spectrogram. Cited references can be aggregated across misspellings and variants.

Datasets used
As the first example in this study, we use the dataset of Wray and Bornmann (2014, [24]). They investigated the origins of the philosophy of science field. Their study is based on papers published in the journals Philosophy of Science, British Journal for the Philosophy of Science, Studies in History and Philosophy of Science, and Erkenntnis (n=8,757 records). Since the data comes from four journals, it can be used to compare the impact of important papers for the field published in these journals. For example, it can be revealed whether Thomas Kuhn's Structure of Scientific Revolutions (Kuhn, 1962, [11]) has the same or different impact in these journals in terms of citation. The relevant number of papers and cited references are provided in Table 1.
A second example deals with four research topics within the field of the natural sciences. The analysis is based on papers (n=18,451 records) investigating light scattering at small particles of four different materials: atmospheric aerosols, cosmic dust, colloids, and nanoparticles. Here, the data can be used to compare the impact of publications shared among these research topics and thus indicated as most important at the level of the field. The corresponding numbers of papers and cited references are provided in Table 2.
The search was undertaken in the Web of Science (WoS) during October 2013 (first example) and March 2015 (second example), respectively.

Software
The programs rpys.exe, yearcr.exe, and RefMatchCluster.jar can be used to generate a RPYS of any document set downloaded from WoS. The procedure for how to use the routine is described in detail at http://www.leydesdorff.net/software/rpys/.

First example
The starting point of a RPYS is the publication set representing a specific research field. In this study we use as a first example four journals in the philosophy of science. The cited references of the publications in these sets are analyzed in terms of how often they have been cited. The results of the RPYS analyses for each journal are shown in Figure 1.
Each graph in the figure visualizes the number of cited references per referenced publication year (blue line "RPYS") in a journal. In order to identify those publication years with significantly more cited references than other years, the deviation of the number of cited references in each year from the median of the number of cited references in the two previous, the current, and the two following years (t -2; t -1; t ; t + 1; t + 2) is visualized, too (brown line "Median"). This deviation from the five-year median provides a curve smoother than the one in terms of absolute numbers. For the complete philosophy of science publication set, Wray and Bornmann (2014, [24]) identified the highest peaks in 1905, 1949, 1950, 1957,1962, 1965, and 1970. As the results in Figure 1 show these peaks are not equally visible in the graphs of the different journals. For example, the 1905 peak is visible in the British Journal for the Philosophy of Science and Studies in History and Philosophy of Science, but not in Erkenntnis and Philosophy of Science. In other words, the visual inspection is not very satisfying as a methodology for comparing the results for the four journals. For this reason, Comins and Hussey (2015. [4]) proposed a rank-transformation procedure: "To overcome the difficulty in making cross RPYS comparisons, we rank-transformed our results. To rank transform data, one takes the complete set of n values from a given RPYS analyses, say X 1 , X 2 , …, X n . These observations are then sorted by the magnitude of the observed values. These values are then substituted in such a manner that the largest value takes the value n, the second largest take the value n-1, and so on." Using this rank-transformation procedure, however, a comparison would no longer be possible in the case of different numbers of years. For this reason, we recom-    [2]). We use the method proposed by Hazen (1914Hazen ( , p. 1550), because it accounts for the uncertainty in small sets (Leydesdorff, 2012, [14]) and is used very frequently nowadays. The formula for the transformation is ((i−0.5)/n * 100), whereby i is the rank (all years are ranked in decreasing order by their number of cited references) and n the total number of years. The resulting quantiles are comparable across results produced on the base of different numbers of referenced publication years. The quantile values are provided in an additional column of the file "median.dbf" generated by "rpys.exe".
Comins and Hussey (2015, [4]) proposed to visualize a comparison of RPYS using heatmaps. We followed their recommendation and produced Figure 2 based on quantiles. The higher the quantile for a certain journal and publication year, respectively, the darker is the corresponding cell. For example, the results shown in Figure 2 validate the results of visual inspection of Figure 1: the 1905 peak is pronounced for the British Journal for the Philosophy of Science and Studies in History and Philosophy of Science, but not for Erkenntnis and Philosophy of Science.
After the most important years for a research field have been identified, the most important publications in these years can be identified in a second step of analysis. These most important publications are often those that produce the peaks. Especially in years of early science, single publications may produce pronounced peaks. Here, two programs can be used to identify the most important publications: (1) The program "yearcr.exe" produces "yearcr.dbf" in which the cited references are listed for specific years (column RPY) by their number of occurrences (column "N_CR"). In two further columns, the percent of a specific cited reference's number of occurrences as a percentage of the total number of all cited references' occurrences within that year (column "PERC_YR") and across all years (column "PERC_ALL") are provided. As a further information, the column "RPY" contains the publication years of the cited references. (2) Since many cited references appear with variants in the list of cited references in "yearcr.dbf", we wrote the program RefMatchCluster.jar in Java. This program is able to identify, unify and aggregate cited references data.
Using these two programs, one can identify those publications in the British Journal for the Philosophy of Science, Studies in History and Philosophy of Science, Erkenntnis, and Philosophy of Science, which are the most frequently cited references. As a first step, "yearcr.exe" is used to produce "yearcr.dbf" with aggregated cited references information for the RPYs 1900 to 1970. As the alphabetical sorting of the cited references show many cited references appear with several variants in "yearcr.dbf." For example, "kuhn ts, 1970, structure sci revolu, p102" and "kuhn ts, 1970, structure sci revolu, p115" are listed among the cited references in the dataset. Both references, however, refer to the same cited publication and should be jointly counted.
One can use RefMatchCluster.jar in a second step of analysis and detect the variants of the same cited reference, cluster them, and aggregate their occurrences (number of cited references). Note that the program does not merge occurrences across publication years. Thus, a reference to "kuhn ts, 1962"-the first edition of The Structure of Scientific Revolu-L. Bornmann, A. Thor, W. Marx and L. Leydesdorffd tions-or a reference to a later edition are not included automatically. However, the user can make these adjustments manually and then run the program again. The program needs some arguments. For example, the results which are presented in the following are In this case, the Levenshtein similarity function is used to determine the similarity value between character strings in a 0 to 1 range (here we used: 0.75 as a threshold) for the journal/book title and author name fields in order to produce the unified and aggregated cited-references data. In our manifold experiences, a similarity value of 0.75 results in useful aggregated results. The results of the unification and aggregation process based on the philosophy of science journals' datasets are shown in Table 3: For each journal the most cited publications are shown -over all publication years, within a publication year, and in 1905.
For example, the reference "popper k., 1959, logic sci discovery" accounts for 0.8% of all cited references in the papers published in the British Journal for the Philosophy of Science. If percentages within single publication years are calculated, "einstein a, 1905, ann physberlin, v17, p891" has been most-frequently cited with 38%.
In Table 3, it is interesting to note that the journals are different in terms of their most frequently cited publications: whereas -for example -publications of Karl Popper seem to be important for authors in the British Journal for the Philosophy of Science, publications of Thomas Kuhn seem to be especially relevant for authors in Studies in History and Philosophy of Science. However, there is one paper published by Albert Einstein which is prominently cited in three of the four journals (British Journal for the Philosophy of Science, Philosophy of Science, and Studies in History and Philosophy of Science). The title of this paper which introduced Einstein's Special Theory of Relativity (STR) 9-1 is as follows: "Zur Elektrodynamik bewegter Körper" [On the electrodynamics of moving bodies] (Einstein, 1905, [9]).
In the interpretation of the percentages for the journal Erkenntnis one should consider that the cited reference counts can be low. With a total of 15,973 cited references, this journal has the lowest number of cited references compared to the other journals (see Table 1). In case of low numbers of cited references, publications with one or only a few cited references may contribute greatly within a single publication year. For example, with only one single citation the cited reference "henning h, 1916, z psychologie, v73" achieved a 100% share in 1916 (see Table 3).

Second example
As a second example in order to demonstrate the utility of our software we analyze four research topics within the field of the natural sciences: light scattering at small particles of different materials (atmospheric aerosols, cosmic dust, colloids, nanoparticles). Again, the cited references of the publications in the sets of these research topics have been analyzed in terms of how often they have been cited. The results of the RPYS analysis are shown in Figure 3.
As in our first example, there are some common reference publication year peaks when comparing the four cases, in particular among earlier publication years. The quantiles of the numbers of cited references for the four natural sciences research topics are shown in Figure 4. Figure 3 and Figure 4 reveal a distinct peak for the reference publication year 1908 which can be assigned to a paper published by the German physicist Gustav Mie (1908). This paper deals with the scattering of electromagnetic waves by small particles and was published in 1908 in Annalen der Physik under the title "Beiträge zur Optik trüber Medien, speziell kolloidaler Metallösungen" [Contributions to the optics of turbid media, particularly of colloidal metal solutions]. Mie (1908, [19]) applied Maxwell's electromagnetic theory to spherical gold colloids and provided a theoretical treatment of this phenomenon. The term "Mie scattering" is still an eponymy of his name: Mie scattering occurs when the   Table 4 Top-three most cited publications of four research topics within the field of the natural sciences (light scattering at aerosols, cosmic dust, colloids, and nanoparticles). The table shows the most cited publications over all publication years (1900-1970), within a publication year and in 1908. Subsequent to each cited reference the corresponding percentage is mentioned. diameters of particles are similar to the wavelengths of the scattered light. For particles much larger or much smaller there are approximations. But for objects with sizes similar to the wavelength, e.g., aerosols or water droplets in the atmosphere, cosmic dust, colloids or nanoparticles, a more exact approach is needed. In contrast to Rayleigh scattering, Mie scattering is independent from the wavelength of the scattered light. This kind of light scattering occurs, for example, in the lower parts of the atmosphere and explains why clouds and the sky near the horizon appear white (and not blue like the upper part of the sky which underlies Rayleigh scattering). Mie's (1908) paper had hardly been cited before 1950 but meanwhile received altogether more than 6300 citations. Other important early reference publication years for the four research topics analyzed here are 1904 and 1941: The peaks corresponding to the reference publication year 1904 in Figure 3 and Figure 4 can be attributed to a paper by Garnett (1904) entitled "Colours in metal glasses and in metallic films" which appeared in the Philosophical Transactions of the Royal Society of London A. The peaks corresponding to the reference publication year 1941 can be assigned to a paper by Henyey and Greenstein (1941, [10]) entitled "Diffuse radiation in the galaxy". Similar to the first example of our analysis, the peaks corresponding to these most frequently referenced early publication years are not equally visible in the four research topics. For example, the paper by Garnett (1904, [8]) is less important for the colloids research topic and the paper by Henyey and Greenstein (1941) is particularly important for the astrophysics research topic, but this is not surprising considering the title.

Research
Again, the unified and aggregated cited references are based on the use of the Levenshtein similarity function (using the similarity value of 0.75 as a threshold) for the journal/ book title and author names as follows: java -jar RefMatchCluster.jar -input=yearcr.dbf -matcher=journal_short,Levenshtein,0.75 -matcher=lastname,Levenshtein,0.75 -match=yearcr_match.csv -cluster=yearcr_cluster.dbf -aggregate=cleaned.csv The results of the unification and aggregation process are shown in Table 4. For each of the four research topics the three most frequently cited publications are shown -over all publication years , within a publication year, and in 1908.
Like the journals of the first example, the research topics of the second example show important differences (not surprising in consideration of the very different research topics) but also notable similarities. For example, four publications appear twice as the most cited publications over all publication years. Again, one publication is striking: The paper published by Mie (1908, [19]) is prominently cited by papers in all four research topics. It appears in all topics both in the category of the most cited publications during all publication years as well as in the category of the most cited publication in its publication year 1908.
Concerning the interpretation of the percentages of the most cited publication in Table  4, one should consider that -similar to our first example -the cited reference counts are frequently low. In case of low cited reference counts, publications with only one or a few L. Bornmann, A. Thor, W. Marx and L. Leydesdorffd cited reference counts may have very high percentile values within a single publication year. For example, with only one single citation the cited reference "harries c., 1907, ber dtsch chem ges, v40, p165" is ranked in the 100 th percentile within 1907.

Discussion
Bibliometric data can be used not only for evaluation purposes, but also for scientifically related investigations in a historical context. The bibliometric cited-reference data in WoS can be traced back to the first ever published documents. RPYS was proposed by  to identify seminal publications in the knowledge base of a research field which are most important in a historical context. Based on ideas around the RPYS conducted by Comins and Hussey (2015), we refined our RPYS toolbox by adding some features to the existing programs and we developed two new routines. For this paper, two examples from the humanities and natural sciences have been used to demonstrate the functionalities and results of the programs. A more technical description of the usage of the programs can be found at http://www.leydesdorff.net/software/rpys/.
The toolbox can be used very flexibly in different contexts: (1) historical roots of research fields can be identified; (2) scientific myths can be uncovered , [17]); (3) most important publications for authors of specific journals can be identified (e.g. Leydesdorff, et al., 2014); (4) importance of single publications can be compared between different research fields, journals, and researchers. We would like to encourage scientists (bibliometricians) to follow the example of Comins and Hussey (2015, [4]) and to use the toolbox for one's own examples (i.e., for one's own research fields). We would be glad to receive hints for further improvements of the toolbox.
In our next project, we plan to develop a graphical user interface integrating the three programs. The user interface will be designed similarly to VOSViewer (van Eck & Waltman, 2010, [21]) and the CiteNetExplorer (van Eck & Waltman, 2014, [22]). For the user of the planned RPYS interface, it will be possible to show, interlink and adapt the results of a RPYS (graphs with RPY peaks and lists with cited references) in a flexible way.