In previous series: part1, part2, we reviewed how the contents of 24 journals of Russian origin is present in CrossRef, Scopus and the Lens, with special focus on its comprehensiveness and accuracy.
This part is devoted to funding information.
Scopus data contains two fileds - Funding text and Funding details. The former comprises a copy of text from the acknowledgment section of the original publication, precisely as it is present there. It may contain the expressions of gratitude to colleagues, sponsors, and mom (of course). The latter represents the funder names and award numbers that Scopus team extracted from the Funding text. One may wonder why Scopus keeps both raw and structured data. The only answer that seems reasonable to me is that a quality of name recognition for funding data is not better than that for affiliation data, as the funder names are also highly variable.
CrossRef relies on the contents provided by the publishers and instructs them to submit the funding metadata in a well-structured format, tied to the Open Funder Registry that contains 13,000+ international funding organizations. The service allows the publishers to update the existing records with funding (and other) metadata at no extra charges. In other words, a presence of funding metadata in CrossRef depends only on the publisher’s will and skills.
The Lens seems to use only CrossRef metadata for funding information.
Let’s look into the real data.
The current problem with Scopus CSV data is that a column “Funding details” contains a special delimiter ‘’ for the individual funders, which affects reading CSV file. Therefore, first we read CSV files in a raw format, substitute ‘’ for ‘|’, save, and read it again.
ff<-list.files(paste0(dir,"/data/scopus/"))
ff<-ff[grepl("scopus",ff)&grepl("csv",ff)]
sc.data <- data.frame()
for (i in 1:length(ff)){
sc.fund <- read_file(paste0(dir, "/data/scopus/",ff[i]))
sc.fund <- gsub(pattern="\\\n\\\n", replacement = "\\|",sc.fund)
write_lines(sc.fund, paste0(dir, "/data/scopus/test.csv"))
data <- read_csv(paste0(dir, "/data/scopus/test.csv"))
nm <-names(data)
datax <- data %>% unite(col="Fund_text", nm[grepl("Funding Text",nm)], sep="_")
sc.data <- rbind(sc.data, datax)
}
sc.data %>% write_excel_csv(paste0(dir, "/data/scopus/sc.data.csv"))
Using the same approach as in the previous parts, we unnest the funder names in the Funding details column in order to have one funder in one line.
sc.fund <- read_csv(paste0(dir, "/data/scopus/sc.data.csv"),
col_types = cols(ISSN = col_character())) %>%
select(DOI, "year"=Year, ISSN, "src.title"=`Source title`,
eid=EID, auth.affils=`Authors with affiliations`,
fun.details=`Funding Details`, fun.text = Fund_text) %>%
mutate(DOI=trimws(tolower(DOI)), ISSN=toupper(ISSN),
src.title=trimws(toupper(src.title))) %>%
mutate(fun.details=strsplit(fun.details, split="\\|")) %>%
unnest(fun.details) %>%
mutate(fun.text=gsub("_NA|NA_|NA_NA","", fun.text)) %>%
mutate(fun.text= ifelse(nchar(fun.text)<2,NA,fun.text))
sc.fund %>% write_excel_csv(paste0(dir, "/data/scopus/sc.fund.csv"))
Now we got a dataset with separate {author - mixture of author’s affiliations} pairs in rows.
## Observations: 3,418
## Variables: 8
## $ DOI <chr> "10.21517/0202-3822-2018-41-3-93-104", "10.21517/0...
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 20...
## $ ISSN <chr> "02023822", "02023822", "02023822", "02023822", "0...
## $ src.title <chr> "PROBLEMS OF ATOMIC SCIENCE AND TECHNOLOGY, SERIES...
## $ eid <chr> "2-s2.0-85056665925", "2-s2.0-85056647908", "2-s2....
## $ auth.affils <chr> "Dokuka, V.N., State Research Center of Russian Fe...
## $ fun.text <chr> NA, NA, NA, NA, NA, NA, NA, "This research was sup...
## $ fun.details <chr> NA, NA, NA, NA, NA, NA, NA, "Russian Science Found...
We process the JSON file exported in the first part.
# my using of purrr functions can seem amateurish, so it is, sorry
ls.data <- jsonlite::fromJSON(paste0(dir, "/data/lens/lens-export.json"), flatten=TRUE)
ls.fund <- ls.data %>% select(funding, lens_id) %>%
mutate(funding = map_if(funding, is.null, ~ tibble())) %>%
unnest(funding)
ls.meta <- read_csv(paste0(dir, "/data/lens/ls.meta.csv")) %>%
select(lens_id, DOI="doi", src.title="source.title_full",
"year"= year_published, "type"=publication_type,
print, electronic) %>%
filter(type!="journal-issue"&type!="journal issue") %>%
mutate(DOI=trimws(tolower(DOI)),
print=toupper(print),
electronic=toupper(electronic),
src.title=trimws(toupper(src.title))) %>%
mutate(ISSN=ifelse(
print %in% sc.fund$ISSN, print,
ifelse(electronic %in% sc.fund$ISSN, electronic, NA))) %>%
unique()
ls.meta %>%
# ls.affils misses NULL lines on unnesting, we recover it via lef_join
left_join(ls.fund, by=c("lens_id")) %>%
write_excel_csv(paste0(dir, "/data/lens/ls.fund.csv"))
This is a Lens dataset with funding info that we are going to use further.
## Observations: 3,414
## Variables: 11
## $ lens_id <chr> "000-101-321-426-866", "000-117-017-693-042", "000-...
## $ DOI <chr> "10.17580/gzh.2018.11.15", "10.15389/agrobiology.20...
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 201...
## $ type <chr> "journal article", "journal article", "journal arti...
## $ print <chr> "00172278", "01316397", "2311911X", "00310301", "01...
## $ electronic <chr> NA, "23134836", "23136871", "15556174", NA, NA, "20...
## $ ISSN <chr> "00172278", "01316397", "2311911X", "00310301", "01...
## $ country <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ funding_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ org <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ src.title <chr> "GORNYI ZHURNAL", "SEL'SKOKHOZYAISTVENNAYA BIOLOGIY...
CrossRef JSONs are unnested and merged into one dataset in a same way as the LENS json, except that we have to merge few files.
ff<-list.files(paste0(dir,"/data/crossref/"))
ff<-ff[grepl("works",ff)]
cr.data <- data.frame()
for (i in 1:length(ff)){
cr.json <- jsonlite::fromJSON(paste0(dir, "/data/crossref/",ff[i]), flatten=TRUE)
cr.funds <- cr.json$message$items %>%
select(DOI, funder) %>%
mutate(funder = map_if(funder, is_empty, ~ tibble())) %>%
unnest(funder)
cr.data <- rbind(cr.data, cr.funds)
}
cr.data <- cr.data %>%
mutate(award = sapply(award, function(x) paste0(unlist(x), collapse="|")))
cr.meta <- read_csv(paste0(dir, "/data/crossref/cr.meta.csv")) %>%
select(DOI, "src.title"=source.title, type,
"year"=pubdate, print, electronic) %>%
filter(type!="journal-issue"&type!="journal issue") %>%
mutate(DOI=trimws(tolower(DOI)),
print=toupper(print),
electronic=toupper(electronic),
src.title=trimws(toupper(src.title))) %>% unique() %>%
mutate(ISSN=ifelse(
print %in% sc.fund$ISSN, print,
ifelse(electronic %in% sc.fund$ISSN, electronic, NA))) %>%
unique()
cr.meta %>% left_join(cr.data) %>%
write_excel_csv(paste0(dir, "/data/crossref/cr.fund.csv"))
Below is a summary for dataset with the author & affiliation metadata from CrossRef records that we are going to use further.
cr.fund <- read_csv(paste0(dir, "/data/crossref/cr.fund.csv")) %>%
select(-src.title) %>% left_join(sc.labs) # making the universal src.title names
glimpse(cr.fund)
## Observations: 3,527
## Variables: 11
## $ DOI <chr> "10.32607/2075-8251-2017-9-4-84-91", "10.326...
## $ type <chr> "journal-article", "journal-article", "journ...
## $ year <dbl> 2018, 2018, 2019, 2019, 2018, 2018, 2018, 20...
## $ print <chr> "20758251", "20758251", "10637869", "1063786...
## $ electronic <chr> NA, NA, "14684780", "14684780", "15738493", ...
## $ ISSN <chr> "20758251", "20758251", "10637869", "1063786...
## $ DOI1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `doi-asserted-by` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ award <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ src.title <chr> "ACTA NATURAE", "ACTA NATURAE", "PHYSICS-USP...
Three datasets are compared by share of the publications that contain an affiliation info.
# scopus
sc_x <- sc.fund %>%
mutate(status=!is.na(fun.text)) %>%
group_by(src.title, ISSN, eid) %>%
summarize(status=sum(status)) %>%
mutate(status=ifelse(status!=0, "present", "missing")) %>%
add_count(src.title) %>%
group_by(src.title, ISSN, status, n) %>%
summarize(pubs=n_distinct(eid)) %>% ungroup()
# crossref
cr_x <- cr.fund %>%
mutate(status=!is.na(name)) %>%
group_by(src.title, ISSN, DOI) %>%
summarize(status=sum(status)) %>%
mutate(status=ifelse(status!=0, "present", "missing")) %>%
add_count(src.title) %>%
group_by(src.title, ISSN, status, n) %>%
summarize(pubs=n_distinct(DOI))%>% ungroup()
# lens
ls_x <- ls.fund %>%
mutate(status=!is.na(org)) %>%
group_by(src.title, ISSN, lens_id) %>%
summarize(status=sum(status)) %>%
mutate(status=ifelse(status!=0, "present", "missing")) %>%
add_count(src.title) %>%
group_by(src.title, ISSN, status, n) %>%
summarize(pubs=n_distinct(lens_id))%>% ungroup()
x_x <- rbind(ls_x %>% mutate(source="Lens"),
cr_x %>% mutate(source="CrossRef"),
sc_x%>% mutate(source="Scopus")) %>%
ungroup() %>%
mutate(label2=percent(pubs/n, accuracy=1)) %>%
mutate(label=ifelse(nchar(src.title)<64,
str_wrap(src.title,25),
str_wrap(paste0(substr(src.title, 1,64),"..."),25)))
#chart1 ####
x_x %>%
ggplot(aes(x=reorder(ISSN,n), y=pubs, fill=status))+
geom_col()+
geom_text(inherit.aes = FALSE, data=x_x[x_x$status=="present",],
aes(x=reorder(ISSN,n), y=n+2,
label=scales::percent(pubs/n, accuracy=1)),
size=2.7, fontface="bold", hjust=0)+
facet_wrap(~source, ncol=3)+
scale_y_continuous(expand = expand_scale(mult=c(0,0.15)))+
coord_flip()+
scale_fill_manual(values=c("#fc8d62", "#66c2a5"), name="SOURCE")+
labs(title="PRESENCE OF FUNDING INFO",
subtitle="PUBYEAR: 2018-2019",
caption="Accessed: June 29, 2019", y="NUMBER OF PUBLICATIONS",
x="ISSN")+
mytheme+
ggsave(paste0(dir, "/funding_long.png"),
width=20, height = 12, units="cm", dpi=300)
Another projection of the same data, stressing the ratios.
Scopus dataset contains some funding information for 22 (out of 24) journals, and 10 journals have more than a half of publications with non-empty Funding text fields. Some journals have a very low proportion of the publications with funding information (HSE ECONOMIC JOURNAL, GORNYI ZHURNAL). This can be explained by low level of competitive funding typical for some subject areas in Russia, but only partially. Some journals may simply lack the funding information in English, so Scopus is likely to miss it.
CrossRef and the Lens have a funding data for only 1 (out of 24) journal - Russian Mathematical Surveys - with comparable proportion of funding-presenting publications (46%). Worth of noting that Scopus has 2x higher ratio for this journal (90%). The Lens also contains 1 publication with funding info in Acta Naturae, but that’s all.
At this point, we finish the comparison part, as there is nothing to compare.
We found that Scopus has a much (much-much) higher proportion of the publications with funding metadata (44%) than CrossRef and Lens (both <1%). Let’s assess its quality, e.g. how accurately Scopus recognizes the funder names.
In cases when Scopus knows the funder name & award number, these details are present in the Funding details field in funder_name: award_number format. When the funder name is not recognized, the Funding details field contains only the award number, which can be ambiguous when the pattern of the award number is not unique.
For instance, two largest Russian funding agencies (RFBR, RSF) have a similar pattern of the award numbers: dd-dd-ddddd (d = digit), so the awards of these funders can not be distinguished without a funder name. Let’s check how many records with this pattern have the properly recognized funder names.
sc.fund %>%
filter(!is.na(fun.text)) %>%
select(eid, fun.text) %>% unique() %>%
mutate(rus.pattern = str_detect(fun.text, "(?<!\\d)\\d{2}-\\d{2}-\\d{5}(?!\\d)")) %>%
summarize(n=n_distinct(eid),
rt=sum(rus.pattern==TRUE)/n,
rf=sum(rus.pattern==FALSE)/n) %>%
ungroup() %>%
gather() %>% filter(key!="n") %>%
ggplot(aes(x=1, y=value, fill=key))+
geom_bar(position="fill", stat="identity")+
coord_flip()+
scale_y_continuous(breaks=c(0,0.25,0.5,0.75,1),
labels=percent_format(),
expand = expand_scale(mult=c(0,0)))+
scale_fill_manual(breaks=c("rf","rt"),
labels=c("Other funders", "RFBR or RSF"),
values=c("#fc8d62", "#66c2a5"), name="")+
labs(title="PRESENCE OF RFBR-RSF AWARDS WITHIN FUNDING TEXT",
subtitle="PUBYEAR: 2018-2019",
caption="Accessed: June 29, 2019",
y="SHARE OF PUBLICATIONS WITH {DD-DD-DDDDD} PATTERNS", x=NULL)+
guides(fill=guide_legend(title.position = "left", title.hjust = 0, title.vjust = 0.5,
label.position = "right", label.hjust = 0, label.vjust = 0.5))+
mytheme+
theme(legend.position = "right",
legend.justification = c("right", "top"),
axis.text.y = element_blank(),
strip.text = element_text(size=rel(0.6)),
strip.background = element_rect(fill="lightyellow", color=NA))+
ggsave(paste0(dir, "/funds_DD_DD_DDDDD.png"),
width=20, height = 4, units="cm", dpi=300)
Among 1241 Scopus records with non-empty funding information 60% contain the award numbers similar to that of RFBR and RSF in the Funding text fields.
To assess an accuracy of Scopus name recognition, we are going to count how often this pattern is accompanied with the recognized funder name. The Funding text field may contain multiple mentions of the same funder, so have to count the mentions of each individual award in the Funding details field.
sc.fund %>%
select(fun.details) %>%
# filtering only those that contain the pattern
mutate(rus.pattern = str_detect(fun.details,
"(?<!\\d)\\d{2}-\\d{2}-\\d{5}(?!\\d)")) %>%
filter(rus.pattern==TRUE) %>%
# extracting the funder names
mutate(funder.name =
sapply(str_extract(fun.details, ".+(?=: )"),
function(x) unlist(x))) %>%
mutate(funder.name=ifelse(is.na(funder.name),"FUNDER NAME IS ABSENT", funder.name)) %>%
count(funder.name) %>%
arrange(desc(n)) %>%
datatable(escape=FALSE, rownames = FALSE, filter = 'top',
options = list(pageLength = 10,
lengthMenu = c(5, 10, 20),
autoWidth = TRUE,
columnDefs = list(list(width = '100px', targets =c(1))))) %>%
formatStyle(2, `text-align` = "center")
Below I review the data aberrations hidden behind the rows of this table:
Both RFBR and RSF are present in few forms. RFBR names are Российский Фонд Фундаментальных Исследований (РФФИ), Russian Foundation for Basic Research, RFBR, Российский Фонд Фундаментальных Исследований (РФФИ), RFBR, Russian Foundation for Fundamental Investigations, Russian Humanitarian Foundation (the latter is a subdivision of RFBR). RSF names are Russian Science Foundation and Russian Science Foundation, RSF. It is hard to guess what is Russian Science Support Foundation. The minor names represent ca. 10% of cases.
There are 48 records (6%) without any funder name. Those are the cases when the Funding text contains the unconventional funder name, so Scopus fails to recognize the funder and extracts only the award number. The illustrative examples of such names are Russian Scientific Foundation or Russian fundamental research fund.
There are also some strange names (1%). One may think that the selected pattern is not as unique as we thought, and it is true - the award granted by Japan Society for the Promotion of Science has a similar pattern 17-52-50038-NP-a - but only for few cases. Some names are caused by the errors in the original publication - this publication mentions an award from RSCF (rscf.ru is a web site of RSF) which was recognized by Scopus as Richmond County Savings Foundation. But there are also the records with the names that Scopus seem to make up by confusion. You can check this article claiming a support from Canon Foundation for Scientific Research, with no any sign of that in the Funding text. Or this article for which Scopus created an extra funding line for The Ministry of Economic Affairs and Employment.
A presence of wrong funder names is a not a big problem, if one can see the full Scopus records via UI or in a dataset. Funding text shows the original acknowledgment text, which more experienced users may decide to parse using their own algorithms.
But it can become a huge problem, if the extracted records in the Funding details field are used as a source of data for further analysis. This record contains the text …and AAS was supported by the Grant of the President…, where AAS is an abbreviation of one author’s name. Scopus considered AAS as an abbreviation of the funding agency, so AAS became African Academy of Sciences. In addition to this, RSF was not recognized as a funder in this publication, though its name was correctly spelled. As a result, according to Scopus, this publication is a part of Russian-African scientific collaboration. Or, rather, due to Scopus.
Few more examples of such transformations:
** the German Research Council (DFG)** became California Department of Fish and Game
CGS became Canadian Geriatrics Society, while it should be China Geological Survey
ISF became Iowa Science Foundation, while it should be Israel Science Foundation
NSF (in just 1 case) was transformed into National Sleep Foundation
FANO (which is a transliteration of the Russian name of the Federal Agency of Scientific Organizations) became Captain Fanourakis Foundation
The Funding details section of the Scopus records contain missed and wrong names. Any analytical solution built on such data will inherit all these fallacies and translate it into the fake evidence of co-funding.
In our experiment Scopus overperformed Lens and CrossRef as a source of funding information. For 24 selected journals Scopus contained a funding information in 44% documents, Lens and CrossRef just in 1%. The reported ratios are specific for the selected journals.
Scopus can be used as a source of the original acknowledgments (stored in the field named Funding text), which requires an additional work to parse the texts and extract the individual awards.
The column Funding details in the Scopus records contain the errors, which proportion, as we can suggest, depend on the variability of the funder name and its prominence. Welcome Trust awards will appear in the Funding details with higher precision and recall that the awards of the less known funding agency, especially if the latter is present in an abbreviated form.
This report do not evaluate a completeness of funding data in Scopus. Due to a lack of funding information in CrossRef and the Lens (for the selected journals!), we were not able to check whether there are any missing records in Scopus. And we did not scan the original full texts in order to detect the acknowledgments section and check if those are complete in Scopus. We can only speculate that Scopus may fail to grab the Funding information that is present in Russian language. Thiscan explain a very low share of the articles with funding information in the journals like GORNYI ZHURNAL, FIBRE CHEMISTRY.
This report calls for the actions designed to improve a quality of funding information in the open data sources like CrossRef. The funding agencies may need to adapt their policies to ensure that the funding information is mentioned in the papers in a proper way (i.e. convenient names in both English and Russian languages).
I am grateful to the Lens & CrossRef teams for what they do. Multiple thanks to all the experts, who care about sharing their experience and contribute to the community with free tutorials and kind advices. Love to the dearest R community.
Lutay, A.. (2019, July 17). Funding Information in 24 Russian Journals in the Lens, Scopus, and CrossRef (Version 1). figshare. <https://doi.org/10.6084/m9.figshare.8942132>
Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse
Hadley Wickham (2018). scales: Scale Functions for Visualization. R package version 1.0.0. https://CRAN.R-project.org/package=scales
Winston Chang, (2014). extrafont: Tools for using fonts. R package version 0.17. https://CRAN.R-project.org/package=extrafont
Bob Rudis and Dave Gandy (2019). waffle: Create Waffle Chart Visualizations. R package version 1.0.0. https://github.com/hrbrmstr/waffle/tree/cran
Yihui Xie (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.23.
Yihui Xie, Joe Cheng and Xianying Tan (2019). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.7. https://CRAN.R-project.org/package=DT