This dataset contains results from analysis of 9 Center for Open Science (COS) preprint systems. Data were collected in December of 2018 using the Open Science Framework (OSF) Application Programming Interface (API, https://developer.osf.io/). The 9 preprint systems analyzed were eartharxiv, engrxiv, lawarxiv, lissa, marxiv, mindrxiv, paleorxiv, pssyarxiv, and socarxiv. These system were chosen because they met the following three conditions:
1. A service must have at least 100 total manuscripts to enable meaningful statistics
2. The services provides english language manuscripts to enable topic analysis
3. The manuscripts must be accessible through the OSF API, which will enable us to analyze the service through programmatic means.
Each of the 9 preprint services has its own subdirectory containing two files. A .log file (e.g. eartharxiv.log, engrxiv.log) and a keyword count log file (e.g. eartharxiv_keywords_count.log, engrxiv_keywords_count.log). The former is a delimited text file where semi-colon is used as the delimiter. Paper titles often have commas in them and using semi-colons to seperate columns allows us to preserve the titles. The semi-colon delimited columns are:
identifier is a unique identifier supplied by the COS preprint system at the time of manuscript submission.
The keyword count files summarize how many times each COS keyword is used. These files each have two columns with semi-colon again being the delimiter. The two columns are: keyword and count
RatioData.csv summarizes results returned from the UnPaywall API. It contains preprint-to-postprint ratios for each of the 9 COS preprint systems.
The software used to generate this dataset is available at: https://doi.org/10.5281/zenodo.2649580