display_distAcrossDatasets.png (49.03 kB)

Probability of observed third-party data reuse

figure

posted on 2012-07-16, 17:15 authored by Heather PiwowarHeather Piwowar

Using the same methods as our Nature letter-to-the-editor analysis, I’ve looked for reuse of gene expression microarray data in PubMed Central by searching for dataset ID numbers in the full text of studies. Studies that mention a dataset accession number but share author last names with those who deposited the dataset are excluded.

The new results look at datasets deposited into the Gene Expression Omnibus (GEO) repository between 2001 and 2009.

The figure below has one panel for every year: the panel reflects datasets deposited into GEO that year. The line shows the cumulative probability of the number of reuses we observed. As you can see in the first panel, almost every dataset deposited in 2001 has been mentioned in a PMC paper at least once, and most many times… the line quickly veers right: the probability of a 2001 dataset being reused only once or twice by 2010 is very small. In 2009, in contrast, the line goes mostly straight up… 90% of the datasets deposited in GEO in 2009 had 0 reuses observed by our conservative method: the probability that a 2009 dataset has only 0 observed reuses by 2010 is very high!

Results for the middle years are particularly important, since by then GEO had a lots of datasets, and between then and now there has been enough time for reuse to accumulate. We observed reuse of more than 20% of the datasets deposited in 2003 and 17% of datasets deposited in 2007.

Note: the method used to detect reuse here is VERY CONSERVATIVE so these are minimum estimates. It only finds reuses by papers that are in PubMed Central, and only those that are attributed by mentioning the accession number (it misses those attributed by citation to the article, for example). Nonetheless, it does serve as a lower bound.

Analysis of the accession number mentions revealed that data reuse was driven by a broad base of datasets: about 20% of the datasets deposited between 2003 and 2007 have been reused by third parties. We note these proportions are gross underestimates since they only include reuses we observed as accession number mentions in PubMed Central; no attempt has been made to extrapolate these distribution statistics to all of PubMed, or to reflect attributions through citations. Further, many important instances of data reuse do not leave a trace in the published literature, such as those in education and training. Nonetheless, even these conservative estimates suggest that reuse finds value in a wide range of datasets, not simply a “very reusable” elite.

(manuscript-in-progress with co-author Todd Vision)