We at Digital Science have been looking at the Data Citation Corpus, to dig deeper into data citation counts.
The first release is based on a seed file that includes data citations from the following sources:
Data citations from DataCite and Crossref DOI metadata, via Event Data.
Data citations from the CZI Science Knowledge Graph, identified via a Named Entity Recognition model algorithm that searches for mentions to datasets in the full text of journal articles and preprints in Europe PMC.
So we are basically looking at papers that have a link to a DataCite DOI or accession number.
By combining this dataset with Dimensions.ai data in Google Big Query, we we're able to add more dimensions to the dataset (pardon the pun), such as funder or institution. The Data Citation Corpus only gave us about 70% of the paper links that were resolvable DOIs. This should improve over time.
This allows us to track how well things like the NIH open data policy is encouraging linking to datasets from papers.