Uberization Module: A life saver for the manually entered dirty citation data in faculty reporting tool (VIVO 2018)

2018-11-27T17:36:45Z (GMT) by Muhammad Javed
Cornell University is a decentralized institution where every college and school uses its own means and procedures to record their faculty’s publications. Few of them rely on institutional repositories such as Digital Commons from bepress; while, others use faculty reporting tools such as Activity Insight from Digital Measures or Symplectic Elements from Digital Science. In this presentation, I will discuss a case study of College of Agriculture and Life Sciences (CALS) that currently use Activity Insight (AI) for their faculty reporting needs.

Every year during faculty reporting season, faculty report their research contributions of the past one year. In College of Agriculture and Life Sciences (CALS), different strategies are used to collect publication data from faculty. Faculty can either i) provide their up to date CVs and an admin staff from the college may read the CVs and manually enter publications data in the reporting system, ii) faculty can copy/paste publications list from their CVs and enter them as a single text blob in a free text template provided by the CALS administration, or iii) faculty can themselves log in to the reporting system and enter their publications in a publication template form. In all three options, publications are entered manually into the faculty reporting system. Such manually entered data is prone to errors and many examples have been found where manually entered citation data do not reflect the truth. Some of the noticed errors include incorrect journal name, incorrect ISSN/EISSN numbers, mistakes in DOIs, incorrect list/order of authors etc. Such dirty citation data cannot be used for data analysis or future strategic discussions. In Scholars@Cornell project, we use uberization module to clean such dirty data.

First, we load dirty publication data from Activity Insight (AI) to Symplectic Elements as an institutional feed. In cases where the loaded publication is already harvested by Symplectic Elements via upstream sources (such as WoS, Pubmed, Scopus), the AI publication become another record in the existing publication object. In scenarios where the AI publication is the first record in Elements, one may re-run the search for the faculty so that the citation data for the same publication is harvested from upstream sources as well. Once step two is completed, next step is to extract publication objects from Elements, merging data from different sources (i.e., one record from each source) and creating a single record – “uber record” for each article. For creation of an uber record, we ranked the citation data sources based on the experience and intuition of two senior Cornell librarians and started with the metadata from the source they considered best. The uberization module merges the citation data from different publication records (including AI record) in a way that it creates a single record which is clean and comprises best of the best citation data. After passing through the data validation, uber records are transformed into an RDF graph and loaded into Scholars@Cornell.