WheatVIVO: Integrating diverse data sources for an international perspective on wheat funding and research activities

WheatVIVO is being developed by The Wheat Initiative[1] as a showcase of information about researchers and projects across the global public-private wheat community. WheatVIVO aims to serve the needs of researchers looking to develop collaborations, students and postdocs seeking to identify labs in which they would like to work, and policy makers and funding agencies working to understand better the research priorities in different countries. WheatVIVO harvests linked open data provided by existing VIVO installations as well as various non-RDF sources. While data integration is fully automated, WheatVIVO also makes it possible for non-programmers to configure the retrieval of data, resolution of common entities and merging of possibly contradictory or duplicate data, as well as to provide manual corrections. The VIVO software is extended not only in the public website but also in a separate application where administrators can view data with their provenance information and set configuration options such as the times and dates at which different data sources should be harvested and the order in which sources should be used when they offer data about the same entity. Through the admin application, Wheat Initiative personnel can add and edit patterns and associated weightings for automatically matching entities across the sources, and iteratively test the resulting merged data in a staging VIVO before scheduling the merge process to run automatically at desired intervals. The WheatVIVO website allows visitors to flag errors discovered in the data and to provide feedback to project staff who are then prompted either to review the associated matching rules or to forward feedback to the original data providers. Statistics are recorded about how frequently data from different sources are viewed in order to help original providers quantify the benefit of making their data open and available. VIVO’s browsing and visualization capabilities are adapted to highlight the international aspects of coauthorship and project participation. Challenges include issues of data normalization and comparison, such as where funding cycles and salary support differ across countries, as well as the integration of open but unstructured data. It is also anticipated that improvements to the data correction and feedback interfaces will be identified after the system’s production launch in late spring 2017, and that future updates will permit the data ingest processes to learn from these corrections to prevent recurrence of errors. The WheatVIVO admin application, portal and core data ingest code are being developed by private contractor Ontocale SRL. The INRA DIST[2] team contributes to the project by developing connectors to download data from data sources. WheatVIVO code is open source and available on GitHub[3]. The INRA DIST project leader oversees the development of the project together with the Wheat Initiative International Scientific Coordinator. [1] http://www.wheatinitiative.org [2] Institut National de la Recherche Agronomique - Délégation Information Scientifique et Technique [3] http://github.com/wheatvivo