Using Shape Expressions (ShEx) to model, validate and curate Wikidata

<div>Wikidata is a data repository and access protocol in the public domain and open to the web. Ready for use by both humans and machines. While the scientific community is increasingly using this valuable infrastructure to distribute findings to a larger audience, Wikidata is really a jack of all trades. </div><div>However, as the figure of speech continues, a ”jack of all trades” is also a “master of none.” As a truly open data infrastructure, community issues such as disagreement, bias, human error, vandalism, etc. manifest themselves on Wikidata. </div><div>From a curator's perspective, it can be challenging at times to filter through the different Wikidata views while maintaining one's own definitions and standards. Whether stemming from benign differences in opinions/views or more malignant forms of vandalism or the introduction of low quality evidence, public databases face extra challenges in providing data quality in the public domain. </div><div>Here we propose the use of W3C Shape Expressions (ShEx: https://shexspec.github.io/spec/ ) as a toolkit to model, validate and filter the interactions between designated public resources and Wikidata. It is a language for expressing constraints on RDF graphs. Wikidata is available as an RDF graph. ShEx can be used to validate documents, communicate expected graph patterns, and generate user interfaces and interface code.</div><div>It will also allow us to efficiently: (1) Exchange and understand each other’s models, (2) Express a shared model of our footprint in Wikidata, (3) Agilely develop and test that model against sample data and evolve, and (4) catch disagreement, inconsistencies or errors efficiently at input time or in batch inspections.</div><div>Shape Expressions has already performed this function in the development of FHIR/RDF (https://www.hl7.org/fhir/rdf.html). This expressive language was sufficient to capture constraints in FHIR and the intuitive syntax helped people to quickly grasp the range of conformant documents. The publication workflow for FHIR tests all of these examples against the ShEx schemas, catching errors and inconsistencies before they reach the public.</div>