28 files

Wikidata Constraint Violations - July 2018 - extended

posted on 2020-12-07, 19:35 authored by Thomas Pellissier TanonThomas Pellissier Tanon
This dataset is a cleaned up and annotated version of an other dataset previously shared: https://figshare.com/articles/dataset/Wikidata_Constraints_Violations_-_July_2017/7712720

This dataset contains corrections for Wikidata constraint violations extracted from the July 1st 2018 Wikidata full history dump.
It has been created as part of a work named Neural Knowledge Base Repairs by Thomas Pellissier Tanon and Fabian Suchanek.
An example of code making use of this dataset is available on GitHub: https://github.com/Tpt/bass-materials/blob/master/corrections_learning.ipynb

The following constraints are considered:
* conflicts with: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Conflicts_with
* distinct values: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Unique_value
* inverse and symmetric: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Inverse https://www.wikidata.org/wiki/Help:Property_constraints_portal/Symmetric
* item requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Item
* one of: https://www.wikidata.org/wiki/Help:Property_constraints_portal/One_of
* single value: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Single_value
* type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Type
* value requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Target_required_claim
* value type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Value_type

The constraints.tsv file contains the list of most of the Wikidata constraints considered in this dataset (beware, there could be some discrepancies for type, valueType, itemRequiresClaim and valueRequiresClaim constraints).
It is a tabbed-separated file with the following columns:
1. constrain id: the URI of the Wikidata statement describing the constraint
2. property id: the URI of the property that is constrained
3. type id: the URI of the constraint type (type, value type...). It is a Wikidata item.
4. 15 columns for the possible attributes of the constraint. If an attribute has multiple values, they are in the same cell but separated by a space. The columns are:
* regex: https://www.wikidata.org/wiki/Property:P1793
* exceptions: https://www.wikidata.org/wiki/Property:P2303
* group by: https://www.wikidata.org/wiki/Property:P2304
* items: https://www.wikidata.org/wiki/Property:P2305
* property: https://www.wikidata.org/wiki/Property:P2306
* namespace: https://www.wikidata.org/wiki/Property:P2307
* class: https://www.wikidata.org/wiki/Property:P2308
* relation: https://www.wikidata.org/wiki/Property:P2309
* minimal date: https://www.wikidata.org/wiki/Property:P2310
* maximum date: https://www.wikidata.org/wiki/Property:P2311
* maximum value: https://www.wikidata.org/wiki/Property:P2312
* minimal value: https://www.wikidata.org/wiki/Property:P2313
* status: https://www.wikidata.org/wiki/Property:P2316
* separator: https://www.wikidata.org/wiki/Property:P4155
* scope: https://www.wikidata.org/wiki/Property:P5314

The other files provide for each constraint type the list of all corrections extracted from the edit history. The format of the file is one line per correction with the following tabbed-separated values:

1. constraint id
2. revision that fixed the constraint violation
3. first violation triple subject
4. first violation triple predicate
5. first violation triple object
6. second violation triple subject (blank if no second violation triple)
7. second violation triple predicate (blank if no second violation triple)
8. second violation triple object (blank if no second violation triple)
9. separator (not useful)
10. subject of the first triple in the correction
11. predicate of the first triple in the correction
12. object of the first triple in the correction
13. is the first triple in the correction an addition or a deletion (`` for a deletion and `` for an addition)
14. subject of the second triple in the correction (might not exist)
15. predicate of the econd triple in the correction (might not exist)
16. object of the econd triple in the correction (might not exist)
17. is the second triple in the correction an addition or a deletion (`` for a deletion and `` for an addition) (might not exist)
18. Description of the subject of the first violation triple encoded in JSON
19. Description of the object of the first violation triple encoded in JSON (might be empty for literals)
20. Description of the term of the second triple that has not already be described by the two previous description. (might be empty for literals or if there is no second triple)