figshare
Browse
WTR.zip (16.01 MB)

WTR

Download (16.01 MB) This item is shared privately
dataset
modified on 2022-09-17, 17:27

WTR contains 409 Wikidata triple-reference pairs, representing 32 groups of text-rich web domains commonly used as sources, as well as 76 distinct Wikidata properties. 43% of references were obtained through external IDs and 57% through direct URLs. Each entry has the following attributes:


Reference attributes:

  • reference_id: An unique identifier issued to the reference by Wikidata;
  • reference_property_id: The unique identifier of the Wikidata property used by the reference to encode the URL we retrieve for it;
  • reference_datatype: Whether the reference’s URL was retrieved by a direct URL or an URL formatted with an external identifier;
  • url: The URL retrieved for this reference;
  • netloc: The actual web domain of this URL;
  • netloc_agg: The web domain of this URL after grouping references under the RARE and OTHER groups;
  • final_url: The URL reached after redirects and whose HTML and text were extracted by ProVe into sentences for annotation;
  • html: The name of the HTML file in the HTML folder containing the code extracted from the reference’s final URL.

Claim attributes:

  • claim_id: An unique identifier issued to the claim by Wikidata;
  • rank: The claims’s rank, either normal or preferred;
  • datatype: The datatype of the claim’s object, e.g. quantity, string, etc;
  • datavalue: The claim’s object as retrieved by Wikidata;
  • Component IDs (entity_id, property_id): The Wikidata IDs of the claim’s subject and property;
  • Component labels (entity_label, property_label, object_label): Main labels for subject, property, and object;
  • Component aliases (entity_alias, property_alias, object_alias): Aliases lists for subject, property, and object;
  • Component descriptions (entity_desc, property_desc, object_desc): Wikidata descriptions for subject, property, and object (if it is a Wikidata item).

Annotations for evaluation:

crowd_annotations_T1: 

The evidence-level annotations that describe the individual TER stances of each piece of evidence towards the claim, in which the evidence set is the five most relevant passages collected from the URL. This consists of a list of JSON objects, each with the following attributes:


  • evidence: The textual evidence collected and being annotated (out of 5);
  • worker_id: A list of anonymized worker IDs denoting the crowd workers who provided annotations/votes;
  • assignment_id: A MTurk internal ID denoting which task the worker provided their annotations in;
  • relation: The list of TER stances voted by the workers, where 0 = SUPP, 1 = REF, 2 = NEI, 3 = Not Sure;
  • reason_not_sure: If a voter gave 'not sure' as their relation, this denotes the reason why, out of the list of options seen in the paper's appendix;
  • reason_not_sure_other: If the worker did not find a reason, they could write one as free text;
  • times: The times in seconds taken by a worker to provide their full annotations;
  • relation_maj: The majority voting aggregation of each individual worker's TER stance annotation;
  • relation_maj_tie: Whether there was a tie or not, which would require the authors to break;
  • reason_not_sure_maj: The majority voting aggregation of each individual worker's reason_not_sure annotation, if relation_maj = 3.

crowd_annotations_T2: 

The evidence-level annotations that describe the collective TER stances of the entire evidence set towards the claim, in which the evidence set is the five most relevant passages collected from the URL. This consists of another JSON object with the following attributes: 

  • evidence: The entire textual evidence set collected and being collectively annotated;
  • worker_id: A list of anonymized worker IDs denoting the crowd workers who provided annotations/votes;
  • assignment_id: A MTurk internal ID denoting which task the worker provided their annotations in;
  • relation: The list of TER stances voted by the workers, where 0 = SUPP, 1 = REF, 2 = NEI, 3 = Not Sure;
  • reason_not_sure: If a voter gave 'not sure' as their relation, this denotes the reason why, out of the list of options seen in the paper's appendix;
  • reason_not_sure_other: If the worker did not find a reason, they could write one as free text;
  • times: The times in seconds taken by a worker to provide their full annotations;
  • relation_maj: The majority voting aggregation of each individual worker's TER stance annotation;
  • relation_maj_tie: Whether there was a tie or not, which would require the authors to break;
  • reason_not_sure_maj: The majority voting aggregation of each individual worker's reason_not_sure annotation, if relation_maj = 3.

author_annotations: 

The sentence-level annotation representing the stance of the entire reference towards the triple.