Deleted Wikipedia articles (spam/vandalism/attack)
datasetposted on 17.12.2019, 07:58 by Aaron HalfakerAaron Halfaker, Jacob Rogers
This dataset contains a random sample of deleted articles from English Wikipedia were the reason was explicitly either spam, vandalism, or attack. 25 articles were sampled for each deletion reason for a total of 75 articles. Text of the articles was censored to remove identifying information.
The dataset contains the following columns:
· page_title -- The title of the deleted page
· rev_id -- The rev_id of the first revision
· creation_timestamp -- The time that the page was created
· archived -- 1 if the page was deleted, 0 if not (always 1)
· draft_quality -- The deletion reason (spam|vandalism|attack)
· censored_text -- The censored text of the deleted page
Censored blocks are noted with a comment block in the censored_text column of the form "Censored: [reason]([explanation])" -- e.g. "Censored: PII(phone number)".