Deleted Wikipedia articles (spam/vandalism/attack)

posted on 17.12.2019, 07:58 by Aaron HalfakerAaron Halfaker, Jacob Rogers
This dataset contains a random sample of deleted articles from English Wikipedia were the reason was explicitly either spam, vandalism, or attack.  25 articles were sampled for each deletion reason for a total of 75 articles.  Text of the articles was censored to remove identifying information.  

The dataset contains the following columns:

 · page_title -- The title of the deleted page
 · rev_id -- The rev_id of the first revision
 · creation_timestamp -- The time that the page was created
 · archived -- 1 if the page was deleted, 0 if not (always 1)
 · draft_quality -- The deletion reason (spam|vandalism|attack)
 · censored_text -- The censored text of the deleted page

Censored blocks are noted with a comment block in the censored_text column of the form "Censored: [reason]([explanation])" -- e.g. "Censored: PII(phone number)".