Wikipedia Talk Labels: Toxicity

dataset

posted on 2017-02-22, 18:50 authored by Nithum ThainNithum Thain, Lucas DixonLucas Dixon, Ellery WulczynEllery Wulczyn

This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

History

Usage metrics

Keywords

Wikipedia Online Comments Natural Language Processing Knowledge Representation and Machine Learning

Licence

CC0

Wikipedia Talk Labels: Toxicity

History

Usage metrics

Categories

Keywords

Licence

Exports