Social Media Corpus: Stigma Identification in Vaccination Discourse
Current research introduces an annotated gold standard dataset based on 2,663 comments from Meta (Facebook). The dataset is manually labelled for stigma, not stigma, and ambiguous sentiment. Each comment is labelled three times (four times in case of dissensus) by independent expert annotators. The overall observed share of agreement reached 68% and Fleiss Kappa agreement rate achieved 0.62 on the annotation task with three labels ("stigma, "not stigma", and "ambiguous" category). Annotation share of agreement between two labels ("stigma, "not stigma") is 89% and Fleiss Kappa is 0.84. The labels are consequently propagated from the annotated Facebook (Meta) to a dataset discussing COVID vaccines with 40,084 comments from Twitter, Reddit, and YouTube corpora. In addition, the corpora are annotated with linguistic features from LIWC (Linguistic Inquiry and Word Count) [1], [2] and additional features: number of characters in the comment string, sentiment score, subjectivity score.
1. Pennebaker, J. W., Francis, M. E. & Booth, R. J. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence
Erlbaum Assoc. 71, 2001 (2001).
2. Tausczik, Y. R. & Pennebaker, J. W. The psychological meaning of words: Liwc and computerised text analysis methods. J. language social psychology 29, 24–54 (2010)
History
Usage metrics
Categories
- Data engineering and data science
- Other health sciences not elsewhere classified
- Applications in health
- Applications in social sciences and education
- Natural language processing
- Modelling and simulation
- Information modelling, management and ontologies
- Information systems for sustainable development and the public good
- Deep learning
- Neural networks
- Semi- and unsupervised learning
- Data mining and knowledge discovery
- Data quality
- Graph, social and multimedia data
- Information extraction and fusion
- Data models, storage and indexing