TM-Senti

dataset

posted on 2021-08-25, 13:11 authored by Wenjie Yin, Rabab Alkhalifa, Arkaitz ZubiagaArkaitz Zubiaga

This is a large-scale, multilingual and longitudinal Twitter sentiment dataset sampled through distant supervision from the Twitter Stream Grab archive (https://archive.org/details/twitterstream). It covers the time period between January 2013 and June 2020 for 7 languages:

- Arabic (ar)

- German (de)

- English (en)

- Spanish (es)

- French (fr)

- Italian (it)

- Chinese (zh)

With the files in this repository, we provide tweet IDs that can be used to rehydrate the datasets by using the files available from the Twitter Stream Grab.

Files are formatted as TSV files, with the following columns:

date \t tweetid \t sentiment \t evidence

where:

- date is the day in which the tweet was posted.

- tweetid is the ID of the tweet

- sentiment is either pos or neg

- evidence is the set of emojis or emoticons used to determine if the tweet was positive or negative.

More details about the dataset can be found in the following paper (please cite the paper if you use the dataset):

TBA

History

Usage metrics

Keywords

sentiment analysis twitter longitudinal multilingual Natural Language Processing

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM