This is a large-scale, multilingual and longitudinal Twitter sentiment dataset sampled through distant supervision from the Twitter Stream Grab archive (https://archive.org/details/twitterstream). It covers the time period between January 2013 and June 2020 for 7 languages:
- Arabic (ar)
- German (de)
- English (en)
- Spanish (es)
- French (fr)
- Italian (it)
- Chinese (zh)
With the files in this repository, we provide tweet IDs that can be used to rehydrate the datasets by using the files available from the Twitter Stream Grab.
Files are formatted as TSV files, with the following columns:
date \t tweetid \t sentiment \t evidence
where:
- date is the day in which the tweet was posted.
- tweetid is the ID of the tweet
- sentiment is either pos or neg
- evidence is the set of emojis or emoticons used to determine if the tweet was positive or negative.
More details about the dataset can be found in the following paper (please cite the paper if you use the dataset):