Version 2 2016-04-22, 16:06Version 2 2016-04-22, 16:06
Version 1 2016-04-22, 11:38Version 1 2016-04-22, 11:38
dataset
posted on 2016-04-22, 16:06authored byArkaitz ZubiagaArkaitz Zubiaga, Alex Voss, Rob Procter, Maria Liakata, Bo Wang, Adam Tsakalidis
Tweet-geolocation-5m is a dataset with more than 5 million geolocated tweets with detailed geolocation information associated. Each geolocated tweet is associated with its fine-grained location information, collected from OpenStreetMap [1] using the reverse geocoding feature in Nominatim [2]. It was originally created for country-level classification of tweets, but finer-grained classification is also provided with the dataset. The country codes are provided using the ISO 3166-1 alpha-2 standard [3].
The dataset was collected in two different week long periods: TC2014, collected in October 2014, and TC2015, collected in October 2015.
Two files are provided here:
* tweet-geolocation-5m.tar.bz2, which is the actual datasets, providing the tweet IDs and ground truth country IDs that enable conducting further experiments.
* vectors-and-folds.tar.bz2, which is provided for the purposes of reproducibility. With the information provided in this file, you should be able to reproduce the results we presented in the paper.