Tweet geolocation 5m
dataset
posted on 2016-04-22, 16:06 authored by Arkaitz ZubiagaArkaitz Zubiaga, Alex Voss, Rob Procter, Maria Liakata, Bo Wang, Adam TsakalidisTweet-geolocation-5m is a dataset with more than 5 million geolocated tweets with detailed geolocation information associated. Each geolocated tweet is associated with its fine-grained location information, collected from OpenStreetMap [1] using the reverse geocoding feature in Nominatim [2]. It was originally created for country-level classification of tweets, but finer-grained classification is also provided with the dataset. The country codes are provided using the ISO 3166-1 alpha-2 standard [3].
The dataset was collected in two different week long periods: TC2014, collected in October 2014, and TC2015, collected in October 2015.
Two files are provided here:
* tweet-geolocation-5m.tar.bz2, which is the actual datasets, providing the tweet IDs and ground truth country IDs that enable conducting further experiments.
* vectors-and-folds.tar.bz2, which is provided for the purposes of reproducibility. With the information provided in this file, you should be able to reproduce the results we presented in the paper.