figshare
Browse
1/1
2 files

Tweet geolocation 5m

Version 2 2016-04-22, 16:06
Version 1 2016-04-22, 11:38
dataset
posted on 2016-04-22, 16:06 authored by Arkaitz ZubiagaArkaitz Zubiaga, Alex Voss, Rob Procter, Maria Liakata, Bo Wang, Adam Tsakalidis
Tweet-geolocation-5m is a dataset with more than 5 million geolocated tweets with detailed geolocation information associated. Each geolocated tweet is associated with its fine-grained location information, collected from OpenStreetMap [1] using the reverse geocoding feature in Nominatim [2]. It was originally created for country-level classification of tweets, but finer-grained classification is also provided with the dataset. The country codes are provided using the ISO 3166-1 alpha-2 standard [3].

The dataset was collected in two different week long periods: TC2014, collected in October 2014, and TC2015, collected in October 2015.

Two files are provided here:
* tweet-geolocation-5m.tar.bz2, which is the actual datasets, providing the tweet IDs and ground truth country IDs that enable conducting further experiments.
* vectors-and-folds.tar.bz2, which is provided for the purposes of reproducibility. With the information provided in this file, you should be able to reproduce the results we presented in the paper.

Funding

PHEME FP7 project (grant no. 611233)

History