4 files

Twitter News Dataset

posted on 28.06.2016, 16:20 by Mauricio Quezada, jkalyana@ucsd.edu, bpoblete@dcc.uchile.cl, gert@ece.ucsd.edu

This dataset consists on 5234 news events obtained from Twitter, along with the tweets talking about them.

The file tweets.csv.gz contains a CSV file, called tweets.csv, with all the tweets IDs corresponding to each event in events.csv. The format of each line of the file is the following:

tweet_ID, event_ID


  • tweet_ID is an long number indicating the Twitter ID of the given tweet. Using the Twitter REST API it is possible to retrieve all the information about the given tweet.
  • event_ID corresponds to the event ID of the given tweet.

The file events.csv.gz contains a CSV file, called events.csv with all the news events captured from Twitter since August, 2013 until June, 2014. The format of each line of the file is the following:



  • event_ID is an integer which identifies the corresponding event. There are 5234 events, then event_ID ranges from 1 to 5234. 
  • date is the date of the event or connected component. The format is YYYY-MM-DD.
  • total_keywords is an integer indicating how many keywords are in the event or connected component.
  • total_tweets is an integer indicating how many tweets belongs to this event.
  • keywords is a string containing total keywords keywords. There is a semicolon between two keywords.

The files cluster_labels.txt and time_resolutions.txt contain the cluster labels for each event and the time resolutions learned from all events, respectively.

  • cluster_labels.txt contains one integer number per line, from 0 to 19. In line i, the cluster label in that line corresponds to the event ID number i.
  • time_resolutions.txt contains one floating point number per line, indicating the time resolution learned for all events, in minutes. There are 20 numbers in the file, one per line, in increasing order, with at most 13 decimal numbers after the point.


National Science Foundation CCF 0830535; National Science Foundation, IIS 1054960; Fondo Nacional de Desarrollo Científico y Tecnológico, 11121511; Millennium Nucleus Center for Semantic Web Research, NC120004; CONICYT 2015/21151445