Virality Measures of "Data Tweets"

posted on 05.03.2020, 12:32 by Leslie Carr, Simperl, Elena
This dataset consists of two files in TSV format derived from a large number of tweets (16754250) that were identified as containing different forms of "numeric data" in an extended collection of tweets from Twitter's 1% public sample over 11 months from September 2018.

Both files have a key column labelled "TweetID" which is the Twitter API ID that can be used to retrieve the full twitter data (recommended retrieval via TWARC).

The file "datatweet-numeric-occurrences.txt" consists of three columns:
1 TweetID
2 NumericDataString - the actual substring from the tweet which was recognised as numeric e.g. "500 billion" or "24 years"
3 NumericType - one of a set of identified numeric types e.g. "[cardinal]" or "[time]".

The "virality" associated with the tweets in which the numeric data has been found is given in the file "datatweet-virality.txt".
Its columns are as follows
1 id of the tweet
2 retweet_count
3 favorite_count
4 followers_count (of the user who made the tweet)

If this tweet is a retweet of another (original) tweet, the following columns are non-empty:
5 id of the original tweet
6 favourite_count of the original tweet
7 followers_count of the original tweet's author

NB if col 2 is 0, then cols 5-7 will be blank.
If col 2 >0, then it contains the number of retweets of the original tweet, not the number of times that this retweet has been retweeted.


UKRI EPSRC EP/P025676/1 "Data Stories"