Virality Measures of "Data Tweets"

dataset

posted on 2020-03-05, 12:32 authored by Leslie CarrLeslie Carr, Simperl, Elena

This dataset consists of two files in TSV format derived from a large number of tweets (16754250) that were identified as containing different forms of "numeric data" in an extended collection of tweets from Twitter's 1% public sample over 11 months from September 2018.

Both files have a key column labelled "TweetID" which is the Twitter API ID that can be used to retrieve the full twitter data (recommended retrieval via TWARC).

The file "datatweet-numeric-occurrences.txt" consists of three columns:

1 TweetID

2 NumericDataString - the actual substring from the tweet which was recognised as numeric e.g. "500 billion" or "24 years"

3 NumericType - one of a set of identified numeric types e.g. "[cardinal]" or "[time]".

The "virality" associated with the tweets in which the numeric data has been found is given in the file "datatweet-virality.txt".

Its columns are as follows

1 id of the tweet

2 retweet_count

3 favorite_count

4 followers_count (of the user who made the tweet)

If this tweet is a retweet of another (original) tweet, the following columns are non-empty:

5 id of the original tweet

6 favourite_count of the original tweet

7 followers_count of the original tweet's author

NB if col 2 is 0, then cols 5-7 will be blank.

If col 2 >0, then it contains the number of retweets of the original tweet, not the number of times that this retweet has been retweeted.

Funding

UKRI EPSRC EP/P025676/1 "Data Stories"

History

Usage metrics

Keywords

twitter data natural language numeric information Natural Language Processing Information Systems

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM