Word counts per US county in geo-tagged Tweets posted between 2015 and 2021
The zip file contains fourteen Parquet [1] files of two kinds, for each of the seven years between 2015 and 2021 included:
- region_counts: for every word found, gives how many times it appeared, regardless of capitalization ("count" column), how many times it appeared with at least one capitalized letter ("count_upper"), in how many different counties it appeared ("nr_cells"), and whether we considered it to be a proper noun ("is_proper")
- raw_cell_counts: gives the count for every word by county, regardless of capitalization.
These counts were obtained from geo-tagged Tweets posted those years within the contiguous US, which were collected through the through the streaming API of Twitter, and more specifically using the “statuses/filter” end-point [2]. See the project's paper for more details on methodology, and the code repository to reproduce the analysis.
The two text files are our lists of excluded word forms.