Word counts per US county in geo-tagged Tweets posted between 2015 and 2021

Version 2 2023-03-23, 16:34

Version 1 2023-02-28, 10:07

dataset

posted on 2023-03-23, 16:34 authored by Thomas LoufThomas Louf

The zip file contains fourteen Parquet [1] files of two kinds, for each of the seven years between 2015 and 2021 included:

- region_counts: for every word found, gives how many times it appeared, regardless of capitalization ("count" column), how many times it appeared with at least one capitalized letter ("count_upper"), in how many different counties it appeared ("nr_cells"), and whether we considered it to be a proper noun ("is_proper")

- raw_cell_counts: gives the count for every word by county, regardless of capitalization.

These counts were obtained from geo-tagged Tweets posted those years within the contiguous US, which were collected through the through the streaming API of Twitter, and more specifically using the “statuses/filter” end-point [2]. See the project's paper for more details on methodology, and the code repository to reproduce the analysis.

The two text files are our lists of excluded word forms.

Funding

This work was partially supported by the Agencia Estatal de Investigación of Spain (MCIN/AEI/10.13039/501100011033) and the Fondo Europeo de Desarrollo Regional (FEDER, UE) under Project PACSS (RTI2018-093732-B-C22), the María de Maeztu Program for units of Excellence in R&D, grant MDM-2017-0711 of MCIN/AEI/10.13039/501100011033, by the Government of the Balearic Islands CAIB Grant No. PDR2020/51, and by the Arts and Humanities Research Council (UK), the Economic and Social Research Council (UK), Jisc (UK) (Jisc grant reference number 3154), and the Institute of Museum and Library Services (US), as part of the Digging into Data Challenge (Round 3)

History

Usage metrics

Keywords

Computational Linguistics word frequencies Twitter United States counties English Language Language in Time and Space (incl. Historical Linguistics, Dialectology)Language in Culture and Society (Sociolinguistics)Language, Communication and Culture not elsewhere classified Communication Technology and Digital Media Studies Social and Cultural Geography

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM