Socioeconomic status classification of social media users

Version 2 2015-12-12, 12:34

Version 1 2015-12-08, 20:30

dataset

posted on 2015-12-12, 12:34 authored by Vasileios LamposVasileios Lampos, Nikolaos Aletras, Jens K. Geyti, Bin Zou, Ingemar J. Cox

This data set accompanies the following paper:

Vasileios Lampos, Nikolaos Aletras, Gens Jeyti, Bin Zou and Ingemar J. Cox. Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language. Proceedings of the 38th European Conference on Information Retrieval (ECIR), 2016.

Data description

- Temporal resolution: February 1, 2014 to March 21, 2015

- data_matrix.csv: Main input file. Each line represents a user (1342 users in total). See below for the interpretation of the dimensions (columns) related to textual content. Dimensions 1284 to 1287 contain the ratios of user replies, mentions (of other accounts), retweets (of tweets from other accounts) and unique mentions (of other accounts) over the total number of tweets of a particular user, respectively. Dimensions 1288 to 1291 contain the log-number of followers+1, followees+1, listings+1 and the impact score for a particular user. The definition of the impact score has been adopted from the following paper: V. Lampos, N. Aletras, D. Preotiuc-Pietro and T. Cohn. Predicting and Characterising User Impact on Twitter. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 405–413, 2014.

- sec_labels.txt: Socioeconomic status class labels for each user; 1,2 and 3 denote the upper, middle and lower socioeconomic classes respectively. Each line of sec_label.txt corresponds to a line of data_matrix.csv.

- voc_1grams.txt: Vocabulary index of frequent 1-grams extracted from the users' tweets. Represents dimensions 1 to 560 from data_matrix.csv.

- voc_bio_1grams.txt: Vocabulary index of 1-grams in the bio description of the users. Represents dimensions 561 to 786 from data_matrix.csv.

- voc_bio_2grams.txt: Vocabulary index of 2-grams in the bio description of the users. Represents dimensions 787 to 1083 from data_matrix.csv.

- voc_clusters.txt: Vocabulary index used in the formation of clusters.

- voc_clusters_ids.csv: Each line contains the 1-gram ids (line numbers) from voc_clusters.txt that are members of a cluster. In total we have derived 200 clusters, represented by dimensions 1084 to 1283 in data_matrix.csv.