Discovering health topics in social media using topic models

Version 5 2014-06-26, 17:53

Version 4 2014-06-26, 17:53

Version 3 2014-06-26, 17:52

Version 2 2014-06-26, 17:45

Version 1 2014-04-25, 20:00

dataset

posted on 2014-06-26, 17:53 authored by Michael PaulMichael Paul

Data set for M. Paul and M. Dredze, "Discovering health topics in social media using topic models".

This includes the set of tweets used in the experiments, and the words associated
with ailments discovered by the Ailment Topic Aspect Model (ATAM).

Contact: Michael Paul (mpaul39@gmail.com)
Released June 26, 2014

atam.topwords.csv
- The most probable words for each ailment. The first column is the ailment ID.
The second column indicates if it is a general (G), symptom (S), or treatment (T) word.
The third column is the word. The fourth column is the probability. Words are shown
in descending order of probability until 90% of the probability mass is accumulated
for each ailment or until probabilities drop below 1.0e-4.

atam.tweets.x.csv (for x=[0-9])
- The tweets used in the study. The first column is the tweet ID. The second column
indicates the ailment ID for the ailment sampled for that tweet.
(See the atam.topwords.csv file for the most probable words associated with each ailment ID.)
Full tweets can be downloaded using the tweet ID through the Twitter API
(https://dev.twitter.com/docs/api/1.1).

keywords.txt
- The set of 269 health-related keywords used in our keyword-filtered Twitter stream as part of our dataset.

keywords_x.txt (for x={diseases,symptoms,treatments})
- The set of approximately 20,000 keyphrases crawled from wrongdiagnosis.com describing
the names of diseases, symptoms, and treatments and medications. These keyword lists are
used to create input for ATAM (which requires phrases to be labeled as symptoms or treatments),
and also to initially filter our dataset when constructing our health classifiers.