Discovering health topics in social media using topic models
Michael Paul
10.6084/m9.figshare.1007712.v5
https://figshare.com/articles/dataset/Discovering_health_topics_in_social_media_using_topic_models/1007712
<p>Data set for M. Paul and M. Dredze, "Discovering health topics in social media using topic models".</p>
<p>This includes the set of tweets used in the experiments, and the words associated<br>with ailments discovered by the Ailment Topic Aspect Model (ATAM).</p>
<p>Contact: Michael Paul (mpaul39@gmail.com)<br>Released June 26, 2014</p>
<p>atam.topwords.csv<br>- The most probable words for each ailment. The first column is the ailment ID.<br>The second column indicates if it is a general (G), symptom (S), or treatment (T) word.<br>The third column is the word. The fourth column is the probability. Words are shown<br>in descending order of probability until 90% of the probability mass is accumulated<br>for each ailment or until probabilities drop below 1.0e-4.</p>
<p>atam.tweets.x.csv (for x=[0-9])<br>- The tweets used in the study. The first column is the tweet ID. The second column<br>indicates the ailment ID for the ailment sampled for that tweet.<br>(See the atam.topwords.csv file for the most probable words associated with each ailment ID.)<br>Full tweets can be downloaded using the tweet ID through the Twitter API<br>(https://dev.twitter.com/docs/api/1.1).</p>
<p>keywords.txt<br> - The set of 269 health-related keywords used in our keyword-filtered Twitter stream as part of our dataset. </p>
<p>keywords_x.txt (for x={diseases,symptoms,treatments})<br>- The set of approximately 20,000 keyphrases crawled from wrongdiagnosis.com describing<br>the names of diseases, symptoms, and treatments and medications. These keyword lists are<br>used to create input for ATAM (which requires phrases to be labeled as symptoms or treatments),<br>and also to initially filter our dataset when constructing our health classifiers.</p>
<p> </p>
2014-06-26 17:53:41
social media
Public health
Twitter
topic modeling
Applied Computer Science