Discovering health topics in social media using topic models

2014-06-26T17:53:41Z (GMT) by Michael Paul
<p>Data set for M. Paul and M. Dredze, "Discovering health topics in social media using topic models".</p> <p>This includes the set of tweets used in the experiments, and the words associated<br>with ailments discovered by the Ailment Topic Aspect Model (ATAM).</p> <p>Contact: Michael Paul (<br>Released June 26, 2014</p> <p>atam.topwords.csv<br>- The most probable words for each ailment. The first column is the ailment ID.<br>The second column indicates if it is a general (G), symptom (S), or treatment (T) word.<br>The third column is the word. The fourth column is the probability. Words are shown<br>in descending order of probability until 90% of the probability mass is accumulated<br>for each ailment or until probabilities drop below 1.0e-4.</p> <p>atam.tweets.x.csv (for x=[0-9])<br>- The tweets used in the study. The first column is the tweet ID. The second column<br>indicates the ailment ID for the ailment sampled for that tweet.<br>(See the atam.topwords.csv file for the most probable words associated with each ailment ID.)<br>Full tweets can be downloaded using the tweet ID through the Twitter API<br>(</p> <p>keywords.txt<br> - The set of 269 health-related keywords used in our keyword-filtered Twitter stream as part of our dataset. </p> <p>keywords_x.txt (for x={diseases,symptoms,treatments})<br>- The set of approximately 20,000 keyphrases crawled from describing<br>the names of diseases, symptoms, and treatments and medications. These keyword lists are<br>used to create input for ATAM (which requires phrases to be labeled as symptoms or treatments),<br>and also to initially filter our dataset when constructing our health classifiers.</p> <p> </p>