figshare
Browse

RECOGNIZING HEALTH CONCEPTS IN TWITTER DATA USING LARGE LANGUAGE MODEL’S

thesis
posted on 2025-05-08, 20:44 authored by Soniya Sagar ChavanSoniya Sagar Chavan

This thesis presents a structured framework leveraging large language models (LLMs)—GPT-4-0613 via LangChain, GPT-4 Turbo, and Gemini 2.0 Flash—for extracting, normalizing, and categorizing COVID-19 symptoms from informal Twitter posts. Using a pre-annotated dataset of 635 tweets as ground truth, the study evaluates each model’s ability to identify symptoms and temporal references expressed through varied, often non-clinical language.

To address LLM non-determinism, the framework introduces a consensus mechanism across three inference runs per model. Outputs are semantically matched, normalized, and categorized using prompt-driven Gemini 2.0 Flash models to ensure consistency across all stages. The evaluation metrics include accuracy, precision, recall, and F1-score, with GPT-4-0613 demonstrating the highest overall performance.

The study further visualizes results through a 3D symptom-day-category data cube to support trend analysis. Findings highlight the potential of LLMs, when combined with prompt engineering and ensemble strategies, to enhance public health surveillance from social media data streams. This reproducible pipeline offers a scalable solution for timely health monitoring and can generalize to other diseases and platforms.

History

Degree Type

  • Master of Science

Department

  • Computer and Information Technology

Campus location

  • Hammond

Advisor/Supervisor/Committee Chair

Keyuan Jiang

Additional Committee Member 2

Ashok Vardhan Raja

Additional Committee Member 3

George Stefanek