Predicting Stroke Risk Dataset
This study addresses stroke as a critical global health issue by employing a comprehensive, data-driven approach to improve early risk prediction and intervention. Utilizing a dataset of 5,110 records, the research combines statistical analysis, machine learning (ML) classification, clustering techniques, and survival modeling to identify key predictors of stroke. Descriptive analysis highlights age, average glucose level, BMI, hypertension, and heart disease as the most significant risk factors, with stroke prevalence reaching 13.25% among hypertensive individuals and 17.03% among those with heart disease. Former and current smokers also demonstrate elevated stroke risk. Clustering using PCA and t-SNE reveals high-risk groups characterized by older age and high glucose levels. ML evaluation shows that XGBoost provides the best precision-recall balance, while Naïve Bayes achieves the highest recall (0.404), offering greater sensitivity to stroke detection. Feature importance analysis consistently ranks glucose, BMI, and age as dominant predictors, with XGBoost assigning high weight to cardiovascular conditions. Survival analysis using Kaplan-Meier and Cox regression models shows stroke risk increases sharply after age 60, with hypertension linked to a 31.9% higher risk. The results emphasize the value of early screening and targeted intervention, suggesting future improvements via class-balancing techniques and real-time clinical tools.