Stroke Risk Prediction | Jay Skaria's Portfolio

Machine Learning Approaches for Predicting Stroke Risk in Primary Care

Overview:
Evaluated machine learning models for predicting stroke risk in adult primary care patients using EMR data. The project aimed to support early intervention and prevention strategies by identifying high-risk individuals automatically within clinical workflows.

Tools & Technologies:

Google Colab, python, scikit-learn, pandas, numpy, matplotlib, SHAP, logistic regression, random forest, XGBoost, Neural Network, Naive Bayes

Background:

Stroke is a leading cause of mortality and long-term disability worldwide, imposing a significant burden on patients, healthcare systems, and economies (Alanazi et al., 2021). In Canada, stroke accounts for an estimated $3.6 billion in yearly costs, nearly 14,000 annual

deaths, and is a major driver of healthcare utilization (Connected by the Numbers, 2023). According to recent research, while stroke deaths have declined, stroke incidence is rising among younger adults, highlighting the need for better prevention strategies (Scott et al., 2022). Early identification of individuals at heightened risk is critical for timely intervention and prevention strategies. Traditional risk assessment tools, such as the Framingham Stroke Risk Profile, provide estimates of stroke probability but often lack the granularity needed for individualized patient management (Chahine et al., 2023). Additionally, traditional methods of risk assessment often require manual calculation by providers, limiting their widespread adoption. Advancements in machine learning (ML) offer an opportunity to enhance stroke risk prediction to develop more precise, personalized and automated risk assessments.

Research Questions:

1. For an adult primary care patient in Canada, can we calculate a ‘stroke risk indicator’ (probability of stroke) based on commonly collected primary care data (e.g., age, sex, diagnoses, laboratory values, etc.)?

2. Which modifiable health factor(s) offer the biggest opportunity to reduce a patient’s stroke risk?

Rationale:

Predicting stroke risk using primary care data is essential for proactive management. Many risk factors for stroke, including hypertension, diabetes, dyslipidemia, and smoking, are already documented in electronic medical records (EMRs), yet they are often underutilized for

real-time risk stratification. Further, the primary care setting is the ideal care environment for screening and advanced mitigation strategies. An ML-based stroke risk model could allow primary care providers to automatically synthesize the multiplicity of patient-level data in the EMR and systematically identify high-risk patients, enabling targeted interventions. Additionally, understanding which modifiable risk factors have the most significant impact on stroke prevention would enable clinicians to prioritize the most effective treatments.

This project aimed to develop a ML-based stroke risk prediction model that could be deployed within a primary care EMR as a clinical decision support system (CDSS). In implementation, if a patient's ‘stroke risk indicator’ surpasses a threshold for stroke likelihood,

an automatic alert would notify the primary care provider, prompting increased vigilance in risk management. This could include intensified blood pressure and lipid control, lifestyle counseling, referral to specialists, and focused patient education on recognizing early stroke symptoms. By integrating such a system into routine care, primary care providers could intervene earlier, reducing stroke incidence and improving long-term patient outcomes.

Dataset:

The dataset, obtained from Toronto Metropolitan University in collaboration with the University of Toronto, comprises primary care records collected from 1998 to 2015 across Canada and is considered broadly representative of the national population. Initially, it contained

over 800,000 observations and 175 features (e.g., demographics, clinical measurements, medications, comorbidities), with 58% female patients, 42% male, and ages ranging from 18 to 90.

Methods:

After removing approximately 706,000 records that were missing the target variable “stroke,” about 100,000 observations remained. Though there was significant loss of data, the amount of retained observations was considered acceptable for analysis. The target label, defined as a binary variable, had significant class imbalance as expected, with stroke patients representing only 5% (coded as “1”). Features were reduced according to clinical relevance, retaining age, sex, BMI, key lab results, and relevant comorbidities. Comorbidities (0 or 1 binary) were further reduced according to which features for the ‘stroke == 1’ class had a mean of greater than .05 (> 5%), which was selected as an initial cutoff for feature inclusion in predicting the stroke class. Features were also reduced to limit clinical redundancy. The resulting, smaller dataset was saved separately to make further data handling more computationally manageable.

During initial data exploration, four features—LDL, HDL, hemoglobin A1c, and triglycerides—were found to have missing values. Additionally, there was unequal patient representation within the dataset, with some patients having over 100 observations and others only a single observation. The first data cleaning step was to ensure each patient was represented uniformly by handling the multiple observations per patient. A decision was made to use the last observation for a given patient due to the aggregation of common patient records being found to be computationally demanding and the final observation being considered more representative of a patient’s final status compared with the first observation. This step filtered what was initially over 100,000 observations to about 17,000. Even after filtering, the dataset remained highly imbalanced, with 16,626 observations labeled as “no stroke” versus 620 labeled “stroke.” The only categorical variable requiring re-encoding was sex, which was converted to a binary variable (0 = female, 1 = male).

Next, Scatterplots, boxplots, and summary statistics were used to examine continuous features, identify outliers, and remove values deemed clinically implausible or erroneous. Patient ID was removed due to not being predictive. Hemoglobin A1c was removed due to having over 50% missing values and being redundant with diagnosis of ‘Diabetes’. The remaining three features with missingness (LDL, HDL and TG) had a low percentage missing (< 5%) and were retained for imputation. Additionally, a correlation matrix was used to reduce colinear features above a threshold of .9, although no features were found to need removal. This resulted in the final cleaned data for modeling.

A range of models were selected for machine learning analysis to represent a diverse set of learning architectures: Logistic Regression to represent linear models, Random Forest and XGBoost to represent tree-based models, Neural Network to represent deep learning, and Naïve Bayes to represent generative probabilistic classifiers. While all of these models could predict binary outcomes (stroke vs. no stroke), this iterative approach was to determine which model comprised the best balance of performance, explainability, and computational cost. Each model began with a common pipeline step of setting the target label, splitting data into 80% training and 20% testing sets, using MICE imputation for the three features with missing values (to maintain relationships between variables) and scaling continuous features.

To handle class imbalance, each of the models utilized synthetic sampling techniques. SMOTE was used for oversampling the minority class in all models except Logistic Regression, which was found to benefit more from SMOTE-ENN for a combination of oversampling and under sampling. Stratified k-fold cross validation (k=5) was used to preserve class representation in each validation fold and prevent overfitting. Additionally, each model used GridSearchCV for hyperparameter tuning to limit overfitting, for example using regularization in logistic regression and adjusting tree depth in random forest and XGBoost. For the Naïve Bayes model, a hybrid approaches with Gaussian and Bernoulli methods were used to fit a combination of continuous and binary features. For each model run, performance was logged according to metrics of Accuracy, Precision, Recall, and F-1. Additionally, computational cost was calculated for each model based upon the time required for computation and the size of GPU disk space utilized.

Results:

In analyzing the final results of the machine learning models, each demonstrated strong accuracy (ranging from Hybrid Naive Bayes at nearly 70% up to XGBoost at 92%), although this metric was less meaningful due to the substantial class imbalance inflating values. The analysis focused primarily on recall as the most important metric due to importance of correctly predicting stroke cases and the acceptability of some false positives in the setting of primary care screening.

Logistic Regression and Hybrid Naive Bayes emerged as the top models primarily because of their high recall rates (62.9% and 59.7%, respectively), significantly outperforming other models in accurately identifying positive stroke cases. Both models also offered high explainability, making them particularly attractive choices for clinical use cases, where transparency and interpretability are essential. However, Hybrid Naive Bayes was ultimately considered superior due to its computational simplicity and efficiency, whereas Logistic Regression required a substantial computational load due to the use of SMOTE-ENN.

In contrast, tree-based models such as Random Forest and XGBoost demonstrated marginally better F1 scores (16.0% and 17.6%, respectively), indicating a more balanced precision-recall tradeoff, but recall was too low for clinical utility. While the Neural Network model had the highest F1 score, the low recall and lack of explainability make this an undesirable model for clinical deployment. Finally, the results of a SHAP analysis on the Logistic Regression model indicated that age, cancer, LDL, anxiety and depression were the features with the highest influence on the model’s output.

SHAP Analysis of Top 5 Features in Logistic Regression model

Discussion:

This study demonstrates that stroke risk prediction models using machine learning can be effective for deployment in the primary care setting to enhance screening, patient education, and early intervention. In this analysis of machine learning approaching for predicting stroke risk, the Hybrid Naive Bayes classifier emerged as the most favorable model overall due to its ability to predict actual stroke cases, explainability, and low computational cost. Logistic Regression also performed well, achieving the highest recall overall, but was comparatively more computationally intensive due to the use of SMOTE-ENN to handle class imbalance. These results support the utility of simple, interpretable models for use in primary care, where timely and resource-efficient decision-making is essential.

The value of these models lies in their ability to flag high-risk patients before a stroke occurs, allowing providers to initiate first-line interventions such as lifestyle counseling, blood pressure management, and patient education. These measures are low-cost, non-invasive, and widely accepted in primary care and their targeted intervention can reduce the incidence of stroke while improving long-term outcomes. For example, a stroke risk prediction model embedded within the EMR could generate an alert during pre-visit charting, guiding providers to initiate risk reduction strategies at the point of care. Even in cases of false positives, these first-line interventions have low risks and are generally beneficial, especially when weighed against the high costs of stroke-related morbidity and mortality.

Nevertheless, this analysis encountered several limitations affecting model performance and clinical application that must be considered. A major limiting factor was the significant target class imbalance and low representation of actual stroke cases, which accounted for only 3.6% of the training set (496 observations). This imbalance likely contributed to relatively low precision across models, reducing their positive predictive value and increasing the risk of false positives. Although recall was prioritized as the most important metric given the importance of identifying true stroke cases for screening, the low precision may diminish provider confidence in the model’s results. For example, frequent false positives could lead to unnecessary diagnostic testing, increase patient anxiety, and place additional strain on already burdened healthcare resources. A more balanced approach to precision and recall would be ideal in future model refinement.

Another limitation is the absence of a time horizon for stroke risk prediction. The models in this study estimate the risk of a stroke without specifying whether that risk applies over one year, five years, or a lifetime. This lack of temporal specificity can make clinical decision-making more challenging, as the urgency and intensity of interventions depend heavily on the timeframe of predicted risk. Incorporating time-bound predictions into future models could greatly enhance their clinical utility by helping providers tailor care based on how soon a patient is likely to be at risk.

The SHAP analysis of the Logistic Regression model provided insights into feature importance and model behavior. Predictors such as age, LDL, and cancer were consistent with established clinical knowledge. However, the finding that anxiety and depression were among the top predictors highlights a potentially under-recognized modifiable factor for reducing stroke risk. While this analysis cannot establish causality, the association between mental health and stroke risk warrants further investigation. Whether these conditions are precursors to stroke or sequelae of it, the possibility that psychological health plays a causal role suggests a need to integrate mental health interventions, such psychological treatment, into broader stroke prevention programs.

From a systems-level perspective, the implementation of ML-based stroke risk prediction models in primary care could yield substantial public health and economic benefits. Early identification and management of high-risk patients may reduce the incidence of stroke, especially amongst younger populations in which stroke is less expected. By enabling targeted interventions before the onset of acute stroke, these tools can support more efficient resource allocation, reduce downstream care costs (e.g., hospitalization and rehabilitation), and improve overall health outcomes for patients.

Conclusion:

This study demonstrates that machine learning can be effectively applied to primary care data to identify individuals at elevated risk of stroke, offering a practical path toward earlier intervention and prevention. Unlike most existing ML research in stroke care, which primarily focuses on hospital-based data and post-stroke predictions, this project offers a novel emphasis on primary care data and a pre-stroke perspective, aligning with upstream prevention goals. The key takeaway is that even relatively simple models, such as Naive Bayes and Logistic Regression, can achieve clinically meaningful performance and maintain transparency, making them viable for integration into electronic medical records as real-time decision support tools.

Future research should prioritize two major enhancements: the incorporation of time-bound risk predictions (e.g., 1-year or 5-year risk windows) and the inclusion of more diverse, representative datasets to improve generalizability and equity. Additionally, the inclusion of mental health variables in the SHAP analysis underscores the need for interdisciplinary studies examining the relationship between psychological factors and stroke risk, which may open new avenues for preventive care.

For patients and providers, the implications are significant. Machine learning-driven risk prediction tools could shift the primary care paradigm from reactive to proactive care, empowering providers to intervene earlier with low-risk, evidence-based strategies. For researchers and developers, this work highlights the importance of model interpretability, bias mitigation, and contextual alignment with real-world clinical workflows. As the healthcare system continues to adopt AI technologies, the goal must remain not only to improve predictive accuracy, but to ensure models enhance care quality, promote health equity, and support meaningful clinical action.

Keywords:
Stroke Risk Prediction, EMR, Clinical Decision Support, Explainable AI, Risk Stratification

References

Alanazi, E. M., Abdou, A., & Luo, J. (2021). Predicting risk of stroke from lab tests using machine learning algorithms: Development and evaluation of prediction models. JMIR Formative Research, 5(12), e23440. https://doi.org/10.2196/23440

Chahine, Y., Magoon, M. J., Maidu, B., del Álamo, J. C., Boyle, P. M., & Akoum, N. (2023). Machine learning and the conundrum of stroke risk prediction. Arrhythmia & Electrophysiology Review, 12, e07. https://doi.org/10.15420/aer.2022.34

Connected by the numbers. (2023). Heart and Stroke Foundation of Canada. https://www.heartandstroke.ca/articles/connected-by-the-numbers#:~:text=Cardiovascular%20disease%20alone%20is%20the,lost%20wages%20and%20decreased%20productivity.

Daidone, M., Ferrantelli, S., & Tuttolomondo, A. (2023). Machine learning applications in stroke medicine: Advancements, challenges, and future prospectives. Neural Regeneration Research, 19(4), 769–773. https://doi.org/10.4103/1673-5374.382228

Dong, Q. (2024). Exploring the application of machine learning algorithms in stroke prediction. In Y. Wang (Ed.), Proceedings of the 2024 2nd International Conference on Image, Algorithms and Artificial Intelligence (ICIAAI 2024) (pp. 92–99). Advances in Computer Science Research. https://doi.org/10.2991/978-94-6463-540-9_12

Fu, Y. (2024). A machine learning approach for predicting stroke. Medical Data Mining, 7(3), 15. https://doi.org/10.53388/MDM202407015

Govindaiah, A., Bhuiyan, T., Smith, R. T., Dhamoon, M. S., & Bhuiyan, A. (2025). A machine learning prediction model to identify individuals at risk of 5-year incident stroke based on retinal imaging. Sensors, 25, 1917. https://doi.org/10.3390/s25061917

Guo, Y. (2022). A new paradigm of “real-time” stroke risk prediction and integrated care management in the digital health era: Innovations using machine learning and artificial intelligence approaches. Thrombosis and Haemostasis, 122, 5–7. https://doi.org/10.1055/a-1508-7980

Gupta, A., Mishra, N., Jatana, N., Malik, S., Gepreel, K. A., Asmat, F., & Mohanty, S. N. (2024). Predicting stroke risk: An effective stroke prediction model based on neural networks. Journal of Neurorestoratology, 13, 100156. https://doi.org/10.1016/j.jnrt.2024.100156

Heseltine-Carp, W., Courtman, M., Browning, D., Kasabe, A., Allen, M., Streeter, A., Ifeachor, E., James, M., & Mullin, S. (2025). Machine learning to predict stroke risk from routine hospital data: A systematic review. International Journal of Medical Informatics, 196, 105811. https://doi.org/10.1016/j.ijmedinf.2025.105811

Hunter, E., & Kelleher, J. D. (2022). A review of risk concepts and models for predicting the risk of primary stroke. Frontiers in Neuroinformatics, 16, 883762. https://doi.org/10.3389/fninf.2022.883762

Jung, S., Song, M. K., Lee, E., Bae, S., Kim, Y. Y., Lee, D., Lee, M. J., & Yoo, S. (2022). Predicting ischemic stroke in patients with atrial fibrillation using machine learning. Frontiers in Bioscience (Landmark Edition), 27(3), 80. https://doi.org/10.31083/j.fbl2703080

Lavanya, S. J. M., & Subbulakshmi, P. (2024). Unveiling the potential of machine learning approaches in predicting the emergence of stroke at its onset: A predicting framework. Scientific Reports, 14, 20053. https://doi.org/10.1038/s41598-024-70354-1

Moulaei, K., Afshari, L., Moulaei, R., Sabet, B., Mousavi, S. M., & Afrash, M. R. (2024). Explainable artificial intelligence for stroke prediction through comparison of deep learning and machine learning models. Scientific Reports, 14, 31392. https://doi.org/10.1038/s41598-024-82931-5

Qiu, Y., Cheng, S., Wu, Y., Yan, W., Hu, S., Chen, Y., Xu, Y., Chen, X., Yang, J., Chen, X., & Zheng, H. (2023). Development of rapid and effective risk prediction models for stroke in the Chinese population: A cross-sectional study. BMJ Open, 13, e068045. https://doi.org/10.1136/bmjopen-2022-068045

Samsel, K. (2025, March 19). Machine learning in healthcare: Navigating ethics, bias, and real-world use [Lecture slides]. HAD7001S3 – Applied Machine Learning for Health Data, University of Toronto, Dalla Lana School of Public Health.

Scott, C. A., Li, L., & Rothwell, P. M. (2022). Diverging temporal trends in stroke incidence in younger vs older people: a systematic review and meta-analysis. JAMA Neurology, 79(10), 1036–1048. https://doi.org/10.1001/jamaneurol.2022.2216

Ugbomeh, O., Yiye, V., Ibeke, E., Ezenkwu, C. P., Sharma, V., & Alkhayyat, A. (2024). Machine learning algorithms for stroke risk prediction leveraging on explainable artificial intelligence techniques (XAI). In 2024 International Conference on Electrical, Electronics and Computing Technologies (ICEECT 2024) (Article 10739320). IEEE. https://doi.org/10.1109/ICEECT61758.2024.10739320

Xie, S., Peng, S., Zhao, L., Yang, B., Qu, Y., & Tang, X. (2025). A comprehensive analysis of stroke risk factors and development of a predictive model using machine learning approaches. Molecular Genetics and Genomics, 300, 18. https://doi.org/10.1007/s00438-024-02217-3

Yang, Y., Zheng, J., Du, Z., Li, Y., & Cai, Y. (2021). Accurate prediction of stroke for hypertensive patients based on medical big data and machine learning algorithms: Retrospective study. JMIR Medical Informatics, 9(11), e30277. https://doi.org/10.2196/30277