The Strands of Health

This is a project that potentially determines your risk of getting stroke and diabetes.
Chenwei Pan
Grade 9

Problem

Problem:

My project's main objective is to investigate different health factors that may contribute to stroke and diabetes.

Stroke risk increases with age, but strokes can occur at any age. Early intervention is crucial, as strokes are a leading cause of serious long-term disability [1]. On the other hand, approximately 830 million people worldwide have diabetes, with the majority residing in low- and middle-income countries. More than half of those affected do not receive treatment [2].

Based on research, hypertension may be one of the primary factors contributing to stroke, diabetes and many other chronic diseases [3]. In Canada, the prevalence of hypertension in adults is 22.6%, with an additional 20% classified as prehypertensive. More than 70% of adults over 80 have hypertension. If the average lifespan is reached, over 90% of Canadians are estimated to develop hypertension [4]. 

On a personal level, I am deeply invested in this issue due to a long family history of hypertension. To better understand what might happen to my family, or myself, I aim to determine whether this condition significantly increases the risk of stroke and diabetes. To explore this, I developed a machine-learning model that analyzes patient data to predict the risks of stroke and diabetes. By identifying these risks, I hope to contribute to a broader understanding of hypertension’s impact on overall health.

My project not only delves into the relationship between hypertension and stroke as well as diabetes but also identifies other key factors that may influence their development. The findings could provide insights into how a healthy lifestyle affects health outcomes, potentially encouraging individuals to adopt healthier habits. 

Background info:

First of all, let's look at how stroke and diabetes occur.

A stroke occurs when blood flow to the brain is disrupted. Blood supplies oxygen and nutrients to brain cells, and if blood flow is blocked, these cells suffer damage and eventually die. Once brain cells die, they cannot be repaired, leading to difficulties with speech, cognition, and mobility [5]. 

There are two types of strokes, ischemic strokes and hemorrhagic strokes. Ischemic strokes account for approximately 87% of cases. They occur when blood flow to the brain is obstructed, either by a clot forming in the brain (thrombotic stroke) or a clot travelling from elsewhere in the body (embolic stroke). Hemorrhagic strokes, comprising 13% of cases, result from bleeding in or around the brain. The two subtypes are intracerebral hemorrhage and subarachnoid hemorrhage, caused by factors such as high blood pressure or aneurysms. Hemorrhagic strokes require urgent medical intervention to control bleeding and mitigate risks [6].

Diabetes occurs when blood glucose (sugar) levels are too high. It develops when the body does not produce enough insulin or cannot use it effectively, causing glucose to remain in the bloodstream rather than being used for energy. By definition, glucose, the body’s primary energy source, comes from both internal production and dietary intake. Insulin, a hormone produced by the pancreas, facilitates glucose absorption into cells. Diabetes increases the risk of complications such as eye, kidney, and nerve damage, cardiovascular disease, and certain cancers. Preventive measures and disease management can significantly reduce these risks [7].

It is reported that hypertension, as a significant factor along with many other factors, can lead to stroke, diabetes, heart disease, kidney damage, and other serious complications [8]. Hypertension, or high blood pressure, occurs when the force of blood against artery walls is consistently too high. The term "hypertension" derives from "hyper-" (over/beyond) and "tension" (stretching/straining), meaning "straining beyond." This condition places excessive strain on blood vessels [9]. 

There are two types of hypertension, primary and secondary. Primary hypertension has a gradual onset, it develops slowly over time. Other underlying medical conditions cause secondary hypertension which is often more sudden and severe [3].

My goal for this study is to use machine-learning approaches to help model the relationships between hypertension plus other health factors and the diseases of stroke and diabetes and develop an app to predict the likelihood of strokes and diabetes based on different lifestyle factors using the machine-learning models.


 

Method

Method:

Machine learning involves the development of computer systems that learn and adapt without explicit programming, using algorithms and statistical models to analyze patterns and draw inferences from data (Oxford Dictionary).

It employs data analytics and machine learning algorithms to examine historical data, patterns, and trends. The more data available to machine learning algorithms, the better the predictions [10].

Various predictive models and algorithms exist. The two primary types relevant to my project are regression and classification models. Both fall under supervised learning, where models are trained on existing data with labelled outcomes. The key difference is that regression predicts continuous values, while classification assigns discrete labels.

A simple classification algorithm is logistic regression. Despite its name, logistic regression is designed for classification tasks. It predicts outcomes by calculating probabilities, and categorizing data points based on threshold values (e.g., >50% = class 1, <50% = class 0). Logistic regression is effective for binary classification problems, such as determining whether an email is spam or whether a patient is likely to develop a condition [11, 12].

A more advanced model, random forest, is an ensemble learning method that constructs multiple decision trees and aggregates their predictions for improved accuracy. Each tree functions as an independent "expert," and the collective decision-making process reduces errors and overfitting. This algorithm has been widely adopted for medical predictions, such as assessing disease risks, and performed well with large datasets [13]. 

Both logistic regression and random forest were used in my project. Currently, my most accurate model employs random forest to predict the likelihood of stroke and diabetes based on health factors. By refining this model, I aim to provide a clearer understanding of how hypertension influences these conditions.

My Dataset Source [14]:

  • The Behavioral Risk Factor Surveillance System (BRFSS) is a state-based health survey by the CDC or Centers for Disease Control and Prevention designed to collect data on health-related risk behaviours, chronic conditions, and preventive service use among U.S. adults. 
  • It uses telephone surveys, including both landline and cell phones, to gather information from a representative sample of adults. 
  • The BRFSS consists of core questions, optional modules, and state-specific questions, covering topics like health status, access to care, chronic diseases, and health behaviours. Its primary purpose is to monitor public health trends, guide policy development, and support research to improve health outcomes.

In this study, I gathered two datasets from CDC. Each of them contains health information, stroke or diabetes respectively, along with hypertension, and other relevant factors. Using this data, I  trained and validated machine-learning models to predict the likelihood of stroke or diabetes with high accuracy. The model's performance was assessed based on its ability to analyze and interpret the relationships between the two diseases and the health factors including hypertension.

The detailed information in each database is given below: 

Stroke:

  • Sex
  • Age
  • Hypertension
  • Heart disease
  • Ever married
  • Work type
  • Residence type
  • Average glucose level
  • BMI
  • Smoking status
  • Stroke (yes or no)

Diabetes:

  • Age
  • Sex
  • High cholesterol
  • Cholesterol check
  • BMI
  • Smoker
  • Heart disease or attack
  • Physical activity
  • Fruits
  • Vegetables
  • Heavy alcohol consumption
  • General health
  • Mental health
  • Physical health
  • Difficulty walking
  • Stroke
  • High blood pressure (hypertension)
  • Diabetes (yes or no)

The following steps were implemented in my Python code:

  1. Import all necessary libraries (Numpy, pandas, etc.)
  2. Upload and Prepare Data
  3. Process data
  4. Find missing values and replace them with a median
  5. Turn everything into numerical data
  6. Split data into training and testing
  7. Create and train the model pipeline (smote, classifier, preprocessor)
  8. Perform a grid search for the best pseudo-parameters
  9. Cross-validate and print results
  10. Generate predictions and evaluate
  11. Generate a confusion matrix (true positive, true negative, etc)
  12. Plot feature importance
  13. Calculate and print metrics (accuracy, precision, etc)

At first, I used the logistic regression model to predict the risks. The accuracy was 68% and 74% for stroke and diabetes respectively. 

Then, I switched the algorithm to random forest and used the SMOTE technique, to improve accuracy because the dataset was imbalanced. SMOTE stands for Synthetic Minority Oversampling Technique. It's a technique used in machine learning to address imbalanced datasets by recognizing the data contains a minority class, similar to rare disease cases in a medical dataset [15].

After many tries, and fixing the errors, I finally got both models to work correctly using random forest. My stroke prediction model has improved dramatically with an accuracy of 99%. My cross-validation has confirmed no overfitting. My diabetes prediction model now has an accuracy of 75%, not a big improvement, but it still shows how switching the algorithm might impact the results.

Since my stroke prediction model appears more satisfactory, I decided to only make a stroke prediction app using Streamlit, leaving the development of a diabetes app to the future when I further improve the diabetes prediction model accuracy. 

Streamlit is a Python framework that allows you to build interactive apps. My learning curve on Streamlit was built with a combination of YouTube videos and hours of practice. The stroke prediction app has the following features: 

  • inputting patient details
  • double-check your information
  • a health tips section
  • a link for more information
  • a user reviews section

Below in Figure 1 and Figure 2, you can see some pictures of my app (although it is cut off). 

Figure 1. Input screen of Streamlit app

Figure 2. The resulting risk prediction screen with health tips

Analysis

The accuracy of model prediction from this work is summarized in Table 1 below. The comparison shows that random forest is typically a better model to use when predicting health outcomes compared to logistic regression. This finding may help choose models for future medical-related projects.

Table 1. The accuracy of model prediction by logistic regression and random forest.

Dataset

Accuracy with Logistic Regression

Accuracy With Random Forest

Stroke

68%

99%

Diabetes

74%

75%

Moreover, the feature importance from the random forest model has provided some insights into the most important factors for stroke and diabetes. The results are given below in Figure 3 and Figure 4.

Top 15 most important factors for Stroke Prediction:

Figure 3. Top 15 most important factors for Stroke Prediction.

Figure 4. Top 15 Most Important Factors for Diabetes Prediction

The results suggest that glucose level and BMI are the two most important factors for stroke prediction. It is reported that high blood sugar, also called hyperglycemia, occurs when glucose levels exceed normal ranges. While transient hyperglycemia may not cause long-term issues, prolonged elevation can result in complications such as stroke, cardiovascular disease, vision impairment, kidney damage and nerve damage [16].

My model results confirmed the relationship between the blood sugar level and the risks of strokes. This result could indicate the potential stroke risks for diabetes patients.

In addition, besides high blood pressure, general health and BMI are among the top 3 important factors for diabetes prediction. Both stroke prediction and diabetes prediction suggest that general health and BMI impact your chances of developing chronic illnesses and other damaging health conditions.  Awareness of health conditions and lifestyle changes remain crucial to decrease the risks of stroke and diabetes, and thereby to live a long, healthy life. 

Based on this study, hypertension is the 3rd most important factor for stroke and the second most important factor for diabetes. This confirms that hypertension can lead to serious medical implications. However, I believe that hypertension could have scored even higher. This may be because people who have hypertension will take medications to lower their blood pressure. Thus, people who have hypertension will be less likely to experience the syndromes it may bring about. 

The results demonstrated that we can use machine learning to guide preventive healthcare.

Impact:

This project could provide insights into the predictive power of machine learning in healthcare, highlighting how analyzing health factors can assist in early detection and intervention. By demonstrating the potential of such models, this project underscores the value of integrating technology into healthcare to improve outcomes and encourage proactive health management.

My app could be used for future patient assessment surveys, helping to guide individuals in assessing their potential risk for stroke. When a person suspects they may be at risk, they can use my app as an initial screening tool before seeking medical attention. This could save time and money, as individuals with an extremely low risk may not need to visit a doctor unnecessarily.

However, I acknowledge that no predictive model is 100% accurate, and my model will always have limitations. If this app is used in the future, it should not be solely relied upon for medical decisions. Instead, it should serve as a supplementary tool alongside professional medical advice.

It is important to recognize that similar technologies have already been developed. Many existing models use machine learning and artificial intelligence to predict stroke risks based on various health factors. However, my project aims to ensure that such tools are more user-friendly and practical for early health assessments.


 

Conclusion

Conclusion:

  • Having a healthy lifestyle is essential for preventing serious health conditions like stroke and diabetes. Hypertension can lead to serious medical implications. My machine-learning model identifies key risk factors, reinforcing the importance of maintaining good health habits.
  • To make these insights more accessible, I created a Streamlit app using my stroke prediction model. I invited my family members, close friends, and classmates to test the app, and their feedback was mostly positive. Through this process, I gained valuable insights into how machine learning can help raise awareness and encourage proactive health decisions.
  • Additionally, my analysis showed that the random forest algorithm generally performs better than logistic regression, suggesting that more complex models may be more effective in predicting health risks when sufficient data are available. This reinforces the potential of AI-driven tools in medical prediction and prevention.

Limitations:

  • Initially, I was hoping to use local health data in Alberta and Canada. I emailed a multitude of health organizations (AHS, CIHI, community health, infostats, etc.) and hospitals but the data I needed either required an ethics-approved study (which I applied for but no response came) or they didn’t have that data. These private datasets also bring up ethical issues, which is another reason I ultimately chose an ethics-approved public dataset.
  • However, there was limited data available, and I could only find public source data from the U.S.. I hope more people will use my app so that I will be able to collect more local data to improve my model.

Improvements:

While my model provides valuable insights, there are several ways it could be improved:

  • More Diverse Data: Expanding the dataset to include a wider range of demographics and medical histories could enhance accuracy.
  • Feature Refinement: Incorporating additional health metrics, such as sleep patterns, may improve predictions.
  • Model Optimization: Testing other machine-learning algorithms or fine-tuning hyperparameters could further boost performance.
  • User Experience: Enhancing my Streamlit app’s interface and adding explanations for predictions could make it more user-friendly.
  • Real-World Validation: Collaborating with healthcare professionals to validate predictions and explore practical applications would strengthen credibility.

My next step:

Introducing Phase 2: Using vessel segmentation to determine hypertensive retinopathy.

Objective: Enhance hypertension prediction by analyzing retinal images and improving segmentation techniques.

Key Steps:

Goal: Improve diagnostic accuracy of hypertensive retinopathy detection using advanced image preprocessing.

 

Citations

[1] CDC. (2024, October 24). Stroke Facts. Stroke. https://www.cdc.gov/stroke/data-research/facts-stats/index.html

[2] World. (2019, May 13). Diabetes. Who.int; World Health Organization: WHO. https://www.who.int/health-topics/diabetes#:~:text=About%20830%20million%20people%20worldwide,diabetes%20are%20not%20receiving%20treatment

[3] High blood pressure (hypertension): Controlling this common health problem-High blood pressure (hypertension) - Symptoms & causes - Mayo Clinic. (2024). Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/symptoms-causes/syc-20373410

[4] HYPERTENSION IN CANADA HIGH BLOOD PRESSURE (HYPERTENSION) IS THE LEADING RISK FOR DEATH AND DISABILITY WORLDWIDE. (2016). https://hypertension.ca/wp-content/uploads/2018/12/HTN-Fact-Sheet-2016_FINAL.pdf.

[5] https://www.facebook.com/NIHAging. (2023, February 9). Stroke: Signs, Causes, and Treatment. National Institute on Aging. https://www.nia.nih.gov/health/stroke/stroke-signs-causes-and-treatment#:~:text=A%20stroke%20happens%20when%20there%27s,oxygen%20suffer%20and%20eventually%20die

[6] Types of Stroke. (2022, December 13). Hopkinsmedicine.org. https://www.hopkinsmedicine.org/health/conditions-and-diseases/stroke/types-of-stroke.

[7] and, D. (2025, January 23). What Is Diabetes? National Institute of Diabetes and Digestive and Kidney Diseases; NIDDK - National Institute of Diabetes and Digestive and Kidney Diseases. https://www.niddk.nih.gov/health-information/diabetes/overview/what-is-diabetes

[8] High blood pressure (hypertension): Controlling this common health problem-High blood pressure (hypertension) - Symptoms & causes - Mayo Clinic. (2024). Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/symptoms-causes/syc-20373410

[9] hypertension. (2025). Vocabulary.com. https://www.vocabulary.com/dictionary/hypertension#:~:text=Hyper%2D%20is%20a%20prefix%20that,strain%20on%20your%20blood%20vessels

[10] IBM. (2024, August 12). Predictive AI. Ibm.com. https://www.ibm.com/think/topics/predictive-ai#:~:text=Predictive%20AI%20uses%20big%20data,biases%20in%20predictive%20AI%20models

[11] Keita, Z. (2022, September 21). Classification in Machine Learning: An Introduction. Datacamp.com; DataCamp. https://www.datacamp.com/blog/classification-machine-learning

[12] Dawson, C. (2021, February 11). A Guide to Logistic Regression for Beginners - Christa Dawson - Medium. Medium. https://dawsonc96.medium.com/a-guide-to-logistic-regression-for-beginners-c53632fea4e4

[13] What is Random Forest? [Beginner’s Guide + Examples]. (2020, October 21). CareerFoundry. https://careerfoundry.com/en/blog/data-analytics/what-is-random-forest/ 

[14] Behavioral Risk Factor Surveillance System. (2024, November 22). Cdc.gov. https://www.cdc.gov/brfss/index.html

[15] SWASTIK. (2020, October 6). SMOTE for Imbalanced Classification with Python. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/
[16] Hyperglycemia: Symptoms, Causes, and Treatments. (2023, November). Yale Medicine. https://www.yalemedicine.org/conditions/hyperglycemia-symptoms-causes-treatments#:~:text=Hyperglycemia%20is%20a%20condition%20in,also%20develop%20in%20non%2Ddiabetics

Acknowledgement

I acknowledge Ms. Lai who supported and guided me through this exciting process.

I acknowledge my mentors from Juniotech, Tim and Irada for providing their opinions and helping improve my project with their expertise in the computer science area.

Finally, I acknowledge Dr. Leyla Baghirzada, clinical assistant professor at the University of Calgary for her great help and guidance in the medical aspect of my project.