Before the Blockage
Gurneet Kaur Grewal, Gurshaan Singh
Gobind Sarvar School Guru Nanak Gate Campus
Grade 10
Presentation
No video provided
Problem
How do long-term cardiovascular risk factors, particularly high cholesterol and high blood pressure, contribute to coronary artery blockage, and how accurately can an AI model use these factors to predict the likelihood of severe coronary artery disease?
Method
- Synthetic Data Generation
- A synthetic dataset of 1,000,000 individuals was created using Python in Google Colab to avoid privacy issues. The dataset included realistic cardiovascular risk factors such as age, sex, BMI, cholesterol levels, blood pressure, smoking status, diabetes, family history, and diet quality. This approach allows controlled analysis of cardiovascular risk patterns without using real patient data.
- Target Variable Creation
- A probability score for severe CAD was calculated using weighted risk factors. Higher-risk variables increased the probability, while protective factors decreased it. Then a function known as sigmoid converted this probability into a binary outcome. (severe CAD: yes or no). This method simulates how multiple risk factors combine to influence disease likelihood.
- Data Preprocessing
- Categorical data was converted to numerical form, and all features were standardized so each variable was treated equally by the model and didn't prioritize one variable over another. Standardization ensures that differences in measurement scales do not bias the analysis.
- Model Training and Testing
- The dataset was split into 80% training data and 20% testing data. A logistic regression model was trained to identify patterns linking risk factors to severe CAD. Logistic regression was chosen for its interpretability and common use in medical risk analysis.
- Model Evaluation
- Predictions were generated for the test set.
- Model performance was evaluated using accuracy, precision, recall, and F1 score.
- Predicted probabilities were also used to create a ROC curve and calculate the AUC score, giving a complete view of model performance across thresholds. These metrics provide insight into both overall accuracy and the model’s ability to correctly identify severe CAD cases.
- Feature Importance Analysis
- Model coefficients were analyzed to determine which risk factors had the strongest influence on predictions. LDL cholesterol, blood pressure, age, and family history were among the most important predictors. This analysis helps identify key factors most strongly associated with severe CAD risk.
Analysis
After testing the model, our hypothesis proved to be wrong. We predicted that the AI model’s accuracy would be 90%. However, the accuracy is 79.44%. From further analysis using the confusion matrix, it can be seen that the AI model says “yes” to suspected presence of CAD too many times, which in result lowers the model’s accuracy. Out of the 186,570 times the model predicted that the individuals have developed severe coronary artery disease, it only predicted 150,732 of these cases correctly. 35,838 of these “yes” cases were actually where the individual being tested did not have severe CAD.
Conclusion
In conclusion, this project demonstrated that an AI-based logistic regression model can predict the likelihood of severe coronary artery disease (CAD) using long-term cardiovascular risk factors with moderate to high accuracy. The model was able to correctly identify high-risk individuals on unseen data, and feature analysis confirmed that LDL cholesterol, blood pressure, age, and family history were the strongest predictors. These results highlight the value of traditional statistical modelling techniques when applied to large-scale health data. Additionally, the interpretability of the model allows for clearer understanding of how individual risk factors contribute to disease likelihood.
Through the analysis of the predicted probabilities and the ROC curve, we were able to determine the performance of the model on all thresholds, and the AUC score further confirmed that it significantly outperformed random chance. This evaluation provides a more comprehensive view of model performance beyond simple accuracy metrics. Overall, the results suggest that such predictive approaches may be useful for supporting early cardiovascular risk assessment. These findings further support the use of AI as an ancillary tool for early cardiovascular risk assessment. However, since this study was conducted on artificial data, the performance of the model on real data may differ, and the predictions made by AI should be used in conjunction with professional medical evaluation, not replace it.
Citations
- https://www.heart.org/en/health-topics/cholesterol/about-cholesterol
- https://youtu.be/UaolDzxn-vE
- https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/symptoms-causes/syc-20350800
- https://www.heart.org/en/health-topics/cholesterol/causes-of-high-cholesterol
- https://youtu.be/CZa8mmJ7ZD8
- https://my.clevelandclinic.org/health/diagnostics/17649-blood-pressure
- https://myhealth.alberta.ca/Health/pages/conditions.aspx?hwid=stb117053&
- https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/symptoms-causes/syc-20373410
- https://my.clevelandclinic.org/health/body/circulatory-and-cardiovascular-system
- https://my.clevelandclinic.org/health/diseases/17400-pulmonary-embolism
- https://my.clevelandclinic.org/health/diseases/16911-deep-vein-thrombosis-dvt
- https://my.clevelandclinic.org/health/body/21486-pulmonary-arteries
- https://my.clevelandclinic.org/health/body/21704-heart
- https://my.clevelandclinic.org/health/articles/11918-cholesterol-high-cholesterol-diseases
- https://www.cdc.gov/heart-disease/risk-factors/index.html
- https://www.mayoclinic.org/diseases-conditions/coronary-artery-disease/symptoms-causes/syc-20350613
- https://www.nhlbi.nih.gov/health/heart-attack#:~:text=A%20heart%20attack%2C%20also%20known,muscle%20will%20begin%20to%20die
- https://www.aurorahealthcare.org/services/heart-vascular/conditions/coronary-artery-disease/types
- https://cvrti.utah.edu/imaging-techniques-for-early-detection-of-atherosclerosis/#:~:text=Atherosclerosis%20is%20the%20process%20of,Cardiac%20MRIs
- https://www.theguardian.com/society/2025/nov/12/young-people-high-blood-pressure-doubled-globally-obesity?CMP=share_btn_url
- https://www.nia.nih.gov/health/vascular-dementia/vascular-dementia-causes-symptoms-and-treatments
- https://www.heart.org/en/health-topics/heart-disease
- https://www.cdc.gov/heartdisease/risk_factors.htm
- https://www.mayoclinic.org/diseases-conditions/coronary-artery-disease/symptoms-causes/syc-20350613
- [https://www.mayoclinic.org/diseases-conditions/coronary-artery-disease/symptoms-causes/syc-20350613](https://www.mayoclinic.org/diseases-conditions/coronary-artery-disease/symptoms-causes/syc-20350613%5C)
- https://www.mayoclinic.org/diseases-conditions/coronary-artery-disease/symptoms-causes/syc-20350613
- https://www.nejm.org/doi/full/10.1056/NEJMra1814259
- https://www.escardio.org/
- https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p020/artificial-intelligence/thyroid_cancer
- OpenAI. (2025). ChatGPT (Version GPT-based large language model). https://chat.openai.com
Acknowledgement
- Anantjeet - Grade 7C
- Provided emotional support through the project
- Mrs. Grewal - Gurmukhi/Gurmat Teacher & Gurneet’s Mom
- Provided supplies
- Dr. Kaur - Kirtan Teacher
- Guided us through the project
- Gurshaan’s Mom
- Provided emotional support through the project
- Dr. Anitha - Science Teacher
- Guided us through the project
- Mrs. Sharma - Science Coordinator
- Guided us through the project
- Jasnoor - Grade 12
- Helped us register for CYSF
- Mr. Eng - Homeroom Teacher
- Guided us through the project
- Prabhleen - Grade 10
- Provided emotional support through the project
