Predicting Acute GvHD in Pediatric Bone Marrow Transplants using Machine Learning
Priscilla Wang
Westmount Mid/High School
Grade 10
Presentation
Problem
In order to enhance and improve our current treatments for HCT, we need to build something fast, but also safe. In the past, checking patients for GvHD risk required many pre-clinical safety screenings. These screenings often took weeks of testing. To make the treatment even faster than it was before, the next steps are to integrate Applied Machine Learning to predict whether or not a patient has a high or low risk of developing GvHD. By predicting risks ahead of time, many lives can be saved. ML models can rapidly filter out patients who are unsuitable for the treatment, speeding up the process significantly. In a clinical setting, doctors may only be able to test a few factors to determine whether or not a patient is considered safe for HCT treatment. Having an ML model can analyse many factors at once and come up with a decision for doctors to then double-check.
Method
I started by finding suitable datasets for my project by looking at public and clean datasets. I used a GvHD prediction dataset named “Bone Marrow Transplant: Children”. Using Google Colab, I loaded the data into Google Colab. I then set up my digital workspace, which included Pandas, Matplotlib, Scipy, and more. The dataset had missing values and needed cleaning. I started off by replacing all values that were missing with NaN (Not a Number). Then, I replaced any typos with NaN. For example, if someone had put 25yo as patient age instead of 25, the model would replace it with NaN. This ensures that the model faces little to no errors. Next, I got to know the data better by using df.head() and analysing the data. I plotted a feature correlation map with the target row being aGvHDIIIIV. By looking at the target row, I could tell which features were more related to each other and which ones had little to no relation. To read this chart, you must check to see which boxes are brightly colored along your target row. Then I removed my “cheater rows,” which are rows that happen after the doctor checks to see if GvHD has occurred. By removing these rows, we are lowering the accuracy of the model, but we are making the model more true to clinical settings. It also significantly prevents overfitting, which is something that I struggled with when building. Then, I split the data into 80% training data and 20% testing data. The split is crucial in ML as the model can not be trained on data it has already seen. I implemented the median strategy on the dataset, which uses the median of some values to avoid having many empty parts of the data. I also scaled the data for only the linear regression model, as scaling does not work well for random forests. I trained the model to make a prediction using y_lr_train_pred and evaluated the performance of the linear regression model and the random forest model using R2 and MSE. Finally, I compared the values of both models and found out that the random forest works better in this case. I created a plot with a trend line and jitter to present the accuracy of my model.
Analysis
Initially, my results were overfit, which led to the training data being more fit than the testing data. This essentially means that the Machine Learning Model memorized the data and just recited it for the test. When predicting, the model relied solely on the time_to_aGvHD_III_IV feature, meaning that it was taking shortcuts. To fix the overfit, I cut three features that were allowing the model to cheat. I believe that the Machine Learning Model that I built is a good baseline product that can be expanded and made more complex in order to achieve higher accuracy. The linear regression model failed with a negative R2. This means that biology is complex and not built linearly. In this case, the random forest worked much better. While 15.5% R2 may be low in a laboratory setting, it is a considerable amount considering the real-world aspects and fluctuations of this dataset. There are also outliers in my dataset and graph. It is a false negative, which I believe is due to unnatural biological circumstances.
Conclusion
ML models can assist doctors and healthcare professionals in a clinical setting in decision-making. They can significantly decrease the time needed to assess whether or not a patient is suitable for a treatment and help limit human mistakes. In the future, I want to apply a threshold for when to tell doctors a positive or negative result. For example, I can set a value of 0.5, and anything over 0..5 would be considered severe. This way, patients can get a severity score for GvHD and also a classification output. Other ways to improve the model are to filter out features with low impact or with many missing values. Or use K-Fold Cross-Validation to improve accuracy and reduce overfitting.
Citations
AboutKidsHealth. (2010, March 6). Maladie du greffon contre l’hôte après une greffe de sang et de moelle osseuse. https://www.aboutkidshealth.ca/fr/santeaz/haematology/maladie-du-greffon-contre-lhote-apres-une-greffe-de-sang-et-de-moelle-osseuse/?language=en#:\~:text=Key%20points,it%20is%20called%20acute%20GVHD. Bone Marrow Transplant Acute Graft vs. Host Disease. (n.d.). https://www.nationwidechildrens.org/family-resources-education/health-wellness-and-safety-resources/helping-hands/bone-marrow-transplant-acute-graft-vs-host-disease#:\~:text=After%20someone%20has%20a%20bone,someone%20other%20than%20the%20patient Data Professor. (2022, May 27). Build your first machine learning model in Python [Video]. YouTube. https://www.youtube.com/watch?v=29ZQ3TDGgRQ Google. (Year). Gemini (Version) [Large language model]. https://gemini.google.com/ Goldsmith, S. R., Ghobadi, A., Dipersio, J. F., Hill, B., Shadman, M., & Jain, T. (2022). Chimeric Antigen Receptor T Cell Therapy versus Hematopoietic Stem Cell Transplantation: An Evolving Perspective. Transplantation and Cellular Therapy, 28(11), 727–736. https://doi.org/10.1016/j.jtct.2022.07.015 Tech With Tim. (2025, August 6). Learn Pandas in 30 minutes - Python Pandas tutorial [Video]. YouTube. https://www.youtube.com/watch?v=EXIgjIBu4EU UCI Machine Learning Repository. (n.d.). https://archive.ics.uci.edu/dataset/565/bone+marrow+transplant+children
Acknowledgement
I want to acknowledge my director Alexa, who guided me through everything from ideations to final results. Ms. Lai, who has done an amazing job running Science club. My parents, who have helped me through all the ups and downs. Additionally, I would like to acknowledge the kind judges who volunteered to judge my project.
