Predicting Thyroid Cancer Recurrence with Machine Learning
Emaan Khosa Gurshaan Singh
Grade 9
Presentation
No video provided
Hypothesis
The advancements in machine learning offer promising avenues for improving the healthcare industry, particularly in providing more accurate and effective prognosis techniques for cancer. In this project, we hypothesize that the three most influential factors in predicting thyroid cancer recurrence are the initial stage at which the cancer is diagnosed, the patient’s response to treatment, and the influence of age and gender. By analyzing these factors, machine learning models can potentially improve the accuracy and timeliness of thyroid cancer recurrence predictions, providing valuable insights to healthcare professionals and demonstrating the practical applications of these models in everyday settings.
Research
What is Thryoid Cancer?
Thyroid cancer is a growth of cells that starts in the thyroid. The thyroid is a butterfly-shaped gland and produces hormones that regulate metabolism, heart rate, blood pressure, body temperature and weight.
Thyroid cancer might not cause any symptoms at first, but as it grows, it can cause signs and symptoms, such as swelling in the neck, voice changes and difficulty swallowing.
Several types of thyroid cancer exist. Most types grow slowly, though some types can be very aggressive but most can be cured with treatment.
While the exact cause of thyroid cancer is not fully understood, certain risk factors may increase the likelihood of developing the disease, including a family history of thyroid cancer, exposure to radiation, or certain genetic conditions.
Treatment options for thyroid cancer generally include surgery to remove the thyroid, radioactive iodine therapy, and sometimes external radiation or chemotherapy, depending on the type and stage of cancer.
Where is the Thyroid Gland Located?
The thyroid gland is found at the front of the neck, under the voice box. It is butterfly-shaped, and the two lobes on either side lie against and around the windpipe (trachea). They are connected at the front by a narrow strip of tissue known as the isthmus.
How common is Thyroid Cancer?
According to the Cleveland clinic, close to 53,000 Americans receive a thyroid cancer diagnosis every year. Treatments for most thyroid cancers are very successful. Still, about 2,000 people die from the disease every year.
Gender Differences: It is more common in women than men, with women being 3 times more likely to be diagnosed.
Age Group: It is often diagnosed in younger and middle-aged adults, with a peak incidence in individuals between 30 and 50 years old.
The frequency of thyroid cancer also depends on the type of cancer; with papillary being the most common, followed by follicular, medullary, and anaplastic being the least common.
What are the types of Thyroid Cancer?
Types of thyroid cancer include:
Papillary Thyroid Cancer: This is the most common type of thyroid cancer, making up about 80% of cases, according to the Cleveland clinic. It grows slowly and often spreads to the lymph nodes in the neck. However, it responds very well to treatment, making it highly curable and rarely life-threatening.
Follicular Thyroid Cancer: This type accounts for around 15% of thyroid cancer cases. It is more likely to spread to other parts of the body, like the bones or lungs. When it spreads, it can be harder to treat.
Medullary Thyroid Cancer: About 2% of thyroid cancer cases are this type. For some people, it runs in families due to a genetic mutation.
Anaplastic Thyroid Cancer: This is a rare and aggressive form of thyroid cancer, making up about 2% of cases. It grows quickly and often spreads to nearby tissues or other parts of the body, making it very difficult to treat.
What are the Warning Signs of Thyroid Cancer?
Thyroid cancer can cause any of the following signs or symptoms:
A lump in the front or side of the neck, sometimes growing quickly
Swelling in the neck
Pain in the front of the neck, sometimes going up to the ears
Hoarseness or other voice changes that do not go away
Trouble swallowing
Trouble breathing
A constant cough that is not due to a cold
Lumps in the thyroid are common and are usually not cancer. Still, if you have any of these symptoms, it’s important to see a doctor so the cause can be found and treated, since thyroid cancer is highly treatable in early stages.
What are the Signs that Thyroid Cancer has Spread?
When thyroid cancer spreads beyond the thyroid gland, it may cause additional symptoms depending on where the cancer has spread. The most common places for thyroid cancer to spread are the lymph nodes, lungs, and bones. Here are the signs that thyroid cancer may have spread:
Fatigue
Loss of appetite
Nausea and vomiting
Unexpected weight loss
What is Artificial Intelligence and Machine Learning?
AI refers to systems designed to perform tasks that require human-like intelligence, such as learning, reasoning, and decision-making.
It involves simulating human cognitive functions, enabling machines to handle tasks typically requiring human intervention.
Machine learning is a subset of AI focused on algorithms that allow systems to learn from data and improve over time.
Unlike traditional programming, machine learning models learn patterns from data to make predictions or decisions.
AI and machine learning are not explicitly programmed for every scenario, but rather improve based on data-driven insights.
How does Machine Learning Work?
Machine Learning is a subset of Artificial Intelligence that uses datasets to gain insights from it and predict future values.
Data Collection: The quality of data determines the accuracy of the predictions. This data could be anything from images, text, numbers, or even sensory data. The more data the better.
Preprocessing: Before feeding this data into modeling we pre-process this data to remove duplicate and missing values, deal with outliers, and standardize the formats is done in this step. This enhances the quality of the dataset and improves accuracy by dealing with the possible error sources before modeling.
Choosing a Model: A model is a mathematical framework that learns from data. There are different types of models (e.g., decision trees, neural networks, support vector machines) that are chosen depending on the type of problem you're solving (classification, regression, clustering, etc.).
Model Training: The model learns by looking at examples in the training data. It tries to understand patterns and relationships in the data. We usually divide our dataset into two parts: training and testing sets.
Model Evaluation: Once trained, the model is tested on new, unseen data to check how well it performs. Metrics like accuracy, precision, and recall are used to measure performance.
Prediction or Decision Making: After the model is trained and validated, it can make predictions or decisions based on new data.
Types of Machine Learning
Supervised Learning: In supervised learning, the model is trained on labeled data, where both input data (features) and the correct output (labels) are provided.
Unsupervised Learning: In unsupervised learning, the model is given data without labeled outputs. The goal is to identify underlying patterns or structures within the data.
Reinforcement Learning: The model learns by interacting with an environment and receiving feedback in the form of rewards or penalties.
Semi-Supervised Learning: Combines both labeled and unlabeled data. The model uses the small amount of labeled data to guide learning from a larger pool of unlabeled data.
Self-Supervised Learning: A type of unsupervised learning where the data itself provides the supervision.
Why is Machine Learning Important?
Improving Decision Making: ML can help make more accurate predictions, guiding better decision-making. For example, in healthcare, it can predict disease outbreaks or assist doctors in diagnosing conditions more accurately.
Automation: ML allows automation of repetitive tasks, which increases efficiency and reduces human error. This is seen in many industries: such as self-driving cars.
Innovation: Machine learning has unlocked possibilities in fields like healthcare, autonomous systems, robotics, and even art. Researchers are using it to create new medicines, scientists are leveraging it for climate modeling, and artists are using it for creating innovative digital works.
Enhancing Security: In the financial sector, ML is essential for detecting fraudulent activities. By analyzing transaction patterns, ML systems can identify anomalies that may indicate fraud, enabling quicker responses and reducing financial losses. Also, ML is also pivotal in cybersecurity, where it helps detect and respond to threats in real time.
Pros and Cons of Machine Learning
Pros of Machine Learning:
Automation of Tasks – Machine learning can handle repetitive tasks quickly and accurately, reducing human effort and saving time.
Improved Decision Making – AI can analyze large amounts of data to find patterns and provide insights that help in making better decisions.
Personalization – Machine learning allows businesses to offer customized experiences, such as recommending products or services based on user preferences.
Cons of Machine Learning:
Data Dependence – Machine learning models need a lot of high-quality data to work well, and poor data can lead to inaccurate results.
Lack of Transparency – AI systems can be complex and difficult to understand, making it hard to explain how they make decisions.
High Costs – Developing and maintaining machine learning systems can be expensive, requiring skilled professionals and powerful computing resources.
Thyroid Cancer Recurrence: Why does it Return?
People with thyroid cancer typically have high survival rates and a positive outlook when their condition is diagnosed early, but the cancer can sometimes recur. There are several reasons why thyroid cancer can return:
If the initial surgery didn't remove all of the cancerous thyroid tissue or any cancerous lymph nodes, some cancer cells can remain in the body. These remaining cells can grow and cause recurrence.
Different types of thyroid cancer behave differently. For example, papillary thyroid cancer, the most common type, often has a very good prognosis and can recur even years after initial treatment, but it usually grows slowly.
Even when thyroid cancer is undetectable on scans after surgery, tiny, microscopic clusters of cancer cells might remain in the body. These cells can eventually grow and lead to a recurrence over time.
Recurrence is more likely if you have a more aggressive type of thyroid cancer, a tumor that wasn’t completely removed during surgery, or cancer that has spread beyond the thyroid gland.
How is Thyroid Cancer Reccurence Monitored?
Blood tests to check levels of thyroglobulin (Tg), thyroglobulin antibodies (TgAb), thyroid-stimulating hormone (TSH), thyroxine (T4), triiodothyronine (T3). In the case of medullary thyroid cancer additional tests can include calcium, calcitonin and carcinoembryonic antigen (CEA)
Levels of thyroglobulin, a protein produced by thyroid cells, are often monitored after treatment. Elevated levels may indicate a recurrence, particularly in patients who have had their thyroid removed.
Ultrasound of the neck to check for a local recurrence or spread of the cancer to lymph nodes
Radioactive iodine scan to monitor response to radioactive iodine (RAI) therapy
CT scan or MRI of the neck or chest to look for cancer that has recurred or spread
Lifelong follow-up with an endocrinologist or oncologist is often recommended to monitor for recurrence and to manage any long-term side effects of treatment.
Variables
Controlled
-
Data preprocessing techniques and functions that contain the python code.
-
The Random Forest model
-
Split to train and testing techniques
Manipulated
-
The specific patient factors being analyzed, such as age, tumor size, or hormone levels
-
The number of features selected (top three predictive factors)
Responding
-
Factors influencing thyroid cancer recurrence
Procedure
- Create a shared python environment on Google Colab. Download the thyroid_cancer_recurrence.ipynb file and thyroid_data.csv data from Science Buddies.
- Run the first code block to import libraries. Load the data into a Pandas DataFrame.
- Using the read.csv function , preprocess the thyroid_data. csv file to interpret all patient records.
- Use label encoding to convert categorical variables (e.g., smoking, gender) into numerical values.
- Use one-hot encoding for non-ordinal variables (e.g., Thyroid Function).
- Separate the dataset into input features and the target (Recurred). Split data into training and testing sets. ‘Target’ contains the ‘Recurred’ column, which tells us whether a patient had recurring thyroid cancer or not.
- Train a Random Forest classifier using the training set. Evaluate accuracy using the test set and visualize feature importance with decision trees and bar plots.
- Select the top 3 features based on importance and repeat steps to preprocess, split, train, and evaluate the model. Compare the accuracy of the simplified model to the original.
- Explore decision trees and analyze key predictors.
Observations
During our experiment we observed that the following were the issues that our experiment was facing:
Overfitting: Decision Trees can memorize the training data too well, which hurts their performance on new data.
Feature Bias: Decision Trees can be biased toward features with more categories, even if they're not important.
How we fixed those issues:
Reduce Overfitting: By averaging predictions from many trees, Random Forests are less likely to overfit.
Avoid Feature Bias: They randomly select features at each split, so no single feature dominates the decision-making.
Analysis
Data Splitting
To train and evaluate the model, the dataset was split into training (80%) and testing (20%) sets:
Training set: 306 samples, 19 features
Testing set: 77 samples, 19 features
This ensures that the model learns from the majority of the data while preserving a separate set for evaluation.
After inputting and training the random forest code function, the model was evaluated using the test dataset (77 samples) to measure its performance in predicting thyroid cancer recurrence. Based on all factors, the model achieved an accuracy score of 98.7%, meaning it correctly predicted cancer recurrence or non-recurrence for 98.7% of the test cases. This high accuracy suggests that the model works well to unseen data and properly distinguishes between patients who experienced recurrence and those who did not.
Evaluating the Model
After training the Random Forest model using only the top three features—Response to Treatment, Nodes, and Adenopathy—the model was evaluated using the test dataset of 77 samples. The model achieved an accuracy score of 0.974 (97.4%), meaning it correctly predicted cancer recurrence or non-recurrence for 97.4% of the test cases. This is a slightly lower accuracy compared to the full 19-feature model, but it still shows strong performance in predicting thyroid cancer recurrence based on the most critical factors.
This result highlights the model's ability to make accurate predictions even with a reduced set of features, which is rather important from a practical standpoint. By focusing on these top three features, healthcare professionals could minimize the need for performing additional tests, ultimately reducing the burden on patients for unnecessary tasks.
Conclusion
In conclusion, this project utilized machine learning to predict thyroid cancer recurrence and identify the most influential factors. Our findings revealed that response to treatment, number of affected lymph nodes (N), and adenopathy were the top predictors, showing the significance of follow-up care. While our hypothesis included stage at diagnosis, response to treatment, and demographic factors, such as age and gender, the results showed that lymph node involvement and adenopathy were more significant than age and gender. These insights can help healthcare professionals improve patient evaluation. Moreover, ML models like Random Forest offer promising applications in improving cancer prognosis and patient outcomes. Future work could expand on this research by incorporating larger datasets and testing additional models for enhanced accuracy. Overall, this project showcases the potential of machine learning influenced approaches in advancing healthcare.
Application
- Early Detection and Screening – AI models help detect cancer at early stages by analyzing medical images and identifying high-risk patients based on clinical data.
- Medical Imaging Analysis – Machine learning enhances the accuracy of radiology scans (e.g., CT, MRI, and X-rays) by identifying patterns and distinguishing between non-cancerous and cancerous tumors.
- Histopathological Analysis – Deep learning algorithms assist in analyzing tissue samples, improving the detection of cancerous cells and reducing diagnostic errors.
- Genomics and Precision Medicine – AI analyzes genetic data to identify cancer-associated mutations, enabling personalized treatment plans and targeted therapies.
- Prognosis and Risk Assessment – Machine learning models predict cancer recurrence and survival rates by analyzing patient records, lifestyle factors, and treatment responses.
- Clinical Decision Support Systems (CDSS) – AI-powered tools support oncologists by integrating patient data, providing treatment recommendations, and improving decision-making accuracy.
Sources Of Error
- Due to being new to the python environment and a lack of time, our model’s performance was not evaluated using the ROC (Receiver Operating Characteristic) curve and the AUC (Area Under the Curve) score. The ROC and AUC evaluate a model’s performance across all possible classification thresholds.
- Furthermore, the model might perform differently in real-life situations than in the experiment as it may be considered too simple. More complexity may be needed.
- We did not manually check the data from our machine learning models, which could help confirm if the model's results are accurate and reliable.
Citations
- https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p020/artificial-intelligence/thyroid_cancer#:~:text=Random%20 Forests%20help%20 reduce%20 overfitting,will%20have%20a recurring%20thyroid%20cancer
- https://my.clevelandclinic.org/health/diseases/12210-thyroid-cancer
- https://www.mayoclinic.org/diseases-conditions/thyroid-cancer/symptoms-causes/syc-20354161
- https://www.cancerresearchuk.org/about-cancer/thyroid-cancer/stages-types/types
- https://cancer.ca/en/cancer-information/cancer-types/thyroid/staging#:~:text=The%20 most%20 common%20 staging%20system,more%20the%20 cancer%20has%20 spread
- https://www.cancer.org/cancer/types/thyroid-cancer/detection-diagnosis-staging/signs-symptoms.html
- https://www.ncbi.nlm.nih.gov/books/NBK279388/#:~:text=The%20thyroid%20gland%20is%20found,tissue%20 known%20as%20the%20 isthmus.&text=The%20 thyroid%20 typically%20 weighs%20between%2020%20and%2060%20 grams
- https://myhealth.alberta.ca/Health/pages/conditions.aspx?hwid=aa125286
- https://cancer.ca/en/cancer-information/cancer-types/thyroid/treatment/follow-up
- https://www.geeksforgeeks.org/how-does-machine-learning-works/
- https://www.simplilearn.com/tutorials/machine-learning-tutorial/what-is-machine-learning
- https://cancer.ca/en/cancer-information/cancer-types/thyroid/treatment/hormone-therapy#:~:text=Hormone%20therapies%20used%20for%20thyroid%20cancer&text=Taking%20levothyroxine%20makes%20sure%20there,a%20pill%20once%20a%20day.
- https://pmc.ncbi.nlm.nih.gov/articles/PMC6381772/
- https://www.canada.ca/content/dam/phac-aspc/migration/phac-aspc/publicat/hpcdp-pspmc/34-1/assets/pdf/CDIC_MCC_Vol34_1_9_Shaw_E.pdf
- https://pubmed.ncbi.nlm.nih.gov/34782918/
- https://www.aitude.com/decision-tree-vs-random-forest-in-machine-learning/#:~:text=Decision%20trees%20are%20simple%20but,reduces%20the%20chances%20of%20 overfitting
- https://www.cancernetwork.com/view/monitoring-thyroid-cancer-recurrence
- https://www.frontiersin.org/journals/surgery/articles/10.3389/fsurg.2022.862322/full
- https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-023-04698-z
- https://www.thelancet.com/journals/landig/article/PIIS2589-7500(21)00041-8/fulltext
- https://www.cancerresearchuk.org/about-cancer/thyroid-cancer/stages-types/tnm-staging
Acknowledgement
We would like to acknowledge the following people for their support and guidance throughout our science fair project. Hargun assisted us in our overall analysis for the project, helping us refine our approach and ensuring our project was well-structured. Mrs. Fauzia provided valuable feedback on our procedure, guiding us in improving our research and data analysis. Mrs. Preetpal reviewed our work and offered constructive feedback, which helped us enhance both our project and presentation. Their support and guidance was significant in helping us successfully complete our project, and we truly appreciate their time and effort.