Machine Learning vs. Thyroid Cancer: Predicting Recurrence

My science fair project aims to predict the recurrence of thyroid cancer in patients who are in remission by utilizing machine learning, specifically a Random Forest model. The project involves preprocessing a dataset containing various risk factors.
Vasundhara Joshi
Grade 10

Problem

Question: 

How can machine learning and artificial intelligence (AI) technologies be utilized to develop a predictive model for thyroid cancer recurrence by analyzing clinical, pathological, and molecular data to identify patterns and improve patient outcomes?

Purpose:

The goal of this project is to develop a Random Forest model to predict the likelihood of thyroid cancer recurrence by analyzing relevant risk factor data. By identifying key factors that influence recurrence, the model aims to optimize follow-up care and improve patient outcomes through early detection and more targeted testing.

 

Research: (From Sources)

  •  I learned that thyroid cancer, though generally treatable, can recur even after successful treatment, highlighting the importance of regular follow-up appointments for early detection. Factors like patient age, cancer stage, and cancer type can influence the likelihood of recurrence, making tailored follow-up care essential to avoid unnecessary procedures. Machine learning, particularly Random Forest models, can help predict recurrence by analyzing key risk factors, guiding more personalized care. Additionally, AI and Decision Trees mimic human decision-making, serving as powerful tools for classifying data and improving healthcare predictions.
  • Thyroid cancer occurs when mutations in the DNA of thyroid cells lead to uncontrolled growth, forming tumors that can spread to lymph nodes and other parts of the body, such as the lungs, bones, liver, brain, and skin. The exact causes of these mutations are not well understood, but several risk factors can increase the likelihood of developing thyroid cancer, such as female sex, exposure to radiation, and certain inherited genetic syndromes. These genetic syndromes, such as familial medullary thyroid cancer and multiple endocrine neoplasia, increase the risk of thyroid cancer and other types of cancers.
  • There are different types of thyroid cancer, including papillary, follicular, Hurthle cell, anaplastic, and medullary thyroid cancers, each with distinct characteristics, behaviors, and prognoses. Some types, like papillary and follicular thyroid cancers, tend to have a favorable prognosis and respond well to treatment, while others, like anaplastic and poorly differentiated thyroid cancers, are more aggressive and difficult to treat. Recurrence is more likely with aggressive cancers or if they have spread beyond the thyroid gland. Thyroid cancer that recurs can often be treated successfully, and most patients will have positive outcomes, especially if detected early.
  • Follow-up care, including periodic blood tests, thyroid scans, and imaging, is essential for detecting recurrence or metastasis. Symptoms of recurrence may include neck pain, lumps in the neck, difficulty swallowing, hoarseness, or changes in voice. While there is no clear way to prevent thyroid cancer in those with average risk, individuals at high genetic risk, such as those with inherited gene mutations, may consider prophylactic thyroid surgery. Additionally, people living near nuclear power plants might be given potassium iodide to block radiation effects on the thyroid in the case of an accident. Preventive measures for high-risk individuals are crucial for reducing the likelihood of developing thyroid cancer.
  • AI is revolutionizing health care by improving diagnosis, creating personalized treatment plans, and enhancing patient outcomes. Key AI technologies in health care include machine learning, deep learning, natural language processing, and robotic process automation, which help analyze big data, interpret medical documentation, automate workflows, and assist in surgeries. AI applications include healthcare analytics, precision medicine, disease prediction, and interpreting medical tests like MRIs and X-rays. In mental health care, AI could help by identifying patterns in patient data to assist with diagnoses and monitor patient well-being, offering valuable support to health care professionals without replacing them.
  • Machine learning (ML) is a branch of artificial intelligence (AI) that allows computers and machines to mimic human learning, perform tasks autonomously, and improve their performance and accuracy by learning from experience and data exposure.

Hypothesis: 

If machine learning and AI technologies are applied to clinical, pathological, and molecular data from thyroid cancer patients, then a predictive model can be developed that accurately identifies patterns associated with cancer recurrence, leading to improved prediction accuracy and more personalized treatment strategies for better patient outcomes.

 

Variables

Manipulated

Responding

Controlled

The type of data used to train the model (clinical, pathological, and molecular data).

- The likelihood or probability of cancer recurrence



 

- Patient demographics (e.g., age, sex) or treatment methods (e.g., surgery type, radiation therapy) can be controlled to prevent them from skewing the analysis.

- Data quality, model settings, and training/testing procedures 

Method

Procedure: Thyroid Cancer Recurrence Analysis Using Python on Replit

Materials Needed

  • A computer with internet access
  • A Replit account (free to create)
  • The thyroid_data (1).csv file
  • Python programming knowledge (basic understanding)

Setting Up the Replit Environment

Create a Replit Account

  1. Open a web browser (Google Chrome, Mozilla Firefox, Microsoft Edge, Safari).
  2. Go to Replit.
  3. If you don’t have an account, click Sign Up and create one using an email or Google account.
  4. If you already have an account, click Log In and enter your credentials.

Create a New Python Project

  1. After logging in, you will be on the Replit dashboard.
  2. Click on the “+ Create Repl” button (blue button at the top left).
  3. In the “Create a Repl” window:
    • Language: Type "Python" and select it.
    • Title: Name your project "Thyroid_Cancer_Analysis".
    • Click Create Repl to set up the environment.

Download & Upload the Dataset

Download the Dataset

  1. Ensure that you have the thyroid_data (1).csv file stored on your computer.
  2. If you do not have the file, download it from your source (Science Buddies, Kaggle, or provided dataset).
  3. Locate the file in your Downloads folder to ensure it is ready for upload.

Upload the CSV File to Replit

  1. In Replit, look at the left sidebar and click on the “Files” tab (it looks like a folder icon).
  2. Click the "Upload File" button at the top of the Files section.
  3. Navigate to your thyroid_data (1).csv file on your computer.
  4. Select the file and upload it.
  5. Once uploaded, confirm that thyroid_data (1).csv appears in the Files tab.

Installing Required Libraries

Accessing the Shell

  1. In the Replit left sidebar, click on “Shell” (terminal icon).

  2. Type the following command to install necessary Python libraries and press Enter:

pip install pandas numpy matplotlib seaborn scikit-learn imbalanced-learn

Understanding the Libraries Installed

  • pandas – Handles data processing (reading CSV files, organizing datasets).
  • numpy – Supports numerical calculations.
  • matplotlib & seaborn – Generate visualizations and graphs.
  • scikit-learn – Provides machine learning models and preprocessing tools.
  • imbalanced-learn – Fixes unbalanced datasets using SMOTE (Synthetic Minority Over-sampling Technique).

Opening the Main Python File

Locate and Open main.py

  1. In the Files tab, locate the file named main.py (automatically created by Replit).
  2. Click on it to open the code editor.
  3. Erase any existing code so you can start fresh.

Step 1: Loading & Preprocessing the Data

1.1 Import Required Libraries

  1. In main.py, type the following:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.preprocessing import StandardScaler
    from imblearn.over_sampling import SMOTE 

1.2 Load the Dataset

  1. Add the following code to read the uploaded CSV file:
    csv_path = "thyroid_data (1).csv"
    df = pd.read_csv(csv_path) 

1.3 Check the Data Structure

1, Type the following code

      print(df.head())
2. Verify that all columns are correctly loaded.

1.4 Check for Missing Values

  1. Run this code to check for missing values in the dataset:
    def check_missing_values(df):
        missing = df.isnull().sum()
        if missing.any():
            print("Missing Values Found:")
            print(missing[missing > 0])
        else:
            print("No missing values detected.")
    
    check_missing_values(df) 

Step 2: Encoding Categorical Variables

2.1 Convert Yes/No Variables to Numeric Values

  1. Encode binary categorical values into 0 (No) and 1 (Yes):
    df['Recurred'] = df['Recurred'].map({'No': 0, 'Yes': 1})
     

2.2 One-Hot Encode Multi-Category Columns

  1. Identify categorical columns that need encoding:
    categorical_columns = ['Gender', 'Smoking', 'Hx Smoking', 'Hx Radiotherapy', 
                          'Thyroid Function', 'Physical Examination', 'Adenopathy',
                          'Pathology', 'Focality', 'T', 'N', 'M', 'Stage', 'Response'] 
  2. Apply one-hot encoding:

df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
 


Step 3: Splitting the Data for Machine Learning

3.1 Define Features and Target Variable

  1. Separate input features (X) from the target (y):
    X = df_encoded.drop('Recurred', axis=1)
    y = df_encoded['Recurred'] 

3.2 Handle Class Imbalance Using SMOTE

  1. Apply SMOTE to balance the dataset:
    smote = SMOTE(random_state=42)
    X, y = smote.fit_resample(X, y) 

3.3 Split the Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 


Step 4: Training the Model

4.1 Standardize the Features

  1. Normalize the dataset using StandardScaler:
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test) 

4.2 Train the Random Forest Classifier

  1. Initialize and train the model:
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
     

Step 5: Evaluating Model Performance

5.1 Model Accuracy Score

  1. Compute and print the model’s accuracy:
    accuracy = rf.score(X_test, y_test)
    print(f"Model Accuracy: {accuracy * 100:.2f}%")
     

5.2 Visualizing Feature Importance

  1. Generate a bar plot of the most important factors in predicting recurrence:
    feature_importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
    
    plt.figure(figsize=(10, 6))
    sns.barplot(x=feature_importances[:10], y=feature_importances.index[:10])
    plt.xlabel("Feature Importance")
    plt.ylabel("Feature")
    plt.title("Top 10 Factors Influencing Thyroid Cancer Recurrence")
    plt.savefig("feature_importance.png")
    plt.show()
     

Running the Code

  1. Click "Run" (green button at the top).
  2. The script will execute and print the model accuracy.
  3. The feature importance graph will be generated.

Final Steps

Downloading Results

  1. Locate feature_importance.png in the Files tab.
  2. Click on it and select Download.

 

Overall Explanation:

This code is designed to analyze a thyroid cancer dataset, focusing on class imbalance, feature importance, and visualizing relationships between variables. It begins by importing necessary libraries, including pandas for data manipulation, numpy for numerical operations, and matplotlib/seaborn for visualization. The dataset is then loaded from a CSV file into a Pandas DataFrame. A function checks for missing values, and if any are found, it prints the count of missing values per column. The categorical columns are then encoded using one-hot encoding (pd.get_dummies), transforming categorical variables into binary columns. The target variable (Recurred) is mapped to binary values (0/1). To handle class imbalance, the code uses SMOTE (Synthetic Minority Over-sampling Technique) to oversample the minority class, balancing the dataset. The data is then split into training and testing sets, and the features are standardized using StandardScaler. A Random Forest model is trained on the data to determine the importance of each feature, which is visualized in a bar plot showing the top 10 most important factors influencing thyroid cancer recurrence. Additionally, a correlation heatmap of the features is created to highlight relationships between variables. Finally, histograms of the first five numerical features are plotted to show their distributions. These visualizations are saved as PNG files, making it easy to interpret the results and understand key patterns in the data.

 

Analysis

Correlation Heatmap Analysis

The correlation heatmap provides a comprehensive overview of the relationships between different variables in the dataset. Each cell represents the Pearson correlation coefficient between two features, with values ranging from -1 to 1. A correlation coefficient close to 1 (dark red) indicates a strong positive correlation, meaning that as one variable increases, the other also increases. Conversely, a coefficient near -1 (dark blue) signifies a strong negative correlation, where an increase in one variable leads to a decrease in the other. The diagonal of the heatmap is uniformly red, as every feature is perfectly correlated with itself. From a statistical perspective, the heatmap reveals that while some features exhibit moderate correlation, no extreme dependencies are observed, suggesting minimal risk of multicollinearity (which can negatively impact model performance). However, certain stage-related features (T, N, M classifications) show moderate positive correlations, which is expected given their hierarchical relationship in cancer staging. Notably, features directly related to thyroid function, pathology, and response to treatment show varied correlation levels with recurrence, emphasizing their potential predictive value. The "Recurred" variable, representing the recurrence of thyroid cancer, does not display strong correlations with any single feature, indicating that recurrence is likely influenced by a combination of multiple factors rather than a single dominant predictor.


2. Feature Importance Bar Plot Analysis

The feature importance bar plot, derived from a Random Forest model, highlights the top 10 most influential features in predicting thyroid cancer recurrence. The x-axis quantifies feature importance based on the Gini importance score, a measure of how much each variable contributes to reducing impurity in decision trees. The highest-ranking feature, Response_Excellent, suggests that patients who had an excellent response to initial treatment are significantly less likely to experience recurrence, reinforcing the critical role of early treatment efficacy. The second most important factor, Structural Incomplete, indicates that residual thyroid tissue or incomplete structural response after treatment is a major predictor of recurrence, aligning with clinical findings that incomplete resection or persistent disease increases the likelihood of recurrence. The presence or absence of adenopathy (lymph node involvement) also plays a crucial role, as evidenced by the high ranking of Adenopathy_No, suggesting that patients without lymph node involvement have a lower recurrence risk. Age emerges as another key predictor, which is consistent with existing oncological research indicating that younger patients may have different recurrence patterns compared to older individuals. Features related to tumor size and staging (T_T2, T_T3a, and N_N1b) also rank high, affirming the established relationship between larger or more invasive tumors and recurrence risk. The inclusion of Focality and Multinodular goiter further suggests that specific structural characteristics of the thyroid gland influence recurrence, possibly due to their impact on surgical outcomes and treatment efficacy.


3. Age Distribution Histogram Analysis

The histogram displaying age distribution provides insights into the demographics of the dataset, showing the frequency of different age groups and their relative prevalence in the study. The shape of the histogram suggests a right-skewed distribution, meaning there are more younger patients with a gradual decrease in frequency as age increases. The peak at around 30-40 years indicates that the majority of thyroid cancer cases (or patients included in this dataset) fall within this age range, which aligns with epidemiological data indicating that thyroid cancer is more common in younger to middle-aged adults, particularly women. The density curve overlay suggests a smooth distribution, confirming that there are no significant gaps or abrupt changes in age representation. The presence of a long right tail (patients aged 60 and above) suggests that while thyroid cancer can occur at older ages, it is less frequent in this population. From a clinical standpoint, this distribution may also reflect screening and diagnostic trends, where younger individuals are more likely to be diagnosed due to routine checkups or incidental findings, whereas older patients may present with more advanced or recurrent disease. Additionally, the dataset’s age distribution can influence model performance, as age-dependent patterns might emerge in recurrence predictions, necessitating careful stratification during model development.

4. Histogram of Gender_M

The histogram of the Gender_M variable illustrates the distribution of gender in the dataset, where 0 represents females and 1 represents males. The graph indicates a significant gender imbalance, with a much higher number of female patients compared to males. This aligns with existing medical research, which shows that thyroid cancer is more prevalent in women than men. The density curve further emphasizes this skewed distribution, peaking sharply at 0 and tapering off at 1. This imbalance suggests that any analysis or predictive modeling should account for the gender disparity to avoid biased conclusions.

5. Histogram of Hx Radiotherapy_Yes

The histogram of Hx Radiotherapy_Yes shows the distribution of patients based on whether they have a history of radiotherapy treatment, with 0 indicating no prior radiotherapy and 1 indicating a history of radiotherapy. The graph reveals that the majority of patients in the dataset have not undergone radiotherapy, as indicated by the dominant peak at 0. The density curve highlights this imbalance, with very few patients falling into the "Yes" category. Since previous radiotherapy exposure is a known risk factor for cancer recurrence, the limited number of radiotherapy patients suggests that additional statistical considerations may be needed to accurately assess its impact.

6. Histogram of Smoking_Yes

The Smoking_Yes histogram represents the distribution of smoking status among patients, where 0 corresponds to non-smokers and 1 represents smokers. The histogram shows a clear imbalance, with a significantly larger proportion of non-smokers compared to smokers. This is evident from the strong peak at 0 and the density curve, which quickly declines after 0 and remains low near 1. Given that smoking is a well-known risk factor for various cancers, the low number of smokers in the dataset could indicate a population with generally lower smoking rates or one that was more health-conscious. This imbalance should be considered when analyzing the effects of smoking on thyroid cancer recurrence.


Histogram of "Hx Smoking_Yes" – Analysis

The histogram depicting the distribution of Hx Smoking_Yes reveals a significant disparity between patients with and without a history of smoking. The x-axis represents whether a patient has a history of smoking, with 0 indicating no smoking history and 1 representing patients who have smoked. The y-axis shows the frequency of patients within each category. The overwhelming majority of individuals in the dataset fall into the 0 category, meaning they have never smoked, while only a small fraction is classified as 1, indicating a past smoking history. The kernel density estimation (KDE) curve further emphasizes the severe right-skew in the data, reinforcing that the dataset is predominantly composed of non-smokers. This distribution suggests that smoking history may not be a highly influential variable in this particular dataset, as there are relatively few smokers to draw statistically significant conclusions. However, given the known links between smoking and various types of cancers, further statistical tests would be required to assess whether smoking history has any meaningful correlation with thyroid cancer recurrence. Additionally, the imbalance in smoking history could pose challenges for predictive modeling, as the model might not learn enough from the limited data on smokers.


Scatterplot of Thyroid Cancer Recurrence – Analysis

The scatterplot illustrating thyroid cancer recurrence provides a clear visualization of the classification of patients based on recurrence status. The x-axis represents the sample index, essentially distinguishing individual patients, while the y-axis represents the binary outcome of recurrence (0 for no recurrence and 1 for recurrence). The two distinct clusters at the top and bottom of the plot indicate that there is a strong binary separation, with no intermediate values, reinforcing that recurrence is an either/or event. The blue points correspond to patients who did not experience recurrence, while the red points represent those who did. The even distribution of both classes suggests that the dataset has been well-balanced, likely due to the application of SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance. This balance is crucial for ensuring that the machine learning model does not become biased toward the majority class. The clear separation between the two groups also suggests that there are potentially strong predictive features that influence whether a patient will experience recurrence. Moving forward, further analysis—such as feature importance evaluation and statistical significance testing—would be essential to identify the key factors driving this outcome.

The overall dataset provides a comprehensive view of factors influencing thyroid cancer recurrence, with various visualizations highlighting key trends. The correlation heatmap indicates that while some clinical features, particularly cancer staging variables (T, N, M classifications), exhibit moderate correlations, no extreme dependencies suggest severe multicollinearity. The feature importance bar plot reinforces that early treatment response, tumor staging, and structural completeness post-surgery are the strongest predictors of recurrence. Demographic histograms reveal a skewed age distribution, with most patients falling within the 30-40 age range, and a notable gender imbalance favoring females, reflecting real-world epidemiological trends. The Hx Smoking_Yes and Hx Radiotherapy_Yes histograms show that a majority of patients are non-smokers and have not undergone prior radiotherapy, limiting the ability to assess their impact on recurrence. The scatterplot of recurrence confirms a well-balanced dataset, likely due to oversampling techniques like SMOTE, ensuring reliable predictive modeling. Overall, the dataset suggests that thyroid cancer recurrence is driven by a complex interplay of multiple factors rather than a single dominant predictor, warranting further multivariate analysis to refine predictive models.

Conclusion

Conclusion

My research did or did not support my hypothesis

My research did support my hypothesis that machine learning and AI technologies can be applied to clinical, pathological, and molecular data to develop a predictive model for thyroid cancer recurrence. The Random Forest model successfully identified key predictive factors, such as response to initial treatment, structural completeness post-surgery, and tumor staging, reinforcing existing medical knowledge. The balanced dataset, aided by SMOTE, allowed for reliable classification of recurrence versus non-recurrence cases. However, some variables, like smoking history and radiotherapy exposure, had skewed distributions, which may have limited their predictive power.

My research is important to society because:

Thyroid cancer recurrence can be life-altering, and early detection is crucial for effective treatment and improved patient outcomes. My research contributes to the medical community by offering a data-driven approach to risk assessment, helping physicians personalize follow-up care based on a patient's likelihood of recurrence. By leveraging machine learning, I can move toward more proactive and efficient monitoring strategies, reducing unnecessary testing for low-risk patients while ensuring high-risk individuals receive timely interventions.

This is what I would change if I did my experiment again:

If I were to conduct this research again, I would:

  • Expand the dataset to include a more balanced representation of gender and smoking history, reducing potential biases in model predictions.
  • Refine feature selection by integrating more molecular and genetic markers, which could provide deeper insights into recurrence risk beyond traditional clinical factors.
  • Test additional machine learning models such as gradient boosting algorithms (e.g., XGBoost, LightGBM) to compare predictive performance.
  • Optimize hyperparameters further to enhance model accuracy and generalizability.

This is what I recommend for future or different experiments:

  • Future research should incorporate larger, more diverse datasets to mitigate demographic biases.
  • Longitudinal studies could track patients over time to refine the recurrence prediction model with real-world clinical outcomes.
  • Explainable AI (XAI) techniques should be applied to increase transparency in predictions, making the model more interpretable for clinicians.
  • Exploring deep learning approaches, such as neural networks, might reveal more complex patterns in patient data.

Sources of Error

Throughout my research, I encountered several challenges that impacted the study:

  • Coding platform issues: I had to switch platforms due to compatibility problems and debugging complexities, which slowed down progress and required modifications to my code.
  • Multiple errors in implementation: The initial model faced issues with feature scaling, data preprocessing, and missing values, requiring several iterations to resolve.
  • Bias in the dataset: The significant gender imbalance (predominantly female patients) may have led to skewed predictions, potentially underrepresenting recurrence risks in male patients.
  • Limited data in certain categories: The small number of patients with a history of smoking or prior radiotherapy could have led to misleading conclusions due to insufficient representation in those groups.

In conclusion, my research successfully supported my hypothesis that machine learning and AI technologies can be utilized to develop a predictive model for thyroid cancer recurrence by analyzing clinical, pathological, and molecular data. Through the use of a Random Forest model, I was able to identify key factors influencing recurrence, such as response to initial treatment, tumor staging, and structural completeness post-surgery, which align with established medical knowledge. The model's ability to classify recurrence and non-recurrence cases was enhanced by data balancing techniques like SMOTE, ensuring reliable predictions. However, the presence of biases within the dataset, particularly the significant gender imbalance and the limited number of patients with smoking history or prior radiotherapy, may have influenced the model’s overall accuracy and generalizability. Additionally, several challenges arose during the research process, including having to switch coding platforms due to technical issues and debugging multiple errors in data preprocessing and feature engineering, which impacted workflow efficiency. Despite these obstacles, my findings demonstrate the potential of AI-driven models in improving thyroid cancer recurrence prediction, thereby contributing to more personalized patient care and follow-up strategies. Moving forward, expanding the dataset to include a more diverse patient population, incorporating additional molecular markers, and experimenting with other advanced machine learning algorithms could further refine prediction accuracy. Moreover, implementing explainable AI techniques would help bridge the gap between machine learning models and clinical application, ensuring that healthcare professionals can trust and interpret model predictions effectively. While my research highlights the feasibility of using AI for recurrence prediction, it also emphasizes the need for continuous improvements in data quality and model development. Ultimately, this study serves as a foundation for future research in leveraging machine learning for better cancer prognosis, potentially leading to earlier interventions and improved patient outcomes.

Acknowledgement

I would like to acknowldege my science fair conridinator, Mr.Morgan, as well as my computer science tutor, and those who suppoort me to make this project succesfull. Thank you!

Attachments

No Log Book Provided