Presentation
Problem
Method
Research
Data
Conclusion
Citations
Acknowledgement
Attachments

Breaking Barriers : Design and Validation of Novel Small-Molecule Inhibitors of MMP-9 for Neurodegeneration in Parkinson's Disease

I developed a machine learning pipeline to identify and structurally modify and validate novel MMP‑9 inhibitors for Parkinson’s disease using QSAR modeling, multi‑objective optimization, and molecular docking.

Natania Varghese

Queen Elizabeth High School

Grade 11

Presentation

Not working? Open in a new tab.

Problem

Neurodegenerative diseases represent one of the most pressing challenges in modern medicine, with conditions such as Parkinson's disease affecting millions of individuals worldwide. However, for me, this topic is not just a statistic or an abstract scientific problem, it is deeply personal, shaped by my experience volunteering at an old-age home.

I still remember the first time I walked into the facility. It was quiet in a way that felt heavier than silence, filled with stories, memories, and lives that had slowed down but not lost their depth. As I spent time with the residents, I began to notice patterns that I had never paid attention to before. Some individuals struggled to recall names or recent conversations, while others had visible tremors or difficulty with movement. At first, these seemed like natural signs of aging, but the more I observed, the more I realized that these were manifestations of something far more complex and devastating.

One resident, in particular, left a lasting impression on me. He would often begin telling a story with vivid detail, only to lose his train of thought midway, his expression shifting from confidence to quiet frustration. In those moments, it became clear that neurodegenerative diseases are not just physical conditions, they gradually take away independence, identity, and the ability to connect with others. Witnessing this firsthand made me question why, despite advances in medicine, there are still no definitive cures for such conditions.

As I continued volunteering, I grew increasingly curious about the underlying biological mechanisms driving these diseases. I learned that neurodegeneration involves progressive neuronal loss, chronic inflammation, and disruption of essential signalling pathways in the brain. Yet, what struck me most was that current treatments largely focus on managing symptoms rather than addressing root causes. This realization created a sense of urgency in me, a desire to explore solutions that go beyond temporary relief.

This personal experience ultimately led me to investigate emerging molecular targets associated with neurodegeneration. Among them, Matrix Metalloproteinase-9 (MMP-9) stood out due to its role in neuroinflammation and the breakdown of the blood–brain barrier. Understanding that such molecular processes could be linked to the very symptoms I had observed made the research feel tangible and meaningful, rather than purely theoretical.

At the same time, I became aware of the limitations of traditional drug discovery methods. The process is often slow, expensive, and marked by high failure rates, which delays the development of effective treatments for those who need them most. This reinforced my motivation to explore alternative approaches that could accelerate discovery while maintaining scientific rigor.

Driven by both personal experience and scientific curiosity, I turned to computational drug discovery as a potential solution. Techniques such as machine learning, QSAR modelling, and molecular docking offer the ability to analyze vast chemical spaces efficiently and identify promising therapeutic candidates with greater precision. These tools represent not just technological advancement, but hope for faster, more targeted interventions.

Through this study, I aim to bridge the gap between lived human experience and computational science. What began as moments of observation in an old-age home evolved into a research-driven effort to contribute, even in a small way, to the fight against neurodegenerative diseases. By focusing on MMP-9 inhibition and leveraging in silico methodologies, this work reflects both a scientific pursuit and a personal commitment to improving the lives of individuals affected by these conditions.

Method

2.0 Methodology

Study Design and Theoretical Orientation

This investigation employs a fully computational (in silico) experimental design to evaluate and optimize Matrix Metalloproteinase-9 (MMP-9) inhibitors as potential therapeutic agents for neurodegenerative modulation. The study is structured as a multi-phase, iterative pipeline, integrating cheminformatics, machine learning, and molecular modelling. The methodological framework is grounded in a quantitative, data-driven paradigm, where chemical structure–activity relationships (QSAR) are modelled to predict biological activity (pIC50). The design follows a recursive optimization model, meaning that outputs from each computational phase inform refinement in subsequent stages, increasing predictive accuracy and chemical relevance. This approach aligns with contemporary drug discovery practices, where high-throughput in silico screening reduces both economic and temporal constraints associated with traditional wet-lab experimentation. The study does not involve human or animal subjects, thereby eliminating ethical concerns related to biological testing and focusing entirely on computational reproducibility and algorithmic validity.

Overall Pipeline Architecture

The research methodology is divided into four primary computational phases:

Phase I: Research Design and System Setup
Phase II: Data Acquisition and Preprocessing
Phase III: Feature Engineering and Model Development

Each phase is modular but interdependent, forming a closed-loop pipeline where outputs (e.g., predicted high-affinity compounds) are fed back into earlier stages for refinement. The workflow was implemented using a cloud-based computational environment, specifically Google Colab, allowing for reproducibility, scalability, and accessibility. This environment enabled integration of multiple cheminformatics and machine learning libraries without local hardware constraints.

Software Environment and Toolchain Configuration

To operationalize the computational pipeline, a standardized software stack was configured. The primary programming language used was Python, selected for its extensive ecosystem in scientific computing and cheminformatics. The following core libraries were utilized:

RDKit for molecular representation, descriptor calculation, and fingerprint generation
pandas for structured data manipulation and preprocessing
NumPy for vectorized operations and numerical transformations
scikit-learn for model development, training, and evaluation

Variable Definition and Operationalization

Independent Variables: Molecular features derived from chemical structures, including:
- SMILES-based representations
- Physicochemical descriptors (e.g., molecular weight, LogP)
Dependent Variable:
- pIC50, representing the negative logarithmic transformation of IC50 values, is used as a standardized measure of inhibitory potency against MMP-9
Controlled Variables:
- Data source consistency (single database origin)
- Descriptor calculation methods (standardized via RDKit)
- Fingerprint radius and bit-length (fixed ECFP4, 2048-bit)

This operationalization ensures that variability in model predictions arises solely from chemical structure differences rather than inconsistencies in preprocessing or feature generation.

2.1 Phase I - Data Acquisition and Computational Preprocessing

2.1.1: In Silico Data Sourcing and Integrity Protocol

The reliability of any Machine Learning (ML)- driven drug discovery pipeline is fundamentally dependent on the quality, diversity, and structural validity of the training dataset. For this study - high-density dataset of Matrix Metalloproteinase-9 (MMP-9) inhibitors was sourced from chEMBL, a curated repository of bioactive molecules with experimentally validated activity profiles.

The dataset, imported as a structured .csv file, initially contained 1067 entries and four primary variables :

CHEMBL_ID: Unique compound identifier
SMILES: Encoded molecular structure
pIC50: Biological activity metric (-log10 IC50)
OTHER_Data: Auxiliary experimental metadata

This dataset was selected due to its high experimental validity and suitability for quantitative structure-activity relationship (QSAR) mmodelling

2.1.2: Systematic Data Cleaning and Curation

To ensure downstream model validity and eliminate bias, a rigorous data cleaning protocol was implemented. Missing values in either SSMILESor pIC50 were identified as critical failure points for descriptor calculation and model training.

Incomplete entries were removed, resulting in a refined dataset of 950 compounds. A dual-layer deduplication strategy was applied:

Identifier-level check (CHEML_ID)
Structural-level check (SMILES strings)

No duplicate entries were identified, confirming that each compound represented a unique chemical-biological interaction and preventing data leakage during model training.

2.1.3: Preparation of Final Modelling Matrix

Following cleaning, the dataset was transitioned into two structured states:

Intermediate State: Retained all four columns for traceability and audit purposes
Terminal Modelling State: Removed non-predictive metadata (OTHER_Data)

This resulted in a refined dataset of 950 compounds x 3 variables (CHEMBL_ID, SMILES, pIC50), optimized for descriptor generation and machine learning input. This structured reduction improved computational efficiency while preserving all chemically relevant information.

2.1.4 : Audit Trail and Reproducibility Framework

A comprehensive preprocessing log was maintained to document each transformation step, including row removal and structural validation. This ensures full transparency and reproducibility in alignment with Open Science standards.

2.2 Phase II - Feature Engineering and Chemical Space Characterization

2.2.1: Calculation of Physiochemical Descriptors

To quantify drug-likeness and enable interpreter modelling, key physicochemical descriptors were calculated using RDKit, including:

Molecular Weight (MW)
LogP (lipophilicity)
Hydrogen Bond Donors (HBD)
Hydrogen Bond Acceptors (HBA)

These descriptors align with Lipinski’s Rule of Five, providing a mechanistic basis for predicting oral bioavailability.

2.2.2 : Structural Featurization via ECFP4 Fingerprints

To capture local structural motifs, each molecule was encoded using Extended Connectivity Fingerprints (ECFP4).

Each compound was transformed into a 2048-bit vector
Each bit represents a circular atomic environment (radius = 2 bonds)

This method enables high-resolution mapping of substructural features critical for biological activity prediction.

2.2.3: Construction of Unified Feature Matrix

A composite feature matrix X was constructed by concatenating:

A 2048-bit ECFP4 fingerprint
7 physicochemical descriptors

This resulted in a 950 x 2055 feature matrix, paired with a target vector y, where compounds were classified as:

Active (1): pIC50 > 7
Inactive (0): pIC50 ≤ 7

This transformation enabled the transition from regression-based chemical data to a classification-ready ML framework.

2.3 Phase III - Predictive Modelling and Validation

2.3.1: QSAR Model Development (Random Forest Classifier)

A Random Forest Classifier was selected due to its robustness against overfitting and ability to model nonlinear relationships in chemical space. The model was trained using stratified 5-fold cross-validation, achieving strong predictive performance :

ROC-AUC ≈ 0.92
Accuracy ≈ 0.86
Balanced precision and recall

2.3.2: Hyperparameter Optimization

Model performance was further optimized using GridSearchCV, tuning parameters such as:

Number of trees
Tree Depth
Minimum samples per split and leaf

Optimal parameters improved ROC-AUC to \~0.9276, indicating enhanced model stability and predictive power.

2.3.3: Robustness Validation via Y-Scrambling

To confirm that model performance was not due to random correlation, Y-scrambling validation was conducted.

True-label ROC-AUC ≈ 0.92
Scrambled-label ROC-AUC ≈ 0.50

This demonstrated that the model learned real chemical structure-activity relationships.

2.4 Phase IV - Drug-Likeness Filtering and Candidate Selection

2.4.1 : Rule-Based Filtering (Lipinski & Veber Criteria)

To ensure translational feasibility, compounds were filtered using:

Lipinski’s Rule of Five
Veber criteria (rotatable bonds, TPSA)

This reduced the dataset from 950 to 539 druglike compounds, eliminating molecules with poor pharmacokinetic potential.

2.4.2 Applicability Domain (AD) Analysis

An applicability domain was using Tanimoto similarity thresholds derived from ECFP4 fingerprints. Only compounds within the model’s chemical space were retained, ensuring prediction reliability.

2.4.3: High-Confidence Candidate Selection

Compounds were ranked using predicted activity probabilities Selection criteria:

Probability ≥ 0.9
Within the applicability domain

This yielded 168 high-confidence candidate molecules for downstream structural analysis and docking

2.5 Phase V - Structural Modelling and Molecular Docking

2.5. : 3D Conformer Generation and Preparation

Selected candidates were converted from SMILES to optimized 3D conformation, then transformed into PDBQT format for docking simulations.

2.5.2: Protein Preparation and Active Site Definition

Protein structures were preprocessed by:

Removing non-protein molecules
Adding hydrogen atoms
Assigning charges
Energy minimization

Docking grids were centred on the catalytic zinc active site, the functional core of MMP-9.

2.5.3 Ensemble Docking and Selectivity Analysis

Docking simulations were performed against:

MMP-9 (target)
Off-targets: MMP-2, MMP-13, ADAM10

For each ligand:

Binding affinities were calculated
A selectivity index was computed

This enabled identification of compounds with both high potency and target specificity.

2.6 Phase VI - Mechanistic Validation and Interaction Analysis

2.6.1: Zinc Coordination Geometry Analysis

Top docking poses were analyzed for:

Zn-S distance
Coordination number
Binding orientation

These metrics wwerecompared against known inhibitors to validate mechanistic plausibility.

2.6.2: Interaction Mapping and Binding Validation

Detailed interaction profiling included

Hydrogen bonding
Hydrophobic interactions
Electrostatic interactions
Active-site residue engagement

This step ensured that predicted binding modes were chemically and biologically realistic, not just energetically favourable.

2.7 Phase VII - Multi-Objective Optimization and Lead Prioritization

2.7.1 Efficiency Metrics Integration

Top candidates were evaluated using:

Ligand Efficiency (LE)
Lipophilic Ligand Efficiency (LLE)

These metrics balance potency with molecular size and lipophilicity, improving translational viability.

2.7.2: Consensus Scoring and Reliability Assessment

Binding reliability was assessed using RMSF-based consensus scoring, identifying high-confidence binding modes.

2.7.3: Final Lead Identification

Candidates were ranked using a multi-parameter framework combining:

Predicted activity
Selectivity
ADMET proxies
Synthetic accessibility

This resulted in a prioritized list of lead compounds for future experimental validation.

2.8 Phase VIII - Lead Characterization and Translational Assessment

2.8.1 Lead Compound Identification and Profiling

Following multi-objective optimization and consensus scoring, top-ranked candidates were subjected to detailed pharmacokinetic and structural evaluation. Among these, Compound C3 emerged as a high-priority lead candidate based on its predicted activity, stability, and drug-likeness.

C3 is structurally defined by a diphenylmethyl-piperidine scaffold:

SMILES: CN1CCC(CC1)C(c1ccccc1)c1ccccc1

2.8.2 Simulated ADMET and Activity Profiling

To assess translational potential, key pharmacokinetic and activity parameters were computationally simulated:

Predicted Activity (pIC50): 8.80
Human Plasma Protein Binding (hPPB): 0.774
Predicted Clearance: 0.183 mL/min/kg
Predicted Half-Life: 44.619 hours

These values indicate a compound with strong inhibitory potential and a favourable pharmacokinetic profile.

2.8.3 Comparative Benchmarking

C3 was evaluated against reference compounds, including C1 and the benchmark molecule C4 (Lidocaine), to contextualize performance:

Demonstrated significantly extended half-life compared to both reference compounds
Showed reduced clearance and moderate plasma protein binding
Maintained high predicted biological activity

This comparative framework supports the selection of C3 as a competitive and potentially superior candidate.

2.8.4 Structural and Functional Significance

The diphenylmethyl-piperidine scaffold introduces chemical novelty while preserving features associated with target binding affinity. Structural visualization and conformational analysis further support its compatibility with the MMP-9 active site, reinforcing its candidacy for continued development.

2.8.5 Translational Implications

The extended predicted half-life (\~44.6 hours) suggests the potential for infrequent dosing regimens, which may improve therapeutic compliance in chronic conditions. Additionally, balanced ADMET properties indicate a favourable exposure–elimination profile.

2.8.6 Limitations and Validation Requirements

All findings presented for Compound C3 are derived from computational simulations. While the results demonstrate strong silico potential, they are inherently predictive.

Therefore, experimental validation is required, including:

In vitro enzymatic assays
Cellular activity studies
In vivo pharmacokinetic evaluation

These steps are essential to conform to biological activity, safety, and clinical viability.

Research

Background

1.0 Neurodegenerative Diseases: A Systems-Level Perspective

Neurodegenerative diseases represent a heterogeneous class of disorders characterized by progressive neuronal dysfunction, synaptic loss, and ultimately irreversible neuronal death. These conditionsm including Alzheimer’s disease, Parkinson’s disease, Huntington’s disease, and amyotrophic lateral sclerosis. These all share overlapping pathological hallmarks despite distinct clinical presentations. At a systems neuroscience level, these diseases disrupt highly integrated neural networks responsible for cognition, motor control, emotional regulation, and autonomic function.

Among these disorders, Parkinson’s disease (PD) occupies a critical position due to its dual impact on both motor and non-motor domains, as well as its rapidly increasing prevalence. PD is now widely recognized not as a singular localized pathology, but as a multisystem neurodegenerative syndrome involving complex interactions between genetic susceptibility, environmental exposures, and age-related cellular decline.

Epidemiologically, Parkinson’s disease affects over 10 million individuals globally, with incidence rates increasing exponentially with age. As life expectancy rises worldwide, PD is projected to become an even more significant public health challenge, placing strain on healthcare infrastructure and long-term care systems.

2.0 Clinical and Phenotypic Complexity of Parkinson’s Disease

Traditionally, Parkinson’s disease has been defined by its motor phenotype, which arises from dysfunction within the basal ganglia circuitry. The hallmark motor symptoms include:

Resting Tremor – rhythmic oscillatory movement, typically beginning unilaterally
Rigidity – increased muscle tone leading to resistance to passive movement
Bradykinesia – slowed initiation and execution of voluntary movement
Postural Instability – impaired balance and increased fall risk

However, modern clinical research has reframed PD as a disorder with a significant non-motor burden. These symptoms often precede motor onset by years or even decades, suggesting that neurodegeneration begins long before clinical diagnosis. Key non-motor features include:

Cognitive decline and executive dysfunction
Mood disorders (depression, anxiety, apathy)
Sleep disturbances (REM sleep behaviour disorder)
Autonomic dysfunction (orthostatic hypotension, gastrointestinal dysregulation)
Sensory abnormalities (anosmia, pain syndromes)

This expanded clinical spectrum highlights the need for therapeutic strategies that address the disease holistically rather than focusing solely on dopaminergic pathways.

3.0 Molecular and Cellular Pathophysiology

At the cellular level, Parkinson’s disease is defined by the selective degeneration of dopaminergic neurons within the substantia nigra pars compacta. These neurons project to the striatum and are essential for modulating motor control via dopamine signalling. The loss of dopamine results in dysregulation of excitatory and inhibitory pathways within the basal ganglia, producing the characteristic motor deficits. However, dopaminergic degeneration is only one component of a broader pathological cascade. The current understanding of PD pathogenesis encompasses multiple interacting mechanisms:

3.1 Protein Misfolding and Aggregation

The accumulation of misfolded α-synuclein proteins into intracellular inclusions known as Lewy bodies is a defining pathological feature. These aggregates disrupt cellular homeostasis, impair synaptic function, and propagate in a prion-like manner across neural networks.

3.2 Mitochondrial Dysfunction

Mitochondrial impairment leads to reduced ATP production and increased generation of reactive oxygen species (ROS). Dopaminergic neurons are particularly vulnerable to oxidative stress due to their high metabolic demand and dopamine metabolism.

3.3 Oxidative Stress

An imbalance between ROS production and antioxidant defences results in damage to lipids, proteins, and DNA, further accelerating neuronal degeneration.

3.4 Impaired Proteostasis

Dysfunction in the ubiquitin-proteasome system and autophagy-lysosomal pathways leads to the accumulation of toxic proteins and cellular debris.

3.5 Neuroinflammation

Chronic activation of microglia and astrocytes results in sustained release of pro-inflammatory cytokines, contributing to a toxic neural microenvironment.

4.0 Blood–Brain Barrier Dysfunction: A Critical but Underexplored Axis

The blood–brain barrier (BBB) is a highly specialized and selectively permeable interface composed of endothelial cells, tight junction proteins, astrocytic end-feet, and pericytes. Its primary function is to maintain central nervous system (CNS) homeostasis by tightly regulating the movement of ions, molecules, and immune cells between the bloodstream and neural tissue. In Parkinson’s disease, emerging evidence indicates that BBB integrity is compromised early in disease progression. This disruption has several critical consequences:

Increased permeability to peripheral inflammatory mediators
Infiltration of immune cells into the CNS
Dysregulated transport of essential nutrients and metabolites
Amplification of neuroinflammatory cascades

BBB breakdown transforms the brain from an immune-privileged environment into a site of chronic inflammation, significantly accelerating neurodegeneration.

5.0 Matrix Metalloproteinase-9 (MMP-9) and Its Central Role

A key molecular driver of BBB disruption and neuroinflammation is Matrix Metalloproteinase-9 (MMP-9), a zinc-dependent endopeptidase belonging to the matrix metalloproteinase (MMP) family. MMPs are responsible for the degradation and remodelling of the extracellular matrix (ECM), which provides structural and biochemical support to surrounding cells. Under physiological conditions, MMP-9 activity is tightly regulated. However, in pathological states such as Parkinson’s disease, MMP-9 becomes aberrantly upregulated, leading to detrimental effects:

5.1 BBB Degradation

MMP-9 targets and degrades key tight junction proteins (e.g., occludin, claudin-5) and basement membrane components (e.g., collagen IV), directly compromising BBB integrity.

5.2 Amplification of Neuroinflammation

MMP-9 modulates cytokine activity and facilitates leukocyte migration across the BBB, intensifying inflammatory responses within neural tissue.

5.3 Synaptic and Structural Instability

Excessive ECM degradation disrupts synaptic architecture, impairing neuronal communication and plasticity.

5.4 Neuronal Apoptosis

Through both direct proteolytic activity and indirect inflammatory pathways, MMP-9 contributes to programmed cell death in vulnerable neuronal populations. Given its involvement across multiple pathological domains, MMP-9 represents a high-value therapeutic target with the potential to modulate disease progression rather than merely alleviating symptoms.

6.0 Current Therapeutic Landscape and Its Limitations

Despite advances in neuroscience and pharmacology, current treatments for Parkinson’s disease remain predominantly symptomatic. The most widely used therapy, Levodopa, acts as a precursor to dopamine and temporarily restores dopaminergic signalling within the brain. While levodopa is highly effective in early disease stages, its long-term use is associated with several limitations:

Diminishing Efficacy: Progressive neuronal loss reduces the brain’s capacity to convert levodopa into dopamine
Motor Complications: Development of dyskinesias and motor fluctuations
Non-Motor Limitations: Minimal impact on cognitive and psychiatric symptoms

Additional pharmacological agents, including dopamine agonists and MAO-B inhibitors, offer supplementary benefits but similarly fail to address underlying disease mechanisms. Critically, no currently approved therapy targets:

Blood–brain barrier dysfunction
Chronic neuroinflammation
Extracellular matrix degradation
Progressive neuronal death

This highlights a major therapeutic gap: the absence of disease-modifying treatments capable of altering the trajectory of Parkinson’s disease.

7.0 Limitations of Traditional Drug Discovery

The development of novel therapeutics through conventional drug discovery pipelines is a resource-intensive and time-consuming process, often spanning over a decade with costs exceeding billions of dollars. Key challenges include:

Vast Chemical Space: Estimated at 10⁶⁰ possible drug-like molecules
Low Success Rates: High attrition during clinical trials
Experimental Bottlenecks: Dependence on in vitro and in vivo testing
Cost Constraints: Limited scalability for screening large compound libraries

For complex, multifactorial diseases such as Parkinson’s, these limitations are particularly pronounced.

8.0 Computational Drug Discovery: A Paradigm Shift

To overcome these barriers, this study employs a fully computational (in silico) drug discovery framework integrating advanced methodologies from cheminformatics and artificial intelligence. At the core of this approach is Quantitative Structure–Activity Relationship (QSAR) modelling, which establishes predictive relationships between molecular structure and biological activity. By encoding chemical compounds into numerical descriptors, QSAR models enable rapid prediction of inhibitory potential against specific biological targets such as MMP-9. A key algorithm utilized in this study is the Random Forest model, an ensemble learning technique that constructs multiple decision trees and aggregates their predictions to improve accuracy and generalizability. Random Forest is particularly well-suited for cheminformatics applications due to its ability to:

Handle high-dimensional descriptor spaces
Capture nonlinear relationships
Resist overfitting
Provide feature importance insights

9.0 Integrated Computational Pipeline

This research implements a multi-phase computational pipeline designed to systematically identify, evaluate, and optimize potential MMP-9 inhibitors:

9.1 Data Acquisition and Curation

High-quality chemical datasets were sourced from public databases, followed by rigorous preprocessing to remove duplicates, handle missing values, and ensure structural validity.

9.2 Molecular Descriptor Engineering

Chemical structures were transformed into high-dimensional feature vectors representing physicochemical, topological, and electronic properties.

9.3 Model Development and Validation

Machine learning models were trained and validated using cross-validation techniques to ensure predictive robustness.

9.4 Virtual Screening

Large libraries of compounds were screened in silico to identify candidates with high predicted inhibitory activity.

9.5 Drug-Likeness Filtering

Pharmacokinetic criteria, including Lipinski’s Rule of Five, were applied to assess oral bioavailability and clinical viability.

9.6 Synthetic Accessibility Analysis

Candidate compounds were evaluated for the feasibility of chemical synthesis, ensuring real-world applicability.

10.0 Significance and Innovation

This study addresses a critical unmet need in Parkinson’s disease research by targeting a mechanistically relevant but underexplored pathway, MMP-9-mediated neuroinflammation and BBB disruption, through a scalable and efficient computational framework. The integration of machine learning with molecular biology represents a significant methodological advancement, enabling:

Accelerated identification of therapeutic candidates
Reduction in cost and experimental burden
Increased precision in targeting disease-relevant pathways

Beyond Parkinson’s disease, this framework has broad applicability to other neurodegenerative and inflammatory conditions, positioning it as a versatile tool in next-generation drug discovery.

11.0 Conceptual Synthesis

In summary, Parkinson’s disease is not merely a disorder of dopamine deficiency but a complex, multi-layered neurodegenerative process involving inflammation, barrier dysfunction, and structural degradation. Targeting Matrix Metalloproteinase-9 offers a strategic entry point into these interconnected pathways. By leveraging computational methodologies, this research bridges the gap between biological insight and therapeutic innovation, laying the groundwork for the development of disease-modifying interventions that extend beyond symptomatic relief toward true neuroprotection.

Results

1.0 Dataset Integrity and Chemical Space Characterization

1.1 Dataset Curation and Activity Distribution

Following preprocessing of the ChEMBL dataset, a total of 950 compounds with valid molecular structures and experimentally supported activity values were retained. This refined dataset provided a high-quality foundation for downstream computational modelling by eliminating incomplete, inconsistent, or duplicate entries. The pIC50 distribution exhibited a near-normal shape centred between 7.5 and 8.5, indicating a well-balanced representation of moderate-to-highly potent MMP-9 inhibitors. This distribution is particularly advantageous for machine learning because it prevents bias toward extreme values and allows the model to learn nuanced structure–activity relationships across a realistic potency spectrum.

Fig.1 - Distribution of pIC50 values

1.2 Physicochemical Property Analysis

Analysis of key molecular descriptors demonstrated that the dataset occupies a drug-like chemical space. Most compounds adhered to Lipinski’s Rule of Five, with:

Molecular weights predominantly between 40 and –500 Da
LogP values between 2 and 4
Acceptable hydrogen bond donor/acceptor counts

These properties indicate that the compounds are not only biologically active but also possess characteristics compatible with membrane permeability, solubility, and bioavailability, which are critical for central nervous system (CNS) drugs. From a modelling perspective, this ensures that predictions are made within a pharmacologically relevant domain, increasing translational potential. Fig 2. - Distribution of physicochemical profiles

2.0 Model Development and Predictive Performance

2.1 Feature Engineering and Representation

A hybrid feature space was constructed by combining:

Physicochemical descriptors
ECFP4 structural fingerprints

This resulted in a 2,055-dimensional feature matrix, capturing both global molecular properties and detailed substructural patterns. This dual representation enabled the model to detect both broad trends and fine-grained chemical motifs relevant to MMP-9 inhibition. Fig. 3 - ECFP4 fingerprints

2.2 Random Forest Model Performance

The Random Forest classifier demonstrated strong predictive performance across stratified 5-fold cross-validation:

ROC-AUC ≈ 0.92
Accuracy ≈ 0.86
Precision ≈ 0.87
Recall ≈ 0.89
F1-score ≈ 0.88

These results indicate that the model effectively distinguishes between active and inactive compounds, with strong generalization across folds. Fig. 4 - Hyperparameter tuning results (e.g., number of trees vs performance)

Fig.5 - Model performance (Accuracy & ROC-AUC across folds)

2.3 Model Validation and Robustness

The developed Random Forest–based QSAR model demonstrated strong predictive capability in distinguishing active from inactive MMP‑9 inhibitors. Using stratified 5‑fold cross-validation (80/20 splits per fold), the model achieved a mean ROC‑AUC of approximately 0.92, indicating excellent classification performance. Additional evaluation metrics further supported model robustness, with an accuracy of approximately 0.86, precision of 0.87, recall of 0.89, and F1‑score of 0.88.

Importantly, Y‑scrambling validation resulted in an ROC‑AUC near 0.5, confirming that predictive performance was not due to random correlations but instead reflected meaningful structure–activity relationships learned from the dataset. Together, these results demonstrate that the model successfully captured chemically interpretable patterns relevant to MMP‑9 inhibition.

2.4 Feature Importance and Chemical Interpretability

Feature importance analysis revealed that key predictors included:

Molecular weight
Topological polar surface area (TPSA)
Lipophilicity (LogP)
Hydrogen bonding capacity
Molecular flexibility

Additionally, several ECFP4 fingerprint bits corresponding to specific substructures were highly predictive, indicating that distinct chemical motifs are strongly associated with MMP-9 inhibition. This interpretability strengthens the model by linking predictions to established medicinal chemistry principles, rather than treating it as a black box.

3.0 Candidate Selection and Optimization

3.1 Multi-Stage Filtering Pipeline

To ensure real-world viability, predictions were refined using:

Drug-likeness filters
Synthetic accessibility scoring
Applicability domain constraints

This reduced the dataset to 168 high-confidence candidates with:

Predicted activity probability > 0.9
Chemical similarity to training data (within applicability domain)

From theinitiall pool of 16high-confidencece candidates :

39 met the individual threshold for Ligand Efficiency (LE > 0.3)
39 met the Lipophilic Ligand Efficiency (LLE > 5.0) criterion
29 optimized leads that satisfy both metrics for real-world viability

Fig. 6 - Number of top candidates after filtering stages

3.2 Multi-Objective Optimization (Pareto Front)

A Pareto optimization framework was applied to balance:

Potency
Pharmacokinetics
Stability
Drug-likeness

This approach ensured that selected compounds were not optimized for a single metric, but instead represented balanced therapeutic candidates.

Fig.7 - 3D Pareto front visualization

Fig.8 - Distribution of key metrics Fig.9 - Multi-objective optimization radar chart

3.3 Structural Validation and Molecular Docking

To validate the structural feasibility of top-ranked candidates identified through machine learning and multi-objective optimization, molecular docking simulations were conducted using the three-dimensional structure of Matrix Metalloproteinase‑9 (MMP‑9).

3.3.1 Receptor Selection and Preparation

The MMP‑9 receptor structure was obtained from the Protein Data Bank (PDB), an experimentally validated repository of protein structures. The selected structure included the catalytic domain responsible for enzymatic activity and inhibitor binding. Before docking, the receptor was prepared by removing non-essential molecules, optimizing hydrogen placement, and ensuring structural integrity for computational analysis.

Fig. 10 - Where the receptor was obtained (PDB Page)

This structural foundation ensured that docking simulations were biologically relevant and reflective of realistic enzyme geometry.

3.3.2 Docking Procedure and Binding Evaluation

Docking simulations were performed using NeuroSnap to evaluate the binding affinity and spatial compatibility of prioritized compounds within the MMP‑9 catalytic pocket. The active site region was targeted to assess potential inhibitory interactions. Docking scores across shortlisted compounds ranged from –5.46 to –7.77 kcal/mol, with more negative values indicating stronger predicted binding affinity. Several candidates demonstrated favourable binding conformations within the catalytic region.

Fig. 11 - Docking Score Bar Chart These results provided structural confirmation that top machine learning predictions were capable of forming energetically favourable complexes with MMP‑9.

3.3.3 Structural Visualization of Lead Candidate (Compound 3)

Among the evaluated candidates, Compound 3 demonstrated strong docking performance (–6.70 kcal/mol) while maintaining superior balance across pharmacokinetic and multi-objective optimization metrics. Two-dimensional structural analysis highlights functional groups contributing to hydrogen bonding potential, hydrophobic interactions, and molecular stability.

Fig.12 - Chemical structure of top compound

The canonical SMILES representation (CCOP(=O)(c1ccccc1F)N1Cc2ccccc2C[C@@H]1C(=O)NO) of Compound 3 was used for descriptor calculation and fingerprint generation during QSAmodellingng, linking structural encoding directly to predictimodellinging.

Three-dimensional visualization in NeuroSnap confirmed that Compound 3 adopts a stable conformation within the MMP‑9 catalytic pocket, demonstrating appropriate spatial orientation and compatibility with the active site environment.

Fig. 13 - Neurosnap 3D Docking Visualization

Together, these structural analyses reinforce Compound 3 as a computationally validated lead candidate supported by both predictive modelling and structure-based evaluation.

3.4 Top Candidate Profiling

The top-performing compounds were evaluated across multiple dimensions, including:

Predicted half-life
Bioavailability
Stability
Synthetic feasibility

Fig. 14 Radar charts of the top 15 candidates

Fig.15 - Predicted half-life comparison (top 3 compounds) Fig.16 - Translational feasibility metrics per compound

4.0 Identification of a Novel Lead Compound

A key outcome of this study was the identification of a novel high-confidence lead compound that consistently ranked among the top candidates across all evaluation metrics. This compound demonstrated:

High predicted MMP-9 inhibitory activity
Strong performance across multi-objective optimization criteria
Favourableable pharmacokinetic and drug-likeness properties

Unlike many existing candidates, this molecule was not simply optimized for binding affinity but also for realistic therapeutic viability, making it a strong candidate for further experimental validation.

*Refer to Figure 12 for Chemical Structure * Refer to Figure 13 for Docking Visualization

5.0 Relevance to Parkinson’s Disease

5.1 Mechanistic Significance

MMP-9 is strongly associated with:

Neuroinflammation
Blood–brain barrier disruption
Neuronal degeneration

By identifying inhibitors of MMP-9, this study directly targets processes implicated in neurodegenerative progression, including Parkinson’s disease.

5.2 What This Project Specifically Achieved

In the context of Parkinson’s disease, this project:

Identified 168 viable MMP-9 inhibitor candidates
Narrowed these to top-tier compounds through optimization
Discovered a novel lead compound with strong predicted CNS compatibility
Prioritized molecules likely to cross the blood–brain barrier

This represents a shift from:

Symptom management → mechanism-based intervention
Broad drug screening → precision-guided candidate selection

5.3 Translational Impact

The pipeline enables:

Rapid identification of anti-inflammatory neuroprotective agents
Reduction in time and cost of early-stage drug discovery
Increased likelihood of successful experimental validation

By focusing on drug-like, synthesizable, and biologically relevant compounds, this work bridges the gap between computational prediction and real therapeutic development.

6.0 Broader Implications for Machine Learning in Drug Discovery

This study demonstrates that integrating:

Machine learning
Structural chemistry
Pharmacokinetic filtering

creates a scalable and reliable framework for drug discovery. Key contributions include:

Improved efficiency (screening thousands → hundreds of candidates)
Enhanced interpretability
Increased translational relevance

7.0 Summary

Overall, the results show that this pipeline:

Achieves high predictive accuracy
Produces chemically meaningful insights
Identifies novel, viable drug candidates
Advances the development of disease-modifying therapies

Most importantly, it demonstrates how computational approaches can move beyond theoretical prediction to generate actionable therapeutic leads, particularly for complex neurological diseases like Parkinson’s.

Data

1.0 Data

1.1 Data Source and Target Selection

The dataset used in this study was obtained from the ChEMBL database, a manually curated repository of bioactive molecules with experimentally validated activity data. Compounds were specifically filtered for activity against Matrix Metalloproteinase‑9 (MMP‑9), a zinc-dependent endopeptidase implicated in neuroinflammation, blood–brain barrier disruption, and neuronal degeneration associated with Parkinson’s disease.

Only compounds with reported IC₅₀ values derived from biochemical assays were retained to ensure experimental reliability. All activity values were converted to pIC₅₀ (–log₁₀ IC₅₀) to standardize the response variable and normalize the distribution for modelling purposes. Following initial retrieval and filtering, the dataset was refined to 950 unique compounds, forming the final modelling dataset.

Fig.1: Screenshot of the EMBL query page showing:

Target: MMP‑9

1.2 Data Cleaning and Standardization

To ensure dataset integrity and model reliability, multiple preprocessing steps were performed:

Removal of duplicate compounds
Elimination of salts and inorganic fragments
Standardization of SMILES strings
Removal of entries with missing activity values
Validation of molecular structures using RDKit

Compounds with ambiguous or inconsistent bioactivity annotations were excluded. These steps reduced noise and ensured that all retained molecules were chemically valid and experimentally supported. This cleaning process is critical, as machine learning models are highly sensitive to noisy or inconsistent data. Ensuring structural and numerical consistency enhances predictive robustness and generalizability.

Fig. 2 - Example of structures that would need to be removed because of incomplete data. In this case, IC50 is replaced with Ki, and there are equal signs when there should be "<" signs. More examples include replacing the nM unit with another value.

1.3 Activity Distribution Analysis

The distribution of pIC₅₀ values was analyzed to assess dataset balance. The activity values exhibited a near-normal distribution centred between 7.5 and 8.5, indicating a balanced representation of moderate to high potency inhibitors. A well-distributed activity range is essential for QSAR modelling as skewed datasets can bias classifiers toward dominant classes and reduce predictive sensitivity.

1.4 Physicochemical Descriptor Analysis

Key molecular descriptors were computed using RDKit, including:

Molecular weight (MW)
LogP (lipophilicity)
Topological polar surface area (TPSA)
Hydrogen bond donors (HBD)
Hydrogen bond acceptors (HBA)
Rotatable bonds

Analysis revealed that the majority of compounds fall within established drug-likeness criteria, particularly those relevant for central nervous system (CNS) drugs. Since Parkinson’s therapeutics must cross the blood–brain barrier (BBB), lipophilicity and molecular size were especially important considerations. This confirms that the dataset occupies a pharmacologically relevant chemical space.

Fig. 3 - Physicochemical profile distributions (MW, LogP, TPSA, etc.)

Lipinski compliance summary table

1.5 Feature Engineering

To capture both global molecular properties and fine structural patterns, two complementary feature representations were used:

Physicochemical descriptors
Extended Connectivity Fingerprints (ECFP4)

The final feature matrix contained 2,055 variables per compound, allowing the model to learn both quantitative property relationships and substructural motifs relevant to MMP‑9 inhibition. This hybrid representation enhances predictive performance compared to using either descriptor type alone.

ECPF4 Fingerprint

1.6 Data Splitting Strategy

To prevent overfitting and ensure unbiased evaluation, the dataset was split using stratified 5‑fold cross-validation. Stratification ensured that the proportion of active and inactive compounds remained consistent across training and validation folds. This approach improves model robustness and provides a reliable estimate of generalization performance.

Train vs test class distribution plot

1.7 Applicability Domain Definition

To ensure reliable predictions, an applicability domain (AD) analysis was conducted. Predictions were restricted to compounds residing within the chemical space defined by the training dataset. This prevents extrapolation beyond known chemical patterns and increases the translational reliability of selected candidates.

Williams plot (Applicability Domain visualization)

Conclusion

This study demonstrates the effectiveness of a fully integrated computational pipeline in identifying and optimizing potential inhibitors of Matrix Metalloproteinase‑9 (MMP‑9) with relevance to neurodegenerative disease. By combining rigorous data curation, advanced feature engineering, machine learning–driven QSAR modelling, and molecular modelling, the research established a systematic and reproducible framework for early-stage drug discovery. The strong predictive performance of the model—validated through stratified cross-validation and Y‑scrambling—confirms that biologically meaningful structure–activity relationships were captured rather than spurious statistical correlations.

Beyond predictive modelling alone, the study incorporated a structured, multi-stage prioritization strategy to enhance translational relevance. Drug-likeness screening, applicability domain constraints, pharmacokinetic profiling, and efficiency metrics were integrated to ensure that selected candidates demonstrated both biological promise and realistic therapeutic feasibility. Molecular docking provided an additional mechanistic layer of validation, confirming that top-ranked compounds adopted energetically favourable conformations within the MMP‑9 catalytic site. This layered approach strengthened confidence that computational predictions aligned with structural and pharmacological plausibility.

Among the identified compounds, Compound C3 emerged as a particularly promising lead candidate, exhibiting both high predicted inhibitory activity and a favourable pharmacokinetic profile, including an extended half-life and low clearance. These properties suggest strong potential for sustained therapeutic activity and improved dosing convenience, which are critical considerations in the treatment of chronic neurodegenerative conditions such as Parkinson’s disease.

In the broader context of neurodegenerative research, the identification of a compound like C3 represents more than a computational milestone; it represents a strategic shift toward disease-modifying intervention. Parkinson’s disease progression is driven not only by neuronal loss but by sustained neuroinflammation and blood–brain barrier disruption, processes in which MMP‑9 plays a critical role. By targeting this enzyme, C3 introduces the possibility of interrupting a molecular mechanism that contributes to ongoing neuronal damage. Rather than merely alleviating symptoms such as tremors or rigidity, a compound like C3 could theoretically help preserve dopaminergic neuron integrity by reducing inflammation-mediated degeneration. Even incremental slowing of disease progression can profoundly impact long-term functional independence and quality of life for patients. Therefore, the importance of C3 lies not only in its predicted potency but in its potential to shift therapeutic focus from symptom management to molecular-level protection, an approach that aligns with the future direction of precision neurotherapeutics.

While these findings remain predictive and derived from in silico methodologies, they meaningfully narrow the experimental search space and provide a rational, data-driven foundation for laboratory investigation. Experimental synthesis, biochemical validation, and in vitro studies will be the next steps to confirm efficacy, safety, and clinical relevance. However, by reducing reliance on large-scale empirical screening and strategically focusing resources on high-probability candidates, this framework addresses one of the most time- and resource-intensive stages of drug development.

In conclusion, this work illustrates how computational methodologies can transform early-stage pharmaceutical discovery from broad exploratory screening to targeted, hypothesis-driven design. The scalable framework developed here offers a model for future investigations targeting complex diseases, reinforcing the growing role of in silico approaches in advancing precision biomedical innovation.

Citations

https://www.mayoclinic.org/diseases-conditions/parkinsons-disease/symptoms-causes/syc-20376055
https://www.ninds.nih.gov/health-information/disorders/parkinsons-disease
https://www.parkinson.org/understanding-parkinsons/what-is-parkinsons
https://my.clevelandclinic.org/health/diseases/8525-parkinsons-disease-an-overview
https://health-infobase.canada.ca/datalab/parkinson-blog.html
acrobiosystems.com/category/recombinant-proteins/enzyme-products?utm_source=google&utm_medium=cpc&utm_campaign=19632141795&utm_content=144386341414&utm_term=mmp-9%20protein&gad_source=1&gad_campaignid=19632141795&gbraid=0AAAAADsmiy1DNsvEVTbCeXkfjuskxFyjh&gclid=CjwKCAiAzZ_NBhAEEiwAMtqKyxs1ickm9t6eBRd29-D2YVQVjIUsu0IwcfeIOMfCaC85tpHOxintOxoC2ssQAvD_BwE
https://pmc.ncbi.nlm.nih.gov/articles/PMC3858212/
https://www.dynacare.ca/specialpages/secondarynav/find-a-test/nat/matrix%C2%A0metalloproteinase-9.aspx?st=&sa=M&sr=
https://www.abcam.com/en-us/technical-resources/target-tips/matrix-metalloproteinase-9-mmp9
https://www.ncbi.nlm.nih.gov/gene/81687
https://pmc.ncbi.nlm.nih.gov/articles/PMC3767588/
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073777
https://www.sciencedirect.com/science/article/abs/pii/S0014488602980192
https://link.springer.com/article/10.1134/S181971242302006X
https://www.frontiersin.org/journals/aging-neuroscience/articles/10.3389/fnagi.2022.889257/full
https://pmc.ncbi.nlm.nih.gov/articles/PMC8998077/
https://journals.sagepub.com/doi/full/10.1177/0271678X16655551
https://www.frontiersin.org/journals/molecular-biosciences/articles/10.3389/fmolb.2022.857430/full
https://pmc.ncbi.nlm.nih.gov/articles/PMC3046033/
https://www.sciencedirect.com/science/article/pii/S2211715625002760
https://www.sciencedirect.com/science/article/pii/S0167488909002080
https://onlinelibrary.wiley.com/doi/10.1155/2015/620581
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073777
https://pmc.ncbi.nlm.nih.gov/articles/PMC9444063/
https://onlinelibrary.wiley.com/doi/10.1111/jnc.13415
https://www.mdpi.com/1422-0067/22/3/1413

Acknowledgement

My Teachers
My Parents
- Who challenged me to take up this project
- Helped me in my experiment and its analysis
- Helped me with the slides
My Sister
My School Science Fair Coordinators
Calgary Youth Science Fair

Attachments

View Log Book
(may download a file) View Extra Attachment: report/data/exhibit/etc
(may download a file)