Generating Novel Linker Structures in Antibody-Drug Conjugates Using Diffusion Models

Linkers play a critical role in the stability, efficacy, and safety of antibody-drug conjugates (ADCs). This project leverages a diffusion-based generative model to design novel linker structures tailored to the specific components of existing ADCs.
April Cao
Grade 11

Problem

In 2022, there were nearly 10 million deaths from cancer, and there are predicted to be over 35 million new cancer cases in 2050 [1]. The rapidly growing global cancer burden represents both the population aging and growth, as well as differences in exposures to risk factors. Despite progress made in early detection of cancers and treatment of cancer patients, there is still significant room for improvement. The current standard for treatment is using chemotherapy, which uses drugs to kill cancer cells or stop them from growing [2]. It is strong and effective but can often cause serious side effects due to its mechanism of action. It targets fast reproducing cells to stop them from multiplying through interrupting specific parts of the cell cycle. Since the drugs work throughout the body, it will also affect skin, hair, intestines, the mouth, and bone marrow, which can cause serious negative side effects. This is one of the main disadvantages of chemotherapy.  
In recent years, there has been rapid development of a novel, advantageous form of treatment: antibody-drug conjugates (ADCs). They are precise drug delivery systems advantageous to chemotherapy primarily because of their precision [3]. ADCs have three main components, the monoclonal antibody, payload, and linker. The monoclonal antibody allows for greater precision because it targets a specific antigen that is typically expressed only in tumor cells. However, the focus of my project will be on the linker. The linker is a short chemical sequence that connects the antibody and payload, and is crucial for regulating stability and releasing the payload in the tumor. The linker can also affect the ADC’s pharmacokinetics, efficacy, and toxicity characteristics. In the first FDA-approved ADC, Mylotarg, its unstable N-acylhydrazone linker caused severe hepatotoxicity and subsequent withdrawal, which demonstrates the importance of a good linker design [4]. There are two primary types of linkers: cleavable and non-cleavable. Cleavable linkers release the payload in response to the conditions of a target cell. For instance, extracellular acidity is a common feature of many solid tumors, and thus some ADC linkers are cleaved due to low pH. On the other hand, non-cleavable linkers require lysosomal degradation of the antibody to release the cytotoxic agent. There are advantages and disadvantages to both, with cleavable linkers having broader efficacy and faster rates of activation/release of cytotoxic drugs, whereas non-cleavable linkers can provide increased plasma stability and reduced off-target toxicity [4]. Ultimately, it is evident that the linker in an ADC is not simply a connector, but rather has embedded engineered functionality in ADCs. One of the key hurdles is the number of parameters that can be investigated. There are numerous factors that the linker can influence, such as stability at physiological pH and in serum, ability to release payload where desired, sufficient hydrophilicity, and reasonable/practical manufacturing process [5]. In addition to this, linkers will differ from case to case, meaning that a more broad, generalized approach is required.
An offered solution to increase the efficacy and efficiency of developing novel linker structures is machine learning models. Machine learning models rapidly analyze datasets to identify patterns, and within this subset, generative AI uses models to create original data by using existing data as reference. Specifically, this project uses diffusion models. With their unique iterative nature of adding and removing Gaussian noise to data, the model can be trained on existing linker structures to generate novel structures. Diffusion models generate artificial yet realistic data from input parameters. They have several advantages over other generative models, including being able to learn complex distributions smoothly, handling high-dimension data, and being easily applicable to different tasks with diverse datasets. In the past, they have been used to address bioinformatics problems such as in protein design [6]. As a result of the success of diffusion models in other fields, our approach is to design a novel, diffusion-based generative model for the task of generating unique linker sequences in ADCs. One key idea in linker development is that there exists ‘no perfect linker’. Instead, the linker chosen will always be dictated by the antibody/payload. In some cases for example, the linker must be more hydrophilic to balance a payload’s hydrophobicity. For this reason, the input parameters to the model are the other two components (payload and antibody), so that the model takes into account the context of the overall drug when generating a novel linker. Ultimately, the goal of this project is to utilize AI and machine learning as a tool to improve the efficiency of drug development, while enhancing the safety and efficacy of ADCs through improved linker stability.

 

Method

1.1 Dataset collection

The dataset was scraped from the public online ADCdb database (accessible at: https://adcdb.idrblab.net) [7]. This is an online database focused on providing pharma-information and biological activities. The database was made from comprehensive literature review by searching through literature databases, with pharma-information of each ADC and its main components being systematically extracted. From each ADC, we collected its corresponding structures–linker, payload, and antibody. The linkers and payloads were represented by SMILES sequences, a common format to represent chemical species, and the antibodies were broken down into their heavy and light chain amino acid sequences. For model training, the dataset was standardized through removing entries with missing info and duplicate entries. One of the key limitations of this dataset is despite having 6000+ ADCs within its database, the vast majority of them are missing one or more key components of the ADCs, preventing them from being included within the dataset the model was trained on. The limited dataset means that it is more difficult to apply machine learning in ADCs, since machine learning typically requires a lot of data.

1.2 Feature representation

For feature representation, the protein language model ESM-2 (freely available at https://github.com/facebookresearch/esm) was used for antibodies, because of their classification as proteins [8]. ESM-2 is a transformer based language model which uses an attention mechanism to learn interaction patterns between pairs of amino acids in input sequences. It can perform feature extraction of antibodies through transfer learning, and outputs 1280-dimension feature vectors. For linkers and payloads, a novel NLP-based method (freely available at https://github.com/rahulsharma-rs/drug-smile-fet.git) was used because of their classification as small molecules [9]. In this method, SMILES are interpreted as natural language sequences. With an innovative feature extraction method leveraging N-grams (a tool derived from NLP) to isolate significant and interpretable features from SMILES, the drug molecule structures were able to be analyzed and converted to vector dimensions. 

1.3 Diffusion model

The framework used for the diffusion model was a DDPM (Denoising Diffusion Probabilistic Model). A DDPM has two Markov chains (a stochastic process to predict the probability of a sequence of events). The forward chain gradually adds noise to the data, whereas the reverse process removes noise to recover the original data. There are two parts of the diffusion model development process. First, the model was trained, through designing a diffusion model network structure that gradually adds random noise then denoises through the reverse process to get the original molecule. This process is iterated several times so the model can improve its predictions of the noise added. Next, is model inference. After inputting molecular constraints (the other components of the ADC), the predicted noise is obtained and a new linker is generated after removing noise. The dataset was split into training and validation sets. In constructing the model, a two-layer sequence of fully connected layers was defined to map high-dimensional context vectors to lower-dimensional space (256 dimensions) The linker generator was defined as a three-layer fully-connected layer, with ReLU (rectified linear unit) activation layers in between, that generates a new linker embedding. The ReLU layers were included to introduce non-linearity and prevent overfitting in the model, which is necessary because of the small dataset size. Then, a two-layer fully-connected layer was defined to map the unidimensional time tensor to a higher dimensional embedding. The forward propagation process of the model involves splicing the input data (heavy chain, light chain, payload) and mapping them into 256-dimensions context vectors by defining a two-layer, fully connected network. Then, this is spliced with the time step and merged. The entire vector is passed through a noise predictor to generate predicted linker noise. The loss function uses alphas_bar_sqrt and 1-alphas_bar_sqrt as coefficients in the diffusion process to generate data with noise. For model training, parameters were set to: epochs = 50,  batch size = 32, and  learning rate = 0.0001. The epochs and the learning rate were experimented with as some higher and lower values were tried, with these being found to produce the most ideal model training results. Batch size was kept relatively low due to hardware constraints. The dataset is divided into training and validation sets with 20% for testing. For the noise calculation, betas (a linear distribution from 1e-4 to 0.02) indicates the variance of noise at each step. 1-betas, which is equal to alphas, is determined and along with the cumulative product, alphas_bar, they are used for constructing denoising process. Next, calculating the square root value associated with alphas_bar (alphas_bar_sqrt and 1-alphas_bar_sqrt) are necessary for the variance calculation of the diffusion process. Each batch of data calls the forward propagation, then uses mean-squared-error (MSE) to calculate the loss of predicted noise vs actual noise. MSE loss was used because it aligns well with properties of Gaussian distributions - natural fit for measuring noise prediction accuracy. The context loss of noise added data and original data is added to the MSE loss to get the total loss. Using the total loss obtained from forward propagation, backpropagation occurs through calculation of gradient, trimming the gradients (to prevent explosion of gradients), and updating the parameters so that the model gradually learns to predict the noise. Backpropagation is a fundamental algorithm that trains artificial neural networks by calculating the loss function’s gradient with respect to the network’s weights, which minimizes the model’s errors and improves its predictions, the end result being a well-trained model for noise prediction. The next step is model inference, where we input live data into the trained machine learning algorithm. First, we loaded the trained noise prediction model. Then, we initialize a random tensor x with the same batch size as the input embeddings, ensuring the shape matches the linker dimensions. To simulate the backward diffusion process, we iterate through time steps t in reverse order. At each step, we call the model with the current noise-containing tensor x, the time step t, and contextual information to predict and remove noise. Update x according to the denoising formula of the diffusion model to gradually remove the predicted noise, the denoising formula being provided below.

Denoising formula

After completing the denoising of all time steps, the original heavy chain embedding, light chain embedding, payload embedding, and generated linker vector x are spliced together to form the final embedding. For novel linker generation, samples were randomly chosen from the validation set and gradually denoised by the model inference to obtain a new linker. These were saved to an Excel file for subsequent decoding.

 

1.4 Decoder

The decoding process is intended to convert the generated linker embedding from the diffusion model back into a SMILES sequence, as is typically used to represent chemical structures. Based on the ADCdb database, a dataset of linkers with their corresponding embeddings and SMILES was collected, and the goal was to train the vector decoder to decode embedding vectors of molecules into SMILES strings.

Linker embeddings and SMILES string dataset​​​​​

First, the SMILES strings are broken down into basic units (atoms, keys, symbols, etc.) to construct a character set dictionary, which provides disambiguation for the SMILES strings. Each SMILES string is then segmented and converted to a fixed-length index sequence. Next, a decoder model was built based on Keras. Keras is a high level neural network API, used for quickly building and training deep learning models with less code and higher level of abstraction, and it was fully integrated into TensorFlow 2.0. The Keras-based decoder model decodes input embeddings into logits of a sequence of characters. Table logits is a vector with the same number of elements as in the model’s vocabulary. Each value in the vector is a raw score (“logit”) that indicates how much the model favors that specific character as the next one in a sequence. In this model, an input layer was created to receive the input data, followed by a dense layer. The dense layer is a fully connected layer with 512 neurons. L2 regularization with a weight decay coefficient of 0.01 was applied to prevent overfitting, which it does by simplifying the model and distributing weights more evenly. Next, there is a LeakyReLU activation function with leakage correction and a negative slope of 0.01. LeakyReLU alleviates the problem of “dead neurons” in the negative region of traditional ReLU. As a result of the model initially overfitting significantly, leading to poor results on the validation set, a dropout layer was added that randomly drops out 50% of neurons during the training process, which allowed for improved generalization ability in the model. Next, we have the loss function. Base cross-entropy loss was used between the model-predicted SMILES character sequences and the true sequences. The SMILES strings were also checked for validity using RDKit, with invalid SMILES counting as a loss.  A Beam Search algorithm was used to generate the SMILES sequence, where for each candidate sequence, each index in the sequence was converted back to corresponding character, then checking the converted strings for bracket matching and SMILES validity. For model compilation, the Adam optimizer (a common algorithm for deep learning tasks) was used. The loss function was as previously defined, including the penalty for SMILES string validity, and the metrics were based on the accuracy of predicted sequences with respect to the true sequence. The accuracy was the metric to evaluate the model’s performance. For model training, early stopping was implemented such that the loss values on the validation set were monitored, and patience was set to 10 to stop training if validation loss did not improve after 10 consecutive epochs. When training was stopped, the weights with the best validation loss were restored. Other preset parameters include: epochs = 30, batch size = 32, and shuffle = True (to shuffle the training data before the start of each epoch). Similar to the diffusion model, the epochs parameter was based on what produced the best training results, whereas the batch size was a result of hardware constraints.

 

Analysis

Results

The diffusion model was able to successfully generate novel linker sequences optimized for specific ADC components, including the antibody and payloads.

Diffusion model
Figure 1: Diffusion model training results

Model performance analysis

  1. Noise Loss

The final values of the training noise loss and validation noise loss are close to a minimum value, indicating the model’s strong ability to predict noise. The gap between the two is also very minimal, with the lines being very close, indicating that the model’s generalization ability is strong. Both curves are also smooth without drastic fluctuations, which is a sign of a stable training process. 

  1. Context Loss

Both training loss and validation loss gradually decrease with the increase of epochs. The gap is also relatively small (typically within 10%-20%), indicating a strong generalization ability.

  1. Problems

Convergence occurs extremely quickly, with the loss curve flattening out after about 10 rounds, which could indicate the model reaching a local minimum. The reason for this is likely the small sized dataset and the simple structure of the network.

Figure 2: Decoder training results
Figure 2: Decoder Training Results

Model performance analysis

  1. Training set loss and validation set loss

As the number of epochs increase, the training loss and validation loss gradually decrease until a minimum value stabilizes. However, as the epochs increase, the gap becomes larger, which is better on the training set but worse on the validation set. This is likely related to the small dataset and nonuniform data distribution.

  1. Training set accuracy and validation set accuracy

The training accuracy and validation accuracy gradually increase with the increase in epochs, and the final values are both close to a maximum, indicating that the model fits both the training and validation data well. However, the final value of the training set accuracy is greater than 90%, which performs significantly well on the seen data, yet the validation set accuracy is only ~75%. This suggests the model has some generalization ability on the unseen data, but worse than its performance on the training set, suggesting a certain degree of overfitting.

Next, we’ll evaluate the linkers that were generated by the model. 

Figure 3: 2D UMAP projection of linker embeddings with novel generated linkers from the diffusion model
Figure 3: 2D UMAP projection of linker embeddings with novel generated linkers from the diffusion model

We used UMAP, a dimensionality reduction technique that helps visualize and analyze high dimensional data by mapping the linker embeddings into a lower dimensional space. UMAP analysis was done to determine the “nearest neighbours” of the generated linkers. The “nearest neighbours” are the closest data points to a given point in a dataset. One particular linker, a promising candidate, was chosen along with its three closest neighbours for further analysis. 

Table 1: Table of ProTox-3.0 and SwissADME evaluations of novel linker and its nearest neighbours
 

SMILES

LD50 Value

Toxicity Class

Lipophilicity

Bioavailability Score

Synthetic Accessibility

Thresholds

 

LD50>500 mg/kg (moderate toxicity)

1-6 (6 is least toxic)

 LogP 1-3 (favorable lipophilicity)

 > 0.55 (acceptable bioavailability)

SA < 5 (practical synthesis)

New Linker 3 (molecule 17)

O=C(NCCN1C(=O)C=CC1=O)CCCSSc1ccccn1

790 mg/kg

4

1.3

0.55

3.16

Mal-Me3Lys-Pro

C[N+](C)(C)CCCCC(NC(=O)C1CCCN1)C(=O)NCCNC(=O)CCN1C(=O)C=CC1=O

215 mg/kg

3

-2.29

0.55

4.76

Maleamic methyl ester-based linker 12A

CC(NC(=O)C(NC(=O)CCNC(=O)/C=C/C(=O)O)C(C)C)C(=O)Nc1ccc(CO)cc1

1000 mg/kg

4

0.41

0.11

3.89

This table compares the toxicity, LD50 values, lipophilicity, bioavailability, and synthetic accessibility of a novel linker and its nearest neighbours. Toxicity analysis was performed using ProTox-3.0, which revealed LD50 values and toxicity classes (the first two columns of the table) [10]. ProTox is based on 61 models for predictions of toxicity endpoints, the models being based on Random Forest machine learning and deep neural network algorithms. It incorporates molecular similarity, fragment propensities, most frequent features, and fragment similarity based CLUSTER cross-validation machine learning. LD50, the median lethal dose, is the dose at which 50% of test subjects die upon exposure. Although the threshold is > 500, class 4 toxicity (300 < LD50 <= 2000) indicates that it is harmful if swallowed. However, most ADCs are administered intravenously, so it is acceptable toxicity for the linker. As we can see, the new linker outperforms Mal-Me3Lys-Pro, which has a more toxic linker with a lower LD50, while having similar values to the other linker. The other three values were determined based on the SwissADME model [11]. Lipophilicity, the ability of a compound to dissolve in lipids, is determined from a consensus value of five different models. Druglikeness is shown through multiple ways, and the table shows the Abbot bioavailability score. The Abbot bioavailability score attempts to predict the probability of a compound to have at least 10% bioavailability in rat permeability. In this, bioavailability refers to the ability of the drug to be absorbed or used by the body. Here, we can see the generated linker outperforms Maleamic methyl ester-based linker 12A, which has a bioavailability score below the threshold, while once again matching the other linker. Lastly, we have synthetic accessibility. Synthetic accessibility is a score between 1 and 10, where 1 is the easiest to build in a lab, whereas 10 is much more difficult (meaning it would be very costly). The novel linker outperforms the other two in terms of synthetic accessibility. This higlights improvements in properties achieved through the diffusion model, with the results underscoring the potential of the diffusion model to generate novel linkers with optimized properties for further refinement in ADC development.

 

Conclusion

These findings has demonstrated the potential of diffusion-based generative models to revolutionize the design of linker sequences in ADCs by accelerating ADC linker design. Encoding the entire ADC, including antibody sequences and payloads, allows the model to effectively generate linkers that are structurally valid, pharmacologically stable, and adapted to the specific requirements of the other ADC components. The application of diffusion models streamlines the process of drug discovery in order to create a linker that balances efficacy and safety profiles. These results highlight the model’s capacity to address key challenges in ADC development, such as improving linker stability, minimizing off-target toxicity, and enhancing overall therapeutic efficacy. 

In the future, there are several ways this project can be improved to optimize linker generation. In generative models, a larger dataset always results in a model that is able to learn better and generalize better. Currently, there is a lack of data in the ADC field, with the majority of ADCs on ADCdb lacking key components. Ways to enrich and improve the dataset could be through searching for more ADC datasets that provide all components of the ADC, as well as augmenting the data to expand the dataset. Additionally, the model can be expanded to broader ADC design challenges, which would accelerate the entire drug design process. For example, utilizing more complex models such as residual diffusion models would provide an opportunity to more fully explore the association between the different components of ADC drugs. Ultimately, these advancements hold potential in accelerating preclinical and clinical testing, reducing costs, and enabling the development of more effective cancer therapies.

 

Citations

  1. World Health Organization. Global cancer burden growing, amidst mounting need for services. (2024)
  2. WebMD. Chemotherapy: Types, how it works, procedure and side effects. n.d. 
  3. Maecker H, Jonnalagadda V, Bhakta S, Jammalamadaka V, & Junutula JR. Exploration of the antibody-drug conjugate clinical landscape. MAbs. 2023; 15(1): 2229101
  4. Challener C. Optimization of Linker Chemistries for Antibody-Drug Conjugates. BioPharm International. 2023;36(11):12-15.
  5. Ho J, Jain A, & Abbeel P. Denoising Diffusion Probabilistic Models. arXiv. 2020
  6. Baah S, Laws M, & Rahman K. M. Antibody-Drug Conjugates-A Tutorial Review. Molecules (Basel, Switzerland). 2021;26(10):2943.
  7. Shen L. et al. ADCdb: the database of antibody-drug conjugates. Nucleic Acids Res., (2023).
  8. Lin Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123-1130. 
  9. Sharma R, Saghapour E, & Chen J. Y. An NLP-based technique to extract meaningful features from drug SMILES. iScience. 2024;27(3)
  10. Banerjee P, Kemmler, E, Dunkel, M, & Preissner R. ProTox 3.0: a webserver for the prediction of toxicity of chemicals. Nucleic Acids Res., 2024.
  11. Daina A, Michielin O, Zoete V. SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci. Rep. 2017;7:42717

Acknowledgement

I would like to thank many people who helped me in the completion of this project. First, I would like to thank Dr. Jake Chen from the University of Alabama at Birmingham for guiding the direction of my project and providing advice at various stages of my project. I would also like to thank Yongna Dai from the Beijing University of Technology, College of Computer Science for her support in the technical aspects of the model and helping me with some of the coding. Lastly, I would like to thank my parents. Without their help and support, I would not have been able to complete this project.