Generating Novel Linker Structures in Antibody-Drug Conjugates Using Diffusion Models
April Cao
Grade 11
Presentation
Problem
Method
1.1 Dataset collection
The dataset was scraped from the public online ADCdb database (accessible at: https://adcdb.idrblab.net) [7]. This is an online database focused on providing pharma-information and biological activities. The database was made from comprehensive literature review by searching through literature databases, with pharma-information of each ADC and its main components being systematically extracted. From each ADC, we collected its corresponding structures–linker, payload, and antibody. The linkers and payloads were represented by SMILES sequences, a common format to represent chemical species, and the antibodies were broken down into their heavy and light chain amino acid sequences. For model training, the dataset was standardized through removing entries with missing info and duplicate entries. One of the key limitations of this dataset is despite having 6000+ ADCs within its database, the vast majority of them are missing one or more key components of the ADCs, preventing them from being included within the dataset the model was trained on. The limited dataset means that it is more difficult to apply machine learning in ADCs, since machine learning typically requires a lot of data.
1.2 Feature representation
For feature representation, the protein language model ESM-2 (freely available at https://github.com/facebookresearch/esm) was used for antibodies, because of their classification as proteins [8]. ESM-2 is a transformer based language model which uses an attention mechanism to learn interaction patterns between pairs of amino acids in input sequences. It can perform feature extraction of antibodies through transfer learning, and outputs 1280-dimension feature vectors. For linkers and payloads, a novel NLP-based method (freely available at https://github.com/rahulsharma-rs/drug-smile-fet.git) was used because of their classification as small molecules [9]. In this method, SMILES are interpreted as natural language sequences. With an innovative feature extraction method leveraging N-grams (a tool derived from NLP) to isolate significant and interpretable features from SMILES, the drug molecule structures were able to be analyzed and converted to vector dimensions.
1.3 Diffusion model
The framework used for the diffusion model was a DDPM (Denoising Diffusion Probabilistic Model). A DDPM has two Markov chains (a stochastic process to predict the probability of a sequence of events). The forward chain gradually adds noise to the data, whereas the reverse process removes noise to recover the original data. There are two parts of the diffusion model development process. First, the model was trained, through designing a diffusion model network structure that gradually adds random noise then denoises through the reverse process to get the original molecule. This process is iterated several times so the model can improve its predictions of the noise added. Next, is model inference. After inputting molecular constraints (the other components of the ADC), the predicted noise is obtained and a new linker is generated after removing noise. The dataset was split into training and validation sets. In constructing the model, a two-layer sequence of fully connected layers was defined to map high-dimensional context vectors to lower-dimensional space (256 dimensions) The linker generator was defined as a three-layer fully-connected layer, with ReLU (rectified linear unit) activation layers in between, that generates a new linker embedding. The ReLU layers were included to introduce non-linearity and prevent overfitting in the model, which is necessary because of the small dataset size. Then, a two-layer fully-connected layer was defined to map the unidimensional time tensor to a higher dimensional embedding. The forward propagation process of the model involves splicing the input data (heavy chain, light chain, payload) and mapping them into 256-dimensions context vectors by defining a two-layer, fully connected network. Then, this is spliced with the time step and merged. The entire vector is passed through a noise predictor to generate predicted linker noise. The loss function uses alphas_bar_sqrt and 1-alphas_bar_sqrt as coefficients in the diffusion process to generate data with noise. For model training, parameters were set to: epochs = 50, batch size = 32, and learning rate = 0.0001. The epochs and the learning rate were experimented with as some higher and lower values were tried, with these being found to produce the most ideal model training results. Batch size was kept relatively low due to hardware constraints. The dataset is divided into training and validation sets with 20% for testing. For the noise calculation, betas (a linear distribution from 1e-4 to 0.02) indicates the variance of noise at each step. 1-betas, which is equal to alphas, is determined and along with the cumulative product, alphas_bar, they are used for constructing denoising process. Next, calculating the square root value associated with alphas_bar (alphas_bar_sqrt and 1-alphas_bar_sqrt) are necessary for the variance calculation of the diffusion process. Each batch of data calls the forward propagation, then uses mean-squared-error (MSE) to calculate the loss of predicted noise vs actual noise. MSE loss was used because it aligns well with properties of Gaussian distributions - natural fit for measuring noise prediction accuracy. The context loss of noise added data and original data is added to the MSE loss to get the total loss. Using the total loss obtained from forward propagation, backpropagation occurs through calculation of gradient, trimming the gradients (to prevent explosion of gradients), and updating the parameters so that the model gradually learns to predict the noise. Backpropagation is a fundamental algorithm that trains artificial neural networks by calculating the loss function’s gradient with respect to the network’s weights, which minimizes the model’s errors and improves its predictions, the end result being a well-trained model for noise prediction. The next step is model inference, where we input live data into the trained machine learning algorithm. First, we loaded the trained noise prediction model. Then, we initialize a random tensor x with the same batch size as the input embeddings, ensuring the shape matches the linker dimensions. To simulate the backward diffusion process, we iterate through time steps t in reverse order. At each step, we call the model with the current noise-containing tensor x, the time step t, and contextual information to predict and remove noise. Update x according to the denoising formula of the diffusion model to gradually remove the predicted noise, the denoising formula being provided below.
After completing the denoising of all time steps, the original heavy chain embedding, light chain embedding, payload embedding, and generated linker vector x are spliced together to form the final embedding. For novel linker generation, samples were randomly chosen from the validation set and gradually denoised by the model inference to obtain a new linker. These were saved to an Excel file for subsequent decoding.
1.4 Decoder
The decoding process is intended to convert the generated linker embedding from the diffusion model back into a SMILES sequence, as is typically used to represent chemical structures. Based on the ADCdb database, a dataset of linkers with their corresponding embeddings and SMILES was collected, and the goal was to train the vector decoder to decode embedding vectors of molecules into SMILES strings.
First, the SMILES strings are broken down into basic units (atoms, keys, symbols, etc.) to construct a character set dictionary, which provides disambiguation for the SMILES strings. Each SMILES string is then segmented and converted to a fixed-length index sequence. Next, a decoder model was built based on Keras. Keras is a high level neural network API, used for quickly building and training deep learning models with less code and higher level of abstraction, and it was fully integrated into TensorFlow 2.0. The Keras-based decoder model decodes input embeddings into logits of a sequence of characters. Table logits is a vector with the same number of elements as in the model’s vocabulary. Each value in the vector is a raw score (“logit”) that indicates how much the model favors that specific character as the next one in a sequence. In this model, an input layer was created to receive the input data, followed by a dense layer. The dense layer is a fully connected layer with 512 neurons. L2 regularization with a weight decay coefficient of 0.01 was applied to prevent overfitting, which it does by simplifying the model and distributing weights more evenly. Next, there is a LeakyReLU activation function with leakage correction and a negative slope of 0.01. LeakyReLU alleviates the problem of “dead neurons” in the negative region of traditional ReLU. As a result of the model initially overfitting significantly, leading to poor results on the validation set, a dropout layer was added that randomly drops out 50% of neurons during the training process, which allowed for improved generalization ability in the model. Next, we have the loss function. Base cross-entropy loss was used between the model-predicted SMILES character sequences and the true sequences. The SMILES strings were also checked for validity using RDKit, with invalid SMILES counting as a loss. A Beam Search algorithm was used to generate the SMILES sequence, where for each candidate sequence, each index in the sequence was converted back to corresponding character, then checking the converted strings for bracket matching and SMILES validity. For model compilation, the Adam optimizer (a common algorithm for deep learning tasks) was used. The loss function was as previously defined, including the penalty for SMILES string validity, and the metrics were based on the accuracy of predicted sequences with respect to the true sequence. The accuracy was the metric to evaluate the model’s performance. For model training, early stopping was implemented such that the loss values on the validation set were monitored, and patience was set to 10 to stop training if validation loss did not improve after 10 consecutive epochs. When training was stopped, the weights with the best validation loss were restored. Other preset parameters include: epochs = 30, batch size = 32, and shuffle = True (to shuffle the training data before the start of each epoch). Similar to the diffusion model, the epochs parameter was based on what produced the best training results, whereas the batch size was a result of hardware constraints.
Analysis
Results
The diffusion model was able to successfully generate novel linker sequences optimized for specific ADC components, including the antibody and payloads.
Model performance analysis
- Noise Loss
The final values of the training noise loss and validation noise loss are close to a minimum value, indicating the model’s strong ability to predict noise. The gap between the two is also very minimal, with the lines being very close, indicating that the model’s generalization ability is strong. Both curves are also smooth without drastic fluctuations, which is a sign of a stable training process.
- Context Loss
Both training loss and validation loss gradually decrease with the increase of epochs. The gap is also relatively small (typically within 10%-20%), indicating a strong generalization ability.
- Problems
Convergence occurs extremely quickly, with the loss curve flattening out after about 10 rounds, which could indicate the model reaching a local minimum. The reason for this is likely the small sized dataset and the simple structure of the network.
Model performance analysis
- Training set loss and validation set loss
As the number of epochs increase, the training loss and validation loss gradually decrease until a minimum value stabilizes. However, as the epochs increase, the gap becomes larger, which is better on the training set but worse on the validation set. This is likely related to the small dataset and nonuniform data distribution.
- Training set accuracy and validation set accuracy
The training accuracy and validation accuracy gradually increase with the increase in epochs, and the final values are both close to a maximum, indicating that the model fits both the training and validation data well. However, the final value of the training set accuracy is greater than 90%, which performs significantly well on the seen data, yet the validation set accuracy is only ~75%. This suggests the model has some generalization ability on the unseen data, but worse than its performance on the training set, suggesting a certain degree of overfitting.
Next, we’ll evaluate the linkers that were generated by the model.
We used UMAP, a dimensionality reduction technique that helps visualize and analyze high dimensional data by mapping the linker embeddings into a lower dimensional space. UMAP analysis was done to determine the “nearest neighbours” of the generated linkers. The “nearest neighbours” are the closest data points to a given point in a dataset. One particular linker, a promising candidate, was chosen along with its three closest neighbours for further analysis.
SMILES |
LD50 Value |
Toxicity Class |
Lipophilicity |
Bioavailability Score |
Synthetic Accessibility |
|
Thresholds |
LD50>500 mg/kg (moderate toxicity) |
1-6 (6 is least toxic) |
LogP 1-3 (favorable lipophilicity) |
> 0.55 (acceptable bioavailability) |
SA < 5 (practical synthesis) |
|
New Linker 3 (molecule 17) |
O=C(NCCN1C(=O)C=CC1=O)CCCSSc1ccccn1 |
790 mg/kg |
4 |
1.3 |
0.55 |
3.16 |
Mal-Me3Lys-Pro |
C[N+](C)(C)CCCCC(NC(=O)C1CCCN1)C(=O)NCCNC(=O)CCN1C(=O)C=CC1=O |
215 mg/kg |
3 |
-2.29 |
0.55 |
4.76 |
Maleamic methyl ester-based linker 12A |
CC(NC(=O)C(NC(=O)CCNC(=O)/C=C/C(=O)O)C(C)C)C(=O)Nc1ccc(CO)cc1 |
1000 mg/kg |
4 |
0.41 |
0.11 |
3.89 |
This table compares the toxicity, LD50 values, lipophilicity, bioavailability, and synthetic accessibility of a novel linker and its nearest neighbours. Toxicity analysis was performed using ProTox-3.0, which revealed LD50 values and toxicity classes (the first two columns of the table) [10]. ProTox is based on 61 models for predictions of toxicity endpoints, the models being based on Random Forest machine learning and deep neural network algorithms. It incorporates molecular similarity, fragment propensities, most frequent features, and fragment similarity based CLUSTER cross-validation machine learning. LD50, the median lethal dose, is the dose at which 50% of test subjects die upon exposure. Although the threshold is > 500, class 4 toxicity (300 < LD50 <= 2000) indicates that it is harmful if swallowed. However, most ADCs are administered intravenously, so it is acceptable toxicity for the linker. As we can see, the new linker outperforms Mal-Me3Lys-Pro, which has a more toxic linker with a lower LD50, while having similar values to the other linker. The other three values were determined based on the SwissADME model [11]. Lipophilicity, the ability of a compound to dissolve in lipids, is determined from a consensus value of five different models. Druglikeness is shown through multiple ways, and the table shows the Abbot bioavailability score. The Abbot bioavailability score attempts to predict the probability of a compound to have at least 10% bioavailability in rat permeability. In this, bioavailability refers to the ability of the drug to be absorbed or used by the body. Here, we can see the generated linker outperforms Maleamic methyl ester-based linker 12A, which has a bioavailability score below the threshold, while once again matching the other linker. Lastly, we have synthetic accessibility. Synthetic accessibility is a score between 1 and 10, where 1 is the easiest to build in a lab, whereas 10 is much more difficult (meaning it would be very costly). The novel linker outperforms the other two in terms of synthetic accessibility. This higlights improvements in properties achieved through the diffusion model, with the results underscoring the potential of the diffusion model to generate novel linkers with optimized properties for further refinement in ADC development.
Conclusion
These findings has demonstrated the potential of diffusion-based generative models to revolutionize the design of linker sequences in ADCs by accelerating ADC linker design. Encoding the entire ADC, including antibody sequences and payloads, allows the model to effectively generate linkers that are structurally valid, pharmacologically stable, and adapted to the specific requirements of the other ADC components. The application of diffusion models streamlines the process of drug discovery in order to create a linker that balances efficacy and safety profiles. These results highlight the model’s capacity to address key challenges in ADC development, such as improving linker stability, minimizing off-target toxicity, and enhancing overall therapeutic efficacy.
In the future, there are several ways this project can be improved to optimize linker generation. In generative models, a larger dataset always results in a model that is able to learn better and generalize better. Currently, there is a lack of data in the ADC field, with the majority of ADCs on ADCdb lacking key components. Ways to enrich and improve the dataset could be through searching for more ADC datasets that provide all components of the ADC, as well as augmenting the data to expand the dataset. Additionally, the model can be expanded to broader ADC design challenges, which would accelerate the entire drug design process. For example, utilizing more complex models such as residual diffusion models would provide an opportunity to more fully explore the association between the different components of ADC drugs. Ultimately, these advancements hold potential in accelerating preclinical and clinical testing, reducing costs, and enabling the development of more effective cancer therapies.
Citations
- World Health Organization. Global cancer burden growing, amidst mounting need for services. (2024)
- WebMD. Chemotherapy: Types, how it works, procedure and side effects. n.d.
- Maecker H, Jonnalagadda V, Bhakta S, Jammalamadaka V, & Junutula JR. Exploration of the antibody-drug conjugate clinical landscape. MAbs. 2023; 15(1): 2229101
- Challener C. Optimization of Linker Chemistries for Antibody-Drug Conjugates. BioPharm International. 2023;36(11):12-15.
- Ho J, Jain A, & Abbeel P. Denoising Diffusion Probabilistic Models. arXiv. 2020
- Baah S, Laws M, & Rahman K. M. Antibody-Drug Conjugates-A Tutorial Review. Molecules (Basel, Switzerland). 2021;26(10):2943.
- Shen L. et al. ADCdb: the database of antibody-drug conjugates. Nucleic Acids Res., (2023).
- Lin Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123-1130.
- Sharma R, Saghapour E, & Chen J. Y. An NLP-based technique to extract meaningful features from drug SMILES. iScience. 2024;27(3)
- Banerjee P, Kemmler, E, Dunkel, M, & Preissner R. ProTox 3.0: a webserver for the prediction of toxicity of chemicals. Nucleic Acids Res., 2024.
- Daina A, Michielin O, Zoete V. SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci. Rep. 2017;7:42717
Acknowledgement
I would like to thank many people who helped me in the completion of this project. First, I would like to thank Dr. Jake Chen from the University of Alabama at Birmingham for guiding the direction of my project and providing advice at various stages of my project. I would also like to thank Yongna Dai from the Beijing University of Technology, College of Computer Science for her support in the technical aspects of the model and helping me with some of the coding. Lastly, I would like to thank my parents. Without their help and support, I would not have been able to complete this project.