Development of a Software-based Method to Predict the Binding Strength of Covid-19 Mutations
Can the strength of binding between a SARS-CoV-2 (Covid-19) spike protein and ACE2 receptor protein be predicted using easily-accessible computer models and softwares?
When a Covid-19 spike protein binds to a receptor protein in the human body, a large number of residues will become buried and inaccessible to solvent. This burial of nonpolar surface area is energetically favourable and can be explained by the hydrophobic effect. Nonpolar interactions are the primary contributor of protein binding and predominantly occur at the protein interface (Sowmya and Ranganathan, 2015). The solvent-accessible surface areas (surface area of a biomolecule that is exposed and accessible to a solvent) of the virus and receptor before binding and the complex formed after binding can be quantified on the GetArea server (Chakravarty and Varadarajan, 1999). Comparing the difference in the burial of nonpolar surface area of different mutants to the wild type will give the relative predicted strength of binding. Hydrogen bonds and salt bridges are also primary contributors to protein binding and stability (Koirala RP et al., 2020), and the lengths and number of them in each mutant will also be compared to the wild type. These interactions when scaled and calculated together as energy contributions will give the relative strength of binding of each mutant.
(1) Fifteen amino acid mutations (wild type and thirteen mutations) of the SARS-CoV-2 spike protein were made from the FASTA sequence (amino acid sequence) of 6M0J, a crystal protein structure file found on the Protein Data Bank (Wang, et al., 2020). Alanine, proline, and charged amino acids were excluded from the mutations made; histidine was included. These mutations were made at the 453th amino acid in the FASTA sequence. All protein models were made through Swiss Model (Bienert, et al., 2017; Bertoni, et al., 2017; Benkert, et al., 2011). Models of the mutated virus sequence bound to the ACE2 receptor were made and saved as PDB files. The input template in Swiss Model was a pdb file of 6m0j made with an input sequence 100% identical to the wild type made in Swiss Model. N-Acetyl-D-Glucosamine proteins (NAGs) and the chloride ion ligands were removed from the template pdb file. Models of the mutated virus by itself (not bound to the receptor) were made using an input template of a Swiss Model of 7JMO with an input sequence 100% identical to the wild type. The NAGs were also removed from the template and the models were also saved as pdb files. A model of the ACE2 receptor was made with a Swiss Model template made from a sequence 100% identical to the wild type virus of 6m0j.
(2) The methodology of Krissinel and Henrick (2007) was modified as described below. The server GetArea (Fraczkiewicz, R. and Braun, W., 1998) was used to calculate the nonpolar accessible surface area (NASA) for the nonpolar atoms of the complex (virus + receptor), the virus by itself, and the receptor by itself (ACE2). Nine water probe radii (1.36, 1.37, 1.38. 1.39, 1.40, 1.41, 1.42, 1.43, 1.44 Å) were used in GetArea for each mutant and NASA values obtained were averaged to decrease the noise. Instead of the default radii of atoms in GetArea, the individual atom radii used were taken from the publication of Chernyshov et al. (2020). The advanced options of atomic solvation parameters (ASP) on GetArea were modified so only nonpolar surface areas (NSA) were calculated (see attachments). Sulfurs, carbonyl carbons, and alpha-carbons were not considered nonpolar as they are very polarizable or slightly polar and would not contribute to the nonpolar interactions. Excel spreadsheets (Microsoft Corporation, 2018) were used for all calculations. When NaN (not a number) errors occurred for a specific radii in GetArea, 0.001Å below and above the probe radii were inputted and averaged. These averages are indicated with red font. See images below.
(3) For each mutant, the radii and its corresponding NASA was graphed, as these variables should be linearly related. The correlation of the graph was calculated using Excel, and a smoothing process was completed for more accurate results. A correlation of at least -0.999 was required for all mutant NASAs. In the case the correlation was lower, data points that fell above/below the trendline were further averaged by using probe radii of 0.001Å below and above. These averages are indicated with yellow highlighting. See image below.
(4) The averaged NASA for the virus and receptor before binding were added together. Then, the averaged NASA of the virus-receptor complex was subtracted from this value to find how much NSA was buried after binding. The buried NSA area of the wild type was subtracted from each of the mutant buried NSA surface areas to give the relative increase or decrease in buried NSA. Positive values indicate more buried NSA while negative values indicate less buried NSA. The decrease in nonpolar accessible surface area NASA upon forming the complex is then taken to be linearly related to the energy contribution of the hydrophobic effect. Krissinel and Henrick (2007) use a conversion factor of 16 cal/mol Å² for buried NSA to energy conversion, however this includes both polar and nonpolar surface area when calculating buried NSA. Their buried NASA includes contact of polar and nonpolar atoms.
The conversion factor for buried NSA to calories used in this study was determined by analysis of free energy for the solubility of five hydrocarbons (butane, pentane, hexane, heptane, octane) at 298K which was obtained from PubChem (Kim, et al., 2019). The free energy of X(aq) → X(lq) was obtained from the solubility of each hydrocarbon. The free energy of dissolving one mole of X in nonpolar liquid cyclohexane was calculated (assuming ideal mixing). The NSA was calculated using the Get Area server. ΔGtotal was divided by NSA to give an average of 28 cal/mol Å². Plotting ΔGtotal versus buried NSA yielded a slope of 31 cal/mol Å². Thus, an approximately average conversion factor of 30 cal/mol Å² (0.030 kcal/mol Å²) was used. This value is positive as we are just observing the magnitude of energy contribution, so for simplicity sake positive scores indicate more energy contribution and negative scores equal less energy contribution.
(5) Hydrogen bonds (HB) and salt bridges (SB) were identified for each mutant using the PDBePISA analysis (Krissinel and Henrick, 2007), which provides the length of each HB under 4Å. Pace et al. (2014), state that hydrogen bonds within proteins contribute -1.0 kcal/mol in protein folding. The HB studied had lengths of around 2.50Å. This free energy conversion factor was improved based on these factors:
- HB at interfaces such as virus-receptor interactions are likely weaker than the buried HBs that exist within an average protein as:
- HB at the interface of the virus-receptor complex (in this study) were all longer than 2.50Å; none were shorter than 2.50Å
- Mutations may change the length of HBs without breaking them (leading to a different energy contribution with the same number of HBs)
The free energy contribution for hydrogen bonds formed as the virus binds to the receptor was calculated using the equation below.
Equation 1 : ΔG = 1.00 kcal/mole * [(4.00Å-L)/2.00Å]
Like the conversion factor above, this equation is also for the magnitude of energy contribution and all values are positive. This formula was applied to HB in the wild type complex, and the average energy of a HB is 0.45 kcal/mol. This value is in agreement with the average HB energy contribution of -0.444 kcal/mol obtained by fitting a large database of such protein-protein interfaces (Krissinel and Henrick, 2007).
Salt bridges (SB) were identified and scaled the same as HBs with the equation above. The average length of a SB in this study is 3.0 Å, which based on our calculation gives an energy of 0.50 kcal/mol. This is a larger value than the PDBePISA server value of 0.37 kcal/mol. Double mutant studies state that a SB contributes more energy to protein folding than a HB. This higher energy is accounted for in the method because SB are identified as having both a HB and a SB. Based on preliminary calculations, the addition of a HB and SB equal to around the same energy contribution that the PDBePISA uses.
The energy contributions of HB and SB of the wild type were subtracted from the energy contributions calculated for each mutation to give the relative energy contributions.
(6) To calculate the energy contribution of each complex formation, the relative NSA was multiplied by 0.030 kcal/mol Å². Then, the HB and SB values calculated from equation 1 were added. As seen in the image below, these values are all relative to the wild type. These values all have units of kcal/mol.
The Swiss model made of mutant Y453A had an error in the matching of the template to the model. These model may not be accurate and thus mutant Y453A was omitted from this study.
No available number (NaN) errors were scattered across probe radii as well as mutants during the GetArea step. This noise is a result of a glitch in the software and the water probe may have gotten stuck at a particular location. If NaN errors appeared, 0.001Å above and below the non-functioning radii were used and the values were averaged. GetArea suggests using different radii if one gives a NaN error. Additionally, nine water probe radii (1.36, 1.37, 1.38. 1.39, 1.40, 1.41, 1.42, 1.43, 1.44 Å) were used and averaged for more accurate results. As the relationship between probe radii and NASA is observed to be linear, a smoothing process was then done to improve results (see step 3).
Because of the unique cyclic structure of proline, it tends to form bends in proteins. In Covid-19 protein mutations observed, it almost always leads to a weaker binding affinity. This was experimented with many proline substitutions at various positions of the spike protein. Including proline in the calculations actually would have increased the correlation of the method to the experimental results. Additionally, there would not be much study needed to be done about proline as the outcome is already predicted based on current trends. Thus, proline was omitted from the amino acid mutations.
During preliminary trials, it was discovered that different templates used in Swiss Model resulted in slightly different models. Models are based off of the template inputted, but ideally should be independent of the template used. Each time Swiss Model is used, the model made is “relaxed”, and the server uses ideal rotational angles in the new model. This “relaxing” of the model may be more accurate to what the proteins are like in solution, and thus, templates of 6m0j and 7JMO were made in Swiss model. Additionally, using the PDBePISA server, analysis was conducted and it was determined that the templates made through swiss model opposed to the templates taken from the PDB were more accurate to the virus and complex in solution.
Swiss Model makes slightly different models depending on the template used, determined from previous observations. Template 6m0j (ran through Swiss Model once) was used, and the original crystal structure of 6m0j was viewed using FirstGlance in Jmol (FirstGlance in Jmol, 2006). There were three ligands present: NAG, Zn, and Cl, which are not part of the virus-receptor interface. Each pdb file of a Covid spike protein and ACE2 receptor has different ligands, which may interfere with the accuracy of the model produced. That may change the accessible surface area calculated. The crystal structure 6m0j was chosen as it has been cited by numerous studies and was one of the earliest available structures. In PDBePISA the NAG and the Cl were determined to be weakly bound, and have little effect on the conformation of the protein mutations. As a result, they were completely removed from the template to eliminate any possible interference.. The Zn was left in as it was tightly bound to the complex and would have made a larger impact on the model.
The template 6m0j (ran through swiss model once) was used to make the models of the virus by itself. However, as the template had the virus bound to the ACE2 receptor, the model made was not as accurate to the virus alone in solution and had a shape more similar to one bound to the receptor. After Excel analysis of different pdb files of a covid spike protein, 7JMO was determined to have the least interactions with other proteins and was thus used.
In the original GetArea parameters, sulfurs were considered polar atoms; however, numerous studies report that sulfurs within proteins can form weak hydrogen bonds (Zhou, et al., 2008; Burley, S. and Petsko, 1989) and thus they were omitted from the GetArea parameters used in this study.
In this study the scores of the mutations are found as a measure of relative free energy change. By correlating this to published binding affinities of Covid-19 mutants at position 453 (Starr et al., 2020) it can be determined whether this method can accurately predict the binding strength. A Pearson’s Correlation Coefficient test was conducted to determine the statistical significance of the results. The two cases in this study are independent of each other and should be linearly related and the correlation of the two variables is also symmetric. The Social Science Statistics website (Stangroom, 2021) was used to calculate the significance of the datasets by determining the p value from the R score and N value. In this study, the significance level was set to 0.05.
Initial trails were conducted following the procedure without the improvements listed above. The P-Value was 0.093635, and not significant at p < .05.
Improvements and modifications were made throughout the study, and the final method used is detailed under “Methods.” The final trial had a P-Value of 0.043854, giving a significant result at p < .05.
To restate the research question: can the strength of binding between a SARS-CoV-2 (Covid-19) spike protein and ACE2 receptor protein be predicted using easily-accessible computer models and softwares?
The results of this study imply that from using servers accessible on the internet and calculations on Excel, the strength of binding between a SARS-CoV-2 spike protein and ACE2 receptor protein can be predicted with statistically significant correlation based on reported scores from the first-available published binding affinities of Covi-19 mutants. 3D protein structures of Covid-19 mutations can be made through Swiss Model. The energy change of the binding of a SARS-CoV-2 spike protein and ACE2 receptor protein can be calculated from buried non polar surface area the difference in hydrogen bonds and salt bridge lengths during the formation of the complex compared to the wild type. Non polar areas of the Covid-19 spike protein, ACE2 receptor, and virus-receptor complex are found from the GetArea server. The difference in hydrogen bond and salt bridge number and length from the wild type can be found on PDBePISA. For each mutation, the buried non polar surface area and HB and SB were converted to energy contribution.
Sources of Error
(1) Using structures made on Swiss Model result in rigid structures of models, which is not how these proteins actually exist. These proteins have flexibility in solution, and may flex wider or narrower. This would possibly result in less or more calculated NASA depending on the protein’s position in solution.
(2) In this method developed only nonpolar interactions, hydrogen bonds, and salt bridges are considered for interactions that would contribute to binding. However, there exist energy contributions from vanderwaal interactions, electrostatic interactions, and torsion angle twisting between the virus and the receptor that are not taken into account. Although these are minor interactions, including them in the calculations will give more accurate and realistic results. (Chakravarty and Varadarajan, 1999).
(3) The hydrogen bond and salt bridge scaling is not accurate enough to use for charged amino acid mutations. The focus of calculations has been on non polar interactions, as they are the primary factor in these types of interactions. For uncharged amino acids, there are very little differences in the salt bridges and hydrogen bonds, and the calculation used slightly improved the results. However, with the mutation of a charged amino acid, there are more changes in the salt bridges of the complex (analyzed in PDBePISA). There are not only more salt bridges formed, but more interference with existing salt bridges as well.
This method was developed to support current experiments being conducted in labs that determine the binding affinity of Covid-19 mutants. While this method and current technologies are not developed enough to completely substitute these experiments, they are much faster and can be a preliminary step to identifying virus variants of concern. Current in-lab experiments take several months to test mutations at one position, and this method can help narrow down which positions are of concern and significant in the virus-receptor interface. As of now, the calculations and softwares are being done by hand and through Excel; mutation binding strengths at one position can be approximated in less than an hour. Additionally, this method is able to determine which interactions are playing a role in binding and how significant they are, whereas current experiments are just testing for binding affinity. This method can be used to calculate approximate binding affinities of single, double, triple, or even more mutations in one variant. I hope this method can be further developed as modeling technologies advance to be a viable replacement to the experiments that are currently being done.
Wang, X., Lan, J., Ge, J., Yu, J., Shan, S. (2020) Crystal structure of SARS-CoV-2 spike receptor-binding domain bound with ACE2 Nature 215:220
Wu, N.C., Yuan, M., Liu, H., Zhu, X., Wilson, I.A. (2020) Crystal structure of SARS-CoV-2 receptor binding domain in complex with neutralizing antibody COVA2-04 Biorxiv
Krissinel, E., & Henrick, K. (2007). Inference of macromolecular assemblies from crystalline state. Journal of Molecular Biology, 372(3), 774-797. doi:10.1016/j.jmb.2007.05.022
Chernyshov, I. Y., Ananyev, I. V., & Pidko, E. A. (2020). Revisiting Van der Waals Radii: From COMPREHENSIVE structural analysis TO KNOWLEDGE‐BASED classification of interatomic contacts. ChemPhysChem, 21(5), 370-376. doi:10.1002/cphc.201901083
Fraczkiewicz, R. and Braun, W. (1998) "Exact and Efficient Analytical Calculation of the Accessible Surface Areas and Their Gradients for Macromolecules" J. Comp. Chem., 19, 319-333.
E. Krissinel and K. Henrick (2007). 'Inference of macromolecular assemblies from crystalline state.'. J. Mol. Biol. 372, 774--797.
Bienert, S., Waterhouse, A., de Beer, T.A.P., Tauriello, G., Studer, G., Bordoli, L., Schwede, T. The SWISS-MODEL Repository - new features and functionality. Nucleic Acids Res. 45, D313-D319 (2017).
Bertoni, M., Kiefer, F., Biasini, M., Bordoli, L., Schwede, T. Modeling protein quaternary structure of homo- and hetero-oligomers beyond binary interactions by homology. Scientific Reports 7 (2017).
Benkert, P., Biasini, M., Schwede, T. Toward the estimation of the absolute quality of individual protein structure models. Bioinformatics 27, 343-350 (2011).
Microsoft Corporation. (2018). Microsoft Excel. Retrieved from https://office.microsoft.com/excel
Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B., Zaslavsky, L., Zhang, J., & Bolton, E. E. (2019). PubChem in 2021: new data content and improved web interfaces. Nucleic acids research, 49(D1), D1388–D1395. https://doi.org/10.1093/nar/gkaa971
Pace, C. N., Fu, H., Lee Fryar, K., Landua, J., Trevino, S. R., Schell, D., . . . Grimsley, G. R. (2014). Contribution of hydrogen bonds to protein stability. Protein Science, 23(5), 652-661. doi:10.1002/pro.2449
FirstGlance in Jmol. (2006). FirstGlance in Jmol. https://proteopedia.org/wiki/fgij/
Zhou, P., Tian, F., Lv, F., & Shang, Z. (2008). Geometric characteristics of hydrogen bonds involving sulfur atoms in proteins. Wiley Online Library. Retrieved from https://onlinelibrary.wiley.com/doi/full/10.1002/prot.22327?casa_token=V54SUDYa4jwAAAAA:BanrZ6Iz9Up88ghpdy6-s47AYvd4ZJ3RyfHjlpsA8uvQ5EoloYITIwDDyh_rOmWvFjhQs-qmXNBIQew.
Burley, S., & Petsko, G. (1989). Electrostatic interactions in aromatic oligopeptides contribute to protein stability. ScienceDirect. Retrieved from https://www.sciencedirect.com/science/article/abs/pii/016777998990036X.
Starr, T. N., Greaney, A. J., Hilton, S. K., Crawford, K. H., Navarro, M. J., Bowen, J. E., . . . Bloom, J. D. (2020). Deep mutational scanning OF SARS-CoV-2 receptor binding domain REVEALS constraints on folding AND ace2 binding. doi:10.1101/2020.06.17.157982
Stangroom, J. (2021). Quick P Value from Pearson (R) Score Calculator. Social Science Statistics. https://www.socscistatistics.com/pvalues/pearsondistribution.aspx
Chakravarty, S., & Varadarajan, R. (1999). Residue depth: a novel parameter for the analysis of protein structure and stability. Structure, 7(7), 723–732. https://doi.org/10.1016/s0969-2126(99)80097-5
Sowmya, G., Ranganathan, S. Discrete structural features among interface residue-level classes. BMC Bioinformatics 16, S8 (2015). https://doi.org/10.1186/1471-2105-16-S18-S8
Allen, J., Almukhtar, S., Aufrichtig, A., Barnard, A., Bloch, M., & Cahalan, S. (2020, January 28). Coronavirus world map: Tracking the global outbreak. Retrieved March 19, 2021, from https://www.nytimes.com/interactive/2020/world/coronavirus-maps.html
Rafael Bayarri-Olmos, Anne Rosbjerg, Laust Bruun Johnsen, Charlotte Helgstrand, Theresa Bak-Thomsen, Peter Garred, Mikkel-Ole Skjoedt
bioRxiv 2021.01.29.428834; doi: https://doi.org/10.1101/2021.01.29.428834
Koirala RP, Thapa B, Khanal SP, et al. Binding mechanism of SARS-CoV-2 spike protein with human ACE2 receptor. Research Square; 2020. DOI: 10.21203/rs.3.rs-71923/v1.
I would like to acknowledge my mentor Dr. Robert Allan Edwards (University of Calgary) for initial ideas and methods as well as suggestions. I thank my science fair coordinator Ms. Jensen for supporting me in the registration process.