Genomic mutations and changes in protein secondary structure and solvent accessibility of SARS-CoV-2 (COVID-19 virus)

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly pathogenic virus that has caused the global COVID-19 pandemic. Tracing the evolution and transmission of the virus is crucial to respond to and control the pandemic through appropriate intervention strategies. This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. Prediction results suggest that mutation D614G in the virus spike protein, which has attracted much attention from researchers, is unlikely to make changes in protein secondary structure and relative solvent accessibility. Based on 6324 viral genome sequences, we create a spreadsheet dataset of point mutations that can facilitate the investigation of SARS-CoV-2 in many perspectives, especially in tracing the evolution and worldwide spread of the virus. Our analysis results also show that coding genes E, M, ORF6, ORF7a, ORF7b and ORF10 are most stable, potentially suitable to be targeted for vaccine and drug development.


Related works
Since the first genomes were collected in December 2019, there have been many findings on the mutations of SARS-CoV-2. For example, Phan 16 analysed 86 genomes of SARS-CoV-2 downloaded from the the Global Initiative on Sharing All Influenza Data (GISAID) database (https ://www.gisai d.org/) and found 93 mutations over the entire viral genome sequences. Among them, there are three mutations occurring in the RBD region of the spike surface glycoprotein S, including N354D, D364Y and V367F, with the numbers showing amino acid (AA) positions in the protein. That study also reveals three deletions in the genomes of SARS-CoV-2 obtained from Japan, USA and Australia. Two of these deletion mutations are in the ORF1ab polyprotein and one deletion occurs in the 3' end of the genome. Likewise, a study in 17 shows that the SARS-CoV-2 genomes may have undergone recurrent, independent mutations at 198 sites with 80% of these are of the nonsynonymous type. Alternatively, a SNP genotyping study in 18 discovered highly frequent mutations in the genes encoding the S protein, RNA polymerase, RNA primase, and nucleoprotein. Those high-frequency SNP mutations are worth further investigations for vaccine development because they may be linked to the virus transmissibility and virulence. Tang et al. 19 investigated 103 genomes of COVID-19 patients and discover mutations in 149 sites of these genomes. The study also shows that the spike gene S consistently has larger dS values (synonymous substitutions per synonymous site) than other genes. In addition, two major lineages of the virus, denoted as L and S, have been specified based on two tightly linked SNPs. The L lineage is found more prevalent than the S lineage among the examined sequences.
Korber et al. 20 tracked the mutations of spike protein S of SARS-CoV-2 because it plays an important role in mediating infection of targeted cells and is the focus of vaccine and antibody therapy development efforts 21 . They detected 14 mutations in the spike protein that are growing, especially the mutation D614G that rapidly becomes the dominant form when spread to a new geographical region. Likewise, Hashimi 22 analysed the mutation frequency in the spike protein S of 796 SARS-CoV-2 genomes downloaded from the GISAID and GenBank databases. The study found 64 mutations occurring in the S protein sequences obtained from multiple countries. It suggests that the virus is spreading in two forms, the D614 form (residue D at position 614 in the S protein) takes 68.5% while the G614 form takes 31.5% proportion of the examined isolates. Koyama et al. 23 on the other hand found several variants of SARS-CoV-2 that may cause drifts and escape from immune recognition by using the prediction results of B-cell and T-cell epitopes in 24 . Typically, the mutation D614G occurring in the spike www.nature.com/scientificreports/ protein is found prevalent in the European population. This mutation may have caused antigenic drift, resulting in vaccine mismatches that lead to a high mortality rate of this population. A recent situation report 25 by Nextstrain 26 on genomic epidemiology of novel coronavirus using 5193 publicly shared COVID-19 genomes shows that SARS-CoV-2 on average accumulates changes at a rate of 24 substitutions per year. This is approximately equivalent to 1 mutation per 1000 bases in a year. This evolutionary rate of SARS-CoV-2 is typical for a coronavirus, and it is smaller than that of influenza (average 2 mutations per 1000 bases per year) and HIV (average 4 mutations per 1000 bases per year). Shen et al. 12 conducted metatranscriptome sequencing for bronchoalveolar lavage fluid samples obtained from 8 patients with COVID-19 and found no evidence for the transmission of intrahost variants as well as a high evolution rate of the virus with the number of intrahost variants ranged from 0 to 51 around a median number of 4. Pachetti et al. 27 examined 220 genomic sequences of COVID-19 patients from the GISAID database and discovered 8 novel recurrent mutations at nucleotide locations 1397, 2891, 14408, 17746, 17857, 18060, 23403 and 28881. Mutations at locations 2891, 3036, 14408, 23403 and 28881 are mostly found in Europe while those at locations 17746, 17857 and 18060 occur in sequences obtained from patients in North America. Likewise, a study in 28 on 95 SARS-CoV-2 complete genome sequences discovered 116 mutations. Among them, the mutations at position C8782T in the ORF1ab gene, T28144C in the ORF8 gene and C29095T in the N gene are common.

Materials and methods
SARS-CoV-2 mutation detection. We use 6324 sequence records downloaded from the NCBI GenBank database on 2020-06-17. The latest collection date for the samples from which the sequences were derived was on 2020-06-05. The data, which were collected in 45 countries, include both nucleotide sequences and protein translations of coding genes. A proportion of the 6324 records have sequences of only few proteins, i.e. these records do not annotate all 11 proteins (ORF1ab, ORF3a, ORF6, ORF7a, ORF7b, ORF8, ORF10, S, E, M and N). The number of available sequences is thus different from one protein to another (see column "Available" in Table 1). Genome sequences that do not specify country or AA sequences that contain letter "X" representing an unknown AA are excluded in our calculations. We use the genome obtained from the isolate Wuhan-Hu-1, accession number NC_045512 as the reference genome. For the mutation detection purpose, we apply a dynamic programming algorithm to protein AA sequences to get global pairwise alignments between a reference sequence and a query sequence. Specifically, we use the Python Bio.pairwise2.align.globalms function (https ://biopy thon. org/docs/dev/api/Bio.pairw ise2.html) where a match is given 2 points, a mismatch is deducted 0.5 points, 2 points are deducted when opening a gap, and 1 point is deducted when extending it. Gaps are then inserted into nucleotide sequences corresponding to the resulted protein sequence alignments. Using the resulted pairwise alignments, we are able to compare query sequences and the reference sequences at each position and identify locations of insertion, deletion, synonymous and nonsynonymous mutations.
Secondary structure and solvent accessibility prediction. Virus protein structure plays a key role in its functions and a change in structure shape may affect its functions, virulence, infectivity and transmissibility, possibly resulting in non-functional proteins. Protein secondary structure is defined by hydrogen bonding patterns, which make an intermediate form before the protein folds into a three-dimensional shape composing its tertiary structure. Eight types of protein secondary structure defined by the Dictionary of Protein Secondary Structure (DSSP) include 3 10 helix (G), α helix (H), π helix (I), hydrogen bonded turn (T), extended strand in parallel and/or anti-parallel β-sheet conformation (E), residue in isolated β-bridge (B), bend (S) and coil (C). The DSSP tool assigns every residue to one of the eight possible states. In a reduced form, these 8 conformational states can be diminished to 3 states: H = {H, G, I}, E = {E, B} and C = {S, T, C} 29 . The protein secondary structure represents interactions between neighboring or near-by AAs as its functional three-dimensional shape is created through the polypeptide folding. We thus determine a change in protein secondary structure if any change happens in the structures of the mutated AA and its 10 neighboring AAs compared to those of the reference Table 1. Summary of SARS-CoV-2 mutations on each protein and secondary structure and relative solvent accessibility changes. www.nature.com/scientificreports/ sequence. In detail, we consider 5 AAs ahead and 5 AAs behind the mutated AA. The same approach is applied when considering a change of the protein relative solvent accessibility. Solvent-exposed area represents the area of a biomolecule on a surface that is accessible to a solvent. Accordingly, a residue is considered as exposed if at least 25% of that residue must be exposed, denoted as the "e" state. Alternatively, the residue is determined as buried, i.e. the "b" state. There have been various protein secondary structure prediction programs in the literature and many of those were developed based on artificial intelligence models using protein AA sequences such as JPred4 30 , Spider2 31 , Porter 5 32 , RaptorX 33 , PSSpred 34 , YASSPP 35 and SSpro 14 . In this paper, we use the protein secondary structure and relative solvent accessibility prediction methods SSpro/ACCpro 5 14 within the SCRATCH-1D software suite (release 1.2, 2018) 15 . These predictors were built using the bidirectional recursive neural networks and a combination of the sequence similarity and sequence-based structural similarity to sequences in the Protein Data Bank 36 . Prediction results of 8-class structure (SSpro8 predictor) and 25%-threshold relative solvent accessibility (ACCpro predictor) are used for statistics on protein secondary structure and accessibility changes. We however also report in the spreadsheet supplemental information prediction results of 3-class structure (SSpro predictor) and relative solvent accessibility on 20 thresholds, ranging from 0 to 95% with a 5% step (the ACCpro20 predictor within the SCRATCH-1D software). Table 1 summarizes statistics of SARS-CoV-2 mutations so far. "AA Length" indicates the length of the protein AA sequence derived from the SARS-CoV-2 reference genome. "Available" denotes the number of records among 6324 NCBI GenBank records that have the complete sequence of the corresponding protein. "No mutation" refers to the number of sequences that do not have any mutations compared to the reference sequence. "No mutation rate" is the ratio between "No mutation" and "Available". "Delete" means the number of deletion mutations occurring in the AA sequences of the protein. This number may be larger than the number of sequences having deletion mutations because an AA sequence may have more than one deletion. Likewise, "Insert", "Nonsyn" and "Syn" show the number of insertion, nonsynonymous and synonymous mutations occurring in the protein AA sequences. "Nonsyn/Syn" demonstrates a ratio between the number of nonsynonymous mutations versus the number of synonymous mutations. "Structure change" means the number of nonsynonymous mutations that have protein secondary structure change potential based on the SSpro8 predictor of the SCRATCH-1D software. Similarly, "Accessibility change" refers to the number of nonsynonymous mutations that have potential to change the protein relative solvent accessibility based on the ACCpro predictor of the SCRATCH-1D software. Insertion and deletion mutations alter protein secondary structure and solvent accessibility by default so that they are not included in the structure and solvent accessibility change statistics. Table 1 shows that the ORF3a and ORF8 proteins have the number of nonsynonymous mutations significantly larger than that of the synonymous mutations. In contrast, this ratio in proteins E, M, ORF7b and ORF10 are very small (less than 1). These proteins could be targeted for vaccine and drug development as they have less variations than other proteins. These findings are supported by results presented in Figs. 2, 3, 4 where we plot the number of insertion, deletion and nonsynonymous mutations against different locations in the proteins. A spike in these figures demonstrates a large number of insertion, deletion and nonsynonymous mutations. Regions between spikes are stable and can be useful for further research for vaccine and drug development. For example, in protein ORF1ab (Fig. 2 (Fig. 3), entire regions before and after the spike at position 614 are almost unchanged. Figure 4 presents variations of multiple proteins. In addition to proteins E, M, ORF7b and ORF10, we find from Fig. 4 that proteins ORF6 and ORF7a are also relatively stable without a large number of variations at any particular locations. This is justified by data in the column "No mutation rate" in Table 1, which shows the ratio between "No mutation" and "Available", i.e. the ratio between the number of sequences 99. This is because the size of these proteins is relatively small (see data in the column "AA Length" in Table 1). These proteins are considered as most stable when compared to other proteins, which have the ratios of less than 0.8. These results are consistent with the data shown in Figs. 2, 3, 4. Protein N has 1927 nonsynonymous mutations but 1678 of them are likely to make changes in protein secondary structure, making a ratio of 87.08%. This is considerably larger than those of protein S (4.42%), protein M (24.79%) and protein E (64.29%). The number of solvent accessibility changes of protein S is larger than its structure changes: 184 vs 164. This however is opposite in other structural proteins: E (7 vs 18), M (8 vs 30) and N (37 vs 1,678).

Details of mutations in nonstructural ORF genes
Gene ORF1ab. The ORF1ab polyprotein has 7096 AAs. Among 6324 records deposited to the NCBI Gen-Bank database, only 3726 genomes have the complete coding sequence (CDS) of protein ORF1ab, with 1024 unique AA sequences. This is quite a large number compared to other proteins but understandable because ORF1ab is the longest protein of SARS-CoV-2 and thus has a large number of variations. Only 119 sequences have no mutation or synonymous mutations while the rest 3607 sequences have insertion, deletion or nonsyn-   Table 2. Figure 5 shows an alignment of sequences having a large number of deletions in the ORF1ab protein. Alternatively, 8330 nonsynonymous mutations are found, in which 1067 mutations are distinct. Table 3 Table 4. Notably, the mutation Q57H occurs in 2795 sequences collected in many countries. This is an emerging and active mutation, which requires further investigation as the latest case of this mutation was on 2020-06-05, same as the latest collection date of the entire downloaded dataset. The mutation G251V occurring in 206 sequences is also a prevalent mutation in the ORF3a protein.  Fig. 6. The isolate MT547814 thus may have transmitted the virus to MT609561 but this implication needs to be corroborated by patients' travel history. There are 23 distinct nonsynonymous mutations and those occurring in 2 or more sequences are presented in Table 5.   Table 9.      www.nature.com/scientificreports/  Figure 6. Insertion and deletion mutations in protein ORF6. The GenBank accession numbers, collected dates and isolate names are presented on the left. One synonymous mutation D61K and two insertions − 62R and − 63T at the end of isolate USA/MA_MGH_00184/2020 (MT520188) is interesting while there is a high chance that HKG/VM20001061/2020 has spread to USA/VA-DCLS-0294/2020.  www.nature.com/scientificreports/ USA: CA on 2020-03-27 and 2020-04-28, respectively) similarly have four deletions L141-, G142-, V143-and Y144-. The virus transmission may have happened between these two isolates but this needs further investigation. Alignment of these sequences is shown in Fig. 9. The number of nonsynonymous mutations in gene S is 3711, with 240 distinct mutations. Mutations that occur in 10 or more cases are reported in Table 10. The number of synonymous mutations is 670, making a     in Thailand on 2020-01-04. It is however important to note that the first patient having the D614G mutation and his/her location may never be known because genome of that patient might not be sequenced and reported. Therefore, information reported here can support for further investigation.
On the other hand, there are 37 A829T mutations that all occur in Thailand. The first case of this mutation was collected on 2020-01-23 and its latest case was on 2020-04-07. This may indicate that the first case had probably transmitted to other cases having the same mutation A829T in Thailand. Alternatively, mutations H146Y (24 cases), V483A (11 cases), E554D (14 cases), P681L (16 cases) and S939F (11 cases) all occur only in USA   www.nature.com/scientificreports/ or mutation L8V (4 cases) occurs only in Hong Kong (refer to the spreadsheet data, which can be found in the Data Availability section). We identify the RBD region within the residue range Arg319-Phe541 of protein S based on a study in 38 . In the RBD region only, the number of nonsynonymous mutations is 53 and that of synonymous is 46, making a ratio of 1.15. This is much smaller than the ratio of 5.54 for the entire gene S, suggesting that the RBD region may have been optimized for binding to a receptor of a host cell. This is complemented by Fig. 9 showing all deletion mutations in gene S being outside the RBD region. Note that the difference of these ratios is partly due to the large number of D614G mutations (3089), which is outside the RBD region. Table 11 summarizes nonsynonymous mutations in the RBD region occurring in 2 or more sequences. Notable mutation in this region is V483A occurring in 11 isolates all collected in USA. The first and latest collected dates of these isolates were respectively 2020-03-05 and 2020-04-05, suggesting that the first isolate may have spread to others having the same mutation V483A. Likewise, the mutation G476S occurs in 6 isolates all collected in USA: WA from 2020-03-10 to 2020-03-25. Alternatively, the mutation Y453F occurs in 5 sequences all in Netherlands but the first collected date was on 2020-04-25 and the latest collected date was on 2020-04-29. These dates are too close, indicating that all the reported Y453F cases may have been infected from another case, whose genome had not been sequenced and reported to the NCBI GenBank. It is important to note that all the transmission implications need further investigation with more data from other aspects such as travel history, physical contacts and so on.
In For the entire protein S, 134 nonsynonymous mutations (48 unique) have both structure and solvent accessibility change potentials. These mutations occurring in 2 or more sequences are reported in Table 12. Mutation H146Y occurs in 24 cases and mutation P681L occurs in 16 cases, which are all collected in USA. The most common mutation D614G does not have the potential to change either protein secondary structure or relative solvent accessibility. Gene E. The envelope protein E has 75 AAs, found in 5852 GenBank records with 15 unique AA sequences.
Among them, 5824 sequences have no mutation or only synonymous mutations while 28 sequences have nonsynonymous mutations. Gene E is thus relatively stable and could be targeted for vaccine and drug development. This is supported by the fact that no insertion or deletion mutations are found within gene E. There are 14 distinct nonsynonymous mutations in gene E and those occur in 2 or more sequences are presented in Table 13.
Five distinct nonsynonymous mutations in gene E have protein structure change potential: S68C, S68F, P71L, D72Y and L73F. Alternatively, 4 distinct mutations have potential to change relative solvent accessibility: L37H, L37R, D72Y and L73F. Therefore, D72Y and L73F are two mutations in gene E that have a potential to change both protein structure and solvent accessibility.
Gene M. The M protein has 222 AAs and its complete CDS appears in 5677 GenBank records, with 37 unique AA sequences. There are 5557 sequences having no mutation or only synonymous mutations while other 120 sequences have nonsynonymous mutations. No insertion or deletion mutations are found in gene M. The number of distinct nonsynonymous mutations in gene M is 37, with those occurring in 5 or more sequences shown in Table 14. Among these, 10 mutations are likely to make changes in protein secondary structure: C64F, A69S, A69V, V70F, N113B, R158L, V170I, D190N, D209Y and S214I. Alternatively, 6 mutations have the solvent accessibility change potential: N113B, P123L, P132S, H155Y, D190N and T208I. N113B and D190N are thus two mutations having potential to change both protein structure and solvent accessibility in gene M. Table 11. Gene S-RBG region only-nonsynonymous mutations that occur in 2 or more sequences.   Table 15. Notable mutations Table 12. Gene S-nonsynonymous mutations that have both structure and solvent accessibility change potentials occuring in 2 or more sequences. The "Query Structure" (and "Query Accessibility") shows the unique structure (and accessibility) changes based on on prediction results. Structure letter in parentheses is the predicted structure of the residue at the corresponding mutation position. Five letters before and after parentheses are structures of neighbouring residues. Likewise, letter "b" or "e" in parentheses shows the accessibility status of the residue at the mutation position.

Discussions
The proposed analysis approach has been able to detect all point mutations so far of SARS-CoV-2 and report them in a spreadsheet dataset, which can be found in the Data Availability section. The generated data can facilitate investigations about the virus in many perspectives. For example, using the mutations found, we can observe the possible virus transmissions between patients. This is showcased through Figs. 5, 6 and 8 where similar mutations are detected from different isolates. In Fig. 5 for instance, a group of isolates have similar deletion mutations at positions M85-, V86-or K141-, S142-, F143-in protein ORF1ab. Genome sequence of isolate USA-CA6/2020 collected in USA on 2020-01-27 is found to be the first having these consecutive mutations. There could be a connection between this isolate with other isolates obtained in USA, Greece and Australia. Information presented in Tables 2, 3 , 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15 is extracted from the generated mutation data. In these summary tables, "First Date" and "First Country" information is useful for identifying when and where the mutations were possibly originated while the "Distribution" information shows how such mutations have spread to different countries. The generated mutation data also allow us to observe the evolution of the virus. We can point out which mutations are still active or no longer active based on the "Latest Date" information presented in Tables 2, 3, 4, 5 , 6, 7, 8, 9, 10, 11, 13, 14, 15. For example, in gene S (Table 10), the latest date of D614G was on 2020-06-05 (same as the latest collection date of the entire genome dataset used in this study), indicating that this mutation is still active. The latest date of P681L was on 2020-04-03, showing that this mutation may no longer occur. Mutations are still active mean they have evolutionarily adapted to varying environments while inactive mutations may have not. This kind of information is useful for further research on vaccine and drug development as ongoing changes of the viral proteins need to be focused and addressed rather than inactive mutations. Our analysis shows that the G form at location 614 in protein S becomes increasingly popular compared to the D form. Among 4434 sequences of the S protein, 3089 sequences have the mutation D614G, taking 69.67%. This number has considerably increased compared to 31.5% in the previous analysis in 22 on a dataset downloaded on 2020-03-22.
Through the mutation analysis, we are able to detect regions of the viral genomes that are stable and can be targeted for vaccine and drug development, such as those coding for proteins E, M, ORF6, ORF7a, ORF7b and ORF10. In addition, this study presents results obtained by the use of deep learning recurrent neural networks for protein secondary structure and solvent accessibility predictions. These results are useful for further research on SARS-CoV-2 protein structure changes. In particular, among 3089 D614G mutations, our prediction results show that none of these mutations is likely to make changes in the protein secondary structure and relative solvent accessibility. This mutation has attracted much attention of researchers as it may affect the virulence and infectivity of the virus and our finding has contributed to understanding this mutation.

Conclusions
Analysing the virus genome sequences and their proteins is crucial for understanding the virus and proposing appropriate approaches to respond to and control the pandemic. This paper has reported all point mutations of SARS-CoV-2 since the virus's first genomes were obtained in December 2019. A SARS-CoV-2 mutation spreadsheet dataset is built using a large number of genome sequences (6324) obtained across 45 countries. This dataset Table 15. Gene N-nonsynonymous mutations occurring in 10 or more sequences. www.nature.com/scientificreports/ can enable scientists to monitor the evolution and spread of the virus although the use of these data needs to be corroborated with patients' clinical data and travel history for substantiated confirmations. We also predict the secondary structure and relative solvent accessibility of the virus proteins to evaluate whether the detected mutations have a potential to change the virus characteristics. The mutation D614G in protein S is unlikely to change either protein secondary structure or relative solvent accessibility based on the prediction results. These protein secondary structure and solvent accessibility change potentials are predicted results based on deep learning recurrent neural networks, which need to be experimentally verified. They however provide important insights about the virus and prompt further experimental biochemistry and molecular biology research into the genomic regions of these mutations. A future work focusing on impacts of the mutations on protein functions would be worth investigating. The function impacts can be determined via wet lab experiments or using low-cost computational tool such as PROVEAN 39 .