Characterization of the SARS-CoV-2 genomes in Egypt in first and second waves of infection

At Wuhan, in December 2019, the SRAS-CoV-2 outbreak was detected and it has been the pandemic worldwide. This study aims to investigate the mutations in sequence of the SARS-CoV-2 genome and characterize the mutation patterns in Egyptian COVID-19 patients during different waves of infection. The samples were collected from 250 COVID-19 patients and the whole genome sequencing was conducted using Next Generation Sequencing. The viral sequence analysis showed 1115 different genome from all Egyptian samples in the second wave mutations including 613 missense mutations, 431 synonymous mutations, 25 upstream gene mutations, 24 downstream gene mutations, 10 frame-shift deletions, and 6 stop gained mutation. The Egyptian genomic strains sequenced in second wave of infection are different to that of the first wave. We observe a shift of lineage prevalence from the strain B.1 to B.1.1.1. Only one case was of the new English B.1.1.7. Few samples have one or two mutations of interest from the Brazil and South Africa isolates. New clade 20B appear by March 2020 and 20D appear by May 2020 till January 2021.


Mutations in SARS-CoV2 genomes second wave of infection in Egypt.
Mutation analysis shows a total of 1115 unique mutations (synonymous vs non-synonymous ratio = 1.6:1) from all Egyptian SARS-CoV-2 samples compared to the reference Wuhan-Hu-1 sequence (Accession NC_045512). We found that more than half of the mutations were in ORF1ab polyprotein (60.5%). The least number of mutations were related to the ORF6 and ORF8 protein sequences (0.7%) ( Table 1). Of the 1115 mutations, there are 613 missense mutation, 431 synonymous mutation, 25 upstream gene mutation, 24 downstream gene mutation, 10 frameshift mutation, 6 stop gained, and 2 conservative in-frame deletion, 2 disruptive in-frame deletion, 1 splice region mutation & synonymous mutation and 1 start lost ( Table 1).
Investigating the frequency of the mutations in the Egyptian samples compared to the world samples, there was no mutation specific to the Egyptian ones in the first and second waves of infection. Tables 2 and 3 include the most frequent mutations in the Egyptian samples.
Geographical distribution of the SARS-CoV-2 mutations characterizing the variants of interest in Egyptian samples (first and second wave of infection). We collected the mutations of related The D614G and other top frequent mutations. The highest Egyptian frequency mutation in the second wave was found in 176 out of 183 of the viral genome samples. This leads to change in amino acid from aspartic acid (D) to Glycine (G). The D614G amino acid change was found in the spike region of Egyptian strain GR in both the first and the second waves (Tables 2, 3). This amino acid change was accompanied by silent mutation of C241T in a non-coding region, and in C3037T of ORF1a, the missense mutation at C14408T (P214L) in ORF1b.  www.nature.com/scientificreports/ The most frequent mutation in the second wave of SARS-CoV-2 infection was observed in the first wave of infection. From these top 12 mutations observed in the second wave of infection, there was only one mutation not in the first wave. These mutations included two mutations in S region, two mutations in N region and four mutations in ORF1. Tables 2 and 3 include the most frequent mutations in the Egyptian samples. For both waves of mutations, there was no mutation specific to the Egyptian samples.
The Missense mutation of G28881A, G28882A, and G28883C results in amino acid changes (R202K and G203R) and of G28908T results in amino acid changes G212V in N was observed in the second wave. As shown in Table 2, the spike region contained three nucleotide mutations resulting in three amino acid changes. In addition to the D614G mutation, both of the C23731T mutation and the G23593T mutation in the spike region resulted in amino acid changes T723T and Q677H respectively. www.nature.com/scientificreports/ The ORF1ab is transcribed into a multi-protein and subsequently divided into 16 non-structural proteins (NSPs). The Missense mutation of C14408T and synonymous mutation of C13536T resulting in amino acid changes (P4715L and Y4424Y) in RNA-depended-RNA-polymerase region. One synonymous mutation of C3037T resulting in amino acid change F924F in NSP3 region.
Lineage and phylogenetic analysis. One hundred eighty three whole genome sequences from the second wave of infection and 282 from the first wave of infection with > 99% reads mapped to the reference genome were generated, with average coverage depth of 992 × . All Egyptian whole genome sequences available in GISAID were added to the analysis, making a total of 465 Egyptian sequences.
For the evaluation of lineages, Pangolin (Phylogenetic Assignment of Named Global Outbreak LiNeages) COVID-19 lineage Assigner was used where nearly 22 different lineages was found to be circulating in Egypt and majority of Twenty two lineage groups were identified in the 183 Egyptian sequences of second wave of infection and 17 lineage groups were identified in the 282 Egyptian sequences had infection in the first wave To better determine the most likely Clade in Egypt during the period between January 2020 and January 2021, we performed a phylo-geographical analysis using all available SARS-CoV-2 sequences and related global sequences from GISAID (Global Initiative on Sharing All Influenza Data, https:// www. gisaid. org). These results determined the most likely clade on January 2020 is 19A and 20A. New clade 20B appear by March 2020 and 20D appear by May 2020 till January 2021 (Fig. 4). Both clades 19A and 20A were decreased by January 2021.

Discussion
The SARS-CoV-2 outbreak was identified at Wuhan in December 2019, and the worldwide diagnosis of SARS-CoV-2 is now 21 century pandemic 18 . Globally, 111,279,860 confirmed cases of COVID-19 were reported to WHO on 23 February 2021, including 2,466,639 deaths. At the time, Egypt was ranked second high country in Africa after South Africa with 178,774 confirmed cases and 10,404 deaths. This study reveals molecular features and patterns of mutation of SARS-CoV-2 strains circulating from January 2020 to the end of January 2021 in COVID-19 Egyptian patients.
CoVs are RNA viruses with mutation-specific effects that enable rapid host replacement by mutation. The Wuhan SARS-CoV-2 strain has over 80% SARS-CoV identity and over 50% of the MERS-CoV strain that was  www.nature.com/scientificreports/ founded in bats 19 . The SARS-CoV-2 seems to have resulted from several mutations which support the idea that virus development is a continuous process so forming new strains 20 . Two polyproteins code for 16 Nsps encoded by the viral genome. SARS-CoV-2 structural proteins are translated from single guided RNAs. Nsp functions to regulate virus replication while structural proteins are involved in binding to the receptor and virion assembly 21 . The S Protein Receptor Binding (PRB) domain selects specific mutations that improve its binding with the ACE2 receptor and improve the virus entry into the host cell 22 .
In this study SARS-CoV-2 genome sequence in COVID19 Egyptian patients were reported for high frequency mutations. ORF1ab, followed by S-gene, N gene and ORF3a, was the largest group of mutations. M, E, ORF7b, ORF7b and ORF10 have the lowest mutation rate. Of these 613 mutations, 431 synonymous mutations, 25 upstream gene mutations, 24 downstream gene mutations, 10 frameshift mutations, 6 stop mutations, and 2 conservative in-frame deletion, 2 disruptive in-frame deletion, 1 splice region mutation & synonymous mutation and 1 start loss. A similar study on 4254 SARS-CoV-2 sequences has shown that mutations are most commonly found within the ORF1a, ORF1b, as well as the S and N genes, as opposed to the ORF7b and E genes, which showed a low mutation rate frequency 23,24 . The genome's mutational frequency can be related to the increase in the infection rate of the Egyptian population and the appearance of the second wave of infection.
In the current study, 176 of 183 viral genome samples were found to be have the highest Egyptian frequency mutation D614G, where the Aspartic amino acid (D) changes to Glycine (G). The change in D614G amino acid was found both on the first and second waves in the spike region of the Egyptian GR strain. This change in amino acid was combined with a silent mutation of C241T in a non-coding region and the missense of C14408T (P214L) in ORF1b in C3037T of ORF1a. ORF1ab is transcribed into a multi-protein and then divided into 16 non-structural proteins (NSPs). The Missense mutation of C14408T and the synonymous mutation of C13536T resulting in amino acid changes (P4715L and Y4424Y) in the RNA-dependent RNA-polymerase region. One synonymous mutation of C3037T resulting in a change of amino acid p.Phe924Phe in the NSP3 region. The most frequent mutations of SARS-CoV-2 were observed in both waves of infection. The 12 top mutations in the second wave includes two mutations in N region, four mutations in ORF1ab, and two mutations in S area. Only one mutation was not present in the 1st wave of infection (RG203KR). In a further study carried out by Islam et al. 2020, 1,247 nt mutations were observed in the ORF regions and 503 of them were missense mutations 25 .   www.nature.com/scientificreports/ NSP3, NSP4, NSP2, NSP12, and NSP5 have 120, 33, 57, 44, and 11 AA substitutions in the ORF1ab polyprotein, respectively. In the case of spike protein, 11 AA substitutions were discovered in RBD at 331 to 524 residues of S1 subunits (in Wales, the United Kingdom, Shenzhen, Hong Kong/France, Shanghai, Guangdong, Finland, and France), three of which occurred in positions 424 and 494, which comprise the receptor-binding motif (RBM). A single mutation in the S-protein in SARS-CoV-2, which was lacking in other SARS-CoV-2 strains of different geographic regions, was identified [26][27][28][29] . Changes in ORF8 appear to be strongly linked to the adaptation of the new species, as substantial changes have been found in ORF8 during the transition from civet to human host 30 . ORF8 SARS-CoV-2 protein shares the lowest SARS-CoV homology among all viral proteins, which interacts with major histocompatibility complex molecules class I (MCH-I) and down-regulating the surface expression of MHC-I on various cells 31,32 .
Analysis of genome mutations in the first and second waves of infection compared with the global mutations in the present study has been shown to produce 4 genome mutations on an annual average and 26 on average annual mutations during Egypt's first and second waves, respectively, compared to an annual global 22,88 mutations. In the second infection wave, there is so far no specific mutation for the Egyptian samples. The presence of mutations similar to those found in other parts of the world suggests that they facilitate the adaptation of the virus to the human host. These mutations are found in NSP3, NSP6, RdRp, helicase, ORF3a, ORF8, as well as S and N proteins. These proteins are interestingly the same and have shown the highest mutation rate in our study. For the adsorption, reproduction and processing of polyproteins to replicate coronavirus, proteins are essential. In the S protein located in different domains a total of sixteen mutations were identified 33 .
Both ORF3 and ORF8 encoded proteins are type I interferon inhibitors that promote virus replication by interference with antiviral defense 34  . In a similar study, the changes in gene coding for N protein and ORF3a and ORF8 contributed to the epidemic's virulence, transmission and pathogens 47 . In this study, the gene codes for NSP7, NSP9, NSP10, NSP11, and ORF 7b accessory protein SARS-CoV-2 genes are not found to be mutated during the second wave of infection. Similar research study analysed the accumulation rate for the SARS-coV-2 genome over an 11-week period and found that the majority of the viral genes accumulated NSp2, NSP3, RdRp, helicase, Spike, ORF3a, ORF8 and N proteins, although with varying rates. Sixteen mutations accumulated in Spike protein, in which four mutations are located in the binding domain of the receptor. Interestingly, the number of viral proteins that did not accumulate any mutation was considered (NSP7, NSP9, NASP10, Envelop, ORF6 and ORF7b proteins) 35 . Similar to our findings, no mutations were found in NSP9, while only two amino acid substitutions were identified in NSP10 36 .
Several non-canonical structures of the nucleic acid, such as G-quadruplexes, have been shown to be essential for genome regulatory activities 37 . Although a few G-quadruplex sequences in the SARS-CoV-2 genome were determined, the inverted repetition of the genome is abundant (IRs) 38 . Two preserved SARS-CoV-2 regions are stem-loops which are designed to protect viral RNA against quick degradation and thus increase stability of the viral RNA genomes and efficiency and virulence in viral replication 39 . In the current study, to investigate the geographical distribution of SARS-CoV-2 hotspot mutations in Egyptian samples, the presence of IRs in the entire SARS-CoV-2 genome were analyzed and produced an overlay of 29 high-frequency nucleotide positions identified as hot spots based on their GISAID frequency. In SARS-COV-2 genome, potential G-quadruplex-forming sequences that regulates vital RNA syntheses are occur very rarely 4041 . A report showed that SARS-COV-2 genomes exhibit a CpG depletion and therefore hot-spot mutations in the SARS-COV-2 genome was important 6 .
SARS-COV-2 hot-spot mutations are significantly abundant in IR sequences and CpG islands, suggesting the SARS-COV-2 genome's possible survival strategy and/or evolutionary benefit to the virus in either adapting to human host, modulating cellular immune response, or even increasing virulence and pathogenicity. IRs are generally very important for ssRNA genome organization [41][42][43] . In the present study, 29 mutations of interest were identified in the Egyptian sequences. Out of these, 18 mutations related to the variants (lineages) of interest were found in the S protein, coming from the UK B.1.1.7 lineage. Four mutations were found in the ORF1ab polyprotein, distributed in two regions coding for NSP6 (S367S), and three coded for NSP3 (T1001I),(A1798D) and (S1188L) coming from England B. According to WHO, measures to combat epidemics and pandemics caused by highly pathogenic viruses may necessitate timely efforts from all or at least the majority of countries around the world. Egypt, for example, has taken unprecedented anti-epidemic measures to halt the spread of SARS-CoV2 infection.

Material and methods
Ethics statement. The study was permitted by the Ethics Committee of the Ministry of Health and Populations, Training and Research Sector, with number OHRP: FWA00016183 23 March 2020, IORG0005704/ IRB0000687 31 May 2020. In accordance with the principles of the 1975 Helsinki Declaration revised in 2008, the study was conducted. The study was approved by the National Institute of Cancer Ethics Committee. Before enrolling, all patients provided informed consent. After standard SARS-CoV-2 diagnostic tests were performed, the next generation sequence for SARS-CoV-2 was performed in positive samples.
Research protocol confirmatory laboratory tests have been conducted in conformity with WHO recommended. During the period of November to December 2020, all 250 samples were collected. Patients had high copy number of SARS-CoV-2 (between 1.2 × 10 4 to 2 × 10 6 copies/ µl) by real time PCR technique. The sequencing of QC thresholds was only achieved in 183 (172 from National Cancer Institute and 11 by the Egypt Army). There was no information available regarding the source of the isolates infection. The QIAMP VIRAL RNA mini-kit (Qiagen, Hilden, Deutschland) with internal PCR controls as instructed by the manufacturer was used with 250 to 300 µL of each nasopharyngeal swab sample for viral RNA extraction. The extracted RNA was directly used for detection of SARS-Cov2 using Genesig Real-Time PCR Detection Kit.
Next generation sequencing of SARS-CoV-2. The RNAs collected were measured by a high-sensitivity Qubit RNA kit (Invitrogen, USA). As previously described, the entire sequence of the genome was done 44 . In brief, the genomic RNAs were retro-transcribed using the VILO-cDNA Synthesis Kit (Cat. No.11754050; Invitrogen, USA). For the preparation of the libraries, the Ion AmpliSeq Library Kit Plus (Thermo Fisher Scientific) was used. The Ion-PI-Hi-Q Sequencing 200 Kit (Thermo Fisher Scientific) PCR emulsion was used to clonally amplify the libraries. Ion PI Hi-Q Sequencing 200 Kit -Chef Kit (Thermo Fisher Scientific) of the Ion Proton Sequencer were used for the entire genome sequence. Data analysis. We used the pipeline for bioinformatics analysis as previously described 44 for viral assembly and mutation calling. Briefly, the pipeline uses the Torrent Suite package (v.5.12) for alignment of the reads to the reference sequence (RefSeq; NC_045512.2), and for mutation calling. The IRMA (v0.9.3) workflow was used for de novo assembly. The de-novo assembly was compared against the reference-based assembly (based on alignment of the reads to the reference genome) to assure consistency of the results. In fact, for this target amplicon based panel, we see, as in our first paper 44 , that the reference-based assembly is enough to reconstruct the viral sequence.
As threshold of acceptance, samples with > 99% coverage and with gaps length less than 30 bps were retained for further analysis. The final successful set included 183 complete genome sequences and these were uploaded to NCBI/GISAID repositories (Supplementary File. S1).
Lineage and phylogeny. We collected mutations and double checks for emerging strains from the UK, Brazil and South Africa, based on literature review. To assign the lineage to each sequence, the Pangolin system was used. We used MAFFT for multiple alignment computing for phylogenetic analysis (v7.450) 45 . The iqtree packages are then used to compute phylogeny, selecting the best model for nucleotide replacement with bootstrapping in order to ensure high tree topology confidence.
Variation analysis. World dataset. GISAID public sequences (until 15th of January 2021) were collected and aligned to the reference viral sequence using the nucmer program 46 . The output file o is parsed to extract the variations and transform it to VCF format using in-house script. The snpEff package 47 was then used to annotated the VCF file (snpEff_v4_5covid19_core.zip). All the VCFs were then processed to compute the frequency of each variation in the world population. www.nature.com/scientificreports/ Egyptian dataset. To determine the characteristics of genomic variation, we analyzed the 183 whole SARS-CoV-2 genomes, collected in second wave between November 2020 and mid-January 2021. The variations (mutations) in the Egyptian genomes were examined for quality and depth. A variation is filtered out if its depth is less than 50 reads. We also checked if the variations occur in a homopolymer region or not, especially if it appears once in our dataset and not present in the world population. (Homo-polymer errors are frequent and well known sequencing errors for the Ion Torrent technology.) The final set of variations were then annotated with snpEff. Moreover, they were annotated with their frequencies in both the Egyptian and the world population.
We also analyzed the complete SARS-CoV-2 genomes of 265 samples (available on GISAID, https:// www. gisaid. org) from the first wave of infection in Egypt from different institutes that were collected between March and April 2020 from 7 different institute in Egypt, namely, National Cancer Institute (n = 85), Cancer Children Hospital (n = 90), Egyptian Army (n = 36), Ain Shams Medical Institute (n = 30), Ministry of Health (n = 19), Pathogen Genomics Center, National Institute of Infectious Diseases (n = 2), National Research Center (n = 2), Vaccine Research Institute (n = 1).