SARS-CoV-2 genome variations and evolution patterns in Egypt: a multi-center study

A serious global public health emergency emerged late November 2019 in Wuhan City, China, by a new highly pathogenic virus, SARS-CoV-2. The virus evolution spread has been tracked by three developing databases: GISAID, Nextstrain and PANGO to understand its circulating variants. In this study, 110 diagnosed positive COVID-19 patient’s samples, were collected from Kasr Al-Aini Hospital and the Children Cancer Hospital Egypt 57357 between May 2020 and January 2021, with clinical severity ranging from mild to severe. The viral genomes were sequenced by next generation sequencing, and phylogenetic analysis was performed to understand viral transmission dynamics. According to Nextstrain clades, most of our sequenced samples belonged to clades 20A and 20D, which in addition to clade 20B were present from the beginning of sample collection in May 2020. Clades 19A and 19B, on the other hand, appeared in the mid and late 2020 respectively, followed by the disappearance of clade 20B at the end of 2020. We identified a relatively high prevalence of the D614G spike protein variant and novel patterns of mutations associated together and with different clades. We also identified four mutations, spike H49Y, ORF3a H78Y, ORF8 E64stop and nucleocapsid E378V, associated with higher disease severity. Altogether, our study contributes genetic, phylogenetic, and clinical correlation data about the spread of the SARS-CoV-2 pandemic in Egypt.

www.nature.com/scientificreports/ such as viral pneumonia through infecting the lower respiratory systems, and/or diarrhea and vomiting via the gastrointestinal tract, with varying severity 6 . Coronaviruses are single-stranded, positive-sense RNA (+ ssRNA) contained in an envelope. The coronavirus genome is approximately 26-32 kb which is currently identified as the largest RNA virus genome size 7 . The SARS-CoV-2 genome comprises 10 functional ORFs; ORF1ab which encodes the viral replication and transcription complex (RTC), 4 structural proteins-spike (S) protein, nucleocapsid (N) protein, envelope (E) protein and membrane (M) protein, and 5 accessory proteins-ORF3a, ORF6, ORF7a, ORF7b and ORF8. ORF1ab comprises two-thirds of the viral genome and encodes two polyproteins (PP1ab and PP1a). PP1ab and PP1a are then cleaved into 16 non-structural proteins nsp1-16; nsp1 (leader protein), nsp2-11 which provide supporting functions for the RTC, RNA-dependent RNA polymerase (nsp12), helicase (nsp13), 3′-to 5′ endonuclease (nsp14), endoRNase (nsp15) and 2′-O-ribose methyltransferase (nsp16). Viral entry into infected cells depends on interaction between the surface spike S protein with angiotensin-converting enzyme 2 (ACE2) host cellular receptor, followed by cleavage of S protein between the S1 and S2 domains, and subsequent internalization of the virus inside the cell. The functions of the S1 and S2 domain is to mediate the binding and downstream membrane fusion, respectively 8 . A subdomain of S1 folds independently and acts as a receptor binding domain (RBD) 9 , binding with high affinity to ACE2. The immune-dominance of the RBD makes it the primary target of natural and vaccine-elicited immunity 7 .
Viral genome sequencing provides a powerful approach to monitor introductions into a country and anticipate spread and evolution of the virus. Certain variants of concern have been defined by the world health organization (WHO) as variants causing increased transmissibility, virulence, or reduced vaccine effectiveness. These variants have notable mutations affecting the spike protein, in addition to mutations in other protein coding sequences. Genetic variations in viral genome can result in a better or worse prognosis, thus studying these mutations is critical to enhance patients' outcome 10 .
Three software and databases were developed for the real-time tracking of the SARS-CoV-2 evolution through analysis of genomic sequences and the assignment of phylogenetic clades and lineages; Nextstrain, Global Initiative on Sharing Avian Influenza Data (GISAID) and Phylogenetic Assignment of Named Global Outbreak Lineages (PANGO). Analyses done for the S-protein on more than 28,000 spike gene sequences revealed the emergence of a non-synonymous mutation at position 614 (D614G) that was rare before March 2020 and then became more common as the pandemic spread by June 2020 11 . Three other mutations accompanied the D614G substitution: a synonymous/silent C to T mutation at position 3037 in ORF1ab, a C to T mutation at position 241 in the 5` untranslated region and a nonsynonymous C to T mutation at position 14,408 in ORF1ab resulting in mutation P3715L (or P314L) in the hydrophobic cleft near the active site of the RNA-dependent RNA polymerase gene (RdRp/nsp12) 12 .
In this study, 110 SARS-CoV-2 viral isolates from Kasr Al-Aini Hospital and the Children's Cancer Hospital Egypt (CCHE 57357) were sequenced using short read sequencing of total RNA content. The genomic variation and phylogenetic diversity of SARS-CoV-2 were investigated, as well as mutation patterns and correlation with clinical severity. Finally, co-occuring mutations were investigated and several groups of co-occurring mutations were identified.

Materials and methods
Ethical and IRB approval. All procedures performed in the study involving human participants were in accordance with the ethical standards of the institutional research committee of CCHE 57357 and with the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards. A signed informed consent was obtained from the patient (or patient's guardian for pediatric patients) as an assent to participate in the current study, all of which were approved by the Institutional Review Board at CCHE 57357.
Sample collection and RNA extraction. Naso-pharyngeal swabs were collected in viral transport media and RNA was extracted using QIAamp® Viral RNA Mini kit (Qiagen). Confirmatory qualitative commercial RT-PCR kits were used for diagnosis and screening (depending on critical availability during the outbreak).
Library preparation and next generation sequencing. Library preparation was performed using the TruSeq stranded total RNA (Illumina, USA). Samples were then normalized, pooled and subjected to 150-base paired-end reads sequencing using Illumina NextSeq system with a minimum of 2.4 Gb sequencing depth per sample.
Raw reads processing and mapping. Reads were processed in which low reads quality were filtered out using Trimmomatic 13 followed by host reads removal through mapping trimmed reads against Homo sapiens genome reference (GRCh38) using Burrows-Wheeler Alignment tool (BWA) mem 14 then unmapped reads were extracted using SAMtools 15 . Unmapped reads were mapped later using BWA mem to Wuhan-Hu-1 (MN908947.3) reference genome. Both mapping runs were aligned using paired-end mode and default parameters. Mapped reads were sorted and indexed using SAMtools.

Variant detection.
Variant calling was performed using Lofreq V2.1.2 16 starting with realign of aligned reads and indel quality assignment using Viterbi and indelqual commands from LoFreq package. Finally, LoFreq was used to call low-frequency variants.
Generated variants were filtered using LoFreq filter and BCFtools 17 with default parameters. Filtered variant were annotated with SnpEff 18 and extracted using SnpSift 19 . Spearman correlation coefficient analysis between selected mutations using cor function in R 4.1.2 20  www.nature.com/scientificreports/ Phylogenetic analysis. Consensus sequence for each sample was computed using mpileup and consensus tools from Bcftools. MAFFT was used to perform multiple sequence alignment 22 using the following parameters (--6merpair --maxambiguous 0.05 --addfragments) followed by phylogenetic tree calculation using IQ-TREE 23 which was visualized by iTol 24 .

Study population and clinical parameters. A total of 110 diagnosed positive COVID-19 patients
were included in the study; 13 from Kasr Al-Aini Hospital and 97 from the outpatient COVID clinic at the CCHE 57357. Samples were collected in the period between May 2020 and January 2021, and included 60 males (54.5%), and 50 females (45.5%). 88 patients were adults (average age 39.6), and 22 were pediatric patients (average age 10.2). Clinical severity ranged from mild (60.9%, n = 67) to moderate (24.6%, n = 27) and severe (14.5%, n = 16), resulting in hospitalization in 36 patients (32.7%) and a survival rate of 97.3%. Clinical data is summarized in Table 1.

Phylogenetic analysis of our samples.
To determine phylogenetic characteristics of our SARS-CoV-2 isolates, we performed maximum likelihood tree based on aligned full length sequence of our genomic sequences using IQ-tree and Nextclade online tool (https:// clades. nexts train. org/) (Fig. 1). Nextstrain clade, PANGO lineage, date of sampling, hospitalization and clinical severity are color indicated on the phylogenetic tree in Fig. 1a. Nextstrain classifies strains as follows; 19A is considered the parent strain with no mutations compared to the Wuhan-1 reference strain, C8782T and T28144C (ORF8 L84S) define clade 19B, and C3037T, C14408T (ORF1ab P4715L) and A23403G (S D614G) define clade 20A. Clades 20B and 20D are derived from clade 20A and have further clade defining mutations; GG28881-28882AA (N R203K) and G28883C (G204R) for clade 20B and C4002T and G10097A (ORF1ab T1246I and G3278S) for clade 20D. Most of our samples belong to Nextstrain clades 20A and 20D, 34.5% and 40% respectively (Fig. 1a, b) and are equally distributed across our sampling dates from May 2020 to January 2021. Clade 19A and 19B appears in our samples mid and late 2020 respectively. Clade 20B disappears by end of 2020 ( Supplementary Fig. 1).

Discussion
Since In this study, samples collected in Dec 2020 belonged to clade 19B from patients suffering mild symptoms with some reported neutropenia. Samples belonging to clade 20A were collected from May 2020 to January 2021 with all levels of clinical severity ranges from mild to severe symptoms and reports of hypoxemia, neutropenia, and lymphopenia in some patients. Clade 20B circulated from May to August 2020 with patients showing mild to moderate clinical severity and reports of neutropenia and lymphopenia in some. The last clade 20D circulated from May 2020 to January 2021 with patients showing all levels of clinical severity from mild to severe and reports of neutropenia, lymphopenia, and hypoxemia in some patients.
Clade 19B appears late in our samples and harbors D614 in the spike protein. This is in contrast to studies showing the D614 as the original ancestor, and G614 appearing late in the pandemic. D614G occurs at the B-cell epitope, and was reported to enhance the viral replication in human lung epithelial cells and primary human airway tissues thus increasing the infectivity and stability of virions 12 and was speculated to reduce the efficacy of the vaccines. Our results show an inverse correlation between the occurrence of D614G which occurs in most of the strains, belonging to clades 20A, 20B and 20D and other mutations in the spike N501T, H655Y and IHV68I, which appear in strains belonging to clade 19B.
PANGO lineage A.28 was mostly reported in France, and harbored many key mutations in the spike protein; N501T, H655Y and IHV68I. The N501T spike mutation was predicted to increase the ACE2 binding 31 . IHV68I (or del 69-70) was also reported in the Alpha variant (B.1.1.7) and interferes with viral PCR test accuracy 32 . Other notable spike mutations were reported in our samples. S477N, which lies in the RBD domain, was only found in a few samples in our study and was reported to slightly increase the ACE2 binding. P681H mutation, which is near the furin cleavage site, was found in eight samples in our study belonging to Nextstrain clade 20D and PANGO lineage C.36, and is considered one of the clade defining mutations for the highly contagious delta variant-Nextstrain clade 20I. S477N and P681H were found to confer resistance to antibody therapy 29,33 , whereas contradictory reports were reported for IHV68I 34,35 . A study on 176 viral genome sequences from Egypt showed mutation patterns similar to those found in our data 36 .
In our study, we identified four mutations with association to moderate and severe clinical severity, spike H49Y, ORF3a H78Y, ORF8 E64stop and nucleocapsid E378V. ORF8 gene was reported to be involved in the innate immunity evasion 37 , and similar to our data E64stop mutation was associated with severity of clinical symptoms 38 .
In conclusion, through SARS-CoV-2 viral sequencing, our study identified several lineages circulating in Egypt between May 2020 and January 2021. We identified a large range of mutations throughout the SARS-CoV-2 genome, including four mutations, spike H49Y, ORF3a H78Y, ORF8 E64stop and nucleocapsid E378V, that were associated with higher disease severity. Additionally, we identified several mutation groups that were associated together and in specific clades. These results could provide a starting point for in vitro and in vivo analysis for the functions of these mutations, and are vital for virus tracking and the development of novel vaccines.

Data availability
All data generated and analyzed during this study are included in this article and published online on NCBI with BioProject ID PRJNA818451 https:// www. ncbi. nlm. nih. gov/ biopr oject/ PRJNA 818451.