Emergence of B.1.524(G) SARS-CoV-2 in Malaysia during the third COVID-19 epidemic wave

The COVID-19 pandemic first emerged in Malaysia in Jan 2020. As of 12th Sept 2021, 1,979,698 COVID-19 cases that occurred over three major epidemic waves were confirmed. The virus contributing to the three epidemic waves has not been well-studied. We sequenced the genome of 22 SARS-CoV-2 strains detected in Malaysia during the second and the ongoing third wave of the COVID-19 epidemic. Detailed phylogenetic and genetic variation analyses of the SARS-CoV-2 isolate genomes were performed using these newly determined sequences and all other available sequences. Results from the analyses suggested multiple independent introductions of SARS-CoV-2 into Malaysia. A new B.1.524(G) lineage with S-D614G mutation was detected in Sabah, East Malaysia and Selangor, Peninsular Malaysia on 7th October 2020 and 14th October 2020, respectively. This new B.1.524(G) group was not the direct descendant of any of the previously detected lineages. The new B.1.524(G) carried a set of genetic variations, including A701V (position variant frequency = 0.0007) in Spike protein and a novel G114T mutation at the 5’UTR. The biological importance of the specific mutations remained unknown. The sequential appearance of the mutations, however, suggests that the spread of the new B.1.524(G) lineages likely begun in Sabah and then spread to Selangor. The findings presented here support the importance of SARS-CoV-2 full genome sequencing as a tool to establish an epidemiological link between cases or clusters of COVID-19 worldwide.

www.nature.com/scientificreports/ manufacturer's recommendation amplification protocol. The adaptors and barcodes were ligated to the pooled amplicon libraries using Ion AmpliSeq™ Library Kit Plus (Ion Torrent, Thermo Scientific). The barcode-ligated amplicon libraries were subjected to another round of amplification and clean-up processes prior to sequencing template preparation. The sequencing template was prepared using Ion PGM™ Hi-Q™ View OT2 Kit, and the full genome sequencing was performed using the Ion Personal Genome Machine (PGM) with 550 flows. The generated reads were mapped to SARS-CoV-2 reference strains, Wuhan-Hu-1 (GenBank accession number: MN908947). The level of sequence coverage on the target genome regions was performed using coverageAnalysis v5.12.0.0, as implemented in Torrent Suite™ Software 5.12. Genome assembly was conducted using the IRMAreport v1.2.1.0 (Ion Torrent, Thermo Scientific).

Phylogenetic analysis of Malaysian SARS-CoV-2.
The near-complete genome sequences (66- 29,674) were extracted from the MSA of the Malaysia SARS-CoV-2. Sequences with low coverage and gaps were removed from the analysis. The resulting datasets consisting of 106 Malaysia SARS-CoV-2 strains were used for the phylogenetic tree reconstruction (Supplementary Table 1

Results
Genome structure, geographical distribution and clade assignment of SARS-CoV-2 strains. In the current study, 22 complete and near-complete genome sequences of SARS-CoV-2 strains detected in Malaysia were generated (Table 1). The genome sequences encompassing at least from positions 26 at the 5'-untranslated region (5'UTR) to 29,847 at the 3'-untranslated region (3'UTR) according to the nucleotide position of the Wuhan-Hu-1 genome sequence (MN908947.3). All SARS-CoV-2 complete genome sequences (16 out of 22) in this study possessed a similar genome structure with the reference sequence, Wuhan-Hu-1 with no insertion or deletion detected within the positions that ranged from nucleotide 26 to 29,847. Among these 22 genome sequences, 12 were detected between April to May 2020, representing circulating SARS-CoV-2 strains during the second wave of the COVID-19 epidemic in Malaysia. Most of the second wave's SARS-CoV-2 strains were detected in Selangor, a state in Peninsular Malaysia. Only 1 sample, 4Apr20-3-Hu/2020, was detected from Negeri Sembilan, a neighboring state to Selangor (Table 1). While the other ten SARS-CoV-2 strains were detected from Sabah and Selangor, representing a subset of circulating strains detected during the initial phase of the third COVID-19 epidemic wave. The clade/lineage of SARS-CoV-2 strains was assigned based on the GISAID clade and PANGOLIN Lineage assignments. Currently, there are seven GISAID assigned SARS-CoV-2 clades based on a list of ten genetic markers (Table 1 and Supplementary Table 2). Samples that did not cluster under these seven major GISAID clades will be denoted as others (O). Based on the GISAID clade assignment, the samples detected during early April 2020 (4Apr20-3-Hu/2020 and 5Apr20-64-Hu/2020), and all samples detected in October 2020 (third epidemic wave) denoted clade G (Table 1), possessed genetic variations at C241T, C3037T, and A23403G. Using the Pangolin COVID-19 Lineage Assigner, these clade G strains belonged to the same lineage, B.1.524 (Table 1). Samples obtained on 21 April 2020 and 2 May 2020 did not cluster under any of the seven major GISAID clades, hence, denoted as O. All samples obtained on 21 April 2020 carried two nucleotide changes C241T and G11803T, while sample obtained on 2 May 2020 (2May20-132-Hu/2020) only possessed G11803T. Based on the Pangolin assignment, all samples detected on 21 April 2020, denoted as lineage B.6.1 and sample detected on 2 May 2020 sample was assigned under B.6.6.
Phylogenetic relationships and molecular signature of Malaysian SARS-CoV-2. The phylogenetic relationships of SARS-CoV-2, detected in Malaysia, were examined using a phylogenetic tree reconstructed using near-complete genome sequences ranging from nucleotide 66 to 29,674. The tree consisted of 106 SARS-CoV-2 strains (16 generated from this study and 90 retrieved from GISAID) detected between 28 January 2020   www.nature.com/scientificreports/ in Malaysia after the first wave. The lineage B (GISAID Clade L) was closely related to the first sequenced SARS-CoV-2 strain, Wuhan-Hu-1 (Fig. 1). Malaysia's lineage B(L) strains detected between 6 to 12 February clustered closely to form a distinct subgroup. These strains shared two missense mutations, the C1758T which encoded for amino acid substitution of alanine to valine at position 318 of nsp2, and the C10604T encoded for amino acid substitution proline with serine at position 184 of nsp5 (Table 2). These mutations, however, were not observed in other B(L) strains detected after the first wave, consistent with epidemiological data that early introduction of SARS-CoV-2 during the first wave was successfully contained. We looked into the time of emergence of these mutations (https:// bigd. big. ac. cn/ ncov/); the first strain which carried the C1064T (nsp5-P184S) was reported in hCoV-19/Beijing/BJ53/2020 detected on 24 January 2020 from Beijing, while the first strain that carried both C1758T/C10064T was hCoV-19/Malaysia/MKAK-CL-2020-7554/2020 detected in Malaysia on 6 February 2020.   (Fig. 1). All the samples clustered under lineage B. Our phylogenetic analyses showed that the samples were segregated into two major groups before they were delineated into multiple sub-lineages. The first group clustered closely with B(L) strains detected during the first COVID-19 epidemic phase (Fig. 1 Table 2). These two mutations were first reported in hCoV-19/Japan/PG-0015/2020 (EPI_ISL_479799), a strain detected from Hokkaido, Japan, on 20 January 2020. This was consistent with our analysis that the Most Recent Common Ancestor (MRCA) of these Malaysia's B.12 strains could have dated back to 21 January 2020 (95% HPD: 12 February-25 February 2020). These two linked mutations were common genetic traits for strains detected in Hokkaido, Japan, from January to March 2020, suggesting the possible Japan-origin of Malaysia's B.12 strains. Unlike the B.12 subgroup, the other second wave's strains clustered under lineage B did not have any common genetic variation, indicating they could have been imported independently from different sources. The second subgroup was a group of lineage B.1 strains and their descendants, comprising the strains detected between 21 March to 14 October 2020 (second and third COVID-19 epidemic waves, Fig. 1). All strains clustered under this group, including those sequenced in the current study, possessed the three GISAID clade G genetic variants, C241T, C3037T, and A23403G with an additional mutation C14408T. The C14408T is a common mutation used to define B.1 in the Pangolin system, and it is also used to define a haplogroup A2a4 (another SARS-CoV-2 clustering system) 10 . The C14408T and A23403G were missense mutations ( Table 2). The C14408T was encoded for substitution of proline with leucine at position 323 in nsp12 of ORF1ab (nsp12-P323L). The A23403G was encoded for substitution of aspartic acid with glycine in the S-614 (S-D614G). So far, only strains that fell within this B.1-associated lineage carried this S-D614G amino acid substitution. The strains that possessed all four mutations (C241T-C3037T-C14408T-A23403T) were actively circulating, especially in Europe (https:// bigd. big. ac. cn/ ncov/) before its first documented detection in Malaysia in late March, 2020 (Fig. 1). Our data showed that this B.1-associated lineage further segregated into four subgroups, representing strains of B.1.1.X(GR), B.1(GH), and B.1(G), and B.1.524(G) groups. Strains that possessed additional three genetic markers G28881A, G28882A, and G28883C, clustered under clade GH, strains which possessed G25563T was assigned as clade GR, and the strains that presented without these additional genetic variations remained as Clade G. Malaysia's B1.1.X(GR) group comprised six strains detected between 21 March to 29 May 2020. There was no additional mutations shared among this group besides the clade-specific mutations. Malaysia's GH group comprised two strains, hCoV-19/Malaysia/MGI-G873/2020, detected on 7 April 2020, and hCoV-19/Malaysia/ IMR WC94764/2020, detected on 29 May 2020. Both strains shared an additional genetic variation, C18877T, a synonymous mutation. The assigned B.1(G) strains segregated into two distinct subgroups. The first subgroup consisted of two strains sequenced in this study, hCoV-19/Malaysia/4Apr20-3-Hu/2020 detected from Negeri Sembilan on 4 April 2020 and hCoV-19/Malaysia/5Apr20-64-Hu/2020 detected in Selangor on 5 April 2020. An additional shared genetic variation, G25429T encoded for an amino acid substitution of valine to leucine at position 13 of the ORF3a (ORF3a-V13L), was observed. These two spatially separated ORF3a-L13-bearing B.1(G) strains could have descended from a common ancestor carrying the 25429 T mutation. The mutation at this position is relatively rare, with a variation frequency of less than 0.01 (Table 2).

B.2(V)/B.6(O) subgroups.
The second major group was B/B.6 lineage (Fig. 1). All strains within this B.2/B.6 lineage shared genetic variation G11083T, encoded for amino acid changes from leucine to phenylalanine at position 37 of nsp6 in ORF1ab. It was a common mutation detected in strains circulating in China in January 2020; the earliest isolates with this mutation dated back to 17 January 2020 (https:// bigd. big. ac. cn/ ncov/). These T11803-bearing strains segregated into two groups corresponded to lineage B(clade V) and lineage B.6 (No specific GISAID clade was assigned, denoted as O, referring to others). www.nature.com/scientificreports/ Two B(V) strains were detected in Malaysia on 31 March and 2 April 2020 (Fig. 1). In addition to the T11803, both strains possessed additional two nucleotide variations, C14805T and G26144T. The G11083T and G26144T were genetic markers for the assignment of GISAID Clade V. While the C14805T was an additional genetic trait present in this group. The C14805T was a synonymous mutation that was not originally present in China during the early spread of the virus, suggesting this mutation could have accumulated in the SARS-CoV-2 gene pool outside of China. These three linked mutations, however, were detected in strains detected in England and Korea beginning at the end of Jan and early Feb 2020, indicating the widespread distribution of the ancestor strains (https:// covid cg. org/). There was no B(V) strain detected in Malaysia after 2 April 2020.

Genetic variation Lineage/Clade Effect
The B.6 (O)-associated lineage formed the largest group of Malaysia's SARS-CoV-2 phylogenetic tree, comprised of strains detected from 4 March to 4 June 2020. The SARS-CoV-2 strains sequenced in this study (21 April-2 May 2020) clustered within this subgroup. All B.6 strains within this group contained four additional genetic variations; C6312A, C13730T, C23929T, and C28311T in addition to G11083T mutation. These five mutations were probably linked mutations as these mutations were detected simultaneously in strains recovered from different countries in March 2020 (https:// bigd. big. ac. cn/ ncov/). These A6312A-T11083-T13730-T23929-T28311-bearing strains were segregated into multiple distinct subgroups with the presence of several unique genetic traits. Sequential accumulation of these mutations could have reflected the transmission path of SARS-CoV-2 in the local community. For example, the B.6.1(O) subgroup was also characterized by a mutation, T7621C. This T7621C mutation was a synonymous mutation detected in 29 strains, mainly from Malaysia and Brunei (https:// bigd. big. ac. cn/ ncov/ varia tion/ annot ation/ varia nt/ 7612), suggesting that this is a unique mutation that occurred in this region and could have originated from a single origin. These C7621-bearing strains were further delineated into two groups differentiated by an additional mutation, C25549T. Our samples obtained in mid-Apr 2020 clustered within the C7621-C2554T-groups, with additional mutations detected in some of the samples. Most of these samples were detected from individuals with travel histories to Indonesia and India. There was, however, no distinct spatial clustering of the samples. Samples' collection time for the C25549-bearing and T25549-bearing groups, however, overlapped, indicating at least two independent and simultaneous transmission chains. Within B.6(O), another group with the additional mutation, C19524T, was observed; subsequently, an additional C6210A and then C2508A was detected in a subset of this group (B.6.6(O)). The B.6.6(O) comprised of samples linked to the identified Seri Petaling Gathering Cluster 41 . The T19524 and T19524T-A6210 strains were also detected in neighboring countries including Singapore, Thailand, and Australia (https:// bigd. big. ac. cn/ ncov/) but not the T19524-A6210-A2508 strains, suggesting the C2508A could be a mutation that accumulated in the SARS-CoV-2 gene pool during the second COVID-19 epidemic wave in Malaysia.

Third COVID-19 epidemic wave in Malaysia (from 8 October).
Our phylogenetic results showed that all samples detected in Oct 2020 (Sabah and Selangor strains) clustered together within lineage B.1.524(G). Although these third epidemic wave strains clustered closely with strains detected in early April (hCoV-19/ Malaysia/4Apr20-3-Hu/2020 and hCoV-19/Malaysia/5Apr20-64-Hu/2020), the genetic analysis showed that the Oct 2020 strains were not a direct descendant of April's B.1(G) group. This is because the genetic trait (T25429) was not observed in the genome of the October 2020's strains. Sequence analysis showed a common ancestor Wuhan-Hu-1 MN908947 MY.14OCT20-136-Hu/2020

Discussion
Overall, our findings suggest that there was no sustained transmission of a single SARS-CoV-2 lineage in Malaysia. The initial introductions of SARS-CoV-2 during the first epidemic wave and early of the second wave could have been represented as Stage 1 COVID-19 transmission 42 , where only imported cases, with no localized community transmission, were recorded 18 . It was evidenced by the absence of similar virus strain or descendant lineages detected during analysis. The findings ascribed the targeted border control measures, rapid contact tracing, and isolation during the early phase of the COVID-19 epidemic was effective in containing the SARS-CoV-2 spread during the period between late Jan to Feb 2020. Substantial local transmission of COVID-19 in Malaysia, which started in March 2020 as a result of a subsequent new introduction. Our finding suggests that contrary to the earlier suggestion 43 , the S-G614 strains were already present in Malaysia as early as March 2020 instead of July 2020. Despite the S-G614 could have better epidemic potential, as illustrated in previous studies 37,38 , the spread of the S-G614 groups before July 2020 went unnoticed. The lineage B.6(O)-associated groups (S-D614 strains) were the most successful clade documented in Malaysia in year 2020 with high genetic diversity during the second wave of COVID-19 epidemic. The early segregation of distinct B.6(O) sub-lineages on the phylogenetic tree with unique subgroup-specific genetic mutations suggests multiple independent importations and virus dissemination events that contributed to the high genetic diversity of the B.6(O). The third wave isolates detected from Sept 2020 onwards, showed that the isolates detected in Sabah and Selangor, the two highly affected states were not the direct descendant of any of the previous clusters, including the major contributor of the second wave, lineage B.6(O) 21  The extensive human-to-human transmission of SARS-CoV-2 allows for the accumulation of mutations in the actively circulating virus pool 36 . Due to its polymerase proofreading ability 44 , SARS-CoV-2 has a relatively slower evolutionary rate than other RNA viruses 45 . Hence, the slower evolutionary rate of SARS-CoV-2 showed that as few as one mutation could provide enough information to discern the transmission dynamics, especially in local settings with limited individual movement. Detection of B.1.514(G) asynchronously in Sabah and Selangor, suggesting the surges of COVID-19 cases in Oct 2020 could have originated from a common ancestral strain. It is very likely the new B.1.524(G) strains represent a group of unsampled strains that could have circulated long enough to allow the accumulation of nine "stable" genetic variations. Epidemiological findings suggest that the surge in local transmission in Sabah starts in mid-Sept 2020 46 preceded the increase of COVID-19 cases in Selangor. This is suggesting the spread of B.1.524(G) could have begun in Sabah. With the epidemiological data, the B.1.524(G) could have originated from Indonesia or the Philippines, involving transborder movement of undocumented immigrants into Sabah, Malaysia. Epidemiological investigation suggests that many reported clusters in Peninsular Malaysia, including in Selangor, originated from Sabah as a result of the massive domestic travel prior to the implementation of the travel restrictions. So far, the B.1.524(G) strains have been detected in Singapore and Australia. We do not know if these strains were imported from Malaysia or a similar source where the Malaysia strains originate. www.nature.com/scientificreports/

Conclusion
The phylogenetic and genetic variation study of the SARS-CoV-2 detected in Malaysia showed the emergence of B.1.524(G) group in Oct 2020. This B.1.524(G) served as one of the prevalent circulating lineages for the ongoing localized transmission during the third COVID-19 epidemic. This B.1.524(G) is a new group that did not resemble any of the S-G614 strains previously introduced into Malaysia. Unique genetic variations observed in this new B.1.524(G) suggests it originated from a group of actively circulating sub-lineages that probably remained unsampled. Sequencing of more isolates from different clusters would reveal if this B.1.524(G) is the major contributor to the third wave COVID-19 epidemic. It will also allow for a better understanding of the evolutionary pattern of SARS-CoV-2 in Malaysia. The findings presented were also highlight the potential use of sequencing data as a complementary tool to establish an epidemiological link between cases or clusters.