Altered subgenomic RNA abundance provides unique insight into SARS-CoV-2 B.1.1.7/Alpha variant infections

B.1.1.7 lineage SARS-CoV-2 is more transmissible, leads to greater clinical severity, and results in modest reductions in antibody neutralization. Subgenomic RNA (sgRNA) is produced by discontinuous transcription of the SARS-CoV-2 genome. Applying our tool (periscope) to ARTIC Network Oxford Nanopore Technologies genomic sequencing data from 4400 SARS-CoV-2 positive clinical samples, we show that normalised sgRNA is significantly increased in B.1.1.7 (alpha) infections (n = 879). This increase is seen over the previous dominant lineage in the UK, B.1.177 (n = 943), which is independent of genomic reads, E cycle threshold and days since symptom onset at sampling. A noncanonical sgRNA which could represent ORF9b is found in 98.4% of B.1.1.7 SARS-CoV-2 infections compared with only 13.8% of other lineages, with a 16-fold increase in median sgRNA abundance. We demonstrate that ORF9b protein levels are increased 6-fold in B.1.1.7 compared to a B lineage virus in vitro. We hypothesise that increased ORF9b in B.1.1.7 is a direct consequence of a triple nucleotide mutation in nucleocapsid (28280:GAT > CAT, D3L) creating a transcription regulatory-like sequence complementary to a region 3’ of the genomic leader. These findings provide a unique insight into the biology of B.1.1.7 and support monitoring of sgRNA profiles to evaluate emerging potential variants of concern.

T he SARS-CoV-2 lineage B.1.1.7 (Alpha, 20I/501Y.V1) 1 has been classified as a variant of concern by public health agencies. An increasing body of evidence suggests B.1.1.7 is more transmissible 2,3 and rapidly became the dominant circulating virus in the United Kingdom (UK) during October 2020 to February 2021 (Fig. 1a). To date, B.1.1.7 has been reported in 150 countries (https://cov-lineages.org/, 4 August 2021), with increasing prevalence. Preliminary data 4,5 shows that B.1.1.7 positive diagnostic respiratory samples may have lower cycle threshold (Ct) values, therefore higher viral loads, compared to other lineages. These findings suggest a potential reason for enhanced transmissibility, though they did not account for potential confounders such as days since symptom onset at sampling. Many of these studies also use S gene target failure (SGTF) as a surrogate for the presence of B.1.1.7 4 , which might misclassify samples, depending on the prevalence of B.1.1.7 6 . Several analyses from community-tested cases also suggest increased mortality associated with B.1.1.7 6  potential viral load increase and enhanced mortality are currently unclear 7 .
Genomic surveillance has been critical in rapidly identifying these variants and Nanopore sequencing of ARTIC Network 8 prepared SARS-CoV-2 amplicons is used by many laboratories to generate this data. We have reported an approach to quantify subgenomic RNA (sgRNA) abundance in genomic sequence data, which is produced as a result of a critical step in the SARS-CoV-2 replication cycle 9 . sgRNA is produced from the genomically encoded SARS-CoV-2 RNA-dependent RNA polymerase (RdRp) using discontinuous transcription of the positive, single-stranded SARS-CoV-2 genome from the 3′ end. Negatively stranded RNAs are produced, which are shorter than the genome, owing to a template switch from the ORF to the leader sequence at the 5′ end of the genome when RdRp encounters a transcription regulatory sequence in the genome body (TRS-B) to a complementary TRS 3′ of the leader sequence (TRS-L). All sgRNAs, therefore, contain a leader sequence at their 3′ end which can be used computationally for their identification. There are thought to be nine such canonical sgRNAs; Spike:S, E: Envelope, M: Membrane, N: Nucleocapsid, ORF3a, ORF6, ORF7a, ORF8 and ORF10, although multiple studies have found negligible ORF10 expression 10,11 .
As part of COVID-19 Genomics Consortium UK (COG-UK) 12 we have sequenced SARS-CoV-2 positive combined nose and throat swabs from healthcare workers and patients at Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, UK ('Pillar 1' testing, Supplementary Data 1). Additionally, to relieve pressure on centralised sequencing services, we have also sequenced a selection from 'Pillar 2' testing, which represents SARS-CoV-2 positive samples from the community, tested at the UK's lighthouse laboratories (Supplementary Data 2).
We hypothesised that we would see differences in sgRNA abundance in distinct lineages of SARS-CoV-2, in particular, increased sgRNA in B.1.1.7 that may relate to its altered clinical phenotype. We find that normalised sgRNA is significantly increased in B.  (Fig. 1e, f). Accordingly, total mapped reads and total genomic RNA reads (i.e. non-sgRNA) were also significantly higher for B.1.1.7 sequences compared to B.1.177 ( Supplementary Fig. 3). To rule out any confounding effects of our normalisation method, we also normalised subgenomic RNA reads to reads from ORF1a. Since ORF1a RNA is produced entirely from replication there would be no contribution of nested subgenomic RNA to the read count. Normalising to ORF1a reveals the same pattern of significant overrepresentation of S, N, E and M sgRNA in B.1.1.7 ( Supplementary Fig. 4).
We observed a weak but statistically significant negative correlation between raw sgRNA reads and ECT (R = −0. 16 Fig. 1g, h). To ensure the observed increase in sgRNA abundance in B.1.1.7 was not due to the effect of ECT alone (as a surrogate of viral load), we normalised sgRNA to ECT and repeated our comparisons. The differences between B.1.1.7 and B.1.177 were still apparent for S, N and M sgRNA (Fig. 1i), suggesting that the increase in B.1.1.7 sgRNA abundance is seen independently of any difference in ECT between lineages.
To confirm subgenomic RNA abundance increases in B.1.1.7 were translated to an increase in protein levels we performed quantitative mass spectrometry on ACE2, TMPRSS2 expressing A549 cells infected (MOI = 1) with either B or B.1.1.7 SARS-CoV-2 isolates. At 12 and 24 h post infection, cell monolayers were lysed and protein lysates were subsequently analysed by LC-MS/MS analysis. In order to quantify differences in the abundance of SARS-CoV-2 proteins between strains and at different time points, labelfree quantification (LFQ) data were normalised to the abundance of proteins encoded by the ORF1ab gene (Nsp1, Nsp2 and Nsp3). To ensure the changes observed in clinical samples were not due to parental lineage changes in sgRNA abundance, such as those observed with the R203K/G204R mutation 15 , we analysed normalised sgRNA comparing B.1.1.7 and B.1.177 with their respective ancestral lineages (Fig. 2a, b). An increase in sgRNA is still observed for B.1.1.7 compared to B.1.1 in all major ORFs and interestingly, for B.1.177 compared to B.1 for ORFs S, E, M and N ( Fig. 2a, b). This demonstrates the importance of stratifying sequences by lineage when studying sgRNA.
Changes in subgenomic RNA abundance during the course of infection. It is possible that the days since symptom onset at sampling may vary between lineages in our dataset, either due to changes in sampling practice over time or presentation of individuals to healthcare services, which in turn could confound our sgRNA findings. We obtained information on symptom duration at sampling for 2327 samples in our pillar 1 Fig. 9a, p < 0.0001). In both lineages, M sgRNA followed by S and N sgRNA were the most abundant at each timepoint following symptom onset, with the peak at day 1 seen in B.1.1.7 infections for most ORFs (Fig. 2f). The increase in sgRNA for B.1.1.7 early in infection was most evident for S and N sgRNA, with significant differences found between B.1.1.7 and B.1.177 at days 0 to 4 (Fig. 2g). Less marked differences were seen for E and ORF6 sgRNA, with no difference in M sgRNA or other ORFs ( Supplementary Fig. 9b, c). Although ORF6 has a significant increase on day 1 of infection ( Supplementary Fig. 9b) and a clear shift in the peak of expression is evident (Fig. 2f -yellow). Using the same methods for sgRNA estimation, we have previously shown that the presence of ACE2 and TMPRSS2 alter the kinetics of sgRNA expression in vitro, leading to a peak in abundance at an earlier stage of infection 9 . It is possible that the increased affinity for ACE2 conferred by N501Y in the B.  (Fig. 3a). In addition to the noncanonical sgRNA resulting from the nucleocapsid R203K/ G204R mutation (N*, Supplementary Fig. 2) 15 , B.1.1.7 samples were also significantly enriched for reads which support the production of a noncanonical sgRNA from genomic position 28282 (Fig. 3b-d and Supplementary Fig. 9, Wilcoxon unpaired p = <2e-16, Wilcoxon effect size = 0.797, B.1.1.7 n = 717, other n = 430). Like other sgRNAs, the production of this noncanonical sgRNA peaks 1 day following the onset of symptoms (Fig. 3e). The B.1.1.7 genomic sequence contains a triple nucleotide mutation at 28280, 28281, and 28282 GAT > CTA, resulting in an amino acid substitution, D3L, in the nucleocapsid protein. This triplet results in enhanced complementarity between the 28282 genomic regions and the sequence 3′ of the leader at the 5′ end of the genome (Fig. 3g panel iii), which may have resulted in a novel TRS-B-like site that drives higher sgRNA production from this locus. Of note, this site is upstream of the ORF9b ATG and this noncanonical sgRNA retains the full coding region of the ORF9b but lacks the canonical start codon of N. We propose that this represents the ORF9b sgRNA, which was detected at low levels in 13.8% (430/3127) non-B.1.1.7 samples (median normalised abundance=0.32), but is found in almost all pillar 1 B.1.1.7 infections (98.4%, 717/729), with a 16-fold increase in abundance (median normalised abundance = 6.02, Fig. 3a, c).
Interestingly, we noticed non-B.1.1.7 samples in our dataset with low levels of putative ORF9b sgRNA did not contain the g Normalised sgRNA abundance for N and S is significantly increased at 0-4 days of infection. All p values calculated using an unpaired Wilcoxon signedrank test, and adjusted for multiple testing with Holm (**** <0.0001, *** <0.001, ** <0.01, * <0.05, ns or blank = not significant). All boxplots depict the 25th, 50th (median), and 75th percentiles, and whiskers represent the most extreme datapoint which is no more than 1. nucleocapsid D3L mutation. We propose the following hypothesis to explain our findings: in non-B.1.1.7 SARS-CoV-2, the weak complementarity around the ORF9b ATG causes low levels of noncanonical sgRNA to be produced which has high complementarity to the genome, but crucially contains the CTA triplet (Fig. 3g-ii and Supplementary Fig. 10a). This could have led to a transcriptionally driven recombination event, so-called copy choice, where RdRp uses the positive-stranded sgRNA as a template, and because of complementarity between the nascent negative-strand switches templates to the positive-stranded genome after the CTA mutation, resulting in the GAT > CTA mutation in the genomic sequence (Fig. 3g-iv). Copy choice recombination between subgenomic RNA and the genomic RNA has previously been suggested as a method of recombination in viruses that employ discontinuous transcription 18 . This mutation creates a new TRS-B-like site with increased complementarity in B.1.1.7 viruses, driving higher sgRNA production from this locus, akin to sites of canonical sgRNA production (Fig. 3g-iii). A similar event has been described previously more than 3′ in the N ORF 15 , and mutations which result in increased complementarity and noncanonical sgRNA formation have been shown to occur in coronaviruses 19 . This raises the possibility that an ORF9b protein could be produced independently of other upstream sgRNAs with greater efficiency in B.1.1.7. ORF9b has been shown to regulate interferon responses [20][21][22] and has been shown to be present in SARS-CoV 23 (Fig. 3f). These findings are supported by another study showing both increased ORF9b subgenomic RNA, and higher protein levels of ORF9b (among others) in B.1.1.7 infections 25 .
Resistance to Interferon appears to be significantly increased in B.1.1.7 infections when compared to all other lineages, including variants of concern like B.1.351/beta 26 . Our findings could therefore be important for investigating the greater transmissibility of B.1.1.7 and the role of D3L, in particular, requires urgent experimental validation. We also considered the possibility that this new TRS-B-like site could cause reduced transcription from upstream TRS sites due to early detachment of the transcription complex. The increased abundance of upstream subgenomic RNAs in B.1.1.7 SARS-CoV-2 suggests that this is an unlikely net consequence of this new TRS-B-like site. We explored whether the nucleocapsid D3L mutation has occurred in non-B.1.1.7 lineages in the global COG-UK sequence data set (14 February 2021). Seventy-seven sequences with D3L were noted in a variety of SARS-CoV-2 lineages and appear as homoplasies in a global phylogeny, suggesting that this event may have occurred independently on several occasions ( Supplementary Fig. 11).

Conclusions
Taken together, our data suggests that sgRNA abundance measurements from existing ARTIC Nanopore sequencing data can be used in real-time to examine the effect of SARS-CoV-2 variation on its ability to express its genome. These changes in expression are often represented in the amount of protein produced for these open reading frames, likely changing the phenotype of the virus. We cannot say if the increased sgRNA abundance is the cause or consequence of an increase in viral replication or a more efficacious entry, but our study provides further insight to guide exploration with mechanistic studies. A major advantage of this approach is that we can deconvolute the contribution of genomic and subgenomic RNA, which is impossible with current diagnostic PCR assays and we can, additionally, examine all ORFs simultaneously and discover noncanonical subgenomic RNA which could be of biological relevance. Finally, we believe that sgRNA abundance analysis should be carried out on all compatible genomic surveillance platforms to give an instant readout of altered abundance in emerging SARS-CoV-2 variants. This would use existing data to complement epidemiological and phylodynamic methods, and provide an early warning of variants that might be of concern with regard to greater transmissibility and/or disease severity.

Methods
Diagnostic SARS-CoV-2 Testing. SARS-CoV-2 positivity was determined from nose/throat swabs diagnostically by Sheffield Teaching Hospitals NHS Foundation trust either using the Hologic Aptima TM SARS-CoV-2 assay (Panther Fusion System) to generate relative light units (RLU) 27 or an in house dual E/RdRp realtime PCR assay to generate a cycle threshold (ECT or RCT respectively) 14 . The latter used the following primers and probes: RdRp gene F (GTGTGARATGGTC ATGTGTGGCGG), RdRp gene R (CARATGTTAAASACACTATTAGCATA), RdRp gene P (6-FAM-CAGGTGGAACCTCATCAGGAGATGC-BHQ1), E gene F (ACAGGTACGTTAATAGTTAATAGCGT), E gene R (ATATTGCAGCAGT ACGCACAC A) and E gene P (HEX-ACACTAGCCATCCTTACTGCGCTTCG-BHQ1). Nucleic acid was extracted on the MagnaPure96 platform (Roche Diagnostics Ltd, UK) and 6 µL of extract used to amplify E and RdRP genes on an ABI Thermal Cycler (Applied Biosystems, USA).
Sample preparation, ARTIC network PCR and nanopore sequencing. RNA was extracted from viral transport medium (VTM) with Qiagen QIAamp MinElute Virus Spin Kit (50). Resultant RNA was subject to the ARTIC network 8 tiled amplicon protocol (Supplementary Data File 8) and subsequently sequenced on an Oxford Nanopore Technologies GridION X5. Bases were called with the default basecaller in MinKNOW (currently guppy v4) with -require-both-ends set for demultiplexing. Raw sequencing fastq files were subject to no processing after sequencing to ensure sgRNA leader sequences are retained.
Subgenomic read classification and normalisation. Subgenomic RNAs were classified using periscope 9 v0.0.8a. Briefly, reads containing the leader sequence at their start are identified by a local alignment, the quality of this alignment determines which quality bin sgRNA reads are placed (high quality-HQ, low quality-LQ or low low quality-LLQ). These bins are determined as follows: HQ; local alignment score >50 and read at known ORF site (HQ canonical sgRNA) or not within ORF or primer site (HQ noncanonical sgRNA). LQ; local alignment score <50 but >30 and at known ORF (LQ canonical sgRNA) or not within ORF or primer site (LQ noncanonical sgRNA). LLQ; local alignment score <30 but at known ORF For normalisation, the amplicon from which the sgRNA evidence was generated is determined and a count of genomic reads for this amplicon is used to normalise the raw sgRNA read counts.
Samples were excluded from the subsequent analysis if their consensus coverage was <0.9 and/or they had less than 50,000 mapped reads (we have previously shown that fewer reads produce a less robust analysis).
routinely tested and confirmed to be free of mycoplasma (MycoAlert Mycoplasma Detection Kit (Lonza).
The BetaCoV/Australia/VIC01/2020 strain of SARS-CoV-2 (PANGO lineage B) was obtained from the Victorian Infectious Diseases Reference Laboratory, Melbourne through Public Health England (Colindale), in April 2020. The BetaCoV/England/MIG457/2020 (B.1.1.7) strain of SARS-CoV-2 (B.1.1.7) was obtained from Public Health England (Colindale), in January 2021. B was passaged once on Vero cells and B.1.1.7 was passaged once on VeroE6 + ACE2 + TMPRSS2 cells 28 to generate the stocks used in this study. For these virus propagations, cells were infected (MOI = 0.01) for 1 h at room temperature. The entire flask was frozen at 48 h post infection. Following three freeze-thaw cycles, debris was removed by clarification (centrifugation at 2500 rpm, 10 min) and viral stocks were aliquoted and stored at 80°C. Viral titres were calculated in plaque-forming units (PFU/mL) by standard plaque assay 29,30 . Virus sequences were verified 31 , and no additional mutations were identified in the stocks compared to the parental sequences. All lineage-defining mutations were confirmed to be present.
Infections. The day before infection, A549 + ACE2 + TMPRSS2 cells were seeded at 60% confluency in six-well plates. Cells were washed once with PBS before incubation with virus diluted in sera-free media (MOI = 1), for 1 h at room temperature on a rocking platform. After the inoculum was removed, cells were washed with PBS and 2 mL of media containing 2% foetal bovine serum was added. At 12 and 24 h post infection, cell monolayers were harvested in 250 μL of 2x Laemlli buffer (for protein lysate samples).
Quantitative mass spectrometry analysis. About 20 ml of Laemmli buffer cell lysates from infected cells were alkylated with 25 mM iodoacetamide in the dark for 30 min at 37°C. TEAB was added to a final concentration of 50 mM, and protein was trapped and washed on S-trap micro spin columns (ProtiFi, LLC) according to the manufacturer's instructions. Protein was digested using 5 mg trypsin sequence grade (Pierce) at 47°C for 1 h and 37°C for 1 h. Eluted peptides were dried in a vacuum concentrator and resuspended in 40 ml 0.5% formic acid for LC-MS/MS analysis. Peptides were analysed using nanoflow LC-MS/MS using an Orbitrap Elite (Thermo Fisher) hybrid mass spectrometer equipped with a nanospray source, coupled to an Ultimate RSLCnano LC System (Dionex). Peptides were desalted online using a nano trap column, 75 μm I.D.X 20 mm (Thermo Fisher) and then separated using a 130-min gradient from 3 to 35% buffer B (0.5% formic acid in 80% acetonitrile) on an EASY-Spray column, 50 cm × 50 μm ID, PepMap C18, 2 μm particles, 100 Å pore size (Thermo Fisher). The Orbitrap Elite was operated with a cycle of one MS (in the Orbitrap) acquired at a resolution of 120,000 at m/z 400, with the top 20 most abundant multiply charged (2+ and higher) ions in a given chromatographic window subjected to MS/MS fragmentation in the linear ion trap. An FTMS target value of 1e6 and an ion trap MSn target value of 1e4 were used with the lock mass (445.120025) enabled. Maximum FTMS scan accumulation time of 500 ms and maximum ion trap MSn scan accumulation time of 100 ms were used. Dynamic exclusion was enabled with a repeat duration of 45 s with an exclusion list of 500 and an exclusion duration of 30 s. Raw mass spectrometry data were analysed with MaxQuant version 1.6.10.43 32 . Data were searched against a combined sequence database including the Human and SARS-CoV-2 UniProt reference proteomes using the following search parameters: digestion set to Trypsin/P, methionine oxidation and N-terminal protein acetylation as variable modifications, cysteine carbamidomethylation as a fixed modification, match between runs enabled with a match time window of 0.7 min and a 20-min alignment time window, label-free quantification (LFQ) was enabled with a minimum ratio count of 2, the minimum number of neighbours of 3 and an average number of neighbours of 6. A protein FDR of 0.01 and a peptide FDR of 0.01 were used for identification level cut-offs based on a decoy database searching strategy. SARS-CoV-2 protein identification and quantification data were extracted and normalised to the levels of proteins encoded by the ORF1ab gene. Nsp1, Nsp2 and Nsp3 were quantified across all replicate samples at each time point, and strain and, therefore, the summed intensity of these proteins were used as a normalisation factor for other SARS-CoV-2 proteins.
Statistics and reproducibility. Statistical analysis was performed in R 33 . Figures were generated in R using tidyverse 34 (Supplementary Data File 9), apart from those that depict sequencing reads, which were generated in IGV 35 . Tests between groups were performed using an unpaired Wilcoxon test using the rstatix package (https://github.com/kassambara/rstatix), adjusting p values for any multiple comparisons using the Holm method. All boxplots shown have median as the centre line, upper and lower lines as 75th and 25th percentile respectively, outliers are not specifically shown as all points are represented. Whiskers represent 1.5x interquartile range.
Phylogenetic tree generation. The grapevine pipeline (https://github.com/COG-UK/grapevine) was used for generating the phylogeny based on all data available on GISAID and COG-UK up until 16 February 2021. A representative sample of global sequences was obtained in two steps. First by randomly selecting one sequence per country per epi week followed by random sampling of the remaining sequences to generate a sample of 4000 sequences. The global tree was then pruned using code adapted from the tree-manip package (https://github.com/ josephhughes/tree-manip). We then identified samples with D3L mutations and colour coded these tips according to their lineages. The visualisation was produced using R/ape, R/ggplot2, R/ggtree, R/treeio, R/phangorn, R/stringr, R/dplyr, R/aplot.
Ethics approval and consent. Individuals presenting with active COVID-19 disease were sampled for SARS-CoV-2 sequencing at Sheffield Teaching Hospitals NHS Foundation Trust, UK using samples collected for routine clinical diagnostic use. This work was performed under approval by the Public Health England Research Ethics and Governance Group for the COVID-19 Genomics UK consortium (R&D NR0195). Approval was provided to undertake viral sequencing on residual clinical diagnostic samples and analysis of anonymised data without the individual patients' consent.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All SARS-CoV-2 consensus sequences that are of high enough quality are available on GISAID and ENA and from https://www.cogconsortium.uk/data/. All sgRNA abundance data were provided as supplementary data at https://github.com/sheffield-bioinformaticscore/periscope-variants-publication. All raw sequencing data are available on ENA under the accession PRJEB48895. The mass spectrometry proteomics data has been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD029954. All supplementary data files.