Main

RNA structural analysis by means of chemical probing is powerful but it suffers from the intrinsic limitation of providing only an averaged measurement of the base reactivities of all coexisting conformations that are simultaneously sampled by an RNA species in a biological sample1,2. The development of mutational profiling has enabled multiple single-stranded residues to be recorded as mutations within the same complementary DNA product3,4. Recently, an expectation–maximization approach for the deconvolution of alternative conformations from mutational profiling experiments (DREEM, detection of RNA folding ensembles using expectation–maximization), has been developed5. Even if powerful in principle, this method has two major limitations: (1) the maximum number of RNA conformations that can be searched for is user defined (two by default), to reduce the risk of overestimating the number of conformations (overclustering, a common problem with expectation–maximization approaches); and (2) it can handle only experiments in which the sequencing reads cover the entire length of the target RNA. Although DREEM can, in theory, be applied to longer transcripts by manual window sliding, it cannot handle the automatic merging of overlapping RNA segments, a non-trivial computational problem that makes it poorly suited for transcriptome-scale analyses.

To address these issues, we introduce DRACO (v1.0), a method for the deconvolution of alternative RNA conformations from mutational profiling experiments, based on a combination of spectral clustering and fuzzy clustering (Supplementary Note 1). DRACO analysis, as illustrated in Extended Data Fig. 1, is performed (by default) using sliding windows with a size equal to 90% of the median length of reads, and an offset of 5%. Spectral clustering, previously proposed as a suitable approach for the identification of structurally heterogeneous regions from mutational profiling experiments6, is performed for each window, which enables the automatic identification of the optimal number of conformations (clusters). DRACO then merges overlapping windows that have the same number of clusters and reconstructs overall mutational profiles. Consecutive sets of windows that have different numbers of clusters are reported separately by DRACO. To validate the algorithm, we generated in silico dimethyl sulfate (DMS) mutational profiling reads of different lengths (50–150 nucleotides), for 10 sets of 100 RNAs of increasing length (300–1,500 nucleotides), designed to form up to four distinct conformations. DMS-induced mutations in reads were modeled as a binomial distribution, well-approximating the observed distribution of a previously published dataset6 (Supplementary Figs. 1,2). Analyses of in silico data (Supplementary Figs. 314) showed that DRACO accuracy relies on two main factors, read length and depth of coverage, which can be explained by DRACO’s dependency on co-mutation information. Although greater depths of coverage partially compensated for the reduced amount of mutational information in shorter reads, best results were obtained with a read length of 150 nucleotides and a minimum depth of coverage of 5,000×. Under these conditions, DRACO correctly identified the expected number of conformations in nearly 100% of cases (Fig. 1a and Extended Data Fig. 2), accurately deconvoluted the individual conformation mutational profiles (median Pearson correlation coefficient, PCC > 0.85; Fig. 1b) and precisely estimated relative conformation stoichiometries (PCC ≈ 0.99; Fig. 1c).

Fig. 1: In silico validation of DRACO.
figure 1

a, Maximum number of conformations detected for 10 sets of 100 simulated 300-nt-long RNAs expected to form 1–4 conformations, at a depth of coverage of 5,000× and a read length of 150 nt. (Extended Data Fig. 2 shows the maximum number of conformations detected for RNAs with length ranging from 600 to 1,500 nt.) The individual data points represent the mean of each set. b, Box plot of median PCC of reconstructed reactivity profiles for the 10 sets. When DRACO detected more than one window with different numbers of clusters, only the largest window, spanning >50% of the RNA length, was used. Boxes span the 25th–75th percentiles. The horizontal center line represents the median. Whiskers span from the 25th percentile − 1.5-fold the interquartile range (IQR) to the 75th percentile + 1.5-fold the IQR. Data points falling outside this range represent outliers and are reported as dots. c, Violin plot of the distribution of expected versus reconstructed conformation abundances for 10 sets of 100 simulated 300-nt-long RNAs, expected to form 2 conformations with varying relative abundances. When multiple windows were detected, only the largest window was used. Whiskers span the 25th–75th percentiles. The central dot represents the median. nt, nucleotide; R, Pearson correlation.

Source data

Given that in silico-generated data might not completely capture the complexity of real DMS mutational profiling with sequencing (DMS-MaPseq) experiments, we then tested DRACO using published in vitro data for the Escherichia coli cspA 5′ untranslated region (UTR)7. The cspA 5′ UTR acts as an RNA thermometer, regulating the accessibility of the Shine–Dalgarno sequence in response to temperature changes, switching between a translationally repressed conformation at 37 °C and a translationally competent conformation at 10 °C (ref. 8). After mapping DMS-MaPseq data from in vitro folding experiments at 10 °C and 37 °C, reads from the two experiments were pooled at different percentages and analyzed using DRACO (Extended Data Fig. 3). DRACO successfully reconstructed the expected reactivity profiles with high accuracy, even with a conformation abundance of only 10% (PCC = 0.88). Interestingly, the cspA protein can act as an RNA chaperone on its own 5′ UTR, mediating the switch from the 10 °C translationally competent conformation to the 37 °C translationally repressed conformation. In the same study7, the cspA 5′ UTR was folded and probed at 10 °C, in the presence of increasing concentrations of the cspA protein. In the presence of 0.1 mM cspA the conformation of the 5′ UTR resembled that observed at 37 °C (ref. 7). The use of 50 µM cspA protein resulted in a reactivity profile that only partially correlated with either the 10 °C or 37 °C conformations. Strikingly, analysis with DRACO identified two nearly equimolar conformations (51.4% and 48.6%, respectively; Fig. 2a), the profiles of which were highly correlated with the 37 °C and the 10°C conformations (respectively, PCC = 0.85 and 0.83; Fig. 2b). Data-driven RNA structure prediction using these profiles produced secondary structure models nearly identical to the reference structures expected for the 10 °C and 37 °C conformations (positive predictive value, PPV, 1.00 and 0.91; sensitivity, 0.87 and 0.97, respectively; Fig. 2c). We further analyzed a recently published DMS-MaPseq dataset, originally generated to validate the DREEM algorithm5, by probing the structure of the adenosine deaminase riboswitch encoded by the add gene from Vibrio vulnificus, either in the absence or presence of 5 mM adenine. Although DREEM identified three conformations under both conditions5, DRACO showed that a single conformation is present in the absence of adenine, and that the addition of adenine triggers the conformation switch towards the translation-competent conformation on ~65.6% of the RNA molecules (Fig. 2d). The remaining ~34.4% represent instead the translation-incompetent conformation, as demonstrated by the high correlation to the adenine-free sample (PCC = 0.96, Fig. 2e), as well as by the agreement between the predicted and the expected secondary structures of the two conformations (Fig. 2f). These results support the robustness of the DRACO algorithm, as well as its lower propensity to overclustering, as compared with expectation–maximization-based approaches (Supplementary Note 2). Encouraged by these results, we applied DRACO to the analysis of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA genome structure. In a recent report, we defined the secondary structure of the full SARS-CoV-2 genome using SHAPE-MaP (selective 2′-hydroxyl acylation analyzed by primer extension and mutational profiling)9. Although powerful, that approach was limited to the analysis of regions folding into a single well-defined conformation. We therefore queried (in two independent experiments) the full in vitro refolded SARS-CoV-2 genome using DMS-MaPseq analysis. Sequencing and assembly of 150-nucleotide-long paired-end reads produced more than 2.2 × 107 mappable fragments, resulting in a median coverage of ~9.9 × 104 per experiment (Supplementary Fig. 15a,b). The data showed exceptional between-experiment correlation (PCC = 0.99, Supplementary Fig. 15c), and were in agreement with conserved SARS-CoV-2 structures9 (Supplementary Fig. 16). In each of the two experiments, analysis with DRACO unambiguously identified 22 windows, which accounted for ~15.5% of the SARS-CoV-2 genome, that folded consistently into two conformations (Supplementary Fig. 17a). We observed an exceptional overall correlation of reactivity profiles for reconstructed conformations across experiments (PCC = 0.86; Supplementary Fig. 17b), as well as highly consistent relative conformation abundance estimates (Supplementary Fig. 17c), with an average variation of only ±1.9%. On inspection of the distribution of these windows, we noticed an enrichment of 50% at open reading frame (ORF) and at peptide boundaries (11 out of 22 windows) as compared with the ~19% expected by chance (P = 1.0 × 10−3, one-sided binomial test; see Methods). One of these windows spanned the ORF1a–ORF1b boundary, overlapping with the frameshifting element (FSE, positions 13369–13542; Supplementary Fig. 18). Importantly, the data do not support the existence of a pseudoknotted structure at the level of the FSE. Instead, this region is likely to fold into either a single extended stem-loop structure or two stem-loop structures. This observation is further supported by a recently proposed structure analysis, using DMS-MaPseq, of the SARS-CoV-2 genome in living infected host cells10. It is conceivable that the identified RNA switches might be involved in controlling either the translation of SARS-CoV-2 proteins, or the discontinuous transcription of subgenomic messenger RNAs (or both), but additional experiments will be needed to investigate their functional relevance. One of the identified windows encompassed the 3′ UTR (positions 29546–29767), and showed consistent abundance estimates and reactivity profiles for the two identified conformations across the two experiments (Fig. 3a). The major conformation (63.4 ± 1.7%) had a reactivity pattern compatible with the known phylogenetically inferred 3′ UTR structure of sarbecoviruses, while the minor conformation (36.6 ± 1.7%) was predicted to form an alternative three-way junction, sequestering both the bulged stem-loop (BSL) and distal stem-loop (designated P2) helices (Fig. 3b). We further evaluated the conservation of this alternative conformation using an approach we recently developed to automatically identify regions of the SARS-CoV-2 genome that show significant covariation9 (Methods). Only those coronavirus sequences supporting both 3′ UTR conformations were retained in order to evaluate the covariation (Methods), and significant covariation was then identified in this alternative 3′ UTR three-way junction conformation (Fig. 3c), hinting at its functional relevance. When the same analysis was performed on the two conformations independently, additional significantly covarying base pairs were detected (Supplementary Fig. 19). Re-analysis of a recently published dataset of RNA–RNA interaction capture in SARS-CoV-2-infected cells11 with COMRADES (crosslinking of matched RNAs and deep sequencing)12 provided support for the presence of both conformations in vivo. Altogether, these data demonstrate the ability of DRACO to capture otherwise hidden structural features, and reveal the presence of a conserved RNA switch at the level of a regulatory region in the SARS-CoV-2 genome.

Fig. 2: In vitro validation of DRACO.
figure 2

a, Original DMS-MaPseq profile and DRACO-deconvoluted profiles for the cspA 5′ UTR folded at 10 °C in the presence of 50 µM cspA recombinant protein, from Zhang et al., 20187. Schematic representation of the structures is reported together with the estimated relative abundances. b, Heatmap of PCCs showing the correlation between the conformations deconvoluted by DRACO and the reactivity profiles of the cspA 5′ UTR folded at either 10 °C or 37 °C, in the absence of the recombinant cspA protein. c, Arc plots depicting the secondary structure inferred from the DRACO-deconvoluted profiles, as compared to the reference cspA 5′ UTR structures at 10 °C and 37 °C. d, DRACO-deconvoluted profiles for the V.vulnificus add riboswitch, in the absence (one conformation detected) or presence (two conformations detected) of 5 mM adenine, from Tomezsko et al., 20205. Schematic representation of the structures, and the estimated relative abundances are reported. e, Heatmap of PCCs showing the correlation between the conformations deconvoluted by DRACO. f, Arc plots depicting the secondary structure inferred from the DRACO-deconvoluted profiles, as compared to the reference add structure in the absence of adenine.

Source data

Fig. 3: A conserved structural switch in the 3′ UTR of SARS-CoV-2.
figure 3

a, Heat scatter plot of base reactivities for DRACO-deconvoluted reactivity profiles, in experiment 1 versus experiment 2, for conformations A and B. b, Secondary structure models with overlaid base reactivities for conformations A and B. Base pairs whose existence is supported by significant enrichment of RNA–RNA chimeras from in vivo COMRADES analysis (Ziv et al., 202011) are boxed in light blue. c, Structure models for conformations A and B, inferred by simultaneous phylogenetic analysis. Structures have been generated using R2R v1.0.6. Base pairs showing significant covariation (as determined by R-scape v1.4.0) are boxed in green (E < 0.05) or violet (E < 0.1). HVR, hypervariable region; s2m, stem-loop II-like motif.

Source data

We anticipate that DRACO will enable exploration of the RNA structurome at unprecedented resolution, and the identification of transient and dynamic features of cellular transcriptomes.

Methods

DRACO algorithm

The DRACO algorithm is implemented in C++ and exploits the Armadillo library (http://arma.sourceforge.net), built on top of the BLAS (Basic Linear Algebra Subprograms, http://www.netlib.org/blas/) and LAPACK (Linear Algebra Package, http://www.netlib.org/lapack/) libraries for fast matrix manipulation and eigenvalue decomposition. As input, DRACO takes Mutation Map (MM) format files. These files store the relative coordinates of mutations for each read mapping on a given transcript and can be generated by processing a Sequence Alignment/Map (SAM, or its compressed, binary version, BAM) alignment file with the rf-count tool of RNA Framework v2.7.5 (ref. 13) (parameter: -mm). With default parameters, DRACO takes ~8–10 h, on a single thread, to analyze ~17 million reads mapping to the SARS-CoV-2 genome. This runtime should not be taken as representative of DRACO performances, because it is biased by the non-uniform distribution of reads across the SARS-CoV-2 genome due to a large fraction of the reads mapping at the 5′ and 3′ ends, which extends the computation time when analyzing genome boundaries. A complete description of the algorithm, including pseudocodes, is provided in Supplementary Note 1.

In silico generation of DMS-MaPseq data

One thousand RNA sequences with an average A or C content of 50% and varying lengths (300, 600, 900 or 1,500 nucleotides) were randomly generated. DMS modification profiles for one to four different conformations were then generated by randomly setting as single-stranded ~30% of the A or C residues. This fraction of single-stranded A or C residues represents an underestimate of what is expected for real RNAs (~51.3% of single-stranded A or C residues for E.coli 16S and 23S ribosomal RNAs). Mutated reads matching these modification profiles were then generated (in MM format) to obtain a median depth of coverage per base of 2,000×, 5,000×, 10,000× or 20,000×, using the generate_mm tool (available from the DRACO repository). Distribution of DMS-induced mutations in reads was empirically learnt from a previously published dataset6 (Supplementary Fig. 1) and well-approximated by a binomial distribution, with P = 0.01927 and n = length of the transcript (Supplementary Fig. 2).

Analysis of in silico-generated DMS-MaPseq data

In silico-generated MM files were analyzed using DRACO (parameters: --set-all-uninformative-to-one --set-uninformative-clusters-to-surrounding --max-collapsing-windows <variable> --first-eigengap-threshold 0.9). Owing to the fact that A and C residues are non-uniformly distributed along transcripts, certain regions of the RNA can give rise to reads bearing a lower mutational information content, possibly leading to a local under- (or over-) estimate of the number of conformations. To account for this, DRACO can ignore a small set of windows (the number of which is controlled by the --max-collapsing-windows parameter) showing a discordant number of conformations with respect to surrounding windows. Given that the window size is determined by the read length (by default, 90% of the median read length), the number of discordant windows is expected to increase with decreasing read lengths. Therefore, the --max-collapsing-window parameter was linearly decreased from 5 to 2 with increasing read lengths from 50 to 150 nucleotides. By default, windows slide by 5% of the median read length, hence these --max-collapsing-window values imply that only 12.5 (for 50 nucleotide reads) to 15 bases (for 150 nucleotide reads) are ignored in such situations.

Analysis of DMS-MaPseq data

All of the relevant analysis steps, from reads alignment to data normalization and structure modeling, were performed using RNA Framework13. All tools referenced in the following paragraphs are distributed as part of the RNA Framework suite (https://github.com/dincarnato/RNAFramework). Specific analysis parameters are detailed in the respective paragraphs.

Optimization of folding parameters

For structure predictions, optimal slope (2.4) and intercept (−0.2) values were identified using jackknifing, with a DMS-MaPseq dataset for ex vivo deproteinized E.coli rRNAs that we previously published14 (accession, SRR8172706), the rf-jackknife tool (parameters: -rp ‘-md 600 -nlp’ -x) and ViennaRNA v2.4.14 (ref. 15).

Analysis of cspA 5′ UTR DMS-MaPseq data

Reads for DMS-MaPseq data of in vitro folded cspA 5′ UTR at 37 °C and 10 °C were obtained from the Sequence Read Archive (accessions, SRR6123773 and SRR6123774) and mapped to the first 171 bases of the cspA transcript using the rf-map tool (parameters: -cq5 20 -cqo -mp ‘--very-sensitive-local’). Given that a lower fraction of reads aligned to the reference for the experiment conducted at 37 °C, we chose to randomly shuffle the BAM file from the experiment conducted at 10 °C, and a matching number of reads was extracted. Resulting BAM files for both samples were then randomly shuffled, and reads were extracted and combined to achieve final percentage stoichiometries of 90:10, 80:20, 70:30, 60:40 and 50:50 of the 10 °C and 37 °C conformations, respectively. Resulting BAM files were then analyzed with the rf-count tool to produce MM files (parameters: -m -mm -ds 75 -na -ni -md 3). MM files were analyzed with DRACO (parameters: --max-collapsing-windows 3 --set-all-uninformative-to-one --min-cluster-fraction 0.1 --set-uninformative-clusters-to-surrounding) and deconvoluted mutation profiles were extracted from the resulting JSON (JavaScript Object Notation) files and converted into RNA Framework’s RNA Count (RC) format. Starting from RC files, normalized reactivity profiles were obtained by first calculating the raw reactivity scores as the per-base ratio of the mutation count and the read coverage at each position, and then by normalizing values by box-plot normalization, using the rf-norm tool (parameters: -sm 4 -nm 3 -rb AC -mm 1 -n 1000). Data-driven RNA structure inference was performed using the rf-fold tool and the normalized reactivity profiles (parameters: -sl 2.4 -in -0.2 -nlp). DMS-MaPseq data for the cspA 5′ UTR folded in the presence of 0.05 µM cspA protein (accession, SRR6507969) were analyzed using the same parameters. Comparison between the deconvoluted conformations and the cspA 5′ UTR folded at either 10 °C or 37 °C was performed using the rf-compare tool.

Analysis of V. vulnificus add riboswitch DMS-MaPseq data

DMS-MaPseq data for the add riboswitch from V.vulnificus, in vitro folded either in the presence or absence of 5 mM adenine, were obtained from the Sequence Read Archive (accessions, SRR10850890 and SRR10850891). Forward and reverse reads were merged, prior to mapping, using PEAR (Illumina paired-end read merger) v0.9.11 (ref. 16), and then mapped to the add riboswitch using the rf-map tool (parameters: -cq5 20 -cqo -ctn -cmn 0 --mp ‘--very-sensitive-local’). Resulting BAM files were then analyzed with the rf-count tool to produce MM files (parameters: -m -mm -na -ni). MM files were analyzed with DRACO (parameters: --max-collapsing-windows 1 --set-all-uninformative-to-one --set-uninformative-clusters-to-surrounding), and deconvoluted mutation profiles were extracted from the resulting JSON files and converted into RC format. Starting from RC files, normalized reactivity profiles were obtained by first calculating the raw reactivity scores as the per-base ratio of the mutation count and the read coverage at each position, and then by normalizing values by box-plot normalization, using the rf-norm tool (parameters: -sm 4 -nm 3 -rb AC -mm 1 -n 1000). Data-driven RNA structure inference was performed using the rf-fold tool and the normalized reactivity profiles (parameters: -sl 2.4 -in -0.2 -nlp).

Cell culture and SARS-CoV-2 infection

Vero E6 cells were cultured in DMEM (Lonza, 12-604F), supplemented with 8% FCS (Bodinco), 2 mM l-glutamine, 100 U ml−1 penicillin and 100 µg ml−1 streptomycin (Sigma Aldrich, P4333-20ML) at 37 °C, in the presence of 5% CO2. Cells were passaged twice and then infected at a multiplicity of infection of 1.5 with SARS-CoV-2/Leiden-0002 (GenBank accession, MT510999). Infections were performed in EMEM (Lonza, 12-611F) supplemented with 25 mM HEPES, 2% FCS, 2 mM l-glutamine and antibiotics. Sixteen hours after infection, cells were trypsinized, resuspended in EMEM supplemented with 2% FCS, and then pelleted and washed with 50 ml 1X PBS.

Total RNA extraction and in vitro folding

Approximately 5 × 106 SARS-CoV-2-infected cells were resuspended in 1 ml TriPure Isolation Reagent (Sigma Aldrich, 11667157001). After adding 0.2 volumes chloroform and vigorously vortexing for 15 s, the sample was centrifuged for 15 min at 12,500×g (4 °C). The upper aqueous phase was mixed with 1 ml 100% ethanol, and then loaded on an RNA Clean & Concentrator-25 column (Zymo Research, R1017). In vitro folding was carried out as previously described9,14. In brief, ~7.5 μg RNA was subjected to rRNA depletion using the RiboMinus Eukaryote System v2 (ThermoFisher Scientific, A15026). The rRNA-depleted RNA was denatured at 95 °C for 2 min, then transferred immediately to ice for 1 min. Ice-cold 5X RNA folding buffer (500 mM HEPES pH 7.9; 500 mM NaCl) supplemented with 20 U SUPERase•In RNase Inhibitor (ThermoFisher Scientific, AM2696) was then added, and RNA was incubated for 10 min at 37 °C to enable secondary structure formation. MgCl2 was then added to a final concentration of 10 mM and the sample was further incubated for 20 min at 37 °C.

Probing of SARS-CoV-2 RNA

For probing of RNA, DMS was pre-diluted in a ratio of 1:6 in 100% ethanol and added to a final concentration of 150 mM. Samples were then incubated at 37 °C for 2 min. Reactions were quenched by the addition of 1 volume dithiothreitol 1.4 M and then purified on an RNA Clean & Concentrator-5 column (Zymo Research, R1013).

DMS-MaPseq analysis of SARS-CoV-2 RNA

DMS-MaPseq of SARS-CoV-2 was conducted as previously described14, with minor changes. First, probed RNA was fragmented to a median size of 150 nucleotides by incubation at 94 °C for 8 min in RNA fragmentation buffer (65 mM Tris-HCl pH 8.0, 95 mM KCl, 4 mM MgCl2), then purified with NucleoMag NGS Clean-up and Size Select beads (Macherey Nagel, 744970), supplemented with 10 U SUPERase•In RNase Inhibitor and eluted in 8 μl NF H2O. Eluted RNA was supplemented with 1 μl 50 μM random hexamers and 2 μl deoxyribonucleoside triphosphates (10 mM each), then incubated at 70 °C for 5 min and immediately transferred to ice for 1 min. Reverse transcription reactions were conducted in a final volume of 20 μl. Reactions were supplemented with 4 μl 5X RT buffer (250 mM Tris-HCl pH 8.3, 375 mM KCl, 15 mM MgCl2), 1 μl dithiothreitol 0.1 M, 20 U SUPERase•In RNase Inhibitor and 200 U TGIRT-III Enzyme (InGex, TGIRT50). Reactions were incubated at 25 °C for 10 min to enable partial primer extension, followed by 2 h at 57 °C. TGIRT-III was degraded by the addition of 2 μg Proteinase K (Sigma Aldrich, P2308), followed by incubation at 37 °C for 20 min. Proteinase K was inactivated by addition of Protease Inhibitor Cocktail (Sigma Aldrich, P8340). Reverse transcription reactions were then used as input for the NEBNext Ultra II Non-Directional RNA Second Strand Synthesis Module (New England Biolabs, E6111L). Second strand synthesis was performed by incubation for 1 h at 16 °C, as per the manufacturer’s instructions. Double-stranded DNA was purified using NucleoMag NGS Clean-up and Size Select beads and used as input for the NEBNext Ultra II DNA Library Prep Kit for Illumina, following the manufacturer’s instructions.

Analysis of SARS-CoV-2 DMS-MaPseq data

Following sequencing, samples were demultiplexed using the bcl2fastq v2.20.0.422 utility. After clipping adapter sequences using Cutadapt v2.1 (ref. 17) (parameters: -a AGATCGGAAGAGC -A AGATCGGAAGAGC -O 1 -m 100:100), paired-end reads were merged using PEAR v0.9.11 (ref. 16) and then mapped to the SARS-CoV-2 reference using the rf-map tool and the Bowtie2 algorithm (v2.3.5.1), with soft-clipping enabled (parameters: -b2 -cq5 20 -ctn -cmn 0 -cl 150 -mp ‘--very-sensitive-local’). Alignments in SAM format were sorted and converted to BAM format using Samtools v1.9. An MM file was then generated from the resulting BAM alignment using the rf-count tool, by keeping only those reads that covered at least 150 bases. Insertions and deletions were ignored (because they account for <6% of DMS-induced mutations when using TGIRT-III4), and only mutations with a Phred quality score > 20 were considered. Furthermore, mutations were considered only when the two surrounding bases also had a Phred quality score > 20. Reads with more than 10% mutated bases were excluded (parameters: -m -ds 150 -es -nd -ni -mm -me 0.1). DRACO was invoked with default parameters. Following DRACO analysis, windows in which the median coverage (calculated on reads passing DRACO’s filtering) was above 10,000× were selected. To select windows consistently folding into multiple conformations in both experiments, we retained windows predicted to have the same number of conformations in the two experiments, and which overlapped by at least 75% of their length, and considered only their intersection. Deconvoluted reactivity profiles for matching conformations from the two experiments were then averaged and used for secondary structure modeling. The correlation between reconstructed conformations from the two experiments was calculated using 90% of the reactivity values in the window, after exclusion of the first and last 5% of the A and C bases, to avoid terminal bias. RNA secondary structures for SARS-CoV-2 elements (FSE and 3′ UTR) were generated using VARNA v3.93 (ref. 18).

Identification of conserved RNA structure elements

To evaluate the conservation of the alternative 3′ UTR structure, we implemented a modified version of an automated pipeline that we have previously introduced9 (cm-builder; https://github.com/dincarnato/labtools), built on top of Infernal v1.1.3 (ref. 19). In brief, we first built two covariance models from Stockholm files containing only the SARS-CoV-2 sequence and the two alternative 3′ UTR structures, using the cmbuild module. After calibrating the covariance models using the cmcalibrate module, we then used them to search for RNA homologs in a database composed of all of the non-redundant coronavirus complete genome sequences from the ViPR database20 (https://www.viprbrc.org/brc/home.spg?decorator=corona), as well as a set of representative coronavirus genomes from the NCBI database, using the cmsearch module. Only matches from the sense strand were kept and a very relaxed E value threshold of 10 was used at this stage to select potential homologs. Three additional filtering criteria were used. First, we took advantage of the extremely conserved architecture of coronavirus genomes21 and restricted the selection to matches falling at the same relative position within their genome, with a tolerance of 3.5% (corresponding approximately to a maximum allowed shift of 1,050 nucleotides in a 30 kilobase genome). Through this more conservative selection, we kept only matches likely to represent true structural homologs, although at the cost of probably losing some true matches. Second, we filtered out matches retaining less than 55% of the canonical base pairs from the original structure elements. Third, truncated hits covering less than 50% of the structure were discarded. A fourth filtering step was also applied when simultaneously analyzing the two structures, by retaining only the set of sequences matched by both structures. The resulting set of homologs was then aligned to the original covariance models using the cmalign module and the resulting alignments were used to build new covariance models. The whole process was repeated three times. The alignment was then refactored, removing gap-only positions and including only bases spanning the first to the last base-paired residue. The alignment file was then analyzed using R-scape 1.4.0 (ref. 22) and average product corrected G-test statistics to identify motifs showing significantly covarying base pairs.

Testing for significant overlap with ORF boundaries

To test for significant overlap between the windows folding into two mutually exclusive conformations and the ORF boundaries within the SARS-CoV-2 genome, we generated 10,000 random windows of matching size for each window identified by DRACO. For each DRACO-identified window, as well as for each random window, we calculated the number of windows overlapping the start and end positions of the SARS-CoV-2 ORFs, including each of the individual proteins within the polyprotein ORF1ab (positions 266, 806, 2720, 8555, 10055, 10973, 11843, 12092, 12686, 13025, 13442, 13468, 16237, 18040, 19621, 20659, 21563, 25393, 26245, 26523, 27202, 27394, 27756, 27894, 28274, 29558, 29674). Resulting values were used to perform a one-sided binomial test, with parameters k = 11 (number of windows identified by DRACO, overlapping with ORF boundaries), n = 22 (total number of windows identified by DRACO) and p, the ratio of the number of random windows overlapping with ORF boundaries, divided by the total number of random windows (220,000).

Validation of the alternative SARS-CoV-2 3′ UTR conformation by COMRADES

COMRADES data for the SARS-CoV-2 virus in living infected host cells11 were obtained from the Gene Expression Omnibus (GEO, GSE154662). The dataset consisted of two biological replicates, each one composed of a control (C) and the actual COMRADES sample (S). A reference was built to include all human transcripts from refGene, plus the sequence of the SARS-CoV-2 genome, using STAR (Spliced Transcripts Alignment to a Reference) v2.7.1a (ref. 23) (parameters: --runMode genomeGenerate --genomeSAindexNbases 12), and reads were also aligned to the reference using STAR (parameters: --runMode alignReads --outFilterMultimapNmax 100 --outSAMattributes All --alignIntronMin 1 --scoreGapNoncan -4 --scoreGapATAC -4 --chimSegmentMin 15 --chimJunctionOverhangMin 15). Resulting alignments (as well as chiastic alignments from the junctions file) were filtered, discarding ungapped reads, reads with more than one gap, and reads aligning to the human transcriptome, and the total number of reads per experiment was calculated (Ctot and Stot). Each chimeric read was described as a set of two numeric intervals (I1 and I2), corresponding to the two halves of the chimera. To assess whether a base pair i–j was enriched in the COMRADES sample with respect to the control sample, we calculated the number of reads in which base i overlapped interval I1 and base j overlapped interval I2, for both samples (Ci–j and Si–j). Significance of the enrichment was then assessed using a one-tailed binomial test, with parameters k = Si–j, n = Stot and p = Ci–j / Ctot. Only base pairs with P < 0.05 in both replicates were considered to have in vivo support.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.