Dual isoform sequencing reveals complex transcriptomic and epitranscriptomic landscapes of a prototype baculovirus

In this study, two long-read sequencing (LRS) techniques, MinION from Oxford Nanopore Technologies and Sequel from the Pacific Biosciences, were used for the transcriptional characterization of a prototype baculovirus, Autographa californica multiple nucleopolyhedrovirus. LRS is able to read full-length RNA molecules, and thereby distinguish between transcript isoforms, mono- and polycistronic RNAs, and overlapping transcripts. Altogether, we detected 875 transcript species, of which 759 were novel and 116 were annotated previously. These RNA molecules include 41 novel putative protein coding transcripts [each containing 5′-truncated in-frame open reading frames (ORFs), 14 monocistronic transcripts, 99 polygenic RNAs, 101 non-coding RNAs, and 504 untranslated region isoforms. This work also identified novel replication origin-associated transcripts, upstream ORFs, cis-regulatory sequences and poly(A) sites. We also detected RNA methylation in 99 viral genes and RNA hyper-editing in the longer 5′-UTR transcript isoform of the canonical ORF 19 transcript.


Analysis of AcMNPV transcriptome using third-generation sequencing. In this study, PacBio
Sequel and ONT MinION LRS platforms were used to characterize the structure of AcMNPV transcriptome and epitranscriptome (Fig. 1). Sequel sequencing yielded a total of 47,880 Circular Consensus Sequences (CCS), 25,371 of which mapped to the viral genome and 23,884 to the insect host (Sf9 cells) genome. The total read count was less than the sum of the two mapped read counts because of the eliminated chimeric reads formed during library preparation mapped to both of the genomes. The Cap-selected samples yielded a total of 1,830,476 reads, 198,516 of which mapped to the AcMNPV genome and 1,631,960 to the host genome, whereas the non-Cap selected samples yielded 1,119,716 reads, 290,039 mapping to the viral and 760,533 to the host genome. Sequel sequencing yielded longer mean mapped read lengths than the ONT, while the Cap-selected and non-Cap selected ONT reads had similar mapped read lengths ( Table 1). The difference in mean read-length between the two platforms is explained by a step during Sequel library preparation used to mitigate the loading bias of PacBio sequencers, resulting in the loss of short cDNAs. Further details on read counts and read lengths are shown in Supplementary Table 1, and the boxplot of Supplementary Fig. 1 presents the distribution of read lengths per sample.
The LoRTIA software suit (developed in our laboratory) was used to annotate viral transcripts and also our in-house script augmented by manual checking. The criterion of acceptance of TSSs and TESs as true transcript www.nature.com/scientificreports/ ends was to identify them by the LoRTIA software suite in two amplified ONT samples and in another technique that was either sequel or Cap-sequencing. A more stringent criterion was applied for the non-coding transcripts and 5′-truncated transcripts: in addition to 2 amplified ONT samples, Cap-selection had to be confirmed. After the screening, a total of 311 TSSs and 261 TESs (Supplementary Table 2A,B) as well as 13 splice junctions were obtained (Supplementary Table 2C). TATA boxes were identified for 60 TSSs. The mean distance of a TATA box from the TSS was 32 nts. Twenty-two GC boxes were identified, and their average distance from the TSSs was 66 nts. The average distance of the identified 15 CAAT boxes from the TSSs was 108 nucleotides (nts). The canonical CAGT initiators were present in only 6% of TSSs, the TAAG initiators in 61%, and the non-TAAG initiators in 33% of the cases (Fig. 2a). A total of 875 transcript species were detected ( Table 2). The full transcript list is available in Supplementary Table 3A, the abundance of transcripts is available in Supplementary Fig. 2 and Supplementary Table 3B, and the transcripts themselves are depicted in Fig. 3.
Approximately 80% of TESs were found to contain a canonical PAS at an average distance of 27.23 nts upstream from the TES. In line with previous findings describing arthropod polyadenylation signals 51 , the ± 50 nts surrounding of the viral TESs was characterized by A/U-rich sequences with an increased adenine content immediately upstream of the cleavage site. Intriguingly, sequences harboring a PAS showed a slight increase of Figure 1. Workflow of transcriptome and epitranscriptome analyses. The types of kits used for each workflow are shown in the dark blue boxes. Spodoptera frugiperda cells were infected with Autographa californica nuclear polyhedrosis virus (AcMNPV). RNA was isolated from the infected cells, and after poly (A) selection, the samples were sequenced using the Oxford Nanopore technologies (ONT) and the Pacific Biosciences (PacBio) Sequel platforms. In the case of ONT, we used PCR-amplified cDNA-seq (SQK-LSK-108, SQK-PCS-109 kit) and dRNA-seq (SQK-RNA001 kit). ONT dRNA-seq was also applied to determine possible methylation sites using the Tombo software. To confirm the methylation sites, bisulfite conversion was also performed using ONT sequencing. The base-called sequences were aligned with the reference genome using the minimap2 long-read mapping software, followed by the determination of TSS and TES positions of each transcript using the LoRTIA program. The illustration was created with Microsoft PowerPoint 2021 software 47 .  www.nature.com/scientificreports/ adenines between − 26 and − 12 nts upstream from the TES, whereas in the ones without a PAS, this phenomenon could not be observed (Fig. 2b).

UTR isoforms.
In this part of the study, 330 transcript isoforms were found to have longer or shorter 5′-UTRs than the previously annotated transcript encoded by the same genes. Less than half (32.06%) of them showed an E/L initiation region shift when compared to the initiator element (Inr) of their previously annotated transcript. 7.06% of TSS transcript isoforms were revealed to be controlled by TAAG-Inr in the previously annotated isoforms being controlled by non-TAAG-Inr and encoded by the same gene. However, we identified non-TAAG-Inr isoforms in 25% of the transcripts that were previously annotated as TAAG-Inr. This phenomenon suggests that several AcMNPV genes are transcribed by both the host and the viral RNP, resulting in altered 5′-UTR lengths. In addition to the genes described in a previous work 12 , we detected 57 genes, which were transcribed by both the host and the viral RNP (Supplementary Table 4A). Polymorphism in the 5′-UTR length probably has any biological relevance, but we could not exclude that it represents mere transcriptional noise. In many cases, longer 5′-UTRs harbor upstream ORFs (uORFs), which have been shown to alter the translation of the protein coding sequence by ribosome reinitiation, ribosome stalling or disassociation and ribosome bypass 51,52 . We identified 75 gene products containing at least one uORF (Supplementary Table 4B).  www.nature.com/scientificreports/ All in all, 340 novel TES isoforms were identified in this work, 76.35% of which contained a canonical PAS upstream of their 5′-ends. The phenomenon of nontemplated adenine-addition by the viral RNP has previously been described 18 , and this in-vitro study has also suggested the presence of a T-rich termination signal for this enzyme, and nontemplated thymine addition preceding adenine incorporation. In concordance with this work, we found that 51.85% of the 3′-UTR isoforms with a LIS terminated in the near vicinity (± 3 nts from the TES) of a T-rich region. This finding is in contrast with the 22.51% of the 3′-UTR isoforms with non-TAAG-Inr. However, we could not confirm the presence of nontemplated thymines upstream of the poly(A) tail.
The mean read length of transcripts was 1423.7 nts (σ = 913.190) (Supplementary Table 1). Intriguingly, RNAs transcribed by the viral RNP were on an average of 500 nts longer than those transcribed by the host RNP. The mean 5′-UTR length was 153.06 nts (σ = 270.438), and the mean 3′-UTR length was 529.09 nts (σ = 729.266) both measured from the first ORF overlapped by the transcript. The difference was significant for the transcript length and 3′-UTR length, suggesting the tendency of viral RNP to produce longer RNA molecules.

Monocistronic transcripts.
Several AcMNPV genes lack a precise transcript annotation because of the challenges facing SRS when assembling a genomic region with a complex transcriptional overlapping pattern. Using LRS, we annotated 14 novel monocistronic transcript species with single-nucleotide precision (Supplementary Table 3A). Canonical TATA boxes were observed upstream of the TSSs of ORF85 and ORF112-113. These transcripts started from a non-TAAG-Inr and harbored a canonical PAS upstream of their TES. The transcripts coding for DNA polymerase were initiated at a canonical arthropod initiator (GCATA), while helicase was initiated at a similar but non-canonical sequence (GCAATA). Both DNAPOL and HEL harbored a canonical PAS upstream of their TESs. Nine of the transcripts (ORF1629, P47, ORF72, ORF84, 38K, ORF108, PP34, P49 and ORF154) originated at a TAAG-Inr, PP34 with a canonical CAAT sequence (CCA ATC ) 87 nts upstream from its TSS, and five of these transcripts had PASs.
Non-coding transcripts. We detected 101 novel transcript isoform species that did not contain any previously annotated ORFs, two-thirds being longer than 200 nts representing long non-coding RNAs (lncRNAs), while one-third of them fell in the size range of short non-coding RNAs (sncRNAs). We identified 41 sense ncRNAs all of which overlapped a canonical transcript but lacked the stop codons or both the stop codons and a certain part of the 5′-UTRs. Only 10.3% of the ncRNAs had a canonical TATA promoter upstream of their TSSs, whereas 70.5% of them started at a TAAG initiator, which may suggest their late transcription. Twenty-three antisense RNAs (asRNAs) were identified being controlled by their own promoters. These asRNAs were encoded by the complementary DNA strands of 11 genes (Supplementary Table 3A). ORF-60-AS-1 was the only asRNA the promoter of which contained TATA box, whereas 86% of them contained TAAG initiator sequence. All of the ncRNAs were also present in the Cap-selected samples.
Putative 5′-truncated mRNAs with in-frame ORFs. We detected 41 putative novel genes, which produced 5′-truncated version of the canonical mRNAs and contained shorter in-frame ORFs. Nineteen of these transcripts were initiate at TAAG sequence, 9 of which had a previously annotated isoform (EGT, DNAPOL, HCF1, PNK/PNL, HEL, HE65, 94K, IE1 and IE2) initiated at a non-TAAG-Inr, which may suggest that early genes were partially transcribed by the viral RNP at late time-points. Intriguingly, eleven of the previously annotated transcripts (AC-BRO, POLH, ORF19, PP31, ORF66, ORF84, ODV-E25, BV/ODV-C42, ORF117, CHIT, ODV-EC27) originating at a TAAG sequence had 5′-truncated isoforms that were initiated at non-TAAG-Inrs. It implies the transcription of 5′-truncated isoforms of some late genes by the cellular RNP. These transcripts were all present in the Cap-selected samples.
Multigenic transcripts. Several long (more than 760 nts) multigenic RNA molecules were detected in the viral transcriptome. We designated polycistronic transcript species to the ones that exclusively contained tandem ORFs, whereas complex transcripts were defined as multigenic transcripts that contained at least one ORF in the opposite orientation as the rest of the ORFs. In this study, we detected 241 polycistronic transcript species containing at least two ORFs. The main initiator motif of these long transcripts was LIS, as 81.74% of the transcripts started at a TAAG sequence. A total of 79 complex transcript species were detected, 21 of which were transcript isoforms, while the rest of them were mapped to unique genomic locations. The longest complex transcript P10-74-ME53-C-1 had only a single sense and two antiparallel ORFs, while ORF51-52-53-LEF-10-ORF54-55-56-C-1 had the highest number of ORFs (6 sense and 1 antiparallel ORFs).
Replication origin-associated transcripts. The homologous repeat (hr) regions are located in multiple genomic positions in AcMNPV. They are believed to contain the replication origins (Oris). Our LRS approach detected overlapping transcriptional activity at all of the 9 h sequences. However, in the case of h5, LoRTIA did not identify transcripts. Despite this fact, we could detect reads without exact TSSs and TESs. Altogether, 55 transcript species were detected at the hr regions, 50 of which contained a TAAG initiator sequence. Fifteen of these RNAs were polygenic, 32 were TSS variants and 3 were TES isoforms, while 8 were monocistronic transcripts. Most of the overlapping transcripts (12) were transcribed at the genomic junction (at hr1) of the circular viral DNA, 7 of which were complex transcripts, 4 were TSS isoforms, and 1 was a monocistronic RNA.
Splice isoforms and transcripts with retained introns. Chen et al. 12 have previously reported twelve introns with an abundance above 1%. We detected five additional introns ( Table 3). Twelve of the introns detected in this study contained the canonical GT/AG splice junctions, while a single one a less common GC/ www.nature.com/scientificreports/ AG. Chen and colleagues have associated a spliced antisense transcript to ORF115. We detected 2 similarly positioned RNAs (ORF117-L-SP-1 and ORF117-L-SP-2), and ORF117-L-SP-1 had matching introns with the previously annotated transcript. ORF117-L-SP-2 had the same acceptor site position, but its donor site was located 85 nts downstream from the previously annotated genomic location. We could not precisely annotate the TSS of these transcripts; however, according to our data, it was located upstream of ORF117-L-1's TSS. Splicing of the ORF117-L-SP-2 led to frame shifting within the previously annotated ORF, and to the generation of a novel 246 nt-long ORF, originated upstream of the original ATG.
Transcriptional overlaps. Theoretically, transcripts can overlap each other in three different ways: convergently, divergently, and parallelly. The AcMNPV genome contains 37 convergent gene pairs. Our LRS approach demonstrated that all of the convergent gene pairs produced transcriptional readthroughs, only 3 pairs of which overlapped exclusively at their 3′-UTRs, while the others overlapped the ORFs (Supplementary Table 5). We detected 32 divergent transcriptional overlap out of the 34 gene pairs, and 84 parallel overlaps out of the 87 gene pairs. We assume that a higher data coverage would detect overlaps in every transcript.

5-mC methylation.
We used dRNA-Seq and bisulfite conversion data for the detection of methylated nucleotides of AcMNPV transcripts.
Tombo analysis. Tombo is a software package used for the identification of modified nucleotides from nanopore sequencing data. In order to decrease false positive results in the dRNA-Seq sample, transcripts were filtered with a coverage lower than 30, and the ones the modified fraction of which was less than 30%. We found no significant correlations between the coverage and the number of methylated nucleotides in the raw fraction (Fig. 4a). Using the Tombo software, a possible methylation consensus sequence (UUAC*CG) (the modified C letter is labelled with asterisk) was identified, which indicated the right distribution of log-likelihood ratios (Fig. 4b). Our bisulfite conversion experiment confirmed the methylation of this consensus sequence. The deviation from the canonical C sites could also be clearly detected (Fig. 4c). After identifying the potential false positive sites, we obtained 325 putative 5-mC methylation positions in 12 viral genes (ac-39k, ac-bro, ac-ctl, ac-odv-e25, ac-orf-58, ac-orf-73, ac-orf-74, ac-orf-75, ac-p40, ac-p6.9, ac-polyhedryn, and ac-vp39). The deviation from the canonical C sites could also be clearly detected (Fig. 4c).
Bisulfite conversion analysis. Besides the Tombo analysis of dRNA-Seq data, we also carried out bisulfite conversion experiments. While low read counts (2710 viral reads) were obtained in the dRNA-Seq, a much higher read count (125,448 reads) was generated by the bisulfite sequencing. Furthermore, in the latter method, positive control (non-converted samples) was also used. We could confirm 234 methylated positions out of the 325 (identified by Tombo analysis) with the bisulfate sequencing. In order to decrease the false positive results, we set a coverage of 25 as the threshold for the bisulfate analysis. Altogether, 7897 putative methylation positions were identified in the transcripts of 99 genes (Supplementary Table 6 and Supplementary Fig. 3). We detected 31 potential cytosine positions in the 3′-UTR of the ac-Orf-12 transcript, which were always unconverted, and therefore methylated. Altogether, 88% of the examined potential methylation positions (positions the coverage of which was at least 25) were located at the coding regions and 21% at the UTRs.
A to I RNA hyper-editing. Reads of ORF19-L showed a high frequency of A to I (read as A to G by the sequencing) substitution, which was not present in overlapping reads. We found that 50% of all substitutions Table 3. TSS, TES, splice junction, ATG and stop codon positions of the spliced AcMNPV transcripts. www.nature.com/scientificreports/ www.nature.com/scientificreports/ were A to G (Fig. 5a,b) for ORF19, which was significantly higher than the 16.9% for overlapping transcripts in the same region (p < 0.0001, sided Fisher's exact test) (Fig. 5c). A substitution threshold of 16.9% was set to distinguish possible edited bases from the noise of sequencing inaccuracy. Our results showed that 18% of all adenines of ORF19-L presented a high level (x̅ = 0.839, σ = 0.153), while 4% of adenines of the overlapping reads presented a low level of A to G editing (x̅ = 0.224, σ = 0.051). To identify the presence of a possible editing motif recognized by ADAR, we calculated the base frequency in the ± 5 nts surrounding the edited A. It has been previously demonstrated that a G-enriched neighborhood and an upstream U stabilize the RNA-ADAR complex in mammalian cells 54 . We detected a significantly higher frequency of Us (χ 2 (1, N = 79,455) = 79,338.023, p < 0.01) right upstream of the edited base, while the frequency of Gs was only slightly higher (χ 2 (1, N = 79,454) = 79,340.021, p < 0.05) at the + 5 position downstream of the edited base.

Discussion
The standard next-generation sequencing techniques are limited by short read length because the fragmented sequences have to be re-assembled computationally, during which a significant amount of valuable information on the transcriptome is lost. LRS is particularly useful for the analysis of nested and alternatively spliced transcripts. In this study, we applied two LRS techniques, SMRT Sequel platform from PacBio and MinION platform from ONT for profiling the AcMNPV transcriptome. We carried out amplified and direct RNA sequencing on ONT platform. Altogether, we identified 876 novel transcript species, including mRNAs, ncRNAs, mono-and polygenic transcript species, transcript isoforms, and novel splice sites. A stepwise truncation on both ends of the transcripts can be observed in several genomic regions. This has been shown to be present in other viruses 56 as well; however, many studies can confuse this phenomenon with RNA degradation or PCR artefacts, especially if it is present on the 5′-end of the reads. AcMNPV offers a unique support of the existence of these variable UTR isoforms by the presence of a LIS located at the TSS. The same TSSs and TESs are used by multiple transcript isoforms of neighboring, or in some cases, distant genes, resulting in polycistronic and complex transcript isoforms. This organization of the transcriptome, especially the intensive usage of the same TSS for multiple transcript isoforms containing varying TESs, is uncommon in herpesviruses 57 ; however, we observed a somewhat similar pattern of TSS/TES usage in African swine fever virus (ASFV) 58 , which is related to insect viruses.
Several multigenic transcripts were detected in AcMNPV. Operons encoding multigenic transcripts represents the basic organization principle of prokaryotic genome, but they are rare in eukaryotes the reason for which is that in bacteria the Shine-Dalgarno sequences allow the translation of every gene in the mRNA 59 , but in eukaryotes only the most upstream gene of a multigenic transcript is translated because of the Cap-dependent initiation system. However, polycistronic RNAs are very common in eukaryotic viruses 60 . The function of these multigenic transcripts are currently unknown because we have no evidence for the translation of downstream genes. It is hypothesized that the transcriptional readthrough in tandem genes (and also on convergent genes) plays a role in a transcription interference-based mechanism 61 .
We observed that many of the longer TSS isoforms contained uORFs in their 5′-UTR, which may play a role in the regulation of translation 52,62 . These transcripts are also involved in the formation transcriptional overlaps. AcMNPV resembles to ASFV and vaccinia virus and differs from herpesviruses in that it exhibits higher heterogeneity in their TESs than TSSs. The alternative use of 3′-UTRs generates long tail-to-tail and tail-to-head transcriptional overlaps. This part of the transcripts may contain cis-regulatory elements, which can bind to regulatory proteins or micro RNAs thereby controlling the translation and the decay of mRNAs 63 .
In this study, we detected novel promoters, Inr sequences and poly(A) sites. Additionally, we identified TAAG-Inr motifs, which bind viral RNP at late and very late phase of viral life cycle, and non-TAAG-Inr motifs recognized by both viral and host RNPs at early time points. Our results clearly demonstrate that viral RNP generates longer transcripts than the host RNP. The 3′-cleavage of the viral RNAs and the formation of a poly(A) tails are carried out by polyadenylation machinery of both the host and the virus, although the latter is not well understood.
AcMNPV contains 9 AT-rich repetitive sequences (hr regions), which are thought to be replication origins 64,65 . However, others have demonstrated that none of them is essential for the viral replication 66 . We detected overlapping transcription from each hr region. They are assumed to play a role in the regulation of replication 67 . Such transcripts have been identified in other viruses as well such as herpesviruses 68 .
We describe several putative 5′-truncated mRNA molecules containing nested in-frame ORFs. This phenomenon has been described in other viruses 58,69 . Further studies are needed to determine whether these transcripts carry the information of N-terminally-truncated polypeptides. If so, this kind of nested transcription significantly increases the coding potential of viruses. In this part of the work, we detected a large number of low-abundance transcript isoforms; nonetheless, their potential functional significance has to be ascertained.
Theoretically, the electric signals (squiggles) generated by the nanopore sequencing of native RNAs might be used to identify 5-mC modifications. Currently, only the Tombo package provides support for this, but our results show that this software tends to produce false positive results. We also obtained false negative results with the Combo software, but it is explained by the low dRNA-Seq data coverage. In order to exclude the false results, we also applied the traditional bisulfate conversion method, which converts non-methylated cytosines with 99.5% efficiency. We could validate 70% of the methylation position obtained by Tombo using bisulfate sequencing. With this approach, we detected 5-mC methylation in transcripts of 99 AcMNPV genes and identified the UUA CCG sequence, which is assumed to be a methylation consensus sequence. We found that the majority of methylated positions were located in GC-rich genomic regions. This phenomenon has already been described in mammalian animals 70 . Yang and co-workers have demonstrated that 5-mC bases enhance the nuclear export via the ALYREF adapter protein in mammalian cells 70  www.nature.com/scientificreports/   71 . ALYREF is also present in Arthropods. We assume that similarly to the mammalian cells 70 and the Kaposi's sarcoma virus 71 , the methylation of AcMNPV mRNAs may play a role in nuclear export.
We detected A to I hyper-editing in the 5′-UTR region of the longer TSS isoform of ORF19 canonical transcript. In cellular organism, this process plays an important role in innate immunity 30 , which is unlikely to be the case in AcMNPV. A-I editing is thought to decrease the affinity of antisense transcripts to the complementary mRNAs through inhibiting the binding of dsRNA nucleases (such as RNase) 72 . Since cDNA-Seq-based editing detection is still in its infancy, our results need further confirmation.

Materials and methods
Cells and viral infection. RNA purification. Total RNA was isolated using the Nucleospin RNA Kit (Macherey-Nagel) according to the manufacturer's instruction. In short, infected cells were collected by centrifugation, and the cell membrane was disrupted by the addition of lysis buffer (provided in the kit). Genomic DNA was digested using the RNasefree rDNase solution (supplied with the kit). Samples were eluted in a total volume of 50 μL nuclease free water. To eliminate residual DNA contamination, samples were treated with the TURBO DNA-free Kit (Thermo Fisher Scientific). The RNA concentration was measured by Qubit 2.0 Fluorometer (Thermo Fisher Scientific) using the Qubit RNA BR Assay Kit (Thermo Fisher Scientific).

Poly(A) selection.
Thirty-five μg of total RNA was pipetted in separate DNA LoBind Eppendorf tubes (Merck) from every time point. The poly(A) + RNA fraction was isolated from the samples using the Oligotex mRNA Mini Kit (Qiagen). The concentrations of the samples were measured by Qubit 2.0 using the Qubit RNA HS Assay Kit (Thermo Fisher Scientific). RNAs were stored at − 80 °C until use.
Cap-selection, library preparation and sequencing. Cap-selection was carried out with the aim to validate the 5′-ends of the transcripts. For this, we used the TelopPrime Full-Length cDNA Amplification Kit of Lexogen. The starting material was a total RNA mixture (containing samples from 1, 2, 4, 6, 16, 24, 48 and 72 h p.i.). The cDNA generation was carried out according to the recommendations from the manual of Lexogen. Detailed protocol can be found in our earlier published data paper 73 . The cDNA sample was used to produce a sequencing library with the ONT Ligation Sequencing Kit 1D (SQK-LSK108) following the last steps (end-repair and 1D adapter ligation) of ONT's 1D Strand switching cDNA by ligation method (Version: SSE_9011_v108_ revS_18Oct2016). Sequencing was performed on ONT R9.4 SpotON Flow Cells.
Barcoding. For the kinetic characterization of the viral transcripts, RNA samples from different time points (1,2,4,6,16,24, 48 and 72 h p.i.) were used individually for the production of cDNA libraries for the Nanopore sequencing. Libraries were labeled with barcodes applying a combined protocol: the ONT's 1D Strand switching cDNA by ligation method was used until the end repair step, which was followed by the 1D PCR barcoding (96) genomic DNA protocol (version: PBGE96_9015_v108_revS_18Oct2016, updated 25/10/2017) starting with the Barcode Adapter ligation step 73 . Adapters (Table 4)   cDNA-PCR Sequencing. A cDNA library from the bisulfite-converted sample was generated for MinION sequencing by using the cDNA-PCR Sequencing Kit (SQK-PCS109, Version: PCS_9085_v109_revM_14Aug2019), as follows: the primer (VNP) and dNTP (both from the Kit) were mixed with 50 ng bisulfite-converted RNA, and then, the mixture was incubated at 65 °C for 5 min. This protocol was followed by the addition of RT buffer, RNaseOUT, nuclease-free water, and strand-switching primer to the sample. They were incubated at 40 °C for 2 min, and then, Maxima H Minus Reverse Transcriptase (Thermo Scientific) was measured into the RT mix. Reverse transcription was carried out at 42 °C for 90 min. The RT enzyme was inactivated by heating the sample to 85 °C for 5 min. For the amplification of the first-strand cDNA sample, LongAmp Taq Master Mix (New England Biolabs), cDNA Primer (cPRM, ONT Kit) and Nuclease-free water (Invitrogen) were added. The following settings was used: initial denaturation (95 °C, 30 s), 16 75 . Annotation of the TSSs, TESs, and introns was performed using the LoRTIA software suite (https:// github. com/ zsolt-balazs/ LoRTIA) with specific settings for each sample type (Table 1).
We used SeqTools, our in-house scripts for the generation of the descriptive quality statistics of reads (Read-Statistics) and for the analysis of promoters (MotifFinder), which are available on GitHub: https:// github. com/ moldo vanno rbert/ seqto ols.
In this study, the LoRTIA (https:// github. com/ zsolt-balazs/ LoRTIA, v.0.9.9) pipeline developed in our laboratory was used for the identification of transcripts and transcript isoforms, as was described earlier 73 . Briefly, sequencing adapters and the homopolymer A sequences were checked by the LoRTIA software for the detection of TSS and TES, respectively. For the elimination of false transcript ends, the putative TSSs and TESs were tested against the Poisson distribution (using Bonferroni correction). Introns were identified by applying the following criteria: they have one of the three most frequent splice consensus sequences (GT/AG, GC/AG, AT/AC), and their frequency exceed 1‰ compared to the local coverage (Table 5).
For transcript isoform annotation, TSSs and TESs were selected that were present in at least two samples, while introns were selected if they were present in at least two samples and if their orientation matched the orientation of reads in which they were present, as the LoRTIA software is blind for the orientation of the reads when looking for introns. Transcript isoforms were annotated for each sample using these features and the Transcript Annotator module of LoRTIA.
A read was considered a transcript isoform if it started in the ± 5 nt vicinity of a TSS and if it ended in the ± 5 nt vicinity of a TES. Transcripts enclosing the same ORFs as a previously annotated transcript but starting upstream of its TSS were denoted longer (L) 5′-UTR isoforms, while those starting downstream, shorter (S) 5′-UTR isoforms. Transcripts with the same ORFs as a previously annotated transcript but ending upstream or downstream of its TES were denoted transcript isoforms with alternative termination (AT). Transcripts with longer 5′ or 3′-UTRs overlapping multiple ORFs in the same orientation were considered polygenic. If a TSS of a novel transcript isoform was positioned downstream of a previously described ORF's AUG, with an alternative in-frame start codon downstream from the TSS, the isoform was considered putative protein coding transcript, while those without a 5′-truncated ORF were considered 5′-truncated (TR) non-coding transcripts. Both of these transcript species are conterminal with their previously annotated isoforms. If a transcript isoform started in the same TSS as a previously described protein coding transcript, but its TES was located upstream of the stop codon of previously described ORF, the novel transcript was denoted as non-coding (NC). Transcripts in the opposite orientation of an annotated transcript were named non-coding antisense (AS) transcripts. Very long transcripts overlapping multiple ORFs in different orientation were denoted as complex (C) transcripts. Any other transcript configuration not containing a previously annotated ORF was denoted as NC.
Detection of RNA modifications. To detect the modifications in the RNA nucleotides, we base-called the raw fast5 files of our previously published direct RNA sequencing dataset deposited in the European Nucleotide Archive under sample accession SAMEA10458962. Modification detection was performed by Tombo software suite 43 (v.1.3.1.).