Time-course profiling of bovine alphaherpesvirus 1.1 transcriptome using multiplatform sequencing

Long-read sequencing (LRS) has become a standard approach for transcriptome analysis in recent years. Bovine alphaherpesvirus 1 (BoHV-1) is an important pathogen of cattle worldwide. This study reports the profiling of the dynamic lytic transcriptome of BoHV-1 using two long-read sequencing (LRS) techniques, the Oxford Nanopore Technologies MinION, and the LoopSeq synthetic LRS methods, using multiple library preparation protocols. In this work, we annotated viral mRNAs and non-coding transcripts, and a large number of transcript isoforms, including transcription start and end sites, as well as splice variants of BoHV-1. Our analysis demonstrated an extremely complex pattern of transcriptional overlaps.

3′-truncated ncRNAs (3tncRNA) are the result of premature transcription termination that lead to the lack of stop codons on the mRNA. We annotated 3′-truncated long non-coding RNAs (lncRNAs) in 13 genes, totaling 44 transcript isoforms. These RNA molecules lack canonical polyadenylation signal (PAS) and the sequence directly upstream of their termination is GCA-rich, which is not the case for the main transcripts (Fig. 2a). The bICP4 gene encodes five 3′-truncated RNAs (Fig. 2b), four of which terminate in GA-rich regions. These loci Scientific Reports | (2020) 10:20496 | https://doi.org/10.1038/s41598-020-77520-1 www.nature.com/scientificreports/ may act as transcriptional pause sites, that promote termination, cleavage and polyadenylation 39 . The UL27 gene produces the highest variation of 3tncRNAs with seven truncated isoforms. Additionally, four of the six 3tncRNAs transcribed from bICP22 are spliced using the same splice sites as the main gene. Non-coding RNAs can be first detected in the second hour of the infection, and they continue to be expressed throughout the viral lifecycle (Supplementary figure S3).    www.nature.com/scientificreports/ Replication-origin-associated transcripts. Replication of BoHV-1 starts between 2 to 4 h p.i. 41 . The viral genome contains a replication origin (OriS) in both repeat regions, but lacks OriL, which is present in its close relative PRV. Replication-associated (ra)RNAs overlap the OriS sequences. We identified six raRNAs: ORIS-RNA1 and ORIS-RNA2, and four 5′-UTR isoforms of bICP22 (bICP22-L1, bICP22-L1-SP10, bICP22-L1-SP12, bICP22-L1-SP8) (Fig. 2c). The raRNAs were present in low abundance and only at late time points of infection (8 h and 12 h).
Transcript ends, length isoforms and promoters. Transcript start sites and promoters. In this work, we detected 823 TSSs using the criterion for sequencing reads to be present in at least three independent samples. Four of the TSSs were present in the first hour of the infection. This number increased ten times at the second hour, while 90.7% of the TSSs were only detected following the start of the viral replication (Fig. 3a). Promoter analysis disclosed 91 TATA consensus sequences at an average distance of 30.6 nt (σ = 2.09) upstream from TSS positions and fifteen CAAT consensus sequences at an average distance of 105.8 nt (σ = 17.53) (Supplementary Table S2). The eukaryotic initiator sequence region is poorly conserved in this virus (Py A N U/A) 42 . Our analysis revealed a similar consensus but with an enrichment in Gs in the + 1 position for TSSs with a TATA-box. TSSs lacking a TATA-box showed an even higher enrichment of Gs in the + 1 position followed by a www.nature.com/scientificreports/ predominance of Gs in the + 2 position and an ambiguous + 3 position (Fig. 3b). This G enrichment has also been described in the initiator element (INR) of HSV-1′s VP5 promoter 38,43 . TSSs with a canonical promoter sequence are more abundant at every time point in the infection, than those lacking it (Fig. 3c). We identified a putative TSS for RNA1.5 overlapping the genomic junction 44 , which was located at 129,272 nts, and it was found to be coterminal with the more abundant CIRC RNA. We also detected splice junctions of this transcript located at positions 134,896 and 473. RNA1.5 appears to be a low-abundant CIRC variant with a longer 5′-UTR and an intron. TSS isoforms were detected starting on the second hour of infection and their expression continued throughout the entire course of infection making up 58% of all transcripts in the late phase (Supplementary figure S3). The TSS distribution alongside the BoHV-1 genome is illustrated in Fig. 4a.
Transcript end sites. RNA termination is more precisely regulated than the initiation. This results in a lower TES diversity. Using the above criterion for a LoRTIA transcript, we identified 135 TESs. Altogether six TESs were detected in the first hour of the infection, 31% in the second hour, whereas 90% of the TESs could be detected in the late phase of the infection. Our analysis revealed that 48 out of the 135 identified TESs have PAS consensus sequences at an average of 24.5 nt (σ = 5.98) upstream of their TESs (Supplementary Table S2). TESs with PASs have canonical sequence surroundings consistent with the eukaryotic RNA cleavage regions with an orthodox A/C cleavage site and a U/G-rich downstream element, whereas TESs without PAS did not show any consensus sequence motifs at any of the key regions ( Fig. 3d) surrounding the cleavage site. TESs with canonical regulatory sequences are more abundant at every time point of the infection, than those of lacking these sequences (Fig. 3e). We identified 31 transcripts that terminated in alternative loci compared to their most abundant isoform. All TES isoforms were detected in the late phase of the infection (Supplementary figure S3). The TES distribution alongside the BoHV-1 genome is illustrated in Fig. 4b. www.nature.com/scientificreports/ Spliced transcripts and splice isoforms. Compared to gammaherpesviruses, alphaherpesviruses have much fewer spliced transcripts. This is indeed true for high-abundance splice isoforms. However, as in HSV-1 38 , we also detected a number of low-abundance spliced RNA molecules in BoHV-1. In this analysis, we identified 25 introns (23 with GT/AG and 2 with GC/AG) by dRNA-Seq. All of these introns were also detected using cDNA-Seq techniques. One-third of all introns were detected in the first, 60.7% in the second hour of the infection, whereas in the late phase 82%. We identified all of the previously annotated six introns. However, one of the bICP22 introns spanning from 112,284 to 112,865 13 was present in only two samples represented by merely 4 reads. We detected two novel abundant introns in this region, the first with the same donor and the second with the same acceptor positions as the previously annotated splice sites, but with a 67 nt-long exon between the two (Fig. 5a). We found a total of 23 novel splice variants with every intron being located in the 5′-UTR region of the transcript. We also located a non-spliced version of the bICP22. Most of the splice isoforms of bICP22 are most abundant at 2 h p.i. with the exceptions of bICP22-SP3, bICP22-SP12, bICP22-SP14 and bICP22-SP10, which peak at 6 h p.i. (Fig. 5b). A non-spliced version of UL15 starting in the same TSS and co-terminal with its spliced variant was also present in our samples, although in a lower abundance than its previously annotated spliced variant. In addition, we also detected an alternatively spliced UL15 transcript with the previously annotated intron overlapping its splice donor position at 78,040. This novel intron having a length of 3579 nts is preceded by a short exon with a start codon. The resulting 128-amino-acid-long putative protein showed no similarity with any of the proteins in the NCBI non-redundant protein database (data not shown). Finally, deleted segments found in cDNA that were absent in dRNA-seq data were only considered as putative introns if they were present in at least 3 independent samples. We found a total of 28 putative introns, 23 with GT/AG and 5 with GC/AG consensus (Supplementary Table S1). We detected a novel splice variant in a UL40 transcript. The intron spanning from 22,945 to 23,500 produces a frame shift mutation. The predicted 104 aa-long product of the UL40-SP1 showed no significant similarity to any other proteins in the NCBI non-redundant database. This intron is also present in several TSS-variants of the UL40. We observed an increased abundance of the non-spliced isoform until 6 h p.i., followed by a decrease, www.nature.com/scientificreports/ whereas the abundance of the spliced isoforms started to increase at 4 h p.i. and continued increasing until 12 h p.i. (Fig. 5c). Spliced isoforms could be detected in every phase of the infection. Two out of three genes detected at the first hour p.i. are spliced, however the fraction of spliced isoforms declines with the progression of the infection. The distribution of splice junctions alongside the BoHV-1 genome is illustrated in Fig. 6.

Multigenic transcripts. Polycistronic transcripts.
Herpesvirus genes are organized into tandem gene clusters sharing common transcription termination sequences. It is a general wisdom that in eukaryotes only the most upstream ORFs are translated from polygenic transcripts 45 . In a few examples however, translation from downstream genes of polygenic transcripts has been described in both eukaryotes 46 and their viruses 47 . For example, it has been reported that an upstream ORF (uORF) helps the translation of the downstream ORF36 of Kaposi's sarcoma-associated herpesvirus (KSHV) by a termination-reinitiation mechanism 48 . We detected a high number of polygenic transcripts using the stringent criteria, including 21 bicistronic, 13 tricistronic, 3 tetracistronic and 4 pentacistronic isoforms (data not shown). To identify the longest transcript isoforms in our samples, we screened the sequencing reads and annotated two more tricistronic and an additional tetracistronic transcript. These were represented by a single read, thus we deem their start position to be uncertain and hypothesize that they start at the closest TSS. The longest polycistronic transcript annotated by this study is the bicistronic UL37-36 RNA molecule spanning 13,041 nts. Complex transcripts contain multiple genes of which at least one of them is encoded in the opposite strand. This study revealed a more intricate network of complex (cx)RNA molecules in BoHV-1 than in other herpesviruses 21,38,49 , for which the probable reason is technical: in this study we used high-coverage dRNA and dcDNA sequencing techniques. We identified a total of 76 complex transcripts. Twenty-one of these are represented by only one read and were detected by our screening of the longest isoforms. A putative transcript, Intergenic peptides mapping to transcript isoforms. We used the proteogenomic coordinates of intergenic peptides detected by a previous tandem mass spectrometry experiment 8 to analyze the isoforms' coding potential. For 95 isoforms the peptides mapped to the 5′-UTR, 16 in-frame with an ORF with canonical start and end sites. Fifty-six of these transcripts were long 5′-UTR isoforms of 28 complex transcripts. Peptides also mapped to the 3′-UTR of 31 isoforms, 10 of them being in-frame with a canonical ATG and stop codon. Detection of these peptides suggests translational activity on the 5′-and 3′-UTR region of some viral genes. Eighty-one putative protein-coding transcripts are also overlapped by a peptide, however these are out-offrame with respect to the main genomic ORF or its truncated version, and no canonical ATG or stop codon is present for these peptides. Jefferson and colleagues suggest, that the apparent lack of a canonical ORFs might originate from frame shifting events caused by splicing, however our results did not confirm this hypothesis. Peptides originating from non-canonical ORFs overlapped 16 of the non-coding isoforms including seven 5′-, three 3′-truncated and 6 antisense transcripts. These however do not support the in-silico coding potential estimation, since CPAT assesses only canonical ORFs. For a comprehensive list of intergenic peptide locations see Supplementary table S3. A complex meshwork of transcriptional overlaps. This study revealed an extreme complexity of transcriptional overlaps, which is generated by frequent transcriptional readthroughs and head-to head overlaps produced by divergent genes. Transcriptional readthrough is already a well-known phenomenon between coterminal tandem gene clusters. Here, we demonstrated that even more distal genes with their own transcription termination are able to carry out occasional transcriptional readthroughs. With a few exceptions (UL30-31, LR-ORF1-bICP0, LR-ORF2-bICP0), canonical transcripts of convergent gene pairs terminate without overlapping each other. Furthermore, we show that every convergent gene produces sporadic readthroughs with a mean overlap size of 153.5 nt (σ = 705.6). The most extensive overlap was detected between two putative transcripts UL18-17-16-15-14-13-12-11-C1 and UL15-L4-SP1 with a size of 9498 nts.

Discussion
The past decade has witnessed tremendous advances in long-read sequencing. Besides the Pacific Biosciences and Oxford Nanopore Technologies platforms, Loop Genomics has also developed an LRS technique based on single molecules synthetic long-read sequencing. High-throughput LRS techniques are able to read full-length RNA molecules, which led to the discovery that the transcriptomes of examined organisms are much more intricate than previously thought. LRS-based studies have demonstrated that herpesviruses exhibit astoundingly complex transcription profiles 21,38,49 . Here we report the assembly and annotation of the BoHV-1 dynamic transcriptome. Our analysis identified a large number of novel transcripts and transcript isoforms. Our results www.nature.com/scientificreports/ demonstrate that practically every BoHV-1 gene produces transcripts with one or more 5′-truncated in-frame ORFs. It is unknown whether these ORFs are translated, and if so, what their functional significance could be. Several novel splice sites were also identified in this work. This result suggests that splicing in alphaherpesviruses may be more common than previously thought. However, in most cases, splice isoforms are expressed in low abundance relative to the common variant. It can be speculated whether the relative abundance of the splice isoforms is dissimilar in different cell types. We detected a huge variety of bICP22 splice isoforms and found that splicing occurred in the 5′-UTR of the transcript that overlaps two asRNAs. It is possible that these asRNAs contribute to the alternative splicing of bICP22, as suggested in other systems 50,51 . Novel introns in the UL40 and UL15 genes give rise to a change in amino acid composition. The increase of UL40-SP1 following viral replication might play a role in the virus' transition into post-replication-phase. Replication-associated RNAs have been shown to regulate replication by altering primer synthesis 52 or mediate the recruitment of the Origin Recognition Complex 53 . We discovered six replication-associated RNAs (ORIS-RNA1, ORIS-RNA2 and four 5′ UTR isoforms of bICP22), which overlaps the OriS of the virus. ORIS-RNA1 and ORIS-RNA2 are non-coding raRNAs and have no homologous counterparts in closely related viruses. This finding is consistent with observations in closely related herpesviruses. In HSV-1, a long TSS isoform of ICP4 transcript overlaps the OriS, whereas in PRV the non-coding PTO 32 and the long TSS variant of the US1 transcript overlap the OriS. It appears to be that raRNAs underwent very rapid evolution, and every species has developed unique solutions for raRNA production. Additionally, two miRNAs have been reported to overlap both raRNAs in an antisense orientation 14 , which may play a role in the regulation of raRNAs. Our study demonstrated the existence of very long polycistronic transcripts, and also that genes thought to be present in only bi-or polycistronic RNAs, are also expressed as monocistronic transcripts. A great diversity of transcription initiation is also described in this report. The same ORFs are expressed in a large number of TSS isoforms, with many starting in adjacent genes, and some in more distal genes. A large fraction of this variation is represented by low-abundance transcripts. 5′-UTRs were shown to modulate translation by forming specific structures 54 , or through upstream AUGs or uORFs 48,55,56 . Two-thirds of the overlaps between previously detected peptides and the transcript isoforms mapped to the 5′-UTR, suggesting the translation of potential uORFs. The function of the great extent of TSS variation in BoHV-1 is currently unknown; however, it may lead to post-transcriptional modifications, or differential translation. The prototypic organization of the herpesvirus transcriptome is characterized by the 3′-co-termination of the transcripts produced by tandem genes. Our findings suggest TES variation being substantially lower than TSS diversity. However, transcriptional readthrough is common not only in the parallel, but also in convergent genes. The 3′-UTRs may contain cisacting elements determining the fate of RNAs after transcription 57 . The 3′-UTR of some isoforms also showed translational activity. A recent paper by Wu and colleagues proposed that downstream ORFs have a translationenhancing function 58 , which needs further confirmation in viruses.
An extremely complex meshwork of transcriptional overlaps was also described. Overlaps are produced by transcriptional read-throughs between tandem and convergent genes or by the mutual utilization of certain genomic loci by divergent genes. Multiple transcriptional read-throughs can result in the generation of complex transcripts. Convergent and divergent transcriptional overlaps produce antisense RNA segments on transcripts. These overlaps may give rise to double-stranded RNAs (dsRNAs), which are targeted by the dsRNA-activated protein kinase R 59 or type I interferon system 60 of the host, thereby effectively reducing viral translation. In a recent publication Dauber and co-workers reported the mechanism of how HSV-1 virion host shut-off (VHS) protein limits the accumulation of viral dsRNAs 61 . In light of these findings, it is possible that the frequency of RNA polymerase II readthroughs are much higher than expected from the basis of ratio of read-thorough transcripts.
The high-coverage of sequencing reads allowed us to carry out kinetic characterization of viral transcripts. We used dcDNA data because it produces longer reads than both amplification-based and native RNA sequencing techniques. At the same time, dcDNA-Seq lacks the potential amplification biases causing improper quantitation 62,63 . LRS allows the distinction between parallel-overlapping RNA molecules and transcript isoforms, therefore we can not only kinetically characterize the genes, but also transcript isoforms.
These analyses and data provide valuable resources for future functional studies. RT with random primers. 50 ng of ribodepleted RNA was reverse transcribed using SuperScript IV Reverse Transcriptase and custom-made primers composed of a random hexamer sequence and one complementary to Ligation Sequencing Kit Primer (supplied in the kit). PCR, end repair and sequencing adapter ligation were identical to the oligo(dT) primed RT library. The resulting four libraries were barcoded using the 1D PCR Barcoding (96) Kit (Oxford Nanopore Technologies) following the manufacturer's instructions.

Cells and viruses. Cultured cells were infected with the Cooper isolate (GenBank
LoopSeq single-molecule synthetic long-read sequencing. LoopSeq  with-qscore_filtering. Reads with a Q-score greater than 7 were mapped to the circularized viral genome (NCBI nucleotide accession: JX898220.1) with the Minimap2 software (Li, 2018). The same reads were also mapped to the host genomes, as follows: the BoHV-1/MS sample was mapped to the genome assembly of Bos www.nature.com/scientificreports/ taurus (GCF_002263795.1) while the BoHV-1/HU sample to the genome of Ovis aries (GCA_002742125.1), both using the Minimap2 aligner. Synthetic long-read data was base called on the MiSeq device using the Real-Time Analysis software with default settings. Base called reads were then processed by the Loop Genomics Pipeline Software with default settings. Synthetic long reads were mapped to the BoHV-1 and Bos taurus genomes using Minimap2.
Structural analysis. For transcript isoform detection and annotation, mapped reads were analyzed with the LoRTIA software suite v.0.9.9, using the following steps: (1) To eliminate reads resulting from partial RT or PCR and artefacts resulting from false priming non-trimmed read ends were searched for the sequencing adapters for the TSS or the presence of a homopolymer A sequence for the TESs. The first nucleotide not aligning with the adapter was denoted as possible TSS while the last nucleotide not aligning with the homopolymer A was detonated as putative TES. Any other read start/end position was discarded. (2) Putative TSSs and TESs were tested against the Poison distribution to eliminate random start and end positions caused by RNA degradation. The significance is corrected with the Bonferroni method. Features failing to pass qualifying as local maxima, or being present in less than 2 reads or in less than 1‰ of the coverage are eliminated from the analysis. (3) Putative introns are annotated if they have one of the three most commune splice consensus sequences (GT/AG, GC/ AG, AT/AC) and they are more abundant than 1‰ compared to the local coverage. Introns with splice junctions flanked by tandem repeats are removed from the analysis as these can be template switching artefacts.
The resulting putative TSSs and TESs were considered as existing if they were detected in at least three independent samples. Deleted segments were accepted as introns if they were present in both dRNA-Seq and one of the cDNA-Seq datasets and if they were shorter than 10 kbps. We set a relatively low abundance for acceptance because it may vary in different cell types. The accepted TSSs, TESs and introns were then assembled into putative transcripts using the Transcript_Annotator software of the LoRTIA toolkit. Very long unique or low-abundance reads which could not be detected using LoRTIA were annotated manually. These reads were also accepted as putative transcript isoforms if they were longer than any other overlapping RNA molecule. In some cases, the exact TSSs were not annotated. Finally, a read was considered as transcript if it was present in at least three separate samples. Transcript annotation was followed by isoform categorization according to the following principles: the most abundant transcript containing a single ORF was termed canonical monocistronic transcript, whereas isoforms with longer or shorter 5′-UTRs or 3′-UTRs regions than the canonical transcripts were termed TSS or TES isoforms (variants), respectively. Similarly, transcripts with alternative splicing were named splice isoforms. Transcripts with 5′-truncated in-frame ORF were termed as putative mRNAs. Transcripts with multiple nonoverlapping ORFs were designated polycistronic, whereas those with ORFs in different orientation were called complex transcripts. Transcripts with no ORFs or ORFs shorter than 30 nts were named non-coding, except if occurred in front of a canonical ORF in a transcript (These small ORFs were termed uORFs).
Analysis of viral gene expression. For the characterization of viral gene expression kinetics, we used dcDNAseq datasets annotated by LoRTIA. First, we filtered out reads that were confirmed by less than 4 other reads in the dcDNA-seq datasets, then we normalized read counts using a modified version of the Median Ration Normalization of the DESeq2 software suite (Wu et al., 2019). The average normalized read counts of the biological replicates was then considered as the read count of a given isoform.
Coding potential analysis. The assessment of the coding potential of the long non-coding RNAs was performed using the Coding Potential Assessment Tool (CPAT) 40 with the standard vertebrate codon table. The non-redundant NCBI protein database was also searched for ORF homologies. To infer translational information for the vast variety of structural isoforms detected we compared our results to a previous proteogenomic analysis of the virus. The coordinates of intergenic peptides with and without a viable ORF detected by Jefferson et al. 8 were overlaid with the resulting isoforms, and their location within the transcript was annotated.
A schematic representation of the workflow can be found in Supplementary Figure S4.
Ethics declaration. Neither human nor animal experiments were applied in this study.