Comprehensive analysis of single molecule sequencing-derived complete genome and whole transcriptome of Hyposidra talaca nuclear polyhedrosis virus

We sequenced the Hyposidra talaca NPV (HytaNPV) double stranded circular DNA genome using PacBio single molecule sequencing technology. We found that the HytaNPV genome is 139,089 bp long with a GC content of 39.6%. It encodes 141 open reading frames (ORFs) including the 37 baculovirus core genes, 25 genes conserved among lepidopteran baculoviruses, 72 genes known in baculovirus, and 7 genes unique to the HytaNPV genome. It is a group II alphabaculovirus that codes for the F protein and lacks the gp64 gene found in group I alphabaculovirus viruses. Using RNA-seq, we confirmed the expression of the ORFs identified in the HytaNPV genome. Phylogenetic analysis showed HytaNPV to be closest to BusuNPV, SujuNPV and EcobNPV that infect other tea pests, Buzura suppressaria, Sucra jujuba, and Ectropis oblique, respectively. We identified repeat elements and a conserved non-coding baculovirus element in the genome. Analysis of the putative promoter sequences identified motif consistent with the temporal expression of the genes observed in the RNA-seq data.

Single molecule sequencing and analysis of HytaNPV Genome. We sequenced HytaNPV DNA using PacBio single molecule sequencing 10 . A total of 124,978 reads with an average length of 8,635 bp was obtained. De novo assembly of the reads yielded a circular genome of ~139 kb with the overall coverage of 2,237×. We also generated sequence data on a short read platform (MiSeq, Illumina) and obtained 13,941,057 paired-end (2 × 300-bp) reads to help polish and derive the consensus HytaNPV genome sequence of 139,089 bp (GenBank accession number MH261376) (see Methods). We developed a pipeline (Supplementary Figure S1) that identified open reading frames (ORFs) coding for proteins 50 amino acids or longer in the HytaNPV genome and then annotated the ORFs based on the homology to proteins they encode (see Methods). We also used the comparative analysis of genes and gene order in closely related NPV genomes to further refine the annotations.
The HytaNPV double stranded circular genome is 139,089 bp long and has a GC content of 39.6%. We identified 141 ORFs in the genome that code for proteins of at least 50 amino acids long (Fig. 2). We confirmed the expression of these genes using RNA-seq data derived from HytaNPV infected larvae (Supplementary Table S1 and Supplementary Figure S2). The HytaNPV genome codes for the polyhedrin gene that is observed in all alphabaculoviruses. Keeping with the convention, we designated the polyhedrin gene as orf1 and used it as the origin to annotate the remaining ORFs. We found that the HytaNPV genome does not code for the envelope fusogenic protein gp64, but contains the coding sequence for the F protein (orf132) as observed in group II alphabaculoviruses.
Phylogeny and genome organization of HytaNPV. We detected all 37 core genes 11 in the HytaNPV genome that are involved in replication, transcription, oral infectivity, viral assembly, packaging and host protein interactions (Supplementary Table S2). In addition, we analyzed the genome for genes conserved among the four baculovirus subgroups 11 and found all the 9 genes conserved in alpha, beta and gamma viruses, the F-protein gene conserved in alpha, beta and delta, and all 16 genes conserved in alpha and beta subgroup of baculoviruses (Supplementary Table S3). Phylogenetic analysis based on amino acid sequences of 37 core genes across 81 complete reference genomes (Supplementary Table S2, S4) confirmed that the HytaNPV is a group II alphabaculovirus member (Figs 3-4, Supplementary Figure S3). HytaNPV was closest to Biston suppressaria (formerly Buzura suppressaria; average amino acid identity between core genes ~76.8%) NPV, though HytaNPV genome is ~19 kb larger. It is interesting to note that Biston suppressaria, a moth similar to Hyposidra talaca from the Geometridae family, is a known tea looper pest found in China, India and other parts of Asia 12 . The related host niche and the high similarity between the two viruses suggest that they both evolved from a common parent or one is the descendent of the other. In addition, among the next three closely related viruses, two NPVs SujuNPV (average amino acid identity ~57.3%), and EcobNPV (average amino acid identity ~52.8%) also affect geometridae moths, Sucra jujuba and Ectropis obliqua, respectively. Also, Ectropis obliqua is a major tea pest and EcobNPV has been successfully used for its control 13 . This suggests that the host specificity of the baculoviruses probably co-evolved with the insects they infect and often closely related viruses infect larvae from the same family. Alignment and analysis of HytaNPV genome to the closely related BusuNPV and SujuNPV genomes, and the prototype AcMNPV genome using Mauve multiple genome alignments software 14 identified 20 conserved segments shared across these genomes (Fig. 5, Supplementary Table S5). Consistent with the phylogenetic relatedness, HytaNPV genome was collinear with the BusuNPV. Compared to BusuNPV, HytaNPV acquired additional sequences in segment B and G and between segments E-F, L-M and R-S. HytaNPV genome, while very similar in gene content to SujuNPV genome, it shows distinct differences. In particular, segments C to H, N, P, Q, and R are inverted and rearranged in the SujuNPV genome. Interestingly, segment S present in both HytaNPV and SujuNPV encodes ribonucleotide reductase large subunit (rr1) and this is absent in the BusuNPV genome. Comparison of the distantly related AcMNPV and HytaNPV genomes identified segments F, G, I, J and D among the most conserved regions and with the exception of segment J, they are rearranged and inverted. Interestingly, a majority of the conserved blocks carried at least one of the core baculovirus genes with the highly conserved J block encoding 17 of the 37 core genes, indicating an evolutionary constraint that has necessitated the inheritance of these genes as a group during the course of the evolution of the virus (Supplementary Table S2).
Repeat regions. Repeated A-T rich sequences called homologous repeats (hrs) made of one or more copies of imperfect or perfect palindromic sequence are present in baculovirus genomes 15,16 . They widely vary in the sequence composition, length and number of repeats between genomes 15 . The hrs are thought to function as transcriptional enhancers and replication origins 15 . Similarly, direct repeats (drs) in the baculovirus genomes have been suggested to function as replication origins. In the HytaNPV genome, we found six repeat regions with lengths ranging from 62-513 bp. The repeat sequences were either 46, 18 or 15 bp long and were arranged in tandem (Fig. 6). They were A-T rich with an AT content of 72-74%. Repeat 1, located at the 5p region of hoar (orf4), contains three 18 bp repeats and a truncated 8 bp region from the 18 bp unit arranged in tandem. Four 15 bp repeats and a truncated 5 bp region form the 15 bp unit arranged in tandem located at the 3p end of orf6 constituted repeat 2. Interestingly, the 15 bp unit in repeat 2 consists of two 6 bp repeats with intervening 3 bases separating them. The first ORF (orf1) is the polyhedrin gene. red -core genes, which are conserved among all baculoviruses; blue -genes conserved among lepidopteran baculoviruses; gray -known genes, which are found in baculovirus; green -unique genes, which are only found in HytaNPV.
Repeats 3, 4, 5 and 6 were made of 46 bp repeats that shared a common core (Fig. 6), though each 46 bp unit that made each of these repeats was distinct. The repeat 3 is the longest (513 bp) and is made of a total of eleven copies of a 46 bp unit and a truncated 7 bp unit and is located at the 3p end of orf16. The 46 bp units in repeat 3 consist of two variants that differ in two positions (Fig. 6). Each of the eleven copies of the 46 bp unit in repeat 3 contains a BglII restriction enzyme site (6 bp) in the middle that is flanked by 20 bp on either side. Repeat 4, located at 3p end of orf47 is 121 bp long and is made of two 46 bp repeat units and a truncated 29 bp repeat sequence. Between endonuclease (orf94) and nrk-1(orf95) is the 171 bp repeat 5 that consists of three 46 bp repeat units and a 39 bp truncated repeat sequence. Interestingly the 46 bp repeat units in repeat 5, like repeat 3 units contain a BglII restriction  Table S4). Bootstrap value resulted from 1000 replications is shown in each node. Red arrow -HytaNPV.
Scientific REPoRTS | (2018) 8:8924 | DOI:10.1038/s41598-018-27084-y located close to the 3p end of the repeat unit. Repeat 6, 297 bp long, is made of six 46 bp units and a 21 bp truncated repeat sequence. One of the 46 bp unit in repeat 6 differs from the rest by one base at the 3p end ( Fig. 6). In the closely related BusuNPV genome a repeat region with two tandem 58 bp repeats is observed 17 . This 58 bp sequence shares a core region observed in the 48 bp units in HytaNPV genome. Previously, 46 bp repeat units in NeleNPV, a gammabaculovirus that infects the hymenopteran Neodiprion lecontei, was reported 18 . However, sequence comparison showed that they share very low similarity (Supplementary Figure S4).

Conserved Noncoding Element (CNE) in HytaNPV.
We analyzed the HytaNPV genome for the presence of a previously reported, conserved noncoding functional element (CNE) responsible for virus replication in transfected insect cell cultures 19 . This revealed the presence of a CNE in HytaNPV that overlaps with orf5. Comparative sequence analysis revealed the presence of CNE in HytaNPV as well as in 52 other alphabaculovirus genomes (Fig. 7). Multiple sequence alignment of CNEs across alphabaculoviruses revealed several highly conserved nucleotide clusters represented by specifically arranged repeat sequences (Fig. 7).
HytaNPV genes. We analyzed the HytaNPV genome for the presence of genes involved in replication, transcription, viral packaging and other functions based on homology to characterized AcMNPV genes and other sequenced NPV genomes.
Baculovirus genes can be broadly classified as early, late and very late based on the timing of its expression during the lifecycle of the virus. While the early genes use the host RNA polymerase for its transcription, those expressed late during infection use a virally encoded RNA polymerase. Proteins encoded by lef-4 (orf75; 457aa), lef-8 (orf57; 879aa), lef-9 (orf39; 505aa) and p47 (orf29; 393aa) form the virally encoded RNA polymerase  subunits. Additionally, we found six more replication-associated genes, including vlf1 (orf68; 394aa) and lef-5 (orf84; 277aa) that function as initiation factors (Supplementary Table S6). We did not find homologs of lef-12 and lef-10, known to have a role in transcription in the prototype AcMNPV.
Genome-wide analysis of HytaNPV promoters. As similar to many DNA viruses, baculovirus genes are transcribed temporally 25 . The early genes mostly use the host transcriptional machinery while the late and very late genes require viral derived protein for their expression 25 .
The core elements of baculovirus early promoters are those recognized by host RNA polymerase II, and sometimes they include the TATA box motif and an initiator sequence (CAGT). Analysis of the prototype baculovirus AcMNPV show that the early gene promoters contain sequence elements recognized by the host RNA polymerase  26,27 . While these elements are typical, they are not always present in all known early gene promoters in AcMNPV [28][29][30] . The late genes are transcribed by a viral RNA polymerase complex and the transcription is initiated in and around TAAG sequence found within the later promoter [31][32][33][34] .
We scanned sequences 200 bp upstream of HytaNPV ORFs for consensus promoter motifs as previously described 32,35 . The significance of these motifs was assessed by comparing their frequencies in sequences downstream of each ORF. Amongst the 141 ORFs in the HytaNPV genome, we identified 12 ORFs that possessed only the early promoter motif (a TATA box linked with a CAG/TT motif ~30 bp downstream), while 61 ORFs had the late promoter motif only (A/T/GTAAG). Thirty two ORFs contained both the early TATA and late TAAG promoter motifs ( Fig. 8a; Supplementary Tables S1, S7-9), while the remaining 34 ORFs did not contain any recognizable consensus promoter motifs. We also observed a strong correlation for sequences such as TATAAGG that contain both an early TATA box element and late promoter sequences (ATAAG) in combination. This motif was about ~4 times as frequent in the upstream location compared to sequences downstream of ORFs. TATA sequences combined with the initiator sequences CAGT or CATT separated by ~30-bp were ~3 times as frequent in upstream sequences. Furthermore, 65% of the TATA sequences present in the genome were found to be clustered within 100-bp upstream of the ATG codon of each ORF (Supplementary Table S7), unlike TAAG-containing sequences whose location is more broadly distributed (Fig. 8b-d; Supplementary Table S9). We compared the promoter motifs present in HytaNPV ORFs to those annotated near homologous AcMNPV ORFs and found that ~38% of promoter motifs found in HytaNPV genes were also found near homologous AcMNPV genes (33/86 orthologous ORFs had an identical promoter motif class (E, L or E/L)). Additionally, we compared upstream sequences of HytaNPV ORFs that had homologs in two closely related genomes (BusuNPV, SujuNPV) to identify conserved promoter motifs, including a well-known late baculovirus gene, polyhedrin. This analysis revealed a strong conservation promoter motifs upstream of homologous ORFs (Supplementary Figure S6 and  Supplementary Table S1).  Table S1, Supplementary Figure S2). We further examined the concordance between the promoter motif prediction and the RNA-seq data. Out of 12 ORFs with an early-promoter motif, 4 ORFs expressed higher at early time point (24 h) (Supplementary Table S1). This includes p43 (orf96) and egt (orf137), which are also known to express early during AcMNPV infection 36,35 . Among 62 ORFs containing a late-promoter motif, 31 were found elevated late (72 h) during the infection cycle (Supplementary Table S1). As expected, Polyhedrin (orf1), a gene known to be expressed late during infection showed a maximum expression at 72 h (Supplementary Table S1). This is consistent with the expression pattern of Polyhedrin (AcOrf-8) 6 observed in AcMNPV 35 . Among 32 ORFs with both the early-and late-promoter motifs, 29 ORFs (~91%) were found to be highly expressed at both 24 h and 72 h (RPKM > 200). These data confirmed the temporal expression of the HytaNPV genes during the infection cycle.

Discussion
We have obtained the complete sequence of the HytaNPV genome using single molecule sequencing. We found that it belongs to the group II alphabaculoviruses and is closely related to BusuNPV and SujuNPV. The elucidation of the complete sequence of this virus will enable the development of specific PCR based tests that will support the development of a HytaNPV based biopesticide. Such a biopesticide will help reduce the use of chemical pesticides in tea plantations and enhance the quality of tea produced in H. talaca infested areas.

Materials and Methods
Samples. We obtained NPV infected H. talaca larvae from tea fields of Dooars region of West Bengal, India.
Infected larvae were macerated using a sterile mortar and pestle in distilled water. Large particulates and insect debris were removed by centrifugation at 150 g at 4 °C for 5 minutes. Polyhedral inclusion bodies (PIBs) in the supernatant were then pelleted by centrifugation at 20,000 g for 10 min at 4 °C. The PIBs were washed in distilled water three times and pelleted to remove any additional particulate materials. PIBs were characterized using a compound microscope and used for further studies.

Scanning Electron Microscopy (SEM). A 2.5ul drop of the aqueous PIB suspension were placed on
Formvar and carbon-coated TEM grid and adsorbed to the grid for 30 min at RT. The grids were then quickly rinsed with water, negatively stained with 1% uranyl acetate for 30 sec and finally air-dried. Negatively stained PIBs were examined in a JEOL JEM-1400 transmission electron microscope (TEM) at 80 kV. Digital images were captured with a GATAN Ultrascan 1000 CCD camera at magnifications from 3000× to 15000×.
Sequencing and assembly. Sequencing was done using RSII Pacific Biosciences single molecule platform.
Read processing and de novo assembly were performed using HGAP Assembly workflow (version 2.0) 37 as implemented in Pacbio SMRT Portal. Illumina MiSeq short read data were used to polish the PacBio assembly. Briefly, we mapped MiSeq reads onto PacBio assembly sequence using BWA-MEM (version 0.7.10) and called variants using GATK Haplotype Caller (version v3.5) 38 . High-confidence variants, which meet criteria recommended by GATK 39 , were used for manual correction to derive the consensus HytaNPV assembly.

Gene annotation.
We developed a pipeline for calling and annotating ORFs as outlined in Supplementary Figure S1 (code available upon request). Briefly, hypothetical open reading frames (ORFs) were predicted using EMBOSS sixpack program 40 , with at least 50 amino acids. For any pair of ORFs that are overlapped (>40%) in any orientation, the longer ORF was kept and the shorter ORF was discarded if it is not known in baculoviruses. ORFs are reviewed and annotated using NCBI Protein-Protein BLAST algorithm (version 2.2.30+).

RNA-seq.
H. talaca larvae were reared in laboratory from the eggs obtained from a single moth. The larvae were reared on tea leaf. Third instar larvae were placed on tea leaf and sprayed with HytaNPV particles (1 × 10 10 PIBs/ml). Three larvae each were dissected at 0 h, 24 h and 72 h to obtain the gut tissue. Total RNA was isolated from pooled gut tissue corresponding to each time point using Trizol (Thermo Fisher Scientific). Libraries were prepared using TruSeq ® stranded total RNA library prep kit (Illumina) using 1 μg of total RNA as starting material. The final libraries obtained were sequenced on Illumina HiSeq 2500 platform to obtain ~20 million 2 × 100 bp paired end reads for each library.
The reads were mapped to HytaNPV genome sequence using Bowtie (version 0.12.9) allowing for one mismatch within the 60 bp high-quality end of the reads. We observed 3 reads (at 0 h), 17,342 reads (at 24 h), and 57,040 reads (at 72 h) mapped uniquely to HytaNPV genome. Uniquely mapped reads at 24 h and 72 h time points were used to quantify expression (reads per kilobase of exon model per million mapped reads (RPKM)) of HytaNPV genes.
Conserved noncoding element (CNE) detection. Blastn 41 was performed to identify the previously described CNE 19 with parameters -word size = 7, match/mismatch = 1,−1, gap existence/extend = 5,2. The top hit from Blastn for all sequenced alphabaculovirus was then collected and supplied as input to the ClustalW 42 program to identify highly conserved clusters within homologous CNEs from each baculovirus species. Finally, Weblogo 43 was used to generate a consensus CNE sequence across all Alphabaculoviruses.
Promoter motif analysis. Genome sequences 200 bp upstream of each HytaNPV ORF were extracted and searched for known consensus core promoter motifs, namely, TATAA, A/T/GTAAG, and CAG/T using multiple expectation maximization for motif elicitation (MEME) 44 Table S4). All the sequences were aligned using the multiple sequence alignment algorithm clustalW (version 2.1) 42 with default parameters. A phylogenetic tree was constructed using Maximum Likelihood method with Jones-Taylor-Thornton (JTT) model 45 as implemented in the MEGA7 software 46 . Phylogeny test was carried out using Bootstrap method with 1000 replicates.
Availability of data and materials. The HytaNPV genome sequence is available from GenBank under accession number MH261376.