Molecular evolution and genetic variations of  V and W proteins derived by RNA editing in Avian Paramyxoviruses

The newly assigned subfamily Avulavirinae in the family Paramyxoviridae includes avian paramyxoviruses (APMVs) isolated from a wide variety of avian species across the globe. Till date, 21 species of APMVs are reported and their complete genome sequences are available in GenBank. The APMV genome comprises of a single stranded, negative sense, non-segmented RNA comprising six transcriptional units (except APMV-6 with seven units) each coding for a structural protein. Additionally, by co-transcriptional RNA editing of phosphoprotein (P) gene, two mRNAs coding for accessory viral proteins, V and W, are generated along with unedited P mRNA. However, in APMV-11, the unedited mRNA codes for V protein while +2 edited mRNA translates to P protein, similar to members of subfamily Rubulavirinae in the same family. Such RNA editing in paramyxoviruses enables maximizing the coding capacity of their smaller genome. The three proteins of P gene: P, V and W, share identical N terminal but varied C terminal sequences that contribute to their unique functions. Here, we analyzed the P gene editing site, V and W sequences of all 21 APMV species known so far (55 viruses) by using bioinformatics and report their genetic variations and molecular evolution. The variations observed in the sequence and hexamer phase positions of the P gene editing sites is likely to influence the levels and relative proportions of P, V and W proteins’ expressions which could explain the differences in the pathogenicity of APMVs. The V protein sequences of APMVs had conserved motifs similar to V proteins of other paramyxoviruses including the seven cysteine residues involved in MDA5 interference, STAT1 degradation and interferon antagonism. Conversely, W protein sequences of APMVs were distinct. High sequence homology was observed in both V and W proteins between strains of the same species than between species except in APMV-3 which was the most divergent APMV species. The estimates of synonymous and non-synonymous substitution rates suggested negative selection pressure on the V and W proteins within species indicating their low evolution rate. The molecular clock analysis revealed higher conservation of  V protein sequence compared to W protein indicating the important role played by V protein in viral replication, pathogenesis and immune evasion. However, we speculate the genetic diversity of W proteins could impact the degree of pathogenesis, variable interferon antagonistic activity and the wide host range exhibited by APMV species. Phylogenetically, V proteins of APMVs clustered into three groups similar to the recent classification of APMVs into three new genera while no such pattern could be deciphered in the analysis of W proteins except that strains of same species grouped together. This is the first comprehensive study describing in detail the genetic variations and the molecular evolution of P gene edited, accessory viral proteins of Avian paramyxoviruses.


S.No APMV species Genbank Id
Year of Isolation Continued www.nature.com/scientificreports www.nature.com/scientificreports/ used in this study that were either directly collected from NCBI or derived by prediction using DNASTAR software suite are provided in Supplementary Files S1, S2 and S3.
Sequence alignment, comparison and prediction of conserved motifs/domains. Multiple sequence alignments of V and W proteins were performed using the TCOFFEE multiple alignment algorithm, mode 'expresso' and the sequence similarities were colored through ESPript [64][65][66] . All residues/amino acid positions mentioned in the results and discussion correspond to APMV-1 strain KJ808820.1, the strain that appears first in the alignment file. Individual sequences were also analyzed in NCBI's interface, conserved domain (CD)-search 67 and aligned sequences were run in DREME version 5.0.5 software 68 to identify conserved motifs/ domains. The intraclade amino acid percentage identity was estimated using Megalign software from DNASTAR. prediction of nuclear Localization Signal (nLS) and nuclear export Signal (neS) in V and W proteins. The nuclear localization signal (NLS) in V and W proteins of APMV species were identified using online tool, cNLS mapper with a cut-off score of 5.0 that predicted NLS specific to the importin αβ pathway 69 . The presence of nuclear export signal (NES) in V and W proteins of APMV species was predicted using online tool, NetNES 1.1 server that predicted leucine-rich NES using a combination of neural networks and hidden Markov models 70 and using LocNES that predicted the classical NESs in CRM1 cargoes 71 . phylogenetic analysis and evolutionary divergence. Phylogenetic analysis was performed using MEGA7 software. For drawing the phylogenetic trees, evolutionary history was inferred by using the Maximum Likelihood method with JTT matrix-based model 72 for V proteins and Dayhoff matrix based model for W proteins 73 . For drawing the phylogenetic tree of V proteins, bootstrap consensus tree inferred from 500 replicates was taken to represent the evolutionary history of the taxa analyzed 74 . Branches corresponding to partitions, reproduced in less than 80% bootstrap replicates, were collapsed. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) are shown next to the branches. Initial trees for heuristic search were obtained automatically by applying Neighbor-Joining and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then topology with superior log likelihood value was selected. A discrete gamma distribution was used for V proteins tree to model evolutionary rate differences among sites (16 categories (+G, parameter = 2.4693)). The rate variation model allowed for some sites to be evolutionarily invariable ([+I], 5.67% sites). For drawing the phylogenetic tree of W proteins, a discrete gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 1.1488)). The rate variation model allowed for some sites to be evolutionarily invariable ([+I], 0.70% sites). Analysis of both the trees involved 55 amino acid sequences. All positions containing gaps and missing data were eliminated. There were a total of 141 and 71 positions in the final dataset for drawing V and W proteins' phylogenetic trees, respectively.
The estimates of evolutionary divergence over sequence pairs between groups were analyzed for V and W proteins of all 21 APMV species using MEGA7 75 . Briefly, based on the maximum likelihood fits of 56 different amino acid substitution models, the final analyses were conducted in MEGA7 using JTT matrix-based model 72 44 APMV -12 KC333050. www.nature.com/scientificreports www.nature.com/scientificreports/ for V proteins and Dayhoff matrix based model 73 for W proteins. The rate variation among sites was modeled with a gamma distribution (shape parameter = 1). The analysis included 55 amino acid sequences. All positions containing gaps and missing data were eliminated.
Selection pressure analysis. The number of nonsynonymous substitutions per nonsynonymous site (dN), the number of synonymous substitutions per synonymous site (dS), and the dN/dS ratios for the nucleotide sequences of V and W proteins of all 21 species were analyzed for the entire sequence and also their shared N terminal and unique C terminal regions. The shared portion in the N-terminus of all the three proteins was considered up to the RNA editing site (KKG motif). The C terminal regions of V and W proteins of all 21 species were considered after RNA editing site (KKG motif). The dN/dS ratio of 21 species of APMV nucleotide sequences were estimated by DnaSP v6.12.03 software 76 . The protein was considered under positive selection or diversifying when the dN/dS ratio is >1 and negative or purifying selection when dN/dS ratio <1. evolutionary rate analysis. To estimate evolutionary rates of different APMV species in V and W nucleotide sequences, the substitution rate analysis was performed by BEAST v 1.10.4 software 77 . The substitution model GTR and site heterogeneity model G + I was found to be the best by MEGA7 and was used here to study the substitution rate of V and W sequences. The tree prior coalescent, constant size was used for individual and all the species. The uncorrelated relaxed clock with lognormal was implemented. The MCMC chain 4 × 10 8 cycles was used to reach the ESS value more than 200 to converge the data except for W proteins of APMV-8 strains where MCMC chain length of 2 × 10 8 cycles was used. The final data analysis was performed using tracer v 1.7.1 software.
All APMVs except APMV-11 expressed P protein from unedited mRNA. The editing site of APMV-11 resembled that of other paramyxoviruses that insert 2G for generating P mRNA or the 'genomic V' viruses 12 . The P protein of APMV-11 is derived from 2G nucleotides insertion, W protein from single G nucleotide addition and unedited mRNA expresses the V protein 17 . The V and W protein sequences of the other 20 APMV species were predicted by insertion of single G and two G nucleotides at the P gene RNA editing site, respectively.
Amino acid sequence analysis: percentage identity and conservations. The V and W protein sequences of all 21 APMV species shared common N terminal region with P protein. The variations in the amino acid sequences were minimum within the first 60 amino acids in N terminal region. The N-terminal portions of P, V and W proteins of metaavulaviruses and orthoavulaviruses showed closer identity than paraavulaviruses. In ortho and paraavulaviruses, the N-and C-terminal regions showing high homology up to 93%. Metaavulaviruses showed 100% identity in all their C-and N-terminal regions. The C terminal region of W-protein sequences showed 0.0-100% identity, as both the amino acid composition and length variations at C-terminal portion for all the species were higher ( Table 2). As described previously, the soyuz1 and soyuz2 motifs were observed within N terminal region in all APMVs except in APMV-3 strains 78 . Additionally, conserved domains (CD) were predicted in APMV-1 mesogenic strain Komarov (CD between aa 25 to 167) for large tegument protein UL36 (superfamily member PHA03247), in APMV-14 (CD between aa 31 and 144) for gene regulated by oestrogen in breast cancer-GREB1 (superfamily member, pfam15782) and in APMV-21 (CD between aa 53 and 120) for Tumor necrosis factor receptor superfamily member cd13415. comparison of V protein sequences of ApMV species. The V protein of APV-C was the shortest (221 aa, MW: 23.73 kDa) and that of APMV-21 was the longest (304 aa, MW: 31.98 kDa) among the 21 APMV species. The length of V protein (in aa) conserved within species was as follows: APMV-2 strains (232 aa), APMV-4 strains (224 aa), APMV-5 strains (277 aa), APMV-8 strains (238 aa), APMV-10 strains (246 aa) and APMV-13 strains (241 aa). However, in APMV-1, -3 and -6, variation in the length of V protein was observed between strains within the same species. Between species, the similarity in the V protein length was observed as follows: www.nature.com/scientificreports www.nature.com/scientificreports/ V protein length of 252 aa was observed in APMV-14, -15 and one strain of APMV-3; APMV-11 and -5 showed 277 aa long V protein; APMV -9 and -20 had 263 aa long V protein while the V protein length of APMV-16 and two lentogenic strains of APMV-1 were 245 aa. The lowest amino acid identity (10.4%) was observed between V proteins of APMV-3 strain Wisconsin and APV-A. The lowest amino acid divergence was noticed between APV-B and APV-C (70.4) and both showed identity of 53.7% at amino acid level which was the highest between APMV species (Supplementary File S4).
The multiple sequence alignment of V protein sequences of APMV species revealed higher amino acid conservations in both N (majorly in the first 60 amino acids) and C terminal regions (Fig. 1). All viruses in this study had the following conserved motifs similar to V proteins of other known paramyxoviruses (i) KKG motif in the N terminal region, which is the coding sequences at the P gene mRNA editing site (residues 132-134, corresponding to APMV-1 strain KJ808820.1, the first strain in the alignment file) except in APMV-3, APMV-4, APMV-12, APMV-13, APMV-14 and APMV-20 (ii) HRRE motif (residues 177 -180, corresponding to APMV-1 strain KJ808820.1), (iii) WCNP motif (residues 195-198, corresponding to APMV-1 strain KJ808820.1) and (iv) conserved seven cysteine-rich domain. Interestingly, the following amino acids were also conserved in the C terminal region in majority of APMV species with few exceptions: Proline residues at five positions: APMV-1 strains analyzed in this study, had the longest unique C terminal region when compared to other APMV species (Table 1 and Fig. 2).
The lowest amino acid percentage identity (2.4) was observed between W proteins of APMV-12 and APMV-3 strain Netherland. Incidentally, APMV-3 strain Netherland also showed the lowest amino acid identity with W proteins of other APMV species. The lowest amino acid divergence (83.7) was noticed between APMV-1 isolate HN1007 (KX761866.1) and APMV-16, their W protein amino acid identity was 48.6% which was the highest homology observed between APMV species (Supplementary File S5).
The phylogenetic tree obtained from W protein sequences analysis showed clustering of strains of the same species (Fig. 4). The evolutionary distance analyses of W proteins of APMV species revealed that APMV-3 species is more divergent than other APMV species. The highest evolutionary divergence was noticed between APMV-3 & APV-C species (10.322) followed by APMV-3 & APMV-13 (8.594) and APMV-3 & APMV-12 (8.270). The lowest divergence was observed between APMV-9 & APMV-21 (0.400) followed by APV-A & APV-B (0.407). The distance between the strains of the same species was more in APMV-3 (0.619) followed by APMV-1 (0.256) which was also apparent from their lower percentage of amino acid homology (Table 5).
www.nature.com/scientificreports www.nature.com/scientificreports/ Selection pressure analysis. The dN/dS ratio was used to determine the natural selection pressure acting on the P gene edited products. The dN/dS ratio was estimated by DnaSP v6.12.03 for APMV species that comprised of more than one strain. The dN/dS values were significantly less than 1 for both V and W sequences (complete, N-and C-terminal regions) of most species explaining that they are under negative selection pressure. Only the C terminal region of V proteins of APMV-3 strains showed positive selection with dN/dS> 1 ( Table 6). evolutionary rate analysis. The substitution rate of the V and W nucleotide sequences of APMV species that comprised of more than two strains were estimated by uncorrelated relaxed clock with lognormal using BEAST software. APMV-10 comprised of four strains, which were 98.65% to 100% identical to each other and hence the substitution rate could not be determined. The substitution rate was highest in APMV-13 followed by APMV-2, APMV-6 for both V and W proteins. The overall substitution rate was 7.37 × 10 −5 for V protein and 8.07 × 10 −5 for W protein (Table 7).

Discussion
Avian paramyxoviruses are known to infect a variety of bird species across the globe. Currently 21 species (previously called as serotypes) of APMVs are characterized and more viruses could be identified in future with improved viral surveillance programs. Paramyxoviruses with their small genome have a unique strategy of maximizing their genomic information by expressing viral proteins through co-transcriptional RNA editing. This helps to avoid error catastrophe caused by higher mutation rates often associated with larger genomes. Additionally, these viruses follow the 'rule of six' for efficient replication. Though, detailed studies on APMV structural genes and their complete genomes are available, a comparative knowledge of their accessory proteins expressed through RNA editing is lacking. In this study, using bioinformatics approach, we analyzed the P gene editing site, predicted and studied the protein sequences of edited products-V and W, of all 21 APMV species (55 viruses) known till date.
The hexamer phasing at the P gene editing site within each virus group is conserved 7 . We observed conserved hexamer phasing between certain APMV species and also, within each APMVs except in APMV-3 and for one strain of APMV-6. The hexamer phase is known to regulate the mRNA editing pattern, though subtle, it is important; for example, in human and bovine parainfluenza virus type 3 (PIV-3) in which the hexamer positions at P gene editing site are 2 to 5, higher mRNA editing frequency (~70%) and more number of G insertions (1 to 6 at equal frequencies) are observed while least mRNA editing ~30% with only 1 to 3 G insertions occur in Sendai virus wherein hexamer phase position is 1 at the P gene editing site 7,79,80 . Based on the hexamer phasing position, it is anticipated that, in all APMVs except in APMV -20, the editing frequency could be extensive with possibilities of more number of G insertions. However, the cis-acting sequence of P gene editing site in APMV-20 ( 3' AA), suggests higher mRNA editing frequency and increased number of G insertions as reported in human and bovine PIV-3 81,82 . Thus APMVs seem to follow PIV-3 RNA editing phenotype.
Another interesting observation is the unique editing site sequence of APMV-11 ( 3' A 4 UUCUUC 5 ), in which the unedited mRNA translates to V protein and it has been suggested that 2G insertions in mRNA translates to P protein 17 . In rubulaviruses with P gene editing site sequence of 3' A 3 UUCUC 4 , realignment of the nascent mRNA/ template hybrid during 1G insertion would mean non permissible A:C base pairing hence the minimum insertion expected is 2G 83 . The base pairing between the nascent chain and the template genome to form a hybrid is important to prevent transcriptional slippage by the polymerase 80 . Similarly, in APMV-11, 1G and 2G insertions would lead to unstable A:C base pairing, hence 3G insertion (V protein) could be the minimum number of insertions expected, also, while a 4G insertion would translate to W protein, a 5G insertion would lead to P protein synthesis. It needs to be explored if APMV-11 expresses more V protein (from both unedited mRNA and 3G insertions) than other paramyxoviruses. Furthermore, it will be interesting to study if deletions in addition to G insertions could happen in APMV-11 and other APMVs with longer C runs at the editing site as described previously with recombinant Sendai virus and PIV-3 minigenomes 83 .
Three factors, the editing site (sequence and the length of C runs), the type of sequences immediately upstream of the editing site (cis-acting sequence) and the hexamer phase positions are known to decide the editing phenotype (i.e. number of G insertions, deletions and frequency of mRNA editing) which further can influence the virus pathogenicity 81 Table 2. Intraclade percentage amino acid sequence identity for P-gene products in APMV species. P complete, V complete and W complete refer to the entire protein sequences. N terminal refers to N terminal region of respective proteins shared with P protein and C terminal refers to the unique C terminal region of respective proteins. The amino acid sequences of a particular protein of all virus strains, within the particular genus, were analyzed and the percentage amino acid identity was calculated using Lasergene DNASTAR MegAlign software.
www.nature.com/scientificreports www.nature.com/scientificreports/ at the editing site and also (iii) the sequences immediately upstream of the editing site, all of which will determine the expression levels and relative proportions of P, V and W proteins in the APMVs which in turn could explain their differences in replication, pathogenicity and virus-host interactions.   Table 3. Predicted nuclear localization (NLS) and nuclear export signals (NES) in W (3a) and V (3b) proteins of 21 APMV species.
www.nature.com/scientificreports www.nature.com/scientificreports/ With respect to the length and amino acid composition of V and W proteins, there were huge variations between species than within species, which was also reiterated by their dN/dS estimates. The V proteins were more conserved than W proteins. Higher sequence identity for V proteins was observed between the strains of the same species (exception was APMV-3) more often than between species. Phylogenetically, V protein analysis of APMV species grouped viruses similar to the individual gene-based phylogeny 2 , all the members of genus Orthoavulavirus clustered into one group and all members of genus Metaavulvirus along with one of the avian paraavulaviruses (APMV-4) formed the second group while the avian paraavulavirus, APMV-3, formed an outgroup. This was further affirmed by evolutionary distance analysis. The phylogenetic analysis and the evolutionary distance data of both V and W proteins clearly showed that APMV-3 strains are the most divergent.
The N terminal region of V and W proteins which is shared with P protein, showed highest conservation among APMVs. In their C terminal region, the V proteins of paramyxoviruses carry conserved arginine and isoleucine residues upstream of highly conserved seven cysteine residues (zinc binding domain), known to play important roles in MDA5 interference, STAT1 degradation and blocking interferon signaling to evade host immunity [86][87][88] . The V proteins of all 21 APMV species analyzed in this study had the seven cysteine residues and remarkably, many other amino acids were also conserved in their C terminal region. Though similar observations have been made earlier in other paramyxovirus V proteins, their functional importance is unknown yet 86 .
The V and W genes were found to be under negative selection pressure with dN/dS <1 in all the species. This shows the conserved nature of the non-structural viral proteins within the species and probably indicates their functional importance, which is yet to be completely explored. Furthermore, the substitution rates of APMV species determined by molecular clock was varying across the species and was higher between species than within species. The substitution rates for W proteins was higher than V proteins except for APMV-1, where more strains were Figure 2. Comparison of C terminal region of W proteins of APMV species. The C terminal region is considered beyond the conserved KKG motif (please refer Fig. 1). Number of aminoacids in the unique C terminal is mentioned within parenthesis. The basic amino acids (R, K and H) are bolded and enlarged. Amino acid residues that are conserved between strains of the same species are bolded. www.nature.com/scientificreports www.nature.com/scientificreports/ Figure 3. Phylogenetic tree derived from analysis of V proteins of APMV species by Maximum Likelihood method. The evolutionary history was inferred by using the Maximum Likelihood method based on the JTT matrix-based model. The bootstrap consensus tree inferred from 500 replicates is taken to represent the evolutionary history of the taxa analyzed. Evolutionary analyses were conducted in MEGA7. All the orthoavulaviruses clustered together, all metaavulaviruses grouped along with APMV-4 (Paraavulavirus) while APMV-3 strains (Paraavulavirus) formed a separate branch. Table 4. Estimates of Evolutionary Divergence over Sequence Pairs between Groups, analyzed for V proteins of 21 APMV species. The number of amino acid substitutions per site from averaging over all sequence pairs between groups is shown. Standard error estimate(s) are shown above the diagonal. Analyses were conducted using the JTT matrix-based model. Evolutionary analyses were conducted in MEGA7. www.nature.com/scientificreports www.nature.com/scientificreports/ available for comparison. Though the substitution rate was slightly higher for V nucleotide sequence (5.15 × 10 −4 ) of APMV-1 compared to W nucleotide sequence (1.0 × 10 −4 ), it did not lead to changes in amino acid sequence as evident from dN/dS ratio estimates suggesting negative selection pressure. The higher conservation of V protein sequence implies its significant role in virus biology such as replication, pathogenesis and immune evasion.

APMV species APMV-1 APMV-2 APMV-3 APMV-4 APMV-5 APMV-6 APMV-7 APMV-8 APMV-9 APMV-10 APMV-11 APMV-12 APMV-13 APMV-14 APMV-15 APMV-16 APV-A APV-B APV-C APMV-20 APMV-21 D istance Std. Err
In contrast to V proteins, the APMV W proteins were highly disordered, showed little sequence conservation when compared to V proteins and their divergence values were higher when compared to V proteins. The evolutionary data analysis of W proteins suggested higher sequence identity among strains of same species and higher variability between species. There were no conserved sequences or motifs in the C terminal region of W proteins except that most them carried large number of basic amino acids suggesting W protein to be highly basic as www.nature.com/scientificreports www.nature.com/scientificreports/ described previously 12 . The exceptions were one strain of APMV-1 (KJ736742.1), APMV-4, APMV-7 and APV-C, and all of them had shorter W protein length. This genetic diversity seen in the W proteins may determine the degree of pathogenesis, variable interferon antagonistic activity and the wide host range exhibited by the APMV species.
The likelihood of W mRNA occurrence could be less than that of V mRNA, because of two unstable base pairing created by the two mismatches (2G) during polymerase stuttering. This skepticism becomes more compelling and doubts rise to whether W protein is expressed at all in those APMV species whose predicted W protein sequences have only fewer amino acid residues in their C terminal region-single (in APMV- 7) 89,90 .
The presence of W mRNA of APMV-1 was first accounted in 1993 with a frequency of about 10% 12 , and the W protein expression from APMV-1 lentogenic strain Clone 30 and from APMV-1 lentogenic strain La Sota and velogenic strain SG10 was recently confirmed 16,91] . We had earlier shown that W protein of APMV-1 mesogenic strain Komarov compartmentalized in the nucleus using plasmid system 30 , while the same has also been documented during virus infection in cells in the above two studies. Here, we report NLS and NES of W proteins predicted only in certain APMV species, also, we could identify NLS only in five out of twelve strains of APMV-1 implying that the not all the W proteins of APMV-1 strains localize in the nucleus. The W protein sequence analysis of nearly 1000 strains of APMV-1 in our lab show variations in the W protein length between strains (unpublished data),which was also reported recently in an analysis of 286 strains of NDV 91 , furthermore, W proteins of only about 50% of the strains analyzed by us are predicted to localize into the nucleus (data not shown) leading us to speculate that these differences in W proteins can attribute to the wide spectrum of pathogenicity and virulence observed in Newcastle disease.
Among paramyxoviruses, the W protein of Nipah and Hendra viruses, are the most well characterized. The nuclear localization of W protein of Nipah virus was found to modify p53 expression and activity 92 , sequester inactive STAT1 within nucleus 93 , prevent IRF3 phosphorylation, inhibit IFN signaling mediated both by the virus and TLR-3 29 , modulate host immunity, influence the disease course and viral pathogenesis specifically neurovirulence 27,28,94 . Intriguingly, neither the lack of W protein nor its cytoplasmic localization in APMV-1 strain clone 30 had any effect on viral replication in cell culture 16 . Though no conserved motifs could be identified between the W proteins of APMVs and Nipah virus, it will be interesting to study if similar roles are executed by W proteins of any or all the APMV species. To our knowledge, this is the first comprehensive and comparative evolutionary study of the P gene edited accessory viral proteins of APMVs. The information obtained by this study will enable designing future studies to understand the specific functions of conserved motifs/amino acids of V and W proteins and decipher their evolutionary significance on the virus and as well as on the host. www.nature.com/scientificreports www.nature.com/scientificreports/ www.nature.com/scientificreports www.nature.com/scientificreports/