Virtual 2D mapping of the viral proteome reveals host-specific modality distribution of molecular weight and isoelectric point

A proteome-wide study of the virus kingdom based on 1.713 million protein sequences from 19,128 virus proteomes was conducted to construct an overall proteome map of the virus kingdom. Viral proteomes encode an average of 386.214 amino acids per protein with the variation in the number of protein-coding sequences being host-specific. The proteomes of viruses of fungi hosts (882.464) encoded the greatest number of amino acids, while the viral proteome of bacterial host (210.912) encoded the smallest number of amino acids. Viral proteomes were found to have a host-specific amino acid composition. Leu (8.556%) was the most abundant and Trp (1.274%) the least abundant amino acid in the collective proteome of viruses. Viruses were found to exhibit a host-dependent molecular weight and isoelectric point of encoded proteins. The isoelectric point (pI) of viral proteins was found in the acidic range, having an average pI of 6.89. However, the pI of viral proteins of algal (pI 7.08) and vertebrate (pI 7.09) hosts was in the basic range. The virtual 2D map of the viral proteome from different hosts exhibited host-dependent modalities. The virus proteome from algal hosts and archaea exhibited a bimodal distribution of molecular weight and pI, while the virus proteome of bacterial host exhibited a trimodal distribution, and the virus proteome of fungal, human, land plants, invertebrate, protozoa, and vertebrate hosts exhibited a unimodal distribution.

www.nature.com/scientificreports/ The most challenging aspect of deciphering the origin and evolution of viruses is the presence of high sequence divergence due the high rate of mutation, genetic recombination, gene duplication/loss, and horizontal gene transfer that occurs in viruses 21,22 . The high level of sequence divergence and the smaller number of genes in viral genomes relative to prokaryotic and eukaryotic genomes, makes it difficult to identify genes that are conserved across and within families of viruses. The mutation rate of the viral genome varies from 1.5 × 10 -3 mutation per nucleotide per genomic replication (RNA phage Qβ) to 1.8 × 10 -8 mutations per nucleotide per replication (HSV-1) [22][23][24] . Notably, the mutation rate corresponds to the type of polymerase used in replication 22 . RNA viruses that use an RNA-dependent RNA polymerase were reported to mutate faster than viruses that use an RNA-dependent DNA polymerase (retrovirus) or reverse transcriptase 22 . Viruses that use an RNA-dependent RNA polymerase also mutate faster than viruses that use a DNA polymerase 22 . Drake et al. proposed a universal mutation rate in microorganisms of 3.4 × 10 -3 mutations/genome/genomic replication 25 . The mutation rate of ssDNA (phage ϕX174), however, was reported to be quite higher than the level proposed by Drake et al. 26 . The high mutation rate makes it challenging to address problems associated with viral diseases. Although the genomic aspects of viruses are frequently studies, the proteome of viruses is poorly understood. Viral genomes, their role in immune modulation, and the use of viral genomes in vaccine development have gained enormous attention. Again, however, the viral proteome, its composition and function across hosts is poorly understood.
Proteomics is a very promising approach that can be used to better understand the molecular details of viruses. Increased knowledge of viral proteomes would enable a better understanding of the disease process and could be used to identify new biomarkers for the diagnosis and early detection of diseases and for drug development. Although proteomics has been used for biomarker development, its use in the development of broad-spectrum vaccines remains limited. The diversity of genomic and proteomic sequences in viruses may explain the basis of this problem. Although viral genomes, proteomes, and even the sequence of individual proteins are not conserved, a portion of a protein sequence can be used to develop a broad-spectrum vaccine against viral diseases. Therefore, characterizing the proteomic details of all known viruses was very important. Viruses are host specific and all viruses require a host to propagate. Therefore, the question arises, whether the genomic and proteomic content of a virus is also host specific? When viruses lack either conserved DNA or protein consensus sequences, it becomes essential to analyse viral genomes and proteins as they relate to host specificity. It is plausible that viruses may use similar molecular or biochemical mechanisms to propagate in specific hosts or host ranges. Understanding the molecular details of different groups of viruses in relation to their hosts may shed more light on the evolution and phylogeny of viruses. Therefore, we conducted a proteome-wide analysis of the proteome of the virus kingdom to provide details pertaining to the composition and structure of the proteome of the virus kingdom.
A principal component analysis was conducted to determine the relationship (clustering) between the amino acids (Supplementary Figure 1). Results indicated that Gln, Trp, Met, and Thr clustered together, Asn and Ile also clustered together (Supplementary Figure 1 Table 1). The amino acid composition of human and invertebrate viruses clustered together while the composition of land plant and vertebrate viruses also clustered together (Supplementary Figure 1). Bacterial and Archaeal viruses also clustered near each other (Supplementary Figure 1).
A correlation analysis was conducted to determine the relationship between amino acid composition in viruses that infect different hosts (Fig. 3). Results revealed that all of the amino acids were positively correlated with the viruses of different hosts. The amino acid composition of in algae with Archaea, Bacteria, Human, Invertebrate, land plant, Protozoa, and Vertebrate hosts exhibited a positive correlation coefficient of > 0.90 (Fig. 3). The correlation between Algae and Fungi; Bacteria and humans and invertebrate, Fungi and humans and invertebrates, Protozoa and Archaea, Bacteria, fungi, land plants, and vertebrates exhibited a positive correlation (Pearson) to a lesser extent. The lowest correlation was observed between Fungi and Protozoa (0.718), while the highest correlation was observed between Archaea and Bacteria (0.978) (Fig. 3). When correlation analysis was www.nature.com/scientificreports/ done without grouping the amino acids by host species, both positive and negative correlations were observed. Gly and Trp exhibited a positive correlation with Ala, Phe had a positive correlation with Cys, Gln was positively correlated with Glu, Cys and Ser were positively correlated with Phe, Ala was positively correlated with Gly; Lys, and Asn, and Tyr was positively correlated with Ile. Ile, Lys, and Tyr were positively correlated with Asn, Phe and His were positively correlated with Ser, Met was positively correlated with Val, and Ala was positively correlated with Trp (Fig. 4). The highest positive correlation was found between Ile and Asn (0.970), followed by Tyr and Asn (0.955) (Fig. 4). Notably, Cys, Phe, Ile, Lys, Asn, Ser, and Try were negatively correlated with Ala. Ala, Asp, Glu, Gly, Met, Gln, Thr, Val, and Trp were negatively correlated with Cys. Cys, Phe, His, Leu, Met, Pro, Arg, Ser, Thr, and Val were negatively correlated with Asp. Ala, Cys, Asp, Glu, Gly, and Trp were negatively correlated with Phe (Fig. 4). Other negative correlations can be observed as well in Fig. 4. The strongest negative correlation was found between Cys and Gly (− 0.915), followed by Arg with Lys (− 0.907) (Fig. 4). Overall, the correlation analysis indicated, that the amino acids in viruses associated with algae (0.   Table 2). A correlation (Pearson's) analysis was conducted to determine the relationship between the molecular weight of viral proteins and different hosts (Fig. 5). Results of the analysis indicated that the molecular weight of viral proteins in human, invertebrate, land plants, protozoa, and vertebrate hosts were positively correlated (Fig. 5). The highest correlation was found between invertebrate with protozoan and land plant (0.976) hosts (Fig. 5). A strong positive correlation was also found between invertebrate and vertebrate (0.965) hosts. In contrast, the viral proteins in fungi, human, invertebrate, land plants, protozoa, and vertebrate hosts were negatively correlated  www.nature.com/scientificreports/ with the viral proteins found in algal, archaea, and bacterial hosts (Fig. 5). The strongest negative correlation was between the size of viral proteins in bacterial and fungal hosts (Fig. 5).
The pI of viral proteomes is in the acidic range. The pI of the virus proteome was found to reside in the acidic pI range. The average pI of the virus proteome was 6.89. When viral proteomes were grouped by host, however, a host-related pI distribution was observed. The bacterial virus proteome exhibited the lowest average pI (6.3), while land plant virus proteome had the highest average pI (7.5) (Supplementary Table 2). The average pI of the algal proteome was pI 7.08 and the vertebrate virus proteome PI was pI 7.09, both of which are in the basic pI range. In contrast, the pI of the archaeal virus proteome was pI 6.42, the pI of bacteria viruses was 6.3, fungal viruses (pI 6.96), human viruses (pI 6.85), invertebrate viruses (pI 6.96), and protozoan viruses (pI 6.85), all of which represent a basic pI.
The decreasing order of the average pI of virus proteomes found in different hosts was land plants > vertebrates > algae > fungi > invertebrate > human > protozoa > archaea > bacteria (Supplementary Table 2). The highest pI protein in the virus kingdom was a pI 13.364 (BAH72951.1) found in the bacterial virus, Ralstonia phage phiRSL1, while the lowest pI in a virus proteome was pI 2.448 (ATZ81137.1), which was found in the protozoan virus Bodo saltans. The average acidic pI viral proteome was pI 5.507, while acidic pI range was 5.001-5.9 (Supplementary Table 3). A principal component analysis (PCA) of acidic pI proteins in the entire virus proteome was conducted. The PCA revealed that acidic of pI viral proteomes of algae, invertebrates, archaea, vertebrate, and protozoa clustered together. The acidic pI viral proteomes of protozoa and vertebrate also clustered together (Supplementary Figure 2). The acidic pI viral proteomes of fungi and land plants clustered near each other, while the acidic pI of the viral proteomes of humans, and bacteria also clustered in proximity to each other (Supplementary Figure 2). Collectively, the data indicate that acidic pI viral proteins exhibit a positive correlation with different hosts.
The average of basic pI viral proteomes was pI 8.439 and the pI range was pI 8.  www.nature.com/scientificreports/

Virtual 2D map of the virus proteome exhibits a host-dependent modality. A virtual 2D map
of the proteome of the virus kingdom was constructed utilizing molecular weight and isoelectric point data (Fig. 6). The virtual 2D proteome map of the virus kingdom exhibits a bimodal distribution in the molecular weight and isoelectric point of viral proteomes (Fig. 6). We subsequently determined if the proteomes within different hosts also exhibit a bimodal distribution. A bimodal distribution was revealed in algal and archaeal virus proteomes, while the viral proteomes of protozoan and bacterial hosts exhibit a trimodal distribution. Lastly, the viral proteomes of fungal, human, land plant, invertebrate, and vertebrate hosts exhibited a unimodal distribution (Fig. 6).

Discussion
The  27,28 . Interestingly, we found that fungal viruses encode larger proteins than the nuclear encoded proteins of its host. This raises the question of why and how this could occur? What are the factors that lead to the production of viral proteins that are larger than those encoded by their fungal hosts? Since viruses do not encode a large number of proteins, perhaps they encode larger proteins that contain a greater number of functional domains, enabling the protein to perform different functions? Viruses were also found to encode small peptides. The smallest peptide of the virus kingdom was revealed as the penta-peptide, M-S-S-T-T, while the smallest peptide encoded in the plant kingdom is a tetra peptide, M-I-M-F, and a dipeptide, M-V, in the fungal kingdom 27,28 . It is interesting to study the functional aspects of small, viral peptides and their molecular activities in host cells have been determined to perform several biological functions 29,30 . These small peptides can regulate hormone levels and may act as hormones or other bioactive agents, such as biocides and anti-cancer drugs [31][32][33][34][35] . Thus, the presence of these small peptides can have enormous impact on their hosts. The average amino acid composition of virus proteome indicated that Leu (8.556%) is the most abundant and Trp (1.274%) is the least abundant amino acid, respectively. Leu is also the most abundant amino acid in www.nature.com/scientificreports/ the nuclear-encoded plant (9.62%) and fungal (9.115%) proteome 27,28 . While Trp (1.28%) is the least abundant amino acid in the nuclear-encoded plant kingdom proteome, Cys (1.267%) is the least abundant amino acid in the nuclear-encoded fungal kingdom proteome 27,28 . Cysteine forms disulphide bonds in proteins and provides conformational stability. Disulphide bonds are typically present in extracellular proteins and rarely in intracellular proteins 36 . The low percentage of Cys in viral proteomes reflects their need to be targeted to intracellular compartments of the cell. We suggest that viruses encode a low percentage of Cys amino acids so that their encoded proteins can be readily targeted to the cytoplasm of the host cell. The most and least abundant amino acids in the virus proteome are more closely correlated with their amino acid abundance in the nuclear-encoded proteome of the plant proteome than the fungal proteome. The molecular weight of viral proteomes revealed that the proteomes of fungal viruses encode the heaviest average proteins with an average molecular weight of 98.911 kDa, while the proteomes of bacterial viruses encode proteins with an average weight of only 22.942 kDa. The molecular weight of a protein has a significant impact on translocation across cellular compartments and also represents an important aspect of the functional role of a protein. Viruses are able to manipulate host cells to transcribe and translate large proteins that are not needed by the host cells. Understanding the molecular components responsible for the translation of such large proteins in host cells can be crucial to eliminating virus-mediated diseases in host organisms. The largest protein encoded in the human genome contains more than 27,000 amino acids and possesses at least 34 functional domains, while the largest protein encoded by viruses and produced in human hosts is the ORF1ab polyprotein (AAT98578.1) of the human coronavirus, which contains 7182 amino acids (Supplementary file 1). The largest protein of the virus kingdom is PSCNV polyprotein, which contains 17 putative functional domains (Supplementary file 1). It is plausible that the large number of functional domains in this viral protein reflects the strategy of encoding as many functional domains as possible in a protein so that a stable virus with only a few proteins can exist. Larger protein molecules provide a structural and functional benefit to an organism 37 , a premise that supports viruses encoding a few large proteins with multiple domains rather than many smaller proteins each with a different function.
The average isoelectric point of the virus proteome is in the acidic pI range (pI 6.89), which is similar to average pI of the nuclear-encoded plant and fungal proteomes 27,28 . The average algal virus proteome had a pI of 7.08 and the vertebrate viral proteome had a pI of 7.09, both of which are in the basic range. Although the collective virus proteome encodes a higher percentage of acidic than basic pI proteins, they only encode a few proteins with a neutral pI (7.0). The percentage of neutral pI viral proteins algal, archaea, bacterial, fungal, human, invertebrate, plant, protozoan, and vertebrate hosts was 0.159%, 0.079%, 0.115%, 0.285%, 0.231%, 0.241%, 0.215%, 0.16%, and 0.241%, respectively. The archaea virus proteomes had the lowest percentage of neutral pI proteins, while fungal virus proteomes had the highest percentage of neutral pI proteins.
The virus proteome exhibited a host-dependent modal distribution of molecular weight and isoelectric point. Algal and archaea virus proteomes exhibited a bimodal distribution, while protozoan and bacterial virus proteomes exhibited a trimodal distribution. Fungal, human, land plant, invertebrate, and vertebrate viral proteomes exhibited a unimodal distribution. Nuclear-encoded proteomes of the plant kingdom have been reported to exhibit a trimodal distribution, while fungal proteomes exhibit a bimodal distribution 27,28 . Schwartz et al. previously reported a trimodal distribution for the pI and molecular weight of all eukaryotic proteins 38 . The pH of cytoplasm of prokaryotic and eukaryotic cells is usually in the acidic pI range. Therefore, viruses have also evolved to produce proteins that have an acidic pI. However, the pH of chloroplasts is in the basic pI range, and as a result, proteins with a basic pI may be targeted to specific sub-cellular locations. Kirag et al. reported that the pI of a protein is based on their modality and taxonomic association 39 . Ecological niche and sub-cellular localization, however, have also been reported to play a critical role in determining the pI of a protein 39 . Schwartz et al. reported that the pI of a protein is correlated with its sub-cellular localization and the pH of the cytosol is below 7 38 . Viruses, however, do not have any cellular or sub-cellular components and hence do not encode the pI of a protein based on these factors. The selection pressure on the pI of a viral protein is completely based on the cellular environment of its host. Modification in the host pI environment by external factors can assist in eliminating a virus from its host and decrease the virulence and pathogenicity of a virus. Although the estimated and experimentally-validated pI value of a protein can be different in vivo, they are typically in close agreement when evaluated on a 2-DE gel 40 . The variation in the modal distribution of virus proteomes may be attributed to host-dependent selection pressure. The observed variation in the modal distribution of different viral proteomes based on their hosts is intriguing. Protozoan and bacterial virus proteomes exhibited a trimodal distribution, while algal and archaea virus proteomes displayed a bimodal distribution. Although protozoa are eukaryotic organisms and bacteria are prokaryotes, they share a common, trimodal modality in molecular weight and isoelectric point of their proteome. Although, archaea are also prokaryotes, they exhibit a bimodal distribution in the molecular weight and pI of their proteomes. The proteomes of multi-cellular eukaryotic hosts, including fungi, humans, land plants, invertebrates, and vertebrates display a unimodal distribution. This perhaps explains why the proteomes of viruses that infect and reside in multicellular eukaryotic hosts also display a unimodal distribution.

Conclusion
Viruses exhibit highly diverse and heterogeneous genome and replication mechanisms. They tend to undergo a high rate of mutation that contributes to a high degree of genetic and proteomic variation. Similarly, a singlestranded virus mutates more frequently than a double-stranded virus. The reverse transcribing DNA hepadnavirus can undergo a high-degree of genetic mutation that contributes to high genomic and proteomic diversity. In addition, they lack proofreading and lack a DNA repair mechanism. Although, these molecular mechanisms play important role in contributing to higher genetic and proteomic diversity, ecological and demographic factors are also responsible for the higher genetic diversity. Natural selection and random genetic drift generate great www.nature.com/scientificreports/ pressure to undergo evolution towards the genetic diversity in viruses. Although there is a great diversity between the virus population, the virus population structure for any particular host can be useful to develop broad range of antiviral agents/vaccines using artificial intelligence and a machine learning approach. The virus proteome study is just a starting point for the functional studies and discoveries uncovered through the proteomic study need extensive study to determine their functional significance. Virus proteome-based specific application tool need to be developed to understand valuable information on viral pathogenesis and their lifecycles as well as cellular functions. Amino acid composition, isoelectric point, and molecular weight of the virus protein can be very valuable towards development of such tools in future.

Materials and methods
Sequence retrieval and calculation of molecular weight and isoelectric point. All the protein sequences of the virus proteome were downloaded from the National Center of Biotechnology Information (NCBI). The virus proteome was downloaded based on its host. The viral host were algae, archaea, bacteria, fungi, humans, invertebrates, land plants, protozoa, and vertebrates. In total protein sequences of 19,128 virus proteome were downloaded that constituted 1.713 million predicted protein sequences. The downloaded protein sequences of the virus proteome were subjected to analysis of molecular weight and isoelectric point. The molecular weight of the virus proteins was calculated using the python-based command line "protein isoelectric point calculator" (IPC Python) in a Linux-based platform. The source code was used as mentioned by Kozlowski 41 . All the analysis was conducted based on the host groups to understand the host-specific relationship and or differences. The calculated molecular weight and isoelectric point of the virus proteome were further processed using Microsoft excel 2016.
Statistical analysis. Statistical analysis was conducted to understand the similarity and difference between the virus proteome originated from a different host. Principal component analysis (PCA) is used to analyze exploratory data and for making predictive models by projecting each data points. It defined the direction and maximizes the variance of the projected data. Therefore, to understand the similarity and variances in the virus proteome data we conducted the PCA to understand the similarity and differences of the amino acid composition of the virus proteome. Further, PCA was also used to understand the similarity and variances in the acidic and basic pI proteins of the virus proteomes. Statistical software Unscrambler version 10.4 was used to conduct the principal component analysis. NIPLAS (nonlinear iterative partial least square) model was used to conduct the PCA plot with 100 iterations. The role of calibrated to validated residual variance was 0.5 and the ratio of validated to calibrated residual variance was 0.75. Correlation and regression analysis was conducted using statistical software JAPS version 0.14.1.0. Pearson's correlation (r) was used to run the correlation-regression plot with confidence interval 95% (p < 0.05) and prediction interval 95% (p < 0.05). The correlation heat-map plot was constructed using JAPS version 0.14.1.0 software using Pearson's correlation (r) with confidence and prediction interval 95% (p < 0.05). The photograph of the virtual 2D proteome map of virus proteome was constructed using scatterplot online platform (https:// scatt erplot. online/).

Data availability
All the data used in this study was taken from publicly available NCBI database and accession number of each of the data are provided as supplementary file. www.nature.com/scientificreports/