Icosahedral viruses defined by their positively charged domains: a signature for viral identity and capsid assembly strategy

Capsid proteins often present a positively charged arginine-rich region at the N and/or C-termini that for some icosahedral viruses has a fundamental role in genome packaging and particle stability. These sequences show little to no conservation at the amino-acid level and are structurally dynamic so that they cannot be easily detected by common sequence or structure comparison. As a result, the occurrence and distribution of positively charged protein domain across the viral and the overall protein universe are unknown. We developed a methodology based on the net charge calculation of discrete segments of the protein sequence that allows us to identify proteins containing amino-acid stretches with an extremely high net charge. We observed that among all organisms, icosahedral viruses are especially enriched in extremely positively charged segments (Q ≥ +17), with a distinctive bias towards arginine instead of lysine. We used viral particle structural data to calculate the total electrostatic charge derived from the most positively charged protein segment of capsid proteins and correlated these values with genome charge arising from the phosphates of each nucleotide. We obtained a positive correlation (r = 0.91, p-value < 0001) for a group of 17 viral families, corresponding to 40% of all families with icosahedral structures described so far. These data indicated that unrelated viruses with diverse genome types adopt a common underlying mechanism for capsid assembly and genome stabilization based on R-arms. Outliers from a linear fit pointed to families with alternative strategies of capsid assembly and genome packaging. Significance Statement Viruses can be characterized by the existence of a capsid, an intricate proteinaceous container that encases the viral genome. Therefore, capsid assembly and function are essential to viral replication. Here we specify virus families with diverse capsid structure and sequence, for each capsid packing capacity depends on a distinctive structural feature: a highly positively charged segment of amino acids residues, preferentially made of arginine. We also show that proteins with the same characteristics are rarely found in cellular proteins. Therefore, we identified a conserved viral functional element that can be used to infer capsid assembly mechanisms and inspire the design of protein nanoparticles and broad-spectrum antiviral treatments.


Introduction
The most common solution that viruses employ to protect their genomes is to assemble a spherical shell composed of multiple copies of only one or a few kinds of proteins. The capsid proteins (CP) interact with the genome and each other, usually following the principles of icosahedral symmetry, where the number of subunits forming the capsid is given by the triangulation number (T) x 60. The second architecture is a helical arrangement of proteins (nucleocapsid proteins, NCP) that interact with the genome (1,2). The mechanisms involved in the assembly of the protein shell and condensation of viral capsids genome often find direct applications in the fields of drug development and nanotechnology.
Some icosahedral viruses have a high concentration of positively charged amino acid residues at the extremities of their CPs, known as arginine-rich motifs, poly-arginine or arginine-arms. These R-arms are directed towards the interior of the viral particle, where they make contact with the encapsulated nucleic acid (3). Studies with hepatitis B virus (4), circovirus (5), nodavirus (6), and other models (7,8) have demonstrated that these positively charged domains are essential for interaction with the viral genome and for particle stability. Part of the functional explanation may rely on the counteraction of repulsive forces that results from the negatively charged nucleic acids condensed inside the capsid (9,10). Different groups working with single-stranded positive sense (+) RNA viruses observed that the sum of net charges of all R-arm containing proteins in a virus capsid correlates with its genome packing capacity, e.g. (10)(11)(12)(13)(14). However, for some specific viruses, R-arms have also been implicated in the interaction with cellular membranes promoting particle penetration into the cell (15) or intracellular localization (16,17). In these cases, R-arms can act as localization signals or cell penetrating peptides (18,19), suggesting that these domains are multi-functional. Despite the notion that R-arms are present in different viruses and are critical components for viral replication and assembly, they have never been formally annotated as a protein domain by widely known resources and databases such as the Pfam protein family database (20) or InterPro (21). Consequently, there is no information on the distribution of R-arms across different organisms or viral families or their overall amino-acid composition. This broad view perspective is necessary to determine if R-arms can be considered a typical functional module of icosahedral viral capsids and if they can be used to infer capsid assembly mechanisms. R-arms often present low sequence conservation and extensive variation in length, what hampers domain identification by profile Hidden Markov-Model (HMM) protein classification, the method employed by relevant databases such as Pfam (20).
Moreover, R-arms often lie within an intrinsically disordered region, too dynamic or flexible to be resolved in viral capsid structural models generated by X-ray crystallography or cryo-electron microscopy. These attributes complicate the use of traditional approaches for the identification of R-arms in unrelated viruses; and sometimes, even within a viral family.
Here, to determine the occurrence of positively charged domains among protein from diverse organisms, including viruses, we analyzed the net charge distribution across the primary structure of proteins deposited in the reviewed Swiss-Prot database (559,052 proteins). Using a program that calculates the net charge in consecutive amino-acid stretches, we observed that icosahedral viruses are enriched with positively charged stretches when compared to Homo sapiens and other proteomes, especially at extreme charge values (≥ +17). The viral capsid segments also present at least 4 times more arginine than lysine, a feature that is not common in cellular proteins. We also made a focused effort to calculate the correlation between the total net-charge derived from the positively charged domain and the genome charge for a comprehensive group of viruses with different genome types. We propose that this analysis can be used to predict if the electrostatic interaction between the positively charged domain and the genome is a dominant driving force for capsid assembly and stability.

Results
The viral proteome contains more super-positive stretches than the cellular proteome.
The first step to characterize the charge distribution along different protein sequences was to define the length of the search frame, that is, the number of residues that would be used for net-charge calculation in every consecutive stretch. Virus positively charged motifs can be in rigid patches on the inner capsid surface (e.g., bacteriophage MS2, Leviviridae family (22)) but usually they are within flexible arms in the N terminus. For this reason, a commonly used criterion for R-arm size determination is the length of the disordered region of the N-terminus as determined by x-ray crystal models (11) or secondary structure prediction software (23). We listed (+)RNA viruses that have been previously analyzed (10,14,23), and noticed that the average unstructured N-terminus is around 30 amino-acid residues (n= 14 families, SD ± 23.71). These observations indicated that this frame was a good starting point for our analysis.
In order to characterize the distribution of positively charged protein stretches in several organisms, we used a program that can screen a protein sequence and calculate the net charge every consecutive frame of 30-amino-acid residues (Q30res) (24). We analyzed total Swiss-Prot reviewed proteome (559,052 sequences) and compared with viral, Drosophila melanogaster, Arabidopsis thaliana, and H. sapiens data sets. In Fig.   1A, we show the frequency distribution of the stretches according to their Q30res.
Different from insect, plant, and human, viruses had more positively charged than negatively charged segments. In Fig. 1B we show the log (base 10) net-charge frequency value for a selected group of proteins (e.g., Viral proteins) subtracted of the log expected frequency value calculated from the total proteome distribution (i.e., log10fold-change). Viral proteins were the only class enriched in extremely high positively charged segments (charge ≥ + 17) when compared to the overall proteome (1B-inset with p-values). Positively charged protein stretches can be involved in diverse roles, such as membrane interaction, DNA or RNA binding, and cellular localization signaling (25). All these functions are important for virus replication and must contribute to the charge distribution profile of the viral protein dataset. In order to characterize the charge distribution according to protein function, we grouped the viral proteins following their functional annotation available in the Swiss-Prot database (Fig. 2). As expected, proteins classified in the DNA/RNA binding functional class (i.e., viral transcriptional factors, RNAi suppressors) had more positively charged segments than the Total Swiss-Prot proteome ( Fig. 2A). However, the "Viral Particle" subset had higher frequencies of positively charged fragments ( Fig. 2A). In Fig. 2B we dissected the viral particle components and observed that the class containing the highest frequencies and broadest distribution of positively charged segments was "Viral icosahedral capsid".
Even when compared to human DNA/RNA binding proteins, viral icosahedral capsid proteins concentrated more positively charged segments than any other analyzed classes (Fig. 2C).
Positively charged domains of the icosahedral capsids are mainly involved in capsid assembly and stability.
We hypothesized that by searching for the most positively charged segment in a capsid protein, we could efficiently identify the viral R-arm domains. Because the correlation between total R-arm charge and genome charge have already been demonstrated for a selected group of icosahedral RNA viruses (10,14,23), we decided to generalize this idea for all the icosahedral viruses in our dataset. This would not only validate our Rarm identification method for the previously analyzed (+)RNA viruses but would also reveal how the positively charged domain of viruses with different genome types relates to the capsid packaging capacity. While the theoretical determination of the genome charge (Qgenome) is straightforward (each phosphodiester bond produces one negatively charged phosphate group), the calculation of R-arm total charge is more complicated.
We carefully curated our protein dataset in order to select entries that corresponded to viruses with known capsid structure and complete genome sequence. The total R-arm net charge was calculated by multiplying the maximum net charge value in 30 aminoacid residues found in a protein capsid by the number of subunits forming the capsid (Total Qmax30res). We accounted for deviations in icosahedral symmetry by using the actual subunit copy number (e.g., Papillomaviridae: pseudo T=7, with 72 pentamers of L1 and 72 copies of L2; Geminiviridae: formed by two fused T=1 capsids totalizing 110 subunits; Picornaviridae: pseudo T=3, with 60 copies of VP1, VP2, VP3, and VP4). We excluded viruses with complex multicomponent capsids; viruses with complex maturation pathways that involve scaffold proteins; and with uncertain protein copy number per capsid. With these criteria, we eliminated most bacteriophages (except Leviviridae) and complex icosahedral viruses, such as Adenoviridae, Reoviridae, Herpesviridae, etc. The final list (Support Table S1) contained 129 icosahedral viruses from 25 different families and all genome types, except for single-stranded negative sense (-)RNA (all helical viruses) and ssRNA-RT, comprising 57% of icosahedral virus families with known capsid structure (Viperdb). A linear fit allowing outliers identification indicated that 20 viruses, members of 8 virus families (marked in grey), deviated from the linear fit ( Fig. 3A and Support Fig S2). Assuming these families as outliers, we analyzed the remaining 103 inliers from 17 families in a correlation analysis. We obtained a Pearson (r) of 0.91 and a p-value < 0.0001. In order to test the effect of different search frames, we repeated the analysis with segments of 10 and 60 aminoacid residues (Support Fig S2). The best linear fit was obtained with the 30 amino-acid residues frame (Total Qmax30res R 2 = 0.74; Total Qmax10res R 2 =0.42; Total Qmax60res R 2 = 0.69). The fits from 30 and 60 aminoa-acid residues frames presented similar slopes (Total Qmax30res and Total Qmax60res slopes = 0.66 compared to Qmax10res slope = 0.42), and members from the same families were identified as outliers (S2). Therefore, we concluded that the 30-amino-acid residues frame was effectively reporting positively charged domains involved in genome stabilization, while the 10 amino-acid residue frame was too short to capture the entire positively charged domain. Figures 3B and C shows the Qmax30res and the Qgenome vs. T, respectively. Interestingly, T=1 ssDNA viruses carried segments with the highest net charge (Fig. 3B), probably to maximize packaging capacity of the smallest capsid of the viral world. Members from Anelloviridae and Circoviridae families have genome sizes equivalent to the geminated (T=1*) capsids of Geminiviridae (110 subunits), and even to some T=3 viruses, such as Bromoviridae (see Fig. 3A and 3B and supplemental table 1). We conclude that R-arms are a general strategy for capsid assembly and stabilization. Moreover, the identification of outliers indicates alternative functions for these domains and points to assembly strategies less dependent on electrostatic interactions between the genome and the capsid protein (see Discussion).
Next, we examined the composition and location of these positively charged protein segments in viral capsid proteins (Fig 4). We complemented the protein dataset analyzed in Fig 3 (Fig 4, group 1 and 2) with icosahedral viruses with complex capsids (group 3) and helical viruses NCPs, totalizing 1,100 entries from 49 virus families. In Viruses had more arginine, proline, and tryptophan than the human dataset ( Fig 4C).
We looked for recurring patterns or known motifs in these sequences using MEME (data not shown) (26). The program retrieved expected motifs for the human data sets, such as RGG and RGR motifs for the human RNA-binding proteins (27); and zinc fingers and Homeobox motifs for the DNA-binding proteins (28). However, for the viral data set, no known nucleic-acid motifs were identified, and the few patterns retrieved by the program matched entries from the same family (not shown). This result confirms the unique structural makeup of viral capsid positively charged domains in relation to other DNAand RNA-binding proteins.

Discussion
We found that the high frequency of positively charged domains found in viruses (Fig 1) is due to the existence of icosahedral viral capsids (Fig 2), an extremely specialized quaternary arrangement of proteins and nucleic acids, which function and structure have no counterpart in cellular organisms. Only 0.1 % of all proteins of the Swiss-Prot database have at least one or more stretches with Q30res ≥ +14 and R/K ≥ +4. About 25% of these are viral capsid proteins, a striking feature of viruses, considering that they represent only 3% of the Swiss-Prot proteome (Support Fig S3). Vertebrates are the second group having a considerable number of proteins with a similar constitution.
Nevertheless, these proteins represent a tiny fraction of the individual organisms proteome (e.g., just 55 proteins with Q30res ≥ +14 and R/K ≥ 4 in 20,415 human proteins). Among them, nucleic acid binding proteins, and more notably, protamines, which are small proteins expressed exclusively during spermatogenesis and involved in DNA hyper-condensation (29) (Support Fig S3). The arginine side chain possesses a guanidinium group, able to form bidentate bonds that are advantageous to maximize nucleic acid folding and packing when compared to Lys (30,31). Moreover, argininerich cell-penetrating peptides are more efficient than the lysine-rich peptides, probably because of the bidentate interaction forces membrane curvature and destabilization (32). Hence, arginine seems to be the optimal amino acid to condense and stabilize the viral genome and to facilitate membrane interaction. Nevertheless, different from the negatively charged amino acids, the concentration of R and K in a short protein segment is limited (Fig. 1). The adverse effect of exceptionally positively charged protein segments on ribosomal synthesis efficiency may be among the selective pressures acting against repetitions of R or K in all organisms (24). Additionally, the size and composition of viral positively charged domains might be controlled by other factors. Viral nucleic-acid structural features that are rare in host cells usually serve as molecular targets for innate immune response (33), and is possible that R-rich domains function as a viral protein specific pattern.
The calculation of capsid internal net charge shown in Figure 3 follows the most straightforward methodology published so far (11,14,23) since the only criterion for Rarm identification is the assumption that it is the most positively charged segment of the capsid protein. Even so, we observed a positive correlation between total Qmax30res and Qgenome for all the (+)RNA viruses for each the involvement of positively charged domains and capsid assembly was experimentally demonstrated (e.g., Geminiviridae, ratio was 1.4, which is generally in line with previous data indicating that (+)RNA viral capsids are overcharged, meaning that the Qgenome is not completely neutralized by protein derived positive charges (3,9,10). Despite the simplicity of the calculation, we reproduced the general findings obtained with (+)RNA viruses (3) (23)  The dsRNA families Totiviridae and Partitiviridae share similar simple capsid architecture, with 60 CP dimers forming a T1 capsid. All dsRNA viruses, including the more complex reoviruses and birnaviruses, replicate their genome and transcribe their mRNA inside an assembled capsid that also encloses the RNA-dependent RNA polymerase. More than transporting the genome, these particles are part of the viral factory, preventing the detection of viral dsRNA species by cellular proteins (35).
Because these capsids must sustain variable levels of RNA content during viral replication, it is reasonable that these families diverged from the group belonging to the linear fit. Among the ssRNA outliers, we found Caliciviridae; the 3 families of picornavirales present in the dataset (Dicistroviridae, Secoviridae, and Picornaviridae); and Tymoviridae. A recent sequence-similarity network analysis of single Jelly-Roll capsid proteins from RNA viruses revealed two large clusters, one containing most of the ssRNA viruses present in our data-set and another formed by picornavirales and Caliciviridae (36). Even though the capsid architecture is not the same, both groups pack VPg, a small protein bound to the genome 5'-end (37). Picornaviruses form pseudo-T = 3 viruses containing 4 different proteins. Segments with charge > +7 were found in few entries and were restricted to one or two CPs. The primary role of these domains is unknown, but they may participate in membrane interaction, as already demonstrated for dicistroviruses CP4 (38). Most viruses from Caliciviridae assemble their capsid with one type of CP arranged in 90 dimers in a T=3 lattice. No segments with Q ≥ +7 were found in Caliciviridae CPs. Our data reinforce the structural similarities between these two groups and suggest a common, yet unknown mechanism for genome stabilization and assembly. The Tymoviridae capsid proteins are also devoided of segments with Q ≥ +7. An X-ray structural model of DYMV include densities corresponding to ordered RNA inside the capsid, but no positively charged residues are present in the interaction interface (39). The Parvoviridae were the only T = 1 ssDNA viruses identified as an outlier family. These viruses enclose the largest genomes among the ssDNA viruses (~ 5 kb) but have charge values similar to the tiny Nanoviridae (~ 1 kb). Parvoviruses present 3 variations of the cap gene product, all having an overlapping amino-acid sequence with similar C-termini. The most charged segment is a short Lys-enriched region unique to VP1. Because this CP variant is the least abundant, our charge calculation is probably overestimated. The capsid is mainly formed by VP2 proteins that have a very conserved ssDNA binding pocket (40). The binding site shows an ordered loop of 9 nucleotides that coordinates two Mg 2+ . This stable and structured contact between the genome and the protein shell may represent an alternative strategy to the long super-charged R-arms that are observed in circovirus and anellovirus (5,40). ssDNA viruses are known to be a polyphyletic diversified group (41). The finding that Parvoviridae has genome stabilization strategy that differs from other small ssDNA viruses only contributes to the hypothesis of independent events of capsid acquisition in this group.
In summary, positively charged domains that are implicated in viral capsid stabilization have general features such as, possessing a high positive charge in a 30 amino-acid residue stretch (Q30res ≥ +7); being enriched in arginine over lysine; and being located at the C-or N-terminus of the capsid protein. However, these characteristics are neither essential nor exclusive to genome stabilization function, which complicates a sequence only approach to R-arms identification. On the other hand, when associated with virus genome charge and capsid structure, positively charged domains can suggest the general basis of capsid assembly and genome packaging mechanisms.

Data sources
Protein database Swiss-Prot at Uniprot.org was used as our source of primary protein sequences. Protein function, taxonomic and structural information were retrieved from Uniprot.org, Viralzone, and Viperdb. Genome sizes for all viruses were obtained at the National Center for Biotechnology Information (NCBI) database. Reference sequences were used when available.
Net charge calculation, R/K ratios determination, amino-acid composition, and statistics.
We developed a program that screens the primary sequence of a given protein and calculates the net charge in consecutive frames of a predetermined number of amino acids (10, 30 or 60 were used). For the net charge determination, K and R were considered +1; D and E were considered -1; every other residue was considered 0. The N and C termini charges were disregarded. In a previous publication, we have shown that these simplified parameters are equivalent to a calculation using partial charges of individual amino acids at pH 7.4, according to their pKa values and Henderson-Hasselbach equation (24). After net charge calculation in 30 amino-acid residue stretches, another program was used to determine the arginine (R) to lysine (K) ratio in a group of amino-acid stretches using the equation available in the support information S4. The general amino-acid composition of 30 residues stretches with Q ≥ +7 was generated by MEME (http://meme-suite.org) (26). Statistical analyses and graphical    Table S1). The total nucleic-acid net charge was calculated from the number of nucleotides residues in the genome (Qgenome). For multipartite viruses, the longest genome segment was considered for the plot. A straight line fit for the entire dataset was calculated, and a shaded area indicates the outliers (ROUT 2%). Pearson correlation results obtained from the inliers are shown in the inset. The panel shows the amino-acid enrichment in relation to the total Swiss-Prot proteome amino-acid composition.