Introduction

Plastids represent biochemically and morphologically complex organelles and can change both form and function drastically in response to developmental and environmental cues. A vestigial but functional genome of 120–160 kb harboring ~90 protein-coding genes is present in the plastids of photosynthetic higher plants1. However, the total chloroplast proteome conservatively contains 2,000–3,500 proteins as reported in Arabidopsis2,3,4, but as many as 4,875 plastid-targeted proteins are estimated in eSLDB5, and 5,136 by the Chloroplast 2010 project6,7,8. Less than 900 of 4,500 genes horizontally transferred from the ancestral cyanobacterium are predicted to be retargeted to the plastid in vivo9.

There seems to be a difference between the composition of plastid-targeted proteomes in dicots and monocots. Only 21% of plastid-targeted rice proteins have a predicted homolog in the predicted Arabidopsis plastid proteome, and in reciprocal comparison the number is 38%2. A similar result was obtained in a comparison of six crop plants against Arabidopsis, in which an average of 51.0% of the predicted plastid proteome of each species matched to the Arabidopsis predicted plastid proteome, while 67.5% matched against the full Arabidopsis proteome10. Thus, the plastid pan-proteome is extremely diverse and is composed of unique proteins at the species-level. Furthermore, as the number of conserved sequences across all the genomes analyzed closely mirrors the number of genes of cyanobacterial origin, the non-conserved plastid-targeted protein-coding genes most likely evolved from eukaryotic sequences. The variability in the predicted plastid proteome mirrors the observable diversity in plastid function and ultrastructure in different species or under different environmental and developmental conditions2,10,11,12,13. The diversity of plastid proteomes is evident even within the same plastid morphotype: the pigment-storing chromoplast alone has at least four described ultrastructural phenotypes across various species with unique sub-organellar membrane structures that can occur either singly or mixed within individual plastids14. Morphological differences in plastid shape and ultrastructure are noted even in genetically similar cultivars of the same species. Both chloroplasts and chromoplasts of developing apple peel differ significantly from tomato, which is used as a model reference for chromoplast differentiation in fruits15,16. Variation has also been documented between the apple cultivars and the epidermal and collenchymal plastids11.

The observed phenotypic diversity of plastids could be explained by three potential molecular factors: (1) Differences in the expression of genes controlling the rate and total amount of protein accumulation or import. This aspect could lead to unique phenotypes without necessarily changing the subset of plastid-targeted proteins. (2) Mutations within a shared group of plastid-targeted proteins could lead to neofunctionalization. (3) Finally, gain or loss of transit peptides causing subcellular mistargeting could alter the total pool of plastid-targeted proteins.

These factors are not mutually exclusive, and examples of each mechanism are known. Gene expression differences, possibly caused by epigenetic DNA methylation patterns, are responsible for differential protein accumulation in mesophyll and bundle sheath cells of C4 plants, illustrating the first point17,18,19,20. In support of the second mechanism, point mutations in the active site of plastid-targeted limonene synthase change the abundance and distribution of different monoterpenoid end products in bacterial expression systems21, and transplastomic expression of a delta-9 desaturase gene causes changes in fatty acid concentrations and levels of unsaturation, cold tolerance, leaf senescence, and seed yield22 are additional examples. While it is challenging to address the neofunctionalization of plastid-targeted proteins via mutation without detailed reverse genetics experiments, the other mechanisms can be evaluated with high-throughput sequencing and bioinformatics.

High-throughput proteomics using mass spectrometry (MS) has been an important means of surveying organellar proteomes and comprises the majority of current plastid proteome evidence. However, these techniques have historically been limited to the chloroplast morphotype and a restricted number of plant species. Excellent databases for high-throughput plastid proteomes based largely on mass spectrometry are accessible at AT_CHLORO23, PPDB24, SUBA425, and CROPPAL26. However, caution should be exercised in interpreting these datasets because MS is susceptible to high false positive errors due to contamination during plastid isolation, liberal mass tolerance, and errors in peptide mapping, among other problems27,28,29. While the use of reference genomes and transcriptomes can help overcome peptide mapping issues, other technical issues are more difficult to resolve. Use of fluorescent protein chimeras (e.g., GFP – green fluorescent protein), though lower-throughput, typically have higher biological accuracy. Using these, localization of low-abundance, as well as proteins from species lacking robust plastid isolation methods, can be evaluated with higher efficiency. However, GFP techniques are not immune to experimental error either. Since the sequence of the mature protein partially influences localization (e.g.,30,31,32), GFP fused to the native protein may alter localization in some cases. Furthermore, dual-targeted mitochondrial/chloroplast proteins can be mislocalized in GFP assays33. Alternative transcripts or alternative protein products may also produce differential subcellular localization that are either not captured in GFP assays or give ambiguous results. Given these experimental limitations, a robust bioinformatics workflow could enable rapid and cost-effective assessment of plastid proteomes with somewhat comparable accuracy. Though wet lab validation is still necessary, these datasets could narrow the focus to smaller subsets of proteins of interest which could be more manageably targeted for wet lab validation depending on the biological question being asked.

The semi-conserved and sometimes ambiguous nature of chloroplast transit peptides makes in silico predictions challenging. Plastid transit peptides, as with other signal peptides, are well-known to be more variable than downstream protein sequence but more conserved than noncoding sequence. Yet, patterns of loose conservation at the amino acid level if not at the sequence level reveal multiple subgroups of transit peptides34,35,36,37. However, sequence- and annotation-based approaches have yielded results with significant accuracy. Protein sequence-based prediction uses the amino acid content or the presence of conserved motifs in the peptide to make predictions. Use of the amino acid content alone, such as in the tool PCLR, is enough to predict many plastid-targeted proteins38. More complex sequence-based identify conserved motifs, such as in iPSORT39 and WoLF-PSORT40, or sliding-window searching algorithm such as Localizer41, make predictions based on the sum of prediction vectors to determine transit peptide similarity. Finally, tools that use neural networks such as ChloroP42, TargetP43,44, Predotar45, PredSL46, and Protein Prowler47 use multiple layers of nodes to identify the best-scoring localization. In contrast, annotation-based methods such as CLPFD48 and EpiLoc49, or simple text-based methods based on GO annotations50, use homology to proteins with known localization to designate subcellular predictions. While these methods offer advantages over sequence-based methods for proteins with annotated homologs, they perform poorly for novel proteins51. Hybrid approaches including MultiLoc252, Sherloc253, Y-Loc54, and Plant-mPLoc55 combine sequence- and annotation-based methods in an attempt to overcome this limitation. Unfortunately, the homology component of hybrid approaches is weighted more heavily, which can lead to the false prediction of proteins with transit peptide variation or for proteins with shared domains. Both high-throughput proteomics and bioinformatics approaches consistently indicate that the plastid proteome content is highly dynamic and likely has significant variability across the plant kingdom. With newer methods, ever-growing genomic resources, and availability of better gene annotation methods, previously reported estimates of conserved and non-conserved sets of the plastid proteome warrant an update.

This study evaluated the hypothesis that bioinformatics methods could achieve similar accuracy to experimental methods by comprehensively testing previously published subcellular prediction algorithms both alone and in combination. A specific combination of methods was found to be most efficient, which was then used to globally predict nuclear-encoded plastid-targeted proteins for fifteen higher plant species including eight eudicots, six monocots, and Amborella trichopoda, an early diverging species of the angiosperm clade. Two parallel approaches, Reciprocal-Best Blast Hit (RBH) and UCLUST56 were used to perform clustering, and the sub-cellular localization prediction for each cluster was analyzed to identify conserved, semi-conserved, and non-conserved plastid-targeted proteins. This approach also evaluated the hypothesis that a relative minority of plastid-targeted protein-coding genes are conserved among all species. It was found that natural selection and environmental influence has shaped the development of species-specific plastid proteomes.

Results and Discussion

Identification of optimal subcellular prediction workflows

To test the hypothesis that a bioinformatics workflow could reach parity with experimental methodology, the accuracy of six subcellular prediction algorithms including TargetP43, WoLF PSORT40, PredSL46, Localizer41, Multiloc252, and PCLR38 was first evaluated using data from the original publications. Sensitivity, specificity, accuracy, and Matthew’s Correlation Coefficient (MCC) were evaluated for each program as it related to the prediction of plastid-targeted proteins (Table 1). Sensitivity, specificity, and MCC in TargetP were found to exactly match the values reported by Emanuelsson et al.43,44 and while minor differences were found for MultiLoc and PredSL, these discrepancies likely represent rounding errors. Unexpectedly, significant differences were found for PCLR and Localizer: in PCLR, sensitivity was found to be 52.1%, which was about 5% lower than what was reported38. In Localizer, calculated specificity was 78.9%, nearly 16% lower than the 95.7% reported41. In both cases, all other performance statistics were identical or nearly identical, so it is likely that the discrepancies in Localizer and PCLR represent either miscalculations or quality of the transcriptional data used for analysis in the original publications.

Table 1 Self-Reported Performance of Six Algorithms on Prediction of Plastid-Targeted Proteins.

Next, cross-validation of subcellular prediction programs was performed against proteins with experimentally-determined subcellular localization retrieved from AT_CHLORO23, PPDB24,57, CropPAL and CropPAL226 and Suba425,58,59,60, resulting in 42,761 nonredundant sequences including 32,450 proteins validated by mass spectrometry (MS) and 3,722 validated by GFP. Most prediction algorithms were found to have lower performance against biological data reported in the original reports, as shown in Table 2 and Fig. 1. However, substantial differences were observed based on the method of experimental validation. On average among the six algorithms, sensitivity was 15.7% higher in the GFP-validated dataset while no significant change in specificity was found; this difference resulted in 10% higher overall accuracy and an increase of 0.159 in MCC for GFP-validated proteins. By further narrowing focus to a dataset of proteins validated by both methods, sensitivity increased by an additional 7.6%, and specificity increased 2.5%, on average. Due to the previously reported high false positive rates associated with shotgun proteomics of organellar proteomes27,28, program performance was expected to be much higher for GFP-validated proteins. While the dataset containing proteins experimentally validated by both GFP and mass spectrometry showed the highest apparent performance for the six subcellular prediction algorithms - and is likely closer to the biological accuracy of these programs - it contains roughly a third as many proteins as the GFP-validated dataset and is heavily biased by Arabidopsis sequences. Therefore, remaining comparisons focused on the GFP-validated dataset. Similarly, MCC was used as the primary measure of biological accuracy of in silico approaches to avoid problems due to drastically different dataset sizes.

Table 2 Review of Algorithms using modern curated datasets (combined).
Figure 1
figure 1

Venn-Diagram of Combinatorial and Standalone Subcellular Prediction Algorithms. Performance measured by MCC on proteins with subcellular localization validated by GFP is represented as a heatmap with high values in green and low values in red. For each intersection, only the best accept threshold is represented. Numbers indicate workflow number followed by the calculated MCC.

Overall, the highest-performing program in terms of MCC was Localizer, followed by MultiLoc2-HR, TargetP, PCLR, PredSL, WoLF PSORT, and MultiLoc2-LR. Of these, PredSL and MultiLoc2-LR performed poorly with GFP-validated proteins compared to the original reports, while other programs decreased marginally or performed similarly to the published MCC. Among the six programs that were evaluated, Localizer had the highest performance regardless of the experimental method used for validation, which is surprising since it is a simpler tool than annotation-based methods which have been at the forefront of subcellular prediction methods recently. Part of Localizer’s increased accuracy may be due to its unique capacity to predict dual-targeted mitochondrial/chloroplast proteins. Over 200 dual-localized proteins have been described in Arabidopsis61 and over 500 are predicted to have ambiguous transit peptides62. Increased accuracy in the prediction of these sequences in Localizer could alone account for a portion of its higher performance. After Localizer, MultiLoc2 had the next-highest MCC and also had the highest specificity of any program, at 83% in GFP-validated proteins. MultiLoc is a hybrid method combining annotation and sequence analysis, so these findings support that the use of hybrid methods yields robust biological specificity. However, MultiLoc also had the worst sensitivity of any program, correctly predicting only 50% of bonafide plastid-targeted proteins validated by GFP or 31% of sequences validated by either GFP or mass spectrometry. TargetP, which has historically been the most popular subcellular prediction program for plants since its introduction, was found to perform at lower accuracy than earlier estimates: even when using the more conservative GFP-validated data, specificity was only 59% and sensitivity was 67%. Previous experiments using high-throughput shotgun proteomics have reported that the sensitivity of TargetP is as low as 62%3,63,64,65. Use of strictly-curated data improves the apparent sensitivity up to 86%, but false positive rates are still problematic as a specificity of about 65% is observed66. The results presented here suggest that the biological accuracy of TargetP is somewhat closer to the initial estimates on non-curated data. PredSL, PCLR, and WoLF-PSORT were the lowest-ranked programs by MCC for prediction of plastid-targeted proteins, in that order, but typically had higher sensitivity than Localizer or MultiLoc2.

Differences in the amino acid composition of transit peptides are observable between rice and Arabidopsis, which have an overrepresentation of alanine and serine, respectively66. Therefore, differences in the prediction of monocot or eudicot sequences were assessed, and different programs displayed significant bias (Table 3). PCLR was the most drastically affected, with an MCC bias of +0.091 in monocots, representing a roughly 20% increase compared with eudicots. This finding is somewhat unsurprising because PCLR is the only program which uses sequence composition alone to make predictions and is, therefore, more susceptible to bias than motif- or annotation-based methods. TargetP was the only other tool that favored monocots, with an increase of 0.055 (+10.2%) in MCC. A marginal difference between monocot and eudicot prediction was observed when Localizer was used, which differed by only 0.008 in MCC, slightly favoring eudicots. Eudicot sequences were favored in the other prediction programs, with between 0.043 (+10%) higher MCC in WoLF POSRT and 0.066 in PredSL (+14.9%). To the best of our knowledge, this is the first study to report this type of error or bias for in silico prediction methods. Some differences have also been described for the proposed subunits of the TIC translocon in grasses, which could result in coevolution of the transit peptide sequence composition67,68,69. Choice of training and cross-validated datasets could significantly sway the predictions of sequence-based methods, while overrepresentation or prioritization of sequences for Arabidopsis and thereby eudicots could introduce bias to annotation-based methods. Although these species-specific differences are smaller than differences observed for sequences validated by mass spectrometry compared with GFP, they are still noteworthy and have consequences for whole-genome prediction. In contrast, WoLF-PSORT and Localizer were found to have insignificant if any bias, making them attractive both as standalone programs or in combinatorial approaches where they could mask biases of other programs.

Table 3 Performance of prediction algorithms against GFP-validated proteins from monocots and eudicots.

Combinatorial workflow outperforms single programs

Use of multiple prediction algorithms in combination is a powerful strategy to combine the strengths and overcome the limitations of single programs. Combinatorial approaches have been used to improve the accuracy of predictions in whole-genome analyses (e.g.,2) or to curate mass spectrometry data (e.g.,70,71,72,73). Additionally, a combinatorial workflow using 22 prediction algorithms and four experimental techniques is used in the SUBAcon algorithm implemented for the SUBA4 database of Arabidopsis proteins which reportedly yields up to 97.5% accuracy for chloroplast localization and 90% for other compartments25,60. While SUBAcon does not strictly require experimental data to perform predictions, available evidence weighs heavily on the final prediction and contributes to the reported accuracy. Even if experimental evidence were to be ignored, the use of 22 separate subcellular prediction algorithms is not feasible for individual researchers or application to enormous datasets. Therefore, a bioinformatics-based workflow that can work efficiently would be desirable.

Calculations were performed for each possible permutation of subcellular prediction algorithms and for all possible acceptable thresholds for each combination as applied to GFP-validated proteins. For example, for the combination of TargetP, PredSL, and Localizer, three thresholds were tested in which one, two, or all three programs needed to predict plastid localization to consider that protein as having a plastid transit peptide. To simplify analyses, the poorly-performing WoLF PSORT was removed from consideration (results including WoLF PSORT and datasets including MS-validated proteins are available in Supplementary File 1). In total, 80 unique workflows including the five remaining standalone program workflows were evaluated against GFP-validated proteins, the results of which are graphically summarized in Fig. 1, and numerically ranked by MCC in Table 4. Unequivocally, the results demonstrate that combinations of programs tend to outperform single programs for GFP-validated data: among the 25 workflows with the highest MCC, 23 were combinatorial approaches, while the standalone Localizer ranked tenth and Multiloc2-HR 22nd. Localizer was not only the best-performing standalone program but was also overrepresented in combinatorial workflows: except the standalone Multiloc2-HR workflow, Localizer appeared in all 25 top-performing workflows. It is interesting to note that combinations that rank higher tend to combine programs with high sensitivity with counterparts that have lower sensitivity but higher specificity, thus correcting for each other’s deficiencies. Specifically, most of the combinations with the highest MCC and ACC tend to include Localizer most often, followed by MultiLoc2, TargetP, PCLR, and lastly PredSL. The ranking of Localizer is unsurprising given that its relatively balanced and high sensitivity and specificity are unparalleled by any of the other programs. However, MultiLoc2’s extremely high specificity makes it a valuable component of many workflows despite its low sensitivity. The best performing workflow used TargetP, Localizer, and Multiloc2 and required 2 of the three programs to predict plastid targeting to define a sequence as containing a plastid transit peptide; specificity of 78.5%, the sensitivity of 64.6%, and MCC of 0.659 was achieved with this approach. In comparison to TargetP alone, a nearly 20% increase in specificity was observed with no loss in sensitivity. However, as the annotation-based functions of MultiLoc2 make it difficult to run on extensive datasets, an alternative workflow using a “2 of 2” consensus approach for TargetP and Localizer was found which ranked 2nd and achieved a marginally higher specificity of 80.7%. Furthermore, comparing the accuracy of the best workflows to Table 2 and to prior evaluations of experimental methodology (e.g.,66) supported the hypothesis that bioinformatics methods could reach parity with mass spectrometry in characterizing the plastid proteome. Due to the increased simplicity and comparable performance of the TargetP/Localizer consensus approach, this workflow was selected for subsequent genome-scale prediction of plastid-targeted proteins.

Table 4 Best combinatorial prediction approaches ranked by Matthew’s Correlation Coefficient (MCC).

Predicted plastid proteome correlates with genome size

As a demonstration of the utility of the Localizer and TargetP workflow, subcellular prediction was performed for the whole proteomes of fifteen phylogenetically diverse species. Six monocot species, including Anthurium amnicola, Brachypodium distachyon, Oryza sativa, Panicum virgatum, Setaria italica, and Sorghum bicolor and eight eudicots, including Arabidopsis thaliana, Fragaria vesca, Glycine max, Malus × domestica, Populus trichocarpa, Prunus persica, Solanum lycopersicum, and Vitis vinifera were chosen. Additionally, Amborella trichopoda, a species which diverged from the rest of the angiosperms prior to the divergernce of monocots and eudicots, was also incorporated into the comparative analysis. Complete information including data version numbers, proteome sizes, and prediction of plastid-targeted proteins by Localizer and TargetP is summarized in Table 5. In Arabidopsis, 2,826 proteins were predicted to be plastid-targeted, representing 8.8% of all protein isoforms. This finding is in agreement with the conservative estimates of the Arabidopsis plastid proteome2,4,74. Similar percentages were calculated in other species but varied from a low of 6.4% in tomato to a high of 9.3% in A. amnicola. As expected, the absolute number of predicted plastid-targeted protein-coding genes showed a high correlation with the genome size (R2 = 0.965) (Fig. 2). This result suggests that an increase in genome size and gene content yield a similar increase in the total number of plastid-targeted proteins. Over 10,000 of the Arabidopsis sequences have experimentally-determined localization, and comparing predictions for these sequences revealed an apparent sensitivity of 55.6%, specificity of 89.8%, accuracy of 83.6%, and MCC of 0.614. Sensitivity is somewhat low in this estimation due to the use of MS data, which includes many false positives, but the high specificity suggests good prediction accuracy. With the combination of the high correlation with experimentally-validated proteins and the lack of monocot/eudicot bias imparted by Localizer, it is expected that similar levels of accuracy were achieved for the entire set of species analyzed in this study.

Table 5 Targeting Prediction for Selected Species.
Figure 2
figure 2

Illustration of RBH and UCLUST Sequence Clustering Methods. Initial (A) and expanded (B) RBH figures indicate clustering between species 1 (blue circles), 2 (green circles), and 3 (orange circles). Bidirectional best BLAST hits between sequences from different species are indicated with black lines; bidirectional better BLAST hits between sequences within the same species with red lines and fragments with dotted red lines. For UCLUST, the initial length-sorted (C) run is illustrated with yellow stars indicating centroids, small gray patterned circles indicating non-centroid sequences, large black circles indicating the match range for initial centroids, and black lines indicating sequence clustering for the initial run. For clarity, sequences are patterned to indicate belonging to each initial cluster, and red dotted lines indicate cluster fragmentation. Randomization of centroids (D) mitigates this artificially-induced problem; gray patterned stars indicate randomly-selected centroids, light blue circles indicate the match range for randomly-seeded centroids, and red lines indicate new matches found with red lines. Distances not drawn to scale.

Clustering of gene families

Although the plastid is highly dependent on proteins imported from the nucleus for normal viability and function, the size and diversity of the plastid proteome across the plant kingdom remain poorly understood. The hypothesis that the plastid proteome is diverse and each species has a unique set of plastid-targeted proteins was examined by grouping sequences into homologous protein groups using two parallel clustering methods (Fig. 3). Clustering method has a significant impact on the size and accuracy of the resulting clusters, and therefore on the number and relevance of predictions. Reciprocal best BLAST Hits (RBH) using ALL-vs.-All BLAST comparisons of whole proteomes are a standard proxy for orthology in comparative genomics, although they are susceptible to inclusion of weakly homologous paralogs. BLAST-based approaches combined with Markov clustering or similar methods to remove paralogs are used in commonly-cited methods such as InParanoid95, OrthoMCL96, and COG97,98. However, these methods can bias single-copy genes or highly conserved families which can be problematic for polyploid genomes where many-to-many gene relationships are common99,100. For instance, the popular OrthoMCL fails to detect many homologous proteins with conserved expression patterns, and therefore with likely conserved functions, between rice and Arabidopsis101,102. In contrast, more straightforward RBH methods often outperform more complicated algorithms on eukaryotic genomes103.

Figure 3
figure 3

Overall Performance of RBH and UCLUST methods. (A) Cluster distribution in RBH and UCLUST. Both methods resulted in similar distributions of clusters, although RBH resulted in slightly more clusters with 13–15 species and UCLUST resulted in more clusters from 2–12 species. The slight increase in clusters with five species is interesting, and may result from sequences with homology within the Poaceae family or within Rosids but with no significant homologs outside those groups. (B) GO annotation similarity in RBH and UCLUST clusters. Lower similarity scores in higher-order clusters are partially due to different annotation methods and thresholds used for different species. Annotation similarity was generally higher in RBH at smaller cluster sizes and higher in UCLUST for larger clusters. Similarity decreased with the increasing representation of species, which may be partially caused by different annotation methods used for different genome sequencing projects, or may alternatively be caused by decreased homology within large clusters.

A simplified RBH approach, allowing many-to-many relationships, was determined to be most appropriate for this analysis to avoid fracture of gene families with paralogs or co-orthologs. Initial homologous relationships were identified using pairwise BLAST-P comparisons of two species; only sequences which are mutually the best BLAST hits for each other were utilized. Similar methods have used 40% as an appropriate identity and coverage threshold for orthologous relationships10,104,105,106. Therefore 40% was used as the initial threshold of homology. Initial clustering generated many small clusters, so a supplemental method for expansion of clusters, using reciprocal better BLAST hits of each species = ’ proteome BLAST’ed against itself, was tested (Supplementary Figure 2–1). A 90% threshold was determined to be optimal for clusters with fewer species decreasing significantly in number, while clusters containing a majority of species remained stable or increased. In contrast, application of between 60 and 80% expansion thresholds caused the liberal merging of clusters into extremely large clusters representing thousands of individual sequences. Additionally, GO term similarity was assessed within clusters at each population size based on the number of species in the cluster and was found to increase slightly for clusters containing few species when using a 90% expansion threshold, while more massive clusters experienced no change or slight decreases.

An alternative approach called UCLUST was implemented to complement the RBH method with a faster and more efficient technique because its semi-global algorithm detects homology in a fraction of the time required for BLAST and becomes much more efficient on enormous datasets. Initial clusters were constructed at a 40% identity and 40% coverage threshold similar to the RBH approach. However, initial clustering produced smaller clusters and resulted in cluster fragmentation. Therefore, modifications were implemented to expand initial clusters by randomly selecting sequences out of each initial cluster and iterating the UCLUST search at more stringent conditions using the selected sequences as new centroids (Supplementary Figure 2–2). Cluster expansion significantly increased the number of clusters with many species, which largely came from the drastic reduction of the number of single-species clusters. As with RBH, a 90% expansion threshold was found to be optimal and increased the number of clusters sharing 14–15 species roughly 4-fold, while lower thresholds resulted in the frequent grouping of nonhomologous sequences. Comparison of GO similarity for clusters containing multiple species showed that similarity increased slightly or remained stable for nearly all cluster sizes in the 90% expansion threshold compared to the initial, non-expanded UCLUST analysis. The number of iterations required to fully expand cluster space in UCLUST was also examined, and it was found that most clusters were completely expanded by ten iterations, while further iterations yielded diminishing returns (Supplementary Figure 2–3). A total of 100 iterations were performed to avoid problems with the randomization of centroid sequences.

Application of the optimal clustering methods to the proteomes of the species chosen generated 170,877 clusters using RBH (Table 6) and 103,501 clusters using UCLUST (Table 7). Nearly all the additional clusters in RBH were from single-species clusters or singleton sequences (data not shown): 150,067 of the RBH clusters (87.82%) were single-species clusters of which 134,319 were singleton sequences, while UCLUST detected 74,059 single-species clusters (71.55%) including 45,033 singletons. Some of these may be orphan genes, but they are more likely to be prediction and annotation errors or pseudogenes because the lack of homology implies lack of conserved function or extreme mutation rates that are more likely to occur in non-coding sequences. A total of 20,810 and 29,442 clusters in RBH and UCLUST approach, respectively, contained sequences from multiple species; although they represented a minority of clusters, they contained the majority of initial sequences. A bimodal distribution was observed in both methods in which two clusters, the first containing 14–15 of the species and the second containing just 2–3 species, represented the majority of the clusters (Fig. 3A). Comparatively fewer clusters contained between 4–13 species. Of the conserved clusters containing all 15 species, RBH detected 4,090 clusters, while UCLUST yielded 3,295. GO similarity between UCLUST and RBH was remarkably consistent, but UCLUST had somewhat better scores for conserved clusters containing plastid-targeted sequences from all species and lower scores for semi-conserved or non-conserved clusters containing few species (Fig. 4B). Across both methods, GO similarity decreased with increasing cluster size. While the merging of nonhomologous sequences may be partially responsible for this decrease, the annotation methods and parameters are not identical for the species used in this study, which artificially decreases the apparent similarity score regardless of clustering specificity.

Table 6 RBH Clustering Results by Species.
Table 7 UCLUST Clustering Results by Species.
Figure 4
figure 4

Workflow Diagram of Sequence Clustering Methods. For RBH (left panel), 1. initial cluster edges were generated by finding all reciprocal best-BLAST hits in all-vs.-all comparisons of proteomes from two separate species at a 40% identity, 40% coverage threshold, and 2. Secondary cluster edges were generated by finding all reciprocal better-BLAST hits in all-v-all comparisons of each proteome against itself at a 90% identity, 90% coverage threshold. For UCLUST (right panel), 1. An initial run was performed at 40% identity and 40% coverage threshold on a FASTA file containing sequences from every species in length-sorted order, and 2. Random sequences of at least 90% identity and 90% coverage were extracted from each cluster, this subset was length-sorted, and then the original length-sorted FASTA file was concatenated to the new seed sequences. This process was iterated 100 times, and a separate UCLUST run was performed for each iteration. Downstream processes for RBH and UCLUST were identical: 3. All clusters/pairs with a shared sequence were condensed into single clusters, 4. All sequences that failed to have at least 40% identity and 40% coverage based on BLAST-P analysis to any of the predicted plastid-targeted sequences in the cluster were trimmed out, 5A. all clusters with at least three species were extracted, and 5B. Clusters containing plastid-targeted sequences were sorted into “conserved,” “semi-conserved,” and “non-conserved” groups according to the number of species with predicted plastid targeting and the taxonomic grouping of those species. cTP – chloroplast transit peptide.

Identification of gene families with conserved plastid targeting

Genomes of endosymbiotic bacteria contain 1,500 proteins on average, and plastids are likely to contain similar numbers when accounting for both the plastid genome and core nuclear-encoded plastid-targeted protein-coding genes107. To determine the number of gene families with conserved plastid localization, clusters containing at least 13 species, of which all species contained at least one predicted plastid-targeted sequence or at least four non-plastid-targeted sequences were selected. These parameters were chosen to account for assembly and annotation errors and to correct for the 39% false negative prediction rate for bonafide plastid-targeted proteins which could eliminate many truly conserved clusters. There is a nearly 20% chance that at least one of four random sequences with non-plastid localization prediction is a false negative, but sequences that already share homology to predicted plastid-targeted sequences have a significantly higher likelihood of being false negatives. A workflow diagram representing cluster detection, filtering, processing, and categorization is represented in Fig. 4. Applying this workflow, 628 conserved protein clusters were found in RBH (Table 6, Fig. 5), while UCLUST detected 828 (Table 7, Fig. 6). Of these, 621 clusters in RBH and 817 in UCLUST also contain sequences from A. trichopoda, and all have several monocot and eudicot sequences, strongly indicating that these clusters represent the fundamental core plastid-targeted protein-coding gene families. Previous estimates predicted that 857–1020 sequences were shared between rice and Arabidopsis, another report projected that between 289–737 proteins were shared among the chloroplast proteomes of seven plant species2,10. Identification of gene families with conserved chloroplast transit peptides is an essential output of this work, as in silico methods can quickly identify conserved plastid-targeted proteins that have failed to be detected by genetic screens due to embryo lethality, gene redundancy, or random chance. Several methods have validated these sequences as truly plastid-targeted and representative of conserved plastid-targeted protein-coding genes. First, Arabidopsis proteins with experimentally-validated localization were examined within the conserved clusters. A total of 84.2% (183 proteins) of predicted plastid-targeted Arabidopsis sequences in conserved RBH clusters were validated by GFP and 94.5% (1,054) were validated by MS. The same was true for 80.5% (154 proteins) and 92% (855 proteins) in conserved RBH and UCLUST clusters, respectively (Supplementary Files 3 and 4). While these methods have yielded good overall sensitivity, small errors at initial stages of clustering can compound in larger clusters and result in unrealistically high numbers of sequences. For RBH, an average of 113.9 sequences and median of 61 were present in conserved clusters while UCLUST produced an average of 125.9 sequences and median of 84. Most sequences in these clusters come from a small set of species: G. max, P. virgatum, P. trichocarpa, and V. vinifera each contributed an average of over 10 sequences each to clusters with shared plastid localization prediction, while M. × domestica contributed over 10 sequences on average in UCLUST (summarized in Supplementary Files 3 and 4). Significant gene duplication or inclusion of multiple gene isoforms especially in those species likely accounts for a portion of the larger cluster sizes, but more distant paralogous sequences which are less likely to share biological function are also likely to be common. Thus, the list of conserved clusters reported here is not meant to be definitive and final, but rather a general guide which will require phylogenetic and experimental validation. In cases where larger clusters contain multiple paralogs or non-homologous, phylogenetic methods could resolve homology relationship with higher efficiency than the currently used RBH and UCLUST methods. However, the biological accuracy of the predicted plastid-targeted sequences within these clusters is still high.

Figure 5
figure 5

RBH Visual Representation. For “unique” clusters, single-species and singleton clusters are not represented, leaving only clusters with non-targeted homologs present in other species. The relative size of these unique clusters is represented by the area of the respective geometric shape. Shared protein groups at the kingdom, clade, subclade, and family levels are not represented by figure size. Overall, 628 protein clusters were shared between all 15 species, 1,002 had plastid-targeting specific to either monocots or eudicots, and 2,943 had plastid-targeting specific to only a single species.

Figure 6
figure 6

UCLUST Visual Representation. For “unique” clusters, single-species and singleton clusters are not represented, leaving only clusters with non-targeted homologs present in other species. The relative size of these unique clusters is represented by the area of the respective geometric shape. Shared protein groups at the kingdom, clade, subclade, and family levels are not represented by figure size. Overall, 828 protein clusters included plastid-targeted sequences from all 15 species, 1,983 had plastid-targeting specific to monocots or eudicots, and 5,632 had plastid-targeting specific to a single species.

Next, enrichment of gene ontology (GO) annotations was performed in conserved clusters by finding GO terms shared in at least three individual sequences and for over 10% of sequences. Terms were compared to annotations extracted using the same criteria for all the clusters of the respective clustering method and GO term enrichment was performed using BLAST2GO108. Overall, 53 terms including 29 terms associated with biological function, 23 associated with the cellular component, and one associated with the molecular process were found for RBH (Table 8). In UCLUST, a total of 33 terms were found, including 15 associated with the biological process, 17 with the cellular component, and one with the molecular process (Table 9). The most significantly enriched GO terms under the biological process ontology for both RBH and UCLUST methods were GO:0015979 (photosynthesis) and GO:0008152 (metabolic process), while a majority of the remaining highly enriched terms were associated with homeostatic processes (GO:0042592), cellular component organization (GO:0016043), single-organism biosynthetic processes (GO:0016043), generation of precursor metabolites (GO:0006091), and lipid metabolism (GO:0006629). In the RBH method, additional terms associated with amide, peptide, and organonitrogen compound biosynthesis and metabolism (GO:0043604, GO:0043603, GO:0043043, GO:0006518, GO:1901566, GO:1901564, GO:0044271, GO:0034641, GO:0006807), were enriched. UCLUST additionally had enriched GO terms associated with transport (GO:0006810), localization (GO:0051234, GO:0051179) and metabolism of carbohydrates (GO:0005975). Among cellular component ontologies, plastid (GO:0009536) was the most overrepresented term in both methods. Other highly overrepresented cellular component terms included organelle (GO:0043226), thylakoid (GO:0009579), chloroplast (GO:0009507), and associated terms. In RBH methods, significant enrichment of ribonucleoprotein complexes (GO:1990904, GO:0030529) was found. For the molecular process ontology, structural molecule activity (GO:0005198) was enriched in RBH and catalytic activity (GO:0003824) in UCLUST. These GO terms were further compared to the results of a previous study involving intergeneric analysis that described 737 conserved plastid-targeted proteins10. In this study, 42% of enriched terms found using UCLUST overlapped with the methods reported previously10. RBH methods were somewhat lower because more enriched terms were found, but still overlapped with the previously published dataset by 24%. These results are remarkably similar given that only GO terms from Arabidopsis had been examined previously and also different methods of GO enrichment had been used in those studies. The final and perhaps the most important test of the biological significance of conserved plastid-targeted clusters is whether they contain proteins expected to be present in plastids of all higher plants. Gene names were retrieved from TAIR10 for all Arabidopsis sequences in conserved clusters, and many of the most prominent plastid proteins were confirmed to be present in clusters for both RBH and UCLUST methods. The following is not intended to be an exhaustive list but merely a representative of the types of proteins detected in conserved plastid-targeted clusters; a complete list of annotations and gene names in RBH and UCLUST clusters are available in Supplementary Files 3 and 4. Among genes involved in primary photosynthesis, HCEF, LhcA1, LhcA2, LhcB1, LhcB2, LhcB3, Lhcb4, LPA1, LPA3, PPDK, and RbcS were detected in both methods, while LPA66 was found in RBH only. Photosystem subunits Psa-E, Psa-F, Psa-G, Psa-H, Psa-K, Psa-N, PsbP, Psa-O, PsbQ, PsbR, PsbS, PsbW, and PsbY were also found in both methods, while PsbT-N and PsbX were found only in RBH and PsbO was found only in UCLUST. Among ribosomal proteins, Rps1, Rps9, Rpl4, Rpl11, and Rpl12 were detected by both techniques, while Rpl9 and Rpl15 were only found using RBH and Rpl10 was found only with UCLUST. Proteins involved in translocation and chaperone functions found by both methods included ClpB, ClpC, ClpD, ClpP, ClpR, FtsH, Hsp60, Hsp70, Hsp88, Hsp90, Hsp98, Cpn10, Cpn20, Cpn60, Vipp1, Alb3, Alb4, TatC, Tic20, Tic21, Tic40, Tic55, Tic110, Toc75, and Plsp1. The Sec translocase subunits SecA, Scy1, and Scy2 were uniquely found in RBH, while organellar oligopeptidase OOP was also found in UCLUST. Finally, genes associated with primary plastid metabolism (SBPase, TPT, FRUCT5, G6PD2, and G6PD), heme biosynthesis (GUN2, GUN5, HEMA, HEMB, HEMC, HY2, PORA, PORB, and PORC), and fatty acid synthesis (ACC2, FAB2, FAD7, FAD8, FATA, FATB, lipoxygenase) were found in core clusters.

Table 8 Enriched GO terms for Conserved Plastid-Targeted RBH Clusters.
Table 9 Enriched GO terms for Conserved Plastid-Targeted UCLUST Clusters.

Taken together, the good correlation of protein clusters with experimentally-validated sequences, the enrichment of expected annotation terms, and the presence of expected highly-abundant proteins or proteins critical to chloroplast biology suggest that both the RBH and UCLUST methods achieved good accuracy and sensitivity for genes with conserved chloroplast targeting which are likely critical in all photosynthetic plants for minimal chloroplast function. It is noteworthy that 194 clusters in RBH and 333 core clusters in UCLUST contain at least one Arabidopsis sequence but have no associated gene synonyms available (Supplementary Files 3 and 4). As the sensitivity for conserved plastid-targeted proteins was found to be very high overall, many of these 194–333 clusters with missing annotation information are likely biologically accurate, in which case they are excellent candidates for understanding hitherto uncharacterized aspects of chloroplast biology.

Analysis of semi-conserved and non-conserved plastid-targeted proteins

Semi-conserved plastid-targeted protein-coding gene families in which predicted plastid-targeting was found for two or more sequences only in monocots, only in eudicots, or uniquely in A. trichopoda were identified beginning with the most diverse clades. In each case, all clusters with predicted plastid-targeted sequences or at least four predicted non-plastid-targeted sequences from the outgroup species were removed. A total of 572 gene families with plastid-targeted sequence specific to monocots and 430 to eudicots were found using RBH methods (Table 6, Fig. 5), while UCLUST detected 1,054 and 885, respectively (Table 7, Fig. 6). Additionally, 82 clusters with Amborella-specific plastid targeting were found using RBH, and 195 were found with UCLUST. These findings indicate that gene families with semi-conserved plastid-targeting outnumber core clusters by 73% in RBH and more than 150% in UCLUST. Narrowing focus to the subclade and family level revealed that semi-conserved clusters are still abundant, indicating that significant plastid proteome variation is present across all taxonomic levels. It is plausible that some of the clusters with plastid-targeting specific to either monocots or eudicots have functionally related clusters in the reciprocal group but lack sufficient homology to cluster together. Such an occurrence seems unlikely in most cases because the clustering methods used here were relatively liberal, but isolated cases may still occur. In some cases, non-orthologous or chimeric genes could also functionally replace an otherwise conserved gene and lead to loss of orthologous sequences in particular species or taxonomic groups109,110.

Finally, clusters with predicted plastid targeting only present in a single species were identified in RBH (Table 6, Fig. 5) and UCLUST (Table 7, Fig. 6). Singletons and clusters containing only a single species were discarded as these likely represent gene prediction errors. For example, predicted proteins in Malus which do not share homology with proteins in other species are typically poorly-supported by transcriptomics evidence: examination of over 300 such sequences revealed only one that had full coverage and was not a smaller fragment of a larger protein (data not shown). Since the chloroplast transit peptide is presumed to have arisen recently in each cluster, the term “nascent plastid-targeted proteins” (NPTPs) was coined to represent such proteins. Unsurprisingly, species with large and complex genomes possessed a more significant number of NPTPs: A. amnicola had the least, at just 52 in RBH and 97 in UCLUST, while P. virgatum had the most, with 682 NPTPs found in RBH and 1,458 in UCLUST. The predicted proteome of A. amnicola is based on transcriptomics data rather than genome-wide prediction, while P. virgatum has the largest genome and most extensive predicted proteome of the species in this analysis, so these trends are consistent with expectations.

Additionally, up to 728 proteins were uniquely targeted to the plastid in M. × domestica, and between 300–400 proteins had species-specific plastid transit peptides in B. distachyon, F. vesca, G. max, S. italica, and S. bicolor. Arabidopsis had some of the lowest estimates of NPTPs, with only 74 found in RBH and 166 in UCLUST. Species-unique plastid-targeted proteins had a moderately linear correlation with the total number of sequences in each species R2 = 0.73 in RBH and 0.72 in UCLUST, Fig. 7A), but the removal of the outlier P. virgatum resulted in nonlinear correlation (Fig. 7B). Consequently, extreme increases in genome size and complexity are hypothesized to create more opportunities for the evolution of novel transit peptides and diversification of the plastid proteome, but differences are subtler when the genomes being compared are closer in size. Previous literature (e.g.111,112,113) has suggested that gene duplication is a prerequisite or at least greatly encourages neofunctionalization via novel subcellular targeting, and the generally linear correlation with proteome size suggests that this may indeed be the case. However, based on the data, the evolution of the plastid proteome is more likely to be driven by environmental adaptation and selection pressure114.

Figure 7
figure 7

Correlation of Total Proteome Size with Nascent Plastid-Targeted Proteins (NPTPs). (A) Clusters containing at least three species and with predicted plastid-targeted proteins in only one species were compared to the total proteome size for both RBH and UCLUST clustering methods. Although the correlation was moderately linear when P. virgatum was included, its extremely large proteome skewed results. (B) Correlation after removal of P. virgatum. Weakly linear correlation indicates that the evolution of novel transit peptides is a random process.

While transit peptide structure and sequence were expected to be conserved within each thus-identified cluster, searching for shared homology between transit peptides of different clusters was not performed. Without experimental data to support such identification, the motifs thus identified would be unreliable predictions, and it would be hard to state if the observed convergent evolution detected in novel transit peptides has any cause-effect relationship.

As with the conserved plastid-targeted clusters, the accuracy of targeting prediction in NPTPs was cross-validated against experimentally-validated proteins from Arabidopsis. For the RBH clusters, 75% (4 proteins) were validated to be true plastid proteins via GFP, and 53.8% (17 proteins) validated by MS. For UCLUST, 29.4% (17 proteins) were validated by GFP, and 41.4% validated by MS. Specificity was also very high: only 6.3% of 300 predicted non-plastid-targeted proteins in RBH-generated NPTP clusters were found to actually be plastid-targeted by GFP, while the rate in MS-validated proteins was 13.4% (967 proteins). UCLUST generated similar results, with false negative error rates of 3% (493 proteins) in GFP-validated data and 12.5% (1,369 proteins) for MS-validated data. The few false negatives in predicted NPTPs may be representative of ambiguous/intermediate sequences in clusters which are already predicted to be uniquely chloroplast-targeted in Arabidopsis and therefore represent missing links. More pertinently, the GFP estimates are likely more accurate due to the experimental specificity errors inherent in mass spectrometry, and the 3–6% error rates are within an acceptable range.

Overall, these data affirm that evolution of the plastid proteome is highly dynamic at the species-level. Compared to previous reports, somewhat reduced species-unique plastid-targeted proteins are reported here (e.g.,2,10) due in part to the removal of singletons and single-species clusters. Homology to sequences in other species dramatically decreases the probability of pseudogenes and gene prediction errors. Remarkably, the monocot species had an average of 50–60% more species-unique plastid-targeted protein clusters than eudicot or Amborella counterparts. Even after removal of the outliers P. virgatum and A. amnicola, monocots still had 40% more plastid-targeted clusters than eudicots according to RBH methods, and over 80% more clusters using UCLUST. The reasons for this could be two-fold. First, the monocot species in this analysis have larger proteomes on average, increasing the overall likelihood for both de novo evolution of NPTPs and for retention of orphaned singleton/species-specific proteins. Secondly, monocots, and especially grasses, have been described to have many presence/absence variants (PAV’s) and copy number variants (CV’s) in their genomes. Pan-genome sequencing of B. distachyon revealed over 7,000 pan-genes that are not present in the reference genome, and an average of 9 Mb of sequence in each accession does not align to the reference genome115. Similar rates of PAV’s have been reported for cereal crops: only half of the pan-genome diversity of maize is present in the reference genome116, over 21,000 predicted wheat genes are not represented in the reference genome117, and 8,000 predicted rice genes are not represented in the Nipponbare reference genome118. In contrast, pan-genomes of Arabidopsis119 and tomato120,121 describe variation primarily at the SNP and small insertion/deletion levels, although one report described that 14.9 Mb of the Columbia-0 genome was absent in one or more other accessions122. In Brassica oleracea, less than 20% of genes were affected by presence/absence variation123. Somewhat higher variation is observed in legumes: 302 soybean lines including varieties, landraces, and wild accessions revealed 1,614 copy number variants and 6,388 segmental deletions, and 51.4% of gene families were dispensable124 while in Medicago truncatula, 67% of annotated genes may be dispensable125. It bears consideration that the pangenomes of the grasses are primarily within cultivated accessions and have already passed through a domestication filter which already significantly reduces genomic diversity, whereas the pangenomes of most of the eudicots include wild and landrace accessions. These trends suggest that PAV’s and CV’s are significant drivers of plastid proteome evolution, either by retention of orphaned genes or by de novo evolution of transit peptides in duplicated genes. Despite the smaller number of species-unique clusters, conserved plastid-targeted proteins are still outnumbered up to 25-fold by species-unique or semi-conserved proteins. If even a fraction of these sequences is accurate and expressed in vivo, each could impart novel biological functions because escape from the evolutionarily established biochemical and regulatory environment could impart a different function in a new subcellular environment without changing the functional sequence of the protein. Thus, each of these is an excellent candidate for further characterization to determine if unique phenotypes are created by relocalization to the plastid. Conversely, species-specific plastid-targeted protein-coding genes in model systems could yield misleading interpretations because the same phenotypes for those genes would not be observed in species where homologs do not have plastid-specific localization. Such a situation is potentially problematic for the unique plastid-targeted proteins detected for Arabidopsis, B. distachyon, and rice because it is likely that some of these genes already have a described gene function that is being inaccurately ascribed to plants as a whole. Indeed, out of 113 Arabidopsis proteins with predicted species-specific plastid-targeting, 18 have a described phenotype, and 100 are cited in previous research reports (summarized in Supplementary File 5). In cases where the predicted localization divergence is validated, the mutant phenotypes for those sequences will have to be revised.

Conclusions

The evaluations conducted in this study support the hypothesis that a combination of subcellular localization prediction programs can accurately predict chloroplast transit peptides at a whole-genome scale in higher plants and can perform equally well for both monocots and eudicots. The best-performing method was then applied to predict chloroplast proteins globally for a diverse range of angiosperm species and developed both a slow and accurate reciprocal best-BLAST hit method and a fast-liberal UCLUST method to cluster gene families. Though results were not identical, UCLUST yielded comparable results while performing more efficiently. With the addition of more species, UCLUST could be a useful tool to overcome the inefficiency of BLAST-based methods. The consensus of both methods determined that the hypothesis of extreme plastid proteome variability was supported across the taxonomic space. Roughly 700 genes were shared between the chloroplast proteomes in all plant species, but these were vastly outnumbered by proteins with variable plastid targeting prediction. Most of these species- or clade-specific proteins have no known function for the plastid and are excellent candidates for further studies. Additionally, roughly a third of conserved plastid-targeted proteins have no known function and could be targeted for reverse genetics experiments in the future. Biological verification of these sequences remains a significant challenge. Even if good prediction accuracy was achieved, these sequences may be poorly expressed, expressed only in particular conditions, or are nonfunctional. Incorporation of transcriptomics would provide significant evidence that these genes are at least expressed, and patterns of gene expression along with co-expression information may also reveal additional information about their function. Experimental validation using mass spectrometry could also be used, but many proteins may have abundances below detection limits, and technical challenges also remain for the isolation of non-green plastids where they may be more abundant. The decreasing costs of gene synthesis make high-throughput fluorescence protein assays an attractive alternative. In addition to increased sensitivity and specificity compared with mass spectrometry, fluorescent protein assays could also be used to simultaneously validate whether the localization of species-unique proteins are truly different from their nearest predicted non-plastid-targeted homologs, and likewise may be able to provide better spatial resolution. Outer membrane proteins, lacking a classical transit peptide, are only currently predictable based on homology to the mature protein, and thus cannot be predicted de novo. Furthermore, prediction of localization within sub-compartments of the chloroplast remains a challenge. TargetP and other programs offer sub-compartment predictions, but their accuracy remains questionable, making improvement of experimental methods a necessity. The methods and results reported in this study will enable rapid, accurate and cost-effective identification of plastid-targeted proteomes in new plant species as their genomic information becomes available. These research findings are expected to provide a foundation for further research into unique plastid biology and to understand better how diversification of the organellar proteomes contributes to important agronomic, biochemical, culinary, or even aesthetic traits.

Methods

Cross-validation of in silico techniques

Test datasets for cross-comparison of subcellular prediction algorithms were retrieved from PPDB (2012 update; current as of this writing), AT_CHLORO (January 2015 update; current as of this writing)23, Suba4 (30 June 2017 update; current as of this writing)24, CropPAL version 58839ba26, and CropPAL2 version 7486696726. Headers which could not be referenced to the most up-to-date reference proteomes were discarded. For AT_CHLORO, Suba4, and PPDB databases, all genes located on the chloroplast and mitochondrial genomes were removed, and redundant headers were merged. Subsets of data including sequences confirmed by mass spectrometry, GFP fusion, either GFP or mass spectrometry, or both were extracted from each database by filtering for the keywords “Chloroplast” or “Plastid.” All ambiguous results containing experimental evidence for both plastids and at least one other subcellular fraction were removed.

Experimentally validated protein sequences were analyzed with TargetP v.1.143,44, WoLF PSORT Command Line Version 0.240, PredSL Web Server46, Localizer v.1.0.241, MultiLoc2 version 2-26-10-200952, and PCLR update 2011-11-24 release 0.938. Additionally, NLStradamus v.1.8126 was used as part of the Localizer algorithm, while Python v.2.7.5, LIBSVM v.2.8, BLAST v.2.2.30, and Interproscan v.5.25-64.0 were used as part of MultiLoc2. Results for each workflow were converted into binary classification and evaluated for Sensitivity (SE), Specificity (SP), Matthew’s Correlation Coefficient (MCC), and accuracy (ACC) as related to plastid localization prediction based on the number of true positives, false positives, true negatives, and false negatives compared to the annotations in the corresponding experimental dataset (see equations below). Combinatorial approaches were performed for each possible combination of programs from two up to all six programs, and different thresholds were evaluated based on the number of programs in agreement for plastid localization. Complete records of individual and combinatorial workflows for each experimental dataset are available in Supplementary File 1. The heatmap in Fig. 1 was generated using conditional formatting in Microsoft Excel.

$$\begin{array}{ccc}{\rm{Sensitivity}}({\rm{i}}) & = & \frac{tp}{tp+fn}\\ {\rm{Specificity}}({\rm{i}}) & = & \frac{tp}{tp+fp}\\ {\rm{MCC}}({\rm{i}}) & = & \frac{tp\times tn-fp\times fn}{\sqrt{(tp+fn)(tp+fp)(tn+fn)(tn+fp)}}\\ {\rm{Overall}}\,{\rm{Accuracy}}\,({\rm{ACC}}) & : & \frac{tp+tn\,}{tp+fp+tn+fn}\end{array}$$

where tp is the number of sequences correctly identified as plastid-targeted, tn is the number of sequences correctly predicted to be non-plastid-targeted, fp is the number of non-plastid-targeted sequences incorrectly predicted as plastid-targeted, and fn is the number of plastid-targeted sequences that were predicted as non-plastid-targeted. Note that these categorizations are based on the accuracy of the database annotation and any filtering that was applied to data subsets, and they may not reflect biological accuracy.

Whole proteome analysis

Predicted proteomes for Amborella trichopoda, Arabidopsis thaliana, Brachypodium distachyon, Fragaria vesca, Glycine max, Malus × domestica, Oryza sativa, Panicum virgatum, Populus trichocarpa, Prunus persica, Setaria italica, Solanum lycopersicum, and Sorghum bicolor were downloaded from Phytozome87. The proteome of Anthurium amnicola was obtained by personal correspondence with Dr. Jon Suzuki, USDA-ARS, Hilo, Hawaii, in advance of the publication76. For Vitis vinifera, an expanded proteome version was obtained from94. For Malus × domestica, modifications to the predicted proteome were made because over 15,000 sequences, representing over 20% of the predicted proteome, were determined to have no significant matches to proteins from other species (See Supplementary File 5). The predicted proteome was expanded using apple transcriptome data that were downloaded from the NCBI SRA database under the project numbers PRJEB2506, PRJEB4314, PRJEB6212, and PRJNA231737, representing a mixture of leaf, apical meristem, fruit, and root tissues at different time points and under varying conditions82,83,84,85,127. These sources are described further in Supplementary File 2. Sequence files were processed in CLC Genomics Workbench version 8 (Qiagen Bioinformatics, Hilden, Germany); paired Illumina read files and 454 sequencing files were indicated during import. Graphical QC reports were generated to obtain nucleotide contribution (GC content) and quality distribution (quality scores) by base position. Reads were processed to remove ambiguous nucleotides and base quality scores lower than 0.001. Illumina reads were additionally trimmed at the 5’ end until the GC content stabilized within 0.5% of the average, and reads with fewer than 34 bases remaining were discarded. All paired read files were subsequently merged using default settings. All processed read files were assembled de novo with default settings. Assembled contigs of >300 bp were kept and used to predict open reading frames (ORF’s). Non-overlapping ORF’s with at least 5x average base coverage and >300 bp were extracted and translated into protein sequences. Finally, extracted protein sequences were compared against the existing Malus × domestica v.1.0 predicted gene set81 downloaded from Rosaceae.org. All hits with greater than 98% ID and coverage (as per85) were tagged as potential duplications or alleles of the original headers but were kept in the peptide dataset in case minor mutations caused differential localization prediction. All sequences generated from this transcriptome assembly are available in Supplementary File 6. In total, 36,477 sequences were obtained, of which 26,881 sequences were determined to be unique in comparison with the apple genome 81. Addition of the unique genes from the de novo transcriptome created a final dataset of 64,680 unique proteins. Redundant sequences from the resulting transcriptome were retained in case minor differences resulted in differential targeting.

The predicted proteomes of all species were filtered to remove any sequences less than 100 residues and which did not begin with methionine. Post-analysis filtering was accomplished by removing singleton sequences that failed to find matches with both the USEARCH method and BLAST (indicated for each sequence in Supplementary File 5). Remaining sequences were analyzed with TargetP v.1.143,44 and Localizer v.1.0.241. All sequences predicted by both methods to have a chloroplast transit peptide were classified as plastid-targeted, and all sequences with either “1 or 2” or “0 of 2” chloroplast transit peptide predictions were classified as non-plastid-targeted.

Clustering of gene families

Reciprocal Best-BLAST hit clustering was performed as follows: Pairwise BLAST-P (v.2.3.0+ command line executable;128,129) was performed for each species’ predicted proteome set against that of every other species in both forward and reverse directions. These results were filtered for hits in which identity and coverage parameters exceeded 40%. Of these, only hits in which two sequences from different genomes were the respective best hit were kept. Next, better-BLAST hits within each species were performed by conducting pairwise BLAST-P of the predicted proteome against itself. Hits exceeding 90% coverage and identity and which was reciprocal within the first 10 hits were collected. Cluster merging was performed by iterating through each possible header and collapsing all pairwise hits containing that header.

Clustering using the UCLUST algorithm proceeded as follows: An initial run on a length-sorted FASTA file containing all sequences was performed using ‘Cluster_Fast’ function of UCLUST (v.9.2.64_win32;56) with 40% identity and 40% query coverage. Next, random seeds were constructed by extracting a single random sequence from each cluster, sorting the resulting sequences by length, and appending them to a length-sorted FASTA of the full sequence list used in the initial “Cluster_Fast” analysis. 100 randomly-seeded FASTA files were then analyzed with “Cluster_Fast” set to 90% sequence identity. Target and query coverage were additionally set to 0.4 to avoid problems with small query sequences acting as centroids for much larger sequences as a result of USEARCH being performed in sequential rather than length-sorted order. Cluster merging was performed by iteratively searching through each possible sequence header and collapsing all clusters containing that header. Custom scripts were developed for automating program workflows, referencing and translating sequences or headers, performing seed randomization for the modified UCLUST technique, performing cluster expansion, calculating statistics on clustering outputs, and referencing headers to respective clusters for both workflows. Sequence members within merged clusters from RBH and UCLUST methods were referenced to the predicted plastid targeting phenotype, and all clusters containing plastid-targeted members were extracted. Conserved plastid-targeted protein-coding gene families were defined as clusters containing at least 13 species and in which all had either predicted plastid transit peptides or at least three additional sequences. Semi-conserved plastid-targeted gene families were defined as clusters containing plastid-targeted sequences from at least 2 species within each family or clade and no predicted plastid-targeted sequences from species outside that clade. Non-conserved plastid-targeted protein-coding gene families were defined as all clusters containing a minimum of three species in which only one species had a plastid-targeted sequence.

Gene ontology enrichment

Annotations for NPTPs were retrieved from Phytozome87 for each of the species used in the analysis except Anthurium amnicola and Vitis vinifera, which were retrieved from76 and94, respectively. Non-redundant predicted proteins produced by the de novo transcriptome assembly of Malus × domestica were annotated using BLASTP against the NR Protein database at NCBI with BLAST2GO v.4.1.9 default parameters108 (BioBam Bioinformatics, Valencia, Spain). GO terms were converted into GOslim annotations using BLAST2GO, and for each cluster, all terms shared by at least three species and present in over 10% of a cluster’s sequences were extracted to develop query datasets. In parallel, the same methods were used to extract GO terms from the total list of clusters to serve as reference datasets. Enrichment of GO terms in the shared plastid-targeted clusters was performed using BLAST2GO, with Fisher’s Exact Test was used to calculate significance using a false discovery rate (FDR) of less than 0.05 as a minimum significance threshold108. Graphical analyses of enriched GO terms were produced in BLAST2GO.

Gene and phenotype identification

Full gene annotations include described gene names were downloaded for the TAIR10 Arabidopsis genome from Phytozome87. Gene names were referenced from the annotation file for Arabidopsis sequences present in conserved plastid-targeted protein clusters. Phenotype information for species-unique plastid-targeted proteins was referenced on NCBI130.