Genome-Scale Characterization of Predicted Plastid-Targeted Proteomes in Higher Plants

Plastids are morphologically and functionally diverse organelles that are dependent on nuclear-encoded, plastid-targeted proteins for all biochemical and regulatory functions. However, how plastid proteomes vary temporally, spatially, and taxonomically has been historically difficult to analyze at a genome-wide scale using experimental methods. A bioinformatics workflow was developed and evaluated using a combination of fast and user-friendly subcellular prediction programs to maximize performance and accuracy for chloroplast transit peptides and demonstrate this technique on the predicted proteomes of 15 sequenced plant genomes. Gene family grouping was then performed in parallel using modified approaches of reciprocal best BLAST hits (RBH) and UCLUST. A total of 628 protein families were found to have conserved plastid targeting across angiosperm species using RBH, and 828 using UCLUST. However, thousands of clusters were also detected where only one species had predicted plastid targeting, most notably in Panicum virgatum which had 1,458 proteins with species-unique targeting. An average of 45% overlap was found in plastid-targeted protein-coding gene families compared with Arabidopsis, but an additional 20% of proteins matched against the full Arabidopsis proteome, indicating a unique evolution of plastid targeting. Neofunctionalization through subcellular relocalization is known to impart novel biological functions but has not been described before on a genome-wide scale for the plastid proteome. Further work to correlate these predicted novel plastid-targeted proteins to transcript abundance and high-throughput proteomics will uncover unique aspects of plastid biology and shed light on how the plastid proteome has evolved to influence plastid morphology and biochemistry.


Identification of optimal subcellular prediction workflows.
To test the hypothesis that a bioinformatics workflow could reach parity with experimental methodology, the accuracy of six subcellular prediction algorithms including TargetP 43 , WoLF PSORT 40 , PredSL 46 , Localizer 41 , Multiloc2 52 , and PCLR 38 was first evaluated using data from the original publications. Sensitivity, specificity, accuracy, and Matthew's Correlation Coefficient (MCC) were evaluated for each program as it related to the prediction of plastid-targeted proteins (Table 1). Sensitivity, specificity, and MCC in TargetP were found to exactly match the values reported by Emanuelsson et al. 43,44 and while minor differences were found for MultiLoc and PredSL, these discrepancies likely represent rounding errors. Unexpectedly, significant differences were found for PCLR and Localizer: in PCLR, sensitivity was found to be 52.1%, which was about 5% lower than what was reported 38 . In Localizer, calculated specificity was 78.9%, nearly 16% lower than the 95.7% reported 41 . In both cases, all other performance statistics were identical or nearly identical, so it is likely that the discrepancies in Localizer and PCLR represent either miscalculations or quality of the transcriptional data used for analysis in the original publications.
Next, cross-validation of subcellular prediction programs was performed against proteins with experimentally-determined subcellular localization retrieved from AT_CHLORO 23 , PPDB 24,57 , CropPAL and CropPAL2 26 and Suba4 25,58-60 , resulting in 42,761 nonredundant sequences including 32,450 proteins validated by mass spectrometry (MS) and 3,722 validated by GFP. Most prediction algorithms were found to have lower performance against biological data reported in the original reports, as shown in Table 2 and Fig. 1. However, substantial differences were observed based on the method of experimental validation. On average among the six algorithms, sensitivity was 15.7% higher in the GFP-validated dataset while no significant change in specificity was found; this difference resulted in 10% higher overall accuracy and an increase of 0.159 in MCC for GFP-validated proteins. By further narrowing focus to a dataset of proteins validated by both methods, sensitivity increased by an additional 7.6%, and specificity increased 2.5%, on average. Due to the previously reported high false positive rates associated with shotgun proteomics of organellar proteomes 27,28 , program performance was expected to be much higher for GFP-validated proteins. While the dataset containing proteins experimentally validated by both GFP and mass spectrometry showed the highest apparent performance for the six subcellular prediction algorithms -and is likely closer to the biological accuracy of these programs -it contains roughly a third as many proteins as the GFP-validated dataset and is heavily biased by Arabidopsis sequences. Therefore, remaining comparisons focused on the GFP-validated dataset. Similarly, MCC was used as the primary measure of biological accuracy of in silico approaches to avoid problems due to drastically different dataset sizes.
Overall, the highest-performing program in terms of MCC was Localizer, followed by MultiLoc2-HR, TargetP, PCLR, PredSL, WoLF PSORT, and MultiLoc2-LR. Of these, PredSL and MultiLoc2-LR performed poorly with GFP-validated proteins compared to the original reports, while other programs decreased marginally or performed similarly to the published MCC. Among the six programs that were evaluated, Localizer had the highest performance regardless of the experimental method used for validation, which is surprising since it is a simpler tool than annotation-based methods which have been at the forefront of subcellular prediction methods recently. Part of Localizer's increased accuracy may be due to its unique capacity to predict dual-targeted mitochondrial/ chloroplast proteins. Over 200 dual-localized proteins have been described in Arabidopsis 61 and over 500 are predicted to have ambiguous transit peptides 62 . Increased accuracy in the prediction of these sequences in Localizer could alone account for a portion of its higher performance. After Localizer, MultiLoc2 had the next-highest MCC and also had the highest specificity of any program, at 83% in GFP-validated proteins. MultiLoc is a hybrid method combining annotation and sequence analysis, so these findings support that the use of hybrid methods yields robust biological specificity. However, MultiLoc also had the worst sensitivity of any program, correctly predicting only 50% of bonafide plastid-targeted proteins validated by GFP or 31% of sequences validated by  Table 1. Self-Reported Performance of Six Algorithms on Prediction of Plastid-Targeted Proteins. Self-reported values for overall and plastidial sensitivity (SE), specificity (SP), Matthew's Correlation Coefficient (MCC), and accuracy (ACC). Parentheses indicate values that were calculated to be different from the original paper using the same data. Programs marked with an asterisk (*) had a confusion matrix available, while those marked with a cross ( †) did not, but confusion matrices were inferred by the available data; estimations were left as non-integer values, and therefore suffer from rounding errors in MCC and ACC calculations. Localizer, marked with a double cross ( ‡), was re-run with the original dataset provided in the publication's supplementary information 41 .
either GFP or mass spectrometry. TargetP, which has historically been the most popular subcellular prediction program for plants since its introduction, was found to perform at lower accuracy than earlier estimates: even when using the more conservative GFP-validated data, specificity was only 59% and sensitivity was 67%. Previous experiments using high-throughput shotgun proteomics have reported that the sensitivity of TargetP is as low as 62% 3,[63][64][65] . Use of strictly-curated data improves the apparent sensitivity up to 86%, but false positive rates are still problematic as a specificity of about 65% is observed 66 . The results presented here suggest that the biological accuracy of TargetP is somewhat closer to the initial estimates on non-curated data. PredSL, PCLR, and WoLF-PSORT were the lowest-ranked programs by MCC for prediction of plastid-targeted proteins, in that order, but typically had higher sensitivity than Localizer or MultiLoc2. Differences in the amino acid composition of transit peptides are observable between rice and Arabidopsis, which have an overrepresentation of alanine and serine, respectively 66 . Therefore, differences in the prediction of monocot or eudicot sequences were assessed, and different programs displayed significant bias (Table 3).  Table 2. Review of Algorithms using modern curated datasets (combined). For each program, SE, SP, MCC, and ACC are reported compared to in vivo experimental data using a conservative dataset of GFP-validated proteins, or a larger but more liberal dataset comprised of both GFP and MS data. Difference between observed performance statistics of different datasets is presented as GFP minus MS/GFP. MS data was found to have increased error especially for observed sensitivity, indicating that a large number of MS-validated proteins are likely artefactual. Furthermore, this suggests that the overall performance of subcellular prediction methods is likely more accurate than high-throughput proteomics reports suggest. Sensitivity can be inverted (1-SE) to yield the false negative rate, i.e. the fraction of proteins that were experimentally found to be plastid-targeted by the given experimental method but predicted to be non-plastidial. Likewise, specificity can be inverted (1-SP) to yield the false positive rate, i.e. the fraction of predicted experimentally determined to be non-plastidial that were found by the prediction algorithm to be plastidial. PCLR was the most drastically affected, with an MCC bias of +0.091 in monocots, representing a roughly 20% increase compared with eudicots. This finding is somewhat unsurprising because PCLR is the only program which uses sequence composition alone to make predictions and is, therefore, more susceptible to bias than motif-or annotation-based methods. TargetP was the only other tool that favored monocots, with an increase of 0.055 (+10.2%) in MCC. A marginal difference between monocot and eudicot prediction was observed when Localizer was used, which differed by only 0.008 in MCC, slightly favoring eudicots. Eudicot sequences were favored in the other prediction programs, with between 0.043 (+10%) higher MCC in WoLF POSRT and 0.066 in PredSL (+14.9%). To the best of our knowledge, this is the first study to report this type of error or bias for in silico prediction methods. Some differences have also been described for the proposed subunits of the TIC translocon in grasses, which could result in coevolution of the transit peptide sequence composition [67][68][69] . Choice of training and cross-validated datasets could significantly sway the predictions of sequence-based methods, while overrepresentation or prioritization of sequences for Arabidopsis and thereby eudicots could introduce bias to annotation-based methods. Although these species-specific differences are smaller than differences observed for sequences validated by mass spectrometry compared with GFP, they are still noteworthy and have consequences for whole-genome prediction. In contrast, WoLF-PSORT and Localizer were found to have insignificant if any bias, making them attractive both as standalone programs or in combinatorial approaches where they could mask biases of other programs.
Combinatorial workflow outperforms single programs. Use of multiple prediction algorithms in combination is a powerful strategy to combine the strengths and overcome the limitations of single programs. Combinatorial approaches have been used to improve the accuracy of predictions in whole-genome analyses (e.g., 2 ) or to curate mass spectrometry data (e.g., [70][71][72][73] ). Additionally, a combinatorial workflow using 22 prediction algorithms and four experimental techniques is used in the SUBAcon algorithm implemented for the SUBA4 database of Arabidopsis proteins which reportedly yields up to 97.5% accuracy for chloroplast localization and 90% for other compartments 25,60 . While SUBAcon does not strictly require experimental data to perform predictions, available evidence weighs heavily on the final prediction and contributes to the reported accuracy. Even if experimental evidence were to be ignored, the use of 22 separate subcellular prediction algorithms is not feasible for individual researchers or application to enormous datasets. Therefore, a bioinformatics-based workflow that can work efficiently would be desirable. Calculations were performed for each possible permutation of subcellular prediction algorithms and for all possible acceptable thresholds for each combination as applied to GFP-validated proteins. For example, for the combination of TargetP, PredSL, and Localizer, three thresholds were tested in which one, two, or all three programs needed to predict plastid localization to consider that protein as having a plastid transit peptide. To simplify analyses, the poorly-performing WoLF PSORT was removed from consideration (results including WoLF PSORT and datasets including MS-validated proteins are available in Supplementary File 1). In total, 80 unique workflows including the five remaining standalone program workflows were evaluated against GFP-validated proteins, the results of which are graphically summarized in Fig. 1, and numerically ranked by MCC in Table 4. Unequivocally, the results demonstrate that combinations of programs tend to outperform single programs for GFP-validated data: among the 25 workflows with the highest MCC, 23 were combinatorial approaches, while the standalone Localizer ranked tenth and Multiloc2-HR 22 nd . Localizer was not only the best-performing standalone program but was also overrepresented in combinatorial workflows: except the standalone Multiloc2-HR workflow, Localizer appeared in all 25 top-performing workflows. It is interesting to note that combinations that rank higher tend to combine programs with high sensitivity with counterparts that have lower sensitivity but higher specificity, thus correcting for each other's deficiencies. Specifically, most of the combinations with the highest MCC and ACC tend to include Localizer most often, followed by MultiLoc2, TargetP, PCLR, and lastly PredSL. The ranking of Localizer is unsurprising given that its relatively balanced and high sensitivity and specificity are unparalleled by any of the other programs. However, MultiLoc2's extremely high specificity makes it a  Table 3. Performance of prediction algorithms against GFP-validated proteins from monocots and eudicots. Performance of each prediction algorithm in monocots and eudicots and the difference between these datasets is presented; dataset sizes are roughly similar for monocot and eudicot sequences, but MCC is still preferable for comparison. 161 plastid-localized proteins and 640 non-plastid-targeted proteins are included for monocots, while eudicots include 489 plastid-targeted and 2,432 non-plastid-targeted proteins. Sensitivity can be inverted (1-SE) to yield the false negative rate, i.e. the fraction of proteins that were experimentally found to be plastid targeted by the given experimental method but predicted to be non-plastidial. Likewise, specificity can be inverted (1-SP) to yield the false positive rate, i.e. the fraction of predicted experimentally determined to be non-plastidial that were found by the prediction algorithm to be plastidial.
valuable component of many workflows despite its low sensitivity. The best performing workflow used TargetP, Localizer, and Multiloc2 and required 2 of the three programs to predict plastid targeting to define a sequence as containing a plastid transit peptide; specificity of 78.5%, the sensitivity of 64.6%, and MCC of 0.659 was achieved with this approach. In comparison to TargetP alone, a nearly 20% increase in specificity was observed with no loss in sensitivity. However, as the annotation-based functions of MultiLoc2 make it difficult to run on extensive datasets, an alternative workflow using a "2 of 2" consensus approach for TargetP and Localizer was found which ranked 2 nd and achieved a marginally higher specificity of 80.7%. Furthermore, comparing the accuracy of the best workflows to Table 2 and to prior evaluations of experimental methodology (e.g., 66 ) supported the hypothesis that bioinformatics methods could reach parity with mass spectrometry in characterizing the plastid proteome. Due to the increased simplicity and comparable performance of the TargetP/Localizer consensus approach, this workflow was selected for subsequent genome-scale prediction of plastid-targeted proteins.
predicted plastid proteome correlates with genome size. As a demonstration of the utility of the Localizer and TargetP workflow, subcellular prediction was performed for the whole proteomes of fifteen phylogenetically diverse species. Six monocot species, including Anthurium amnicola, Brachypodium distachyon, Oryza sativa, Panicum virgatum, Setaria italica, and Sorghum bicolor and eight eudicots, including Arabidopsis thaliana, Fragaria vesca, Glycine max, Malus × domestica, Populus trichocarpa, Prunus persica, Solanum lycopersicum, and Vitis vinifera were chosen. Additionally, Amborella trichopoda, a species which diverged from the rest of the angiosperms prior to the divergernce of monocots and eudicots, was also incorporated into the comparative analysis. Complete information including data version numbers, proteome sizes, and prediction of plastid-targeted proteins by Localizer and TargetP is summarized in Table 5. In Arabidopsis, 2,826 proteins were predicted to be plastid-targeted, representing 8.8% of all protein isoforms. This finding is in agreement with the conservative estimates of the Arabidopsis plastid proteome 2,4,74 . Similar percentages were calculated in other species but varied from a low of 6.4% in tomato to a high of 9.3% in A. amnicola. As expected, the absolute number of predicted plastid-targeted protein-coding genes showed a high correlation with the genome size (R 2 = 0.965) (Fig. 2). This result suggests that an increase in genome size and gene content yield a similar increase in the total number of plastid-targeted proteins. Over 10,000 of the Arabidopsis sequences have experimentally-determined localization, and comparing predictions for these sequences revealed an apparent sensitivity of 55.6%, specificity of 89.8%, accuracy of 83.6%, and MCC of 0.614. Sensitivity is somewhat low in this estimation due to the use of MS data, which includes many false positives, but the high specificity suggests good prediction accuracy. With the combination of the high correlation with experimentally-validated proteins and the lack of monocot/eudicot  Table 4. Best combinatorial prediction approaches ranked by Matthew's Correlation Coefficient (MCC). The sensitivity (SE), specificity (SP), Matthew's Correlation Coefficient (MCC), and accuracy (ACC) are presented for each program. Almost all of the highest-performing programs utilized Localizer in their approach, followed by Multiloc2 and TargetP. Localizer and MultiLoc2 were also the only two programs which ranked highly as standalone algorithms, whereas the remaining workflows used two or more individual programs.
clustering of gene families. Although the plastid is highly dependent on proteins imported from the nucleus for normal viability and function, the size and diversity of the plastid proteome across the plant kingdom remain poorly understood. The hypothesis that the plastid proteome is diverse and each species has a unique set of plastid-targeted proteins was examined by grouping sequences into homologous protein groups using two parallel clustering methods (Fig. 3). Clustering method has a significant impact on the size and accuracy of the resulting clusters, and therefore on the number and relevance of predictions. Reciprocal best BLAST Hits (RBH) using ALL-vs.-All BLAST comparisons of whole proteomes are a standard proxy for orthology in comparative genomics, although they are susceptible to inclusion of weakly homologous paralogs. BLAST-based approaches combined with Markov clustering or similar methods to remove paralogs are used in commonly-cited methods such as InParanoid 95 , OrthoMCL 96 , and COG 97,98 . However, these methods can bias single-copy genes or highly conserved families which can be problematic for polyploid genomes where many-to-many gene relationships are common 99,100 . For instance, the popular OrthoMCL fails to detect many homologous proteins with conserved expression patterns, and therefore with likely conserved functions, between rice and Arabidopsis 101,102 . In contrast, more straightforward RBH methods often outperform more complicated algorithms on eukaryotic genomes 103 .
A simplified RBH approach, allowing many-to-many relationships, was determined to be most appropriate for this analysis to avoid fracture of gene families with paralogs or co-orthologs. Initial homologous relationships were identified using pairwise BLAST-P comparisons of two species; only sequences which are mutually the best BLAST hits for each other were utilized. Similar methods have used 40% as an appropriate identity and coverage threshold for orthologous relationships 10,104-106 . Therefore 40% was used as the initial threshold of homology. Initial clustering generated many small clusters, so a supplemental method for expansion of clusters, using reciprocal better BLAST hits of each species = ' proteome BLAST' ed against itself, was tested (Supplementary Figure 2-1). A 90% threshold was determined to be optimal for clusters with fewer species decreasing significantly in number, while clusters containing a majority of species remained stable or increased. In contrast, application of between 60 and 80% expansion thresholds caused the liberal merging of clusters into extremely large clusters representing thousands of individual sequences. Additionally, GO term similarity was assessed within clusters at each population size based on the number of species in the cluster and was found to increase slightly for clusters containing few species when using a 90% expansion threshold, while more massive clusters experienced no change or slight decreases.
An alternative approach called UCLUST was implemented to complement the RBH method with a faster and more efficient technique because its semi-global algorithm detects homology in a fraction of the time required for BLAST and becomes much more efficient on enormous datasets. Initial clusters were constructed at a 40% identity and 40% coverage threshold similar to the RBH approach. However, initial clustering produced smaller clusters and resulted in cluster fragmentation. Therefore, modifications were implemented to expand initial clusters by randomly selecting sequences out of each initial cluster and iterating the UCLUST search at more stringent conditions using the selected sequences as new centroids (Supplementary Figure 2-  significantly increased the number of clusters with many species, which largely came from the drastic reduction of the number of single-species clusters. As with RBH, a 90% expansion threshold was found to be optimal and increased the number of clusters sharing 14-15 species roughly 4-fold, while lower thresholds resulted in the frequent grouping of nonhomologous sequences. Comparison of GO similarity for clusters containing multiple species showed that similarity increased slightly or remained stable for nearly all cluster sizes in the 90% expansion threshold compared to the initial, non-expanded UCLUST analysis. The number of iterations required to fully expand cluster space in UCLUST was also examined, and it was found that most clusters were completely expanded by ten iterations, while further iterations yielded diminishing returns (Supplementary Figure 2-3). A total of 100 iterations were performed to avoid problems with the randomization of centroid sequences. Application of the optimal clustering methods to the proteomes of the species chosen generated 170,877 clusters using RBH (Table 6) and 103,501 clusters using UCLUST (Table 7). Nearly all the additional clusters in RBH were from single-species clusters or singleton sequences (data not shown): 150,067 of the RBH clusters (87.82%) were single-species clusters of which 134,319 were singleton sequences, while UCLUST detected 74,059 single-species clusters (71.55%) including 45,033 singletons. Some of these may be orphan genes, but they are more likely to be prediction and annotation errors or pseudogenes because the lack of homology implies lack of conserved function or extreme mutation rates that are more likely to occur in non-coding sequences. A total of 20,810 and 29,442 clusters in RBH and UCLUST approach, respectively, contained sequences from multiple species; although they represented a minority of clusters, they contained the majority of initial sequences. A bimodal distribution was observed in both methods in which two clusters, the first containing 14-15 of the species and the second containing just 2-3 species, represented the majority of the clusters (Fig. 3A). Comparatively fewer clusters contained between 4-13 species. Of the conserved clusters containing all 15 species, RBH detected 4,090 clusters, while UCLUST yielded 3,295. GO similarity between UCLUST and RBH was remarkably consistent, but UCLUST had somewhat better scores for conserved clusters containing plastid-targeted sequences from all species and lower scores for semi-conserved or non-conserved clusters containing few species (Fig. 4B). Across both methods, GO similarity decreased with increasing cluster size. While the merging of nonhomologous sequences may be partially responsible for this decrease, the annotation methods and parameters are not identical for the species used in this study, which artificially decreases the apparent similarity score regardless of clustering specificity.

Figure 2.
Illustration of RBH and UCLUST Sequence Clustering Methods. Initial (A) and expanded (B) RBH figures indicate clustering between species 1 (blue circles), 2 (green circles), and 3 (orange circles). Bidirectional best BLAST hits between sequences from different species are indicated with black lines; bidirectional better BLAST hits between sequences within the same species with red lines and fragments with dotted red lines. For UCLUST, the initial length-sorted (C) run is illustrated with yellow stars indicating centroids, small gray patterned circles indicating non-centroid sequences, large black circles indicating the match range for initial centroids, and black lines indicating sequence clustering for the initial run. For clarity, sequences are patterned to indicate belonging to each initial cluster, and red dotted lines indicate cluster fragmentation. Randomization of centroids (D) mitigates this artificially-induced problem; gray patterned stars indicate randomly-selected centroids, light blue circles indicate the match range for randomly-seeded centroids, and red lines indicate new matches found with red lines. Distances not drawn to scale. www.nature.com/scientificreports www.nature.com/scientificreports/ Identification of gene families with conserved plastid targeting. Genomes of endosymbiotic bacteria contain 1,500 proteins on average, and plastids are likely to contain similar numbers when accounting for both the plastid genome and core nuclear-encoded plastid-targeted protein-coding genes 107 . To determine the number of gene families with conserved plastid localization, clusters containing at least 13 species, of which all species contained at least one predicted plastid-targeted sequence or at least four non-plastid-targeted sequences were selected. These parameters were chosen to account for assembly and annotation errors and to correct for the 39% false negative prediction rate for bonafide plastid-targeted proteins which could eliminate many truly conserved clusters. There is a nearly 20% chance that at least one of four random sequences with non-plastid localization prediction is a false negative, but sequences that already share homology to predicted plastid-targeted sequences have a significantly higher likelihood of being false negatives. A workflow diagram representing cluster detection, filtering, processing, and categorization is represented in Fig. 4. Applying this workflow, 628 conserved protein clusters were found in RBH (Table 6, Fig. 5), while UCLUST detected 828 (Table 7, Fig. 6). Of these, 621 clusters in RBH and 817 in UCLUST also contain sequences from A. trichopoda, and all have several monocot and eudicot sequences, strongly indicating that these clusters represent the fundamental core plastid-targeted protein-coding gene families. Previous estimates predicted that 857-1020 sequences were shared between rice and Arabidopsis, another report projected that between 289-737 proteins were shared among the chloroplast proteomes of seven plant species 2,10 . Identification of gene families with conserved chloroplast transit peptides is an essential output of this work, as in silico methods can quickly identify conserved plastid-targeted proteins that have failed to be detected by genetic screens due to embryo lethality, gene redundancy, or random chance. Several methods have validated these sequences as truly plastid-targeted and representative of conserved plastid-targeted protein-coding genes. First, Arabidopsis proteins with experimentally-validated localization were examined within the conserved clusters. A total of 84.2% (183 proteins) of predicted plastid-targeted Arabidopsis sequences in conserved RBH clusters were validated by GFP and 94.5% (1,054) were validated by MS. The same was true for 80.5% (154 proteins) and 92% (855 proteins) in conserved RBH and UCLUST clusters, respectively (Supplementary Files 3 and 4). While these methods have yielded good overall sensitivity, small errors at initial stages of clustering can compound in larger clusters and result in unrealistically high numbers of sequences. For RBH, an average of 113.9 sequences and median of 61 were present in conserved clusters while UCLUST produced an average of 125.9 sequences and median of 84. Most sequences in these clusters come from a small set of species: G. max, P. virgatum, P. trichocarpa, and V. vinifera each contributed an average of over 10 sequences each to clusters with shared plastid localization prediction, while M. × domestica contributed over 10 sequences on average in UCLUST (summarized in Supplementary Files 3 and 4). Significant gene duplication or inclusion of Both methods resulted in similar distributions of clusters, although RBH resulted in slightly more clusters with 13-15 species and UCLUST resulted in more clusters from 2-12 species. The slight increase in clusters with five species is interesting, and may result from sequences with homology within the Poaceae family or within Rosids but with no significant homologs outside those groups. (B) GO annotation similarity in RBH and UCLUST clusters. Lower similarity scores in higher-order clusters are partially due to different annotation methods and thresholds used for different species. Annotation similarity was generally higher in RBH at smaller cluster sizes and higher in UCLUST for larger clusters. Similarity decreased with the increasing representation of species, which may be partially caused by different annotation methods used for different genome sequencing projects, or may alternatively be caused by decreased homology within large clusters. www.nature.com/scientificreports www.nature.com/scientificreports/ multiple gene isoforms especially in those species likely accounts for a portion of the larger cluster sizes, but more distant paralogous sequences which are less likely to share biological function are also likely to be common. Thus, the list of conserved clusters reported here is not meant to be definitive and final, but rather a general guide which will require phylogenetic and experimental validation. In cases where larger clusters contain multiple paralogs or non-homologous, phylogenetic methods could resolve homology relationship with higher efficiency than the   Table 7. UCLUST Clustering Results by Species. Clustering of gene families was performed using an initial UCLUST iteration with 40% coverage and 40% identity followed by extraction of random sequences from each cluster to seed additional iterations performed at 90% coverage and identity. Clusters containing shared sequences were merged, followed by identification of clusters containing plastid-targeted sequences in each species. The number of clusters overlapping with Arabidopsis for all clusters and plastid-targeted clusters was identified, as well as the number of clusters containing a plastid-targeted sequence from only the selected species.
Next, enrichment of gene ontology (GO) annotations was performed in conserved clusters by finding GO terms shared in at least three individual sequences and for over 10% of sequences. Terms were compared to annotations extracted using the same criteria for all the clusters of the respective clustering method and GO term enrichment was performed using BLAST2GO 108 . Overall, 53 terms including 29 terms associated with biological function, 23 associated with the cellular component, and one associated with the molecular process were found for RBH (Table 8). In UCLUST, a total of 33 terms were found, including 15 associated with the biological process, 17 with the cellular component, and one with the molecular process ( Table 9). The most significantly enriched GO terms under the biological process ontology for both RBH and UCLUST methods were GO:0015979 (photosynthesis) and GO:0008152 (metabolic process), while a majority of the remaining highly enriched terms were associated with homeostatic processes (GO:0042592), cellular component organization (GO:0016043), single-organism biosynthetic processes (GO:0016043), generation of precursor metabolites (GO:0006091), and lipid metabolism (GO:0006629). In the RBH method, additional terms associated with amide, peptide, and organonitrogen compound biosynthesis and metabolism (GO:0043604, GO:0043603, GO:0043043, GO:0006518, -all comparisons of proteomes from two separate species at a 40% identity, 40% coverage threshold, and 2. Secondary cluster edges were generated by finding all reciprocal better-BLAST hits in all-v-all comparisons of each proteome against itself at a 90% identity, 90% coverage threshold. For UCLUST (right panel), 1. An initial run was performed at 40% identity and 40% coverage threshold on a FASTA file containing sequences from every species in length-sorted order, and 2. Random sequences of at least 90% identity and 90% coverage were extracted from each cluster, this subset was length-sorted, and then the original length-sorted FASTA file was concatenated to the new seed sequences. This process was iterated 100 times, and a separate UCLUST run was performed for each iteration. Downstream processes for RBH and UCLUST were identical: 3. All clusters/pairs with a shared sequence were condensed into single clusters, 4. All sequences that failed to have at least 40% identity and 40% coverage based on BLAST-P analysis to any of the predicted plastid-targeted sequences in the cluster were trimmed out, 5A. all clusters with at least three species were extracted, and 5B. Clusters containing plastid-targeted sequences were sorted into "conserved, " "semi-conserved, " and "non-conserved" groups according to the number of species with predicted plastid targeting and the taxonomic grouping of those species. cTP -chloroplast transit peptide. For the molecular process ontology, structural molecule activity (GO:0005198) was enriched in RBH and catalytic activity (GO:0003824) in UCLUST. These GO terms were further compared to the results of a previous study involving intergeneric analysis that described 737 conserved plastid-targeted proteins 10 . In this study, 42% of enriched terms found using UCLUST overlapped with the methods reported previously 10 . RBH methods were somewhat lower because more enriched terms were found, but still overlapped with the previously published dataset by 24%. These results are remarkably similar given that only GO terms from Arabidopsis had been examined previously and also different methods of GO enrichment had been used in those studies. The final and perhaps the most important test of the biological significance of conserved plastid-targeted clusters is whether they contain proteins expected to be present in plastids of all higher plants. Gene names were retrieved from TAIR10 Figure 5. RBH Visual Representation. For "unique" clusters, single-species and singleton clusters are not represented, leaving only clusters with non-targeted homologs present in other species. The relative size of these unique clusters is represented by the area of the respective geometric shape. Shared protein groups at the kingdom, clade, subclade, and family levels are not represented by figure size. Overall, 628 protein clusters were shared between all 15 species, 1,002 had plastid-targeting specific to either monocots or eudicots, and 2,943 had plastid-targeting specific to only a single species. Figure 6. UCLUST Visual Representation. For "unique" clusters, single-species and singleton clusters are not represented, leaving only clusters with non-targeted homologs present in other species. The relative size of these unique clusters is represented by the area of the respective geometric shape. Shared protein groups at the kingdom, clade, subclade, and family levels are not represented by figure size. Overall, 828 protein clusters included plastid-targeted sequences from all 15 species, 1,983 had plastid-targeting specific to monocots or eudicots, and 5,632 had plastid-targeting specific to a single species. (2020) 10  www.nature.com/scientificreports www.nature.com/scientificreports/ for all Arabidopsis sequences in conserved clusters, and many of the most prominent plastid proteins were confirmed to be present in clusters for both RBH and UCLUST methods. The following is not intended to be an exhaustive list but merely a representative of the types of proteins detected in conserved plastid-targeted clusters; a complete list of annotations and gene names in RBH and UCLUST clusters are available in Supplementary Files 3 and 4. Among genes involved in primary photosynthesis, HCEF, LhcA1, LhcA2, LhcB1, LhcB2, LhcB3, Lhcb4, LPA1, LPA3, PPDK, and RbcS were detected in both methods, while LPA66 was found in RBH only. Photosystem subunits Psa-E, Psa-F, Psa-G, Psa-H, Psa-K, Psa-N, PsbP, Psa-O, PsbQ, PsbR, PsbS, PsbW, and PsbY were also found in both methods, while PsbT-N and PsbX were found only in RBH and PsbO was found only in UCLUST. Among ribosomal proteins, Rps1, Rps9, Rpl4, Rpl11, and Rpl12 were detected by both techniques, while Rpl9 and Rpl15 were only found using RBH and Rpl10 was found only with UCLUST. Proteins involved in translocation and chaperone functions found by both methods included ClpB, ClpC, ClpD, ClpP, ClpR, FtsH, Hsp60, Hsp70, Hsp88, Hsp90, Hsp98, Cpn10, Cpn20, Cpn60, Vipp1, Alb3, Alb4, TatC, Tic20, Tic21, Tic40, Tic55, Tic110, Toc75, and Plsp1. The Sec translocase subunits SecA, Scy1, and Scy2 were uniquely found in RBH, while organellar oligopeptidase OOP was also found in UCLUST. Finally, genes associated with primary plastid metabolism (SBPase, TPT, FRUCT5, G6PD2, and G6PD), heme biosynthesis (GUN2, GUN5, HEMA, HEMB, HEMC, HY2, PORA, PORB, and PORC), and fatty acid synthesis (ACC2, FAB2, FAD7, FAD8, FATA, FATB, lipoxygenase) were found in core clusters.
Taken together, the good correlation of protein clusters with experimentally-validated sequences, the enrichment of expected annotation terms, and the presence of expected highly-abundant proteins or proteins critical  Table 9. Enriched GO terms for Conserved Plastid-Targeted UCLUST Clusters. Clusters containing at least 13 species with predicted or likely plastid-targeted sequences were mined for common GO terms and compared against terms extracted for the total set of UCLUST -derived clusters using BLAST2GO. All terms enriched above p = 1.0E −5 in core plastid-targeted clusters are represented. (2020) 10:8281 | https://doi.org/10.1038/s41598-020-64670-5 www.nature.com/scientificreports www.nature.com/scientificreports/ to chloroplast biology suggest that both the RBH and UCLUST methods achieved good accuracy and sensitivity for genes with conserved chloroplast targeting which are likely critical in all photosynthetic plants for minimal chloroplast function. It is noteworthy that 194 clusters in RBH and 333 core clusters in UCLUST contain at least one Arabidopsis sequence but have no associated gene synonyms available ( Supplementary Files 3 and 4). As the sensitivity for conserved plastid-targeted proteins was found to be very high overall, many of these 194-333 clusters with missing annotation information are likely biologically accurate, in which case they are excellent candidates for understanding hitherto uncharacterized aspects of chloroplast biology.

Analysis of semi-conserved and non-conserved plastid-targeted proteins. Semi-conserved
plastid-targeted protein-coding gene families in which predicted plastid-targeting was found for two or more sequences only in monocots, only in eudicots, or uniquely in A. trichopoda were identified beginning with the most diverse clades. In each case, all clusters with predicted plastid-targeted sequences or at least four predicted non-plastid-targeted sequences from the outgroup species were removed. A total of 572 gene families with plastid-targeted sequence specific to monocots and 430 to eudicots were found using RBH methods (Table 6, Fig. 5), while UCLUST detected 1,054 and 885, respectively (Table 7, Fig. 6). Additionally, 82 clusters with Amborella-specific plastid targeting were found using RBH, and 195 were found with UCLUST. These findings indicate that gene families with semi-conserved plastid-targeting outnumber core clusters by 73% in RBH and more than 150% in UCLUST. Narrowing focus to the subclade and family level revealed that semi-conserved clusters are still abundant, indicating that significant plastid proteome variation is present across all taxonomic levels. It is plausible that some of the clusters with plastid-targeting specific to either monocots or eudicots have functionally related clusters in the reciprocal group but lack sufficient homology to cluster together. Such an occurrence seems unlikely in most cases because the clustering methods used here were relatively liberal, but isolated cases may still occur. In some cases, non-orthologous or chimeric genes could also functionally replace an otherwise conserved gene and lead to loss of orthologous sequences in particular species or taxonomic groups 109,110 .
Finally, clusters with predicted plastid targeting only present in a single species were identified in RBH ( Table 6, Fig. 5) and UCLUST (Table 7, Fig. 6). Singletons and clusters containing only a single species were discarded as these likely represent gene prediction errors. For example, predicted proteins in Malus which do not share homology with proteins in other species are typically poorly-supported by transcriptomics evidence: examination of over 300 such sequences revealed only one that had full coverage and was not a smaller fragment of a larger protein (data not shown). Since the chloroplast transit peptide is presumed to have arisen recently in each cluster, the term "nascent plastid-targeted proteins" (NPTPs) was coined to represent such proteins. Unsurprisingly, species with large and complex genomes possessed a more significant number of NPTPs: A. amnicola had the least, at just 52 in RBH and 97 in UCLUST, while P. virgatum had the most, with 682 NPTPs found in RBH and 1,458 in UCLUST. The predicted proteome of A. amnicola is based on transcriptomics data rather than genome-wide prediction, while P. virgatum has the largest genome and most extensive predicted proteome of the species in this analysis, so these trends are consistent with expectations.
Additionally, up to 728 proteins were uniquely targeted to the plastid in M. × domestica, and between 300-400 proteins had species-specific plastid transit peptides in B. distachyon, F. vesca, G. max, S. italica, and S. bicolor. Arabidopsis had some of the lowest estimates of NPTPs, with only 74 found in RBH and 166 in UCLUST. Species-unique plastid-targeted proteins had a moderately linear correlation with the total number of sequences in each species R 2 = 0.73 in RBH and 0.72 in UCLUST, Fig. 7A), but the removal of the outlier P. virgatum resulted in nonlinear correlation (Fig. 7B). Consequently, extreme increases in genome size and complexity are hypothesized to create more opportunities for the evolution of novel transit peptides and diversification of the plastid proteome, but differences are subtler when the genomes being compared are closer in size. Previous literature (e.g. [111][112][113] ) has suggested that gene duplication is a prerequisite or at least greatly encourages neofunctionalization via novel subcellular targeting, and the generally linear correlation with proteome size suggests that this may indeed be the case. However, based on the data, the evolution of the plastid proteome is more likely to be driven by environmental adaptation and selection pressure 114 .
While transit peptide structure and sequence were expected to be conserved within each thus-identified cluster, searching for shared homology between transit peptides of different clusters was not performed. Without experimental data to support such identification, the motifs thus identified would be unreliable predictions, and it would be hard to state if the observed convergent evolution detected in novel transit peptides has any cause-effect relationship.
As with the conserved plastid-targeted clusters, the accuracy of targeting prediction in NPTPs was cross-validated against experimentally-validated proteins from Arabidopsis. For the RBH clusters, 75% (4 proteins) were validated to be true plastid proteins via GFP, and 53.8% (17 proteins) validated by MS. For UCLUST, 29.4% (17 proteins) were validated by GFP, and 41.4% validated by MS. Specificity was also very high: only 6.3% of 300 predicted non-plastid-targeted proteins in RBH-generated NPTP clusters were found to actually be plastid-targeted by GFP, while the rate in MS-validated proteins was 13.4% (967 proteins). UCLUST generated similar results, with false negative error rates of 3% (493 proteins) in GFP-validated data and 12.5% (1,369 proteins) for MS-validated data. The few false negatives in predicted NPTPs may be representative of ambiguous/ intermediate sequences in clusters which are already predicted to be uniquely chloroplast-targeted in Arabidopsis and therefore represent missing links. More pertinently, the GFP estimates are likely more accurate due to the experimental specificity errors inherent in mass spectrometry, and the 3-6% error rates are within an acceptable range.
Overall, these data affirm that evolution of the plastid proteome is highly dynamic at the species-level. Compared to previous reports, somewhat reduced species-unique plastid-targeted proteins are reported here (e.g., 2,10 ) due in part to the removal of singletons and single-species clusters. Homology to sequences in other species dramatically decreases the probability of pseudogenes and gene prediction errors. Remarkably, the (2020) 10:8281 | https://doi.org/10.1038/s41598-020-64670-5 www.nature.com/scientificreports www.nature.com/scientificreports/ monocot species had an average of 50-60% more species-unique plastid-targeted protein clusters than eudicot or Amborella counterparts. Even after removal of the outliers P. virgatum and A. amnicola, monocots still had 40% more plastid-targeted clusters than eudicots according to RBH methods, and over 80% more clusters using UCLUST. The reasons for this could be two-fold. First, the monocot species in this analysis have larger proteomes on average, increasing the overall likelihood for both de novo evolution of NPTPs and for retention of orphaned singleton/species-specific proteins. Secondly, monocots, and especially grasses, have been described to have many presence/absence variants (PAV's) and copy number variants (CV's) in their genomes. Pan-genome sequencing of B. distachyon revealed over 7,000 pan-genes that are not present in the reference genome, and an average of 9 Mb of sequence in each accession does not align to the reference genome 115 . Similar rates of PAV's have been reported for cereal crops: only half of the pan-genome diversity of maize is present in the reference genome 116 , over 21,000 predicted wheat genes are not represented in the reference genome 117 , and 8,000 predicted rice genes are not represented in the Nipponbare reference genome 118 . In contrast, pan-genomes of Arabidopsis 119 and tomato 120,121 describe variation primarily at the SNP and small insertion/deletion levels, although one report described that 14.9 Mb of the Columbia-0 genome was absent in one or more other accessions 122 . In Brassica oleracea, less than 20% of genes were affected by presence/absence variation 123 . Somewhat higher variation is observed in legumes: 302 soybean lines including varieties, landraces, and wild accessions revealed 1,614 copy number variants and 6,388 segmental deletions, and 51.4% of gene families were dispensable 124 while in Medicago truncatula, 67% of annotated genes may be dispensable 125 . It bears consideration that the pangenomes of the grasses are primarily within cultivated accessions and have already passed through a domestication filter which already significantly reduces genomic diversity, whereas the pangenomes of most of the eudicots include wild and landrace accessions. These trends suggest that PAV's and CV's are significant drivers of plastid proteome evolution, either by retention of orphaned genes or by de novo evolution of transit peptides in duplicated genes. Despite the smaller number of species-unique clusters, conserved plastid-targeted proteins are still outnumbered up to 25-fold by species-unique or semi-conserved proteins. If even a fraction of these sequences is accurate and expressed in vivo, each could impart novel biological functions because escape from the evolutionarily established biochemical and regulatory environment could impart a different function in a new subcellular environment without changing the functional sequence of the protein. Thus, each of these is an excellent candidate for further characterization to determine www.nature.com/scientificreports www.nature.com/scientificreports/ if unique phenotypes are created by relocalization to the plastid. Conversely, species-specific plastid-targeted protein-coding genes in model systems could yield misleading interpretations because the same phenotypes for those genes would not be observed in species where homologs do not have plastid-specific localization. Such a situation is potentially problematic for the unique plastid-targeted proteins detected for Arabidopsis, B. distachyon, and rice because it is likely that some of these genes already have a described gene function that is being inaccurately ascribed to plants as a whole. Indeed, out of 113 Arabidopsis proteins with predicted species-specific plastid-targeting, 18 have a described phenotype, and 100 are cited in previous research reports (summarized in Supplementary File 5). In cases where the predicted localization divergence is validated, the mutant phenotypes for those sequences will have to be revised.

conclusions
The evaluations conducted in this study support the hypothesis that a combination of subcellular localization prediction programs can accurately predict chloroplast transit peptides at a whole-genome scale in higher plants and can perform equally well for both monocots and eudicots. The best-performing method was then applied to predict chloroplast proteins globally for a diverse range of angiosperm species and developed both a slow and accurate reciprocal best-BLAST hit method and a fast-liberal UCLUST method to cluster gene families. Though results were not identical, UCLUST yielded comparable results while performing more efficiently. With the addition of more species, UCLUST could be a useful tool to overcome the inefficiency of BLAST-based methods. The consensus of both methods determined that the hypothesis of extreme plastid proteome variability was supported across the taxonomic space. Roughly 700 genes were shared between the chloroplast proteomes in all plant species, but these were vastly outnumbered by proteins with variable plastid targeting prediction. Most of these species-or clade-specific proteins have no known function for the plastid and are excellent candidates for further studies. Additionally, roughly a third of conserved plastid-targeted proteins have no known function and could be targeted for reverse genetics experiments in the future. Biological verification of these sequences remains a significant challenge. Even if good prediction accuracy was achieved, these sequences may be poorly expressed, expressed only in particular conditions, or are nonfunctional. Incorporation of transcriptomics would provide significant evidence that these genes are at least expressed, and patterns of gene expression along with co-expression information may also reveal additional information about their function. Experimental validation using mass spectrometry could also be used, but many proteins may have abundances below detection limits, and technical challenges also remain for the isolation of non-green plastids where they may be more abundant. The decreasing costs of gene synthesis make high-throughput fluorescence protein assays an attractive alternative. In addition to increased sensitivity and specificity compared with mass spectrometry, fluorescent protein assays could also be used to simultaneously validate whether the localization of species-unique proteins are truly different from their nearest predicted non-plastid-targeted homologs, and likewise may be able to provide better spatial resolution. Outer membrane proteins, lacking a classical transit peptide, are only currently predictable based on homology to the mature protein, and thus cannot be predicted de novo. Furthermore, prediction of localization within sub-compartments of the chloroplast remains a challenge. TargetP and other programs offer sub-compartment predictions, but their accuracy remains questionable, making improvement of experimental methods a necessity. The methods and results reported in this study will enable rapid, accurate and cost-effective identification of plastid-targeted proteomes in new plant species as their genomic information becomes available. These research findings are expected to provide a foundation for further research into unique plastid biology and to understand better how diversification of the organellar proteomes contributes to important agronomic, biochemical, culinary, or even aesthetic traits.
Methods cross-validation of in silico techniques. Test datasets for cross-comparison of subcellular prediction algorithms were retrieved from PPDB (2012 update; current as of this writing), AT_CHLORO (January 2015 update; current as of this writing) 23 , Suba4 (30 June 2017 update; current as of this writing) 24 , CropPAL version 58839ba 26 , and CropPAL2 version 74866967 26 . Headers which could not be referenced to the most up-to-date reference proteomes were discarded. For AT_CHLORO, Suba4, and PPDB databases, all genes located on the chloroplast and mitochondrial genomes were removed, and redundant headers were merged. Subsets of data including sequences confirmed by mass spectrometry, GFP fusion, either GFP or mass spectrometry, or both were extracted from each database by filtering for the keywords "Chloroplast" or "Plastid. " All ambiguous results containing experimental evidence for both plastids and at least one other subcellular fraction were removed.
Experimentally validated protein sequences were analyzed with TargetP v. Results for each workflow were converted into binary classification and evaluated for Sensitivity (SE), Specificity (SP), Matthew's Correlation Coefficient (MCC), and accuracy (ACC) as related to plastid localization prediction based on the number of true positives, false positives, true negatives, and false negatives compared to the annotations in the corresponding experimental dataset (see equations below). Combinatorial approaches were performed for each possible combination of programs from two up to all six programs, and different thresholds were evaluated based on the number of programs in agreement for plastid localization. Complete records of individual and combinatorial workflows for each experimental dataset are available in Supplementary File 1. The heatmap in Fig. 1  where tp is the number of sequences correctly identified as plastid-targeted, tn is the number of sequences correctly predicted to be non-plastid-targeted, fp is the number of non-plastid-targeted sequences incorrectly predicted as plastid-targeted, and fn is the number of plastid-targeted sequences that were predicted as non-plastid-targeted. Note that these categorizations are based on the accuracy of the database annotation and any filtering that was applied to data subsets, and they may not reflect biological accuracy.  [82][83][84][85]127 . These sources are described further in Supplementary File 2. Sequence files were processed in CLC Genomics Workbench version 8 (Qiagen Bioinformatics, Hilden, Germany); paired Illumina read files and 454 sequencing files were indicated during import. Graphical QC reports were generated to obtain nucleotide contribution (GC content) and quality distribution (quality scores) by base position. Reads were processed to remove ambiguous nucleotides and base quality scores lower than 0.001. Illumina reads were additionally trimmed at the 5' end until the GC content stabilized within 0.5% of the average, and reads with fewer than 34 bases remaining were discarded. All paired read files were subsequently merged using default settings. All processed read files were assembled de novo with default settings. Assembled contigs of >300 bp were kept and used to predict open reading frames (ORF's). Non-overlapping ORF's with at least 5x average base coverage and >300 bp were extracted and translated into protein sequences. Finally, extracted protein sequences were compared against the existing Malus × domestica v.1.0 predicted gene set 81 downloaded from Rosaceae.org. All hits with greater than 98% ID and coverage (as per 85 ) were tagged as potential duplications or alleles of the original headers but were kept in the peptide dataset in case minor mutations caused differential localization prediction. All sequences generated from this transcriptome assembly are available in Supplementary File 6. In total, 36,477 sequences were obtained, of which 26,881 sequences were determined to be unique in comparison with the apple genome 81 . Addition of the unique genes from the de novo transcriptome created a final dataset of 64,680 unique proteins. Redundant sequences from the resulting transcriptome were retained in case minor differences resulted in differential targeting.

Whole proteome analysis. Predicted proteomes for
The predicted proteomes of all species were filtered to remove any sequences less than 100 residues and which did not begin with methionine. Post-analysis filtering was accomplished by removing singleton sequences that failed to find matches with both the USEARCH method and BLAST (indicated for each sequence in Supplementary File 5). Remaining sequences were analyzed with TargetP v.1.1 43,44 and Localizer v.1.0.2 41 . All sequences predicted by both methods to have a chloroplast transit peptide were classified as plastid-targeted, and all sequences with either "1 or 2" or "0 of 2" chloroplast transit peptide predictions were classified as non-plastid-targeted.
clustering of gene families. Reciprocal Best-BLAST hit clustering was performed as follows: Pairwise BLAST-P (v.2.3.0+ command line executable; 128,129 ) was performed for each species' predicted proteome set against that of every other species in both forward and reverse directions. These results were filtered for hits in which identity and coverage parameters exceeded 40%. Of these, only hits in which two sequences from different genomes were the respective best hit were kept. Next, better-BLAST hits within each species were performed by conducting pairwise BLAST-P of the predicted proteome against itself. Hits exceeding 90% coverage and identity and which was reciprocal within the first 10 hits were collected. Cluster merging was performed by iterating through each possible header and collapsing all pairwise hits containing that header.
Clustering using the UCLUST algorithm proceeded as follows: An initial run on a length-sorted FASTA file containing all sequences was performed using 'Cluster_Fast' function of UCLUST (v.9.2.64_win32; 56 ) with 40% identity and 40% query coverage. Next, random seeds were constructed by extracting a single random sequence from each cluster, sorting the resulting sequences by length, and appending them to a length-sorted FASTA of the full sequence list used in the initial "Cluster_Fast" analysis. 100 randomly-seeded FASTA files were then analyzed with "Cluster_Fast" set to 90% sequence identity. Target and query coverage were additionally set to 0.4 to avoid problems with small query sequences acting as centroids for much larger sequences as a result of USEARCH being performed in sequential rather than length-sorted order. Cluster merging was performed by iteratively searching through each possible sequence header and collapsing all clusters containing that header. Custom scripts were developed for automating program workflows, referencing and translating sequences or headers, performing seed randomization for the modified UCLUST technique, performing cluster expansion, calculating statistics on clustering outputs, and referencing headers to respective clusters for both workflows. Sequence members within merged clusters from RBH and UCLUST methods were referenced to the predicted plastid targeting phenotype, and all clusters containing plastid-targeted members were extracted. Conserved plastid-targeted protein-coding gene families were defined as clusters containing at least 13 species and in which all had either predicted plastid transit peptides or at least three additional sequences. Semi-conserved plastid-targeted gene families were defined as clusters containing plastid-targeted sequences from at least 2 species within each family or clade and no predicted plastid-targeted sequences from species outside that clade. Non-conserved plastid-targeted protein-coding gene families were defined as all clusters containing a minimum of three species in which only one species had a plastid-targeted sequence.
Gene ontology enrichment. Annotations for NPTPs were retrieved from Phytozome 87 for each of the species used in the analysis except Anthurium amnicola and Vitis vinifera, which were retrieved from 76 and 94 , respectively. Non-redundant predicted proteins produced by the de novo transcriptome assembly of Malus × domestica were annotated using BLASTP against the NR Protein database at NCBI with BLAST2GO v.4.1.9 default parameters 108 (BioBam Bioinformatics, Valencia, Spain). GO terms were converted into GOslim annotations using BLAST2GO, and for each cluster, all terms shared by at least three species and present in over 10% of a cluster's sequences were extracted to develop query datasets. In parallel, the same methods were used to extract GO terms from the total list of clusters to serve as reference datasets. Enrichment of GO terms in the shared plastid-targeted clusters was performed using BLAST2GO, with Fisher's Exact Test was used to calculate significance using a false discovery rate (FDR) of less than 0.05 as a minimum significance threshold 108 . Graphical analyses of enriched GO terms were produced in BLAST2GO.
Gene and phenotype identification. Full gene annotations include described gene names were downloaded for the TAIR10 Arabidopsis genome from Phytozome 87 . Gene names were referenced from the annotation file for Arabidopsis sequences present in conserved plastid-targeted protein clusters. Phenotype information for species-unique plastid-targeted proteins was referenced on NCBI 130 .

Data availability
The datasets supporting the conclusions of this article are included within the article and its Supplementary files. Perl scripts used in the organization of data and execution of protein clustering are available at Sourceforge under the Project Name "Plastid Variation" and the homepage https://sourceforge.net/p/plastid-variation. Operating System(s): Platform Independent.