Genome-Scale Characterization of Predicted Plastid-Targeted Proteomes in Higher Plants

Christian, Ryan W.; Hewitt, Seanna L.; Roalson, Eric H.; Dhingra, Amit

doi:10.1038/s41598-020-64670-5

Download PDF

Article
Open access
Published: 19 May 2020

Genome-Scale Characterization of Predicted Plastid-Targeted Proteomes in Higher Plants

Ryan W. Christian^1,2,
Seanna L. Hewitt^1,2,
Eric H. Roalson^2,3 &
…
Amit Dhingra ORCID: orcid.org/0000-0002-4464-2502^1,2

Scientific Reports volume 10, Article number: 8281 (2020) Cite this article

2410 Accesses
6 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Plastids are morphologically and functionally diverse organelles that are dependent on nuclear-encoded, plastid-targeted proteins for all biochemical and regulatory functions. However, how plastid proteomes vary temporally, spatially, and taxonomically has been historically difficult to analyze at a genome-wide scale using experimental methods. A bioinformatics workflow was developed and evaluated using a combination of fast and user-friendly subcellular prediction programs to maximize performance and accuracy for chloroplast transit peptides and demonstrate this technique on the predicted proteomes of 15 sequenced plant genomes. Gene family grouping was then performed in parallel using modified approaches of reciprocal best BLAST hits (RBH) and UCLUST. A total of 628 protein families were found to have conserved plastid targeting across angiosperm species using RBH, and 828 using UCLUST. However, thousands of clusters were also detected where only one species had predicted plastid targeting, most notably in Panicum virgatum which had 1,458 proteins with species-unique targeting. An average of 45% overlap was found in plastid-targeted protein-coding gene families compared with Arabidopsis, but an additional 20% of proteins matched against the full Arabidopsis proteome, indicating a unique evolution of plastid targeting. Neofunctionalization through subcellular relocalization is known to impart novel biological functions but has not been described before on a genome-wide scale for the plastid proteome. Further work to correlate these predicted novel plastid-targeted proteins to transcript abundance and high-throughput proteomics will uncover unique aspects of plastid biology and shed light on how the plastid proteome has evolved to influence plastid morphology and biochemistry.

A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range

Article Open access 11 April 2024

Bioorthogonal masked acylating agents for proximity-dependent RNA labelling

Article 09 April 2024

Single-cell and spatial RNA sequencing reveal the spatiotemporal trajectories of fruit senescence

Article Open access 10 April 2024

Introduction

Plastids represent biochemically and morphologically complex organelles and can change both form and function drastically in response to developmental and environmental cues. A vestigial but functional genome of 120–160 kb harboring ~90 protein-coding genes is present in the plastids of photosynthetic higher plants¹. However, the total chloroplast proteome conservatively contains 2,000–3,500 proteins as reported in Arabidopsis^2,3,4, but as many as 4,875 plastid-targeted proteins are estimated in eSLDB⁵, and 5,136 by the Chloroplast 2010 project^6,7,8. Less than 900 of 4,500 genes horizontally transferred from the ancestral cyanobacterium are predicted to be retargeted to the plastid in vivo⁹.

There seems to be a difference between the composition of plastid-targeted proteomes in dicots and monocots. Only 21% of plastid-targeted rice proteins have a predicted homolog in the predicted Arabidopsis plastid proteome, and in reciprocal comparison the number is 38%². A similar result was obtained in a comparison of six crop plants against Arabidopsis, in which an average of 51.0% of the predicted plastid proteome of each species matched to the Arabidopsis predicted plastid proteome, while 67.5% matched against the full Arabidopsis proteome¹⁰. Thus, the plastid pan-proteome is extremely diverse and is composed of unique proteins at the species-level. Furthermore, as the number of conserved sequences across all the genomes analyzed closely mirrors the number of genes of cyanobacterial origin, the non-conserved plastid-targeted protein-coding genes most likely evolved from eukaryotic sequences. The variability in the predicted plastid proteome mirrors the observable diversity in plastid function and ultrastructure in different species or under different environmental and developmental conditions^{2,10,11,12,13}. The diversity of plastid proteomes is evident even within the same plastid morphotype: the pigment-storing chromoplast alone has at least four described ultrastructural phenotypes across various species with unique sub-organellar membrane structures that can occur either singly or mixed within individual plastids¹⁴. Morphological differences in plastid shape and ultrastructure are noted even in genetically similar cultivars of the same species. Both chloroplasts and chromoplasts of developing apple peel differ significantly from tomato, which is used as a model reference for chromoplast differentiation in fruits^15,16. Variation has also been documented between the apple cultivars and the epidermal and collenchymal plastids¹¹.

The observed phenotypic diversity of plastids could be explained by three potential molecular factors: (1) Differences in the expression of genes controlling the rate and total amount of protein accumulation or import. This aspect could lead to unique phenotypes without necessarily changing the subset of plastid-targeted proteins. (2) Mutations within a shared group of plastid-targeted proteins could lead to neofunctionalization. (3) Finally, gain or loss of transit peptides causing subcellular mistargeting could alter the total pool of plastid-targeted proteins.

These factors are not mutually exclusive, and examples of each mechanism are known. Gene expression differences, possibly caused by epigenetic DNA methylation patterns, are responsible for differential protein accumulation in mesophyll and bundle sheath cells of C4 plants, illustrating the first point^17,18,19,20. In support of the second mechanism, point mutations in the active site of plastid-targeted limonene synthase change the abundance and distribution of different monoterpenoid end products in bacterial expression systems²¹, and transplastomic expression of a delta-9 desaturase gene causes changes in fatty acid concentrations and levels of unsaturation, cold tolerance, leaf senescence, and seed yield²² are additional examples. While it is challenging to address the neofunctionalization of plastid-targeted proteins via mutation without detailed reverse genetics experiments, the other mechanisms can be evaluated with high-throughput sequencing and bioinformatics.

High-throughput proteomics using mass spectrometry (MS) has been an important means of surveying organellar proteomes and comprises the majority of current plastid proteome evidence. However, these techniques have historically been limited to the chloroplast morphotype and a restricted number of plant species. Excellent databases for high-throughput plastid proteomes based largely on mass spectrometry are accessible at AT_CHLORO²³, PPDB²⁴, SUBA4²⁵, and CROPPAL²⁶. However, caution should be exercised in interpreting these datasets because MS is susceptible to high false positive errors due to contamination during plastid isolation, liberal mass tolerance, and errors in peptide mapping, among other problems^27,28,29. While the use of reference genomes and transcriptomes can help overcome peptide mapping issues, other technical issues are more difficult to resolve. Use of fluorescent protein chimeras (e.g., GFP – green fluorescent protein), though lower-throughput, typically have higher biological accuracy. Using these, localization of low-abundance, as well as proteins from species lacking robust plastid isolation methods, can be evaluated with higher efficiency. However, GFP techniques are not immune to experimental error either. Since the sequence of the mature protein partially influences localization (e.g.,^30,31,32), GFP fused to the native protein may alter localization in some cases. Furthermore, dual-targeted mitochondrial/chloroplast proteins can be mislocalized in GFP assays³³. Alternative transcripts or alternative protein products may also produce differential subcellular localization that are either not captured in GFP assays or give ambiguous results. Given these experimental limitations, a robust bioinformatics workflow could enable rapid and cost-effective assessment of plastid proteomes with somewhat comparable accuracy. Though wet lab validation is still necessary, these datasets could narrow the focus to smaller subsets of proteins of interest which could be more manageably targeted for wet lab validation depending on the biological question being asked.

The semi-conserved and sometimes ambiguous nature of chloroplast transit peptides makes in silico predictions challenging. Plastid transit peptides, as with other signal peptides, are well-known to be more variable than downstream protein sequence but more conserved than noncoding sequence. Yet, patterns of loose conservation at the amino acid level if not at the sequence level reveal multiple subgroups of transit peptides^34,35,36,37. However, sequence- and annotation-based approaches have yielded results with significant accuracy. Protein sequence-based prediction uses the amino acid content or the presence of conserved motifs in the peptide to make predictions. Use of the amino acid content alone, such as in the tool PCLR, is enough to predict many plastid-targeted proteins³⁸. More complex sequence-based identify conserved motifs, such as in iPSORT³⁹ and WoLF-PSORT⁴⁰, or sliding-window searching algorithm such as Localizer⁴¹, make predictions based on the sum of prediction vectors to determine transit peptide similarity. Finally, tools that use neural networks such as ChloroP⁴², TargetP^43,44, Predotar⁴⁵, PredSL⁴⁶, and Protein Prowler⁴⁷ use multiple layers of nodes to identify the best-scoring localization. In contrast, annotation-based methods such as CLPFD⁴⁸ and EpiLoc⁴⁹, or simple text-based methods based on GO annotations⁵⁰, use homology to proteins with known localization to designate subcellular predictions. While these methods offer advantages over sequence-based methods for proteins with annotated homologs, they perform poorly for novel proteins⁵¹. Hybrid approaches including MultiLoc2⁵², Sherloc2⁵³, Y-Loc⁵⁴, and Plant-mPLoc⁵⁵ combine sequence- and annotation-based methods in an attempt to overcome this limitation. Unfortunately, the homology component of hybrid approaches is weighted more heavily, which can lead to the false prediction of proteins with transit peptide variation or for proteins with shared domains. Both high-throughput proteomics and bioinformatics approaches consistently indicate that the plastid proteome content is highly dynamic and likely has significant variability across the plant kingdom. With newer methods, ever-growing genomic resources, and availability of better gene annotation methods, previously reported estimates of conserved and non-conserved sets of the plastid proteome warrant an update.

This study evaluated the hypothesis that bioinformatics methods could achieve similar accuracy to experimental methods by comprehensively testing previously published subcellular prediction algorithms both alone and in combination. A specific combination of methods was found to be most efficient, which was then used to globally predict nuclear-encoded plastid-targeted proteins for fifteen higher plant species including eight eudicots, six monocots, and Amborella trichopoda, an early diverging species of the angiosperm clade. Two parallel approaches, Reciprocal-Best Blast Hit (RBH) and UCLUST⁵⁶ were used to perform clustering, and the sub-cellular localization prediction for each cluster was analyzed to identify conserved, semi-conserved, and non-conserved plastid-targeted proteins. This approach also evaluated the hypothesis that a relative minority of plastid-targeted protein-coding genes are conserved among all species. It was found that natural selection and environmental influence has shaped the development of species-specific plastid proteomes.

Results and Discussion

Identification of optimal subcellular prediction workflows

To test the hypothesis that a bioinformatics workflow could reach parity with experimental methodology, the accuracy of six subcellular prediction algorithms including TargetP⁴³, WoLF PSORT⁴⁰, PredSL⁴⁶, Localizer⁴¹, Multiloc2⁵², and PCLR³⁸ was first evaluated using data from the original publications. Sensitivity, specificity, accuracy, and Matthew’s Correlation Coefficient (MCC) were evaluated for each program as it related to the prediction of plastid-targeted proteins (Table 1). Sensitivity, specificity, and MCC in TargetP were found to exactly match the values reported by Emanuelsson et al.^43,44 and while minor differences were found for MultiLoc and PredSL, these discrepancies likely represent rounding errors. Unexpectedly, significant differences were found for PCLR and Localizer: in PCLR, sensitivity was found to be 52.1%, which was about 5% lower than what was reported³⁸. In Localizer, calculated specificity was 78.9%, nearly 16% lower than the 95.7% reported⁴¹. In both cases, all other performance statistics were identical or nearly identical, so it is likely that the discrepancies in Localizer and PCLR represent either miscalculations or quality of the transcriptional data used for analysis in the original publications.

Table 1 Self-Reported Performance of Six Algorithms on Prediction of Plastid-Targeted Proteins.

Full size table

Next, cross-validation of subcellular prediction programs was performed against proteins with experimentally-determined subcellular localization retrieved from AT_CHLORO²³, PPDB^24,57, CropPAL and CropPAL2²⁶ and Suba4^25,58,59,60, resulting in 42,761 nonredundant sequences including 32,450 proteins validated by mass spectrometry (MS) and 3,722 validated by GFP. Most prediction algorithms were found to have lower performance against biological data reported in the original reports, as shown in Table 2 and Fig. 1. However, substantial differences were observed based on the method of experimental validation. On average among the six algorithms, sensitivity was 15.7% higher in the GFP-validated dataset while no significant change in specificity was found; this difference resulted in 10% higher overall accuracy and an increase of 0.159 in MCC for GFP-validated proteins. By further narrowing focus to a dataset of proteins validated by both methods, sensitivity increased by an additional 7.6%, and specificity increased 2.5%, on average. Due to the previously reported high false positive rates associated with shotgun proteomics of organellar proteomes^27,28, program performance was expected to be much higher for GFP-validated proteins. While the dataset containing proteins experimentally validated by both GFP and mass spectrometry showed the highest apparent performance for the six subcellular prediction algorithms - and is likely closer to the biological accuracy of these programs - it contains roughly a third as many proteins as the GFP-validated dataset and is heavily biased by Arabidopsis sequences. Therefore, remaining comparisons focused on the GFP-validated dataset. Similarly, MCC was used as the primary measure of biological accuracy of in silico approaches to avoid problems due to drastically different dataset sizes.

Table 2 Review of Algorithms using modern curated datasets (combined).

Full size table

Overall, the highest-performing program in terms of MCC was Localizer, followed by MultiLoc2-HR, TargetP, PCLR, PredSL, WoLF PSORT, and MultiLoc2-LR. Of these, PredSL and MultiLoc2-LR performed poorly with GFP-validated proteins compared to the original reports, while other programs decreased marginally or performed similarly to the published MCC. Among the six programs that were evaluated, Localizer had the highest performance regardless of the experimental method used for validation, which is surprising since it is a simpler tool than annotation-based methods which have been at the forefront of subcellular prediction methods recently. Part of Localizer’s increased accuracy may be due to its unique capacity to predict dual-targeted mitochondrial/chloroplast proteins. Over 200 dual-localized proteins have been described in Arabidopsis⁶¹ and over 500 are predicted to have ambiguous transit peptides⁶². Increased accuracy in the prediction of these sequences in Localizer could alone account for a portion of its higher performance. After Localizer, MultiLoc2 had the next-highest MCC and also had the highest specificity of any program, at 83% in GFP-validated proteins. MultiLoc is a hybrid method combining annotation and sequence analysis, so these findings support that the use of hybrid methods yields robust biological specificity. However, MultiLoc also had the worst sensitivity of any program, correctly predicting only 50% of bonafide plastid-targeted proteins validated by GFP or 31% of sequences validated by either GFP or mass spectrometry. TargetP, which has historically been the most popular subcellular prediction program for plants since its introduction, was found to perform at lower accuracy than earlier estimates: even when using the more conservative GFP-validated data, specificity was only 59% and sensitivity was 67%. Previous experiments using high-throughput shotgun proteomics have reported that the sensitivity of TargetP is as low as 62%^3,63,64,65. Use of strictly-curated data improves the apparent sensitivity up to 86%, but false positive rates are still problematic as a specificity of about 65% is observed⁶⁶. The results presented here suggest that the biological accuracy of TargetP is somewhat closer to the initial estimates on non-curated data. PredSL, PCLR, and WoLF-PSORT were the lowest-ranked programs by MCC for prediction of plastid-targeted proteins, in that order, but typically had higher sensitivity than Localizer or MultiLoc2.

Differences in the amino acid composition of transit peptides are observable between rice and Arabidopsis, which have an overrepresentation of alanine and serine, respectively⁶⁶. Therefore, differences in the prediction of monocot or eudicot sequences were assessed, and different programs displayed significant bias (Table 3). PCLR was the most drastically affected, with an MCC bias of +0.091 in monocots, representing a roughly 20% increase compared with eudicots. This finding is somewhat unsurprising because PCLR is the only program which uses sequence composition alone to make predictions and is, therefore, more susceptible to bias than motif- or annotation-based methods. TargetP was the only other tool that favored monocots, with an increase of 0.055 (+10.2%) in MCC. A marginal difference between monocot and eudicot prediction was observed when Localizer was used, which differed by only 0.008 in MCC, slightly favoring eudicots. Eudicot sequences were favored in the other prediction programs, with between 0.043 (+10%) higher MCC in WoLF POSRT and 0.066 in PredSL (+14.9%). To the best of our knowledge, this is the first study to report this type of error or bias for in silico prediction methods. Some differences have also been described for the proposed subunits of the TIC translocon in grasses, which could result in coevolution of the transit peptide sequence composition^67,68,69. Choice of training and cross-validated datasets could significantly sway the predictions of sequence-based methods, while overrepresentation or prioritization of sequences for Arabidopsis and thereby eudicots could introduce bias to annotation-based methods. Although these species-specific differences are smaller than differences observed for sequences validated by mass spectrometry compared with GFP, they are still noteworthy and have consequences for whole-genome prediction. In contrast, WoLF-PSORT and Localizer were found to have insignificant if any bias, making them attractive both as standalone programs or in combinatorial approaches where they could mask biases of other programs.

Table 3 Performance of prediction algorithms against GFP-validated proteins from monocots and eudicots.

Full size table

Combinatorial workflow outperforms single programs

Use of multiple prediction algorithms in combination is a powerful strategy to combine the strengths and overcome the limitations of single programs. Combinatorial approaches have been used to improve the accuracy of predictions in whole-genome analyses (e.g.,²) or to curate mass spectrometry data (e.g.,^70,71,72,73). Additionally, a combinatorial workflow using 22 prediction algorithms and four experimental techniques is used in the SUBAcon algorithm implemented for the SUBA4 database of Arabidopsis proteins which reportedly yields up to 97.5% accuracy for chloroplast localization and 90% for other compartments^25,60. While SUBAcon does not strictly require experimental data to perform predictions, available evidence weighs heavily on the final prediction and contributes to the reported accuracy. Even if experimental evidence were to be ignored, the use of 22 separate subcellular prediction algorithms is not feasible for individual researchers or application to enormous datasets. Therefore, a bioinformatics-based workflow that can work efficiently would be desirable.

Calculations were performed for each possible permutation of subcellular prediction algorithms and for all possible acceptable thresholds for each combination as applied to GFP-validated proteins. For example, for the combination of TargetP, PredSL, and Localizer, three thresholds were tested in which one, two, or all three programs needed to predict plastid localization to consider that protein as having a plastid transit peptide. To simplify analyses, the poorly-performing WoLF PSORT was removed from consideration (results including WoLF PSORT and datasets including MS-validated proteins are available in Supplementary File 1). In total, 80 unique workflows including the five remaining standalone program workflows were evaluated against GFP-validated proteins, the results of which are graphically summarized in Fig. 1, and numerically ranked by MCC in Table 4. Unequivocally, the results demonstrate that combinations of programs tend to outperform single programs for GFP-validated data: among the 25 workflows with the highest MCC, 23 were combinatorial approaches, while the standalone Localizer ranked tenth and Multiloc2-HR 22^nd. Localizer was not only the best-performing standalone program but was also overrepresented in combinatorial workflows: except the standalone Multiloc2-HR workflow, Localizer appeared in all 25 top-performing workflows. It is interesting to note that combinations that rank higher tend to combine programs with high sensitivity with counterparts that have lower sensitivity but higher specificity, thus correcting for each other’s deficiencies. Specifically, most of the combinations with the highest MCC and ACC tend to include Localizer most often, followed by MultiLoc2, TargetP, PCLR, and lastly PredSL. The ranking of Localizer is unsurprising given that its relatively balanced and high sensitivity and specificity are unparalleled by any of the other programs. However, MultiLoc2’s extremely high specificity makes it a valuable component of many workflows despite its low sensitivity. The best performing workflow used TargetP, Localizer, and Multiloc2 and required 2 of the three programs to predict plastid targeting to define a sequence as containing a plastid transit peptide; specificity of 78.5%, the sensitivity of 64.6%, and MCC of 0.659 was achieved with this approach. In comparison to TargetP alone, a nearly 20% increase in specificity was observed with no loss in sensitivity. However, as the annotation-based functions of MultiLoc2 make it difficult to run on extensive datasets, an alternative workflow using a “2 of 2” consensus approach for TargetP and Localizer was found which ranked 2^nd and achieved a marginally higher specificity of 80.7%. Furthermore, comparing the accuracy of the best workflows to Table 2 and to prior evaluations of experimental methodology (e.g.,⁶⁶) supported the hypothesis that bioinformatics methods could reach parity with mass spectrometry in characterizing the plastid proteome. Due to the increased simplicity and comparable performance of the TargetP/Localizer consensus approach, this workflow was selected for subsequent genome-scale prediction of plastid-targeted proteins.

Table 4 Best combinatorial prediction approaches ranked by Matthew’s Correlation Coefficient (MCC).

Full size table

Predicted plastid proteome correlates with genome size

As a demonstration of the utility of the Localizer and TargetP workflow, subcellular prediction was performed for the whole proteomes of fifteen phylogenetically diverse species. Six monocot species, including Anthurium amnicola, Brachypodium distachyon, Oryza sativa, Panicum virgatum, Setaria italica, and Sorghum bicolor and eight eudicots, including Arabidopsis thaliana, Fragaria vesca, Glycine max, Malus × domestica, Populus trichocarpa, Prunus persica, Solanum lycopersicum, and Vitis vinifera were chosen. Additionally, Amborella trichopoda, a species which diverged from the rest of the angiosperms prior to the divergernce of monocots and eudicots, was also incorporated into the comparative analysis. Complete information including data version numbers, proteome sizes, and prediction of plastid-targeted proteins by Localizer and TargetP is summarized in Table 5. In Arabidopsis, 2,826 proteins were predicted to be plastid-targeted, representing 8.8% of all protein isoforms. This finding is in agreement with the conservative estimates of the Arabidopsis plastid proteome^2,4,74. Similar percentages were calculated in other species but varied from a low of 6.4% in tomato to a high of 9.3% in A. amnicola. As expected, the absolute number of predicted plastid-targeted protein-coding genes showed a high correlation with the genome size (R² = 0.965) (Fig. 2). This result suggests that an increase in genome size and gene content yield a similar increase in the total number of plastid-targeted proteins. Over 10,000 of the Arabidopsis sequences have experimentally-determined localization, and comparing predictions for these sequences revealed an apparent sensitivity of 55.6%, specificity of 89.8%, accuracy of 83.6%, and MCC of 0.614. Sensitivity is somewhat low in this estimation due to the use of MS data, which includes many false positives, but the high specificity suggests good prediction accuracy. With the combination of the high correlation with experimentally-validated proteins and the lack of monocot/eudicot bias imparted by Localizer, it is expected that similar levels of accuracy were achieved for the entire set of species analyzed in this study.

Table 5 Targeting Prediction for Selected Species.

Full size table

Clustering of gene families

Although the plastid is highly dependent on proteins imported from the nucleus for normal viability and function, the size and diversity of the plastid proteome across the plant kingdom remain poorly understood. The hypothesis that the plastid proteome is diverse and each species has a unique set of plastid-targeted proteins was examined by grouping sequences into homologous protein groups using two parallel clustering methods (Fig. 3). Clustering method has a significant impact on the size and accuracy of the resulting clusters, and therefore on the number and relevance of predictions. Reciprocal best BLAST Hits (RBH) using ALL-vs.-All BLAST comparisons of whole proteomes are a standard proxy for orthology in comparative genomics, although they are susceptible to inclusion of weakly homologous paralogs. BLAST-based approaches combined with Markov clustering or similar methods to remove paralogs are used in commonly-cited methods such as InParanoid⁹⁵, OrthoMCL⁹⁶, and COG^97,98. However, these methods can bias single-copy genes or highly conserved families which can be problematic for polyploid genomes where many-to-many gene relationships are common^99,100. For instance, the popular OrthoMCL fails to detect many homologous proteins with conserved expression patterns, and therefore with likely conserved functions, between rice and Arabidopsis^101,102. In contrast, more straightforward RBH methods often outperform more complicated algorithms on eukaryotic genomes¹⁰³.

A simplified RBH approach, allowing many-to-many relationships, was determined to be most appropriate for this analysis to avoid fracture of gene families with paralogs or co-orthologs. Initial homologous relationships were identified using pairwise BLAST-P comparisons of two species; only sequences which are mutually the best BLAST hits for each other were utilized. Similar methods have used 40% as an appropriate identity and coverage threshold for orthologous relationships^{10,104,105,106}. Therefore 40% was used as the initial threshold of homology. Initial clustering generated many small clusters, so a supplemental method for expansion of clusters, using reciprocal better BLAST hits of each species = ’ proteome BLAST’ed against itself, was tested (Supplementary Figure 2–1). A 90% threshold was determined to be optimal for clusters with fewer species decreasing significantly in number, while clusters containing a majority of species remained stable or increased. In contrast, application of between 60 and 80% expansion thresholds caused the liberal merging of clusters into extremely large clusters representing thousands of individual sequences. Additionally, GO term similarity was assessed within clusters at each population size based on the number of species in the cluster and was found to increase slightly for clusters containing few species when using a 90% expansion threshold, while more massive clusters experienced no change or slight decreases.

An alternative approach called UCLUST was implemented to complement the RBH method with a faster and more efficient technique because its semi-global algorithm detects homology in a fraction of the time required for BLAST and becomes much more efficient on enormous datasets. Initial clusters were constructed at a 40% identity and 40% coverage threshold similar to the RBH approach. However, initial clustering produced smaller clusters and resulted in cluster fragmentation. Therefore, modifications were implemented to expand initial clusters by randomly selecting sequences out of each initial cluster and iterating the UCLUST search at more stringent conditions using the selected sequences as new centroids (Supplementary Figure 2–2). Cluster expansion significantly increased the number of clusters with many species, which largely came from the drastic reduction of the number of single-species clusters. As with RBH, a 90% expansion threshold was found to be optimal and increased the number of clusters sharing 14–15 species roughly 4-fold, while lower thresholds resulted in the frequent grouping of nonhomologous sequences. Comparison of GO similarity for clusters containing multiple species showed that similarity increased slightly or remained stable for nearly all cluster sizes in the 90% expansion threshold compared to the initial, non-expanded UCLUST analysis. The number of iterations required to fully expand cluster space in UCLUST was also examined, and it was found that most clusters were completely expanded by ten iterations, while further iterations yielded diminishing returns (Supplementary Figure 2–3). A total of 100 iterations were performed to avoid problems with the randomization of centroid sequences.

Application of the optimal clustering methods to the proteomes of the species chosen generated 170,877 clusters using RBH (Table 6) and 103,501 clusters using UCLUST (Table 7). Nearly all the additional clusters in RBH were from single-species clusters or singleton sequences (data not shown): 150,067 of the RBH clusters (87.82%) were single-species clusters of which 134,319 were singleton sequences, while UCLUST detected 74,059 single-species clusters (71.55%) including 45,033 singletons. Some of these may be orphan genes, but they are more likely to be prediction and annotation errors or pseudogenes because the lack of homology implies lack of conserved function or extreme mutation rates that are more likely to occur in non-coding sequences. A total of 20,810 and 29,442 clusters in RBH and UCLUST approach, respectively, contained sequences from multiple species; although they represented a minority of clusters, they contained the majority of initial sequences. A bimodal distribution was observed in both methods in which two clusters, the first containing 14–15 of the species and the second containing just 2–3 species, represented the majority of the clusters (Fig. 3A). Comparatively fewer clusters contained between 4–13 species. Of the conserved clusters containing all 15 species, RBH detected 4,090 clusters, while UCLUST yielded 3,295. GO similarity between UCLUST and RBH was remarkably consistent, but UCLUST had somewhat better scores for conserved clusters containing plastid-targeted sequences from all species and lower scores for semi-conserved or non-conserved clusters containing few species (Fig. 4B). Across both methods, GO similarity decreased with increasing cluster size. While the merging of nonhomologous sequences may be partially responsible for this decrease, the annotation methods and parameters are not identical for the species used in this study, which artificially decreases the apparent similarity score regardless of clustering specificity.

Table 6 RBH Clustering Results by Species.

Full size table

Table 7 UCLUST Clustering Results by Species.

Full size table

Identification of gene families with conserved plastid targeting

Genomes of endosymbiotic bacteria contain 1,500 proteins on average, and plastids are likely to contain similar numbers when accounting for both the plastid genome and core nuclear-encoded plastid-targeted protein-coding genes¹⁰⁷. To determine the number of gene families with conserved plastid localization, clusters containing at least 13 species, of which all species contained at least one predicted plastid-targeted sequence or at least four non-plastid-targeted sequences were selected. These parameters were chosen to account for assembly and annotation errors and to correct for the 39% false negative prediction rate for bonafide plastid-targeted proteins which could eliminate many truly conserved clusters. There is a nearly 20% chance that at least one of four random sequences with non-plastid localization prediction is a false negative, but sequences that already share homology to predicted plastid-targeted sequences have a significantly higher likelihood of being false negatives. A workflow diagram representing cluster detection, filtering, processing, and categorization is represented in Fig. 4. Applying this workflow, 628 conserved protein clusters were found in RBH (Table 6, Fig. 5), while UCLUST detected 828 (Table 7, Fig. 6). Of these, 621 clusters in RBH and 817 in UCLUST also contain sequences from A. trichopoda, and all have several monocot and eudicot sequences, strongly indicating that these clusters represent the fundamental core plastid-targeted protein-coding gene families. Previous estimates predicted that 857–1020 sequences were shared between rice and Arabidopsis, another report projected that between 289–737 proteins were shared among the chloroplast proteomes of seven plant species^2,10. Identification of gene families with conserved chloroplast transit peptides is an essential output of this work, as in silico methods can quickly identify conserved plastid-targeted proteins that have failed to be detected by genetic screens due to embryo lethality, gene redundancy, or random chance. Several methods have validated these sequences as truly plastid-targeted and representative of conserved plastid-targeted protein-coding genes. First, Arabidopsis proteins with experimentally-validated localization were examined within the conserved clusters. A total of 84.2% (183 proteins) of predicted plastid-targeted Arabidopsis sequences in conserved RBH clusters were validated by GFP and 94.5% (1,054) were validated by MS. The same was true for 80.5% (154 proteins) and 92% (855 proteins) in conserved RBH and UCLUST clusters, respectively (Supplementary Files 3 and 4). While these methods have yielded good overall sensitivity, small errors at initial stages of clustering can compound in larger clusters and result in unrealistically high numbers of sequences. For RBH, an average of 113.9 sequences and median of 61 were present in conserved clusters while UCLUST produced an average of 125.9 sequences and median of 84. Most sequences in these clusters come from a small set of species: G. max, P. virgatum, P. trichocarpa, and V. vinifera each contributed an average of over 10 sequences each to clusters with shared plastid localization prediction, while M. × domestica contributed over 10 sequences on average in UCLUST (summarized in Supplementary Files 3 and 4). Significant gene duplication or inclusion of multiple gene isoforms especially in those species likely accounts for a portion of the larger cluster sizes, but more distant paralogous sequences which are less likely to share biological function are also likely to be common. Thus, the list of conserved clusters reported here is not meant to be definitive and final, but rather a general guide which will require phylogenetic and experimental validation. In cases where larger clusters contain multiple paralogs or non-homologous, phylogenetic methods could resolve homology relationship with higher efficiency than the currently used RBH and UCLUST methods. However, the biological accuracy of the predicted plastid-targeted sequences within these clusters is still high.

Next, enrichment of gene ontology (GO) annotations was performed in conserved clusters by finding GO terms shared in at least three individual sequences and for over 10% of sequences. Terms were compared to annotations extracted using the same criteria for all the clusters of the respective clustering method and GO term enrichment was performed using BLAST2GO¹⁰⁸. Overall, 53 terms including 29 terms associated with biological function, 23 associated with the cellular component, and one associated with the molecular process were found for RBH (Table 8). In UCLUST, a total of 33 terms were found, including 15 associated with the biological process, 17 with the cellular component, and one with the molecular process (Table 9). The most significantly enriched GO terms under the biological process ontology for both RBH and UCLUST methods were GO:0015979 (photosynthesis) and GO:0008152 (metabolic process), while a majority of the remaining highly enriched terms were associated with homeostatic processes (GO:0042592), cellular component organization (GO:0016043), single-organism biosynthetic processes (GO:0016043), generation of precursor metabolites (GO:0006091), and lipid metabolism (GO:0006629). In the RBH method, additional terms associated with amide, peptide, and organonitrogen compound biosynthesis and metabolism (GO:0043604, GO:0043603, GO:0043043, GO:0006518, GO:1901566, GO:1901564, GO:0044271, GO:0034641, GO:0006807), were enriched. UCLUST additionally had enriched GO terms associated with transport (GO:0006810), localization (GO:0051234, GO:0051179) and metabolism of carbohydrates (GO:0005975). Among cellular component ontologies, plastid (GO:0009536) was the most overrepresented term in both methods. Other highly overrepresented cellular component terms included organelle (GO:0043226), thylakoid (GO:0009579), chloroplast (GO:0009507), and associated terms. In RBH methods, significant enrichment of ribonucleoprotein complexes (GO:1990904, GO:0030529) was found. For the molecular process ontology, structural molecule activity (GO:0005198) was enriched in RBH and catalytic activity (GO:0003824) in UCLUST. These GO terms were further compared to the results of a previous study involving intergeneric analysis that described 737 conserved plastid-targeted proteins¹⁰. In this study, 42% of enriched terms found using UCLUST overlapped with the methods reported previously¹⁰. RBH methods were somewhat lower because more enriched terms were found, but still overlapped with the previously published dataset by 24%. These results are remarkably similar given that only GO terms from Arabidopsis had been examined previously and also different methods of GO enrichment had been used in those studies. The final and perhaps the most important test of the biological significance of conserved plastid-targeted clusters is whether they contain proteins expected to be present in plastids of all higher plants. Gene names were retrieved from TAIR10 for all Arabidopsis sequences in conserved clusters, and many of the most prominent plastid proteins were confirmed to be present in clusters for both RBH and UCLUST methods. The following is not intended to be an exhaustive list but merely a representative of the types of proteins detected in conserved plastid-targeted clusters; a complete list of annotations and gene names in RBH and UCLUST clusters are available in Supplementary Files 3 and 4. Among genes involved in primary photosynthesis, HCEF, LhcA1, LhcA2, LhcB1, LhcB2, LhcB3, Lhcb4, LPA1, LPA3, PPDK, and RbcS were detected in both methods, while LPA66 was found in RBH only. Photosystem subunits Psa-E, Psa-F, Psa-G, Psa-H, Psa-K, Psa-N, PsbP, Psa-O, PsbQ, PsbR, PsbS, PsbW, and PsbY were also found in both methods, while PsbT-N and PsbX were found only in RBH and PsbO was found only in UCLUST. Among ribosomal proteins, Rps1, Rps9, Rpl4, Rpl11, and Rpl12 were detected by both techniques, while Rpl9 and Rpl15 were only found using RBH and Rpl10 was found only with UCLUST. Proteins involved in translocation and chaperone functions found by both methods included ClpB, ClpC, ClpD, ClpP, ClpR, FtsH, Hsp60, Hsp70, Hsp88, Hsp90, Hsp98, Cpn10, Cpn20, Cpn60, Vipp1, Alb3, Alb4, TatC, Tic20, Tic21, Tic40, Tic55, Tic110, Toc75, and Plsp1. The Sec translocase subunits SecA, Scy1, and Scy2 were uniquely found in RBH, while organellar oligopeptidase OOP was also found in UCLUST. Finally, genes associated with primary plastid metabolism (SBPase, TPT, FRUCT5, G6PD2, and G6PD), heme biosynthesis (GUN2, GUN5, HEMA, HEMB, HEMC, HY2, PORA, PORB, and PORC), and fatty acid synthesis (ACC2, FAB2, FAD7, FAD8, FATA, FATB, lipoxygenase) were found in core clusters.

Table 8 Enriched GO terms for Conserved Plastid-Targeted RBH Clusters.

Full size table

Table 9 Enriched GO terms for Conserved Plastid-Targeted UCLUST Clusters.

Full size table

Taken together, the good correlation of protein clusters with experimentally-validated sequences, the enrichment of expected annotation terms, and the presence of expected highly-abundant proteins or proteins critical to chloroplast biology suggest that both the RBH and UCLUST methods achieved good accuracy and sensitivity for genes with conserved chloroplast targeting which are likely critical in all photosynthetic plants for minimal chloroplast function. It is noteworthy that 194 clusters in RBH and 333 core clusters in UCLUST contain at least one Arabidopsis sequence but have no associated gene synonyms available (Supplementary Files 3 and 4). As the sensitivity for conserved plastid-targeted proteins was found to be very high overall, many of these 194–333 clusters with missing annotation information are likely biologically accurate, in which case they are excellent candidates for understanding hitherto uncharacterized aspects of chloroplast biology.

Analysis of semi-conserved and non-conserved plastid-targeted proteins

Semi-conserved plastid-targeted protein-coding gene families in which predicted plastid-targeting was found for two or more sequences only in monocots, only in eudicots, or uniquely in A. trichopoda were identified beginning with the most diverse clades. In each case, all clusters with predicted plastid-targeted sequences or at least four predicted non-plastid-targeted sequences from the outgroup species were removed. A total of 572 gene families with plastid-targeted sequence specific to monocots and 430 to eudicots were found using RBH methods (Table 6, Fig. 5), while UCLUST detected 1,054 and 885, respectively (Table 7, Fig. 6). Additionally, 82 clusters with Amborella-specific plastid targeting were found using RBH, and 195 were found with UCLUST. These findings indicate that gene families with semi-conserved plastid-targeting outnumber core clusters by 73% in RBH and more than 150% in UCLUST. Narrowing focus to the subclade and family level revealed that semi-conserved clusters are still abundant, indicating that significant plastid proteome variation is present across all taxonomic levels. It is plausible that some of the clusters with plastid-targeting specific to either monocots or eudicots have functionally related clusters in the reciprocal group but lack sufficient homology to cluster together. Such an occurrence seems unlikely in most cases because the clustering methods used here were relatively liberal, but isolated cases may still occur. In some cases, non-orthologous or chimeric genes could also functionally replace an otherwise conserved gene and lead to loss of orthologous sequences in particular species or taxonomic groups^109,110.

Finally, clusters with predicted plastid targeting only present in a single species were identified in RBH (Table 6, Fig. 5) and UCLUST (Table 7, Fig. 6). Singletons and clusters containing only a single species were discarded as these likely represent gene prediction errors. For example, predicted proteins in Malus which do not share homology with proteins in other species are typically poorly-supported by transcriptomics evidence: examination of over 300 such sequences revealed only one that had full coverage and was not a smaller fragment of a larger protein (data not shown). Since the chloroplast transit peptide is presumed to have arisen recently in each cluster, the term “nascent plastid-targeted proteins” (NPTPs) was coined to represent such proteins. Unsurprisingly, species with large and complex genomes possessed a more significant number of NPTPs: A. amnicola had the least, at just 52 in RBH and 97 in UCLUST, while P. virgatum had the most, with 682 NPTPs found in RBH and 1,458 in UCLUST. The predicted proteome of A. amnicola is based on transcriptomics data rather than genome-wide prediction, while P. virgatum has the largest genome and most extensive predicted proteome of the species in this analysis, so these trends are consistent with expectations.

Additionally, up to 728 proteins were uniquely targeted to the plastid in M. × domestica, and between 300–400 proteins had species-specific plastid transit peptides in B. distachyon, F. vesca, G. max, S. italica, and S. bicolor. Arabidopsis had some of the lowest estimates of NPTPs, with only 74 found in RBH and 166 in UCLUST. Species-unique plastid-targeted proteins had a moderately linear correlation with the total number of sequences in each species R² = 0.73 in RBH and 0.72 in UCLUST, Fig. 7A), but the removal of the outlier P. virgatum resulted in nonlinear correlation (Fig. 7B). Consequently, extreme increases in genome size and complexity are hypothesized to create more opportunities for the evolution of novel transit peptides and diversification of the plastid proteome, but differences are subtler when the genomes being compared are closer in size. Previous literature (e.g.^111,112,113) has suggested that gene duplication is a prerequisite or at least greatly encourages neofunctionalization via novel subcellular targeting, and the generally linear correlation with proteome size suggests that this may indeed be the case. However, based on the data, the evolution of the plastid proteome is more likely to be driven by environmental adaptation and selection pressure¹¹⁴.

While transit peptide structure and sequence were expected to be conserved within each thus-identified cluster, searching for shared homology between transit peptides of different clusters was not performed. Without experimental data to support such identification, the motifs thus identified would be unreliable predictions, and it would be hard to state if the observed convergent evolution detected in novel transit peptides has any cause-effect relationship.

As with the conserved plastid-targeted clusters, the accuracy of targeting prediction in NPTPs was cross-validated against experimentally-validated proteins from Arabidopsis. For the RBH clusters, 75% (4 proteins) were validated to be true plastid proteins via GFP, and 53.8% (17 proteins) validated by MS. For UCLUST, 29.4% (17 proteins) were validated by GFP, and 41.4% validated by MS. Specificity was also very high: only 6.3% of 300 predicted non-plastid-targeted proteins in RBH-generated NPTP clusters were found to actually be plastid-targeted by GFP, while the rate in MS-validated proteins was 13.4% (967 proteins). UCLUST generated similar results, with false negative error rates of 3% (493 proteins) in GFP-validated data and 12.5% (1,369 proteins) for MS-validated data. The few false negatives in predicted NPTPs may be representative of ambiguous/intermediate sequences in clusters which are already predicted to be uniquely chloroplast-targeted in Arabidopsis and therefore represent missing links. More pertinently, the GFP estimates are likely more accurate due to the experimental specificity errors inherent in mass spectrometry, and the 3–6% error rates are within an acceptable range.

Overall, these data affirm that evolution of the plastid proteome is highly dynamic at the species-level. Compared to previous reports, somewhat reduced species-unique plastid-targeted proteins are reported here (e.g.,^2,10) due in part to the removal of singletons and single-species clusters. Homology to sequences in other species dramatically decreases the probability of pseudogenes and gene prediction errors. Remarkably, the monocot species had an average of 50–60% more species-unique plastid-targeted protein clusters than eudicot or Amborella counterparts. Even after removal of the outliers P. virgatum and A. amnicola, monocots still had 40% more plastid-targeted clusters than eudicots according to RBH methods, and over 80% more clusters using UCLUST. The reasons for this could be two-fold. First, the monocot species in this analysis have larger proteomes on average, increasing the overall likelihood for both de novo evolution of NPTPs and for retention of orphaned singleton/species-specific proteins. Secondly, monocots, and especially grasses, have been described to have many presence/absence variants (PAV’s) and copy number variants (CV’s) in their genomes. Pan-genome sequencing of B. distachyon revealed over 7,000 pan-genes that are not present in the reference genome, and an average of 9 Mb of sequence in each accession does not align to the reference genome¹¹⁵. Similar rates of PAV’s have been reported for cereal crops: only half of the pan-genome diversity of maize is present in the reference genome¹¹⁶, over 21,000 predicted wheat genes are not represented in the reference genome¹¹⁷, and 8,000 predicted rice genes are not represented in the Nipponbare reference genome¹¹⁸. In contrast, pan-genomes of Arabidopsis¹¹⁹ and tomato^120,121 describe variation primarily at the SNP and small insertion/deletion levels, although one report described that 14.9 Mb of the Columbia-0 genome was absent in one or more other accessions¹²². In Brassica oleracea, less than 20% of genes were affected by presence/absence variation¹²³. Somewhat higher variation is observed in legumes: 302 soybean lines including varieties, landraces, and wild accessions revealed 1,614 copy number variants and 6,388 segmental deletions, and 51.4% of gene families were dispensable¹²⁴ while in Medicago truncatula, 67% of annotated genes may be dispensable¹²⁵. It bears consideration that the pangenomes of the grasses are primarily within cultivated accessions and have already passed through a domestication filter which already significantly reduces genomic diversity, whereas the pangenomes of most of the eudicots include wild and landrace accessions. These trends suggest that PAV’s and CV’s are significant drivers of plastid proteome evolution, either by retention of orphaned genes or by de novo evolution of transit peptides in duplicated genes. Despite the smaller number of species-unique clusters, conserved plastid-targeted proteins are still outnumbered up to 25-fold by species-unique or semi-conserved proteins. If even a fraction of these sequences is accurate and expressed in vivo, each could impart novel biological functions because escape from the evolutionarily established biochemical and regulatory environment could impart a different function in a new subcellular environment without changing the functional sequence of the protein. Thus, each of these is an excellent candidate for further characterization to determine if unique phenotypes are created by relocalization to the plastid. Conversely, species-specific plastid-targeted protein-coding genes in model systems could yield misleading interpretations because the same phenotypes for those genes would not be observed in species where homologs do not have plastid-specific localization. Such a situation is potentially problematic for the unique plastid-targeted proteins detected for Arabidopsis, B. distachyon, and rice because it is likely that some of these genes already have a described gene function that is being inaccurately ascribed to plants as a whole. Indeed, out of 113 Arabidopsis proteins with predicted species-specific plastid-targeting, 18 have a described phenotype, and 100 are cited in previous research reports (summarized in Supplementary File 5). In cases where the predicted localization divergence is validated, the mutant phenotypes for those sequences will have to be revised.

Conclusions

The evaluations conducted in this study support the hypothesis that a combination of subcellular localization prediction programs can accurately predict chloroplast transit peptides at a whole-genome scale in higher plants and can perform equally well for both monocots and eudicots. The best-performing method was then applied to predict chloroplast proteins globally for a diverse range of angiosperm species and developed both a slow and accurate reciprocal best-BLAST hit method and a fast-liberal UCLUST method to cluster gene families. Though results were not identical, UCLUST yielded comparable results while performing more efficiently. With the addition of more species, UCLUST could be a useful tool to overcome the inefficiency of BLAST-based methods. The consensus of both methods determined that the hypothesis of extreme plastid proteome variability was supported across the taxonomic space. Roughly 700 genes were shared between the chloroplast proteomes in all plant species, but these were vastly outnumbered by proteins with variable plastid targeting prediction. Most of these species- or clade-specific proteins have no known function for the plastid and are excellent candidates for further studies. Additionally, roughly a third of conserved plastid-targeted proteins have no known function and could be targeted for reverse genetics experiments in the future. Biological verification of these sequences remains a significant challenge. Even if good prediction accuracy was achieved, these sequences may be poorly expressed, expressed only in particular conditions, or are nonfunctional. Incorporation of transcriptomics would provide significant evidence that these genes are at least expressed, and patterns of gene expression along with co-expression information may also reveal additional information about their function. Experimental validation using mass spectrometry could also be used, but many proteins may have abundances below detection limits, and technical challenges also remain for the isolation of non-green plastids where they may be more abundant. The decreasing costs of gene synthesis make high-throughput fluorescence protein assays an attractive alternative. In addition to increased sensitivity and specificity compared with mass spectrometry, fluorescent protein assays could also be used to simultaneously validate whether the localization of species-unique proteins are truly different from their nearest predicted non-plastid-targeted homologs, and likewise may be able to provide better spatial resolution. Outer membrane proteins, lacking a classical transit peptide, are only currently predictable based on homology to the mature protein, and thus cannot be predicted de novo. Furthermore, prediction of localization within sub-compartments of the chloroplast remains a challenge. TargetP and other programs offer sub-compartment predictions, but their accuracy remains questionable, making improvement of experimental methods a necessity. The methods and results reported in this study will enable rapid, accurate and cost-effective identification of plastid-targeted proteomes in new plant species as their genomic information becomes available. These research findings are expected to provide a foundation for further research into unique plastid biology and to understand better how diversification of the organellar proteomes contributes to important agronomic, biochemical, culinary, or even aesthetic traits.

Methods

Cross-validation of in silico techniques

Test datasets for cross-comparison of subcellular prediction algorithms were retrieved from PPDB (2012 update; current as of this writing), AT_CHLORO (January 2015 update; current as of this writing)²³, Suba4 (30 June 2017 update; current as of this writing)²⁴, CropPAL version 58839ba²⁶, and CropPAL2 version 74866967²⁶. Headers which could not be referenced to the most up-to-date reference proteomes were discarded. For AT_CHLORO, Suba4, and PPDB databases, all genes located on the chloroplast and mitochondrial genomes were removed, and redundant headers were merged. Subsets of data including sequences confirmed by mass spectrometry, GFP fusion, either GFP or mass spectrometry, or both were extracted from each database by filtering for the keywords “Chloroplast” or “Plastid.” All ambiguous results containing experimental evidence for both plastids and at least one other subcellular fraction were removed.

Experimentally validated protein sequences were analyzed with TargetP v.1.1^43,44, WoLF PSORT Command Line Version 0.2⁴⁰, PredSL Web Server⁴⁶, Localizer v.1.0.2⁴¹, MultiLoc2 version 2-26-10-2009⁵², and PCLR update 2011-11-24 release 0.9³⁸. Additionally, NLStradamus v.1.8¹²⁶ was used as part of the Localizer algorithm, while Python v.2.7.5, LIBSVM v.2.8, BLAST v.2.2.30, and Interproscan v.5.25-64.0 were used as part of MultiLoc2. Results for each workflow were converted into binary classification and evaluated for Sensitivity (SE), Specificity (SP), Matthew’s Correlation Coefficient (MCC), and accuracy (ACC) as related to plastid localization prediction based on the number of true positives, false positives, true negatives, and false negatives compared to the annotations in the corresponding experimental dataset (see equations below). Combinatorial approaches were performed for each possible combination of programs from two up to all six programs, and different thresholds were evaluated based on the number of programs in agreement for plastid localization. Complete records of individual and combinatorial workflows for each experimental dataset are available in Supplementary File 1. The heatmap in Fig. 1 was generated using conditional formatting in Microsoft Excel.

$$\begin{array}{ccc}{\rm{Sensitivity}}({\rm{i}}) & = & \frac{tp}{tp+fn}\\ {\rm{Specificity}}({\rm{i}}) & = & \frac{tp}{tp+fp}\\ {\rm{MCC}}({\rm{i}}) & = & \frac{tp\times tn-fp\times fn}{\sqrt{(tp+fn)(tp+fp)(tn+fn)(tn+fp)}}\\ {\rm{Overall}}\,{\rm{Accuracy}}\,({\rm{ACC}}) & : & \frac{tp+tn\,}{tp+fp+tn+fn}\end{array}$$

where tp is the number of sequences correctly identified as plastid-targeted, tn is the number of sequences correctly predicted to be non-plastid-targeted, fp is the number of non-plastid-targeted sequences incorrectly predicted as plastid-targeted, and fn is the number of plastid-targeted sequences that were predicted as non-plastid-targeted. Note that these categorizations are based on the accuracy of the database annotation and any filtering that was applied to data subsets, and they may not reflect biological accuracy.

Whole proteome analysis

Predicted proteomes for Amborella trichopoda, Arabidopsis thaliana, Brachypodium distachyon, Fragaria vesca, Glycine max, Malus × domestica, Oryza sativa, Panicum virgatum, Populus trichocarpa, Prunus persica, Setaria italica, Solanum lycopersicum, and Sorghum bicolor were downloaded from Phytozome⁸⁷. The proteome of Anthurium amnicola was obtained by personal correspondence with Dr. Jon Suzuki, USDA-ARS, Hilo, Hawaii, in advance of the publication⁷⁶. For Vitis vinifera, an expanded proteome version was obtained from⁹⁴. For Malus × domestica, modifications to the predicted proteome were made because over 15,000 sequences, representing over 20% of the predicted proteome, were determined to have no significant matches to proteins from other species (See Supplementary File 5). The predicted proteome was expanded using apple transcriptome data that were downloaded from the NCBI SRA database under the project numbers PRJEB2506, PRJEB4314, PRJEB6212, and PRJNA231737, representing a mixture of leaf, apical meristem, fruit, and root tissues at different time points and under varying conditions^{82,83,84,85,127}. These sources are described further in Supplementary File 2. Sequence files were processed in CLC Genomics Workbench version 8 (Qiagen Bioinformatics, Hilden, Germany); paired Illumina read files and 454 sequencing files were indicated during import. Graphical QC reports were generated to obtain nucleotide contribution (GC content) and quality distribution (quality scores) by base position. Reads were processed to remove ambiguous nucleotides and base quality scores lower than 0.001. Illumina reads were additionally trimmed at the 5’ end until the GC content stabilized within 0.5% of the average, and reads with fewer than 34 bases remaining were discarded. All paired read files were subsequently merged using default settings. All processed read files were assembled de novo with default settings. Assembled contigs of >300 bp were kept and used to predict open reading frames (ORF’s). Non-overlapping ORF’s with at least 5x average base coverage and >300 bp were extracted and translated into protein sequences. Finally, extracted protein sequences were compared against the existing Malus × domestica v.1.0 predicted gene set⁸¹ downloaded from Rosaceae.org. All hits with greater than 98% ID and coverage (as per⁸⁵) were tagged as potential duplications or alleles of the original headers but were kept in the peptide dataset in case minor mutations caused differential localization prediction. All sequences generated from this transcriptome assembly are available in Supplementary File 6. In total, 36,477 sequences were obtained, of which 26,881 sequences were determined to be unique in comparison with the apple genome ⁸¹. Addition of the unique genes from the de novo transcriptome created a final dataset of 64,680 unique proteins. Redundant sequences from the resulting transcriptome were retained in case minor differences resulted in differential targeting.

The predicted proteomes of all species were filtered to remove any sequences less than 100 residues and which did not begin with methionine. Post-analysis filtering was accomplished by removing singleton sequences that failed to find matches with both the USEARCH method and BLAST (indicated for each sequence in Supplementary File 5). Remaining sequences were analyzed with TargetP v.1.1^43,44 and Localizer v.1.0.2⁴¹. All sequences predicted by both methods to have a chloroplast transit peptide were classified as plastid-targeted, and all sequences with either “1 or 2” or “0 of 2” chloroplast transit peptide predictions were classified as non-plastid-targeted.

Clustering of gene families

Reciprocal Best-BLAST hit clustering was performed as follows: Pairwise BLAST-P (v.2.3.0+ command line executable;^128,129) was performed for each species’ predicted proteome set against that of every other species in both forward and reverse directions. These results were filtered for hits in which identity and coverage parameters exceeded 40%. Of these, only hits in which two sequences from different genomes were the respective best hit were kept. Next, better-BLAST hits within each species were performed by conducting pairwise BLAST-P of the predicted proteome against itself. Hits exceeding 90% coverage and identity and which was reciprocal within the first 10 hits were collected. Cluster merging was performed by iterating through each possible header and collapsing all pairwise hits containing that header.

Clustering using the UCLUST algorithm proceeded as follows: An initial run on a length-sorted FASTA file containing all sequences was performed using ‘Cluster_Fast’ function of UCLUST (v.9.2.64_win32;⁵⁶) with 40% identity and 40% query coverage. Next, random seeds were constructed by extracting a single random sequence from each cluster, sorting the resulting sequences by length, and appending them to a length-sorted FASTA of the full sequence list used in the initial “Cluster_Fast” analysis. 100 randomly-seeded FASTA files were then analyzed with “Cluster_Fast” set to 90% sequence identity. Target and query coverage were additionally set to 0.4 to avoid problems with small query sequences acting as centroids for much larger sequences as a result of USEARCH being performed in sequential rather than length-sorted order. Cluster merging was performed by iteratively searching through each possible sequence header and collapsing all clusters containing that header. Custom scripts were developed for automating program workflows, referencing and translating sequences or headers, performing seed randomization for the modified UCLUST technique, performing cluster expansion, calculating statistics on clustering outputs, and referencing headers to respective clusters for both workflows. Sequence members within merged clusters from RBH and UCLUST methods were referenced to the predicted plastid targeting phenotype, and all clusters containing plastid-targeted members were extracted. Conserved plastid-targeted protein-coding gene families were defined as clusters containing at least 13 species and in which all had either predicted plastid transit peptides or at least three additional sequences. Semi-conserved plastid-targeted gene families were defined as clusters containing plastid-targeted sequences from at least 2 species within each family or clade and no predicted plastid-targeted sequences from species outside that clade. Non-conserved plastid-targeted protein-coding gene families were defined as all clusters containing a minimum of three species in which only one species had a plastid-targeted sequence.

Gene ontology enrichment

Annotations for NPTPs were retrieved from Phytozome⁸⁷ for each of the species used in the analysis except Anthurium amnicola and Vitis vinifera, which were retrieved from⁷⁶ and⁹⁴, respectively. Non-redundant predicted proteins produced by the de novo transcriptome assembly of Malus × domestica were annotated using BLASTP against the NR Protein database at NCBI with BLAST2GO v.4.1.9 default parameters¹⁰⁸ (BioBam Bioinformatics, Valencia, Spain). GO terms were converted into GOslim annotations using BLAST2GO, and for each cluster, all terms shared by at least three species and present in over 10% of a cluster’s sequences were extracted to develop query datasets. In parallel, the same methods were used to extract GO terms from the total list of clusters to serve as reference datasets. Enrichment of GO terms in the shared plastid-targeted clusters was performed using BLAST2GO, with Fisher’s Exact Test was used to calculate significance using a false discovery rate (FDR) of less than 0.05 as a minimum significance threshold¹⁰⁸. Graphical analyses of enriched GO terms were produced in BLAST2GO.

Gene and phenotype identification

Full gene annotations include described gene names were downloaded for the TAIR10 Arabidopsis genome from Phytozome⁸⁷. Gene names were referenced from the annotation file for Arabidopsis sequences present in conserved plastid-targeted protein clusters. Phenotype information for species-unique plastid-targeted proteins was referenced on NCBI¹³⁰.

Data availability

The datasets supporting the conclusions of this article are included within the article and its Supplementary files. Perl scripts used in the organization of data and execution of protein clustering are available at Sourceforge under the Project Name “Plastid Variation” and the homepage https://sourceforge.net/p/plastid-variation. Operating System(s): Platform Independent. Programming Language: Perl Other Requirements: TargetP v.1.1, Localizer v.1.0.2, BLAST v.2.3.9+ command line executable, UCLUST v.9.2.64_win32, RAxML v.8.2.31, MUSCLE v.3.8.31, MAFFT v. 7.407, Phyutility v2.2.6, FastTree 2.1.10. License: open source Restrictions for use by non-academics: no restrictions.

References

Sugiura, M. The chloroplast genome. Plant Mol. Biol. 19, 149–168 (1992).
Article CAS PubMed Google Scholar
Richly, E. & Leister, D. An improved prediction of chloroplast proteins reveals diversities and commonalities in the chloroplast proteomes of Arabidopsis and rice. Gene 329, 11–16 (2004).
Article CAS PubMed Google Scholar
Armbruster, U., Pesaresi, P., Pribil, M., Hertle, A. & Leister, D. Update on chloroplast research: New tools, new topics, and new trends. Mol. Plant 4, 1–16 (2011).
Article CAS PubMed Google Scholar
Millar, A. H., Whelan, J. & Small, I. Recent surprises in protein targeting to mitochondria and plastids. Curr. Opin. Plant Biol. 9, 610–615 (2006).
Article CAS PubMed Google Scholar
Pierleoni, A., Martelli, P. L., Fariselli, P. & Casadio, R. eSLDB: Eukaryotic subcellular localization database. Nucleic Acids Res 35, 208–212 (2007).
Article Google Scholar
Ajjawi, I., Lu, Y., Savage, L. J., Bell, S. M. & Last, R. L. Large-Scale Reverse Genetics in Arabidopsis: Case Studies from the Chloroplast 2010 Project. Plant Physiol. 152, 529–540 (2010).
Article CAS PubMed PubMed Central Google Scholar
Lu, Y., Savage, L. J., Larson, M. D., Wilkerson, C. G. & Last, R. L. Chloroplast 2010: A Database for Large-Scale Phenotypic Screening of Arabidopsis Mutants. Plant Physiol. 155, 1589–1600 (2011).
Article CAS PubMed PubMed Central Google Scholar
The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Article ADS Google Scholar
Martin, W. et al. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc. Natl. Acad. Sci. 99, 12246–12251 (2002).
Article ADS CAS PubMed PubMed Central Google Scholar
Schaeffer, S., Harper, A., Raja, R., Jaiswal, P. & Dhingra, A. Comparative analysis of predicted plastid-targeted proteomes of sequenced higher plant genomes. PLoS One 9, e112870 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Schaeffer, S. M. et al. Comparative ultrastructure of fruit plastids in three genetically diverse genotypes of apple (Malus × domestica Borkh.) during development. Plant Cell Rep 36, 1627–1640 (2017).
Article CAS PubMed PubMed Central Google Scholar
Solymosi, K. & Keresztes, A. Plastid Structure, Diversification and Interconversions II. Land Plants. Curr. Chem. Biol. 6, 187–204 (2013).
Article CAS Google Scholar
Wang, Y. Q. et al. Proteomic analysis of chromoplasts from six crop species reveals insights into chromoplast function and development. J. Exp. Bot 64, 949–961 (2013).
Article CAS PubMed PubMed Central Google Scholar
Li, L. & Yuan, H. Chromoplast biogenesis and carotenoid accumulation. Arch. Biochem. Biophys. 539, 102–109 (2013).
Article CAS PubMed Google Scholar
Egea, I. et al. Chromoplast differentiation: Current status and perspectives. Plant Cell Physiol 51, 1601–1611 (2010).
Article CAS PubMed Google Scholar
Barsan, C. et al. Proteomic Analysis of Chloroplast-to-Chromoplast Transition in Tomato Reveals Metabolic Shifts Coupled with Disrupted Thylakoid Biogenesis Machinery and Elevated Energy-Production Components. Plant Physiol. 160, 708–725 (2012).
Article CAS PubMed PubMed Central Google Scholar
Stockhaus, J. et al. The promoter of the gene encoding the C-4 form of phosphoenolpyruvate carboxylase directs mesophyll-specific expression in transgenic C-4 Flaveria spp. Plant Cell 9, 479–489 (1997).
Article PubMed PubMed Central Google Scholar
Majeran, W. Functional Differentiation of Bundle Sheath and Mesophyll Maize Chloroplasts Determined by Comparative Proteomics. Plant Cell 17, 3111–3140 (2005).
Article CAS PubMed PubMed Central Google Scholar
Ngernprasirtsiri, J., Chollet, R., Kobayashi, H., Sugiyama, T. & Akazawa, T. DNA methylation and the differential expression of C4 photosynthesis genes in mesophyll and bundle sheath cells of greening maize leaves. J. Biol. Chem. 264, 8241–8248 (1989).
CAS PubMed Google Scholar
Majeran, W. et al. Consequences of C 4 Differentiation for Chloroplast Membrane Proteomes in Maize Mesophyll and Bundle Sheath Cells. Mol. Cell. Proteomics 7, 1609–1638 (2008).
Article CAS PubMed PubMed Central Google Scholar
Srividya, N., Davis, E. M., Croteau, R. B. & Lange, B. M. Functional analysis of (4S)-limonene synthase mutants reveals determinants of catalytic outcome in a model monoterpene synthase. Proc. Natl. Acad. Sci. 112, 3332–3337 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Craig, W. et al. Transplastomic tobacco plants expressing a fatty acid desaturase gene exhibit altered fatty acid profiles and improved cold tolerance. Transgenic Res. 17, 769–782 (2008).
Article CAS PubMed Google Scholar
Ferro, M. et al. AT_CHLORO, a Comprehensive Chloroplast Proteome Database with Subplastidial Localization and Curated Information on Envelope Proteins. Mol. Cell. Proteomics 9, 1063–1084 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sun, Q. et al. PPDB, the Plant Proteomics Database at Cornell. Nucleic Acids Res 37, 969–974 (2009).
Article CAS Google Scholar
Hooper, C. M., Castleden, I. R., Tanz, S. K., Aryamanesh, N. & Millar, A. H. SUBA4: The interactive data analysis centre for Arabidopsis subcellular protein locations. Nucleic Acids Res 45, D1064–D1074 (2017).
Article CAS PubMed Google Scholar
Hooper, C. M., Castleden, I. R., Aryamanesh, N., Jacoby, R. P. & Millar, A. H. Finding the subcellular location of barley, wheat, rice and maize proteins: The compendium of crop proteins with annotated locations (cropPAL). Plant Cell Physiol 57, e9 (2015).
Article PubMed CAS Google Scholar
van Wijk, K. J. & Baginsky, S. Plastid Proteomics in Higher Plants: Current State and Future Goals. Plant Physiol. 155, 1578–1588 (2011).
Article PubMed PubMed Central CAS Google Scholar
Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteomics 73, 2092–2123 (2010).
Article CAS PubMed PubMed Central Google Scholar
Jeong, K., Kim, S. & Bandeira, N. False discovery rates in spectral identification. BMC Bioinformatics 13, S2 (2012).
Article CAS PubMed PubMed Central Google Scholar
Doyle, S. R., Kasinadhuni, N. R. P., Chan, C. K. & Grant, W. N. Evidence of Evolutionary Constraints That Influences the Sequence Composition and Diversity of Mitochondrial Matrix Targeting Signals. PLoS One 8, 1–8 (2013).
Google Scholar
Lisenbee, C. S., Karnik, S. K. & Trelease, R. N. Overexpression and mislocalization of a tail-anchored GFP redefines the identity of peroxisomal ER. Traffic 4, 491–501 (2003).
Article CAS PubMed Google Scholar
Small, I., Wintz, H., Akashi, K. & Mireau, H. Two birds with one stone: genes that encode products targeted to two or more compartments. Plant Mol. Biol 38, 265–277 (1998).
Article CAS PubMed Google Scholar
Carrie, C., Giraud, E. & Whelan, J. Protein transport in organelles: Dual targeting of proteins to mitochondria and chloroplasts. FEBS J 276, 1187–1195 (2009).
Article CAS PubMed Google Scholar
Li, H. min & Teng, Y. S. Transit peptide design and plastid import regulation. Trends Plant Sci 18, 360–366 (2013).
Article ADS CAS PubMed Google Scholar
Lee, D. W. & Hwang, I. Evolution and Design Principles of the Diverse Chloroplast Transit Peptides. Mol. Cells 41, 161–167 (2018).
CAS PubMed PubMed Central Google Scholar
Lee, D. W. et al. Arabidopsis Nuclear-Encoded Plastid Transit Peptides Contain Multiple Sequence Subgroups with Distinctive Chloroplast-Targeting Sequence Motifs. Plant Cell 20, 1603–1622 (2008).
Article CAS PubMed PubMed Central Google Scholar
von Heijne, G., Steppuhn, J. & Herrmann, R. G. Domain structure of mitochondrial and chloroplast targeting peptides. Eur. J. Biochem. 180, 535–545 (1989).
Article Google Scholar
Schein, A. I., Kissinger, J. C. & Ungar, L. H. Chloroplast transit peptide prediction: a peek inside the black box. Nucleic Acids Res 29, E82 (2001).
Article CAS PubMed PubMed Central Google Scholar
Bannai, H., Tamada, Y., Maruyama, O., Nakai, K. & Miyano, S. Extensive feature detection of N-terminal protein sorting signals. Bioinformatics 18, 298–305 (2002).
Article CAS PubMed Google Scholar
Horton, P. et al. WoLF PSORT: Protein localization predictor. Nucleic Acids Res 35, 585–587 (2007).
Article Google Scholar
Sperschneider, J. et al. LOCALIZER: subcellular localization prediction of both plant and effector proteins in the plant cell. Sci. Rep. 7, 44598 (2017).
Article ADS PubMed PubMed Central Google Scholar
Emanuelsson, O., Nielsen, H. & Heijne, G. Von. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci 8, 978–984 (1999).
Article CAS PubMed PubMed Central Google Scholar
Emanuelsson, O., Nielsen, H., Brunak, S. & von Heijne, G. Predicting Subcellular Localization of Proteins Based on their N-terminal Amino Acid Sequence. J. Mol. Biol. 300, 1005–1016 (2000).
Article CAS PubMed Google Scholar
Emanuelsson, O., Brunak, S., von Heijne, G. & Nielsen, H. Locating proteins in the cell using TargetP, SignalP and related tools. Nat. Protoc. 2, 953–71 (2007).
Article CAS PubMed Google Scholar
Small, I., Peeters, N., Legeai, F. & Lurin, C. Predotar: A tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics 4, 1581–1590 (2004).
Article CAS PubMed Google Scholar
Petsalaki, E. I., Bagos, P. G., Litou, Z. I. & Hamodrakas, S. J. PredSL: A Tool for the N-terminal Sequence-based Prediction of Protein Subcellular Localization. Genomics. Proteomics Bioinformatics 4, 48–55 (2006).
Article CAS PubMed PubMed Central Google Scholar
Bodén, M. The prediction of targeting peptides is enhanced by sequentially biased recurrent networks. (2014).
Chou, K. C. & Cai, Y. D. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277, 45765–45769 (2002).
Article CAS PubMed Google Scholar
Brady, S. & Shatkay, H. EpiLoc: a (working) text-based system for predicting protein subcellular location. Pacific Symp. Biocomput. 615, 604–615 (2008).
Google Scholar
Fyshe, A., Liu, Y., Szafron, D., Greiner, R. & Lu, P. Improving subcellular localization prediction using text classification and the gene ontology. Bioinformatics 24, 2512–2517 (2008).
Article CAS PubMed Google Scholar
Xiong, E., Zheng, C., Wu, X. & Wang, W. Protein Subcellular Location: The Gap Between Prediction and Experimentation. Plant Mol. Biol. Report 34, 52–61 (2016).
Article CAS Google Scholar
Blum, T., Briesemeister, S. & Kohlbacher, O. MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics 10, 274 (2009).
Article PubMed PubMed Central CAS Google Scholar
Briesemeister, S. et al. SherLoc2: A High-Accuracy Hybrid Method for Predicting Subcellular Localization of Proteins. J. Proteome Res. 8, 5363–5366 (2009).
Article CAS PubMed Google Scholar
Briesemeister, S., Rahnenführer, J. & Kohlbacher, O. YLoc-an interpretable web server for predicting subcellular localization. Nucleic Acids Res 38, 497–502 (2010).
Article CAS Google Scholar
Chou, K. C. & Shen, H. Bin. Plant-mPLoc: A top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS One 5 (2010).
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Article CAS PubMed Google Scholar
Van Wijk, K. J. Plastid proteomics. Plant Physiol. Biochem. 42, 963–977 (2004).
Article PubMed CAS Google Scholar
Heazlewood, J. L. Combining Experimental and Predicted Datasets for Determination of the Subcellular Location of Proteins in Arabidopsis. Plant Physiol. 139, 598–609 (2005).
Article CAS PubMed PubMed Central Google Scholar
Heazlewood, J. L., Verboom, R. E., Tonti-Filippini, J., Small, I. & Millar, A. H. SUBA: The Arabidopsis subcellular database. Nucleic Acids Res 35, 213–218 (2007).
Article Google Scholar
Hooper, C. M. et al. SUBAcon: A consensus algorithm for unifying the subcellular localization data of the Arabidopsis proteome. Bioinformatics 30, 3356–3364 (2014).
Article CAS PubMed Google Scholar
Carrie, C. & Small, I. A reevaluation of dual-targeting of proteins to mitochondria and chloroplasts. Biochim. Biophys. Acta - Mol. Cell Res 1833, 253–259 (2013).
Article CAS Google Scholar
Mitschke, J. et al. Prediction of dual protein targeting to plant organelles: Methods. New Phytol 183, 224–236 (2009).
Article CAS PubMed Google Scholar
Bhattacharya, D., Archibald, J. M., Weber, A. P. M. & Reyes-Prieto, A. How do endosymbionts become organelles? Understanding early events in plastid evolution. BioEssays 29, 1239–1246 (2007).
Article CAS PubMed Google Scholar
Kleffmann, T. et al. The Arabidopsis thaliana chloroplast proteome reveals pathway abundance and novel protein functions. Curr. Biol. 14, 354–362 (2004).
Article CAS PubMed Google Scholar
von Zychlinski, A. et al. Proteome analysis of the rice etioplast: metabolic and regulatory networks and novel protein functions. Mol. Cell. Proteomics 4, 1072–1084 (2005).
Article CAS Google Scholar
Zybailov, B. et al. Sorting signals, N-terminal modifications and abundance of the chloroplast proteome. PLoS One 3, e1994 (2008).
Article ADS PubMed PubMed Central CAS Google Scholar
de Vries, J., Sousa, F. L., Bölter, B., Soll, J. & Gould, S. B. YCF1: A Green TIC? Plant Cell 27, 1827–1833 (2015).
Article PubMed PubMed Central CAS Google Scholar
Nakai, M. YCF1: A Green TIC: Response to the de Vries et al. Commentary. Plant Cell 27, 1834–1838 (2015).
Article CAS PubMed PubMed Central Google Scholar
Nakai, M. New Perspectives on Chloroplast Protein Import. Plant Cell Physiol 59, 1111–1119 (2018).
Article CAS PubMed Google Scholar
Barsan, C. et al. Characteristics of the tomato chromoplast revealed by proteomic analysis. J. Exp. Bot 61, 2413–2431 (2010).
Article CAS PubMed Google Scholar
Zeng, Y. et al. Phosphoproteomic analysis of chromoplasts from sweet orange during fruit ripening. Physiol. Plant. 150, 252–270 (2014).
Article CAS PubMed Google Scholar
Zeng, Y. et al. A proteomic analysis of the chromoplasts isolated from sweet orange fruits [Citrus sinensis (L.) Osbeck]. J. Exp. Bot 62, 5297–5309 (2011).
Article CAS PubMed PubMed Central Google Scholar
Zhu, M. et al. A comprehensive proteomic analysis of elaioplasts from citrus fruits reveals insights into elaioplast biogenesis and function. Hortic. Res. 5, 0–10 (2018).
Google Scholar
Li, H. & Chiu, C.-C. Protein Transport into Chloroplasts. Annu. Rev. Plant Biol. 61, 157–180 (2010).
Article CAS PubMed Google Scholar
Albert, V. A. et al. The Amborella Genome and the Evolution of Flowering Plants. Science (80-.) 342, 1241089 (2013).
Article CAS Google Scholar
Suzuki, J. Y. et al. Organ-specific transcriptome profiling of metabolic and pigment biosynthesis pathways in the floral ornamental progenitor species Anthurium amnicola Dressler. Sci. Rep. 7, 1–15 (2017).
Article CAS Google Scholar
Lamesch, P. et al. The Arabidopsis Information Resource (TAIR): Improved gene annotation and new tools. Nucleic Acids Res 40, 1202–1210 (2012).
Article CAS Google Scholar
Initiative, T. I. B. Genome sequencing and analysis of the model grass Brahcypodium distachyon. Nature 463, 763–768 (2010).
Article ADS CAS Google Scholar
Shulaev, V. et al. The genome of woodland strawberry (Fragaria vesca) Vladimir. Nat. Genet. 43, 109–116 (2011).
Article CAS PubMed Google Scholar
Schmutz, J. et al. Genome sequence of the palaeopolyploid soybean. Nature 463, 178–183 (2010).
Article ADS CAS PubMed Google Scholar
Velasco, R. et al. The genome of the domesticated apple (Malus × domestica Borkh.). Nat. Genet. 42, 833–839 (2010).
Article CAS PubMed Google Scholar
Krost, C., Petersen, R. & Schmidt, E. R. The transcriptomes of columnar and standard type apple trees (Malus x domestica) - A comparative study. Gene 498, 223–230 (2012).
Article CAS PubMed Google Scholar
Krost, C. et al. Evaluation of the hormonal state of columnar apple trees (Malus x domestica) based on high throughput gene expression studies. Plant Mol. Biol. 81, 211–220 (2013).
Article CAS PubMed Google Scholar
Gusberti, M., Gessler, C. & Broggini, G. A. L. RNA-seq analysis reveals candidate genes for ontogenic resistance in Malus-Venturia pathosystem. PLoS One 8 (2013).
Bai, Y., Dougherty, L. & Xu, K. Towards an improved apple reference transcriptome using RNA-seq. Mol. Genet. Genomics 289, 427–438 (2014).
Article CAS PubMed Google Scholar
Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: Improvements and new features. Nucleic Acids Res 35, D883–D887 (2007).
Article CAS PubMed Google Scholar
Phytozome V.12.1. (2019). Available at, https://phytozome.jgi.doe.gov/pz/portal.html. (Accessed: 2nd May 2018).
Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science (80-.) 313, 1596–1604 (2006).
Article ADS CAS Google Scholar
Verde, I. et al. The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat. Genet. 45, 487–494 (2013).
Article CAS PubMed Google Scholar
Bennetzen, J. L. et al. Reference genome sequence of the model plant Setaria. Nat. Biotechnol. 30, 555–561 (2012).
Article CAS PubMed Google Scholar
Consortium, T. T. G. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635–641 (2012).
Article ADS CAS Google Scholar
McCormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. Plant J. 93, 338–354 (2018).
Article CAS PubMed Google Scholar
Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).
Article ADS CAS PubMed Google Scholar
Vitulo, N. et al. A deep survey of alternative splicing in grape reveals changes in the splicing machinery related to tissue, stress condition and genotype. BMC Plant Biol 14, 20–30 (2014).
Article CAS Google Scholar
O’Brien, K. P., Remm, M. & Sonnhammer, E. L. L. Inparanoid: A comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33, 476–480 (2005).
Article Google Scholar
Li, L., Stoeckert, C. J. J. & Roos, D. S. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes–Li et al. 13 (9): 2178–Genome Research. Genome Res. 13, 2178–2189 (2003).
Article CAS PubMed PubMed Central Google Scholar
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A Genomic Perspective on Protein Families. Science (80-.) 278, 631–637 (1997).
Article ADS CAS Google Scholar
Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28, 33–36 (2000).
Article CAS PubMed PubMed Central Google Scholar
Das, M. et al. Expression pattern similarities support the prediction of orthologs retaining common functions after gene duplication events. Plant Physiol. 171, 01207.2015 (2016).
Google Scholar
Trachana, K. et al. Orthology prediction methods: A quality assessment using curated protein families. BioEssays 33, 769–780 (2011).
Article CAS PubMed PubMed Central Google Scholar
Van Bel, M. et al. Dissecting Plant Genomes with the PLAZA Comparative Genomics Platform. Plant Physiol. 158, 590–600 (2012).
Article ADS PubMed CAS Google Scholar
Kim, K., Kim, W. & Kim, S. ReMark: An automatic program for clustering orthologs flexibly combining a Recursive and a Markov clustering algorithms. Bioinformatics 27, 1731–1733 (2011).
Article CAS PubMed Google Scholar
Altenhoff, A. M. & Dessimoz, C. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput. Biol. 5 (2009).
Yang, Y. & Smith, S. A. Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: Improving accuracy and matrix occupancy for phylogenomics. Mol. Biol. Evol 31, 3081–3092 (2014).
Article CAS PubMed PubMed Central Google Scholar
Chiu, J. C. et al. OrthologID: Automation of genome-scale ortholog identification within a parsimony framework. Bioinformatics 22, 699–707 (2006).
Article CAS PubMed Google Scholar
Sanderson, M. J. & McMahon, M. M. Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evol. Biol. 7, S3 (2007).
Article PubMed PubMed Central CAS Google Scholar
Hönigschmid, P., Bykova, N., Schneider, R., Ivankov, D. & Frishman, D. Evolutionary Interplay between Symbiotic Relationships and Patterns of Signal Peptide Gain and Loss. Genome Biol. Evol 10, 928–938 (2018).
Article PubMed PubMed Central CAS Google Scholar
Conesa, A. & Götz, S. Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int. J. Plant Genomics 2008 (2008).
Koonin, E. V., Aravind, L. & Kondrashov, A. S. The Impact of Comparative Genomics on Our Understanding of Evolution. Cell 101, 573–576 (2000).
Article CAS PubMed Google Scholar
Osterman, A. & Overbeek, R. Missing genes in metabolic pathways: A comparative genomics approach. Curr. Opin. Chem. Biol. 7, 238–251 (2003).
Article CAS PubMed Google Scholar
Byun, S. A. & Singh, S. Protein subcellular relocalization increases the retention of eukaryotic duplicate genes. Genome Biol. Evol 5, 2402–2409 (2013).
Article PubMed PubMed Central Google Scholar
Bennetzen, J. L. & Wang, H. The Contributions of Transposable Elements to the Structure, Function, and Evolution of Plant Genomes. Annu. Rev. Plant Biol. 65, 505–530 (2014).
Article CAS PubMed Google Scholar
Kleine, T., Maier, U. G. & Leister, D. DNA Transfer from Organelles to the Nucleus: The Idiosyncratic Genetics of Endosymbiosis. Annu. Rev. Plant Biol. 60, 115–138 (2009).
Article CAS PubMed Google Scholar
Christian, R., Hewitt, S., Nelson, G., Roalson, E. & Dhingra, A. Plastid Transit Peptides - Where Do They Come From and Where Do They All Belong? Assessment of Chloroplast Transit Peptide Evolution in Multi-Species and Pan-Genomic Comparisons. (2019).
Gordon, S. P. et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 8 (2017).
Hirsch, C. N. et al. Insights into the Maize Pan-Genome and Pan-Transcriptome. Plant Cell 26, 121–135 (2014).
Article CAS PubMed PubMed Central Google Scholar
Montenegro, J. D. et al. The pangenome of hexaploid bread wheat. Plant J. 90, 1007–1013 (2017).
Article CAS PubMed Google Scholar
Yao, W. et al. Exploring the rice dispensable genome using a metagenome-like assembly strategy. Genome Biol. 16, 187 (2015).
Article PubMed PubMed Central Google Scholar
Alonso-Blanco, C. et al. 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).
Article CAS Google Scholar
Aflitos, S. et al. Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing. Plant J. 80, 136–148 (2014).
Article PubMed Google Scholar
Lin, T. et al. Genomic analyses provide insights into the history of tomato breeding. Nat. Genet. 46, 1220–1226 (2014).
Article CAS PubMed Google Scholar
Gan, X. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Golicz, A. A. et al. The pangenome of an agronomically important crop plant Brassica oleracea. Nat. Commun. 7, 13390 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhou, Z. et al. Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean. Nat. Biotechnol. 33, 401–414 (2015).
Google Scholar
Zhou, P. et al. Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes. BMC Genomics 18, 261 (2017).
Article PubMed PubMed Central CAS Google Scholar
Ba, A. N. N., Pogoutse, A., Provart, N. & Moses, A. M. NLStradamus: A simple Hidden Markov Model for nuclear localization signal prediction. BMC Bioinformatics 10, 1–11 (2009).
Google Scholar
Petersen, R., Djozgic, H., Rieger, B., Rapp, S. & Schmidt, E. R. Columnar apple primary roots share some features of the columnar-specific gene expression profile of aerial plant parts as evidenced by RNA-Seq analysis. BMC Plant Biol 15, 1–16 (2015).
Article CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
National Center for Biotechnology Information (NCBI). Available at, https://www.ncbi.nlm.nih.gov. (Accessed: 2nd December 2019).

Download references

Acknowledgements

Work in the Dhingra lab was supported by Washington State University Agriculture Center Research Hatch Grant WNP00011 to A.D. R.C. and S.H. acknowledge the support received from the National Institutes of Health/National Institute of General Medical Sciences through an institutional training grant award T32-GM008336. The contents of this work are solely the responsibility of the authors and do not necessarily represent the official views of the NIGMS or NIH.

Author information

Authors and Affiliations

Department of Horticulture, Washington State University, Pullman, WA, USA
Ryan W. Christian, Seanna L. Hewitt & Amit Dhingra
Molecular Plant Sciences Program, Washington State University, Pullman, WA, USA
Ryan W. Christian, Seanna L. Hewitt, Eric H. Roalson & Amit Dhingra
School of Biological Sciences, Washington State University, Pullman, WA, USA
Eric H. Roalson

Authors

Ryan W. Christian
View author publications
You can also search for this author in PubMed Google Scholar
Seanna L. Hewitt
View author publications
You can also search for this author in PubMed Google Scholar
Eric H. Roalson
View author publications
You can also search for this author in PubMed Google Scholar
Amit Dhingra
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.W.C. and A.D. designed the study. R.W.C. performed localization prediction, gene clustering, and data analysis. E.H.R. assisted in methods development. S.L.H. performed gene annotation analyses. A.D. and E.H.R. supervised the study. R.W.C. and A.D. prepared the manuscript. All authors read and approved the manuscript. The authors declare no conflict of interest.

Corresponding author

Correspondence to Amit Dhingra.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary File 1

Supplementary File 2

Supplementary File 3

Supplementary File 4

Supplementary File 5

Supplementary File 6

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Christian, R.W., Hewitt, S.L., Roalson, E.H. et al. Genome-Scale Characterization of Predicted Plastid-Targeted Proteomes in Higher Plants. Sci Rep 10, 8281 (2020). https://doi.org/10.1038/s41598-020-64670-5

Download citation

Received: 21 June 2019
Accepted: 20 April 2020
Published: 19 May 2020
DOI: https://doi.org/10.1038/s41598-020-64670-5

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results and Discussion

Identification of optimal subcellular prediction workflows

Combinatorial workflow outperforms single programs

Predicted plastid proteome correlates with genome size

Clustering of gene families

Identification of gene families with conserved plastid targeting

Analysis of semi-conserved and non-conserved plastid-targeted proteins

Conclusions

Methods

Cross-validation of in silico techniques

Whole proteome analysis

Clustering of gene families

Gene ontology enrichment

Gene and phenotype identification

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links