A draft map of the mouse pluripotent stem cell spatial proteome

Knowledge of the subcellular distribution of proteins is vital for understanding cellular mechanisms. Capturing the subcellular proteome in a single experiment has proven challenging, with studies focusing on specific compartments or assigning proteins to subcellular niches with low resolution and/or accuracy. Here we introduce hyperLOPIT, a method that couples extensive fractionation, quantitative high-resolution accurate mass spectrometry with multivariate data analysis. We apply hyperLOPIT to a pluripotent stem cell population whose subcellular proteome has not been extensively studied. We provide localization data on over 5,000 proteins with unprecedented spatial resolution to reveal the organization of organelles, sub-organellar compartments, protein complexes, functional networks and steady-state dynamics of proteins and unexpected subcellular locations. The method paves the way for characterizing the impact of post-transcriptional and post-translational modification on protein location and studies involving proteome-level locational changes on cellular perturbation. An interactive open-source resource is presented that enables exploration of these data.

. Resolution of some compartments is more evident in the lower components, for example separation of plasma membrane and lysosome in PC3, and mitochondrion and peroxisome in PC7. HyperLOPIT verifies localization of high confidence cell surface proteins assigned by Bausch-Fluck and co-workers, and provides experimental evidence to support putative cell surface proteins. Most such proteins observed in the plasma membrane or trans-Golgi network in the hyperLOPIT data. Proteins described as non-specific interactors by Bausch-Fluck and co-workers display hyperLOPIT distributions that are inconsistent with cell surface proteins, such as mitochondrial, nuclear, ribosomal and cytosolic localization.

Supplementary
Supplementary Figure 9 | Examples of proteins displaying mixed localization. Proteins with mixed localization do not co-distribute with classifiable organelle phenotypes (muted colors), and therefore display characteristic distribution patterns. The nuclear import/export machinery demonstrates an intermediate position between the cytosol and the nucleus, while the Rab G-proteins are distributed throughout the secretory pathway. Similarly, the MCM (minichromosome maintenance) complex has a distinct location between the nucleus and cytoplasm, in accordance with its role in DNA replication initiation 2 . Tfe3 is a transcription factor whose nuclear/cytoplasmic ratio is indicative of differentiation status 3 , while nucleocytoplasmic re-localization of Pcna has previously been demonstrated to modulate differentiation of neutrophils 4 . The Bcl-2 family member Mcl-1 displays an intermediate position between the mitochondria, endoplasmic reticulum and the nucleus 5 .
Tom34 is a cytosolic co-chaperone involved in mitochondrial protein import 6 . Also shown are two examples of complexes where a single member of the complex has a distinct localization from the core group (TFIID complex and the exocyst complex). Taf7 is thought to dissociate from the TFIID complex following initiation of transcription, and also has a role in the assembly of several other transcription pre-initiation complexes, which may explain its separate steady-state location 7 . Exoc8 is localized away from the core exocyst complex, and co-distributes with its known binding partners Par6 and RalA HyperLOPIT adds a spatial context to interactomics studies. The aminoacyl-tRNA synthetase complex is distributed between the cytosol and ribosome, consistent with its expected function. Additional assignments to the human orthologue of this complex by Havugimana and co-workers were localized to the cytosol, suggesting that their interaction with the complex is transient or unstable relative to the 'core' curated complex. (C) The spatial context can also be used to add additional confidence to novel assignments. Two of eight proteins novel assignments to the mitochondrial ribosome were found to localize to the mitochondrion and were therefore plausible interactors. The remaining six novel interactors were distributed across a range of other subcellular compartments, suggesting that these interactions are improbable. (D) Putative protein complexes can also be evaluated with this approach. Components of the putative complex shown here are distributed in many different subcellular compartments, suggesting that the probability of all components co-localizing to form a single complex is low. The putative complex is therefore likely a false positive in this case. (A) The long isoform of leucine aminopeptidase 3 (Lap3) was identified with mitochondrial localization, whereas the short isoform, which lacks the N-terminal import sequence, is localized between the cytosol and plasma membrane. Predicted interaction partners of Lap3 are found to localize across these three distributions, suggesting that the interactions are isoform-specific due to the differential compartmentalization of Lap3. (B) TMT 10-plex reporter ion profiles for the two isoforms of chromatin modifier Dnmt1 display differential localization. The long isoform enriches in the TMT 130C channel, consistent with chromatin localization, whereas the short isoform is most enriched in the TMT 129C channel, suggesting non-chromatin nuclear localization.

Supplementary Note 1: Machine Learning Results
The first step of the classification process is to obtain a set of well-characterised organelle residents, termed protein 'markers'. These markers, once defined, can be used as input labelled data to train a machine learning classifier to assign proteins of unknown localisation to one of the localisations covered in the protein marker set. It is however laborious and extremely difficult to manually define reliable markers that cover the full sub-cellular diversity in the data, and furthermore to obtain markers that represent the true structure of any sub-cellular clusters determined, which is essential for sound analysis. As such, an initial round of phenotype discovery was conducted using the phenoDisco algorithm 13 , in the pRoloc package 14 .

Phenotype discovery
The phenoDisco algorithm uses iterative cluster merging combined with Gaussian Mixture Modelling and outlier detection, and with a minimal initial set of markers and unlabelled data can be used to effectively detect new putative clusters, beyond those that are initially manually described.
Ten new phenotype clusters were detected in the dataset (Supplementary Figure 15 and Supplementary Table 5). Each cluster was carefully validated by querying the UniProt database 15 , the Gene Ontology 16 and the literature, as per the original pre-defined input markers, to assess biological relevance (Supplementary Table 6 ). Clusters that contained residents of small organellar structures such as the lysosome (phenotype 3) endosome (phenotype 4), and peroxisome (phenotype 7), were detected, thus confirming their independent data specific structure. Similarly, two very distinct nuclear clusters were confirmed, that were enriched in chromatin (phenotype 1) and nucleolus and other non-chromatin (phenotype 2) localised proteins. Further clusters contained actin cytoskeletal localised proteins (phenotype 9), ER localised proteins (phenotype 8) and a large cluster of cytosolic proteins (phenotype 5). We also see an interesting cluster that contains a small number of p-body proteins (phenotype 10) and a cluster of proteins that have mixed nuclear/cytoplasmic distributions (phenotype 6), of which many are known to shuttle between the nucleus and cytoplasm (see supplementary data 1 for phenoDisco output). Following examination of the phenotype clusters, further mining was conducted and well-known residents, as defined by UniProt and the literature, of the validated organelles were extracted and added to the list of protein markers to be used in a round of supervised machine learning classification. Markers for the lysosome, endosome, peroxisome, actin cytoskeleton, chromatin, nucleolus (non-chromatin) and cytosol were extracted from the discovery analysis to be added to the list of marker proteins. Proteins from phenotype 8, which are ER localised, were added to the existing set of ER markers, thus extending the number of markers for this organelle. Markers from phenotype 10 and phenotype 6 were left out of the final set of markers, as they were not highly enriched for one specific phenotype, and additionally the number of markers in these clusters was too small for use in classification (a minimum of 6 markers per subcellular class is required in supervised machine learning analysis for parameter optimisation as discussed in the proceeding section).

Increasing organellar resolution
Prior to novelty detection and supervised machine learning classification, to increase the organelle resolution, replicates 1 and 2 were combined using simple data fusion 17 in which quantitative TMT reporter ion ratios (10 per protein per experiment) were concatenated across the rows of proteins common in the two datasets. This combined dataset results in 20 quantitative data columns per protein and a total of 5032 proteins. Experiment 3 was not included as little additional resolution was obtained by further data fusion.

Comparison of MS 2 and SPS-MS 3 cluster resolution
Comparison of MS 2 and SPS-MS 3 protein-level cluster resolution and the repercussion for organelle proteomics has been investigated graphically as illustrated in Supporting Figures 17 and 18. The MS 2 and SPS-MS 3 (first replicate only) experiments contained 7116 and 5491 proteins respectively. Despite the higher number of proteins and peptide spectrum matches (PSMs) per proteins in MS 2 , we demonstrate the negative impact of lack of accurate quantification on the sub-cellular resolution for proteins quantified by a limited number of PSMs. The histograms and density plots in Supplementary Figure 17, illustrate the higher number of proteins and PSM per protein in MS 2 . Supplementary Figure 18 shows the SPS-MS 3 (top) and MS 2 (bottom) densities on the PCA plot for a set of PSM thresholds: from proteins with at least 20 PSMs per protein (left) to only a single PSM (right). Dense regions on the PCA plot are represented by darker shades on the figures. When considering proteins with a high number of PSMs (left), organelle clusters are clearly visible as darker groups. Filtering out proteins quantified by a high number of PMSs down to single PSM hits (right), the resolution of the sub-cellular clusters disappear already using a 5 PSM threshold in the MS 2 data; the density of point concentrates in the middle of the PCA figure, a pattern that characteristic of noisy, non-specific protein profiles. For SPS-MS 3 data, cluster resolution (organellar cluster densities and their separation) remains visible even for single PSM features.