Defining the consequences of genetic variation on a proteome-wide scale

Journal name:
Nature
Volume:
534,
Pages:
500–505
Date published:
DOI:
doi:10.1038/nature18270
Received
Accepted
Published online

Abstract

Genetic variation modulates protein expression through both transcriptional and post-transcriptional mechanisms. To characterize the consequences of natural genetic diversity on the proteome, here we combine a multiplexed, mass spectrometry-based method for protein quantification with an emerging outbred mouse model containing extensive genetic variation from eight inbred founder strains. By measuring genome-wide transcript and protein expression in livers from 192 Diversity outbred mice, we identify 2,866 protein quantitative trait loci (pQTL) with twice as many local as distant genetic variants. These data support distinct transcriptional and post-transcriptional models underlying the observed pQTL effects. Using a sensitive approach to mediation analysis, we often identified a second protein or transcript as the causal mediator of distant pQTL. Our analysis reveals an extensive network of direct protein–protein interactions. Finally, we show that local genotype can provide accurate predictions of protein abundance in an independent cohort of collaborative cross mice.

At a glance

Figures

  1. Tandem mass tag (TMT)-based liver proteomics in 192 DO mice.
    Figure 1: Tandem mass tag (TMT)-based liver proteomics in 192 DO mice.

    a, Overview of the breeding scheme to create the DO and CC mouse strains. b, Experimental overview of the genotyping, transcriptomics and proteomic analysis on 192 DO mouse livers from both sexes on a high-fat or chow diet.

  2. Global view of the liver proteome reveals distinct genetic models of protein regulation.
    Figure 2: Global view of the liver proteome reveals distinct genetic models of protein regulation.

    a, Venn diagram showing the distribution of transcripts and proteins broken down into local or distant QTL. b, Histograms of Pearson correlations for each gene’s protein and transcript measurements after segregating into four groups (eQTL–pQTL (purple), pQTL–no eQTL (blue), eQTL–no pQTL (green) and no QTL (grey)). c, Local and distant pQTL LOD scores after transcript measurements were used as a covariate in the regression model showing that local pQTL were mediated through their cognate transcripts unlike distant pQTL. d, Model selection by Bayesian information criterion (BIC). Local pQTL (QTLL) were mostly transcriptionally controlled, whereas distant pQTL (QTLD) were regulated generally by post-transcriptional mechanisms.

  3. Examples of local pQTL that illustrate different models of regulation.
    Figure 3: Examples of local pQTL that illustrate different models of regulation.

    a, DHTKD1 abundance is regulated by a local pQTL that probably acts proximally on transcript abundance. b, Dhtkd1 has a strong local eQTL (green) and local pQTL (blue), which corresponds to high correlation between transcript and protein abundance (inset; abundance data transformed to rank normal scores for comparison). c, The predicted founder strain abundance of DHTKD1 in the DO population mirrors the measured abundance of DHTKD1 in the founder strains (n = 4 mice for each founder, 2 male and 2 female, black bars represent median values). d, OMA1 follows a mode of regulation in which the pQTL acts directly on protein abundance without affecting transcript levels. e, OMA1 protein abundance is controlled by a strong local pQTL without a corresponding local eQTL, leading to low correlation (inset) observed between protein and transcript abundance. f, The predicted founder strain expression in the DO population is highly correlated to measured OMA1 abundance in the founder strains (n = 4 mice for each founder, 2 male and 2 female, black bars represent median values).

  4. Mediation of distant pQTL reveals network interactions in the liver proteome.
    Figure 4: Mediation of distant pQTL reveals network interactions in the liver proteome.

    a, The genetic variant underlying the distant TMEM68 pQTL acts proximally in cis on Nnt transcript and protein abundance. b, TMEM68 protein abundance is buffered against local genetic variation affecting transcript levels by a distant regulator on chromosome 13. c, Mediation analysis identified NNT protein and Nnt transcript as the likely mediator. d, TMEM68 protein is poorly correlated to its corresponding transcript, but highly correlated with both NNT protein and Nnt transcript abundance. e, TMEM68 strain abundance predicted at the chromosome 13 distant pQTL in the DO population is highly correlated to TMEM68 and NNT abundance measured in the founder strains, and matches the predicted NNT strain abundance in the DO population (n = 4 mice for each founder, 2 male and 2 female, black bars represent median values). In all cases the C57BL/6J allele is observed to be the low expressor. f, The chromosome 5 variant responsible for the distant effect on CCT2 abundance acts proximally in cis on Cct6a transcript and protein abundance. g, All members of the chaperonin containing Tcp1 (CCT) complex including CCT2 exhibit a distant pQTL that maps to distal chromosome 5. h, Mediation analysis identified Cct6a/CCT6A as the probable mediator of this effect. Protein mediation shows that the protein abundance of all CCT complex members is highly correlated as all members are pulled down in the background of the mediation plot. i, CCT2 protein abundance is highly correlated to CCT6A protein and Cct6a transcript abundance. All other CCT complex members show this same pattern. j, CCT2 abundance predicted at the chromosome 5 distant pQTL is highly correlated with CCT2 and CCT6A abundance measured in the founder strains, and tracks with CCT6A abundance predicted at the pQTL in the DO population (n = 4 mice for each founder, 2 male and 2 female, black bars represent median values). DO animals that derive the chromosome 5 region from NOD/ShiLtJ have lower abundance of all CCT proteins.

  5. Genotype can be an accurate predictor of protein abundance.
    Figure 5: Genotype can be an accurate predictor of protein abundance.

    a, Founder strain protein abundance values inferred at significant pQTL in the DO population closely match measured abundance values from the founder strains themselves. The distributions of Pearson correlations are plotted for local pQTL and distant pQTL. Local pQTL are generally more predictive of abundance values in the founder strains (local median r = 0.72, distant median r = 0.11). b, Founder strain allele predictions from the DO were also assessed against protein abundance data collected from four CC strains (n = 6 mice per strain). We observe that local pQTL are more predictive of protein abundance in the CC strains (local median r = 0.63; distant median r = 0.22). c, Predictive power depends largely on the significance of the pQTL. Local pQTL generally had higher LOD scores, and as such we had higher power to predict these proteins (n = 4 mice for each founder, 2 male and 2 female, black bars represent median values). An example is shown for LYPLAL1. d, Protein abundance could also be predicted for genes with significant distant pQTL in the DO population; however, as a group these predictions were modest compared to local pQTL. As an example, NAGS abundance in the CC strains could be predicted based on the local genotype at its mediator protein, GLYCTK (n = 6 mice for each CC strain, 3 male and 3 female, black bars represent median values).

  6. Proteomic profiling of the eight founder strains used to create the DO mouse population.
    Extended Data Fig. 1: Proteomic profiling of the eight founder strains used to create the DO mouse population.

    a, A multiplexed TMT proteomics method was used to characterize protein expression for the eight founder strains with two biological replicates for each strain using both sexes. In total, just over 400,000 peptides were quantified corresponding to 7,699 proteins. b, Hierarchical clustering and principal component analysis determined that the major source of variation in protein expression is due to genetic variation among the eight strains and the sex within strains. c, K-means clustering and gene set enrichment determined that each of the clusters was specifically enriched for metabolic pathways, biological process or cellular components. d, Proteins representing each of the displayed clusters from c. These proteins have specific patterns of expression as exemplified by PCK1, which was highly expressed in the NOD strain. Other examples include SCD1, which was highly expressed in C57BL/6J and NZO strains (n = 4 mice for each founder, 2 male and 2 female, black bars represent median values). Protein abundance is shown as the percentage contribution of that mouse’s protein levels to its respective 10-plex.

  7. The influence of sex and diet on protein and transcript abundance.
    Extended Data Fig. 2: The influence of sex and diet on protein and transcript abundance.

    a, Principal component analysis aligns well with sex and diet as major experimental contributors of variation in protein abundance. b, Female-specific protein abundance profiles for SULT2A1 and FMO3. c, Male-specific protein abundance profiles for CYP4A12A and MUP3. d, e, Diet also resulted in the regulation of many proteins, which are represented by proteins such as SCD1 and ACACA that increased in abundance and proteins such as HMGCR and SQLE that decreased in abundance. f, Principal component analysis aligns well with sex and diet as major experimental contributors of variation in transcript abundance. gj. Transcript scatter plots for the proteins in be. Transcript abundance data were transformed to rank normal scores for plotting.

  8. Genetic effects drive much of the observed expression variance in the RNA-seq and proteomics data.
    Extended Data Fig. 3: Genetic effects drive much of the observed expression variance in the RNA-seq and proteomics data.

    Liver transcript and protein abundance are highly variable in the DO population. Among the discovery set (n = 6,707 proteins, 6,647 genes), much of this variance can be attributed to one or more experimental variables and/or genetic effects. ac, The experimental covariates sex and diet influence many transcripts and proteins in an additive manner, however, the interaction of sex and diet does not seem to affect many genes. The effects from sex and diet are not biased towards one molecular species—that is, similar numbers of transcripts and proteins are similarly affected by these experimental variables. Genetic variation underlies many of the most variable transcripts and proteins. d, e, Local genetic variation in particular is a strong driver of expression variation for many genes, while distant genetic effects are observed but more subtle. Among the discovery set, we observe more and larger genetic effects (both local and distant) on transcript abundance than protein abundance. f, For most transcripts and proteins detected in this study, expression variation is minimal, cannot be attributed to a known experimental or genetic variable, and is plotted as noise. g, pQTL map for all 6,707 proteins tested from genetic linkage analysis. h, i, QTL mapping identified the genetic loci that underlie variability in transcript abundance (eQTL). For the discovery set of transcripts with detected proteins and the larger set of all expressed genes, the location of the eQTL is plotted on the x axis and the location of the controlled gene is plotted on the y axis. Most genetic effects are local and map to the same location as the gene, as evidenced by the prominent diagonal line in both maps.

  9. Replication rates for eQTL are highly correlated with effect size, and local eQTL replicate at higher rates than distant eQTL.
    Extended Data Fig. 4: Replication rates for eQTL are highly correlated with effect size, and local eQTL replicate at higher rates than distant eQTL.

    a, To assess replication of eQTL, an independent set of 192 DO liver RNA-seq samples was analysed (‘replication set’) and compared to the discovery set. A total of 16,839 genes were expressed in half or more samples in both data sets. For each gene, the most significant proximal locus (within ± 10 Mb of gene) and distant locus (located on a different chromosome from the gene) were identified from the discovery set—LOD scores at these loci are plotted on the x axis (local in red; distant in blue). Next, the most significant loci within a 10-Mb window flanking the local and distant loci from the discovery set were identified in the replication set and plotted on the y axis. LOD scores are highly correlated at these peak loci (local Pearson r = 0.91; distant r = 0.84). b, For the core set of 6,707 proteins (6,647 gene ids), pQTL and eQTL overlap were compared at multiple genome-wide P value thresholds from 0.01 to 0.2. Again, one maximum proximal locus and one maximum distant locus were identified for each gene/protein, and recorded if it met the P value cut off. Local pQTL exhibit high overlap with both the discovery eQTL set and replication eQTL set, regardless of P value threshold (67–80%). Distant pQTL exhibit slightly higher overlap with eQTL at the most stringent P value cut off, however, overlap is consistently low for distant pQTL (<1–2%). Local eQTL overlap well with the replication eQTL set regardless of P value threshold (75–77%). Distant eQTL replicate poorly overall (3–31%), but overlap rate is highest (31%) at the most stringent P value threshold, suggesting that larger sample sizes will be required to fully and accurately characterize distant effects on gene expression. c, The maximum proximal locus and distant locus were identified for each of the 6,707 proteins and transcripts, and the cumulative distribution of their LOD scores is plotted (blue = proteins, green = transcripts). LOD score is plotted on the x axis, and the proportion of total QTL is plotted on the y axis. Local eQTL as a group exhibit higher LOD scores (consistent with higher effect sizes) than local pQTL (ninetieth percentile LOD = 23.9 for local eQTL, 13.6 for pQTL), while distant eQTL and pQTL are of similar scale (ninetieth percentile LOD = 7.9 for distant eQTL, 8.2 for distant pQTL). d, Comparison of pQTL from the discovery set to eQTL from the discovery set (left set of Venn diagrams) and eQTL from the replication set (right). As expected given that they derive from the same samples, local pQTL and eQTL overlap is observed to be higher in the discovery set (1,392 out of 1,736 = 80%), however, local pQTL still overlap well with eQTL from the replication set (1,273 out of 1,736 = 73%). Distant pQTL overlap poorly with both eQTL sets (9 out of 1,048 in discovery set); 8 out of 1,048 in replication set), however, 6 of 9 distant pQTL that do overlap with eQTL in the discovery set are also identified as overlapping in the replication set.

  10. BIC model selection reveals transcriptional mechanisms driving most local pQTL and post-transcriptional mechanisms underlying most distant pQTL.
    Extended Data Fig. 5: BIC model selection reveals transcriptional mechanisms driving most local pQTL and post-transcriptional mechanisms underlying most distant pQTL.

    We identified the local and distant QTL with the maximum LOD score (regardless of significance) for each of the 6,707 proteins, and used BIC to assess eight models linking QTL genotype to transcript and protein abundance. Most proteins are not affected by the local or distant QTL, and fall in one of the three groups below outlined by the dotted line. Among the five models where a QTL effect on protein abundance is detected, two are transcriptional in nature (L1, L2; D1, D2); the QTL effect on protein abundance is conferred at least partially through the transcript. The remaining three genetic models are post-transcriptional (L3–5; D3–5); the QTL effect on protein abundance is not mediated through the transcript. The transcriptional L1 and L2 models are identified as the best models for most local pQTL, while the post-transcriptional D3 and D4 models are optimal for most distant pQTL.

  11. Examples of local pQTL that are due to an underlying eQTL and those that are due to post-transcriptional mechanisms.
    Extended Data Fig. 6: Examples of local pQTL that are due to an underlying eQTL and those that are due to post-transcriptional mechanisms.

    a, The protein DHTKD1 contained a local acting eQTL and pQTL, which was associated with increased transcript and protein abundance derived from 129S1/SvImJ, CAST/EiJ, PWK/PhJ and WSB/EiJ strains. Mice were divided into three groups depending on whether or not their genomes contained 0, 1 or 2 of the alleles found to be associated with the pQTL. These increases in protein abundance were further validated using the proteomic analysis of the founder strains. b, c, Similarly, Ces2h and Pipox had both a local acting eQTL and pQTL that could be associated with specific strains (CAST/EiJ, PWK/PhJ and WSB/EiJ). These protein abundance measurements were further validated using the founder strains data set. d, e, Alternatively, 10% of the genes had local pQTL but lacked local eQTLs, which is evident in proteins such as ENTPD5 and OMA1. The founder allele expression patterns inferred at the pQTL were validated by protein abundance measurements in the founder strains, which could be explained CAST/EiJ specific missense mutations in both genes. f, Likewise, Lars2 also contained a pQTL that had no observable eQTL that showed a decrease in protein abundance in the 129S1/SvImJ, CAST/EiJ, PWK/PhJ and WSB/EiJ strains. Genome sequencing determined that these strains share four missense mutations (*P < 0.01 using a Student’s t-test; for founder strains, n = 4 mice for each founder, 2 male and 2 female, error bars represent s.d.).

  12. The causal relationship between genetic variation and protein expression was determined for over 700 proteins as inferred by mediation analysis.
    Extended Data Fig. 7: The causal relationship between genetic variation and protein expression was determined for over 700 proteins as inferred by mediation analysis.

    ad, Many of the causal relationships between proteins have been previously documented such as the associations between SNX7–SNX4, PGAM1–PGAM2, LRRFIP1–FLII and PPIF–PPIE. eh, In addition, many of the protein associations had not be previously documented such as UPB1–MTR, FOCAD–AVEN, AGPAT9–CHP1 and ANXA1–ARAD1A. il, Protein associations were also identified for multimeric complexes such as ECSIT–NDUFAF1–TMEM126B, DMXL2–ROGDI–WDR7, PIGU–PIGT–PIGS and IKBKAP–ELP2–ELP3.

  13. Mediation analysis for CCT complex members details the effects of a QTL in Cct6a on protein abundance through post-transcriptional protein buffering.
    Extended Data Fig. 8: Mediation analysis for CCT complex members details the effects of a QTL in Cct6a on protein abundance through post-transcriptional protein buffering.

    a–f, Mediation analysis for each of the Cct complex identifies Cct6a as the causal intermediate. A local QTL for Cct6a affects transcript and protein abundance, and CCT6A abundance sets the abundance of other CCT proteins regardless of variation in their transcripts. For each of the complex members tested, all other complex members are confirmed to be co-regulated providing additional supporting evidence for stoichiometric buffering.

  14. Distant pQTL and co-regulated proteins frequently correspond to complexes of physically interacting proteins.
    Extended Data Fig. 9: Distant pQTL and co-regulated proteins frequently correspond to complexes of physically interacting proteins.

    a, Distant pQTL and co-regulated proteins assemble to form a regulatory network, which is defined by protein clusters with distinct topologies. A total of 3,938 proteins/QTL are linked by 5,794 associations. Distant pQTL are depicted as purple arrows pointing from the inferred causal protein to its regulated pair. Co-regulated proteins are connected with green arrows emanating from the primary target protein. b, MCL clustering decomposes the distant pQTL network into 671 clusters. Cluster size varies considerably, although most clusters contain fewer than 20 proteins. c, Clusters extracted from the distant pQTL network frequently associate proteins with shared biological functions. More than half of clusters are enriched for at least one GO category, as depicted in the bar chart above. df, Three selected clusters of distant pQTL and co-regulated proteins. g, To understand the relationship between the distant pQTL associations and protein interactions, each distant pQTL and its co-regulated proteins were mapped to their human homologues in the BioPlex network of human protein interactions. To assess the tendency for these co-regulated proteins to cluster together, the median graph distance separating all pairs of co-regulated proteins was determined. The distribution of median distances observed for equal numbers of randomly selected proteins was also determined and used to assign a Z-score to each distant pQTL and its co-regulated proteins. h, Histogram depicting the Z-score distribution for distant pQTL and co-regulated proteins. Z-scores below −2.5 (highlighted in red) indicated that co-regulated proteins were unusually close within the BioPlex network. il, Selected distant pQTL and co-regulated proteins, mapped onto the BioPlex network of protein interactions. All shortest paths connecting distant pQTL and their regulated proteins have been extracted from the BioPlex network and displayed. Proteins inferred to be responsible for each QTL are purple, while primary regulated proteins are red and secondary co-regulated proteins are green. Grey circles represent neighbouring proteins in the BioPlex network that were not found to be co-regulated. Grey edges indicate BioPlex interactions, while Blue edges denote co-regulation uncovered from trans-QTL analysis.

  15. Comparison of protein abundance in the DO and founder strains reveals a positive correlation between pQTL significance and predictive power.
    Extended Data Fig. 10: Comparison of protein abundance in the DO and founder strains reveals a positive correlation between pQTL significance and predictive power.

    a, b, For all detected liver pQTL in the DO population, founder strain allelic contributions were derived from the mapping model and compared to protein abundance measured directly from the eight founder strains. Pearson correlations are plotted against the LOD score of the pQTL for both local and distant pQTL. Predictive power tracks well with pQTL significance. Local pQTL tend to be more significant and yield higher predictive power than distant pQTL, however highly significant distant pQTL (>10 LOD) have comparable predictive power to local pQTL of similar significance.

Accession codes

Primary accessions

Gene Expression Omnibus

References

  1. Crick, F. Central dogma of molecular biology. Nature 227, 561563 (1970)
  2. Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol. 19, 17201730 (1999)
  3. Schwanhäusser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337342 (2011)
  4. Ghazalpour, A. et al. Comparative analysis of proteome and transcriptome variation in mouse. PLoS Genet. 7, e1001393 (2011)
  5. Skelly, D. A. et al. Integrative phenomics reveals insight into the structure of phenotypic diversity in budding yeast. Genome Res. 23, 14961504 (2013)
  6. Wühr, M. et al. Deep proteomics of the Xenopus laevis egg using an mRNA-derived reference database. Curr. Biol. 24, 14671475 (2014)
  7. Fu, J. et al. System-wide molecular evidence for phenotypic buffering in Arabidopsis. Nat. Genet. 41, 166167 (2009)
  8. Rockman, M. V. & Kruglyak, L. Genetics of global gene expression. Nat. Rev. Genet. 7, 862872 (2006)
  9. Brem, R. B., Yvert, G., Clinton, R. & Kruglyak, L. Genetic dissection of transcriptional regulation in budding yeast. Science 296, 752755 (2002)
  10. Morley, M. et al. Genetic analysis of genome-wide variation in human gene expression. Nature 430, 743747 (2004)
  11. Schadt, E. E. et al. Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297302 (2003)
  12. Jansen, R. C. & Nap, J. P. Genetical genomics: the added value from segregation. Trends Genet. 17, 388391 (2001)
  13. Chesler, E. J. et al. Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat. Genet. 37, 233242 (2005)
  14. Foss, E. J. et al. Genetic variation shapes protein networks mainly through non-transcriptional mechanisms. PLoS Biol. 9, e1001144 (2011)
  15. Foss, E. J. et al. Genetic basis of proteome variation in yeast. Nat. Genet. 39, 13691375 (2007)
  16. Khan, Z., Bloom, J. S., Garcia, B. A., Singh, M. & Kruglyak, L. Protein quantification across hundreds of experimental conditions. Proc. Natl Acad. Sci. USA 106, 1554415548 (2009)
  17. Wu, Y. et al. Multilayered genetic and omics dissection of mitochondrial activity in a mouse reference population. Cell 158, 14151430 (2014)
  18. Wu, L. et al. Variation and genetic control of protein abundance in humans. Nature 499, 7982 (2013)
  19. Damerval, C., Maurice, A., Josse, J. M. & de Vienne, D. Quantitative trait loci underlying gene product variation: a novel perspective for analyzing regulation of genome expression. Genetics 137, 289301 (1994)
  20. Albert, F. W., Treusch, S., Shockley, A. H., Bloom, J. S. & Kruglyak, L. Genetics of single-cell protein abundance variation in large yeast populations. Nature 506, 494497 (2014)
  21. Ting, L., Rad, R., Gygi, S. P. & Haas, W. MS3 eliminates ratio distortion in isobaric multiplexed quantitative proteomics. Nat. Methods 8, 937940 (2011)
  22. McAlister, G. C. et al. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Anal. Chem. 86, 71507158 (2014)
  23. Churchill, G. A. et al. The Collaborative Cross, a community resource for the genetic analysis of complex traits. Nat. Genet. 36, 11331137 (2004)
  24. Churchill, G. A., Gatti, D. M., Munger, S. C. & Svenson, K. L. The Diversity Outbred mouse population. Mamm. Genome 23, 713718 (2012)
  25. Threadgill, D. W. & Churchill, G. A. Ten years of the collaborative cross. Genetics 190, 291294 (2012)
  26. Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289294 (2011)
  27. Gatti, D. M. et al. Quantitative trait locus mapping methods for diversity outbred mice. G3 (Bethesda) 4, 16231633 (2014)
  28. Toye, A. A. et al. A genetic and physiological study of impaired glucose homeostasis control in C57BL/6J mice. Diabetologia 48, 675686 (2005)
  29. Ronchi, J. A. et al. A spontaneous mutation in the nicotinamide nucleotide transhydrogenase gene of C57BL/6J mice results in mitochondrial redox abnormalities. Free Radic. Biol. Med. 63, 446456 (2013)
  30. Freeman, H. C., Hugill, A., Dear, N. T., Ashcroft, F. M. & Cox, R. D. Deletion of nicotinamide nucleotide transhydrogenase: a new quantitive trait locus accounting for glucose intolerance in C57BL/6J mice. Diabetes 55, 21532156 (2006)
  31. Huttlin, E. L. et al. The BioPlex Network: a systematic exploration of the human interactome. Cell 162, 425440 (2015)
  32. van Weering, J. R. T. et al. Molecular basis for SNX-BAR-mediated assembly of distinct endosomal sorting tubules. EMBO J. 31, 44664480 (2012)
  33. Liu, Y.-T. & Yin, H. L. Identification of the binding partners for flightless I, A novel protein bridging the leucine-rich repeat and the gelsolin superfamilies. J. Biol. Chem. 273, 79207927 (1998)
  34. Huttlin, E. L. et al. A tissue-specific atlas of mouse protein phosphorylation and expression. Cell 143, 11741189 (2010)
  35. Battle, A. et al. Genomic varation. Impact of regulatory variation from RNA to protein. Science 347, 664667 (2013)
  36. Laurent, J. M. et al. Protein abundances are more conserved than mRNA abundances across diverse taxa. Proteomics 10, 42094212 (2010)
  37. Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227232 (2012)
  38. Welsh, C. E. et al. Status and access to the Collaborative Cross population. Mamm. Genome 23, 706712 (2012)
  39. Chesler, E. J. et al. The Collaborative Cross at Oak Ridge National Laboratory: developing a powerful resource for systems genetics. Mamm. Genome 19, 382389 (2008)
  40. Iraqi, F. A., Churchill, G. & Mott, R. The Collaborative Cross, developing a resource for mammalian systems genetics: a status report of the Wellcome Trust cohort. Mamm. Genome 19, 379381 (2008)
  41. Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976989 (1994)
  42. Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207214 (2007)
  43. de Hoon, M. J. L., Imoto, S., Nolan, J. & Miyano, S. Open source clustering software. Bioinformatics 20, 14531454 (2004)
  44. Welsh, C. E. & McMillan, L. Accelerating the inbreeding of multi-parental recombinant inbred lines generated by sibling matings. G3 (Bethesda) 2, 191198 (2012)
  45. Broman, K. W. et al. Haplotype probabilities in advanced intercross populations. G3 (Bethesda) 2, 199202 (2012)
  46. Munger, S. C. et al. RNA-Seq alignment to individualized genomes improves transcript abundance estimates in multiparent populations. Genetics 198, 5973 (2014)
  47. Cheng, R., Abney, M., Palmer, A. A. & Skol, A. D. QTLRel: an R package for genome-wide association studies in which relatedness is a concern. BMC Genet. 12, 66 (2011)
  48. Dudbridge, F. & Koeleman, B. P. C. Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. Am. J. Hum. Genet. 75, 424435 (2004)
  49. Storey, J. D., Taylor, J. E. & Siegmund, D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. Series B 66, 187205 (2004)
  50. Baron, R. M. & Kenny, D. A. The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J. Pers. Soc. Psychol. 51, 11731182 (1986)
  51. Fritz, M. S. & Mackinnon, D. P. Required sample size to detect the mediated effect. Psychol. Sci. 18, 233239 (2007)
  52. Yvert, G. et al. Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat. Genet. 35, 5764 (2003)
  53. Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013)
  54. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 15751584 (2002)
  55. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222D230 (2014)
  56. Magrane, M. & UniProt Consortium. UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) 2011, bar009 (2011)
  57. Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 2529 (2000)
  58. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing on JSTOR. J. R. Stat. Soc. B 57, 289300 (1995)

Download references

Author information

  1. These authors contributed equally to this work.

    • Joel M. Chick &
    • Steven C. Munger
  2. These authors jointly supervised this work.

    • Gary A. Churchill &
    • Steven P. Gygi

Affiliations

  1. Harvard Medical School, Boston, Massachusetts 02115, USA

    • Joel M. Chick,
    • Edward L. Huttlin &
    • Steven P. Gygi
  2. The Jackson Laboratory, Bar Harbor, Maine 04609, USA

    • Steven C. Munger,
    • Petr Simecek,
    • Kwangbom Choi,
    • Daniel M. Gatti,
    • Narayanan Raghupathy,
    • Karen L. Svenson &
    • Gary A. Churchill

Contributions

C.M. developed the methodology for analysing the convection models, conducted the plate analysis, contributed to the interpretation and wrote the manuscript. N.C. conducted the convection calculations, contributed to the development of the methodology and analysis, contributed to the interpretation and wrote the manuscript. M.S. and R.D.M. provided guidance with GPlates and scripts, contributed to the interpretation and wrote the manuscript. P.J.T. provided the StagYY convection code, guidance on using it and wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD002801 (http://www.proteomexchange.org/). Raw RNA-seq fastq files and processed gene-level data are archived at Gene Expression Omnibus (GEO) under accession number GSE72759. We implemented our mediation method as the R package, intermediate, which can be freely downloaded from http://github.com/churchill-lab/intermediate. The Genotyping by RNA-seq (GBRS) software is available for download from https://github.com/churchill-lab/gbrs.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Proteomic profiling of the eight founder strains used to create the DO mouse population. (422 KB)

    a, A multiplexed TMT proteomics method was used to characterize protein expression for the eight founder strains with two biological replicates for each strain using both sexes. In total, just over 400,000 peptides were quantified corresponding to 7,699 proteins. b, Hierarchical clustering and principal component analysis determined that the major source of variation in protein expression is due to genetic variation among the eight strains and the sex within strains. c, K-means clustering and gene set enrichment determined that each of the clusters was specifically enriched for metabolic pathways, biological process or cellular components. d, Proteins representing each of the displayed clusters from c. These proteins have specific patterns of expression as exemplified by PCK1, which was highly expressed in the NOD strain. Other examples include SCD1, which was highly expressed in C57BL/6J and NZO strains (n = 4 mice for each founder, 2 male and 2 female, black bars represent median values). Protein abundance is shown as the percentage contribution of that mouse’s protein levels to its respective 10-plex.

  2. Extended Data Figure 2: The influence of sex and diet on protein and transcript abundance. (334 KB)

    a, Principal component analysis aligns well with sex and diet as major experimental contributors of variation in protein abundance. b, Female-specific protein abundance profiles for SULT2A1 and FMO3. c, Male-specific protein abundance profiles for CYP4A12A and MUP3. d, e, Diet also resulted in the regulation of many proteins, which are represented by proteins such as SCD1 and ACACA that increased in abundance and proteins such as HMGCR and SQLE that decreased in abundance. f, Principal component analysis aligns well with sex and diet as major experimental contributors of variation in transcript abundance. gj. Transcript scatter plots for the proteins in be. Transcript abundance data were transformed to rank normal scores for plotting.

  3. Extended Data Figure 3: Genetic effects drive much of the observed expression variance in the RNA-seq and proteomics data. (532 KB)

    Liver transcript and protein abundance are highly variable in the DO population. Among the discovery set (n = 6,707 proteins, 6,647 genes), much of this variance can be attributed to one or more experimental variables and/or genetic effects. ac, The experimental covariates sex and diet influence many transcripts and proteins in an additive manner, however, the interaction of sex and diet does not seem to affect many genes. The effects from sex and diet are not biased towards one molecular species—that is, similar numbers of transcripts and proteins are similarly affected by these experimental variables. Genetic variation underlies many of the most variable transcripts and proteins. d, e, Local genetic variation in particular is a strong driver of expression variation for many genes, while distant genetic effects are observed but more subtle. Among the discovery set, we observe more and larger genetic effects (both local and distant) on transcript abundance than protein abundance. f, For most transcripts and proteins detected in this study, expression variation is minimal, cannot be attributed to a known experimental or genetic variable, and is plotted as noise. g, pQTL map for all 6,707 proteins tested from genetic linkage analysis. h, i, QTL mapping identified the genetic loci that underlie variability in transcript abundance (eQTL). For the discovery set of transcripts with detected proteins and the larger set of all expressed genes, the location of the eQTL is plotted on the x axis and the location of the controlled gene is plotted on the y axis. Most genetic effects are local and map to the same location as the gene, as evidenced by the prominent diagonal line in both maps.

  4. Extended Data Figure 4: Replication rates for eQTL are highly correlated with effect size, and local eQTL replicate at higher rates than distant eQTL. (247 KB)

    a, To assess replication of eQTL, an independent set of 192 DO liver RNA-seq samples was analysed (‘replication set’) and compared to the discovery set. A total of 16,839 genes were expressed in half or more samples in both data sets. For each gene, the most significant proximal locus (within ± 10 Mb of gene) and distant locus (located on a different chromosome from the gene) were identified from the discovery set—LOD scores at these loci are plotted on the x axis (local in red; distant in blue). Next, the most significant loci within a 10-Mb window flanking the local and distant loci from the discovery set were identified in the replication set and plotted on the y axis. LOD scores are highly correlated at these peak loci (local Pearson r = 0.91; distant r = 0.84). b, For the core set of 6,707 proteins (6,647 gene ids), pQTL and eQTL overlap were compared at multiple genome-wide P value thresholds from 0.01 to 0.2. Again, one maximum proximal locus and one maximum distant locus were identified for each gene/protein, and recorded if it met the P value cut off. Local pQTL exhibit high overlap with both the discovery eQTL set and replication eQTL set, regardless of P value threshold (67–80%). Distant pQTL exhibit slightly higher overlap with eQTL at the most stringent P value cut off, however, overlap is consistently low for distant pQTL (<1–2%). Local eQTL overlap well with the replication eQTL set regardless of P value threshold (75–77%). Distant eQTL replicate poorly overall (3–31%), but overlap rate is highest (31%) at the most stringent P value threshold, suggesting that larger sample sizes will be required to fully and accurately characterize distant effects on gene expression. c, The maximum proximal locus and distant locus were identified for each of the 6,707 proteins and transcripts, and the cumulative distribution of their LOD scores is plotted (blue = proteins, green = transcripts). LOD score is plotted on the x axis, and the proportion of total QTL is plotted on the y axis. Local eQTL as a group exhibit higher LOD scores (consistent with higher effect sizes) than local pQTL (ninetieth percentile LOD = 23.9 for local eQTL, 13.6 for pQTL), while distant eQTL and pQTL are of similar scale (ninetieth percentile LOD = 7.9 for distant eQTL, 8.2 for distant pQTL). d, Comparison of pQTL from the discovery set to eQTL from the discovery set (left set of Venn diagrams) and eQTL from the replication set (right). As expected given that they derive from the same samples, local pQTL and eQTL overlap is observed to be higher in the discovery set (1,392 out of 1,736 = 80%), however, local pQTL still overlap well with eQTL from the replication set (1,273 out of 1,736 = 73%). Distant pQTL overlap poorly with both eQTL sets (9 out of 1,048 in discovery set); 8 out of 1,048 in replication set), however, 6 of 9 distant pQTL that do overlap with eQTL in the discovery set are also identified as overlapping in the replication set.

  5. Extended Data Figure 5: BIC model selection reveals transcriptional mechanisms driving most local pQTL and post-transcriptional mechanisms underlying most distant pQTL. (414 KB)

    We identified the local and distant QTL with the maximum LOD score (regardless of significance) for each of the 6,707 proteins, and used BIC to assess eight models linking QTL genotype to transcript and protein abundance. Most proteins are not affected by the local or distant QTL, and fall in one of the three groups below outlined by the dotted line. Among the five models where a QTL effect on protein abundance is detected, two are transcriptional in nature (L1, L2; D1, D2); the QTL effect on protein abundance is conferred at least partially through the transcript. The remaining three genetic models are post-transcriptional (L3–5; D3–5); the QTL effect on protein abundance is not mediated through the transcript. The transcriptional L1 and L2 models are identified as the best models for most local pQTL, while the post-transcriptional D3 and D4 models are optimal for most distant pQTL.

  6. Extended Data Figure 6: Examples of local pQTL that are due to an underlying eQTL and those that are due to post-transcriptional mechanisms. (484 KB)

    a, The protein DHTKD1 contained a local acting eQTL and pQTL, which was associated with increased transcript and protein abundance derived from 129S1/SvImJ, CAST/EiJ, PWK/PhJ and WSB/EiJ strains. Mice were divided into three groups depending on whether or not their genomes contained 0, 1 or 2 of the alleles found to be associated with the pQTL. These increases in protein abundance were further validated using the proteomic analysis of the founder strains. b, c, Similarly, Ces2h and Pipox had both a local acting eQTL and pQTL that could be associated with specific strains (CAST/EiJ, PWK/PhJ and WSB/EiJ). These protein abundance measurements were further validated using the founder strains data set. d, e, Alternatively, 10% of the genes had local pQTL but lacked local eQTLs, which is evident in proteins such as ENTPD5 and OMA1. The founder allele expression patterns inferred at the pQTL were validated by protein abundance measurements in the founder strains, which could be explained CAST/EiJ specific missense mutations in both genes. f, Likewise, Lars2 also contained a pQTL that had no observable eQTL that showed a decrease in protein abundance in the 129S1/SvImJ, CAST/EiJ, PWK/PhJ and WSB/EiJ strains. Genome sequencing determined that these strains share four missense mutations (*P < 0.01 using a Student’s t-test; for founder strains, n = 4 mice for each founder, 2 male and 2 female, error bars represent s.d.).

  7. Extended Data Figure 7: The causal relationship between genetic variation and protein expression was determined for over 700 proteins as inferred by mediation analysis. (203 KB)

    ad, Many of the causal relationships between proteins have been previously documented such as the associations between SNX7–SNX4, PGAM1–PGAM2, LRRFIP1–FLII and PPIF–PPIE. eh, In addition, many of the protein associations had not be previously documented such as UPB1–MTR, FOCAD–AVEN, AGPAT9–CHP1 and ANXA1–ARAD1A. il, Protein associations were also identified for multimeric complexes such as ECSIT–NDUFAF1–TMEM126B, DMXL2–ROGDI–WDR7, PIGU–PIGT–PIGS and IKBKAP–ELP2–ELP3.

  8. Extended Data Figure 8: Mediation analysis for CCT complex members details the effects of a QTL in Cct6a on protein abundance through post-transcriptional protein buffering. (389 KB)

    a–f, Mediation analysis for each of the Cct complex identifies Cct6a as the causal intermediate. A local QTL for Cct6a affects transcript and protein abundance, and CCT6A abundance sets the abundance of other CCT proteins regardless of variation in their transcripts. For each of the complex members tested, all other complex members are confirmed to be co-regulated providing additional supporting evidence for stoichiometric buffering.

  9. Extended Data Figure 9: Distant pQTL and co-regulated proteins frequently correspond to complexes of physically interacting proteins. (650 KB)

    a, Distant pQTL and co-regulated proteins assemble to form a regulatory network, which is defined by protein clusters with distinct topologies. A total of 3,938 proteins/QTL are linked by 5,794 associations. Distant pQTL are depicted as purple arrows pointing from the inferred causal protein to its regulated pair. Co-regulated proteins are connected with green arrows emanating from the primary target protein. b, MCL clustering decomposes the distant pQTL network into 671 clusters. Cluster size varies considerably, although most clusters contain fewer than 20 proteins. c, Clusters extracted from the distant pQTL network frequently associate proteins with shared biological functions. More than half of clusters are enriched for at least one GO category, as depicted in the bar chart above. df, Three selected clusters of distant pQTL and co-regulated proteins. g, To understand the relationship between the distant pQTL associations and protein interactions, each distant pQTL and its co-regulated proteins were mapped to their human homologues in the BioPlex network of human protein interactions. To assess the tendency for these co-regulated proteins to cluster together, the median graph distance separating all pairs of co-regulated proteins was determined. The distribution of median distances observed for equal numbers of randomly selected proteins was also determined and used to assign a Z-score to each distant pQTL and its co-regulated proteins. h, Histogram depicting the Z-score distribution for distant pQTL and co-regulated proteins. Z-scores below −2.5 (highlighted in red) indicated that co-regulated proteins were unusually close within the BioPlex network. il, Selected distant pQTL and co-regulated proteins, mapped onto the BioPlex network of protein interactions. All shortest paths connecting distant pQTL and their regulated proteins have been extracted from the BioPlex network and displayed. Proteins inferred to be responsible for each QTL are purple, while primary regulated proteins are red and secondary co-regulated proteins are green. Grey circles represent neighbouring proteins in the BioPlex network that were not found to be co-regulated. Grey edges indicate BioPlex interactions, while Blue edges denote co-regulation uncovered from trans-QTL analysis.

  10. Extended Data Figure 10: Comparison of protein abundance in the DO and founder strains reveals a positive correlation between pQTL significance and predictive power. (175 KB)

    a, b, For all detected liver pQTL in the DO population, founder strain allelic contributions were derived from the mapping model and compared to protein abundance measured directly from the eight founder strains. Pearson correlations are plotted against the LOD score of the pQTL for both local and distant pQTL. Predictive power tracks well with pQTL significance. Local pQTL tend to be more significant and yield higher predictive power than distant pQTL, however highly significant distant pQTL (>10 LOD) have comparable predictive power to local pQTL of similar significance.

Supplementary information

Zip files

  1. Supplementary Tables (55.8 MB)

    This zipped file contains Supplementary Tables 1-9.

Additional data