Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Pathway-level information extractor (PLIER) for gene expression data

Abstract

A major challenge in gene expression analysis is to accurately infer relevant biological insights, such as variation in cell-type proportion or pathway activity, from global gene expression studies. We present pathway-level information extractor (PLIER) (https://github.com/wgmao/PLIER and http://gobie.csb.pitt.edu/PLIER), a broadly applicable solution for this problem that outperforms available cell proportion inference algorithms and can automatically identify specific pathways that regulate gene expression. Our method improves interstudy replicability and reveals biological insights when applied to trans-eQTL (expression quantitative trait loci) identification.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: PLIER overview.
Fig. 2

Similar content being viewed by others

Data availability

Processed gene expression and cell proportion measurements generated for this study are available through the PLIER package. The raw data can be accessed through the Gene Expression Ominibus (GSE130824). The DGN dataset can be obtained from NIMH following the instructions provided in ref. 26. The NESDA dataset can be obtained from dbGAP (identifier: phs000486.v1).

References

  1. Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).

    Article  CAS  Google Scholar 

  2. Subramanian, A. et al. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

    Article  CAS  Google Scholar 

  3. Newman, A. M. et al. Nat. Methods 12, 453–457 (2015).

    CAS  Google Scholar 

  4. Abbas, A. R. et al. PLoS One 4, e6098 (2009).

    Article  Google Scholar 

  5. Battle, A. et al. Genome Res. 24, 14–24 (2014).

    Article  CAS  Google Scholar 

  6. Westra, H.-J. et al. Nat. Genet. 45, 1238 (2013).

    Article  CAS  Google Scholar 

  7. Novershtern, N. et al. Cell 144, 296–309 (2011).

    Article  CAS  Google Scholar 

  8. Filiano, A. J. et al. Nature 535, 425–429 (2016).

    Article  CAS  Google Scholar 

  9. Wright, F. A. et al. Nat. Genet. 46, 430–437 (2014).

    Article  CAS  Google Scholar 

  10. Gieger, C. et al. Nature 480, 201–208 (2011).

    Article  CAS  Google Scholar 

  11. Olsson, A. et al. Nature 537, 698–702 (2016).

    Article  CAS  Google Scholar 

  12. The Cancer Genome Atlas Research Network. Nature 497, 67–73 (2013).

    Article  Google Scholar 

  13. Ross, D. T. et al. Nat. Genet. 24, 227–235 (2000).

    Article  CAS  Google Scholar 

  14. Alter, O., Brown, P. O. & Botstein, D. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

    Article  CAS  Google Scholar 

  15. Hore, V. et al. Nat. Genet. 48, 1094 (2016).

    Article  CAS  Google Scholar 

  16. Brunet, J.-P. et al. Proc. Natl Acad. Sci. USA 101, 4164–4169 (2004).

    Article  CAS  Google Scholar 

  17. Zou, H. & Hastie, T. J. R. Stat. Soc. Ser. B 67, 301–320 (2005).

    Article  Google Scholar 

  18. Witten, D. M. et al. Biostatistics 10, 515–534 (2009).

    Article  Google Scholar 

  19. Leek, J. T. et al. PLoS Genet. 3, e161 (2007).

    Article  Google Scholar 

  20. Levine, J. H. et al. Cell 162, 184–197 (2015).

    Article  CAS  Google Scholar 

  21. Wingett, S. W. & Andrews, S. F1000Res. 7, 1338 (2018).

    Article  Google Scholar 

  22. Dobin, A. et al. Bioinformatics 29, 15–21 (2013).

    Article  CAS  Google Scholar 

  23. Liao, Y., Smyth, G. K. & Shi, W. Bioinformatics 30, 923–930 (2013).

    Article  Google Scholar 

  24. Gaujoux, R. & Seoighe, C. Bioinformatics 29, 2211–2212 (2013).

    Article  CAS  Google Scholar 

  25. Zheng, S. C., Breeze, C. E., Beck, S. & Teschendorff, A. E. Nat. Methods 15, 1059–1066 (2018).

    Article  CAS  Google Scholar 

  26. Mostafavi, S. et al. Mol. Psychiatry 19, 1267–1274 (2014).

    Article  CAS  Google Scholar 

  27. Zerbino, D. R. et al. Nucleic Acids Res. 46, D754–D761 (2017).

    Article  Google Scholar 

  28. Machiela, M. J. & Chanock, S. J. Bioinformatics 31, 3555–3557 (2015).

    Article  CAS  Google Scholar 

  29. Storey, J. D. J. R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the US National Institutes of Health (NIH) grants U54HG008540 and 5R03MH109008 to M.C., 1R01HG009299 to M.C. and W.M., and 5U19AI117873 and 5U24DK112331 to Z.E., S.S.C., and H.B.M. The authors acknowledge G. Nudelman for help with RNA-seq processing. This study uses data from dbGaP (phs000486.v1). Funding support for the Genetic Association Information Network (GAIN) Major Depression study: Stage 1 Genome-wide Association In Population Based Samples Study (parent studies: NESDA) and the Netherlands Twin Register (NTR)) was provided by the Netherlands Scientific Organization (904-61-090, 904-61-193, 480-04-004, 400-05-717, Netherlands Organisation for Scientific Research (NWO) Genomics, SPI 56-464-1419), the Centre for Neurogenomics and Cognitive Research (CNCR-VU), the European Union (EU/WLRT-2001-01254), ZonMW (geestkracht program, 10-000-1002), NIMH (RO1 MH059160) and matching funds from participating institutes in NESDA and NTR, and the genotyping of samples was provided through the Genetic Association Information Network (GAIN). The datasets used for the analyses described in this manuscript were obtained from dbGaP (http://www.ncbi.nlm.nih.gov/gap) through dbGaP accession number phs000486.v1.p1. Samples and associated phenotype data for the GAIN Major Depression study: Stage 1 Genome-wide Association In Population Based Samples Study (PI, P. F. Sullivan, University of North Carolina) were provided by D. I. Boomsma and E. de Geus, VU University Amsterdam (PIs NTR), B. W. Penninx, VU University Medical Center Amsterdam, F. Zitman, Leiden University Medical Center, and W. Nolen, University Medical Center Groningen (PIs and site-PIs, NESDA).

Author information

Authors and Affiliations

Authors

Contributions

M.C. conceived and led this work. W.M. and M.C. developed the analytical framework, analyzed data, and produced figures. M.C., W.M., Z.E., and S.S.C. drafted the manuscript. W.M. implemented the web interface. B.H.M. collected the RNA-seq and CyTOF data.

Corresponding author

Correspondence to Maria Chikina.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information: Lei Tang and Tal Nawy were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 PLIER decompositions are robust to normalization procedure.

We compare the PLIER decompositions obtained from two different versions of the DGN dataset: one normalized with all known technical factors and the other normalized only by quantile normalization. (Main panel) Heatmap of the Spearman rank correlations (computed across 922 samples) between LVs from the two decompositions. All pairwise correlations for the top best matched LVs (correlation >0.9) are shown. LVs are named with their corresponding top prior-information geneset (if any). Note that the prior information used is almost identical across the two decompositions. (Inset) The distribution of Spearman rank correlation values across all best reciprocal match pairs of LVs. LVs that use prior information (LVs with any non-zero U coefficients) are more robust to normalization procedure as they are more likely to have a high-correlation match across the two datasets. Boxplot displays the 25th, 50th and 75th percentiles, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary Figure 2 PLIER decompositions isolate technical and biological variation in different components.

Using a PLIER decomposition obtained from the quantile normalized DGN dataset, which has not been corrected for known technical factors, we plot the distribution of absolute values of Pearson correlation (across 922 samples) between technical factors and LVs. Since technical factors are correlated we select a non-redundant set for our evaluation (PF_ALIGNED_BASES, MEDIAN_3PRIME_BIAS, PCT_MRNA_BASES, %heme, duplicate%, yield, set aside globin, length corr, GC corr) where the last two are per-sample correlations between quantified gene expression vectors and transcript GC content and length respectively. LVs associated with prior information (LVs with non-zero U coefficients) and LVs without prior information (LVs with zero U coefficients) are contrasted using two-sided Wilcoxon rank-sum test. The correlation for LVs with prior information is significantly lower for all but one technical factor (PF_ALIGNED_BASES). Boxplots display the 25th, 50th and 75th percentile, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary Figure 3 Pathway associations for the DGN dataset.

We plot the U matrix for the complete set of LVs that were associated with at least one prior information pathways with and FDR<0.05. LVs correspond to columns and pathways correspond to rows. For ease of visualization, only the top pathway (largest U coefficient) for each LV is retained and U matrix is split into two columns. We note that pathways for abundant cell-types such as neutrophils and erythrocytes are top hits for multiple LVs.

Supplementary Figure 4 Pathway-centric PLIER eQTLs have a higher replication rate.

We compare trans-eQTL discovered with the standard gene-centric approach to those discovered by PLIER with respect to independent replication in the NESDA dataset. While gene-centric approach considers all possible SNP-gene associations that satisfy valid trans-eQTL criteria (see Methods) the pathway-centric approach only considers those gene-level effects that are associated with pathway-level eQTLs. We find that at the same raw p-value threshold, pathway-centric eQTLs have a notably higher replication rate. Since pathway-centric associations are by construction linked to a pathway-level effect, they are more likely to represent real and replicable indirect associations. Statistics were computed using Spearman rank correlation across 922 subjects with a two-sided test. P-values indicated on the x-axis are uncorrected.

Supplementary Figure 5 Top scoring Immgen cell-types for genotype-associated LVs with no or ambiguous PLIER pathway annotations.

In order to add further interpretation to the pathway-level eQTLs that had either no or ambiguous pathway associations we investigate the gene expression of top loading genes in a comprehensive database of mouse immune cells (Immgen). This database was not used in the PLIER decomposition so it provides an independent source of immune gene expression patterns. Z-scores were computed across 62 immune cell types in the Immgen dataset. The 10 cell-types with the top median z-score across 20 genes with the highest loading values are shown.

Supplementary Figure 6 Top genes for all genotype associated LVs.

We plot the top 15 genes for all LVs that had a significant genotype association. Data is plotted as z-scores across 922 subjects.

Supplementary Figure 7 The effects of rs1354034 of LVs 44 and 133 are independent.

We plot the relationship between LVs and minor allele counts of rs1354034 using raw LV estimates (first row) or corrected estimates, using residuals from the linear regression fit on the other LV (second row). We find that while the estimates for these two LVs are correlated, the eQTL effects are substantially improved when regressing one LV on the other and using the residuals for eQTL testing. Statistics were computed using Spearman rank correlation across 922 subjects with a two-sided test and uncorrected p-values are reported. Boxplots display the 25th, 50th and 75th percentiles, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary information

Supplementary Information

Supplementary Figs. 1-–7, Supplementary Note 1 and Supplementary Data 1

Reporting Summary

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mao, W., Zaslavsky, E., Hartmann, B.M. et al. Pathway-level information extractor (PLIER) for gene expression data. Nat Methods 16, 607–610 (2019). https://doi.org/10.1038/s41592-019-0456-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-019-0456-1

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing