Pathway-level information extractor (PLIER) for gene expression data

Abstract

A major challenge in gene expression analysis is to accurately infer relevant biological insights, such as variation in cell-type proportion or pathway activity, from global gene expression studies. We present pathway-level information extractor (PLIER) (https://github.com/wgmao/PLIER and http://gobie.csb.pitt.edu/PLIER), a broadly applicable solution for this problem that outperforms available cell proportion inference algorithms and can automatically identify specific pathways that regulate gene expression. Our method improves interstudy replicability and reveals biological insights when applied to trans-eQTL (expression quantitative trait loci) identification.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: PLIER overview.
Fig. 2

Data availability

Processed gene expression and cell proportion measurements generated for this study are available through the PLIER package. The raw data can be accessed through the Gene Expression Ominibus (GSE130824). The DGN dataset can be obtained from NIMH following the instructions provided in ref. 26. The NESDA dataset can be obtained from dbGAP (identifier: phs000486.v1).

References

  1. 1.

    Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).

  2. 2.

    Subramanian, A. et al. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

  3. 3.

    Newman, A. M. et al. Nat. Methods 12, 453–457 (2015).

  4. 4.

    Abbas, A. R. et al. PLoS One 4, e6098 (2009).

  5. 5.

    Battle, A. et al. Genome Res. 24, 14–24 (2014).

  6. 6.

    Westra, H.-J. et al. Nat. Genet. 45, 1238 (2013).

  7. 7.

    Novershtern, N. et al. Cell 144, 296–309 (2011).

  8. 8.

    Filiano, A. J. et al. Nature 535, 425–429 (2016).

  9. 9.

    Wright, F. A. et al. Nat. Genet. 46, 430–437 (2014).

  10. 10.

    Gieger, C. et al. Nature 480, 201–208 (2011).

  11. 11.

    Olsson, A. et al. Nature 537, 698–702 (2016).

  12. 12.

    The Cancer Genome Atlas Research Network. Nature 497, 67–73 (2013).

  13. 13.

    Ross, D. T. et al. Nat. Genet. 24, 227–235 (2000).

  14. 14.

    Alter, O., Brown, P. O. & Botstein, D. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

  15. 15.

    Hore, V. et al. Nat. Genet. 48, 1094 (2016).

  16. 16.

    Brunet, J.-P. et al. Proc. Natl Acad. Sci. USA 101, 4164–4169 (2004).

  17. 17.

    Zou, H. & Hastie, T. J. R. Stat. Soc. Ser. B 67, 301–320 (2005).

  18. 18.

    Witten, D. M. et al. Biostatistics 10, 515–534 (2009).

  19. 19.

    Leek, J. T. et al. PLoS Genet. 3, e161 (2007).

  20. 20.

    Levine, J. H. et al. Cell 162, 184–197 (2015).

  21. 21.

    Wingett, S. W. & Andrews, S. F1000Res. 7, 1338 (2018).

  22. 22.

    Dobin, A. et al. Bioinformatics 29, 15–21 (2013).

  23. 23.

    Liao, Y., Smyth, G. K. & Shi, W. Bioinformatics 30, 923–930 (2013).

  24. 24.

    Gaujoux, R. & Seoighe, C. Bioinformatics 29, 2211–2212 (2013).

  25. 25.

    Zheng, S. C., Breeze, C. E., Beck, S. & Teschendorff, A. E. Nat. Methods 15, 1059–1066 (2018).

  26. 26.

    Mostafavi, S. et al. Mol. Psychiatry 19, 1267–1274 (2014).

  27. 27.

    Zerbino, D. R. et al. Nucleic Acids Res. 46, D754–D761 (2017).

  28. 28.

    Machiela, M. J. & Chanock, S. J. Bioinformatics 31, 3555–3557 (2015).

  29. 29.

    Storey, J. D. J. R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).

Download references

Acknowledgements

This work was supported by the US National Institutes of Health (NIH) grants U54HG008540 and 5R03MH109008 to M.C., 1R01HG009299 to M.C. and W.M., and 5U19AI117873 and 5U24DK112331 to Z.E., S.S.C., and H.B.M. The authors acknowledge G. Nudelman for help with RNA-seq processing. This study uses data from dbGaP (phs000486.v1). Funding support for the Genetic Association Information Network (GAIN) Major Depression study: Stage 1 Genome-wide Association In Population Based Samples Study (parent studies: NESDA) and the Netherlands Twin Register (NTR)) was provided by the Netherlands Scientific Organization (904-61-090, 904-61-193, 480-04-004, 400-05-717, Netherlands Organisation for Scientific Research (NWO) Genomics, SPI 56-464-1419), the Centre for Neurogenomics and Cognitive Research (CNCR-VU), the European Union (EU/WLRT-2001-01254), ZonMW (geestkracht program, 10-000-1002), NIMH (RO1 MH059160) and matching funds from participating institutes in NESDA and NTR, and the genotyping of samples was provided through the Genetic Association Information Network (GAIN). The datasets used for the analyses described in this manuscript were obtained from dbGaP (http://www.ncbi.nlm.nih.gov/gap) through dbGaP accession number phs000486.v1.p1. Samples and associated phenotype data for the GAIN Major Depression study: Stage 1 Genome-wide Association In Population Based Samples Study (PI, P. F. Sullivan, University of North Carolina) were provided by D. I. Boomsma and E. de Geus, VU University Amsterdam (PIs NTR), B. W. Penninx, VU University Medical Center Amsterdam, F. Zitman, Leiden University Medical Center, and W. Nolen, University Medical Center Groningen (PIs and site-PIs, NESDA).

Author information

M.C. conceived and led this work. W.M. and M.C. developed the analytical framework, analyzed data, and produced figures. M.C., W.M., Z.E., and S.S.C. drafted the manuscript. W.M. implemented the web interface. B.H.M. collected the RNA-seq and CyTOF data.

Correspondence to Maria Chikina.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information: Lei Tang and Tal Nawy were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 PLIER decompositions are robust to normalization procedure.

We compare the PLIER decompositions obtained from two different versions of the DGN dataset: one normalized with all known technical factors and the other normalized only by quantile normalization. (Main panel) Heatmap of the Spearman rank correlations (computed across 922 samples) between LVs from the two decompositions. All pairwise correlations for the top best matched LVs (correlation >0.9) are shown. LVs are named with their corresponding top prior-information geneset (if any). Note that the prior information used is almost identical across the two decompositions. (Inset) The distribution of Spearman rank correlation values across all best reciprocal match pairs of LVs. LVs that use prior information (LVs with any non-zero U coefficients) are more robust to normalization procedure as they are more likely to have a high-correlation match across the two datasets. Boxplot displays the 25th, 50th and 75th percentiles, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary Figure 2 PLIER decompositions isolate technical and biological variation in different components.

Using a PLIER decomposition obtained from the quantile normalized DGN dataset, which has not been corrected for known technical factors, we plot the distribution of absolute values of Pearson correlation (across 922 samples) between technical factors and LVs. Since technical factors are correlated we select a non-redundant set for our evaluation (PF_ALIGNED_BASES, MEDIAN_3PRIME_BIAS, PCT_MRNA_BASES, %heme, duplicate%, yield, set aside globin, length corr, GC corr) where the last two are per-sample correlations between quantified gene expression vectors and transcript GC content and length respectively. LVs associated with prior information (LVs with non-zero U coefficients) and LVs without prior information (LVs with zero U coefficients) are contrasted using two-sided Wilcoxon rank-sum test. The correlation for LVs with prior information is significantly lower for all but one technical factor (PF_ALIGNED_BASES). Boxplots display the 25th, 50th and 75th percentile, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary Figure 3 Pathway associations for the DGN dataset.

We plot the U matrix for the complete set of LVs that were associated with at least one prior information pathways with and FDR<0.05. LVs correspond to columns and pathways correspond to rows. For ease of visualization, only the top pathway (largest U coefficient) for each LV is retained and U matrix is split into two columns. We note that pathways for abundant cell-types such as neutrophils and erythrocytes are top hits for multiple LVs.

Supplementary Figure 4 Pathway-centric PLIER eQTLs have a higher replication rate.

We compare trans-eQTL discovered with the standard gene-centric approach to those discovered by PLIER with respect to independent replication in the NESDA dataset. While gene-centric approach considers all possible SNP-gene associations that satisfy valid trans-eQTL criteria (see Methods) the pathway-centric approach only considers those gene-level effects that are associated with pathway-level eQTLs. We find that at the same raw p-value threshold, pathway-centric eQTLs have a notably higher replication rate. Since pathway-centric associations are by construction linked to a pathway-level effect, they are more likely to represent real and replicable indirect associations. Statistics were computed using Spearman rank correlation across 922 subjects with a two-sided test. P-values indicated on the x-axis are uncorrected.

Supplementary Figure 5 Top scoring Immgen cell-types for genotype-associated LVs with no or ambiguous PLIER pathway annotations.

In order to add further interpretation to the pathway-level eQTLs that had either no or ambiguous pathway associations we investigate the gene expression of top loading genes in a comprehensive database of mouse immune cells (Immgen). This database was not used in the PLIER decomposition so it provides an independent source of immune gene expression patterns. Z-scores were computed across 62 immune cell types in the Immgen dataset. The 10 cell-types with the top median z-score across 20 genes with the highest loading values are shown.

Supplementary Figure 6 Top genes for all genotype associated LVs.

We plot the top 15 genes for all LVs that had a significant genotype association. Data is plotted as z-scores across 922 subjects.

Supplementary Figure 7 The effects of rs1354034 of LVs 44 and 133 are independent.

We plot the relationship between LVs and minor allele counts of rs1354034 using raw LV estimates (first row) or corrected estimates, using residuals from the linear regression fit on the other LV (second row). We find that while the estimates for these two LVs are correlated, the eQTL effects are substantially improved when regressing one LV on the other and using the residuals for eQTL testing. Statistics were computed using Spearman rank correlation across 922 subjects with a two-sided test and uncorrected p-values are reported. Boxplots display the 25th, 50th and 75th percentiles, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary information

Supplementary Information

Supplementary Figs. 1-–7, Supplementary Note 1 and Supplementary Data 1

Reporting Summary

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mao, W., Zaslavsky, E., Hartmann, B.M. et al. Pathway-level information extractor (PLIER) for gene expression data. Nat Methods 16, 607–610 (2019). https://doi.org/10.1038/s41592-019-0456-1

Download citation

Further reading