Pathway-level information extractor (PLIER) for gene expression data

Mao, Weiguang; Zaslavsky, Elena; Hartmann, Boris M.; Sealfon, Stuart C.; Chikina, Maria

doi:10.1038/s41592-019-0456-1

Brief Communication
Published: 27 June 2019

Pathway-level information extractor (PLIER) for gene expression data

Nature Methods volume 16, pages 607–610 (2019)Cite this article

9873 Accesses
47 Citations
19 Altmetric
Metrics details

Subjects

Abstract

A major challenge in gene expression analysis is to accurately infer relevant biological insights, such as variation in cell-type proportion or pathway activity, from global gene expression studies. We present pathway-level information extractor (PLIER) (https://github.com/wgmao/PLIER and http://gobie.csb.pitt.edu/PLIER), a broadly applicable solution for this problem that outperforms available cell proportion inference algorithms and can automatically identify specific pathways that regulate gene expression. Our method improves interstudy replicability and reveals biological insights when applied to trans-eQTL (expression quantitative trait loci) identification.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Data availability

Processed gene expression and cell proportion measurements generated for this study are available through the PLIER package. The raw data can be accessed through the Gene Expression Ominibus (GSE130824). The DGN dataset can be obtained from NIMH following the instructions provided in ref. ²⁶. The NESDA dataset can be obtained from dbGAP (identifier: phs000486.v1).

References

Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).
Article CAS Google Scholar
Subramanian, A. et al. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Article CAS Google Scholar
Newman, A. M. et al. Nat. Methods 12, 453–457 (2015).
CAS Google Scholar
Abbas, A. R. et al. PLoS One 4, e6098 (2009).
Article Google Scholar
Battle, A. et al. Genome Res. 24, 14–24 (2014).
Article CAS Google Scholar
Westra, H.-J. et al. Nat. Genet. 45, 1238 (2013).
Article CAS Google Scholar
Novershtern, N. et al. Cell 144, 296–309 (2011).
Article CAS Google Scholar
Filiano, A. J. et al. Nature 535, 425–429 (2016).
Article CAS Google Scholar
Wright, F. A. et al. Nat. Genet. 46, 430–437 (2014).
Article CAS Google Scholar
Gieger, C. et al. Nature 480, 201–208 (2011).
Article CAS Google Scholar
Olsson, A. et al. Nature 537, 698–702 (2016).
Article CAS Google Scholar
The Cancer Genome Atlas Research Network. Nature 497, 67–73 (2013).
Article Google Scholar
Ross, D. T. et al. Nat. Genet. 24, 227–235 (2000).
Article CAS Google Scholar
Alter, O., Brown, P. O. & Botstein, D. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).
Article CAS Google Scholar
Hore, V. et al. Nat. Genet. 48, 1094 (2016).
Article CAS Google Scholar
Brunet, J.-P. et al. Proc. Natl Acad. Sci. USA 101, 4164–4169 (2004).
Article CAS Google Scholar
Zou, H. & Hastie, T. J. R. Stat. Soc. Ser. B 67, 301–320 (2005).
Article Google Scholar
Witten, D. M. et al. Biostatistics 10, 515–534 (2009).
Article Google Scholar
Leek, J. T. et al. PLoS Genet. 3, e161 (2007).
Article Google Scholar
Levine, J. H. et al. Cell 162, 184–197 (2015).
Article CAS Google Scholar
Wingett, S. W. & Andrews, S. F1000Res. 7, 1338 (2018).
Article Google Scholar
Dobin, A. et al. Bioinformatics 29, 15–21 (2013).
Article CAS Google Scholar
Liao, Y., Smyth, G. K. & Shi, W. Bioinformatics 30, 923–930 (2013).
Article Google Scholar
Gaujoux, R. & Seoighe, C. Bioinformatics 29, 2211–2212 (2013).
Article CAS Google Scholar
Zheng, S. C., Breeze, C. E., Beck, S. & Teschendorff, A. E. Nat. Methods 15, 1059–1066 (2018).
Article CAS Google Scholar
Mostafavi, S. et al. Mol. Psychiatry 19, 1267–1274 (2014).
Article CAS Google Scholar
Zerbino, D. R. et al. Nucleic Acids Res. 46, D754–D761 (2017).
Article Google Scholar
Machiela, M. J. & Chanock, S. J. Bioinformatics 31, 3555–3557 (2015).
Article CAS Google Scholar
Storey, J. D. J. R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).
Article Google Scholar

Download references

Acknowledgements

This work was supported by the US National Institutes of Health (NIH) grants U54HG008540 and 5R03MH109008 to M.C., 1R01HG009299 to M.C. and W.M., and 5U19AI117873 and 5U24DK112331 to Z.E., S.S.C., and H.B.M. The authors acknowledge G. Nudelman for help with RNA-seq processing. This study uses data from dbGaP (phs000486.v1). Funding support for the Genetic Association Information Network (GAIN) Major Depression study: Stage 1 Genome-wide Association In Population Based Samples Study (parent studies: NESDA) and the Netherlands Twin Register (NTR)) was provided by the Netherlands Scientific Organization (904-61-090, 904-61-193, 480-04-004, 400-05-717, Netherlands Organisation for Scientific Research (NWO) Genomics, SPI 56-464-1419), the Centre for Neurogenomics and Cognitive Research (CNCR-VU), the European Union (EU/WLRT-2001-01254), ZonMW (geestkracht program, 10-000-1002), NIMH (RO1 MH059160) and matching funds from participating institutes in NESDA and NTR, and the genotyping of samples was provided through the Genetic Association Information Network (GAIN). The datasets used for the analyses described in this manuscript were obtained from dbGaP (http://www.ncbi.nlm.nih.gov/gap) through dbGaP accession number phs000486.v1.p1. Samples and associated phenotype data for the GAIN Major Depression study: Stage 1 Genome-wide Association In Population Based Samples Study (PI, P. F. Sullivan, University of North Carolina) were provided by D. I. Boomsma and E. de Geus, VU University Amsterdam (PIs NTR), B. W. Penninx, VU University Medical Center Amsterdam, F. Zitman, Leiden University Medical Center, and W. Nolen, University Medical Center Groningen (PIs and site-PIs, NESDA).

Author information

Authors and Affiliations

Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
Weiguang Mao & Maria Chikina
Carnegie Mellon-University of Pittsburgh PhD Program in Computational Biology, Pittsburgh, PA, USA
Weiguang Mao & Maria Chikina
Department of Neurology and Center for Advanced Research on Diagnostic Assays, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Elena Zaslavsky, Boris M. Hartmann & Stuart C. Sealfon

Authors

Weiguang Mao
View author publications
You can also search for this author in PubMed Google Scholar
Elena Zaslavsky
View author publications
You can also search for this author in PubMed Google Scholar
Boris M. Hartmann
View author publications
You can also search for this author in PubMed Google Scholar
Stuart C. Sealfon
View author publications
You can also search for this author in PubMed Google Scholar
Maria Chikina
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.C. conceived and led this work. W.M. and M.C. developed the analytical framework, analyzed data, and produced figures. M.C., W.M., Z.E., and S.S.C. drafted the manuscript. W.M. implemented the web interface. B.H.M. collected the RNA-seq and CyTOF data.

Corresponding author

Correspondence to Maria Chikina.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information: Lei Tang and Tal Nawy were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 PLIER decompositions are robust to normalization procedure.

We compare the PLIER decompositions obtained from two different versions of the DGN dataset: one normalized with all known technical factors and the other normalized only by quantile normalization. (Main panel) Heatmap of the Spearman rank correlations (computed across 922 samples) between LVs from the two decompositions. All pairwise correlations for the top best matched LVs (correlation >0.9) are shown. LVs are named with their corresponding top prior-information geneset (if any). Note that the prior information used is almost identical across the two decompositions. (Inset) The distribution of Spearman rank correlation values across all best reciprocal match pairs of LVs. LVs that use prior information (LVs with any non-zero U coefficients) are more robust to normalization procedure as they are more likely to have a high-correlation match across the two datasets. Boxplot displays the 25th, 50th and 75th percentiles, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary Figure 2 PLIER decompositions isolate technical and biological variation in different components.

Using a PLIER decomposition obtained from the quantile normalized DGN dataset, which has not been corrected for known technical factors, we plot the distribution of absolute values of Pearson correlation (across 922 samples) between technical factors and LVs. Since technical factors are correlated we select a non-redundant set for our evaluation (PF_ALIGNED_BASES, MEDIAN_3PRIME_BIAS, PCT_MRNA_BASES, %heme, duplicate%, yield, set aside globin, length corr, GC corr) where the last two are per-sample correlations between quantified gene expression vectors and transcript GC content and length respectively. LVs associated with prior information (LVs with non-zero U coefficients) and LVs without prior information (LVs with zero U coefficients) are contrasted using two-sided Wilcoxon rank-sum test. The correlation for LVs with prior information is significantly lower for all but one technical factor (PF_ALIGNED_BASES). Boxplots display the 25th, 50th and 75th percentile, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary Figure 3 Pathway associations for the DGN dataset.

We plot the U matrix for the complete set of LVs that were associated with at least one prior information pathways with and FDR<0.05. LVs correspond to columns and pathways correspond to rows. For ease of visualization, only the top pathway (largest U coefficient) for each LV is retained and U matrix is split into two columns. We note that pathways for abundant cell-types such as neutrophils and erythrocytes are top hits for multiple LVs.

Supplementary Figure 4 Pathway-centric PLIER eQTLs have a higher replication rate.

We compare trans-eQTL discovered with the standard gene-centric approach to those discovered by PLIER with respect to independent replication in the NESDA dataset. While gene-centric approach considers all possible SNP-gene associations that satisfy valid trans-eQTL criteria (see Methods) the pathway-centric approach only considers those gene-level effects that are associated with pathway-level eQTLs. We find that at the same raw p-value threshold, pathway-centric eQTLs have a notably higher replication rate. Since pathway-centric associations are by construction linked to a pathway-level effect, they are more likely to represent real and replicable indirect associations. Statistics were computed using Spearman rank correlation across 922 subjects with a two-sided test. P-values indicated on the x-axis are uncorrected.

Supplementary Figure 5 Top scoring Immgen cell-types for genotype-associated LVs with no or ambiguous PLIER pathway annotations.

In order to add further interpretation to the pathway-level eQTLs that had either no or ambiguous pathway associations we investigate the gene expression of top loading genes in a comprehensive database of mouse immune cells (Immgen). This database was not used in the PLIER decomposition so it provides an independent source of immune gene expression patterns. Z-scores were computed across 62 immune cell types in the Immgen dataset. The 10 cell-types with the top median z-score across 20 genes with the highest loading values are shown.

Supplementary Figure 6 Top genes for all genotype associated LVs.

We plot the top 15 genes for all LVs that had a significant genotype association. Data is plotted as z-scores across 922 subjects.

Supplementary Figure 7 The effects of rs1354034 of LVs 44 and 133 are independent.

We plot the relationship between LVs and minor allele counts of rs1354034 using raw LV estimates (first row) or corrected estimates, using residuals from the linear regression fit on the other LV (second row). We find that while the estimates for these two LVs are correlated, the eQTL effects are substantially improved when regressing one LV on the other and using the residuals for eQTL testing. Statistics were computed using Spearman rank correlation across 922 subjects with a two-sided test and uncorrected p-values are reported. Boxplots display the 25th, 50th and 75th percentiles, with whiskers extending to 1.5x the interquartile range or the range of the data whichever is smallest.

Supplementary information

Supplementary Information

Supplementary Figs. 1-–7, Supplementary Note 1 and Supplementary Data 1

Reporting Summary

Source data

Source Data, Fig. 1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mao, W., Zaslavsky, E., Hartmann, B.M. et al. Pathway-level information extractor (PLIER) for gene expression data. Nat Methods 16, 607–610 (2019). https://doi.org/10.1038/s41592-019-0456-1

Download citation

Received: 15 December 2017
Accepted: 16 May 2019
Published: 27 June 2019
Issue Date: July 2019
DOI: https://doi.org/10.1038/s41592-019-0456-1

This article is cited by

DiseaseNet: a transfer learning approach to noncommunicable disease classification
- Steven Gore
- Bailey Meche
- Rajeev K. Azad
BMC Bioinformatics (2024)
Advances in cancer DNA methylation analysis with methPLIER: use of non-negative matrix factorization and knowledge-based constraints to enhance biological interpretability
- Ken Takasawa
- Ken Asada
- Ryuji Hamamoto
Experimental & Molecular Medicine (2024)
A patient advocating for transparent science in rare disease research
- Richard Rui Yang
Orphanet Journal of Rare Diseases (2023)
Bulk brain tissue cell-type deconvolution with bias correction for single-nuclei RNA sequencing data using DeTREM
- Nicholas K. O’Neill
- Thor D. Stein
- Lindsay A. Farrer
BMC Bioinformatics (2023)
Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously
- Steven M. Foltz
- Casey S. Greene
- Jaclyn N. Taroni
Communications Biology (2023)